On re-writing drupal contribution analyzer scripts

In the last months I have been incrementally rewriting the scripts behind the stats I weekly publish about drupal core contributions.

This post is about some background on why and how it happened.

tldr; see conclusion below and the link above with produced data.

Why

Developer experience

The way scripts worked in past was really different from what a (drupal) web developer is used to.
It used many languages and script tools to be able to produce its output, increasing the difficulty to start modifying it.

Raw data extracted/processed was stored on CSV files, but interacting with it has proven to be tricky and error-prone.

New/better tools available

Since the time I originally wrote this better tools have been created or stabilized enough to be used.

Buggy

There was some places where the approach was just too fragile and bug prone, specially around the two places where message typos were tried to be corrected by guessing.

Also a Makefile was orchestrating how things were created, but it was just too big to be maintained manually.

How

Restructure

This project is about a lot of little pieces working together, a.k.a. glue code FTW, so when I wanted to rewrite several pieces in php, I was needing a way to help structure the code. Also, it is mainly about CLI scripts, so using Symfony Console component sounded like a good idea.
And naturally I also wanted a dependency manager now that I was going to use an external code, I ended up with composer.

The target was then clear: to convert scripts piece by piece to php with a symfony console application containing several commands, one for each piece.

Extracting/processing information form git history was the first step. Doing the same than the python git wrapper library plus extra logic did not look like a good option: it is expensive in general because OS process forking is it as well.
Now I'm using libgit2 library php binding and some git forking for non-trivial operationsvia the library, which improves a little performance given we have mainly C speed there.

After that I wanted to introduce some customization for end user, so I used configuration files in YAML, and used symfony yaml component.
Three configuration files were added: a general configuration file, a mail to username mapping file and a commit message overrides file based on commit hashes.
Overriding commit messages lets fix commits without replacing matching based on the message, but instead relays on commit hashes, making error-prone replacing disappear, mostly.

In early development it was clear automation was going to be a key part for this project to work because of the inter-dependencies during the run.
The makefile orchestrating the run was good enough at the start, but it was really hard to maintain, so it is now dynamically generated, based on configuration files.

CSV files are now replaced with one table in a sqlite database per scenario, so it can be queried easily.

Drupal developers should now see more familiarity in this project: php is the main language used together with a database.

Generalization

This project has been about drupal core for a while, but the way used to extract information is not drupal core specific, but drupal community specific.
There was not really anything apart from scripts flexibility preventing to use it for non-core drupal projects, so I added missing abstraction pieces in order to do it.

New features

This restructure has proven useful for me: I could add several more pieces relatively easy.

CSV and files with username:score are useful, but html is more natural to show in a browser, so I added variations to generate html files of scoreboards. I started using twig to handle related templates.

Data can be seen in context better when shown in a plot. I added some extra generation code using also the flot library to draw some indicators.

It also adds some inter-scenario comparisons, e.g. core's 7.x vs 8.0.x branches indicators.

Automation

This set of scripts historically has been tricky to setup/install.
In the new runner branch I am now maintaining a script to automate the process even more.
It is mainly about calling the right commands in the right environment and hopefully also works as always updated documentation on how to run them.

Conclusion

In conclusion, this rewrite:

  • Improves developer experience and maintainability in general: less languages/tools, more unification on overall process with clear points to override data.
  • Generalizes the target to be usable by any drupal project, e.g. contribute modules, so not only core benefits with it.
  • Restructures logic using new/better tools for some tasks: libgit2, composer, symfony components, flot and twig.
  • Stores extracted information in a sqlite database table, allowing new indicators to be extracted more naturally through queries.

Future

This code is far from perfect, but I am glad of how it end up, it's cleaner and easier to develop/maintain/improve.

Hopefully someone else finds this as useful as me. If you want to help/fix/request a feature please use the relevant issue queue, where I try to keep pending stuff, patches there are welcome!