Analyzing Open Source development (part 1)

Simple analysis of open source development in public administrations can be done very easily. This post describes the initial steps to understand how to obtain previous post results.

We’ll learn how to use Perceval. It’s the tool responsible for data retrieval in GrimoireLab, the free, open source software framework for software development analytics.

Take some coffee or tee, and let’s start!

Laptop, notebook and coffee

Perceval, The Knight in Shepherd’s Boots

First of all, you need a Python environment to start playing. I recommend setting up a virtual environment. Once you have your environment activated, just install Perceval:


$ pip install perceval

Perceval is able to gather information from a very diverse set of data sources. Current Perceval backends are: askbot, bugzilla, confluence, discourse, dockerhub, gerrit, git, github, gmane, hyperkitty, jenkins, jira, mbox, mediawiki, meetup, nntp, phabricator, pipermail, redmine, rss, slack, stackexchange, supybot, telegram.

For this analysis, we will focus only on git (you’ll need to have git installed in our system). You can check the options for each backend with --help. For example, for git backend:


$ perceval git --help
[2017-09-29 10:42:48,434] - Sir Perceval is on his quest.
usage: perceval [-h] [--tag TAG] [--from-date FROM_DATE] [-o OUTFILE]
                [--branches BRANCHES [BRANCHES ...]] [--latest-items]
                [--git-path GIT_PATH | --git-log GIT_LOG]
                uri

positional arguments:
  uri                   URI of the Git log repository

optional arguments:
  -h, --help            show this help message and exit

general arguments:
  --tag TAG             tag the items generated during the fetching process
  --from-date FROM_DATE
                        fetch items updated since this date

output arguments:
  -o OUTFILE, --output OUTFILE
                        output file

Git arguments:
  --branches BRANCHES [BRANCHES ...]
                        Fetch commits only from these branches
  --latest-items        Fetch latest commits added to the repository
  --git-path GIT_PATH   Path where the Git repository will be cloned
  --git-log GIT_LOG     Path to the Git log file

Start coding

Now it’s time to start writing some python code! Let’s get some data to analyze open source development.

First of all, we need to import the specific module for git data retrieval:


from perceval.backends.core.git import Git

Next, create the data source object:


data_repository = Git(uri='https://github.com/grimoirelab/perceval.git', gitpath='/tmp/perceval.git')

And finally, we start gathering data with the .fecth() method. This method will be returning items while there are data to retrieve in the defined data source. Optionally, you can specify an initial date to start tracking from, providing from-date parameter.

For example, to get all the items (commits in git case), we could use:


for commit in data_repository.fetch():
    print(commit)

If you want the output to look nicer, you can use the pprint module:


import pprint
pp = pprint.PrettyPrinter(indent=2)
for commit in data_repository.fetch():
    pp.pprint(commit)

As a result, you’ll get a set of items similar to the following one:


{ 'backend_name': 'Git',
  'backend_version': '0.8.0',
  'category': 'commit',
  'data': { 'Author': 'Santiago Dueñas ',
            'AuthorDate': 'Tue Aug 18 18:08:27 2015 +0200',
            'Commit': 'Santiago Dueñas ',
            'CommitDate': 'Tue Aug 18 18:08:27 2015 +0200',
            'commit': 'dc78c254e464ff334892e0448a23e4cfbfc637a3',
            'files': [ { 'action': 'A',
                         'added': '10',
                         'file': '.gitignore',
                         'indexes': ['0000000...', 'ceaedd5...'],
                         'modes': ['000000', '100644'],
                         'removed': '0'},
                       { 'action': 'A',
                         'added': '1',
                         'file': 'AUTHORS',
                         'indexes': ['0000000...', 'a67f214...'],
                         'modes': ['000000', '100644'],
                         'removed': '0'},
                       { 'action': 'A',
                         'added': '674',
                         'file': 'LICENSE',
                         'indexes': ['0000000...', '94a9ed0...'],
                         'modes': ['000000', '100644'],
                         'removed': '0'}],
            'message': 'Initial import',
            'parents': [],
            'refs': []},
  'origin': 'https://github.com/grimoirelab/perceval.git',
  'perceval_version': '0.9.0',
  'tag': 'https://github.com/grimoirelab/perceval.git',
  'timestamp': 1506666255.440165,
  'updated_on': 1439914107.0,
  'uuid': '29ddd736146e278feb5d84e9dcc1fd310ff50007'}

It’s a json-like output, with some fields generated by Perceval like backend_name, backend_version, category, origin, perceval_version, tag, timestamp, updated_on, uuid and a data field that contains the information retrieved from the data source item.

In git case, we are having data about commits, like:

  • Author: Author information, the person who originally wrote the patch
  • AuthorDate: it notes when this commit was originally made
  • Commit: Committer information, the person who last applied the patch
  • CommitDate: it gets changed every time the commit is being modified. For example when rebasing the branch where the commit is in on another branch
  • commit: sha1 that identifies the commit
  • files: information about the files touched, including file name, action performed on it,
  • message: Commit message
  • parents: the commits this current commit is based on
  • refs: an indirect way of referring to a commit. You can think of it as a user-friendly alias for a commit hash (quoted from: Atlassian Git tutorials)

Try other Perceval backends to see which data you can track. For example:


from perceval.backends.core.discourse import Discourse
forum_data = Discourse(url='http://www.communityleadershipforum.com/')

For some backends you might need a token to get access to its API. For example, for GitHub or Meetup, it would be something like:


from perceval.backends.core.github import GitHub
github_data = GitHub(owner='grimoirelab', repository='perceval', api_token=GITHUB_TOKEN)

from perceval.backends.core.meetup import Meetup
group_data = Meetup(group='python-madrid', sleep_for_rate=True, api_token=MEETUP_API_KEY)

What do you get? If you have any issue, just let me know posting comments in this post. Of course, you can open issues in Perceval repository!

What’s next?

Following posts will show how to start playing with the data obtained from Perceval.

But before that, I need to write about the PyConES 2017.

So, stay tuned!

2 Replies to “Analyzing Open Source development (part 1)”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.