Analyzing Open Source development (part 3)

In last post about analyzing open source development I mentioned that this one would be about massaging people information to have unique identities for all the project contributors.

But before that, I would like to explore something different. How to get data from multiple repositories? What happens when I want data from a whole GitHub organization’s or user’s repositories?

The obvious answer would be:
1. Let’s get the list of repositories:


import requests

def github_git_repositories(orgName):
    query = "org:{}".format(orgName)
    page = 1
    repos = []
    
    r = requests.get('https://api.github.com/search/repositories?q={}&page={}'.format(query, page))
    items = r.json()['items']
    
    while len(items) > 0:
        for item in items:
            repos.append(item['clone_url'])
        page += 1
        r = requests.get('https://api.github.com/search/repositories?q={}&page={}'.format(query, page))
        items = r.json()['items']
    
    return repos

2. And now, for each repository, run the code seen in previous post to get a dataframe for each one in list and concat them with:


df = pd.concat(dataframes)

For organizations or users with a few repositories, it would work. But for those with hundreds of repositories, how long would it take to go one by one fetching and extracting info?

Would there be a fastest approach? Let’s play with threads and queues…

Entering the danger zone

Threads are quite complicated, and I am sure there are better approaches than the one in this post. The idea is to show a simple a way to run several git data downloads in parallel.

First of all, let’s put the gathering data code in a function, that given a git URI, it returns a dataframe with some basic information:


from perceval.backends.core.git import Git
import numpy as np
import pandas as pd

def git_raw_data(gitURI):
    git_repo = gitURI.split('/')[-1]
    
    data_repository = Git(uri=gitURI, gitpath='/tmp/{}'.format(git_repo))
        
    df = pd.DataFrame()
    
    for commit in data_repository.fetch():
        
        if 'message' not in commit['data'].keys():
            commit['data']['message'] = ''
        
        for file in commit['data']['files']:
            df = df.append({'repository': commit['origin'].split('/')[-1].split('.git')[0],
                        'username': commit['data']['Author'],
                        'commit': commit['data']['commit'],
                        'date': commit['data']['AuthorDate'],
                        'filepath': file['file'],
                        'added': int(file['added']) if 'added' in file.keys() and file['added']!='-' else 0,
                        'removed': int(file['removed']) if 'removed' in file.keys() and file['removed']!='-' else 0},
                       ignore_index=True)
    
    df['date'] = pd.to_datetime(df['date'])
    
    return df

And now, let’s define a function that given a set of repositories, create a thread for each one’s data retrieval, and get a set of dataframes that are joined with pd.concat():


from threading import Thread
from queue import Queue

def git_dataframe(repositories):
    dataframes = []
    threads = []
    que = Queue()
    
    for repo in repositories:
        t = Thread(target=lambda q, arg1: q.put(git_raw_data(arg1)), args=(que, repo))
        threads.append(t)
        t.start()
    
    for t in threads:
        t.join()
    
    while not que.empty():
        result = que.get()
        dataframes.append(result)
        
    dataframe = pd.concat(dataframes)
    return dataframe

Part of the magic is done by a queue (que = Queue()) that is used to store the data retrieved in each thread:


t = Thread(target=lambda q, arg1: q.put(git_raw_data(arg1)), args=(que, repo))

Let’s run the code we have:


In[]: repositories = github_git_repositories('grimoirelab')
     repositories
Out[]:
['https://github.com/grimoirelab/perceval.git',
 'https://github.com/grimoirelab/GrimoireELK.git',
 'https://github.com/grimoirelab/sortinghat.git',
 'https://github.com/grimoirelab/use_cases.git',
 'https://github.com/grimoirelab/arthur.git',
 'https://github.com/grimoirelab/panels.git',
 'https://github.com/grimoirelab/mordred.git',
 'https://github.com/grimoirelab/grimoirelab.github.io.git',
 'https://github.com/grimoirelab/reports.git',
 'https://github.com/grimoirelab/grimoirelab-toolkit.git',
 'https://github.com/grimoirelab/perceval-opnfv.git',
 'https://github.com/grimoirelab/perceval-mozilla.git',
 'https://github.com/grimoirelab/perceval-puppet.git',
 'https://github.com/grimoirelab/grimoirelab.git']

In[]: grimoirelab_git_raw_df = git_dataframe(repositories)
In[]: grimoirelab_git_raw_df.info()

Int64Index: 13112 entries, 0 to 4460
Data columns (total 7 columns):
added         13112 non-null float64
commit        13112 non-null object
date          13112 non-null datetime64[ns]
filepath      13112 non-null object
removed       13112 non-null float64
repository    13112 non-null object
username      13112 non-null object
dtypes: datetime64[ns](1), float64(2), object(4)
memory usage: 819.5+ KB

Cool, now we have a big dataframe with data for all the files edited through commits in a whole organization

Time to enrich the data

Before going a bit further, let’s enrich the data we have. We currently have:

  • username, as the whole name and email used by the commit author. It would be more useful to have it as: name, email, email domain (for example)
  • filepath, as the whole file path to the file edited. It would be interesting to have it as: file name, file extension
  • date, as the whole commit date information. It might be interesting to have also: the hour of the commit and the day of the week

How to do that? It’s a pandas dataframe so we could have an enrich function like this:


def enrich_git_data(df):
    df['name'] = df['username'].map(lambda uname: uname.split('<')[0][:-1])
    df['email'] = df['username'].map(lambda uname: uname.split('<')[1].split('>')[0])
    df['email_domain'] = df['username'].map(lambda uname: uname.split('@')[-1][:-1])
    df['isoweekday'] = df['date'].map(lambda d: d.isoweekday())
    df['hour'] = df['date'].map(lambda d: d.hour)
    df['file_name'] = df['filepath'].map(lambda file: file.split('/')[-1])
    df['file_ext'] = df['file_name'].map(lambda file: file.split('.')[-1])
    
    return df

And let’s apply it to previous dataframe:


In[]: grimoirelab_git_data = enrich_git_data(grimoirelab_git_raw_df)
In[]: grimoirelab_git_data.info()

Int64Index: 13112 entries, 0 to 4460
Data columns (total 14 columns):
added           13112 non-null float64
commit          13112 non-null object
date            13112 non-null datetime64[ns]
filepath        13112 non-null object
removed         13112 non-null float64
repository      13112 non-null object
username        13112 non-null object
name            13112 non-null object
email           13112 non-null object
email_domain    13112 non-null object
isoweekday      13112 non-null int64
hour            13112 non-null int64
file_name       13112 non-null object
file_ext        13112 non-null object
dtypes: datetime64[ns](1), float64(2), int64(2), object(9)
memory usage: 1.5+ MB

Great! Now we have some columns to play with.

What is the average hour we do commits in GrimoireLab?


In[]: grimoirelab_git_data.groupby('commit')['isoweekday'].mean().median()
Out[]: 3.0
In[]: grimoirelab_git_data.groupby('commit')['hour'].mean().median()
Out[]: 13.0

So, it seems that we usually commit on Wednesday around 1pm.. Just on time for lunch!

Which email domains have contributed to GrimoireLab? Which is the average lines of code added and removed by each one?


In[]: grimoirelab_git_data.groupby('email_domain')['added', 'removed'].mean()
Email domains table
Email domains table

What's next

And that's all by now. Christmas time is coming, so next post will be in January, and it will cover the promised contributors information massaging part.

Meanwhile, enjoy these days and let's get back in 2018!

Leave a Reply

Your email address will not be published. Required fields are marked *