Analyzing Open Source development (part 3)

In last post about analyzing open source development I mentioned that this one would be about massaging people information to have unique identities for all the project contributors.

But before that, I would like to explore something different. How to get data from multiple repositories? What happens when I want data from a whole GitHub organization’s or user’s repositories?

The obvious answer would be:
1. Let’s get the list of repositories:


import requests

def github_git_repositories(orgName):
    query = "org:{}".format(orgName)
    page = 1
    repos = []
    
    r = requests.get('https://api.github.com/search/repositories?q={}&page={}'.format(query, page))
    items = r.json()['items']
    
    while len(items) > 0:
        for item in items:
            repos.append(item['clone_url'])
        page += 1
        r = requests.get('https://api.github.com/search/repositories?q={}&page={}'.format(query, page))
        items = r.json()['items']
    
    return repos

2. And now, for each repository, run the code seen in previous post to get a dataframe for each one in list and concat them with:


df = pd.concat(dataframes)

For organizations or users with a few repositories, it would work. But for those with hundreds of repositories, how long would it take to go one by one fetching and extracting info?

Would there be a fastest approach? Let’s play with threads and queues…
Continue reading “Analyzing Open Source development (part 3)”