GitHub in 2013: Pull Request Actors
In this IPython notebook I give an overview of GitHub pull request events in 2013 based on data obtained from the GitHub Archive. This is part of a series of posts about GitHub in 2013. The source code of this notebook is available in this GitHub repository, the CSV file with the pull request events is not included due to its size. If there is demand, I look into uploading it to a data sharing service, proposals of what service to use are welcome.
Preliminaries
First load the necessary packages, set a global footer text and a limit for the charts below.
import datetime
import itertools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from utils import graphs3 as graphs
limit = 20
footer = 'Data: githubarchive.org - Source: coderstats.github.io'
Read the event data from a compressed CSV file, set the first column as the index and specify, that it is a date. Also add a Count column and set its values to 1 for aggregating events.
df_pulls = pd.read_csv('csv/githubarchive/2013/pullrequest_events.csv.bz2', compression='bz2', index_col=0, parse_dates=[0], low_memory=False)
df_pulls['Count'] = 1
df_pulls.head()
Distribution of pull request actions
Each pull request event has an associated action, so let's first look at the distribution of different actions.
df_pulls['Action'].value_counts().plot(kind='bar', rot=0)
plt.show()
The most common action is opening a pull request, most of them are closed sooner or later, which doesn't necessarily mean they got accepted.
Opened pull requests
We'll focus on opened pull requests by user (not organizations) in 2013 in this notebook.
df_opened = df_pulls[(df_pulls['Action'] == 'opened') & (df_pulls['Actor Type'] == 'User')]
df_opened.head()
actor_grouped = df_opened.groupby('Actor')[['Count']].sum()
top_actor = actor_grouped.sort(['Count']).tail(limit)
graphs.barh(top_actor.index,
top_actor.Count,
'img/%d-top-actors-pulls-github-2013.png' % limit,
figsize=(12, limit / 2),
title='%d top actors by opened pull requests on GitHub in 2013' % limit,
footer=footer)
Let me just say that some of these users look very fishy. Obviously ideatest1 automates pull requests to test ideas, whatever they are about. The first user in this top 20 who doesn't seem to be a bot is juliocamarero with an impressive 1338 (why not one less man?) pull requests or on average 3.67 per day, each day in 2013.
Most hyperpolyglot actors
In natural language a person who speaks 6 or more languages is considered a hyperpolyglot, a term coined by the linguist Richard Hudson. Adapting this to the world of programming, a person who writes code in 6 or more programming languages could be considered a hyperpolyglot programmer.
Every GitHub repository is assigned a main language, provided one is detected by the GitHub linguist so we're going to group pull requests by actors and languages to find out who are the most hyperpolyglot programmers on GitHub. This is not without problems, because a pull request can just be a fix of a typo in the readme file or a even a code change that affects a part of the code base that is not written in its main language. Also GitHub's linguist first looks at file extensions, so it relies on conventions, which obviously can be broken by coders. Keep this in mind here and whenever you see an analysis of programming languages on GitHub.
actor_lang_grouped = df_opened.groupby(['Actor', 'Repo Language'])[['Count']].sum()
actor_lang_grouped['Lang Count'] = 1
actor_lang_counts = actor_lang_grouped['Lang Count'].groupby(level=0).sum()
actor_lang_counts.sort()
top_actor_lang = actor_lang_counts.tail(limit)
graphs.barh(top_actor_lang.index,
top_actor_lang,
'img/%d-top-actors-pulls-languages-github-2013.png' % limit,
figsize=(12, limit / 2),
title='%d top actors by opened pull request languages on GitHub in 2013' % limit,
footer=footer)
If you look at the profiles of some of these users, chances are very good, that they are indeed hyperpolyglot programmers, but let's have a look at the distribution of language counts by actors' pull requests.
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 5))
actor_lang_counts.hist(bins=53, ax=axes[0])
actor_lang_counts[actor_lang_counts < 15].hist(bins=14, ax=axes[1])
actor_lang_counts[actor_lang_counts < 15].hist(bins=14, log=True, ax=axes[2])
plt.show()
By far most coders submit pull requests to repositories of a single language, without further digging into it, I assume that most actors probably submit just one or a few pull requests to the same repo. So it is quite impressive how diversified some GitHubbers seem to be.
Language combinations of hyperpolyglots
The last thing we'll look at here are the top language combinations of those hyperpolyglots. We'll only take users with more than 6 and less than 15 different languages into account.
hyper = actor_lang_counts[(actor_lang_counts > 6) & (actor_lang_counts < 15)]
is_hyper = df_opened['Actor'].isin(hyper.index)
df_hyper = df_opened[is_hyper]
actor_lang_grouped = df_hyper.groupby(['Actor', 'Repo Language']).count()
actor_lang_grouped.head()
Now create a dictionary of language combinations and their counts and turn it into a DataFrame.
actor_langs = actor_lang_grouped.groupby(level=0)
lang_combs = {}
for g in actor_langs.groups.values():
langs = [i[1] for i in g]
for c in itertools.combinations(langs, 2):
lang_combs[c] = lang_combs.get(c, 0) + 1
df_lang_combs = pd.DataFrame.from_dict(lang_combs, orient='index')
df_lang_combs.columns = ['Count']
Finally print a graph of the top language combinations.
limit = 50
top = df_lang_combs.sort('Count').tail(limit)
graphs.barh(top.index.map(lambda x: ' and '.join(x)),
top.Count,
'img/%d-top-hyper-language-combinations-pulls-github-2013.png' % limit,
figsize=(12, limit / 2),
title='%d top hyperpolyglot language combinations for pull requests on GitHub in 2013' % limit,
footer=footer)
Featured Merch
Latest Posts
- Troubleshooting External Hard Drives on Linux
- How to Prevent SSH Timeout on Linux Systems
- Getting Started with Print-on-Demand Merchandise
- Understanding the Bash Fork Bomb: A Breakdown of :(){ :|:& };:
- 50 Must-Know Bash Commands for Linux & Unix Sysadmins
Featured Book
Subscribe to RSS Feed
This post was written by Ramiro Gómez (@yaph) and published on . Subscribe to the Geeksta RSS feed to be informed about new posts.
Tags: coderstats git github pandas matplotlib notebook visualization
Disclosure: External links on this website may contain affiliate IDs, which means that I earn a commission if you make a purchase using these links. This allows me to offer hopefully valuable content for free while keeping this website sustainable. For more information, please see the disclosure section on the about page.
Share post: Facebook LinkedIn Reddit Twitter