GitHub in 2013: Pull Request Actors

In this IPython notebook I give an overview of GitHub pull request events in 2013 based on data obtained from the GitHub Archive. This is part of a series of posts about GitHub in 2013. The source code of this notebook is available in this GitHub repository, the CSV file with the pull request events is not included due to its size. If there is demand, I look into uploading it to a data sharing service, proposals of what service to use are welcome.

Preliminaries

First load the necessary packages, set a global footer text and a limit for the charts below.

In [1]:

import datetime
import itertools

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from utils import graphs3 as graphs

limit = 20
footer = 'Data: githubarchive.org - Source: coderstats.github.io'

Read the event data from a compressed CSV file, set the first column as the index and specify, that it is a date. Also add a Count column and set its values to 1 for aggregating events.

In [2]:

df_pulls = pd.read_csv('csv/githubarchive/2013/pullrequest_events.csv.bz2', compression='bz2', index_col=0, parse_dates=[0], low_memory=False)
df_pulls['Count'] = 1
df_pulls.head()

Out[2]:

	Action	Actor	Actor Type	Repo Forks	Repo Language	Repo Name	Repo Owner	Repo Size	Repo Stars	Repo Watchers	Repo is Fork	Count
2013-01-01 08:03:58	opened	CODeRUS	User	17	Python	yowsup	tgalal	196	47	47	False	1
2013-01-01 08:05:25	closed	tnm	User	59	C	rugged	libgit2	356	488	488	False	1
2013-01-01 08:06:06	opened	thebinaryhood	User	0	Ruby	beastmaster	thebinaryhood	172	0	0	False	1
2013-01-01 08:07:29	closed	thebinaryhood	User	0	Ruby	beastmaster	thebinaryhood	172	0	0	False	1
2013-01-01 08:09:58	opened	lichenbo	User	1	NaN	2012-a-year-of-no-significance	leon-huang	424	0	0	False	1

5 rows × 12 columns

Distribution of pull request actions

Each pull request event has an associated action, so let's first look at the distribution of different actions.

In [3]:

df_pulls['Action'].value_counts().plot(kind='bar', rot=0)
plt.show()

The most common action is opening a pull request, most of them are closed sooner or later, which doesn't necessarily mean they got accepted.

Opened pull requests

We'll focus on opened pull requests by user (not organizations) in 2013 in this notebook.

In [4]:

df_opened = df_pulls[(df_pulls['Action'] == 'opened') & (df_pulls['Actor Type'] == 'User')]
df_opened.head()

Out[4]:

	Action	Actor	Actor Type	Repo Forks	Repo Language	Repo Name	Repo Owner	Repo Size	Repo Stars	Repo Watchers	Repo is Fork	Count
2013-01-01 08:03:58	opened	CODeRUS	User	17	Python	yowsup	tgalal	196	47	47	False	1
2013-01-01 08:06:06	opened	thebinaryhood	User	0	Ruby	beastmaster	thebinaryhood	172	0	0	False	1
2013-01-01 08:09:58	opened	lichenbo	User	1	NaN	2012-a-year-of-no-significance	leon-huang	424	0	0	False	1
2013-01-01 08:10:12	opened	CODeRUS	User	17	Python	yowsup	tgalal	196	47	47	False	1
2013-01-01 08:12:34	opened	kefirfromperm	User	10	Groovy	grails-quartz	grails-plugins	160	11	11	True	1

5 rows × 12 columns

In [5]:

actor_grouped = df_opened.groupby('Actor')[['Count']].sum()
top_actor = actor_grouped.sort(['Count']).tail(limit)
graphs.barh(top_actor.index,
            top_actor.Count,
            'img/%d-top-actors-pulls-github-2013.png' % limit,
            figsize=(12, limit / 2),
            title='%d top actors by opened pull requests on GitHub in 2013' % limit,
            footer=footer)

Let me just say that some of these users look very fishy. Obviously ideatest1 automates pull requests to test ideas, whatever they are about. The first user in this top 20 who doesn't seem to be a bot is juliocamarero with an impressive 1338 (why not one less man?) pull requests or on average 3.67 per day, each day in 2013.

Most hyperpolyglot actors

In natural language a person who speaks 6 or more languages is considered a hyperpolyglot, a term coined by the linguist Richard Hudson. Adapting this to the world of programming, a person who writes code in 6 or more programming languages could be considered a hyperpolyglot programmer.

Every GitHub repository is assigned a main language, provided one is detected by the GitHub linguist so we're going to group pull requests by actors and languages to find out who are the most hyperpolyglot programmers on GitHub. This is not without problems, because a pull request can just be a fix of a typo in the readme file or a even a code change that affects a part of the code base that is not written in its main language. Also GitHub's linguist first looks at file extensions, so it relies on conventions, which obviously can be broken by coders. Keep this in mind here and whenever you see an analysis of programming languages on GitHub.

In [6]:

actor_lang_grouped = df_opened.groupby(['Actor', 'Repo Language'])[['Count']].sum()
actor_lang_grouped['Lang Count'] = 1
actor_lang_counts = actor_lang_grouped['Lang Count'].groupby(level=0).sum()
actor_lang_counts.sort()

top_actor_lang = actor_lang_counts.tail(limit)
graphs.barh(top_actor_lang.index,
            top_actor_lang,
            'img/%d-top-actors-pulls-languages-github-2013.png' % limit,
            figsize=(12, limit / 2),
            title='%d top actors by opened pull request languages on GitHub in 2013' % limit,
            footer=footer)

If you look at the profiles of some of these users, chances are very good, that they are indeed hyperpolyglot programmers, but let's have a look at the distribution of language counts by actors' pull requests.

In [7]:

fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 5))

actor_lang_counts.hist(bins=53, ax=axes[0])
actor_lang_counts[actor_lang_counts < 15].hist(bins=14, ax=axes[1])
actor_lang_counts[actor_lang_counts < 15].hist(bins=14, log=True, ax=axes[2])
plt.show()

By far most coders submit pull requests to repositories of a single language, without further digging into it, I assume that most actors probably submit just one or a few pull requests to the same repo. So it is quite impressive how diversified some GitHubbers seem to be.

Language combinations of hyperpolyglots

The last thing we'll look at here are the top language combinations of those hyperpolyglots. We'll only take users with more than 6 and less than 15 different languages into account.

In [8]:

hyper = actor_lang_counts[(actor_lang_counts > 6) & (actor_lang_counts < 15)]
is_hyper = df_opened['Actor'].isin(hyper.index)
df_hyper = df_opened[is_hyper]
actor_lang_grouped = df_hyper.groupby(['Actor', 'Repo Language']).count()
actor_lang_grouped.head()

Out[8]:

		Action	Actor	Actor Type	Repo Forks	Repo Language	Repo Name	Repo Owner	Repo Size	Repo Stars	Repo Watchers	Repo is Fork	Count
Actor	Repo Language
9034725985	C	38	38	38	38	38	38	38	38	38	38	38	38
	C#	2	2	2	2	2	2	2	2	2	2	2	2
	CSS	4	4	4	4	4	4	4	4	4	4	4	4
	Java	15	15	15	15	15	15	15	15	15	15	15	15
	JavaScript	54	54	54	54	54	54	54	54	54	54	54	54

5 rows × 12 columns

Now create a dictionary of language combinations and their counts and turn it into a DataFrame.

In [9]:

actor_langs = actor_lang_grouped.groupby(level=0)
lang_combs = {}
for g in actor_langs.groups.values():
    langs = [i[1] for i in g]
    for c in itertools.combinations(langs, 2):
        lang_combs[c] = lang_combs.get(c, 0) + 1

df_lang_combs = pd.DataFrame.from_dict(lang_combs, orient='index')
df_lang_combs.columns = ['Count']

Finally print a graph of the top language combinations.

In [10]:

limit = 50
top = df_lang_combs.sort('Count').tail(limit)
graphs.barh(top.index.map(lambda x: ' and '.join(x)),
            top.Count,
            'img/%d-top-hyper-language-combinations-pulls-github-2013.png' % limit,
            figsize=(12, limit / 2),
            title='%d top hyperpolyglot language combinations for pull requests on GitHub in 2013' % limit,
            footer=footer)

Featured Merch

alias yolo='git push --force' - funny dark programmer design

Latest Posts

Featured Book

Subscribe to RSS Feed

This post was written by Ramiro Gómez (@yaph) and published on August 01, 2014. Subscribe to the Geeksta RSS feed to be informed about new posts.

Tags: github coderstats pandas notebook visualization matplotlib git

Disclosure: External links on this website may contain affiliate IDs, which means that I earn a commission if you make a purchase using these links. This allows me to offer hopefully valuable content for free while keeping this website sustainable. For more information, please see the disclosure section on the about page.

Share post: Facebook LinkedIn Reddit Twitter

GitHub in 2013: Pull Request Actors

Preliminaries

Distribution of pull request actions

Opened pull requests

Most hyperpolyglot actors

Language combinations of hyperpolyglots

Featured Merch

Latest Posts

Featured Book

Merchandise