GitHub in 2013: Follow Events

In this IPython notebook I give a brief overview of GitHub follow events in 2013 based on data obtained from the GitHub Archive. This is a follow-up post to the Event Types 2013 notebook. The source code of this notebook is available in this GitHub repository, the CSV file with the follow events is not included due to its size. If there is demand, I look into uploading it to a data sharing service, proposals of what service to use are welcome.

Preliminaries

First load the necessary packages, set a global footer text and a limit for the bar charts below.

In [1]:
import datetime
import itertools

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from utils import graphs

limit = 30
footer = 'Data: githubarchive.org - Source: coderstats.github.io'

Read the follow events from a compressed CSV file, set the first column as the index and specify, that it is a date. Also add a Count column and set its values to 1 for aggregating events by followers and followed users.

In [2]:
df_follows = pd.read_csv('csv/githubarchive/2013/follow_events.csv.bz2', compression='bz2', index_col=0, parse_dates=[0])
df_follows['Count'] = 1
df_follows.head()
Out[2]:
Actor Target Count
2013-01-01 08:00:48 kelsonzhao pjhyett 1
2013-01-01 08:02:05 bedreamer t-k- 1
2013-01-01 08:03:17 zonuexe 0xrofi 1
2013-01-01 08:03:56 adulau FredericJacobs 1
2013-01-01 08:04:09 conikeec garyburd 1

Most Following Users

One eye-catching observation in the previous notebook was a huge spike in the follow events timeline, so let's shed some light on this by looking at the users who followed the most people.

In [3]:
top_followers = df_follows.groupby('Actor').sum().sort('Count').tail(limit)
graphs.barh(top_followers.index,
            top_followers.Count,
            'img/%d-most-following-github-users-2013.png' % limit,
            figsize=(12, 16),
            title='%d most following GitHub users from 2013-01-01 to 2013-12-11' % limit,
            footer=footer)

Particularly user/bot threejs-cn, who unsurprisingly doesn't exist anymore on GitHub, stands out here, although a few of the others have quite impressive follow counts as well.

Let's look threejs-cn's activity aggregated by date.

In [4]:
bot = df_follows[df_follows['Actor'] == 'threejs-cn']
bot['date'] = bot.index.date
bot.groupby('date').sum()
Out[4]:
Count
date
2013-10-18 1
2013-11-05 5
2013-11-14 40934

Obviously, the one responsible for the spike we saw. Interestingly, one day earlier on November 13, 2013 there was a DDoS attack on GitHub pages, but this might just be a coincidence. In any case, I assume that it is not more possible to follow 40k users on GitHub in a single day.

Most Followed Users

Let's now look at who gained the most followers last year or more exactly from from 2013-01-01 to 2013-12-11.

In [5]:
top_followed = df_follows.groupby('Target').sum().sort('Count').tail(limit)
graphs.barh(top_followed.index,
            top_followed.Count,
            'img/%d-most-followed-github-users-2013.png' % limit,
            figsize=(12, 16),
            title='%d most followed GitHub users from 2013-01-01 to 2013-12-11' % limit,
            footer=footer)

I wonder why Tom Preston-Werner (mojombo) got so many more followers than his co-founder Chris Wanstrath (defunkt). Both of them have some pretty popular repositories on GitHub. Maybe looking at their timelines of follow events reveals something.

Top Followed Users Timelines

The code below will plot multiple line-graphs for the 20 most followed users in 2013.

In [6]:
top = top_followed.tail(20)

fig, axes = plt.subplots(nrows=10, ncols=2)
fig.suptitle('GitHub top followed users timelines for 2013', y=1.01, fontsize=14)
fig.set_figheight(32)
fig.set_figwidth(14)

for idx, coords in enumerate(itertools.product(range(10), range(2))):
    ax = axes[coords[0], coords[1]]
    user = top.index[idx]
    user_follows = df_follows[df_follows['Target'] == user]
    user_follows['date'] = user_follows.index.date
    grouped = user_follows.groupby('date').sum()
    grouped.plot(ax=ax, legend=False, rot=45)
    ax.set_title(user)
    ax.set_xlabel('', visible=False)

fig.text(0, 0, footer, fontsize=12)
fig.tight_layout()
plt.show()

The graphs for mojombo and defunkt show some spikes at the same dates, but also a few ones for mojombo only. Moreover, on average mojombo's follow counts are higher. But the graphs themselves do not indicate, what might be a reason for the difference in followers.

Some of the other graphs look interesting too, e. g. user funkenstein, who got pretty much no followers throughout most the year and than more than 2000 in September.

In [7]:
fs = df_follows[df_follows['Target'] == 'funkenstein']
fs['date'] = fs.index.date
fs.groupby('date').sum().sort('Count').tail()
Out[7]:
Count
date
2013-10-22 2
2013-09-19 3
2013-10-08 3
2013-09-24 567
2013-09-25 2249

When you look at his profile now, he has "only" 30 followers, so there is definitely something fishy about that. I checked some of the original data files to make sure this is no error, which may have occurred during pre-processing. Also let's see whether the users who followed him are not always the same one.

In [8]:
fs.groupby('Actor').sum()
Out[8]:
<class 'pandas.core.frame.DataFrame'>
Index: 2846 entries, Abachis to wlaurance
Data columns (total 1 columns):
Count    2846  non-null values
dtypes: int64(1)

Well, that is also not the case, so this remains a bit of a mystery. Somehow, I don't believe that he just lost all these followers in the past few months.

In case of the other graphs where we can see some unusual spikes these numbers are a lot lower and may well be explicable by repos of these users getting popular on Hacker News or Reddit.

That's it for my look at follow events one of the next things I'll explore will be fork and pull request events of repositories, stay tuned.


This post was written by Ramiro Gómez (@yaph) and published on . Subscribe to the Geeksta RSS feed to be informed about new posts.

Tags: coderstats git github pandas matplotlib notebook visualization

Disclosure: External links on this website may contain affiliate IDs, which means that I earn a commission if you make a purchase using these links. This allows me to offer hopefully valuable content for free while keeping this website sustainable. For more information, please see the disclosure section on the about page.


Share post: Facebook LinkedIn Reddit Twitter

Merchandise