GitHub in 2013: Follow Events
In this IPython notebook I give a brief overview of GitHub follow events in 2013 based on data obtained from the GitHub Archive. This is a follow-up post to the Event Types 2013 notebook. The source code of this notebook is available in this GitHub repository, the CSV file with the follow events is not included due to its size. If there is demand, I look into uploading it to a data sharing service, proposals of what service to use are welcome.
Preliminaries
First load the necessary packages, set a global footer text and a limit for the bar charts below.
import datetime
import itertools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from utils import graphs
limit = 30
footer = 'Data: githubarchive.org - Source: coderstats.github.io'
Read the follow events from a compressed CSV file, set the first column as the index and specify, that it is a date. Also add a Count column and set its values to 1 for aggregating events by followers and followed users.
df_follows = pd.read_csv('csv/githubarchive/2013/follow_events.csv.bz2', compression='bz2', index_col=0, parse_dates=[0])
df_follows['Count'] = 1
df_follows.head()
Most Following Users
One eye-catching observation in the previous notebook was a huge spike in the follow events timeline, so let's shed some light on this by looking at the users who followed the most people.
top_followers = df_follows.groupby('Actor').sum().sort('Count').tail(limit)
graphs.barh(top_followers.index,
top_followers.Count,
'img/%d-most-following-github-users-2013.png' % limit,
figsize=(12, 16),
title='%d most following GitHub users from 2013-01-01 to 2013-12-11' % limit,
footer=footer)
Particularly user/bot threejs-cn, who unsurprisingly doesn't exist anymore on GitHub, stands out here, although a few of the others have quite impressive follow counts as well.
Let's look threejs-cn's activity aggregated by date.
bot = df_follows[df_follows['Actor'] == 'threejs-cn']
bot['date'] = bot.index.date
bot.groupby('date').sum()
Obviously, the one responsible for the spike we saw. Interestingly, one day earlier on November 13, 2013 there was a DDoS attack on GitHub pages, but this might just be a coincidence. In any case, I assume that it is not more possible to follow 40k users on GitHub in a single day.
Most Followed Users
Let's now look at who gained the most followers last year or more exactly from from 2013-01-01 to 2013-12-11.
top_followed = df_follows.groupby('Target').sum().sort('Count').tail(limit)
graphs.barh(top_followed.index,
top_followed.Count,
'img/%d-most-followed-github-users-2013.png' % limit,
figsize=(12, 16),
title='%d most followed GitHub users from 2013-01-01 to 2013-12-11' % limit,
footer=footer)
I wonder why Tom Preston-Werner (mojombo) got so many more followers than his co-founder Chris Wanstrath (defunkt). Both of them have some pretty popular repositories on GitHub. Maybe looking at their timelines of follow events reveals something.
Top Followed Users Timelines
The code below will plot multiple line-graphs for the 20 most followed users in 2013.
top = top_followed.tail(20)
fig, axes = plt.subplots(nrows=10, ncols=2)
fig.suptitle('GitHub top followed users timelines for 2013', y=1.01, fontsize=14)
fig.set_figheight(32)
fig.set_figwidth(14)
for idx, coords in enumerate(itertools.product(range(10), range(2))):
ax = axes[coords[0], coords[1]]
user = top.index[idx]
user_follows = df_follows[df_follows['Target'] == user]
user_follows['date'] = user_follows.index.date
grouped = user_follows.groupby('date').sum()
grouped.plot(ax=ax, legend=False, rot=45)
ax.set_title(user)
ax.set_xlabel('', visible=False)
fig.text(0, 0, footer, fontsize=12)
fig.tight_layout()
plt.show()
The graphs for mojombo and defunkt show some spikes at the same dates, but also a few ones for mojombo only. Moreover, on average mojombo's follow counts are higher. But the graphs themselves do not indicate, what might be a reason for the difference in followers.
Some of the other graphs look interesting too, e. g. user funkenstein, who got pretty much no followers throughout most the year and than more than 2000 in September.
fs = df_follows[df_follows['Target'] == 'funkenstein']
fs['date'] = fs.index.date
fs.groupby('date').sum().sort('Count').tail()
When you look at his profile now, he has "only" 30 followers, so there is definitely something fishy about that. I checked some of the original data files to make sure this is no error, which may have occurred during pre-processing. Also let's see whether the users who followed him are not always the same one.
fs.groupby('Actor').sum()
Well, that is also not the case, so this remains a bit of a mystery. Somehow, I don't believe that he just lost all these followers in the past few months.
In case of the other graphs where we can see some unusual spikes these numbers are a lot lower and may well be explicable by repos of these users getting popular on Hacker News or Reddit.
That's it for my look at follow events one of the next things I'll explore will be fork and pull request events of repositories, stay tuned.
Featured Merch
Latest Posts
- Troubleshooting External Hard Drives on Linux
- How to Prevent SSH Timeout on Linux Systems
- Getting Started with Print-on-Demand Merchandise
- Understanding the Bash Fork Bomb: A Breakdown of :(){ :|:& };:
- 50 Must-Know Bash Commands for Linux & Unix Sysadmins
Featured Book
Subscribe to RSS Feed
This post was written by Ramiro Gómez (@yaph) and published on . Subscribe to the Geeksta RSS feed to be informed about new posts.
Tags: coderstats git github pandas matplotlib notebook visualization
Disclosure: External links on this website may contain affiliate IDs, which means that I earn a commission if you make a purchase using these links. This allows me to offer hopefully valuable content for free while keeping this website sustainable. For more information, please see the disclosure section on the about page.
Share post: Facebook LinkedIn Reddit Twitter