GitHub in 2013: Event Types and Commits
In this IPython notebook I present a brief visual overview of GitHub event types in 2013 based on data obtained from the GitHub Archive.
The GitHub Archive makes data available as gzipped files that each contain a stream of JSON encoded GitHub events. There is one archive file for each hour of each day. I downloaded all the files availble for 2013 (9 files/hours are missing) and pre-processed them to create the CSV files used here. The pre-processing steps won't be covered in this notebook.
The source code of this notebook and the data files used are available in this GitHub repository.
Preliminaries
First load the necessary packages, set some matplotlib configuration parameters and create a list of short weekday names used for labels later on.
import itertools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['axes.grid'] = False
plt.rcParams['grid.linewidth'] = 0
weekdays_short = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
Read the event counts by day for 2013 from a CSV file, set the first column as the index and specify, that it is a date. Also remove Event
from the column labels and add a column with the total number of events.
df_events = pd.read_csv('csv/githubarchive/2013/event_counts_by_day.csv', index_col=0, parse_dates=[0])
df_events.columns = df_events.columns.map(lambda x: x.replace('Event', ''))
df_events.head()
Events by Type
Let's first look at how frequent the different event types are in total.
event_sums = df_events.sum(axis=0)
event_sums.sort()
event_sums.plot(kind='barh', figsize=(12, 8))
plt.show()
Unsurprisingly pushes occur most often, in fact almost as often as all other events combined.
last = len(event_sums) - 1
sums = [int(event_sums[1: last].sum()), int(event_sums[last])]
fig = pd.Series(sums, index=('All but Pushes', 'Pushes')).plot(kind='bar')
fig.set_xticklabels(('All but Pushes', 'Pushes'), rotation=0)
fig.annotate(sums[0], [0, sums[0]], ha='left', textcoords='offset points', xytext=(75, 5))
fig.annotate(sums[1], [0, sums[1]], ha='left', textcoords='offset points', xytext=(225, 5))
plt.show()
Event Timelines
Now let's look at how the 17 different GitHub event types evolved in 2013 by plotting a line graph for each type with time on the x-axes and the total number of events aggregated by day on the y-axes. I also add a column with the total number of events to have an even number of data series to plot.
df_events['Total'] = df_events.sum(axis=1)
fig, axes = plt.subplots(nrows=9, ncols=2)
fig.suptitle('GitHub event type timelines for 2013', y=1.01, fontsize=14)
fig.set_figheight(26)
fig.set_figwidth(14)
cols = df_events.columns
lencols = len(cols)
for idx, coords in enumerate(itertools.product(range(9), range(2))):
if idx < lencols:
ax = axes[coords[0], coords[1]]
ax.set_title(cols[idx])
df_events[cols[idx]].plot(ax=ax)
ax.set_xlabel('', visible=False)
fig.tight_layout()
We can clearly see that GitHub has grown in 2013 and one pattern that is common to all graphs are increases at the start of the week and drop-offs towards the end.
Download
and Follow
events were obviously removed from the public timeline in the past year, whereas Release
events were introduced in the beginning of July.
df_events['Release'].dropna().head()
There are extreme spikes in some of the graphs, for example at the end of November in the Follow
events. When exactly does this spike occur?
df_events[['Follow']].dropna().sort('Follow').tail()
The events DataFrame
does not allow us to dig deeper, to see what might have caused this spike, but I'll keep this in mind when looking at Follow
events in a future notebook. Update: the follow events notebook is published.
Event Types by Weekday
First add a weekday column to the data frame, which is very easy since the index contains a date. Then group by weekday and aggregate the grouped events calculating the mean and median values.
# 0 = monday
df_events['Weekday'] = df_events.index.weekday
grouped = df_events.groupby('Weekday').agg([np.mean, np.median])
The next step is to plot a bar chart for each event type, showing the distributions of mean and median event counts per weekday.
keys = grouped.keys()
cols = grouped.columns
lencols = len(cols)
fig, axes = plt.subplots(nrows=6, ncols=3)
fig.suptitle('Mean and median frequencies of GitHub event types per weekday', y=1.01, fontsize=14)
fig.set_figheight(20)
fig.set_figwidth(14)
for idx, coords in enumerate(itertools.product(range(6), range(3))):
if idx < lencols:
ax = axes[coords[0], coords[1]]
start = idx * 2
grouped[[start, start + 1]].plot(ax=ax, kind='bar', legend=False)
ax.set_title(cols[start][0])
ax.set_xticklabels(weekdays_short, rotation=0)
ax.set_xlabel('', visible=False)
fig.tight_layout()
I haven't figured out a good way to add just a single legend for the whole multi-plot, which shows that blue is mean and purple median.
For most event types these graphs confirm the weekend drop-offs we already saw in the timelines, with more activity on Sundays than on Saturdays. One notable exception are Download
events, where Sunday is on average the 3rd most active day of the week.
Moreover, there are considerable differences between mean and median values for Delete
and Gist
events. Looking back at the Gist
timeline we see a huge spike, which must have happened on a Tuesday.
df_events[['Gist', 'Weekday']].dropna().sort('Gist').tail(3)
Again, we cannot figure out what happened that day using the current dataset, but it's something to look at more deeply in one of the next posts.
Commits
The GitHub API doesn't have a dedicated Commit
event type, instead commits are contained in push events. Data for push events per day is aggregated in the CSV file loaded next. The number of commits per day is kept in the Event Size
column.
df_pushes = pd.read_csv('csv/githubarchive/2013/pushes_by_day.csv', index_col=0, parse_dates=[0])
Since the number of commits per push event varies, let's look at the ratios of commits to pushes over the course of the year.
df_pushes['Commit Push Ratio'] = df_pushes['Event Size'] / df_pushes['Event Count'].astype(float)
df_pushes['Commit Push Ratio'].plot(figsize=(14, 10), title='Ratios of commits per push over time')
plt.show()
There is quite some variation here, more than I had expected. A possible explanation is that some pushes contain a very high number of commits, e. g. when a feature branch is pushed for the first time, and some a very low number, which would be the case for hotfixes.
Commits over Time
The last graph in this overview is commits over time. This time not as a line chart, but using a visualization similar to the one you see on GitHub user pages for their contributions.
To do so we group over weekday and week number summing up the commit counts (Event Size
) and store them in a list of lists, one for each weekday containing the commit counts for each week number of the year 2013.
df_pushes['Week'] = df_pushes.index.week
df_pushes['Weekday'] = df_pushes.index.weekday
grouped = df_pushes.groupby(['Weekday', 'Week']).sum()
image = [grouped['Event Size'][i] for i in range(7)]
A few things to note about the following plot are setting the spines visibility to False
to not show lines around the plot and explicitly setting the ticks, the y-axis ticks to the short weekday names and the x-axis ticks to the numbers from 1 to 52. Otherwise the x-ticks would range from 0 to 51 corresponding to the list indices of the image data passed to the imshow
method.
fig, ax = plt.subplots(figsize=(16, 9))
ax.imshow(image, cmap=plt.cm.Greens, interpolation='nearest')
ax.set_title('Commits by weekday and week')
for pos in ['top', 'right', 'bottom', 'left']:
ax.spines[pos].set_visible(False)
plt.yticks(range(7), weekdays_short)
plt.xticks(range(52), range(1, 53))
plt.show()
Using this type of visualization, the increase of commits over time is not as clearly visible as in the line graphs above, but it allows to determine days with unusual activity, especially days with very high numbers of commits, that are worth exploring further.
Summary
In this notebook I presented an overview of the different types of events that occurred on GitHub in 2013 focusing on the evolution of event types over time and their distribution across weekdays.
A few of the things we could see are that GitHub grew in the past year and that there are some days with extremely high activity for some event types, which asks for further investigation.
This is just the tip of the iceberg, there is a lot more information contained in the GitHub Archive data, that I'm going to present in other GitHub Archive posts.
Featured Merch
Latest Posts
- Troubleshooting External Hard Drives on Linux
- How to Prevent SSH Timeout on Linux Systems
- Getting Started with Print-on-Demand Merchandise
- Understanding the Bash Fork Bomb: A Breakdown of :(){ :|:& };:
- 50 Must-Know Bash Commands for Linux & Unix Sysadmins
Featured Book
Subscribe to RSS Feed
This post was written by Ramiro Gómez (@yaph) and published on . Subscribe to the Geeksta RSS feed to be informed about new posts.
Tags: coderstats git github pandas matplotlib notebook visualization
Disclosure: External links on this website may contain affiliate IDs, which means that I earn a commission if you make a purchase using these links. This allows me to offer hopefully valuable content for free while keeping this website sustainable. For more information, please see the disclosure section on the about page.
Share post: Facebook LinkedIn Reddit Twitter