Last week GitHub announced that the data collected over at the GitHub Archive was made available as a public dataset on Google's BigQuery web service. Pretty exciting news for data analysts, considering that the timeline dataset currently contains more than 7 million records and is growing quickly.
To see what developers come up with analyzing the GitHub timeline and language correlation datasets, they started a data challenge which is open for submissions until May 21st.
Being interested in both natural and programming languages, I decided to expand on this analysis of profanity in git commit messages per programming language with a more general exploration of expressions of emotions as well as looking at messages indicating issues and those containing swear words.
Reading millions of commit messages and categorizing them is certainly not a task I'd be able to accomplish in less than 3 weeks and honestly wouldn't want to. Instead, I assembled lists of words expressing emotions using the resources listed at the bottom of this article and wrote corresponding database queries to count occurrences of these words grouped by programming languages.
Then for each language I calculated the percentage of messages that meet the criteria, i. e. contain certain expressions, in relation to the total number of commits for that language. Only non-empty commit messages of projects with an assigned language were included.
Since the number of commits varies considerably across different languages only those with at least 40,000 commits in the analyzed time frame were accounted for, resulting in a total of of 2,526,650 commit messages within the set of languages displayed in the bar chart below:
Even though a minimum of 40,000 samples per languages seemed adequate to me (I wanted to include Perl), different sample sizes result in varying accuracy, which is a problem and a bit like comparing apples and oranges. Statisticians will probably deny any value of such an approach, still I think it can serve to develop some hypotheses.
Actually, this is not the only problem. Assuming a particular emotion just because a message contains a certain expression does not work in many cases. The expression does not necessarily refer to the person who wrote the message, moreover, natural language is full of ambiguity, there are rhetorical devices like irony, and messages can be written in other languages than English, where the same word may have a completely different meaning.
So to make sure that my approach is not completely messed up, I spent quite some time looking through the resulting messages, refined the queries and excluded many phrases with too much ambiguity. Still, I anticipate that there is no conclusion, this is solely a descriptive study, take it with a grain of salt and enjoy reading the example messages ;)
The rest of this article is divided into sections for the analyzed emotions: anger, joy, amusement and surprise. Additionally there are two separate sections for messages indicating issues and messages containing swear words, which likely indicate emotions, but you can't say which without further analysis.
Each of the following sections contains a regular expression used in the database query, a chart for the corresponding emotion, showing the percentages of commit messages per programming language, that match the expression, and a few example messages.
I do not link to the individual commits, as the URLs appear truncated in the githubarchive dataset. I presume you know how to use search engines, if you really want to track down a particular message.
One last warning before you go on reading at your workplace: this article, most notably the part about swearing, contains more than seven dirty words.
Let's start with anger, an emotion one would expect to be expressed rather frequently in commit messages, as things often go wrong and the process of tracking down bugs and fixing them can be pretty annoying.
What stands out in the anger chart compared to the rest is the prominent gap between the "most angry language" VimL and the other languages. Any theories about why this is so?
For similar reasons one can expect anger one can also expect joy. Commits often solve problems, which should make developers happy. While I'm at it, I decided to omit the word "happy" itself. It is used frequently, but often in phrases like "make X happy" or "X is happy", e. g. add readme to make github happy or in negations like tar is not happy on linux - grr, not really indicating a joyful experience.
Another edge case I excluded is the word "pleasant", often used in phrases like "pleasant to work with", which do convey some kind of joy, but also in "make something more pleasant", which does not, does it? While skimming through messages I found this nice example of irony, which I didn't want to keep for myself: regexes are fun and pleasant to work with, in the same way that oranges are purple.
To detect amusement I heavily relied on Internet slang expressions like "lol" or "rofl" and onomatopoeia like "haha" or "hehe". Looking at the numbers programming cannot be very amusing, but there are probably better ways recognize this emotion.
My attempt to detect surprise, was the least satisfying regarding the low numbers, except for sadness which I left out for that reason. There would have been more surprise if I included "wow", but also more "World of Warcraft".
Still, I'm a bit surprised of the result, not because Perl is the winner, but because PHP does not seem to surprise people that often, which does not reflect my experience with this language at all.
Every developer knows that bugs creep into code all the time, so it's no surprise that messages about them are by far the most frequent ones compared to everything else in this analysis. Something that becomes even more apparent considering the few words I looked for.
A fun fact I have to add: 48.4% of all messages containing "IE", "InternetExplorer" or "Internet Explorer" also satisfy the issue detecting regex below. It's probably so few because I omitted the word problem.
Now to the final and most hilarious part of this analysis: exploring occurrences of swear words, which likely indicate some kind of issue with the code, language, framework or whatever. The chart looks pretty balanced compared to the previous ones and VimL wins again.
Assembling a list of swear words is pretty hard. You easily end up with hundreds or even thousands of words and variations, so you somehow need to limit them. One group of words I left out are those referring to genitals, because there is such a wealth of expressions and running some queries indicated that they are used rarely in commit messages.
To only include frequently used words, I searched for references and found this article about the most popular swear words on network TV in 2010. I created the following regular expression based on it with a few words and acronyms added.
Analyzing emotions in texts based on the occurrence of expressions is certainly a disputable approach. As stated above I am aware of several flaws that skew the results and do not claim that this descriptive study has any scientific value.
Improvements to the approach in order to get more reliable results that could serve as the basis for applying inferential methods and draw conclusions include:
What other improvements come to your mind and why do you think they'd work better? Let us know in the comments below.
Instead of ending with a statement like "X is the most friendly language because of Y", which wasn't my intention anyway, I conclude that exploring GitHub data can be a lot of fun and I'm curious to see what other developers get out of the GitHub Archive.
The database queries and the scripts for calculating percentages and generating charts are available from this GitHub repo.