This idea came to me out of the blue. I was scrolling through Y Combinator's Hacker News board in search of inspiration. I noticed that some posts (very insightful ones, I thought) ended up at the bottom of the list, other posts were popular but had no comments and some triggered lots of comments but were ranked very low. I can imagine that being popular on Hacker News means a lot to a contributor: s/he gets a ton of views, the post generates heated discussion and may encourage other people to share the content with others or reference it in their work. I knew this was my topic. I decided to find out how to become popular on Hacker News. Lets see if I managed to crack this one up :)
I think everyone would agree that the best way to analyze posts on Hacker News is to transform the website into a database since information captured there is pretty well structured. Just like with my Law School Admission blog post I used KimonoLabs for scraping data from Hacker News.
My original dataset contained 1,163 posts submitted by 900 users from 680 sources between July 2010 and March 2015.
Each row captured the following information:
- News Topic (title of the news)
- Source (where the post came from)
- # Comments (how many times readers commented on the post)
- Time Elapsed (how recent the post was)
- User (who added the post)
- # Points (how many times readers up-voted the post)
- Ranking (how the post ranked among others)
Generally every dataset downloaded (or scraped) from the web needs a certain amount of cleaning and transformation. In my previous post (about Law School Admissions) I was mainly focusing on data transformations, whereas today my data doesn't need as much massaging. Instead, I have to do a fair amount of standardization to make the data uniform and "analyzable".
And here goes...
My Lesson #3
Before diving into data analysis one needs to make sure that the data s/he is dealing with is in its cleanest state. That is, all fields are standardized so that big round green apples are compared to fruits of the same breed, size, color and shape.
With this dataset I had to standardize the following fields:
- Time Elapsed
- Transform text values into numerical values (by separating "2 hours ago" into "2" and "hours" in two separate fields)
- Standardize time elapsed so it is displayed in the same measurement units. In my case I had days, hours and minutes, so I recalculated Time Elapsed field in minutes
- # Comments
- Transform text values into numerical values (by transforming "2 comments" into "2")
- # Points
- Transform text values into numerical values (by transforming "2 points" into "2")
I know transformations like these sound trivial, but I have come across many situations where analysts transform text fields into numerical ones but then forget to standardize them to match the same measurement unit. This is especially common in standardizing healthcare data where patients' test results displayed in mg/dL or g/L get analyzed together without any standardization. This is a silly mistake that leads to pretty bad misinterpretations of the data.
Dataset Prep for Text Mining
For this blog post I actually prepared 2 datasets. The original dataset looks very similar to the one displayed above (after standardization & cleanup) and the other dataset is prepared specifically for text mining.
My goal was to decompose News Topic field into a selection of key words each topic name consisted of. I used R tm package for it. Here are the steps I had to take to parse News Topic field.
- Create text corpora out of News Topic field (apparently tm package cannot be applied to regular data frames in R)
- Eliminate extra white space between words in each Topic Field
- Convert all words to lower case (otherwise, "awesomeness" would be a different word than "Awesomeness")
- Stem all words in my corpus. This is actually a very important step. When R performs stemming it identifies routes of each word, so that various permutations of the same word are treated the same. If I don't do stemming I will end up having words like "program", "programmer" and "programming" listed separately, but in reality I want them to belong to the same group.
- When I was done with transformations on the corpora I had to transform my corpus back to the matrix, so I can export something like this
This matrix lists every word parsed out of News Topic field along with Post ID which I can later on link back to my original dataset (R preserved the order of records, so that row 21 in this dataset corresponds to 21st row in my original table).
I then obtained basic counts on these words and identified top 30 most frequently used ones. I also excluded frequent words that don't carry any critical information (e.g. "the", "and", "with", etc). I ended up with 53 most common words used in Hacker News posts. Because every post in my new dataset had to have at least one of these 53 words my overall sample size went down from 1,163 to 594 posts (2x down). This is not a bad sample size reduction given I went from a list of 3,031 words to 53 words (57x down).
As I stated in the beginning, the goal of my blog post today was to understand how to become popular on Hacker News.
Here is the visualization I put together. It is interactive, so feel free to play with it.
Top charts show Top 10 Sources (on the left) and Top 15 Users (on the right) who posted on Hacker News based on the # posts they submitted. Color coding represents average # points each source or user got per post.
Bottom left scatter plots show all posts submitted based on the # Points and # Comments as well as the # Points and Post Recency (# minutes elapsed after post submission). By clicking on any bar on the top you will be able to filter posts on the bottom chart from a selected source or a user.
Finally, bottom right table outlines top 10 key words used in Hacker News Topics based on average # of points and comments a post using this word received.
Click on chart bars to drill into the data and watch the scatterplot change accordingly
For text mining descriptive statistics I put together a nice visualization in Wordle that sizes words used in Hacker News posts based on frequency.
GitHub is rocking the stage and not just by building a community of talented developers who share their work with others, but also by telling the world about awesome things github does. No wonder, github submitted the biggest number of posts (N posts = 40) and also ranked 1st in average # points per post (Avg N points = 112). Medium.com and techcrunch.com were next on the list. It is quite interesting that while top posting sources were tech-oriented (maybe except for Medium), we still observed large presense of media platforms like Bloomberg, Washington Post, NY Times and BBC.
When it comes to posting sources, there seems to be a pretty strong correlation between # posts and average # points each post from that source receives. However, this is not always true for users. People who post the most are not always the ones getting the most attention. For example, top user luu submitted 64 posts but on average received only 21 point per post (which is still pretty good by the way), whereas robin_ reala only had 10 posts but got an average 181 points per post. Out of curiousity I looked at all posts by robin_ reala (by clicking on robin_ reala bar on the top right chart which filtered scatterplot by this author's posts) and realized that most points came from the post named after Sir Terry Pratchett (now I absolutely have to read it!).
From the scatter plots on the bottom it looks like there is a positive correlation between # points a post obtains and # comments it receives. Post ranking, however, doesn't seem to impact either point or comment volumes. I also observed some correlation between recency of the post and # points it obtained. This is intuitive because I would expect older posts to receive more points (just by the nature of being on the website longer). But there seems to be a chunk of posts that don't get many points no matter how long they hang on the news board. On the other hand, some posts that gain popularity are in fact not that old.
Based on my descriptive analysis I generated the following hypotheses:
- # Comments a post receives directly impacts its popularity
- If a contributor uses tech words in the name of the post s/he is more likely to receive more points for the post
- Recency of the post does not affect its popularity
- Post ranking has no effect on its popularity
Since there seems to be a linear relationship between post's popularity (# Points) and other factors, I chose linear regression model to address my hypotheses. So my model looked as follows:
Popularity(i) = β0 + β1* NComments(i) + β2* TechWord(i) + β3* PostRecency(i) + β4* RankingGroup(i) + εi
Popularity - # Points a post received
N_Comments - # Comments a post received
Tech_Word - whether a post contained techy words or not (based on the viz above)
Post_Recency - # Minutes elapsed since post submission
Ranking_Group - A group of posts based on their ranking on the news board (e.g. Top 100, 100-249, 250-499, etc.)
I used R lm function to perform the analysis and assess each factor's impact on the post's popularity. Here is the output of linear regression analysis.
Recency of post submission also seems to contribute to the # points it receives, however, the magnitude is low. On average, every minute your post "ages" it gains a fraction of a point. Put it differently, with every additional week your post sits on the website it gains 0.00032*60 (hour) *12 (day) * 7(week) = 1.6 points (p<0.01). This finding is somewhat contradictory to the next one telling us that Top 100 posts are likely to gain 21.41 points more than bottom ranked posts (p=0.05). But how will your post be in the top 100 if it has been sitting on the website for a week? I think the post ranking effect in this case outweighs the effect of post recency.
For curious souls the R^2 of my model was 0.42 which is not that bad given I knew almost nothing about posts themselves (except, of course, for key words in the post name).
Based on all findings I gained from this analysis, here is my recipe to how to become popular on Hacker News:
- Your post needs comments from contributors. No wonder, admins of the news board recommend that authors start discussion themselves to trigger desired reaction
- Your post name has to be in Hacker News Spirit. In other words, it has to have techy words so that your readers get exactly what they look for on a tech news site. Of course, the content of your post has to be techy too. No one likes bad surprises. robin_ reala I really hope your post about Sir Terry Pratchett was somewhat tech related, although your post name was not :)
- Age brings wisdom, but only to a certain extent. Try to maintain your post at the top 100 for as long as you can and it will pay off!
I feel like I have done enough talking for today. What do you think about this analysis? See if you can poke holes, I promise I will not be offended :)