Analyzing Articles on Hacker News Using NLP

stuartaxelowen · on Aug 1, 2018

> Firstly, any articles that received under 50 points were filtered out

Why? There's still a lot of information in the posts that didn't receive significant interest, and unexplained filtering here seems more suspicious than anything.

baccheion · on Aug 1, 2018

HN users/voters seem habituated and predictable. What if this is applied to link titles to predict likelihood of making it to the front page? What words (or sequences of letters) are associated with popularity?

I also wonder how many unique voters are present, especially regulars and those who vote before a link hits the front page. I bet there aren't that many. And I bet they are mostly INTJ. That is, content seems curated/controlled by a handful of users (ie, bubble). What can be done to buffer against bias? How can submissions automatically be surfaced/tested (shown on the front page to patterned/known users) even before receiving any votes?

I've always thought some percentage of the front page should be dedicated to (randomly chosen, though slightly weighted) new submissions. Some percentage of the page some percentage of the time to some percentage of users able to vote. Or maybe the new tab should be shown inline on the right. That is, I'm guessing most only see what's shown on the first page.

minimaxir · on Aug 1, 2018

> HN users/voters seem habituated and predictable. What if this is applied to link titles to predict likelihood of making it to the front page? What words (or sequences of letters) are associated with popularity?

Apropos of nothing, I am working on building a model for predicting HN post performance.

TL;DR it is not easy.

awb · on Aug 1, 2018

This was built in '17: https://intoli.com/blog/hacker-news-title-tool/

HN discussion: https://news.ycombinator.com/item?id=14400603

minimaxir · on Aug 1, 2018

Interesting and good to see that as a reference. However, I suspect that approach overfits, and the author does not mention using a validation set.

awb · on Aug 1, 2018

Yes, apparently "Rust, Rust, Rust, Rust!" almost guarantees 1st page according to the model.

baccheion · on Aug 1, 2018

I did one some years ago (basic attempt using XGBoost) and AUC was high 0.6s or low 0.7s.

I think I treated the title as a string rather than words and each column/feature was one letter. That is, if there were 256 possible characters and each title could only be 80 characters long, there'd be 256 * 80 * 2 columns (title + title reversed).

minimaxir · on Aug 1, 2018

If you are doing a classification model, what’s the score threshold?

The trick with that approach is if you set the threshold high, it’s very easy to get a model with a high accuracy since classes will be imbalanced. (The vast, vast majority of HN submissions do not do well)

baccheion · on Aug 1, 2018

If I remember correctly, HN's data dump included a made_front_page flag (not exact name). Also, XGBoost includes a feature that attempts to normalize against imbalanced classes. As for thresholds, there's only including submissions with at least average +/- standard_deviation votes.

TeMPOraL · on Aug 1, 2018

A minor correction:

> This was the time when SpaceX successfully launched and landed its satellites at sea.

They didn't launch any of their own satellites, and they haven't landed any satellites at all :). I suppose the right word here would be "rockets".

gitinstinct · on Aug 1, 2018

This is very cool research. I would love to see topics like this built into a browser extension (or even into HN itself although that may be beyond the scope of HN's core features). I find that there's a decent bit of content that gets posted on HN that I'm not personally interested in and, on the flip side, there are times when I want to go more in-depth on a topic but can't find more posts that cover it. I don't really want a whole subreddit-style navigation system, so some automatic topic tagging could be a nice middle ground.

pouta · on Aug 1, 2018

Great job! Thank you for sharing this, will surely contribute to it.

pX0r · on Aug 1, 2018

good work!