> Firstly, any articles that received under 50 points were filtered out
Why? There's still a lot of information in the posts that didn't receive significant interest, and unexplained filtering here seems more suspicious than anything.
HN users/voters seem habituated and predictable. What if this is applied to link titles to predict likelihood of making it to the front page? What words (or sequences of letters) are associated with popularity?
I also wonder how many unique voters are present, especially regulars and those who vote before a link hits the front page. I bet there aren't that many. And I bet they are mostly INTJ. That is, content seems curated/controlled by a handful of users (ie, bubble). What can be done to buffer against bias? How can submissions automatically be surfaced/tested (shown on the front page to patterned/known users) even before receiving any votes?
I've always thought some percentage of the front page should be dedicated to (randomly chosen, though slightly weighted) new submissions. Some percentage of the page some percentage of the time to some percentage of users able to vote. Or maybe the new tab should be shown inline on the right. That is, I'm guessing most only see what's shown on the first page.
> HN users/voters seem habituated and predictable. What if this is applied to link titles to predict likelihood of making it to the front page? What words (or sequences of letters) are associated with popularity?
Apropos of nothing, I am working on building a model for predicting HN post performance.
I did one some years ago (basic attempt using XGBoost) and AUC was high 0.6s or low 0.7s.
I think I treated the title as a string rather than words and each column/feature was one letter. That is, if there were 256 possible characters and each title could only be 80 characters long, there'd be 256 * 80 * 2 columns (title + title reversed).
If you are doing a classification model, what’s the score threshold?
The trick with that approach is if you set the threshold high, it’s very easy to get a model with a high accuracy since classes will be imbalanced. (The vast, vast majority of HN submissions do not do well)
If I remember correctly, HN's data dump included a made_front_page flag (not exact name). Also, XGBoost includes a feature that attempts to normalize against imbalanced classes. As for thresholds, there's only including submissions with at least average +/- standard_deviation votes.
This is very cool research. I would love to see topics like this built into a browser extension (or even into HN itself although that may be beyond the scope of HN's core features). I find that there's a decent bit of content that gets posted on HN that I'm not personally interested in and, on the flip side, there are times when I want to go more in-depth on a topic but can't find more posts that cover it. I don't really want a whole subreddit-style navigation system, so some automatic topic tagging could be a nice middle ground.
Why? There's still a lot of information in the posts that didn't receive significant interest, and unexplained filtering here seems more suspicious than anything.