Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A Graph of Related Subreddits (anvaka.github.io)
377 points by freediver on Jan 29, 2020 | hide | past | favorite | 81 comments


Oh hey there, I'm the author of this project.

Just wanted to say thank you for sharing this! Would be happy to answer any questions - graphs are my long time hobby, and I love them!

PS: You can find more recent graphs and fun projects here: https://twitter.com/search?q=from%3Aanvaka%20min_retweets%3A...


How does this infer subreddit similarity? For instance, I checked for r/AskHistorians and the results don't seem that relevant.

edit: never mind, I just read your GitHub readme. But the question still stands as if "users posting in x, also posted in y" is a good way to infer similarity. Could comparing top-ranked posts be a better comparator?


I used my own metric that is based on jaccard similarity. Which in turn is based on "users who posted to X also posted to Y" metric.

That said, there are a few subreddits that are too popular and similarity results were too saturated (/r/videos, /r/funny, etc.) so I did a manual override by looking into most commonly mentioned other subreddits, and sometimes into `about` blurb of subreddit).

Please don't consider these recommendation as source of truth! It's just a fun way to discover other subreddits :).

I'm also very open to change this metric to something else - please let me know if you have any recommendations!

[1]: https://github.com/anvaka/sayit#the-data - describes the data, indexing scripts are here: https://github.com/anvaka/sayit/tree/master/scripts

[2]: Manual overrides can be found here https://github.com/anvaka/sayit-data#sayit---recommendation-...


I work for reddit and here I’ve used a similar technique to your jaccard distance but with one twist: divide by the size of the smaller subreddit (in your case, the number of unique posters that you’ve recorded). That gives you a directional relatedness, that is programming->python but not necessarily python->programming. Used this way you account for the giant subreddit problem automatically but now the results are less “amitheasshole is related to askreddit” and more like “linguisticshumor is a more niche version of linguistics”.

The great thing is that it’s actually more actionable as far as recommendations go! Everybody has already heard of the bigger version of this subreddit, but they probably haven’t heard of the smaller versions. And it’s self-correcting. As a subreddit gets bigger we are less likely to recommend it (which is great because it needs our help less)


This is super awesome, thank you for sharing!

If you guys are interested in seeing how your recommendation work for the entire reddit, I'd be happy to build you a spaceship similar to this one https://github.com/anvaka/word2vec-graph .

I couldn't find an easy way to download the entire recommendation graph, but it would be awesome if we could make it work. My email is the same as this account at gmail, and twitter is all open: https://twitter.com/anvaka


In case you haven't come across it already, here is a very exhaustive list of distance measures for dealing with problems of this kind - http://www.iiisci.org/journal/CV$/sci/pdfs/GS315JG.pdf

I fooled around a bit with lastfm data for band recommendation and found this sheet quite helpful.

If you are interested in learning more about asymmetrical similarity, here is a great primer by Tversky - http://www.cogsci.ucsd.edu/~coulson/203/tversky-features.pdf


This is absolute treasure trove. Thank you so much!


Normalizing by the user count should help.

My suggestion would be something much simpler ie. to try to compare content itself e.g. top 1000 posts from each subreddit, and estimate (say) cosine sim + Tfidf. Wouldn't that be a better indicator? Also, instead of pairwise comparison, you could try clustering (HDBSCAN for example) to reduce computational complexity.

But great work, love your visualizations!


Can people still make their upvotes/downvotes public? I found that useful for sussing out related subreddits when it was possible back in the day.


I’ve also rediscovered this in similar work I do. But upon later research I’ve learned this is called containment.


If you're interested, in the scientific literature this problem is known as "network backboning". Basically you have all nodes connected to practically all other nodes with weighted edges, and you want to know which are the edges with statistically significant weights.

I wrote on this topic [1]. My method [2] basically uses simple counts on edge weights, and then estimates the expected edge weight and its variance using Bayesian priors. It then attaches a t-score or p-value to each edge, and then you can filter out edges with too low t-score.

The idea is that weak edges can still be statistically significant if they connect "small" nodes. In any case, the library I wrote includes the implementation of a few other methods, in case they work better for your data type.

[1] https://arxiv.org/abs/1701.07336 [2] http://www.michelecoscia.com/?page_id=287


>I used my own metric that is based on jaccard similarity. Which in turn is based on "users who posted to X also posted to Y" metric.

Is that really a good metric of similarity? Just myself, I post in several unrelated subreddits semi-regularly from programming to video games to music, art and even stone masonary, i've posted in subreddits for TV shows i've watched, or just completely random things.

I use reddit as a place where I can learn about and interact with people on nearly any subject or topic and I take advantage of that when I can. I'm sure my posting habits aren't that unusual. I'm just not sure that's really an accurate way to gauge similarity.


Right, I don't think it's 100% accurate either. It gives just some hints what might be related, but like you said, it's not necessary the best possible measure of similarity.

Jaccard similarity does not count only the number of people who posted to A and B, it checks how many people posted to A, how many people posted to B, and how many of those people posted TOGETHER to A and B. That togetherness gives us hints what is related, and after it is computed, we can divide by the total number of poster to both A and B (independently), which brings the value to something that we can use to compare against other subreddits. If that value is close to 1, it means that almost all users who posted to A have also posted to B. If it is close to 0, then the overlap is much smaller.


This has a similar issue to the way Reddit recommends subs itself. A sub will be considered similar if it's the exact opposite due to users of one sub going to another to shitpost or brigade or etc.

I'm not sure of a fix for that, but would it be possible/helpful to weigh it by average upvote/downvote of the comments left from users of said sub? Meaning if sub A is about how much baseball sucks, and sub B is about how amazing baseball is, while determining if the 2 are similar you'd find out most posts from sub A to sub B are heavily downvoted and so probably not similar.


It seems though relationship can be considered to have a sign: positive, when people mostly align in their upvote intents, and negative when people do the brigading, etc.

I'd still see ability to determine absolute value of relationship as a valuable property of a recommender


A lot of Reddits have "Recommended" or "Related" boxes at the side. Why don't you implement these as first, and then use the algorithm(s)?


Thank you for the suggestion!

I think when I created this tool there was no recommendations on reddit.

When it was introduced later on reddit I was contemplating about using reddit's own recommendations, but at that time it was missing a few smaller subreddits, so I just put it of onto the shelf of projects to try.


Have to say the link between The_Donald and TwoXChromosomes was a surprising one.


Probably due to both subreddits going on trolling raids on each other.


Reddit tries to keep a lid on brigading, it's actively policed. Surprising to see this as a major component.


Reddit execs give lip service to fighting brigading. It is only enforced in rare cases.


I guess the amazon principle: Users who posted here, also posted there?


The method seems surprisingly effective for a lot of subs, but also bothers me.


Thank you for creating this project!

I share your interest in creating graphs (vide some older project of mine, a map of Stack Exchange tags https://p.migdal.pl/tagoverflow/).

When I dived into various visualization of Subreddits (as a background reading for this side-project https://observablehq.com/@stared/tree-of-reddit-sex-life), your vis was the best for exploration (out of many, many). The only other one that I found and this level was a Sigma.js-based https://www.jacobsilterra.com/subreddit_map/network/ (beautiful, but more static).


Yay! Thank you for sharing the links and thank you for your kind words!

I see you are doing something with quantum tensors, which is very impressive! I'm still struggling with concept of regular tensors, and would love to have a good intuition/visualization for them - do you have any pointers?


It depends on what you are looking for. Tensors as in physics (typically with each dimension being related to space(time)), tensors for quantum information & computing, tensors for deep learning?

I've found it illuminating to see tensor diagrams.

- https://medium.com/@pmigdal/in-the-topic-of-diagrams-i-did-w...

- https://www.math3ma.com/blog/matrices-as-tensor-network-diag...

Also, right now, we are developing a matrix visualization in https://github.com/Quantum-Game. However, it is pretty much work in progress; for a slightly more mature one, go to Quantum Game 2 website and in the element encyclopedia there is one.

I am always up for talking about tensors, so feel invited to drop me an email.


I love your npm package dependency tree graph, I've actually posted links to it very recently on both HN and Reddit. Awesome tool!

https://npm.anvaka.com/#/


Oh thank you! I'm very glad you liked it :)!


This is so cool! You've basically made a different, better version of r/findasubreddit. Which led me to think...

https://anvaka.github.io/sayit/?query=findasubreddit

Super interesting!

So many small and unknown subreddits, and one of the main connections is another subreddit called "Somebody Make This".

I'll take a peek under the hood later, but this on the surface is very cool.


Thank you :)! Under the hood it is all very naive counting of users who posted to A also posted to B.


Thanks, this is very interesting. Mind if I ask how you determine relatedness? https://anvaka.github.io/sayit/?query=UsbCHardware This subreddit is very small, I am a very active poster there and yet I recognize none of the related ones.


I think for smaller subreddit's there's not as much data, so results are more noise. Apparently Jaccard similarity was used, except for the really big subreddits: https://github.com/anvaka/sayit

Unrelated, but had to say it somewhere in this discussion: Thanks @anvaka for your open source graph libs! Way more performant than anything else I've tried.


Hi there. Here is a similar question and answer: https://news.ycombinator.com/item?id=22178373


Small request: Can you turn off the initial jumble of nodes before the graph is built and laid out? It's distracting clutter that adds no information.

Animations are fun, but only if they convey artistic or intellectual meaning.

Also, the default zoom level is zoomed in too far (large text, not showing the wohle graph). perhaps this is because viewport resolution/size (phone vs desktop) is not taken into account when chooseing a zoom level (font size)


anvaka: your work is just amazing! The viz you did of software repos is outstanding. I just wanted to express thanks to you for sharing as I am a big fan of all things graph as well.


I just saw your city roads project, that is so awesome! Thanks for doing that


any information about coverage? r/electribe is not in auto complete and generates a graph with only itself


Thank you, it's quite a useful app.


great UI for this. very novel and intuitive


Thank you! I'm very happy to hear this!


The most interesting graph I could find so far: https://anvaka.github.io/sayit/?query=chairsunderwater

Speaking of which, it'd be awesome if the site could automatically generate multi-reddits from the results. I think a multi-reddit constructed from that graph would be quite interesting ;)


Thanks, I knew about r/chairsunderwater, but never heard about r/breadstapedtotrees. My life will never be the same.


I didn't know about either. Utterly impressed


I found r/chairsunderwater while looking for a subreddit that could help me pick out a new office chair.

r/chairs has 800 people subscribed. r/chairsunderwater has 115k. It's just reddit things ¯\_(ツ)_/¯


try https://anvaka.github.io/sayit/?query=ocaml

the algorith is non-deterministic, so if you follow the link more than once, you get a different spatial arrangement each time.

some are very pleasing, with identifiable clusters, and others are just a seemingly random scatter of names.

also interesting: https://anvaka.github.io/sayit/?query=lostredditors

(no lines are generated here; is this a feature or a bug?)



Damn, that's a nice one as well :D

As a side-note, I'm quite disappointed that r/DMT does not pop up when starting with r/JoeRogan...


Bon Appetit is a New York city based food magazine that also has pretty entertaining YouTube content.

The graph is extremely shocking and not at all what I would have expected.

https://anvaka.github.io/sayit/?query=bon_appetit


The_donald members often post in TwoXChromosomes and TropicalWeather?

https://anvaka.github.io/sayit/?query=the_donald


In fact, a lot of users seem to post both on extreme right-wing and left-wing subreddits, I did the test for a few of these subs, and it's just baffling. Either it's because of the phenomenon called "brigading", where a thread from one side is linked on another sub which leads users of the latter to post in the former, or a lot of users are just trolls playing both sides and inciting drama and outrage between people just for kicks.


> inciting drama and outrage between people just for kicks.

or

"popping their filter bubbles", or "are generally interested in boundary-pushing ideas", or "are susceptible to the rage-inducing trolls who run fringe communities".

"Affinity for extreme ideas of any kind" maybe a stronger / more common personality trait than "interested in one extreme point in the vector space of ideas".


JordanPeterson subreddit links to the_donald, but not the other way around?


The metric (Jaccard distance) is symmetric. The assymmetry comes from ranking: JordanPeterson is more special-interest/intense-interest with a narrower more homogenous audience than the_donald, so it probably has more very-similar subreddits to it than the_donald does. the_donald would have more related reddits but with less perfect overlap in membership.

A broader base of people (about 1/4 to 1/2 of the voting age US public) is aware and supportive of Donald Trump and for a wide variety of reasons, and much more than half at at least aware of Donald Trump. Jordan Peterson has a much smaller following, who are interested in him for similar reasons to each other.


Makes sense, JP's reddit fanbase is quite small. One slice of his videos has been seen across all the edits and platforms 500million times. He has insane reach.


[flagged]


Is there some reason you can't / are reluctant to make an account? Then you can take this off your subscriptions and add non-default subs that interest you


> for example when all men are blamed for not doing the dishes as if all men are dirty beasts: prepare to get burned down by horde of feminists.

citation needed


I wanted to filter some subreddits from r/all and found out a solution using Filter List for Ublock Origin, it works like a charm.

You can check it out in my repo [1] and gist itself [2].

[1] https://github.com/Offpics/FilterGames

[2] https://raw.githubusercontent.com/Offpics/FilterGames/master...


This is built into reddit you don't need to do anything else


HN is not the place to file bug reports on Reddit, and people here can't help you find workarounds since you aren't enabling tools that allow you to configure your experience.


> I am no Trump bigot

....but let me regurgitate biggoted trumpist talking points which paint me as some sort of victim to the somehow oppressing women.

RES on firefox works perfectly well for me, if twoXchromosome of all subs bothers you on /r/all just try harder to get it working.


The link between r/Android and r/iamverysmart seems legit.


Link also exists with r/apple and r/ios (but not r/windows or r/linux).



This one seems to be doing a worse job. Compare what I'm getting for /r/longevity:

https://subredditstats.com/subreddit-user-overlaps/longevity

vs

https://anvaka.github.io/sayit/?query=longevity


Earlier discussion (june 2019) https://news.ycombinator.com/item?id=18866800


Someone should make a graph all the moderators letting subreddits turn to garbage bot farms


What a fantastic idea!

I always keep on looking for new interesting subreddits. This tool is the best I saw for this task so far.


I'm so glad to know this! Thank you!



this is so sick! finally an easy way to find some new subreddits. thanks!


Yay! Thank you!


If I pick a relatively obscure subreddit like dataengineering, then the results are noisy. May increase the distance/decrease the charge between nodes as the number of children increases on a node?


I'm not able to see the contents of a subreddit in the sidebar after clicking on it. I'm on MacOS Catalina using Google Chrome 79 (pretty modern).


I heard this might be caused by some adblocking extensions - they consider reddit to be an ad/tracking system, so they block all javascript requests to it. Do you happen to have one of those extensions?


Searching for r/aww links to lots of porn.


Great project.. and great implementation.


Thank you!


*immediately types in porn subreddit.


would be nice to keep that renderer as separate library.


Thank you for your suggestion!

This implementation is tailored to smaller graphs with sometimes long text boxes.

I'll make a note to extract it to a reusable component.


I second this, really neat graph




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: