A Graph of Related Subreddits

anvaka · on Jan 29, 2020

Oh hey there, I'm the author of this project.

Just wanted to say thank you for sharing this! Would be happy to answer any questions - graphs are my long time hobby, and I love them!

PS: You can find more recent graphs and fun projects here: https://twitter.com/search?q=from%3Aanvaka%20min_retweets%3A...

bobosha · on Jan 29, 2020

How does this infer subreddit similarity? For instance, I checked for r/AskHistorians and the results don't seem that relevant.

edit: never mind, I just read your GitHub readme. But the question still stands as if "users posting in x, also posted in y" is a good way to infer similarity. Could comparing top-ranked posts be a better comparator?

anvaka · on Jan 29, 2020

I used my own metric that is based on jaccard similarity. Which in turn is based on "users who posted to X also posted to Y" metric.

That said, there are a few subreddits that are too popular and similarity results were too saturated (/r/videos, /r/funny, etc.) so I did a manual override by looking into most commonly mentioned other subreddits, and sometimes into `about` blurb of subreddit).

Please don't consider these recommendation as source of truth! It's just a fun way to discover other subreddits :).

I'm also very open to change this metric to something else - please let me know if you have any recommendations!

[1]: https://github.com/anvaka/sayit#the-data - describes the data, indexing scripts are here: https://github.com/anvaka/sayit/tree/master/scripts

[2]: Manual overrides can be found here https://github.com/anvaka/sayit-data#sayit---recommendation-...

ketralnis · on Jan 29, 2020

I work for reddit and here I’ve used a similar technique to your jaccard distance but with one twist: divide by the size of the smaller subreddit (in your case, the number of unique posters that you’ve recorded). That gives you a directional relatedness, that is programming->python but not necessarily python->programming. Used this way you account for the giant subreddit problem automatically but now the results are less “amitheasshole is related to askreddit” and more like “linguisticshumor is a more niche version of linguistics”.

The great thing is that it’s actually more actionable as far as recommendations go! Everybody has already heard of the bigger version of this subreddit, but they probably haven’t heard of the smaller versions. And it’s self-correcting. As a subreddit gets bigger we are less likely to recommend it (which is great because it needs our help less)

anvaka · on Jan 29, 2020

This is super awesome, thank you for sharing!

If you guys are interested in seeing how your recommendation work for the entire reddit, I'd be happy to build you a spaceship similar to this one https://github.com/anvaka/word2vec-graph .

I couldn't find an easy way to download the entire recommendation graph, but it would be awesome if we could make it work. My email is the same as this account at gmail, and twitter is all open: https://twitter.com/anvaka

yantrams · on Jan 29, 2020

In case you haven't come across it already, here is a very exhaustive list of distance measures for dealing with problems of this kind - http://www.iiisci.org/journal/CV$/sci/pdfs/GS315JG.pdf

I fooled around a bit with lastfm data for band recommendation and found this sheet quite helpful.

If you are interested in learning more about asymmetrical similarity, here is a great primer by Tversky - http://www.cogsci.ucsd.edu/~coulson/203/tversky-features.pdf

anvaka · on Jan 29, 2020

This is absolute treasure trove. Thank you so much!

bobosha · on Jan 29, 2020

Normalizing by the user count should help.

My suggestion would be something much simpler ie. to try to compare content itself e.g. top 1000 posts from each subreddit, and estimate (say) cosine sim + Tfidf. Wouldn't that be a better indicator? Also, instead of pairwise comparison, you could try clustering (HDBSCAN for example) to reduce computational complexity.

But great work, love your visualizations!

jcims · on Jan 29, 2020

Can people still make their upvotes/downvotes public? I found that useful for sussing out related subreddits when it was possible back in the day.

ethn · on Jan 30, 2020

I’ve also rediscovered this in similar work I do. But upon later research I’ve learned this is called containment.

mikk14 · on Jan 29, 2020

If you're interested, in the scientific literature this problem is known as "network backboning". Basically you have all nodes connected to practically all other nodes with weighted edges, and you want to know which are the edges with statistically significant weights.

I wrote on this topic [1]. My method [2] basically uses simple counts on edge weights, and then estimates the expected edge weight and its variance using Bayesian priors. It then attaches a t-score or p-value to each edge, and then you can filter out edges with too low t-score.

The idea is that weak edges can still be statistically significant if they connect "small" nodes. In any case, the library I wrote includes the implementation of a few other methods, in case they work better for your data type.

[1] https://arxiv.org/abs/1701.07336 [2] http://www.michelecoscia.com/?page_id=287

grawprog · on Jan 29, 2020

>I used my own metric that is based on jaccard similarity. Which in turn is based on "users who posted to X also posted to Y" metric.

Is that really a good metric of similarity? Just myself, I post in several unrelated subreddits semi-regularly from programming to video games to music, art and even stone masonary, i've posted in subreddits for TV shows i've watched, or just completely random things.

I use reddit as a place where I can learn about and interact with people on nearly any subject or topic and I take advantage of that when I can. I'm sure my posting habits aren't that unusual. I'm just not sure that's really an accurate way to gauge similarity.

anvaka · on Jan 29, 2020

Right, I don't think it's 100% accurate either. It gives just some hints what might be related, but like you said, it's not necessary the best possible measure of similarity.

Jaccard similarity does not count only the number of people who posted to A and B, it checks how many people posted to A, how many people posted to B, and how many of those people posted TOGETHER to A and B. That togetherness gives us hints what is related, and after it is computed, we can divide by the total number of poster to both A and B (independently), which brings the value to something that we can use to compare against other subreddits. If that value is close to 1, it means that almost all users who posted to A have also posted to B. If it is close to 0, then the overlap is much smaller.

dannytatom · on Jan 29, 2020

This has a similar issue to the way Reddit recommends subs itself. A sub will be considered similar if it's the exact opposite due to users of one sub going to another to shitpost or brigade or etc.

I'm not sure of a fix for that, but would it be possible/helpful to weigh it by average upvote/downvote of the comments left from users of said sub? Meaning if sub A is about how much baseball sucks, and sub B is about how amazing baseball is, while determining if the 2 are similar you'd find out most posts from sub A to sub B are heavily downvoted and so probably not similar.

anvaka · on Jan 29, 2020

It seems though relationship can be considered to have a sign: positive, when people mostly align in their upvote intents, and negative when people do the brigading, etc.

I'd still see ability to determine absolute value of relationship as a valuable property of a recommender

zmix · on Jan 29, 2020

A lot of Reddits have "Recommended" or "Related" boxes at the side. Why don't you implement these as first, and then use the algorithm(s)?

anvaka · on Jan 29, 2020

Thank you for the suggestion!

I think when I created this tool there was no recommendations on reddit.

When it was introduced later on reddit I was contemplating about using reddit's own recommendations, but at that time it was missing a few smaller subreddits, so I just put it of onto the shelf of projects to try.

varjag · on Jan 29, 2020

Have to say the link between The_Donald and TwoXChromosomes was a surprising one.

Mountain_Skies · on Jan 29, 2020

Probably due to both subreddits going on trolling raids on each other.

varjag · on Jan 29, 2020

Reddit tries to keep a lid on brigading, it's actively policed. Surprising to see this as a major component.

MagnumOpus · on Jan 29, 2020

Reddit execs give lip service to fighting brigading. It is only enforced in rare cases.

I_am_tiberius · on Jan 29, 2020

I guess the amazon principle: Users who posted here, also posted there?

catach · on Jan 29, 2020

The method seems surprisingly effective for a lot of subs, but also bothers me.

stared · on Jan 29, 2020

Thank you for creating this project!

I share your interest in creating graphs (vide some older project of mine, a map of Stack Exchange tags https://p.migdal.pl/tagoverflow/).

When I dived into various visualization of Subreddits (as a background reading for this side-project https://observablehq.com/@stared/tree-of-reddit-sex-life), your vis was the best for exploration (out of many, many). The only other one that I found and this level was a Sigma.js-based https://www.jacobsilterra.com/subreddit_map/network/ (beautiful, but more static).

anvaka · on Jan 29, 2020

Yay! Thank you for sharing the links and thank you for your kind words!

I see you are doing something with quantum tensors, which is very impressive! I'm still struggling with concept of regular tensors, and would love to have a good intuition/visualization for them - do you have any pointers?

stared · on Jan 29, 2020

It depends on what you are looking for. Tensors as in physics (typically with each dimension being related to space(time)), tensors for quantum information & computing, tensors for deep learning?

I've found it illuminating to see tensor diagrams.

- https://medium.com/@pmigdal/in-the-topic-of-diagrams-i-did-w...

- https://www.math3ma.com/blog/matrices-as-tensor-network-diag...

Also, right now, we are developing a matrix visualization in https://github.com/Quantum-Game. However, it is pretty much work in progress; for a slightly more mature one, go to Quantum Game 2 website and in the element encyclopedia there is one.

I am always up for talking about tensors, so feel invited to drop me an email.

flanbiscuit · on Jan 29, 2020

I love your npm package dependency tree graph, I've actually posted links to it very recently on both HN and Reddit. Awesome tool!

https://npm.anvaka.com/#/

anvaka · on Jan 29, 2020

Oh thank you! I'm very glad you liked it :)!

wsinks · on Jan 29, 2020

This is so cool! You've basically made a different, better version of r/findasubreddit. Which led me to think...

https://anvaka.github.io/sayit/?query=findasubreddit

Super interesting!

So many small and unknown subreddits, and one of the main connections is another subreddit called "Somebody Make This".

I'll take a peek under the hood later, but this on the surface is very cool.

anvaka · on Jan 29, 2020

Thank you :)! Under the hood it is all very naive counting of users who posted to A also posted to B.

_ugfj · on Jan 29, 2020

Thanks, this is very interesting. Mind if I ask how you determine relatedness? https://anvaka.github.io/sayit/?query=UsbCHardware This subreddit is very small, I am a very active poster there and yet I recognize none of the related ones.

rewq4321 · on Jan 29, 2020

I think for smaller subreddit's there's not as much data, so results are more noise. Apparently Jaccard similarity was used, except for the really big subreddits: https://github.com/anvaka/sayit

Unrelated, but had to say it somewhere in this discussion: Thanks @anvaka for your open source graph libs! Way more performant than anything else I've tried.

anvaka · on Jan 29, 2020

Hi there. Here is a similar question and answer: https://news.ycombinator.com/item?id=22178373

papln · on Jan 29, 2020

Small request: Can you turn off the initial jumble of nodes before the graph is built and laid out? It's distracting clutter that adds no information.

Animations are fun, but only if they convey artistic or intellectual meaning.

Also, the default zoom level is zoomed in too far (large text, not showing the wohle graph). perhaps this is because viewport resolution/size (phone vs desktop) is not taken into account when chooseing a zoom level (font size)

DLA · on Jan 30, 2020

anvaka: your work is just amazing! The viz you did of software repos is outstanding. I just wanted to express thanks to you for sharing as I am a big fan of all things graph as well.

ddoeth · on Jan 29, 2020

I just saw your city roads project, that is so awesome! Thanks for doing that

ptah · on Jan 29, 2020

any information about coverage? r/electribe is not in auto complete and generates a graph with only itself

throw_m239339 · on Jan 29, 2020

Thank you, it's quite a useful app.

hooande · on Jan 29, 2020

great UI for this. very novel and intuitive

anvaka · on Jan 29, 2020

Thank you! I'm very happy to hear this!

mckirk · on Jan 29, 2020

The most interesting graph I could find so far: https://anvaka.github.io/sayit/?query=chairsunderwater

Speaking of which, it'd be awesome if the site could automatically generate multi-reddits from the results. I think a multi-reddit constructed from that graph would be quite interesting ;)

semipro · on Jan 29, 2020

Thanks, I knew about r/chairsunderwater, but never heard about r/breadstapedtotrees. My life will never be the same.

anvaka · on Jan 29, 2020

I didn't know about either. Utterly impressed

mckirk · on Jan 29, 2020

I found r/chairsunderwater while looking for a subreddit that could help me pick out a new office chair.

r/chairs has 800 people subscribed. r/chairsunderwater has 115k. It's just reddit things ¯\_(ツ)_/¯

_bz2r · on Jan 29, 2020

try https://anvaka.github.io/sayit/?query=ocaml

the algorith is non-deterministic, so if you follow the link more than once, you get a different spatial arrangement each time.

some are very pleasing, with identifiable clusters, and others are just a seemingly random scatter of names.

also interesting: https://anvaka.github.io/sayit/?query=lostredditors

(no lines are generated here; is this a feature or a bug?)

sgentle · on Jan 29, 2020

This cluster is pretty spectacular: https://anvaka.github.io/sayit/?query=DarkFuturology

Edit: oh wow: https://anvaka.github.io/sayit/?query=Jung

mckirk · on Jan 29, 2020

Damn, that's a nice one as well :D

As a side-note, I'm quite disappointed that r/DMT does not pop up when starting with r/JoeRogan...

GloriousKoji · on Jan 29, 2020

Bon Appetit is a New York city based food magazine that also has pretty entertaining YouTube content.

The graph is extremely shocking and not at all what I would have expected.

https://anvaka.github.io/sayit/?query=bon_appetit

orf · on Jan 29, 2020

The_donald members often post in TwoXChromosomes and TropicalWeather?

https://anvaka.github.io/sayit/?query=the_donald

throw_m239339 · on Jan 29, 2020

In fact, a lot of users seem to post both on extreme right-wing and left-wing subreddits, I did the test for a few of these subs, and it's just baffling. Either it's because of the phenomenon called "brigading", where a thread from one side is linked on another sub which leads users of the latter to post in the former, or a lot of users are just trolls playing both sides and inciting drama and outrage between people just for kicks.

papln · on Jan 29, 2020

> inciting drama and outrage between people just for kicks.

or

"popping their filter bubbles", or "are generally interested in boundary-pushing ideas", or "are susceptible to the rage-inducing trolls who run fringe communities".

"Affinity for extreme ideas of any kind" maybe a stronger / more common personality trait than "interested in one extreme point in the vector space of ideas".

friendlybus · on Jan 29, 2020

JordanPeterson subreddit links to the_donald, but not the other way around?

papln · on Jan 29, 2020

The metric (Jaccard distance) is symmetric. The assymmetry comes from ranking: JordanPeterson is more special-interest/intense-interest with a narrower more homogenous audience than the_donald, so it probably has more very-similar subreddits to it than the_donald does. the_donald would have more related reddits but with less perfect overlap in membership.

A broader base of people (about 1/4 to 1/2 of the voting age US public) is aware and supportive of Donald Trump and for a wide variety of reasons, and much more than half at at least aware of Donald Trump. Jordan Peterson has a much smaller following, who are interested in him for similar reasons to each other.

friendlybus · on Jan 30, 2020

Makes sense, JP's reddit fanbase is quite small. One slice of his videos has been seen across all the edits and platforms 500million times. He has insane reach.

blablablerg · on Jan 29, 2020

[flagged]

np_tedious · on Jan 29, 2020

Is there some reason you can't / are reluctant to make an account? Then you can take this off your subscriptions and add non-default subs that interest you

jccalhoun · on Jan 29, 2020

> for example when all men are blamed for not doing the dishes as if all men are dirty beasts: prepare to get burned down by horde of feminists.

citation needed

Offpics · on Jan 29, 2020

I wanted to filter some subreddits from r/all and found out a solution using Filter List for Ublock Origin, it works like a charm.

You can check it out in my repo [1] and gist itself [2].

[1] https://github.com/Offpics/FilterGames

[2] https://raw.githubusercontent.com/Offpics/FilterGames/master...

tayo42 · on Jan 29, 2020

This is built into reddit you don't need to do anything else

papln · on Jan 29, 2020

HN is not the place to file bug reports on Reddit, and people here can't help you find workarounds since you aren't enabling tools that allow you to configure your experience.

fifnir · on Jan 29, 2020

> I am no Trump bigot

....but let me regurgitate biggoted trumpist talking points which paint me as some sort of victim to the somehow oppressing women.

RES on firefox works perfectly well for me, if twoXchromosome of all subs bothers you on /r/all just try harder to get it working.

rednerrus · on Jan 29, 2020

The link between r/Android and r/iamverysmart seems legit.

soylentcola · on Jan 29, 2020

Link also exists with r/apple and r/ios (but not r/windows or r/linux).

rewq4321 · on Jan 29, 2020

See also: https://subredditstats.com/subreddit-user-overlaps/programmi...

Glosster · on Jan 29, 2020

This one seems to be doing a worse job. Compare what I'm getting for /r/longevity:

https://subredditstats.com/subreddit-user-overlaps/longevity

vs

https://anvaka.github.io/sayit/?query=longevity

FredrikMeyer · on Jan 29, 2020

Earlier discussion (june 2019) https://news.ycombinator.com/item?id=18866800

morceauxdebois · on Jan 29, 2020

Someone should make a graph all the moderators letting subreddits turn to garbage bot farms

neiman · on Jan 29, 2020

What a fantastic idea!

I always keep on looking for new interesting subreddits. This tool is the best I saw for this task so far.

anvaka · on Jan 29, 2020

I'm so glad to know this! Thank you!

dredmorbius · on Jan 29, 2020

https://anvaka.github.io/sayit/?query=hackernews

mapleboi · on Jan 29, 2020

this is so sick! finally an easy way to find some new subreddits. thanks!

anvaka · on Jan 29, 2020

Yay! Thank you!

iblaine · on Jan 29, 2020

If I pick a relatively obscure subreddit like dataengineering, then the results are noisy. May increase the distance/decrease the charge between nodes as the number of children increases on a node?

itsmhuang · on Jan 29, 2020

I'm not able to see the contents of a subreddit in the sidebar after clicking on it. I'm on MacOS Catalina using Google Chrome 79 (pretty modern).

anvaka · on Jan 29, 2020

I heard this might be caused by some adblocking extensions - they consider reddit to be an ad/tracking system, so they block all javascript requests to it. Do you happen to have one of those extensions?

justaman · on Jan 29, 2020

Searching for r/aww links to lots of porn.

petey283 · on Jan 29, 2020

Great project.. and great implementation.

anvaka · on Jan 29, 2020

Thank you!

ProbablyRyaan · on Jan 29, 2020

*immediately types in porn subreddit.

shanth · on Jan 29, 2020

would be nice to keep that renderer as separate library.

anvaka · on Jan 29, 2020

Thank you for your suggestion!

This implementation is tailored to smaller graphs with sometimes long text boxes.

I'll make a note to extract it to a reusable component.

totony · on Jan 30, 2020

I second this, really neat graph