How does this infer subreddit similarity? For instance, I checked for r/AskHistorians and the results don't seem that relevant.
edit: never mind, I just read your GitHub readme. But the question still stands as if "users posting in x, also posted in y" is a good way to infer similarity. Could comparing top-ranked posts be a better comparator?
I used my own metric that is based on jaccard similarity. Which in turn is based on "users who posted to X also posted to Y" metric.
That said, there are a few subreddits that are too popular and similarity results were too saturated (/r/videos, /r/funny, etc.) so I did a manual override by looking into most commonly mentioned other subreddits, and sometimes into `about` blurb of subreddit).
Please don't consider these recommendation as source of truth! It's just a fun way to discover other subreddits :).
I'm also very open to change this metric to something else - please let me know if you have any recommendations!
I work for reddit and here I’ve used a similar technique to your jaccard distance but with one twist: divide by the size of the smaller subreddit (in your case, the number of unique posters that you’ve recorded). That gives you a directional relatedness, that is programming->python but not necessarily python->programming. Used this way you account for the giant subreddit problem automatically but now the results are less “amitheasshole is related to askreddit” and more like “linguisticshumor is a more niche version of linguistics”.
The great thing is that it’s actually more actionable as far as recommendations go! Everybody has already heard of the bigger version of this subreddit, but they probably haven’t heard of the smaller versions. And it’s self-correcting. As a subreddit gets bigger we are less likely to recommend it (which is great because it needs our help less)
If you guys are interested in seeing how your recommendation work for the entire reddit, I'd be happy to build you a spaceship similar to this one https://github.com/anvaka/word2vec-graph .
I couldn't find an easy way to download the entire recommendation graph, but it would be awesome if we could make it work. My email is the same as this account at gmail, and twitter is all open: https://twitter.com/anvaka
My suggestion would be something much simpler ie. to try to compare content itself e.g. top 1000 posts from each subreddit, and estimate (say) cosine sim + Tfidf. Wouldn't that be a better indicator? Also, instead of pairwise comparison, you could try clustering (HDBSCAN for example) to reduce computational complexity.
If you're interested, in the scientific literature this problem is known as "network backboning". Basically you have all nodes connected to practically all other nodes with weighted edges, and you want to know which are the edges with statistically significant weights.
I wrote on this topic [1]. My method [2] basically uses simple counts on edge weights, and then estimates the expected edge weight and its variance using Bayesian priors. It then attaches a t-score or p-value to each edge, and then you can filter out edges with too low t-score.
The idea is that weak edges can still be statistically significant if they connect "small" nodes. In any case, the library I wrote includes the implementation of a few other methods, in case they work better for your data type.
>I used my own metric that is based on jaccard similarity. Which in turn is based on "users who posted to X also posted to Y" metric.
Is that really a good metric of similarity? Just myself, I post in several unrelated subreddits semi-regularly from programming to video games to music, art and even stone masonary, i've posted in subreddits for TV shows i've watched, or just completely random things.
I use reddit as a place where I can learn about and interact with people on nearly any subject or topic and I take advantage of that when I can. I'm sure my posting habits aren't that unusual. I'm just not sure that's really an accurate way to gauge similarity.
Right, I don't think it's 100% accurate either. It gives just some hints what might be related, but like you said, it's not necessary the best possible measure of similarity.
Jaccard similarity does not count only the number of people who posted to A and B, it checks how many people posted to A, how many people posted to B, and how many of those people posted TOGETHER to A and B. That togetherness gives us hints what is related, and after it is computed, we can divide by the total number of poster to both A and B (independently), which brings the value to something that we can use to compare against other subreddits. If that value is close to 1, it means that almost all users who posted to A have also posted to B. If it is close to 0, then the overlap is much smaller.
This has a similar issue to the way Reddit recommends subs itself. A sub will be considered similar if it's the exact opposite due to users of one sub going to another to shitpost or brigade or etc.
I'm not sure of a fix for that, but would it be possible/helpful to weigh it by average upvote/downvote of the comments left from users of said sub? Meaning if sub A is about how much baseball sucks, and sub B is about how amazing baseball is, while determining if the 2 are similar you'd find out most posts from sub A to sub B are heavily downvoted and so probably not similar.
It seems though relationship can be considered to have a sign: positive, when people mostly align in their upvote intents, and negative when people do the brigading, etc.
I'd still see ability to determine absolute value of relationship as a valuable property of a recommender
I think when I created this tool there was no recommendations on reddit.
When it was introduced later on reddit I was contemplating about using reddit's own recommendations, but at that time it was missing a few smaller subreddits, so I just put it of onto the shelf of projects to try.
Yay! Thank you for sharing the links and thank you for your kind words!
I see you are doing something with quantum tensors, which is very impressive! I'm still struggling with concept of regular tensors, and would love to have a good intuition/visualization for them - do you have any pointers?
It depends on what you are looking for. Tensors as in physics (typically with each dimension being related to space(time)), tensors for quantum information & computing, tensors for deep learning?
I've found it illuminating to see tensor diagrams.
Also, right now, we are developing a matrix visualization in https://github.com/Quantum-Game. However, it is pretty much work in progress; for a slightly more mature one, go to Quantum Game 2 website and in the element encyclopedia there is one.
I am always up for talking about tensors, so feel invited to drop me an email.
Thanks, this is very interesting. Mind if I ask how you determine relatedness? https://anvaka.github.io/sayit/?query=UsbCHardware This subreddit is very small, I am a very active poster there and yet I recognize none of the related ones.
I think for smaller subreddit's there's not as much data, so results are more noise. Apparently Jaccard similarity was used, except for the really big subreddits: https://github.com/anvaka/sayit
Unrelated, but had to say it somewhere in this discussion: Thanks @anvaka for your open source graph libs! Way more performant than anything else I've tried.
Small request: Can you turn off the initial jumble of nodes before the graph is built and laid out? It's distracting clutter that adds no information.
Animations are fun, but only if they convey artistic or intellectual meaning.
Also, the default zoom level is zoomed in too far (large text, not showing the wohle graph). perhaps this is because viewport resolution/size (phone vs desktop) is not taken into account when chooseing a zoom level (font size)
anvaka: your work is just amazing! The viz you did of software repos is outstanding. I just wanted to express thanks to you for sharing as I am a big fan of all things graph as well.
Speaking of which, it'd be awesome if the site could automatically generate multi-reddits from the results. I think a multi-reddit constructed from that graph would be quite interesting ;)
In fact, a lot of users seem to post both on extreme right-wing and left-wing subreddits, I did the test for a few of these subs, and it's just baffling. Either it's because of the phenomenon called "brigading", where a thread from one side is linked on another sub which leads users of the latter to post in the former, or a lot of users are just trolls playing both sides and inciting drama and outrage between people just for kicks.
> inciting drama and outrage between people just for kicks.
or
"popping their filter bubbles", or "are generally interested in boundary-pushing ideas", or "are susceptible to the rage-inducing trolls who run fringe communities".
"Affinity for extreme ideas of any kind" maybe a stronger / more common personality trait than "interested in one extreme point in the vector space of ideas".
The metric (Jaccard distance) is symmetric. The assymmetry comes from ranking: JordanPeterson is more special-interest/intense-interest with a narrower more homogenous audience than the_donald, so it probably has more very-similar subreddits to it than the_donald does. the_donald would have more related reddits but with less perfect overlap in membership.
A broader base of people (about 1/4 to 1/2 of the voting age US public) is aware and supportive of Donald Trump and for a wide variety of reasons, and much more than half at at least aware of Donald Trump. Jordan Peterson has a much smaller following, who are interested in him for similar reasons to each other.
Makes sense, JP's reddit fanbase is quite small. One slice of his videos has been seen across all the edits and platforms 500million times. He has insane reach.
Is there some reason you can't / are reluctant to make an account? Then you can take this off your subscriptions and add non-default subs that interest you
HN is not the place to file bug reports on Reddit, and people here can't help you find workarounds since you aren't enabling tools that allow you to configure your experience.
If I pick a relatively obscure subreddit like dataengineering, then the results are noisy. May increase the distance/decrease the charge between nodes as the number of children increases on a node?
I heard this might be caused by some adblocking extensions - they consider reddit to be an ad/tracking system, so they block all javascript requests to it. Do you happen to have one of those extensions?
Just wanted to say thank you for sharing this! Would be happy to answer any questions - graphs are my long time hobby, and I love them!
PS: You can find more recent graphs and fun projects here: https://twitter.com/search?q=from%3Aanvaka%20min_retweets%3A...