Wow. This gives a lot of false positives, but it found all ~10 of my old account...

hnburnerUixoHr5 · on Nov 26, 2022

Woof.

I create new accounts on a semi-regular basis because I think cliques are the most corrosive factor to social media. Any time my account gathers enough upvotes enough I destroy it for another.

I had four accounts. None are over 50% confidence, but when I look at any one account the others are consistently #2, #3, and #4.

Now I’m thinking very carefully about what words I use to avoid linking this as the 5th account.

butterNaN · on Nov 27, 2022

This makes me melancholic. One should be able to express themselves without the overhead of privacy concerns.

hailwren · on Nov 26, 2022

Exact same thing happened to me. Wild.

dimmke · on Nov 27, 2022

On the other side of the coin, I have never had an alternate HN account (beyond maybe 1-2 throwaways with only one post or comment) so seeing the list of users that are most similar to me was interesting. I didn't see some stark similarities based on a quick peek at their comments, but it was interesting.

costco · on Nov 26, 2022

Yeah top 20 is a little excessive because in my own tests I found that top 20 is only marginally more accurate than top 10. You can get a more academic explanation [here](https://www.tandfonline.com/doi/abs/10.1080/09296174.2011.53...). I was amazed too because it seemed too easy!

sillysaurusx · on Nov 26, 2022

FWIW, top 20 was necessary for mine. The bolding was a brilliant move. Several of my accounts were ranked 10-20, but popped out due to the bolding.

justusthane · on Nov 26, 2022

What does the bolding indicate?

sillysaurusx · on Nov 26, 2022

The explanation is here: https://news.ycombinator.com/item?id=33755466

As far as I’m concerned, it’s the killer feature of the app. The top 20 results may be noisy, but the bolded results have a signal to noise ratio close to infinity.

jsnell · on Nov 26, 2022

The precision of the bolded results looks like maybe 30% to me. Significantly better than the non-bolded, but nowhere near perfect precision.

costco · on Nov 26, 2022

False positives become an increasingly difficult problem the more and more potential authors you introduce. If I had wrote a fancier model it probably wouldn't be as much of a problem but what can you do.

jsnell · on Nov 26, 2022

Yes, this wasn't a criticism of the tool. It is crazy good.

But I don't think people should be making the assumption that bolded results are definite alts, which sillysaurus' comment reads like.

sillysaurusx · on Nov 26, 2022

Hmm, that wasn’t my intent. I see this tool as a recommendation engine more than a doxxer. By “signal to noise ratio close to infinity,” I meant that if you visit one of the bolded accounts, they’ll probably sound a lot like you.

It’s one of those ideas that makes the tool substantially more effective, yet never would’ve occurred to me. It’s like the simplicity of pg’s “a plan for spam” algorithm: deceptively simple, but (like scrubbing dishes with fingers) works really well.

tekknik · on Nov 27, 2022

> I see this tool as a recommendation engine more than a doxxer.

That is absolutely all this will be used for. This is a dangerous tool that serves no real world purpose.

dragonwriter · on Nov 26, 2022

Of my top 20, 19 are bold, all are above 0.6, and I have no alts.

notahacker · on Nov 27, 2022

Vast majority of my top 20 were bold, except you funnily enough!

None of them are me (and you were the only one I recognised and thought "yeah, I can see where it gets it from"...)

loeg · on Nov 26, 2022

I have 7 bolded names (0.53-0.62) in the top 20 list, and none are alts of mine.

morsch · on Nov 26, 2022

I'm one of them and I can confirm. But then again that's what I'd say if I was.

loeg · on Nov 26, 2022

Hi style-adjacent friend :-). Just briefly looking at your recent comment history, we seem to find different kinds of articles interesting, but maybe have a similar writing style.

ghaff · on Nov 26, 2022

Pretty much the exact same. (I do have a throwaway account but I rarely use it and it probably hasn't been used enough to qualify.)

costco · on Nov 26, 2022

The funny thing is that I thought of it while eating dinner last night :)

dimmke · on Nov 27, 2022

My results have 5 bolded users in my top 20, and I have 0 alt accounts.

lettergram · on Nov 26, 2022

Frankly similar to how I was doing in back in 2018 (when you and I chatted about it on HN lol)

https://news.ycombinator.com/item?id=17944293

The approach I took was a bit different, but also no ML required.

The real trick is pruning and going cross platform. There are around 100k active HN accounts (meaning posts a few times a year), maybe 200k if you count at least one post a year. But <10k that post weekly.

It’s a very small space to try to compare so simple methods will work fine.

costco · on Nov 26, 2022

Exactly. HN emphasizes long-form posts much more than other forums which makes the commenters here very susceptible to this kind of analysis. Plus you can fit every single HN comment in RAM on a mid tier gaming laptop so it's even easier. I was trying to think of applications of this kind of data and the only thing I could think of was moderation tools/detecting ban evaders but what you've done seems much more profitable lol.

echelon · on Nov 26, 2022

It works like a charm for me too.

I put in my username and found my pre-echelon alt, possibilistic.

(Echelon was taken when I registered possibilistic, but it must have been unused and dropped.)

User23 · on Nov 26, 2022

I’d figured it would be some kind of n-gram frequency analysis. Would be interesting to code that up and compare.

costco · on Nov 26, 2022

It is. The description on the about page is a little simplified but I basically I look at the most common word and character ngrams of size 1,2,3 (200 each), put all the frequencies in an array and then compare to all the other users with https://scikit-learn.org/stable/modules/generated/sklearn.me....

User23 · on Nov 26, 2022

Cool, I only skimmed the description maybe I needed to read it more carefully.

Have you considered doing rune rather than word ngrams? I can imagine that might be prohibitively expensive, but I really don’t know. I did something like that long long ago in C for automatic document language detection. It was quite accurate.

bb88 · on Nov 27, 2022

sillysaurus3 was in mine. :) Clearly we're not the same.

FormerBandmate · on Nov 26, 2022

> sillysaurus3

> sillysaurus2

Tbf a human could have found a bunch of them relatively easily