Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Wow. This gives a lot of false positives, but it found all ~10 of my old accounts over the years.

The most interesting thing is that my writing style changed pretty drastically since a decade ago. Searching for my oldest account matches my earliest usernames, whereas searching this account matched the rest.

The details of the algorithm are fascinating: https://stylometry.net/about Mostly because of how simple it is. I assumed it would measure word embeddings against a trained ML model, but nothing so fancy.



Woof.

I create new accounts on a semi-regular basis because I think cliques are the most corrosive factor to social media. Any time my account gathers enough upvotes enough I destroy it for another.

I had four accounts. None are over 50% confidence, but when I look at any one account the others are consistently #2, #3, and #4.

Now I’m thinking very carefully about what words I use to avoid linking this as the 5th account.


This makes me melancholic. One should be able to express themselves without the overhead of privacy concerns.


Exact same thing happened to me. Wild.


On the other side of the coin, I have never had an alternate HN account (beyond maybe 1-2 throwaways with only one post or comment) so seeing the list of users that are most similar to me was interesting. I didn't see some stark similarities based on a quick peek at their comments, but it was interesting.


Yeah top 20 is a little excessive because in my own tests I found that top 20 is only marginally more accurate than top 10. You can get a more academic explanation [here](https://www.tandfonline.com/doi/abs/10.1080/09296174.2011.53...). I was amazed too because it seemed too easy!


FWIW, top 20 was necessary for mine. The bolding was a brilliant move. Several of my accounts were ranked 10-20, but popped out due to the bolding.


What does the bolding indicate?


The explanation is here: https://news.ycombinator.com/item?id=33755466

As far as I’m concerned, it’s the killer feature of the app. The top 20 results may be noisy, but the bolded results have a signal to noise ratio close to infinity.


The precision of the bolded results looks like maybe 30% to me. Significantly better than the non-bolded, but nowhere near perfect precision.


False positives become an increasingly difficult problem the more and more potential authors you introduce. If I had wrote a fancier model it probably wouldn't be as much of a problem but what can you do.


Yes, this wasn't a criticism of the tool. It is crazy good.

But I don't think people should be making the assumption that bolded results are definite alts, which sillysaurus' comment reads like.


Hmm, that wasn’t my intent. I see this tool as a recommendation engine more than a doxxer. By “signal to noise ratio close to infinity,” I meant that if you visit one of the bolded accounts, they’ll probably sound a lot like you.

It’s one of those ideas that makes the tool substantially more effective, yet never would’ve occurred to me. It’s like the simplicity of pg’s “a plan for spam” algorithm: deceptively simple, but (like scrubbing dishes with fingers) works really well.


> I see this tool as a recommendation engine more than a doxxer.

That is absolutely all this will be used for. This is a dangerous tool that serves no real world purpose.


Of my top 20, 19 are bold, all are above 0.6, and I have no alts.


Vast majority of my top 20 were bold, except you funnily enough!

None of them are me (and you were the only one I recognised and thought "yeah, I can see where it gets it from"...)


I have 7 bolded names (0.53-0.62) in the top 20 list, and none are alts of mine.


I'm one of them and I can confirm. But then again that's what I'd say if I was.


Hi style-adjacent friend :-). Just briefly looking at your recent comment history, we seem to find different kinds of articles interesting, but maybe have a similar writing style.


Pretty much the exact same. (I do have a throwaway account but I rarely use it and it probably hasn't been used enough to qualify.)


The funny thing is that I thought of it while eating dinner last night :)


My results have 5 bolded users in my top 20, and I have 0 alt accounts.


Frankly similar to how I was doing in back in 2018 (when you and I chatted about it on HN lol)

https://news.ycombinator.com/item?id=17944293

The approach I took was a bit different, but also no ML required.

The real trick is pruning and going cross platform. There are around 100k active HN accounts (meaning posts a few times a year), maybe 200k if you count at least one post a year. But <10k that post weekly.

It’s a very small space to try to compare so simple methods will work fine.


Exactly. HN emphasizes long-form posts much more than other forums which makes the commenters here very susceptible to this kind of analysis. Plus you can fit every single HN comment in RAM on a mid tier gaming laptop so it's even easier. I was trying to think of applications of this kind of data and the only thing I could think of was moderation tools/detecting ban evaders but what you've done seems much more profitable lol.


It works like a charm for me too.

I put in my username and found my pre-echelon alt, possibilistic.

(Echelon was taken when I registered possibilistic, but it must have been unused and dropped.)


I’d figured it would be some kind of n-gram frequency analysis. Would be interesting to code that up and compare.


It is. The description on the about page is a little simplified but I basically I look at the most common word and character ngrams of size 1,2,3 (200 each), put all the frequencies in an array and then compare to all the other users with https://scikit-learn.org/stable/modules/generated/sklearn.me....


Cool, I only skimmed the description maybe I needed to read it more carefully.

Have you considered doing rune rather than word ngrams? I can imagine that might be prohibitively expensive, but I really don’t know. I did something like that long long ago in C for automatic document language detection. It was quite accurate.


sillysaurus3 was in mine. :) Clearly we're not the same.


> sillysaurus3

> sillysaurus2

Tbf a human could have found a bunch of them relatively easily




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: