How Mattermark Teamed Up With Bloomberg Beta to Predict Who Will Start Companies

onion2k · on March 26, 2014

Take two random people, Bob and Sarah. The chance that each of them is going to start a company is equal.

Leave Bob to get on with his life. Maybe he'll start a company, maybe he won't.

Invite Sarah to a private event at a well-funded VC company and call it "Future Founders".

If either of them definitely wasn't going to start a company then they still won't. If either definitely was then they still will. Nothing has changed.

But if either of them had considered starting, and was wondering whether to or not, being labelled a "Future Founder" and being granted access to a group of 349 other people who rank highly on an entrepreneurial scale, plus direct contact with high profile VCs seems likely to influence their decision. That nudge could easily account for the difference between founders and non-founders.

Did MatterMark factor that in to their findings? Seems a bit self-fulfilling as prophetic judgements go to me.

roybahat · on March 27, 2014

Hey... I run Bloomberg Beta. I think that creating a self-fulfilling prophecy is absolutely a dynamic here. Our goal wasn't to succeed at predicting so much as it is to get to know great people before they start companies. If that action induces them to start companies, and they have a great experience doing it, but our prediction's accuracy is compromised -- that's fine. We're using data as a tool to make things better, not for its predictive value in and of itself.

loumf · on March 26, 2014

They don't have findings yet. They are predicting 17% conversion. After a year or so, we'd have to revisit this group to see what happened. At that point, it would be fair to take the "Future Founder" label into account.

Probably, they should compare the #351-700 cohort's prediction against actual to the "Future Founder" prediction against actual.

stillsut · on March 26, 2014

You're absolutely correct, Onion.

This is referred to as Uplift Modelling in the direct marketing lit.

To observe if your algo has real world performance, you would want your invite list to include a (blind) holdout set composed of random people sampled from each decile of the entire scored population, meaning [(360/2)/10]=18 should be the bottom 10% of potential founders according to the algo.

seventytwo · on March 26, 2014

Either way, they'll be able to point to their huge successes on "finding" the next Valley heroes.

7Figures2Commas · on March 26, 2014

The title here seems quite exaggerated, as does the claim that "Mattermark Founder Prediction Is 25X Better Than Chance."

Predicting who in a group of 1.5 million technology professionals is likely to start a company presents an unsupervised learning problem. Short of contacting all 1.5 million people and asking them, there is no way to confirm whether the predictions the system made are correct, so you cannot make claims about efficacy.

There are approaches used to deal with unsupervised learning problems, but there are no details in this post even indicating that the folks involved recognized they were dealing with an unsupervised learning problem in the first place. Instead, we just have claims like "While we believe the future founders group has a 17% chance — 25x higher..." for which no further information is provided.

Perhaps more interesting than the bold claims sans important technical details is the notion that an early-stage fund would look to court potential founders before they even made the decision to become founders. While it's true that many seed stage investors adhere to the mantra of "we invest in people," this is as good an example as I've seen of the fact that there is currently way too much capital chasing too few opportunities.

squigs25 · on March 26, 2014

Wow! This is really cool.

In statistics and machine learning this would be considered an unbalanced data set: predicitng who will start a company when the vast majority of people will not is a very difficult task. It's similar to predicting who will be a terrorist (another really difficult problem).

I think the threshold they are using is way off however. Even if someone has only a 5% chance of becoming a founder (or less), that's pretty significant. I understand that would probably increase the population by many orders of magnitude, but only capturing 17% of 350 means ~60 startups will be found as a result of this program. Given that the large majority of those are likely to fail, the numbers could be better.

Some really interesting predictors might be what meetup groups does the individual belong to, what is their current job title, what is skills and connections do they have on linkedin and facebook, how many founders are they "connected" to, and who are they following on twitter.

It's also worth mentioning that this is probably biased, because the data set of individuals includes data points for founders only after they became founders. You would ideally want the data from before they became a founder. Perhaps over time this model would get better, as non-founder individuals become founders.

kevin_morrill · on March 26, 2014

CTO of Mattermark here. It is a really interesting problem, because as you say even if you boost the odds 25x they're still really low. We trained the data set on venture backed founders (e.g. Series A or beyond), which is a bit higher bar than just any founder. The hope being that once you reach Series A you're less likely to fail than just having seed funding. At some point we want to go back and look at what differentiates founders that reach seed vs. venture backing.

JasonCEC · on March 26, 2014

Can you talk a bit about your feature selection or models?

I run a statistical quality control company using machine learning, and picking up on flaws with tiny probabilities (one batch in every twenty or thirty million) might benefit from similar techniques!

ASquare · on March 26, 2014

Related: https://news.ycombinator.com/item?id=7465150

seanccox · on March 26, 2014

Hmmm... My email must've ended up in the spam box...

dotBen · on March 26, 2014

I'm curious where the date came from - LinkedIn seems obvious but I'm not aware they make that kind of corpus available for purchase.