Unfortunately it looks like the author of this piece got caught in the p value trap. He says:
Keep in mind that people flip coins and get five heads (or five tails) in a row all the time. With a p value of only five percent, one in twenty published papers reporting a p value of five percent will be wrong purely by chance.
That's only true if 50% of the hypotheses you test are true, and your experiment is so good that you have no false negatives. In typical medical trials, on the other hand, sample sizes are small enough that there's perhaps a 50% chance of false negative. (If you see a difference between groups with small sample sizes, you can't tell whether it's due to the tested medication or just chance, so you conclude there's no statistically significant difference.)
Under realistic circumstances, the odds of a p < 0.05 result being true can be as small as 45%.
I interpreted the author's claim to be that, assuming that all tested hypotheses were false, 5% would by chance alone be publishable anyway. So one must not take too strongly as evidence of truth the fact that a hypotheis was published. (Indeed, this observation has led to cautionary papers like Ioannidis's "Why Most Published Research Findings Are False" [1]).
The issue I take with that (important) observation is that just because the threshold is set at 5%, the results can be far more convincing. In an analysis, you may see several components with p<10⁻⁶, a few with p=0.23, several more with p>0.3, and one with p=0.06. They won't all have p=0.05 - in fact, none of them will.
That said, it's worth being reminded occasionally that statistical significance is in many ways just the first step on the road to knowledge, not the last. But when it comes to false findings, stuff like false negatives and exploratory analysis are far more impactful.
Yeah, it always worries me that so many people see a small p-value like 10^-6 and say, "Wow, that one's definitely true."
But p=10^-6 doesn't mean, as commonly believed, that there's only a one-in-a-million chance that the proposed hypothesis is really false, nor does it even mean what many more-statistically-savvy people think it means, that if the proposed hypothesis were false, there would only be a one-in-a-million chance of observing test data as extreme as what was observed. No, what it really means is that – and here's the part most people miss – assuming that the researchers' model of the underlying data-generating process is correct, then, if the proposed hypothesis were false, there would be only a one-in-a-million chance of observing test data as extreme as what was observed.
Yes, as the p-value becomes smaller, it does indeed become easier to believe that the hypothesis of interest is true, assuming that the humans didn't screw up the model. But, in any complex work, I'm going to have a hard time believing, sans replication, that there's not a reasonable chance of humans screwing up.
To me, then, p=10^-6 is the new p=10^-2.
EDIT: Replaced Unicode superscripts (10⁻⁶) with circumflex notation (10^-6) because the superscripts weren't showing up on my Nexus 7.
Yes, the calculation of a p-value is always done by assuming a specific model, though usually a less controversial one than what is proposed. But I wouldn't go so far as to demand smaller values.
I would prefer we stop thresholding so much altogether, and instead operate with the understanding that what has or has not been found is a suggestion with evidence for it, rather than fact, for all but the best-understood processes.
It is a difficult task for many people, who have been taught facts for decades, to accept that objective knowledge is hard to come by. But everyone understands the value and properties of a crude model.
Sorry, I wasn't clear. I meant that when I see a p-value of 10^-6 in papers, I expect that there's at least a 1% chance that the humans screwed up the models somewhere, so I don't see it as more persuasive than 10^-2. That is, p-values lose credibility once they start getting smaller than a few percent.
So I agree with you. If it were up to me, we'd all report evidence intensity in decibels, anyway.
p-values must be the most mis-interpreted statistic in the history of statistics. As an example of the trouble you can get into with p-values see recent discussion over misreporting of the LHC experiments (a good starting point is http://normaldeviate.wordpress.com/2012/07/11/the-higgs-boso...). Every time I see p-values reported in A/B tests I get leery -- people are trying to adopt a scientific mindset but the tools are leading them astray. (Disclosure: I work on an A/B testing product, called Myna [mynaweb.com] that adopts a different approach.)
The other point I particularly like is the discussion on mean values and robust statistics. Most analytics packages just report means, losing so much information in the process. Again I bet many bad decisions are made due to poor tools.
After I read this article (which I see is submitted by the founder of this interesting blog, now a blog with many guest articles), I went to the reading list page on the blog site. There are many VERY GOOD books about mathematics listed there,
but the subtopic with the most disappointing recommendations was actually the subtopic on statistics. Two of my favorite online articles on statistics education
I would be interested in hearing specific recommendations for self-study of statistics from you.
I currently have Feller (based on reading www.ams.org/notices/200510/comm-fowler.pdf) but don't have a good "taste" for what would be the best books for a self-study approach to statistics. There is also the "Teaching Statistics: A Bag of Tricks" book and I was considering dropping it into the mix.
I am probably over-thinking the book choice and would be fine to just dive in to anything, but if I would prefer to not pick up references that are going to drive me off a good path to start . . .
Keep in mind that people flip coins and get five heads (or five tails) in a row all the time. With a p value of only five percent, one in twenty published papers reporting a p value of five percent will be wrong purely by chance.
That's only true if 50% of the hypotheses you test are true, and your experiment is so good that you have no false negatives. In typical medical trials, on the other hand, sample sizes are small enough that there's perhaps a 50% chance of false negative. (If you see a difference between groups with small sample sizes, you can't tell whether it's due to the tested medication or just chance, so you conclude there's no statistically significant difference.)
Under realistic circumstances, the odds of a p < 0.05 result being true can be as small as 45%.
I've written more on this problem here: http://www.refsmmat.com/statistics/#the-p-value-and-the-base...