p-hacking: try it yourself!

It’s not new, but it’s still worth sharing:

The instructions go: “You’re a social scientist with a hunch: The U.S. economy is affected by whether Republicans or Democrats are in office. Try to show that a connection exists, using real data going back to 1948. For your results to be publishable in an academic journal, you’ll need to prove that they are “statistically significant” by achieving a low enough p-value.”

The tool is here: https://projects.fivethirtyeight.com/p-hacking/

And more on p-hacking here: Wikipedia — to understand why “success” in the above is not what it seems.

Peer-reviewing encourages p-hacking

I’m sure I’m not the first to notice, but it seems to me that peer-review encourages p-hacking. Try this: (1) pre-register your analysis of a regression analysis before doing the analysis and writing the paper (in your lab notes, or actually on OSF). (2) Do the analysis, and (3) submit. How often do we get recommendations or demands to change the model during the peer-reviewing process? How about controlling for X, should you not do Y, or you should do Z, etc.

Unless we’re looking at a pre-registered report, we’re being asked to change the model. Typically we don’t know whether these suggestions are based on theory or the empirical results. In the former case, we should probably do a new pre-registration and redo the analysis. Sometimes we catch important things like post-treatment bias… In the latter case, simply resist?

And as reviewers, we should probably be conscious of this (in addition to the additional work we’re asking authors to do, because we know that at this stage authors will typically do anything to get the paper accepted).

Photo credit: CC-by GotCredithttps://flic.kr/p/Sc7Dmi

 

Understanding p-hacking through Yahtzee?

P-values are hard enough to understand — the appear ‘magically’ on the screen — so how can we best communicate the problem of p-hacking? How about using Yahtzee as an analogy to explain the intuition of p-hacking?

In Yahtzee, players roll five dice to make predetermined combinations (e.g. three of a kind, full house). They are allowed three turns, and can lock dice. Important for the analogy, players decide which of combination they want to use for their round after the three turns. (“I threw these dice, let’s see what combination fits best…”) This is what adds an element of strategy to the game, and players can optimize their expected (average) points.

Compare this with pre-registration (according to Wikipedia, this is actually a variant of the Yahtzee variant Yatzy — or is Yahtzee a variant of Yatzy? Whatever.). This means players choose a predetermined combination before throwing their dice. (“Now I’m going to try a full house. Let’s see if the dice play along…”)

If the implications are not clear enough, we can play a couple of rounds to see which way we get higher scores. Clearly, the Yahtzee-way leads to (significantly?) more points — and a much smaller likelihood to end up with 0 points because we failed to get say that full house we announced before throwing the dice. Sadly, though, p-values are designed for the forced Yatzy variant.

Image: cc-by by Joe King

Pre-Registration: A Reasonable Approach

We probably all know that pre-registration of experiments is a good thing. It’s a real solution to what is increasingly called ‘p-hacking’: doing analyses until you find a statistically significant association (which you then report).

One problem is that most pre-registration protocols are pretty complicated, and as researchers in the social sciences we usually don’t have inclination/incentives to follow complicated protocols typically designed for biomedical experiments. A probably more reasonable approach is AsPredicted: We’re looking at 9 simple and straightforward questions, and we’re looking at pre-registration that remains private until it is made public (but can be shared with reviewers).

Defending the Decimals? Not so Fast!

In a recent article in Sociological Science, Jeremy Freese comes to the defence of ‘foolishly false precision’ as he calls it. To cut a short story even shorter, the paper argues for including these conventional three decimals when reporting research findings — as long as the research community continues to rely so much (too much) on p-values. The reason for this is that we can recover precise p-values when often it is simply reported whether the results were above or below a specific level of significance.

While I share the concerns presented in the paper, I think it may actually do more harm than good. Yes, in the academic literature simply appearing more precise than one is will fool nobody with at least a little bit of statistical training. What we miss, however, by including tables with three or four decimals, is communication. It is easier to see that 0.5 is bigger than 0.3 (and roughly how much) than say 0.4958 and 0.307. Cut decimals or keep them? I think we should do both: cut them as much as we can in the main text — graphics would be very strong contenders there; and keep them in the appendix or online supplementary material (as I argued a year ago; and if reviewers think otherwise, ignore them!). That’s exactly in the spirit of Jeremy Freese’s paper, I think: give those doing meta-analyses the numbers the need, while keeping the main text nice and clean.