Completed on 16 Dec 2017 by Krzysztof Jacek Gorgolewski .
Login to endorse this review.
Mazor and colleagues in their manuscript titled “Using experimental data as a voucher for study pre-registration” propose a solution that potentially prevents bad actors from pretending that they preregistered the analysis protocol prior to data analysis. This is an interesting approach to an increasingly important problem. Some technical issue might prevent this solution from being practically applicable.
Author response is in blue.
- The proposed solution only works if the authors share raw data (which would be great, but sadly is not common).
- The verification process requires reviewers to reanalyze the data which seems like an unrealistic expectation.
- Differences between processing pipelines used by the authors and the reviewers could result in slightly different results (see Carp et. al. 2012) and raise false concerns about changes to the preregistered protocol. This could be exploited further by the randomization scheme that has a very limited set of orders resulting in very similar results.
- “Bob then uses the Python script that he found in the protocol folder to generate a pseudorandom sequence of experimental events, based on the resulting protocol-sum.” Isn’t the fact that the code to translate the checksum to random order provided by the authors? What if it always gives the same answer? Am I missing some detail?
- A more sophisticated attack would involve modifying already acquired data to temporary rearrange so it would comply with a protocol defined post hoc. This would, however, require a highly motivated bad actor.
- RPR approaches do not necessarily provide time locking. One could imagine a situation, when a bad actor collects data, picks analysis protocol post hoc, submits to first stage of registered report pretending they did not acquire any data yet. This way they could game the system, but only assuming reviewers will not require changes in acquisition protocol.
Happy to have this interesting exchange with you.
There are three things to note about point 3. First, as the whole process is crowdsourced, reports of inconsistencies between the shared data and the protocol-folder should be evaluated by the community: a single report is never sufficient to draw any definitive conclusions. This is already the case when it comes to detection of inconsistencies in statistical reports, shared data and image manipulation, and we think this should also be true for preregistration.
Second, if several readers failed to find basic, well-established effects in the shared data, maybe this by itself is a good reason to put less weight on the authors’ conclusions. For example, if Alice reports stronger amygdala responses for warm than for cold colors, but independent readers could not find evidence for activations in V1 for stimulus>rest or in motor areas for motor response>rest in her data, conclusions about other secondary effects should be taken with a grain of salt, regardless of the integrity of the preregistration process.
Finally, in the supplemental materials we mention that when the design-entropy is high and the data is noisy, reasonable differences between the predictions and the actual observations are expected, and in fact can be used to support the validity of the pre-registration process (section 5).
You make an important point about the importance of sufficient design-entropy for the strength of the time-locking mechanism. preRNG only makes sense when enough distinct experimental designs can be generated by the experiment randomization script. Luckily, this is relatively easy to test - all that it takes is to have a look at the experimental randomization script, or generate multiple experimental designs using different seeds and check for collisions. It seems to be the case that typical fMRI experiments have sufficient min-entropy: even when constraining the possible timings and orders of events (can’t have 3 events of the same condition in a row, ISI is sampled from a limited set of possible values), a reasonable number of experimental runs and participants makes the design-entropy sufficiently large for fraud to be impractical.
We were happy to put this to test, so we used the popular optseq2 to generate design matrices for an event-related design, following the second example here: http://surfer.nmr.mgh.harva.... Importantly, we looped over the same command 350,000 times when the only thing that changed was the PRNG initiation seed (code available here: https://colab.research.goog... ). We haven’t found one collision, meaning that the min-entropy of one run is, with a very good chance, at least log2(1,000)=9. An experiment with 12 subjects and 4 runs per subject will thus have min-entropy of at least 9*12*14=432. In other words, there are more possible experimental designs than the number of atoms in the universe.
Limiting the entropy of the sequence generator will be easy to detect by interested readers (can be checked without access to the data: by reading the methods section or by findings discrepancies between the reported and the actual randomization used), and will automatically mark the time-locking as invalid. Again, this goes back to the central idea of our scheme - every step in the chain is testable, and this fact should make bad-players (or badly incentivized players) think twice before they try to cheat.
Happy to hear your thoughts,
Matan, Noam and Roy
[Bob computes the entropy in the end of the second paragraph, page 6. We will make it more clear that Bob is the one who performs this computation. Furthermore, in his verification file he generates 100 different design matrices, all giving rise to maps weaker by orders of magnitude. This excludes the possibility that the code maps all seeds to the same/very similar sequences.]
[have fun! http://www.alteredimagesbdc...]