My lab is interested in associative learning. We train a person to expect mild pain after hearing a certain sound. To quantify learning, we measure their physiological arousal when they hear the sound – say, whether their hands become sweaty. We don't want to mistakenly take random fluctuations of sweat gland activity for evidence of learning, and so we only look at a specific time window after sound onset. Now, what is the best time window - 3.0 s, 3.5 s, or 4.0 s? All of these values make sense – but which one is the best?
Similar thorny issues pervade many fields of experimental psychology. How do we analyse reaction times? How do we measure declarative memory, attention, confidence, physical attraction? For any of these constructs, multiple distinct ways of measurement are used simultaneously, by different labs, or even within the same lab.
As a postdoc in the late 2000s, I came across a related problem. I was developing a novel method for analysing skin conductance responses, a surrogate for sweat gland activity. I and my co-authors Karl Friston and Ray Dolan thought this model-based approach had far better face validity than traditional methods – but some journal editors and reviewers strongly disagreed. How could we convince them?
The idea we came up with built on the psychometric concept of criterion validity. Going beyond psychometric tradition, we would induce that criterion experimentally. We did experiments in which we were fairly sure what the average person would learn, and we tested how well a particular measurement method could reproduce this. We termed this criterion "retrodictive validity".
Over more than 10 years, my team has been using this criterion for methods development. But there is a crucial sticking point: to what extent can you generalise retrodictive validity across experimental circumstances? If we select a measurement method that best reproduces associative memory in a very simple experiment, is this method still optimal for comparing learning, in the same type of experiment, but under a drug vs. placebo? And what if we use a completely different experimental paradigm? For example, one may use pictures instead of sounds, or explicitly tell people to expect mild pain. Can we still measure this association with our method that was optimised on a conditioning experiment?
This is why I started to delve into statistics, to look at the concept of calibration - a fairly standardised and institutionalised procedure in technology -, and to team up with colleagues from other fields of psychology: Filip Melinščak, Steve Fleming, and Manuel Völkle. In our new paper in Nature Human Behaviour, we formally derive the conditions under which retrodictive validity is informative and can be generalised. If these conditions are met, then a measurement method can be calibrated in a simple experiment, and widely applied in completely different and more sophisticated experimental manipulations.
We are fairly confident that this generic approach could be useful throughout experimental psychology and behavioural science. We hope that domain experts will take up this concept and put it into practice.
Primer videos: Calibration in psychology