Our paper on the effect of prisons on violence in the community that was recently published in Nature Human Behaviour relies on administrative data from the Michigan Department of Corrections (MDOC). Our research team has been working with MDOC data for over a decade now, and we have learned quite a bit in that time about the promises and potential pitfalls of working with administrative data.
Social scientists work with many different kinds of data, including surveys, interviews, and observations. Administrative data have some unique advantages. One is the relatively low costs of data collection (although data cleaning can be a lot of work), and administrative databases usually offer data on many more cases. The sheer size of the data makes possible methodologies -- like the instrumental variables methods we use -- that are generally more data hungry, and also allow for investigation of more fine-grained patters, like different effects for different subgroups.
One key difference between administrative data and other types of data is that administrative data was originally collected for an entirely different purpose (to support the administrative and organizational functions of a program or agency). This means that we know much less about how the data were created, knowledge that is critical to developing measures and interpreting results correctly. For example, what events trigger data entry into the database? What information is written over when new data are added? How many different people in how many different officers are entering information in the database? What other roles do those people play in the agency, and what information do they have access to, or not? What constraints or incentives do they face? Contrast this with data from a survey. Researchers design every aspect of a survey, from the exact wording of the survey questions to the possible responses to the ordering of the questions, so the process of collecting the data also provides the information needed to analyze it properly. Social scientists have spent decades developing and refining survey questions to measure even the seemingly most basic of concepts, like annual family income.
As we worked with the MDOC administrative data, we confronted the questions listed above and many more. Figuring out the answers played an important role in our conclusions. For example, one important issue was the distinction between being admitted to prison for a new crime vs. for a parole or probation violation, as prison admission is an important measure of recidivism. We needed to understand how these various events were recorded in the database and the circumstances under which someone might be recorded in one category or the other. Some violations are also crimes that could be prosecuted as new crimes in court or simply processes as a parole violation by staff. In contrast, probation violations require the individual to be re-sentenced to prison by a judge. We learned that individuals on parole were often not prosecuted for lower level crimes. Rather, they were simply returned to prison on a parole violation, which was administratively easier and required a lower burden of proof.
We learned such things because we worked hard to understand the processes that generated the data contained in the administrative databases. We talked to people whose job it was to enter different types of data and to agency researchers who regularly worked with the data to prepare their own reports. We learned how various actors like courts, parole boards, and police interacted to produce the data we that showed up in the databases. As we cleaned and analyzed the data and came across surprising patterns or puzzling cases, we regularly consulted with MDOC staff, who generously shared their expertise with us. This required building trusting relationships, and we came to see them in many ways as collaborators on the project.
The observation that researchers must work hard to understand their data is important in the current era of big data. Social scientists and other researchers who work with social data are increasingly using data that was originally created for non-research purposes, whether from administrative databases or the “digital exhaust” of our online lives. Such work requires not only time and effort, but relationships with those who know the data and expertise in human and organizational behavior.