If I've learned one thing from the history of science and technology, it's that the way people ask questions shapes that way that they answer them.
With this in mind, I'm continuing my discussion of the nature of Big Data - or, rather, of different ways in which we can ask the question: "What is Big Data?"
Yesterday, I discussed an essay in which sociologist Nathan Jurgenson tackles Big Data as a "cultural ideology." I thought that this framing yielded some useful critical insights; I just wasn't sure that they were insights that told us much about Big Data as a category of technical practice.*
Today, I'm going to examine how the historian-ethnographer Hallam Stevens frames this question in his 2013 book Life out of Sequence. I think that Stevens offers an especially useful model for how to investigate the role Big Data in making authoritative knowledge.
Over the past few years, a rapidly growing number of historians of science and fellow-travelers have turned their attention to Big Data. A sample of this scholarship includes special issues of Studies in History and Philosophy of Biological and Biomedical Sciences and the (open-access) International Journal of Communication; a working group at the Max Planck Institute in Berlin whose organizers have hosted conferences and are editing a future volume of Osiris on the subject of "Historicizing Big Data." Dozens more conference panels, articles, and books have come out recently or are in the works. Related projects on the use of lists and of archives in science have illuminated connections between the study of scientific collecting and of scientific data.
What kinds of questions are these historians asking? Some have asked: "When did Data become Big?" (Here, e.g., is a short answer by Patrick McCray, and a long answer by Paul Edwards.)
However, a number of other historians - the majority, I suspect - have framed their questions as skeptical inquiries into the novelty of Big Data. David Sepkoski, for instance, argues that nineteenth-century paleontologists transformed the fossil record from a set of material specimens into a random-access textual record and developed novel graphical techniques of presenting its contents quantitatively. This situates two of the transformations most commonly associated with the rise of contemporary data-driven biological sciences in a much longer history.
Stefan Müller-Wille and Isabelle Charmantier show how Linnaeus himself tackled a quantity of data that he deemed too great for the technology he had at hand. In the process of developing a new medium for storing his data, he produced a new kind of scientific object, the genus. So the emergence of new kinds of objects and knowledge through the management of data, another supposed hallmark of Big Data, is also precedented.**
Bruno Strasser flips this script, presenting the production of twentieth century biological databases as the extension of natural historical collecting practices. He argues that biological collecting didn't just disappear between the rise of experimental biology and the Human Genome Project, though it did decline in prestige for a time.
For Sepkoski, Müller-Wille, and Charmantier, then, the question is "What is 'Big Data'-like about old science?" For Strasser, the questions is "What's old about Big Data?" Both approaches are useful means for providing historical context for aspects of Big Data, a topic of present concern.*** This is what historians do, and we should keep doing it.
Stevens' object of analysis is not the culture of Big Data in general but the particular "data culture" of contemporary biology. He studies it by following biological data through different spaces, observing how scientists and technicians produce, manipulate, and care for it.
Stevens defines "data culture" by analogy with material culture. The opposition that this implied is a bit misleading. As Stevens explains:
It is not that life "becomes" data, but rather that data bring the material and the virtual into new relationships. Following the data provides a way of locating, describing, and analyzing those relationships; it allows us to acknowledge that materiality is never fully erased, but rather, the material of the organism and its elements is replaced with other sorts of material: computer screens, electrons, flash memory, and so on. (8)
We might call this property the heterogeneous materiality of data.
This expansive conception of both data and materiality takes Stevens into distinct spaces in each of his chapters - the well appointed twenty-first-century-biotech "front" of the Broad and its production-floor "back"; flow-charts mapping lab setups and highly-choreographed processes for updating bioinformatic websites; the internal structure of databases and protocols for connecting them; nearly-haptic encounters between biologists and graphical representations on their computer screens. Stevens carefully investigates how and by whom data are produced, transformed, and used in each of these distinct spaces. In doing so, he draws out how the constraints particular to each of space - whether database fields or DNA microarrays - shape biological data and the making of biological knowledge.
Stevens' approach yields telling insights. For example, supervisors at the Broad Institute make use of complex analytical tools from industrial and management research to maximize the productivity with which their ensemble of workers and instruments produces sequence. Scientific work product is notoriously difficult to measure quantitatively, especially in a manner by which researchers themselves will assent to be judged. The introduction of a directly measurable product - nucleotides - makes possible profound transformations of the organization of work within a scientific institution.
Though contemporary bioinformatics provides especially good subject matter for Stevens' approach, many aspects of it could prove useful to others studying data's past and present. Here are a few of the lessons that I've learned from Life out of Sequences about how to ask questions about Big Data.
- Take the oft-cited heterogeneity of Big Data and extend it to the material world in which that data is produced and used. Attend to what I'm calling the heterogeneous materiality of data - that is, the opportunities and resistance afforded by the different objects and media gathered under the genre of data.
- Study a particular but diverse data culture. Michael Mahoney called for historians to "embed computing in the histories of the fields that took up computers" - we should do the same in studying data.****
- Mark off the various spaces in which data is produced, transformed, and used, and attend to what happens in each of them. This means making careful judgment calls about where to draw lines around particular spaces - neither buildings, nor institutions, nor data management system have self-evident boundaries. But only through marking off such spaces can we get at the different material forms and constraints that characterize data in each of them.
- Remember, first, that data is a product as well as representation, and, second, that the same systems that produce the data also produce the capacity to treat it as a representation of some bit of the world. Both the production of data and the technology and habits that allow us to forget that data is produced deserve close attention.
Of course, Stevens' approach also imposes limitations. Ethnography is a great way of getting at the distinctive features of a community, but provides limited opportunity for introducing critiques other than those voiced by the subjects of the ethnography themselves. When the community in question has the power and prestige of a lavishly funded biotech research institute, that's a problem. (Stevens recognizes this, and flags moments where further critical engagement seems called for.)
But any way of asking a question is a choice of many questions not to ask. Stevens' question - call it "How is Big Data produced, transformed, constrained, moved around, and used in the spaces of bioinformatics?" - sets us on the path toward material answers to what, if anything, is new and special about the "three V's" - volume, velocity, variety - by which "Big Data" is often breezily defined.
* In defense of Jurgenson's claim that Big Data is a cultural ideology emanating from Silicon Valley, however, I offer this quotation by a senior research scientist at Google:
"Historically, most decisions — political, military, business, and personal — have been made by brains [that] have unpredictable logic and operate on subjective experiential evidence. “Big data” represents a cultural shift in which more and more decisions are made by algorithms with transparent logic, operating on documented immutable evidence. I think “big” refers more to the pervasive nature of this change than to any particular amount of data."
**It's worth noting that this perspective is consistent with how many people in "Big Data" define their field. Several of the participants in a "What is Big Data?" survey by the Berkeley School of Information responded that Big Data is data that can't be handled using "traditional" methods, but that requires novel technology and specific expertise. Some made a point of noting that the definitions of "traditional," "novel," and "specific" are moving targets; yesterday's Big Data is today's traditional method.
***Drawing upon the analytical language of biology itself, Strasser refers to comparisons, on the one hand, and genealogies, on the other, that link historical and present-day data practices as "analogies" and "homologies," respectively. This raises all sorts of questions about the relationship between data and living organisms that I'm not going to tackle here (but Stevens has some things to say on the matter in the conclusion to Life out of Sequence).
****This includes the fields of computational science and data science, subcommunities of generalists with peculiar features of their own.