What is Big Data?
Lots of people are asking and answering this question. (This blog included). Scholars in various fields, including the history of science and technology, have begun to tackle it as well.
For good reason. Gobs of private and public funding, influence over policymakers, civil liberties, the future of the planet and the people on it, and, oh yes, the practice of science are (reputedly) at stake.
The most familiar answers to "What is Big Data?" fit into three categories:
1. It's going to change the world.
2. It's going to ruin the world.
3. It's actually nothing new. (Does not exclude #2.)
These are very different answers, but they share a tendency to praise or damn Big Data without giving a satisfactory sense of what exactly Big Data is. Today and tomorrow, I'm going to contrast two approaches to making sense of this category. One tackles Big Data directly; the other goes looking for Big Data in particular sites and actors. As the late Mike Mahoney wrote of the history of computing, the most effective way to answer "What is Big Data?" is to look at particular people, disciplines, institutions, and practices.
|"What is Big Data?" answered in word cloud form, based on the responses of forty "thought leaders" tapped by Berkeley's program in data science. Their full answers: http://datascience.berkeley.edu/what-is-big-data/|
Big Data is a "cultural ideology." That's the answer that sociologist Nathan Jurgenson offers in an essay published a few weeks ago in The New Inquiry. In this essay. Jurgenson neatly weaves together a few threads of the emerging humanistic critique of data utopianism.
First, the promises of Big Data aren't anything new. Rather, they're the latest version of the "rationalist fantasy that enough data can be collected with the “right” methodology to provide an objective and disinterested picture of reality": the old positivist dream of a universal social physics based on patterns in quantitative measurements. The big-data experts may aim to describe society through a bunch of distinct regularities rather than a few master equations, but the totalizing ambition remains the same.
Second, Big Data's bigness generates a dangerous combination of interpretive flexibility and rhetorical certainty. Skillful (or unwary) data scientists can cherry-pick conclusions while dazzling the masses and policy-makers with claims of statistical certainty. Writes Jurgenson, "Big Data can be used to give any chosen hypothesis a veneer of science and the unearned authority of numbers."
Third, it's all just a means of legitmizing and extending existing power relations. Claims of "the end of theory" always constitute efforts (knowing or not) to universalize a particular theory. Views from nowhere "must be unmasked as a view from a very specific and familiar somewhere": in this case, from the eyes of a 23-year-old programmer with libertarian leanings and a dearth of life experience living in the Bay Area, or a CEO in his late 30s seeking to make the business of collecting and analysis of data look like a public service and a lucrative business at the same time. In the process, purveyors of big data reinforce the social categories that structure the data that they collect, and the relationships implicit in these categories.
|Collecting data can reinforce social categories.|
These are all very reasonable critiques, but I wonder whether Jurgenson is aiming at the correct target. For instance, Jurgenson singles out Dataclysm, the popular recent book by OkCupid President Christian Rudder, as the exemplar of the "cultural ideology" that he critiques. In passing, Jurgenson mentions "similarly inferential sciences like evolutionary psychology and pop-neuroscience" and their popularity, along with Big Data, in mass-market social science.
I suspect that the cultural ideology that Jurgenson wants to critique is more a matter of genre than method. Big Data, evolutionary psychology, and neuroscience (overlapping categories, to be sure) can all be deployed in making sensational, seemingly well-evidenced claims about everyday life, the grist of bestsellers and magazine articles. But if it would be improper to draw a bright line between a field of scientific practice and its presentation in popular media, it would be equally misleading to assume that we can understand the essence of the former by looking at the latter.
Jurgenson notes that there are examples of responsible, thoughtful research with large data sets, singling out the Data & Society Research Institute as an example. Nevertheless, he writes, "the positivist tendencies of data science — its myths of objectivity and political disinterestedness — loom larger than any study or any set of researchers, and they threaten to transform data science into an ideological tool for legitimizing the tech industry’s approach to product design and data collection."
If you scan the bookshelf from Freakonomics to Dataclysm and then check out Wired and watch a TED Talk or two, that conclusion may be unavoidable. But if our aim is to answer the question "What is Big Data?" from the (situated!) perspective of the history of science and technology, it will take a different approach. This requires setting aside our "digital dualism" (a useful term coined by Jurgenson) to piece together the material systems and embodied practices that constitute distinctive ways of using computers and data to generate scientific knowledge.
In order words, we should keep in mind:
Tomorrow, I'm going to turn to a book that exemplifies how to ask and answer situated but broadly significant questions about the production and use of large datasets: Hallam Stevens' Life out of Sequence.
I love how you've framed this, Evan. In particular, I think it helps draw attention to where the genre question intersects with answer 3 to WiBD?. Even just looking at recent history, there's a huge difference between sociologists, geographers, epidemiologists, and others who build and work with "large" "complex" "datasets" (as the word-cloud characterizes big data), and those who work with "Big Data" (capital-B, capital-D). It makes me think of some responses from chemists to the growing vogue of "nanotechnology" (which they often insisted was just chemistry by another name), and how that response changed when lucrative grants and institutes for nanotechnology became increasingly widespread. Historically, one might look to a similar relationship between proponents of social physics and the bureaucrats, colonialists, and others in whose tables the social physicists found the regularities that justified their project. So: genre, but not mere genre, interacting with method.
Thanks, Michael! For the record, "just important world now" are my favorites from the word cloud.
Another book - Lisa Gitelman's _Paper Knowledge_ - got me thinking about genre. Gitelman makes the point that genres often make more meaningful critical categories than media. If we look at how people bring their experience to bear on engaging with, say, a pdf file, we'll find that it's a lot more like a printed checklist than it is like a mp3 file, or even a word-processing document.
I'm not sure that it's meaningful to count Big Data as either a "genre" or a "method," though we can surely tease out various examples of both that are captured under the capacious umbrella of the term. On the other hand, I think that Dataclysm and Big Data (the book) and Freakonomics and The Tipping Point and Predictably Irrational and The Black Swan and The Innovator's Dilemma do constitute a literary genre. (Which is not to say that they all make the same kind of argument or rely on the same kind of evidence or are equivalently valuable - just that you wouldn't be surprised to see them next to each other on the shelf.)
Anyway: one history of Big Data is certainly "just" a history of people working with large complex datasets - that way of framing the question will emphasize continuity. And that's an important story. But there are other equally important angles. More on all this tomorrow.
Note: Only a member of this blog may post a comment.