On the Boundaries of Data Science

This post is precipitated by reading Stephanie Hicks and Roger Peng’s paper on Elements and Principles for Characterizing Variation between Data Analyses (and I was reminded to post this by their follow-up Design Principles for Data Analysis). The reaction engendered isn’t specifically about their paper per se, there are a collection of similar attempts to discuss the broader practice of data analysis. See, for example Bin Yu and Karl Kumbier’s Veridical Data Science, or, from some time ago, David Donoho’s 50 Years of Data Science (there are many more). These are all important contributions, attempting to formalize (to the extend possible) our process of data analysis beyond just formal statistics and improve the reliability and reproducibility of the results that we gain thereby. I in no way wish to disparage these.

Rather, I have a very specific bone to pick: the conflation of Data Science and Data Analysis. Hicks and Peng are careful to discuss their model in terms of data analysis, but nonetheless motivate it by a discussion of the field of data science and the need to escape a definition given solely in terms of an intersection of sub-fields. They then proceed to treat their principles as a means of achieving this.

My reaction to this is that defining data science in terms of data analysis is far too narrow. Data analysis is, of course, a hugely important part of data science, and it is a fairly natural focus for statisticians — it’s still focussed on distilling knowledge from data. However, I suspect that knowledge extraction is still the minority in data uses and problems; data services in social media, finance, commerce, and who knows what, else would seem to be left out of this (How do we store and access account data efficiently? How do we automate recommendations?). Automated systems from the automated telephone trees to self driving cars, automated auctions for oneline ads, and of course the original search and curation services of google and yahoo are all ignored by this focus.

Is it that these problems are left unstudied? Very clearly not! And between topics on databases, high performance computing, machine learning and signal/image processing you probably capture most of what I discuss above. But they do tend to be absent from many statistician’s accounts of Data Science. This is a shame: I expect that statisticians could have useful things to contribute to many of these topics (some of them have reinvented statistics for themselves) and would be well served by having some background in how the data get to them (this includes me). I’d regard any program in data science that doesn’t include a course on databases, or network transfer protocols, to be distinctly deficient, but suspect that might include most of them.

The particular topics above can be readily included by simply expanding the union of subdisciplines. That won’t necessarily generate more cross-disciplinary talk, even if I wish it would. But are there topics in the intersection that genuinely haven’t been dealt with as academic disciplines? My perspective is necessarily limited by my own lack of background in the more engineering side of computer science, but here are a few:

  • Algorithm ethics/fairness/recourse. This, of course, is already a going concern, but stands out as a poster-child for disciplines spawned by the data revolution.
  • Data plumbing, or the interaction of databases, networks and services. There are certainly a lot of people doing this (my thanks to an old high school friend for this self-description) and I confess to not being close enough to understand the landscape other than it doesn’t talk to statistics much.
  • Data cleaning, which certainly gets done a lot but for which a formal theory is clearly lacking. I’m grateful to Peter Bickel for pointing me to this problem.
  • Data analysis, as a social as well as scientific activity: what do people do, how do they interact with practitioners and data collectors, what influences resulting practice?

That is, data analysis, while certainly a core part of data science, is still a core part of data science. It’s absolutely true that it is employed in conjunction with, and often about, other components. It is also the part most fundamentally connected to statistics and this makes it a very natural focus for statisticians. But to envisage the definition of Data Science as bounded solely by data analysis is highly self limiting.

Hicks and Peng are careful to say that they are describing data analysis, rather than all of data science, and it performs a very useful service. But their discussion does tend to read as though they are equivalent. In that sense, this post should be taken as a general reaction to statistician’s reactions to data science rather than them in particular. Statisticians will of course tend to focus on data analysis (as Hicks and Peng do) and I’m not sure that I want to require an obligatory “of course data science also includes…” section, but more frequent acknowledgement of the broader discipline would both improve statistics and increase its (much needed) influence in how everyone deals with data.

Leave a comment