What You Could Do with the Shapley Computation

Ok, so after last post, I really ought to tackle “but what do you think would be better than SHAP?”. In particular, it seems to me to be an enormous waste to go to all that computing just to produce a single number; could you get more information out of it?

Here is a quick experiment. I’d already run a computation obtaining Shapley values for dropping features from a prediction set. In particular, I used a subset Beijing Housing Data with the task of predicting log totalPrice and looked at Random Forest OOB squared error as my value function when the forest was trained with different subsets of features.

Here I used 100 permutations, and for each feature I attributed to it the change (almost always a decrease) in OOB squared error when it entered the model in each permutation.** After I removed features that were annoying to deal with, we had 18 features so I had a 100-by-18 matrix of permutation-by-SSE change.

The Shapley value estimate is just the average over those 100 matrices, but I think we can see how there might be more information here. In particular, part of the information that I thought was useful was dependence between features, along the lines of “You need one of size and number of rooms, but not both.”, or “Whether there is an elevator doesn’t tell you much unless you know the number of bathrooms” (although I still haven’t worked out what’s going on with that last one). Well here we can see some of this, from the order in which the features enter the model.

In particular, for each feature, I created a new 100-by-17 matrix with the Hadamard indicator of precedence of the other features. That is, when I was looking at the importance of X1, say, I went through each permutation and recorded whether X2 came before or after the X1. If before I recorded this as 1, and -1 if after.

Now I obtained a linear regression of the X1 column of my OOB Change matrix on this new precedence matrix. A positive coefficient for the precedence indicator of X2 indicates that X1 appears to make more difference when X2 is already in the model, a negative coefficient indicating that X1 doesn’t add as much information when X2 is already in the model compared to when it isn’t. In this case, we have the following


Intercept Lat DOM followers square livingR drawingR kitchen bathR buildiType
    0.0231 -0.0023 0.0006 -0.0018 -0.0004 -0.0013 -0.0012 -0.0013 0.0009 0.0009

construction renovation Structure ladderRat elevator fiveYear subway district
      0.0006 -0.0022 -0.0025 -0.0007 -0.0017 -0.0027 -0.0031 -0.0084 

If we did all possible permutations, symmetry arguments give us that the intercept this is model should give us the Shapley value that we calculated. It won’t exactly, but the correlation was 0.987 which isn’t bad; the Shapley value for X1 here is 0.026.

Of course, this analysis only allows for main effects and wouldn’t tell us something like “you need two out of three of X1, X5 and X9”, but we can get an idea of how much a linear model tells us about the importance of X1 in the model.

I’ve summarized all of this information in the graph below. Here, nodes correspond to features and an arrow connecting X1 to X2 indicates the effect that the presence of X1 has on X2. The thickness of the arrow indicates the strength of the effect and the color gives direction (red if negative, blue if positive). I’ve annotated the node name with the R-squared of each linear model.

ShapDiag

Clearly there is a lot to do here still. I thresholded effects at a rather arbitrary 0.003, one might instead use a LASSO penalty as in LIME (although see this paper on cautions). And a lot of the R-squared values are fairly low indicating that something might be gained from higher-order interactions, although you’d need a new way of visualizing those. Most importantly, this graph is very unstable — repeat the calculation and you’ll get something that looks rather different. But perhaps there is potential.

Note that I’ve employed this in the service of global interpretation — how important are my features to the model as a whole. I’m not sure that this would be as helpful at local scales and I still find the value function that SHAP uses to be far too abstract to be a useful explanation method. But perhaps someone can produce a convincing use case.

** OK, technically, my value function is total SS minus Error SS, but it works out to the same attribution.

SHAP is the Blockchain of xAI

An unexplainable, computationally-costly, buzz-algorithm*** no-one needs.

Ok, maybe this is a little strong, but I have rolled my eyes at enough papers that I think I need to say some things about Shapley values. If you haven’t been paying attention (and for a few years, I wasn’t), SHAP — a particular form of Shapley values — has become a go-to method for explaining the predictions that result from Machine Learning methods. While I appreciate the motivations of the original paper, I think that they are a sub-optimal choice for this purpose. They are also about the most computationally-intensive choice that you could design as well. It’s therefore somewhat mysterious that they keep being used.

I put this down to over-indexing on axiomatization: Shapley values uniquely satisfy a set of desiderata, and this mathiness of their formulation blinds otherwise-sensible computer scientists to the question of whether they are actually helpful. This is not dissimilar in the way in which blockchain algorithms purport to provide anonymity as well as security, leading to their adoption even when they might not be the optimal solution, hence my title. The two also share the trait of not being readily explainable, or often explained, to outsiders.

In fact, SHAP encompasses two ideas: Shapley values as a general method of valuing members of a set, and the particular way that they are used to produce explanations. I think both the general application of Shapley values to xAI, and the particular choices of the SHAP methodology are problematic. I’ll try to explain both fairly, along with my critiques below.

Shapley Values

Shapley values are a long-standing (dating to 1953) idea about assigning values to a member of a coalition. Shapley values are explained in lots of places (wikipedia is my first go-to) but I want to start by motivating their use from the point of view of one particular problem in machine learning diagnostics (and not the problem that SHAP looks at).

How do we assign a value to an input feature? We might want to do this for screening purposes, or to prioritize features for additional investigation, or maybe just to give ourselves a sense of control. A really easy to metric is to say,

What if learned F_0(x) with all the features, and F_1(x) with everything except x_1?.

We might then look at the change in predictive ability when you remove x_1 from the data set: V_1 = \sum L(Y_i,F_1(X_i)) - L(Y_i, F_0(X_i)). That is, how much does my error rate increase when I remove x_1 and learn at new F without it?

This is, in fact, a very reasonable (and common) thing to do, but you do need to worry about the following: what if x_1 and x_2 are nearly interchangeable? Maybe one measures my height and the other is my trouser inseam. I can predict one from the other nearly perfectly, which means that if I remove height from the model to get F_1(x) can use inseam to compensate and do nearly as well, and vice versa. That means that neither x_1 nor x_2 will look important under this metric, even if removing both would be very harmful.

You can also come up with the reverse situation: x_1 might not be very helpful unless we also know x_2: to predict obesity I need both height and weight, one by itself isn’t very useful.

So how to deal with this? What I need to do is to look not just at dropping single features, but also dropping pairs, and triples, and so forth. This is where Shapley values come in. Specifically Shapley values are calculated the following way ****:

  • put the features in some order (not the ordering they came in)
  • going along the order, learn a model using the features up until each position in the order
  • x_1 is given a value corresponding to the change in error when it is added to those features that came before it
  • now average the values over all possible ways of ordering the features (in practice, choose a set of orderings at random) to give Shapley value \phi_1.

Here sometimes height will come up before inseam and will make a big difference to predictive skill, sometimes it will come up afterwards and will make very little difference (but inseam will make a big difference), and these two will get averaged out over orderings. Note that there is a lot of computing work to do here, hence the need to just look at a random sample of orderings (and even then it’s a lot of work).

Why is this the right thing to do? (Note that it’s different to averaging out over the possible set of features that might have come before). This was derived by Lloyd Shapley in a paper in 1953 for a more general problem in valuing coalitions. Here we suppose we have players x_1,\ldots,x_p and can produce a value v(x_s) for any subset S of the players and we want to find a way of working out a value \phi_i for each individual player i*****. This maps directly onto the problem above: v(x_S) being the error rate for a model learned only with x_S.

Shapley wrote down a series of reasonable desiderata for any such individual value \phi_i and demonstrated that the calculation above is the only way of satisfying these. Indeed, Shapley values are mostly motivated by starting from a set of axioms that the \phi_i should satisfy, which I suspect contributes to their mathematical dazzle.

As a general proposition, why am I unexcited by this? Firstly because I don’t think it’s very much more illuminating than the value you get for What happens if you drop x_1?, it’s just different. In fact, it tends to smooth out importance, making differences in importance between features look smaller. What it doesn’t tell you is relationships between features: We need x_1 or x_2 but not both. or x_1 doesn’t add much unless you also have x_2.. The worst part about this is that you could get that information out of the calculation to get Shapley values, but it gets thrown away!

Secondly, however motivated, Shapley values dictate how you weight the improvement from adding x_1 by the size of the feature set you add it to. Frankly, I don’t care in the slightest what happens when you add x_1 to a model that only uses x_2 and a lot more about what happens when you at x_1 to a model that uses most of the features. Shapley upweights the effects for both very small and very large feature sets, while I might want to just look at the latter.

What about SHAP in particular?

Shapley values as used in Machine Learning make particular choices about the value function to use. In particular, rather than looking at what contributes to predictive accuracy, SHAP is based around explaining individual predictions. How is this done? SHAP looks at the prediction that would have been made if we only knew x_S, which they calculate from v(x_s) = \int F(x) p(x_{-S}) d x_{-s} which integrates out all the features not in the set S. Various choices can be made about what distribution to use, but I’d argue the only really sensible one is the conditional x_{-S}|x_S.

So what does the Shapley value for x_1 represent? It’s how much you change your prediction by being told the value of x_1, averaged over the set of possible features whose values you might already know.

Well that sounds ok! Why don’t you like it (beyond the objections above)? My answer here is What am I now supposed to do? First of all, I already know x_1! So telling me what that information changed is pretty artificial. I might want to know how sensitive F(x) is to the value of x_1 (eg, it tells me how to change the prediction most quickly), but that looks more like LIME. And if you want to explain it to a lay-person we’re talking in pretty abstract terms.

I think this particular choice of value function was motivated by one of the core Shapley desiderata: the values of all the Shapley values, should add up to the value for the full set v(x) = \sum_{j=1}^p \phi_j. So here we are breaking out f(x) into a sum of contributions, one from each feature. There’s some appeal here, but note that this is not a representation as an additive model — each of the \phi_j is itself a rather complex function of all the features. To be actually fair, we should write F(x) = \sum \phi_j(x)

Add to this the computational effort in first, evaluating the integral to calculate v(x_s) and second, looking over a set a orderings (and that’s without ensuring that you get stable answers) and you end up with a number that is

  • Difficult to explain without a lot of technical terminology
  • More computationally costly than just about any alternative
  • Frequently mistaken for meaning something that it doesn’t

The perfect analogue of blockchain!

And just like blockchain, there are some legitimate uses — in this case mostly in aggregating up to something that looks more like global values — but this is far narrower than the way it is currently used.

And what about those axioms?

Shapley values get a lot of their appeal by satisfying a set of desiderata, and I guess I should nominate what I think is irrelevant. The four criteria (expressed as values for features) are:

  • null sets: if a feature adds nothing in addition to any set of features already in the model, it gets Shapley value 0
  • additivity: the Shapley values for the sum of two value functions is the sum of the Shapley values
  • symmetry: if x_i always contributes the same as x_j to any coalition, they have the same Shapley value
  • efficiency: the Shapley values add up to the value of the whole set.

The first two of these are very reasonable, and by and large the third is, too. But I see no reason that the fourth needs to hold, either for global explanations or to explain particular predictions. It’s fine to have an importance measure that describes changes to a function rather than its whole value! (If a value function is even the right way to think about this.) Unfortunately, unlike the first three desiderata where removing them has interesting effects, removing the last rather creates open slather: we need to think again about what we want an explanation to mean, and usually for what purpose.

*** buzzgorithm?

**** I’m deliberately presenting the calculation first, then the justification because I want to understand if something is helpful before being blinded by the axioms

***** The motivation is sometimes that we value players in a sport team, but then small coalitions don’t make a lot of sense. Political coalitions may be a bit more helpful.

Reproduction Intervals and Uncertainty Quantification in ML

As usual, my blogging is like my personal correspondence — far too infrequent for my mental health or my audience’s engagement. For anyone still out there, thank you for following!

This is a quick exploration of exactly what uncertainty quantification in machine learning should look like. When Lucas Mentch and I first developed uncertainty quantification methods for random forests; , one thing that we had real difficulty with was the fact that our theoretical results guaranteed that confidence intervals would cover the expectation of the predictions of an RF, rather than some underlying truth such as E(Y|X). That is, we could not say that the bias in our models was small enough to ignore when performing inference. What were we really learning if we couldn’t relate what we got to real-world quantities?

I always justified this by saying that our results were really interested in using the ensemble structure in random forests to develop theory. (And indeed this was/is true!) If you want to control bias, you need to analyze the ensemble members very carefully. That’s just what Susan Athey and Stefan Wager did in their seminal CLT paper, with some very nice mathematical arguments.

But to be honest, even if you use the sort of mechanisms that Wager and Athey advocate, it’s not hard to find bias in simulation. Under the right conditions it might disappear asympoticaly, but in finite samples you might still be getting the wrong answer. This is, frankly, true of just about any ML method you care to use, and most more classical nonparametric smoothing methods as well (that is, you can undersmooth to avoid bias, but in practice the finite sample statistical properties are still not great).

Now I still think that there is some utility in providing uncertainty quantification, even if it isn’t in the form of formalized inference. However, we do need to distinguish what it is that an uncertainty interval is trying to capture. For this, I really like the idea of reproduction intervals that student Yichen Zhou came up with. Indeed, they seem so natural that it has taken me up until now to realize that they really require some publicity. So far, we’ve only discussed them in the simulation section of this paper.

The idea can be captured rather simply. Traditional confidence intervals for some parameter \theta and data X are given by I(X) = [\hat{\theta}(X)-l,\hat{\theta}(X)+u] where \hat{\theta}(X) calculates an estimator for an unknown \theta and the end points of the interval are defined so that I(X) includes \theta for 95% of all X. The intervals that Lucas and I produced covered E \hat{\theta}(X) 95% of the time.

What Yichen suggested is to avoid asking about covering an underlying truth. Instead, we just want to know How different might your answer be with new data. That is, we want I'(X) so that \hat{\theta}(X') is in I'(X) 95% of the time where X' is an new set of data independent of X. There are a number of things I like about this

  1. It makes explicit that we are interested in stability rather than inference. The term reproduction interval comes from asking about reproducing an estimate with new data.
  2. While it doesn’t quite do away with the mental effort of imagining a universe of alternative data sets, it does feel more natural. It’s not quite right to say that for this particular X, I(X) contains 95% of possible I(X') — you do have to take a distribution over both X and X' — but just doing the X' part gets you pretty close to what we are aiming at.
  3. It’s computationally no harder than standard confidence intervals. The ideas in the Boulevard paper were based around an asymptotic normal result where if we have that \hat{\theta}(X) \sim N(\mu,\sigma^2) then

    \hat{\theta}(X) - \hat{\theta}(X') \sim N(0,2 \sigma^2)

    and 95% reproduction intervals are given by \hat{\theta}(X) \pm 2 \sqrt{2} \sigma — these are simply \sqrt{2} times as large as standard confidence intervals.

    It’s also pretty easy to provide bootstrap intervals. These are always symmetric because of the symmetry between X and X', and we can simply find the 0.95 quantile of the difference between any two bootstraps Q = q_{0.95}(|\hat{\theta}(X^b) - \hat{\theta}(X^{b'})|), and form \hat{\theta} \pm Q.

    The ideas were started from work in this paper where we asked about how large a data set we would need to consistently select the same split when building a decision tree (in our case we were generating the data, so this was possible), which we cast as a power calculation problem. There we really were asking about “What would happen with a new data set?”, but that turns pretty naturally into confidence intervals.

    Now I wouldn’t want to claim this is all we should do. Bias matters for inference in lots of situations, and providing genuine confidence intervals is important. However, I do think that these at least provide some useful information, particularly in areas like machine learning. One might think of them as fitting neatly into Bin Yu’s PCS framework. Possibly (probably?) they should be thought of as a second-best option, but you could make a case that all our models are biassed anyway and this is just being honest.

    Now the question is how one publicizes/advocates for it. It really is simple so would be surprised if it isn’t already in the literature. I’m very willing to give credit elsewhere if provided with a pointer, but whoever should take credit, I still need a way to say “This is a good idea, you should do it”.

Of Model Selection and a Falsificationist Account of Science

This post is only partially meant as an advertisement for this paper that just came out in Ecology. In truth, I feel both mildly under-qualified to write about model selection (it’s not been a methodological focus) and unsure that, despite my co-authors assurances, I really contributed very much. Nonetheless, I think the paper’s prescriptions are statistically uncontroversial — at least for current practice — and the examples are nicely constructed (all credit going to my co-authors in this case). But, that heavy digression aside, I want to use it as a vehicle to discuss the connections between the paper and philosophy of science more broadly.

The paper is written — unsurprisingly — for ecologists, and divides the purpose of a data analysis into three categories: exploration, prediction, and confirmation. These in turn define the (increasingly restrictive) analyses that can be validly employed. This should be fairly familiar to statisticians:

1. Exploration corresponds to exploratory data analysis. You are looking for a good model for the system you study, both in structure and parameters. We might think of this as either a hypothesis generation method, or when you are looking for a specific effect but want to ensure you specify ancillary parts of your model correctly (see my post on selective inference). This is fairly open slather with the proviso that no formal statistical inference is going to be valid, and a warning about fooling yourself in over-fitting the data with accompanying advice to keep validation data separated if possible.

2.Prediction is fairly straightforward: you don’t care about the structure of the model, you just want it to do out-of-sample prediction well. Advice here is not too dissimilar to exploration except that holding out a test set becomes mandatory, and we might expect non-parametric methods to be more frequently used when interpretation is no longer a consideration.

3. Confirmation is then formal statistical inference. You need a model and analysis set out pre-data, undertake minimal exploration beyond checking the validity of model assumptions and should be careful about testing too many hypotheses, or correct for those tests.

Of course, one cannot pass a single data set down these tasks, although traveling the other way is permissible. These feel like a broadly reasonable categorization of experimental aims, even if in my consulting experience, scientists aren’t always so clear on exactly what they wanted to do with data once it’s been generated.

But outside of practice, this hierarchy was largely motivated by the current state of statistical technology. Formal inference methods — particularly those readily available to ecologists — simply don’t account for model selection and thus produce invalid inference. It is certainly true that a considerable amount of statistical effort is currently focussed on addressing exactly this issue; see selective inference or Bayesian model averaging as examples. However, I think it is fair to say that a consensus on best practice has not been reached within statistics, whatever this paper’s referees (who gave differing recommendations) might say.

However, here I want to ask whether these distinctions are also philosophically motivated, as opposed to being the product of current mathematical progress, and consider the challenges that statistical research blurring these distinctions might pose to a philosophy of science.

We can, in particular, map Exploration and Confirmation broadly onto a Popperian view of scientific progress: 1. being the conjectures (and I’ve given K. P. one way they come about — not that he cared) and 3. being refutations.** In a Neymann-Pearson sense, these are refutations directed towards particular alternatives, but I think that’s ok within a falsificationist viewpoint. Much of this blog, particularly the earlier parts, focusses on the problems that Prediction pose for a philosophy of science, but those surround a Prediction-only framework. If prediction is only one goal of many, then it can be treated as more engineering — “As long as the thing works, we don’t care why.” — and we can comfortably ignore its implications for scientific knowledge.***

If we take this mapping seriously, perhaps the dichotomy between exploration and confirmation is more fundamental than how easy the math is. Popper doesn’t rule out using observed patterns to come up with theories, but the idea of falsification implicitly assumes out-of-sample performance — agreeing with known data only counts as induction. This isn’t a perfect match: I read (past tense, longer ago than I’d like) Popper as conceiving of a scientific theory as being globally explanatory with high precision — physical sciences, where parameters and theories make specific predictions about new experimental conditions, really made up the prototypes of early C20th philosophy of science. Ecological (and other social and biological sciences) have nothing like this level of precision and statistical methods really only ask about “If I repeated this under the same conditions, would I get the same results?” That makes these disciplines verge on the unscientific by Popper’s definitions, and I can rant about the ways in which I think statistical practice exacerbates this another time, but I do want to explore the ways in which model selection, and possibly machine learning, interacts with ideas about scientific progress.

I see two points of view: either formally incorporating model selection into inference amounts to automating or violating a falsificationist explanation of science. On automation side, we can think of this as simultaneously proposing conjectures and testing them. Assuming that we have testing procedures that give us appropriate frequentist control of error probabilities, we might get something like a list of models that have been rejected by the data and those that remain un-refuted. (In reality, this is a selection out of a set of models limited both by specifying the models to consider and the search process.) Thus we have automated the tedious work of detailing potential models (“conjectures”) and testing them, allowing scientists to work on a much higher level, specifying what sorts of conjectures to test and with what data.

The alternative viewpoint disputes the appropriateness of equating a model with a conjecture. Instead, by running model selection, you are really making a conjecture about there being some “true” relationship within the class of models that you specify. We can, of course, then at least decide that we have insufficient evidence to support that contention. Note that in doing so we have swapped the roles of the null and alternative hypotheses: our conjecture is that some model explains the data better than none. I guess we could provide evidence of the form that signal to noise ratios can’t be larger than x, or we could declare that the null hypothesis was our conjecture, but both of these seem forced.

More generally, is a falsificationist view of science problematic for being automated? It does feel a little close to “we can always find some explanation for these data”, even if that includes error bounds that account for the search. Part of the question is how specific a conjecture ought to be. The C19th developments in physical sciences that set the paradigms for the falsificationist logic produced predictions that were both numerically precise and widely applicable. Thus a falsification on Newtonian mechanics can involve measuring the refraction of light around the sun in an eclipse — both a new experimental setting and a numerically precise prediction. But can a conjecture be as broad as “There is a relationship between this outcome and these measurements”, or as specific as “If you repeat this experiment under exactly these conditions, you’ll get this answer”? This all starts to make model selection (in the sense of selective inference) feel like it’s still hypothesis generation, in which case the distinction that we made above does start to feel a little more fundamental; and inferential procedures like selective inference are nice to have, but don’t really allow you to combine purposes. They do, then, also make we worry that we would start to drop confirmatory studies altogether.

Ok, so this does get me to my rant: I think statistical methods have encouraged a dangerously unscientific approach to experimentation. This is partly about statistical models that describe relationships in the data, not why they should be that way (I’ve ranted about that before). But it’s also about a focus on evaluating a single experiment under particular conditions as opposed to saying “we need a model that allows you to numerically estimate a causal relationship in one set of conditions and then transfer it to make predictions in a different system. You could argue that just replicating an experiment is already challenging — see the ongoing replication crisis — and that numerical correspondence is nigh-on impossible (my ecologist colleagues look petrified when I suggest it). But it’s also true that the incentives aren’t there; the expectation that “this is what you need to do” drives a lot of invention. Is this all statistics’ fault? No, but statistical paradigms to provide a way to avoid transferring mechanisms, and the experimental design that focusses on factor-level analysis, partly lead by a fear of writing down precise models (since you can always just analyse factor-levels) make developing those mechanisms more difficult. A statistical framework that asks “how readily can you transfer the findings of this experiment elsewhere” would put more pressure on scientists to be more falsificationist. Once again, the machine learning community (for which “transfer learning” is an important topic) may be ahead of the game.

** Ok, Popper is getting pretty long in the tooth here in terms of Phil Sci, but there is still a quality of “why does this work” that’s been refined but not really changed in fundamentals.

*** I’m not sure that we can treat is as only one goal of many given its increasing prevalence in all sorts of places, but for the time being…

On the Boundaries of Data Science

This post is precipitated by reading Stephanie Hicks and Roger Peng’s paper on Elements and Principles for Characterizing Variation between Data Analyses (and I was reminded to post this by their follow-up Design Principles for Data Analysis). The reaction engendered isn’t specifically about their paper per se, there are a collection of similar attempts to discuss the broader practice of data analysis. See, for example Bin Yu and Karl Kumbier’s Veridical Data Science, or, from some time ago, David Donoho’s 50 Years of Data Science (there are many more). These are all important contributions, attempting to formalize (to the extend possible) our process of data analysis beyond just formal statistics and improve the reliability and reproducibility of the results that we gain thereby. I in no way wish to disparage these.

Rather, I have a very specific bone to pick: the conflation of Data Science and Data Analysis. Hicks and Peng are careful to discuss their model in terms of data analysis, but nonetheless motivate it by a discussion of the field of data science and the need to escape a definition given solely in terms of an intersection of sub-fields. They then proceed to treat their principles as a means of achieving this.

My reaction to this is that defining data science in terms of data analysis is far too narrow. Data analysis is, of course, a hugely important part of data science, and it is a fairly natural focus for statisticians — it’s still focussed on distilling knowledge from data. However, I suspect that knowledge extraction is still the minority in data uses and problems; data services in social media, finance, commerce, and who knows what, else would seem to be left out of this (How do we store and access account data efficiently? How do we automate recommendations?). Automated systems from the automated telephone trees to self driving cars, automated auctions for oneline ads, and of course the original search and curation services of google and yahoo are all ignored by this focus.

Is it that these problems are left unstudied? Very clearly not! And between topics on databases, high performance computing, machine learning and signal/image processing you probably capture most of what I discuss above. But they do tend to be absent from many statistician’s accounts of Data Science. This is a shame: I expect that statisticians could have useful things to contribute to many of these topics (some of them have reinvented statistics for themselves) and would be well served by having some background in how the data get to them (this includes me). I’d regard any program in data science that doesn’t include a course on databases, or network transfer protocols, to be distinctly deficient, but suspect that might include most of them.

The particular topics above can be readily included by simply expanding the union of subdisciplines. That won’t necessarily generate more cross-disciplinary talk, even if I wish it would. But are there topics in the intersection that genuinely haven’t been dealt with as academic disciplines? My perspective is necessarily limited by my own lack of background in the more engineering side of computer science, but here are a few:

  • Algorithm ethics/fairness/recourse. This, of course, is already a going concern, but stands out as a poster-child for disciplines spawned by the data revolution.
  • Data plumbing, or the interaction of databases, networks and services. There are certainly a lot of people doing this (my thanks to an old high school friend for this self-description) and I confess to not being close enough to understand the landscape other than it doesn’t talk to statistics much.
  • Data cleaning, which certainly gets done a lot but for which a formal theory is clearly lacking. I’m grateful to Peter Bickel for pointing me to this problem.
  • Data analysis, as a social as well as scientific activity: what do people do, how do they interact with practitioners and data collectors, what influences resulting practice?

That is, data analysis, while certainly a core part of data science, is still a core part of data science. It’s absolutely true that it is employed in conjunction with, and often about, other components. It is also the part most fundamentally connected to statistics and this makes it a very natural focus for statisticians. But to envisage the definition of Data Science as bounded solely by data analysis is highly self limiting.

Hicks and Peng are careful to say that they are describing data analysis, rather than all of data science, and it performs a very useful service. But their discussion does tend to read as though they are equivalent. In that sense, this post should be taken as a general reaction to statistician’s reactions to data science rather than them in particular. Statisticians will of course tend to focus on data analysis (as Hicks and Peng do) and I’m not sure that I want to require an obligatory “of course data science also includes…” section, but more frequent acknowledgement of the broader discipline would both improve statistics and increase its (much needed) influence in how everyone deals with data.

Yet Another Rant About CS Reviewing (YARACSR)

Inspired by a re-tweet from Tom Diettrich. There’s been lots of complaints about reviewing for CS conferences; it’s random, there’s lots of biasses, you have to be one of the in crowd already. All of these are true, but they are also true of reviewing in any other field as well.

But I do think CS conference system isn’t great for the field. That’s not new; the purpose of the blog is to add details to the dynamics that I suspect are at play in ways that aren’t so different from the journal systems in other fields, but nonetheless serve to make the field more random, and more insular, than others.

To begin with, CS conference deadlines certainly do provide incentives to get research done and written up (as an advisor of PhD students, I definitely sympathise with this!), and this process also generates papers that are less complete/well thought out and more incremental than journal-based publications. The sheer volume of submissions to CS conferences also serves to put pressure on the refereeing system, reducing the thoughtfulness and deliberation of the review process. The insistence on giving all papers the same treatment — while apparently democratic — also accentuates this; the editorial board of a journal weeds out clearly unsuitable papers, and then provides more nuanced judgements about who to use as referees, reducing the over-all burden and increasing the relevance of referee expertise.

But Tom’s tweet really reflected a complaint that I have long had about the CS review system as a very occasional reviewer: the lack of iteration. I’ve handled a not small number of papers that could be great with more work, but without the opportunity to iterate a couple of times, the paper just didn’t meet the standard.

Now you could certainly say that iteration occurs between CS conferences. I’m frankly skeptical that this occurs as much, as opposed to simply submitting much the same manuscript to the next conference. But more, I think it misunderstands the review process: it’s only partly about making the paper better. I think a good portion of iterative review is a conversation between the authors and the referees; one that clarifies the ideas in the paper and allows both parties to understand each other better.

Isn’t this just about reviewing improving the paper? Not really. I think of this as being a bit like teaching: the ideas need to be “turned around” until they’re presented in a way that clicks with the referee. I’ve often been struck by how much the fate of paper lies in how its introduction is written, and if the referee doesn’t see the point, initially, they’re prone to misread the rest of the material. This often seems to be pretty idiosyncratic to the referees.

I’m not sure that this iteration always makes the paper better for those who aren’t the particular referee (although it now makes sense to both referee and authors, so perhaps that’s something?). So is this better than a one-off review process? I think it does decrease the randomness, to some extent; I’m doubt that there is a presentation that works for everyone, but it does help to get around the presentation and judge the papers on ideas. The perceived randomness in CS conferences is partly about “did I happen to get referees for whom my presentation worked”? Whereas that’s less relevant in an iterative review process. It’s made worse, because at some level we all understand that to be the case, and this reduces the incentive to make genuine improvements when we can always say “It’s just that these particular referees didn’t get it.” The dynamic also tends to make the discipline more insular — subjects further from the referee’s way of thinking (and statisticians publishing in CS conferences see this a lot) get much shorter shrift.

There is, however, a further aspect that adds to this: the time pressure in refereeing CS conferences means that you don’t write as long or as thoughtful reviews. That means that the authors also get much less by the way of specifics “you really should do this”, and this also makes the process feel more adversarial and provides a lot less guidance to indicate that there really are things to be done to make the paper better. Slowing down really can be beneficial.

I gave up on CS conferences a while ago, due to frustrations both as an author and a referee. And I’ve had the same rants as oh so many people. I used to think that was a failing on my part, until I met increasing numbers of researchers whose work I greatly admire who have said that they can’t get published in CS conferences either. This certainly doesn’t make journal reviewing perfect: it’s still full of biases and still depends on who is assigned as editors and reviewers for a particular paper. But I still think it might do a bit better, both from taking more time, and from the ability to push back from both directions. I’d love to see that studied, though by how and what metrics would be difficult to work out.

Philosophy of Statistics and Over-simplified Models

In which I start with some comments about the types of models generally discussed in philosophy of statistics, and then get sidetracked into musings about the philosophy of science.

This post derives from a comment I made on Deborah Mayo’s blog error statistics in which she discusses David Trafimow’s position on hypothesis tests. My purpose here isn’t really to revisit this particular debate (I mostly fall in Mayo’s camp) but the discussion brought a particular facet of statistical debates home to me: these debates only focus on farcically-simple models.

When examining Mayo’s post or the many comments on it, the basic example given is testing m in data from N(m,1), usually from a single observation X. Now there is a lot to be said for this; removing a lot of extraneous fluff and simplifying the arguments. To be fair, it maps fairly directly onto practical tests using real data: t-tests and t-tests for differences at least.

However, I think this over-simplification has some pernicious consequences, also. These come up particularly in the “confidence intervals rather than hypothesis tests” school. I’m very happy to grant that confidence intervals contain more of the relevant statistical information than a p-value when we are interested in a single parameter; you can, after all, read off the result of a test (at the right level) from where the null intersects the confidence interval. But, once we move to questions about the joint configuration of more than two quantities of interest, our ability to display and understand high dimensional confidence sets breaks down — I can’t conduct a likelihood ratio test for the hypothesis that more than, say, 3 means are all equal by a visual summary of their estimates.

In fact, even this leads into interesting questions: I can test this hypothesis visually by using uniform confidence intervals, but doing so looses power, and implies larger confidence intervals than I might otherwise need. Does the additional information content of those intervals warrant the loss of power? I’m inclined to say no, but one might argue otherwise. I’ve yet to see it done.

And this is for a model that I could still plausibly define by parameters that are likely to be of interest. The example I was thinking about when reading this discussion was testing a k-factor model in multivariate analysis. One might think of this is as a goodness of fit test, or as a demonstration of some more complex effect. But there are no parameters that readily (and simply) define deviation from the model about which one might put even a uniform confidence band.

In fact, I think there are a number of features that this example pulls out. Besides the difficulty of “confidence-interval-ing” the test, its justification is asymptotic, even under Gaussian assumptions, forcing us to accept approximate inference and p-values and for which we don’t have tight (and mostly only asymptotic) bounds on how “approximate” the inference might be.

Much of this, and my divergence into philosophy of science below, is built upon working with relatively complex models and data. It’s quite likely that it is my view of scientific practice that’s distorted — you don’t really need a statistician to conduct a t-test, after all — but it is certainly the case that at least some questions in practical science do center around complex effects that can’t be readily visualized, and with the increase in automatically generated and unstructured data and machine learning, I think that this will only become more common. There are real philosophical questions about how these should be dealt with.

So to being side-tracked…

On Mayo’s blog, I suggested goodness of fit as presenting a challenge to David Trafimow’s argument against hypothesis tests, and I still think they are that. However, the gap between my go-to example of a hypothesis test and the philosophers’ suggests a deeper challenge to our view of statistical models.

I have, I think, tended to conceptualize tests as limiting model complexity: I want to use a model as an approximation for the world that is complex enough to both capture the effect that I am interested in, and adequately describe any other processes that might affect my conclusions. In that light, even testing a specific effect of interest can be thought of as justifying added model complexity.

Ok, so here I’m working my way around to an ML-ish philosophy of science: I want to model the world as well as I can given the available data, with additional complexity added as we get the data to support it. Whither hypothesis tests in this view of science? After all, we do have a plethora of model selection criteria that we know have better predictive (and estimation) properties than selecting only those effects that pass significance tests. This is true even when all the models are interpretable.

I think some of the answer might be that model selection results are based on a single data set, but science is really interested in transferability for which greater regularization might be warranted, along with greater confidence in effects. One wants clear evidence of efficacy before releasing a vaccine (to take a topical example), although I’m not really sure that a decision-theoretic approach would yield the same thresholds to which we have defaulted. I continue to be frustrated that statistics doesn’t formalize ways to demand quantitative transferability of effects across experiments and experimental conditions, something that I think would help with the phenomenon of attenuating effect sizes. One might argue that the hypothesis testing framework mitigates something that would otherwise be worse, but I’d like to see a mathematical formalism to demonstrate it.

Does thinking of science as far more incremental — slowly building ever more complex models as the data support them — really change a broadly Popperian view of progress? Or more generally, one can certainly express paradigm shifts in terms of shifting to new models that provide better compression of the observed phenomena (neural networks instead of support vector machines?). I do think that this misses something qualitative, however, in a conception of science. Physics, for example, tries to build their models from very simple starting propositions, explicitly with the aim of showing that the combination of these explains many different phenomena. This is, of course, a form of model compression, but current statistical treatments provide neither aid nor incentive to the scientist to seek out these lower-level simplicities and I fear that the incrementality that is currently encouraged by statistical practice (or at least the models we like to work with) will mean it takes a long time to achieve them.

Some further thoughts on Causal Inference

So let me admit: Judea Pearl got under my skin reading the Book of Why. Not only negatively — I still don’t think he does justice to statisticians — but in the sense of saying that there is something important to think about.

I’m not sure that this had lead me in directions that bring me closer to Pearl, though. Some of my reaction is to say

Scientists (and statisticians) have had a habit of paying lip service to causal concerns (“This is an association, not causation”) but actually interpreting effects found through regression etc as, in fact, being causal; maybe we should be more careful about this.

I think of this realisation as being a bit like the way statistics has worked out that we’ve maybe been overly cavalier about data processing and exploration. Pearl doesn’t have much to say about demonstrating causal relations — he assumes you know them and just works with the consequences — but (and this may be the statistician in me) I rather feel that there is much more to do here than in the consequences of calling an effect causal. There’s lots of work being done right now, though (on both topics) so I’ll let that rest.

The thing that has niggled me is that the models that I think of as being justified are nothing like the toys that Pearl uses in his book. They’re based on physics or chemistry, or “first principles” understandings of biology or ecology along the lines of “predators eat prey”. And for these, I think I do know what causal relationships look like, even without Pearl — you read these off the model, or you make some change and simulate the model again. (Yes, that is the “do” operator, but what was needed was pretty obvious).

But is this very different from Pearl’s DAGs, except maybe in having more complex models? And it finally struck me that there is something that I think Pearl misses: time. One of the fundamental principles of our understanding of the world is a bar on backwards causation: causes happen along the direction of time. Now in most of Pearl’s examples (and in much of statistical causal inferences) time only appears implicitly: in a randomized controlled trial, one applies the treatment <i>then</i> measures the effect, similarly in Pearl’s models of Berkeley admissions. There is nothing in a DAG structure that requires that we satisfy the law of forward causation. I’m sure that the response would be

You do need to write down a model that corresponds to your causal understanding.

or more generally just that arrows in your DAG can only point forwards in time. Nonetheless, it does feel somewhat odd that the most fundamental properties of our understanding of cause and effect plays such as peripheral role in its mathematical description. I guess one might also view it as enforcing the “acyclic” part of DAGS. (Maybe it also says something about the nature of our understanding of causality.)

But this did get me to wondering about incorporating time explicitly into causal analysis. It’s certainly true that “arrows can only point forward in time” pretty much implies you have a DAG (I think that can be made formal). This also brings Pearls’ concepts closer to Granger causality. It both moves you closer to the sort of mechanistic models that I’m prepared to consider “causal” (at least outside randomized trials) and it would be interesting to see a Pearl-style approach to time-series models.

Thoughts on the Statistics Debate

On October 16, Jim Berger, Deborah Mayo and David Trafimow took part in a debate about the use of hypothesis tests and p-values in scientific studies hosted by NISS. I’m delighted to see Statistics once again grappling with its philosophic basis, and at least some philosophers coming to help out. I think it’s worth watching if you haven’t. See

https://www.niss.org/news/statisticians-debate-issues-central-inference-and-estimation

Here’s some of my own thoughts, having had a few days to digest.

First, the participants had agreed not to prepare beforehand, and while I understand the motivation I think that I would support Mayo’s tweet wondering if that was, in fact, the best idea. I think that the arguments would have been more cogently stated and perhaps more directly engaged with some more preparation. Trafimow, in particular, took some time to warm up (he may have taken the injunction against preparation most seriously) and that is a shame; I don’t agree with his position (more later), but this is all the more reason to want it to be persuasively argued — there might be something I’ve missed! In retrospect, I might have actually gone in the other direction and had each participant write out a statement of principles to be shared ahead of time and read at the start, and from which the discussion could proceed.

I took there as being really two debates: Mayo vs Berger on p-values versus Bayes factors, and Mayo and Berger versus Trafimow on testing at all. I’ll readily admit to finding it hard to concentrate in online fora the same way I would for an in-person event (no-one can see me check my e-mail) and I had to leave and teach before the discussion period, so I may have missed some details, but these were the most salient discussions that struck me.

For the first of these, it seemed to me that Berger and Mayo largely came from very different perspectives and largely talked past each other. Berger supported Bayes factors with “this is how scientists want to interpret p-values” countered with Mayo’s “but that’s not how they ought to interpret them”. As a philosophy of science, I’m in Mayo’s camp here, but have to acknowledge that nearly a century of statistical education still seems not to have found a way to reliably get the point across. We can certainly say that correct science does not need to account for human cognitive failings — it’s ok that it’s hard — but it does open the question of whether there are more readily-understood but equally rigorous frameworks; although a Popperian description of science certainly points to something like classical statistical methods.

But I was sorry to see neither really address the concerns of the other. Berger, of course, would say that Bayesian frameworks provide a coherent alternative to scientific inference and argues that p-value evidence maps poorly only Bayes factors. Mayo would counter that you can’t equate the scales used by the two systems. I’d agree with Mayo, but the technical argument misses the differentiation of intent: do we work from how scientists actually think and try to improve that, or formulate how they ought to think? The latter is certainly appealing, but does run into human failings.

This is evident in the reproducibility crisis (only part of which is based in bad statistical practice), where Mayo is again technically correct in observing that p-hacking is an indication that statistical significance is actually a pretty challenging requirement, if achieved honestly. However, the neatness of the philosophical system doesn’t account for the crises very neatly demonstrating Goodhart’s law as generalized by Marylin Strathem:

When a measure becomes a target, it ceases to be a good measure.

One can, of course, build in guard-rails: pre-registration, or developing methods of post-selection inference, and I would guess that Mayo would be fine with either of these so long as they preserve her severity requirements. It might be better to find ways to reward scientists, not for publishing papers, but for publishing papers that get replicated (not that I have brilliant ideas about how to do this), and thereby incentivizing scientists to be honest (with themselves) about their statistical practice. That idea isn’t original and quite likely neither Mayo nor Berger would disagree with either of these statements. In fact, there may be little to say about their disgreement over a starting point besides “Let me acknowledge the opposing concern, but say that I think we have to just push past it”, but even that statement is useful.

The debate about using hypotheses at all was somewhat shorter. I had initially understood Trafimow’s editorial decision to ban significance tests as something of sociological response to p-hacking — we will get less distorted models and conclusions if we take away the incentives. I’d agree with Mayo that, on balance, I think that removing checks against randomness is counter-productive, but the proposition is not crazy. However, Trafimow staked out a more philosophical position; initially against basing conclusions on wrong models (the reductio of which would make progress impossible) but then against basing dichotomous decisions on incorrect models. I presume this comes from the discontinuity: I can be slightly off in my picture of the world, but still fairly close in my estimates of effects until I turn it into an either-or proposition about something (assuming my estimates are somehow continuous in my model space). I’m not sure this is so much of a concern: the p-values Trafimow objects to (or at least their distribution) would still be continuous in the model space and the potential for disagreement in dichotomous conclusions reduces as the models converge, as effectively argued by Mayo. I will happily buy into confidence intervals being a more informative summary of statistical evidence, when they are available, but we really shouldn’t pretend that they are anything other than a different summary of statistical tests.

I did think that it was a shame here not to have a performance interpretation of hypothesis tests articulated, since “Using this model, I would not frequently be wrong” is an easy counter to Trafimow’s statements. This interpretation does leave you working with models rather than with the world they purport to describe, which is one of Trafimow’s objections, but they do in fact talk about the projection of the world onto that model. (Inference also requires more assumptions than just obtaining parameter estimates does, but that goes for any sort of uncertainty quantification.) The other concern is that it starts veering a bit Bayesian. Nonetheless, we do work with models and I think the within-model replicability is still at least a minimal-and-non-trivial requirement. The replication crisis, also, is explicitly given in performance terms, albeit extra-modular, so some statement that the performance criteria are at least entailed by severe testing would be helpful.

One statement of Trafimow’s that I would thoroughly get behind, however, is that replications should not just be about hypothesis tests, but needs to also include effect sizes. Indeed, I’ve often worried that the narrow focus of statistical methods on single experiments is damaging: in making evidence for specific effects challenging we make scientific conclusions contingent on the specific experimental setting they were obtained in. This disincentivizes science from developing knowledge that is transferable across situations. Biology or psychology are much more complex that physics, of course, making generalized quantitative effects much more diffcult, but I’m not sure that we couldn’t do better — I have a rant about linear models being deleterious for most fields that statisticians have interacted with, but that’s for another day.

In any case; thanks to all participants for having a go. For all I may seem to complain, the debate clearly stimulated my thought processes and if the same is the case for even a fraction of the audience, that’s the optimal a priori outcome it could have had. Well done.

On The Book of Why

With the semester over, I am finally able to get back for a bit of less-academic writing, in this case about my bedtime reading: a review of The Book of Why  by Judea Pearl.

I was put onto this book by the Nature Podcast who reviewed it on their books segment. Pearl’s ideas had been known on the fringes of the statistics community for years and largely dismissed as “Well if you call part of a model causal, the inference is rather circular”. But enough else was happening in causality in statistics that it seemed a good juncture to actually look at how badly wrong that statement was.

As it turns out, it’s not far off. But that doesn’t mean that Pearl’s ideas are devoid of content (or of no relevance to statistics). In fact, I find myself fairly violently conflicted about his project, which might make for a more-interesting-than-usual review.

The first thing to say is that this is excellent bed-time reading for a statistician. In fact it’s hard to work out who else might be the intended audience. It requires far too much statistical background for a general audience (even, I expect, for most computer scientists) but had the right level of informal discussion for me to read when too tired for technical work. I fear that may have limited its sales, but certainly worked well for me.

The second is that it is worth slogging through the opening chapters. These are prototypical examples of CS salesmanship (“the new science of causation”???), both over-promising the remainder of the book, and under-rating the contributions of others, particularly in statistics. I spent a good deal of time wanting to thrown the book across the room as Pearl airily dismissed Fisher as being completely uninterested in causation (What did he think randomized trials were for?) and dismissed the entire field of statistics as following from that tradition (Had he not heard of Don Rubin or Jamie Robbins? How did he think he was going to get inferences out of data, whether he called them causal or not? What about structural equation models?) As it turns out later, Pearl does in fact discuss RCTs as a gold standard (though not as the only standard) and both Rubin and Robbins play large roles. Indeed, it’s hard to find a non-historical figure in the book who isn’t either Pearl’s student (and he’s admirably supportive of these) or a statistician! I rather suspect Pearl should have been a statistician, though I don’t know that he’d be prepared to admit that.

But, after plowing through the aggravation, we get down to business. I’ll divide the book into two broad topics. The first of these is, to me, relatively uncontroversial (at least in parts): causality has not (surprisingly) been formalized in terms of probabilistic mathematical models. The do-calculus (I find this name awkward, but at least it’s descriptive) provides this and, along the way, some insight about what relationships one should examine in data if you are looking for downstream causal effects.

Don’t control for mediators” is, for example, a statistically counter-intuitive statement, though it makes causal sense. (Except the statistician in me is more inclined to condition, and then reconstruct the causal effect — I haven’t yet worked out if that leads to a loss of efficiency or the other way around). He also shows how to view a set of not-obviously-causal problems through this lens, which is certainly interesting and useful. I’ll leave it to others who have looked at the do-calculus more carefully to assess the philosophical contribution (although things are slipperier for continuous quantities than for the discrete values that Pearl — clearly a computer scientist here — finds more comfortable), but I’ll readily admit to surprise at the realisation that it hadn’t been formalized a century or so ago. Statistical application or not, that is an important contribution within philosophy. Some of this really is slippery, though; I’d like to see the Rudin potential-outcomes framework re-written in do-calculus (I think this is possible) or how this relates to the much weaker notion of Granger causality. I’m less convinced that Pearl’s “ladder of causation” (Maybe step-ladder? It has but three rungs) is really as clear as all that, but I’ll accept it as a means of introducing the lay person to thinking about things.

Along the way, he skewers statisticians’ traditional interpretation of their own linear models. See my complains on this at Weasel Words where it’s not only causal relationships that are ignored (contrary to Pearl, I think a comparison of individuals is still interpretable even though not causal), but those are, too.

The second issue is much more controversial: that scientists are far too afraid of causal language (and in particular, that this is statistician’s fault). The argument for this claim goes something like

  1. Scientists do, in fact, think that they are finding causal relationships.
  2. They are hampered in discussing this by depressing party-pooper statisticians forever warning about hidden confounders.

And in fact I agree with both these and recall thinking 1. on many occasions myself! Indeed, I also find myself in emphatic agreement with Pearls’ frustration at statistician’s refusal to go outside of their own (fairly narrow) toolbox and incorporate an understanding of the domain that they work with. See, for example, my comments in Interpretation and Mechanistic Models  although I will come back to that (and have been recently guilty myself of having too little time to develop enough depth to really work with a problem).

BUT: these claims, and the benefits of causal understanding, are easy to make with the rather pat models that Pearl produces. It’s much harder in the real world where, as in nutrition or public health, the unlicensed causal interpretation of findings and the constant attenuation of cautionary languages (eg recently by Andrew Gelman) produces features and fads and much more malicious effects, too. My sister‘s reaction to the set of ideas was that encouraging more unsupported causal interpretation would be disastrous in public health.

And in fact, this really is the nub: Pearls’ analysis works beautifully once you have agreed on what the causal relationships are. But almost everywhere, that agreement doesn’t exist (for a nice interaction with the fairness debate, see this paper). Even in The Book of Why, I kept looking at the already-overly-simple models and saying “I’m not sure that’s right.” How only earth are we to agree on causal relations in real scenarios? Every example in Chapter 9 illustrates this for me.

Now I’m pretty sure that Pearl’s response would be that the solution to this isn’t to avoid discussing causation, but to make it explicit. That is to say “A causal claim is being made here, let’s explicitly discuss that and the evidence for it.” He even (although only once, and in passing) countenances establishing tiers of evidence for causal effects: randomized trials, from observational data, etc.

And I’m generally sympathetic to “Let’s be honest about what we think is going on.” even after accounting for “But human’s have a horrible tendency to run away with an idea.” And it might be worthwhile to produce a somewhat more formalized framework for discussing causal claims. But the discussion sections of paper do, in fact, often it clear what the authors think the causal relations are. These are not stated elsewhere precisely because the evidence for them is weak.

I think that what Pearl perceives as hostility to causation on the part of statistics can be reasonably attributed to caution. Statistics suffers from an over-abundance of this, partly due to statistician’s unwillingness (or lack of time) to really get involved with a subject. This leads us to be rather scared of writing down anything but the weakest models, and it certainly leads us to be highly skeptical of causal statements where the do-calculus (ie, performing an intervention) hasn’t been physically instantiated in an experiment. That attitude is a hindrance, but it’s also born of a century of experience of poor replicability and bad scientific consequences. Even the more-often accepted Rudin analysis has a “no hidden confounders” assumption that many (myself included) find hard to swallow.

Now I agree with Pearl that statisticians are far too scared of writing down a model. One doesn’t have to be Bayesian to say “Let’s start with writing down what I think I know and see how that agrees with the data” — we do that in sample size calculations already. But I’m really not persuaded that sticking with linear models or categorical relationships is the way to do this. One thing that Pearl leaves out of this book is mechanism: how, physically, does this effect take place? (Ok, this comes up as mediation, but in a quite different context) More importantly, can I transfer this understanding to a different system? Without this sort of understanding — and that’s very difficult — causality has relatively limited uses. But Pearls’ observation-based and rather pat models don’t really get us anything like that.

So do read The Book of Why; it is a set of ideas that you should know about, at least informally, and it is useful to think about more than just phenomenological correlation. But then also go and read some physics, or some mathematical biology as well. Statisticians need a really good dose of both.