X-raying the Black Box

Here we get to the original purpose behind the blog and the reason I started thinking about these issues in the first place. It’s also rather self-serving, since it really gets to be about my own work. (Hence the reason it’s three months late? Not really — I had thought of it as being background material and simply got distracted).

At least one aspect of my work has centered around taking otherwise-unintelligible models and trying to come up with the means of allowing humans to understand, at least approximately, what they are doing. This was the focus of my PhD research (which now seems painfully naive in retrospect, but we’ll get to some of it, anyway), but the ideas have, in fact, a considerably longer history. I recall hearing about neural networks in the early ’90s when they were in their first wave of popularity and thinking “Yes, but how do you understand what it’s doing? And is it science if you don’t?” *** These largely stayed in the background until  Jerry Friedman  suggested machine learning diagnostics as a useful topic to investigate (among a number of others) and I found that I had more ideas about this than about building yet more seat-of-the-pants heuristics for making prediction functions. ***

Most of this post will be a run-down of machine learning diagnostics; rather aptly described as “X-raying the Black Box”, *** although Art Owen  probably more accurately likens what is done to tomography. However, in doing this, I hope to highlight the key question for all of this blog — is the effort put in to developing the tools below of any real scientific value? Or is it all simply cultural affectation that panders to our desire to feel in control of something that we really ought to let run automatically? Models and software that have some of the tools below built in to them certainly sell better, but that doesn’t mean that they’re more useful. And if they are useful for something besides making humans feel better about themselves, what is that, and can we keep it in mind when designing these tools better? It is, as I think I have foreshadowed, difficult to come up with real, practical examples to justify the time I’ve spent on the problem — although the tools I developed work quite well. I do have the beginnings of some notions about this — most of them have a huge roadblock in the form of statistical/machine learning reluctance to examine mechanistic models (see Post 6) — but we’ll need to get to what is actually done by the way of machine learning diagnostics first.

So lets look at the basic problem. We have a function F(X) where X = (x1,…,xp) is a vector and we can very cheaply obtain the value of F(X) at any X, but we can’t write down F(X) in any nice algebraic form (think of a neural network or a bagged decision tree from previous posts). What would we like to understand about F? Most of the work on this has been in terms of global models:

– Which elements of X make a difference to the value of F?
– How does F change with these elements?

(there are also local means of interpreting F — I’ll post about these later***). Most of these center around the problem of “What is the importance and the effect of x3?” Or, in more immediately understandable terms “What happens if we change x3?” Fortunately, F(X) is cheap to evaluate, so we can try this and look at
f3(z) = F(x1,x2,z,x4,…,xp) as we change z (note that this is a direct analogue of the linear regression interpretation “all other terms held constant”). This can then be represented graphically as a function of z (see the last blog post on intelligibility of visualizable relationships even when they aren’t algebraically nice). I’ve done this in the left plot below for x2 in the function from the previous post. We get a picture of what changing x does to the relationship, and if the plot is close to flat, we can decide that x3 really isn’t all that important.

This is fairly reasonable, but presents the problem that the plot you get depends on the values of x1, x2 and all the other elements of X. The right hand plot below provides the relationship graphed at 10 other values of X.

 

EX_pdp1                 EX_pdp2

So what to do? Well the obvious thing to do is to average. There are a bunch of ways to do this; Jerry Friedman defined the notion of partial dependence in one of his seminal papers in which you average the relationship over the values of x that appear in your data set. Leo Breiman defined a notion of variable importance  in a similar manner by saying “mix up the column of x3 values (ie randomly permute these values with respect to the rest of the data set) and measure how much the value of F(X) changes.” There have been a number of variations on this theme. In the right-hand plot above I’ve approximated Jerry’s partial dependence by averaging the 10 curves to produce the thick partial dependence line.

These can all be viewed as some variation on the theme of what is called the functional ANOVA. For those of you used to thinking in rather stodgy statistical terms, fANOVA is just the logical extension of a multi-way ANOVA when you let every possible value of each xi be its own factor level (we can do this because we have a response F(X) at each factor level combination). For those of you for whom the last sentence was so much gibberish, we replace the average above by integrals. So we can define

f3(z) = ∫ … ∫ F(x1,x2>/sub>,z,x4,…,xp) dx1 dx2 dx4… dxp

The point of this is that it also allows us to examine how much difference there is due to pairs of variables, after the “main effects” have been taken out

f2,3(z2,z3) = ∫ … ∫  F(x1,z2,z3,x4,…,xp) dx1 dx4… dxp – f2(z2) – f3(z3)

this can be extended to combinations of three variables and so on; it gives us a representation of the form

F(X) = f0 + ∑i fi(xi) + ∑i,j fi,j(xi,xz) + ….

There are lots of nice properties of this framework; most relevant here is that we can ascribe an amount of variance explained to each of these effects and we can parse out “How important is x3?” in terms of the variance due to all components that include x3. We can also plot f3(x3) as a notion of effects.

This framework was a large part of my PhD thesis (although to be fair, Charles Roosen — an earlier student of Jerry Friedman, laid a lot of the groundwork). You might think of plotting all the individual level effects, and all the pair-wise effects in a grid. If these explain most of the changes in the predictions F(X) over your input space, then you can think of F(X) as being nearly like a generalized additive model (exactly as written above, stopping at functions of two variables) in which all the terms are visually interpretable. One of the things I did was to look at combinations of three variables and ask “Are there things that this interpretation as missing, and which variables to they involve?”

Of course there is a question of “Integrate over what range?” Or alternatively with respect to what distribution. In fact, this can have a profound effect on which variables you think are important, or how well you can reconstruct F(X) based on adding up functions of one dimension. This was another concern of my thesis — particularly when the values for X that we had left “holes” in space that F(X) filled in without much guidance as to what the real relationship was. I’ll come back to some of this later.

For the moment, however, we have that almost all tools for “understanding” machine learning functions come down to representing them approximately in terms of low-dimensional visualizable components. The methods that have come with such tools have been very popular, partly because of them. However, I wonder whether they are really doing anything useful and I’ve rarely seen such visualization tools then used in genuine scientific analysis. They can be (and have been) employed to develop more algebraically tractable representations for F(X). This sometimes improves predictive performance by reducing the variance of the estimated relationship, but mostly comes down to having a relationship that humans can get their heads around, and this still doesn’t answer the question of “Why do we need to?”

 

*** As an aside I don’t know how common it is for academics to trace their research interests to questions and ideas far earlier than their conceptualization of their work as a discipline. In my first numerical analysis class, we at one point set up linear regression (I had no statistical training at all) and I recall thinking that I wanted to develop methods to decide if it should instead be quadratic or cubic etc. Some five years or so later I discovered statistical inference. Of course, we always pick out important sign-posts in retrospect.

*** Jerry has an amazing intuition for heuristics and the way that algorithms react to randomness — I quickly decided that competing with him was not really going to be a viable option, but still lacked the patience for long mathematical exercises; hence a penchant for the picking-up of unconsidered problems, even if they lead into Esoterika (that’s got to be a journal name) on occasion.

***  I wish this was my term — I first came across it as the heading of a NISS program.

*** Yes a do have a list of all the things I’ve said I would post about.

One More Intelligible Model

In my most recent entry, I attempted (apparently not particularly successfully) a distinction between interpretable and mechanistic models. The term interpretable appears to be in common use in statistics and machine learning, but it may be that intelligible would be more appropriate. By this I mean that humans can make sense of the specific mathematical forms employed in order to “understand” how function outputs are related to function inputs.

So what can we find intelligible? In an older post I essentially brought this down to small combinations of simple algebraic operations. Of course what “small” and “simple” mean here will depend on the individual in question — I’m pretty decent at understanding how exponentiating affects a quantity, but that certainly isn’t true for my introductory statistics students — but with the possible exception of a very small number of prodigies, I know nobody who can keep even tens of algebraic manipulations in their head at any one time.

There is an important extension of this, which is that some interpretation is still possible if a relatively simple algebraic expression is embedded in a more complex model and retains its interpretation in this context. For example, a linear regression with hundreds of covariates is not particularly interpretable — there are far too many terms for a human to keep track of — but each individual term can be understood in terms of its effect on the prediction. (This is as a function, ie “with the other terms held constant” — I’ll post something on this weasel formulation later). It is, of course, possible to embed simple terms within complex models in which case the relatively easy interpretation of a linear model within a neural network, for example, are lost when its effects are obscured by the more complex manipulation that is then applied to its output.

For the purposes of describing some means of machine learning diagnostics, there are, however, one further class of mathematical function that I think humans can get a handle on — those we can visualize. Here

fnxfny

I have plotted some one and two dimensional functions (I’ll come back to what these are in a bit) that do not have “simple” algebraic structures. Nonetheless, understanding them is easy — just look! We can even read off these numbers. We also know how to plot two-dimensional functions and are pretty good at understanding contour plots, heatmaps, and three-dimensional renderings.

xyfn1 xyfn2 xyfn3

If we wanted a function of three inputs it might be possible to stack some of these, or at least lay them out somehow:

xyzfn1 xyzfn2 xyzfn3 xyzfn4

and nominally we could try to extend this further, but my brain is already starting to dribble when I actually want to look through these and come up with some sense of what is going on.

None of the functions that I have just presented have algebraically simple expressions. The first is given by a combination of three normal densities (apparently I don’t have any sort of equation editor in this tool so I can’t do square roots) which really isn’t nice, but we can examine it visually. Is this more than a special case? Only to some extent — these can be extended into more complex contexts in the same way that the interpretation of linear terms can be: so long as their effects are the same when put within that context. In fact, statisticians have long used generalized additive models of the form

y = g_1(x_1) + g_2(x_2) + g_3(x_3) + g_4(x_4)

precisely because of their intelligibility (and because estimating such models is more statistically stable). Even in machine learning, this is gaining some traction — see  my paper  with Yin Lou who just completed his PhD in Computer Science at Cornell explicitly looking at estimating these types of prediction function because of their intelligibility. ***

By way of example to tie in many themes from the last few posts, we might examine Newton’s law of gravity as it applies to a object near the ground on earth. In the classical form, the vertical height z of an object from the surface is described by the differential equation

D^2 z = – g/(z+c)^2

where c is the distance from earth’s center of mass to its surface and D^2 z means its acceleration. This, of course, is an approximation for many reasons, but partly because earth’s gravity changes over space, and is affected (in very minor ways) by other celestial bodies, so perhaps we should write

D^2 z = – g(x,y,z,t)/(z+c)^2

where x and y provide some representation of latitude and longitude and t is, of course, time. Here the mechanistic interpretation remains — the dynamics of z are governed by acceleration due to gravity — but g(x,y,z,t), unless given in an algebraically nice form, is not particularly intelligible. My larger question in this blog is “does that matter?” Of course, for most practical purposes, the first form of these dynamics is sufficient to predict the trajectory of the object quite well — it’s also a very handy means of producing a simpler, intelligible approximation to the actual underlying dynamics (g is very close to constant) that humans can make use of.+++

This idea of producing a simpler model, along with additive models, really makes up most of the tools used — usually informally — to understand high dimensional prediction functions, and that’s something that I’ll get to in the next post.

 

*** I must also thank Yin for pointing out in his thesis that “intelligible” might be a less ambiguous term than “interpretable”, although there is no alternative verb corresponding to “interpret”.

+++ Now I know that a physicist will object that the general law of gravitation applies to any collection of bodies if you know enough. Besides the fact that you never know enough to account for everything (and once you do, not everything behaves approximately according to Newtonian dynamics), I could still ask — what if the inverse square law were a more complicated function, and does the fact that it has a nice algebraic form matter?

Interpretation and Mechanistic Models

I want to devote this post to a very different modeling style which neither statisticians nor ML-types devote much attention to: what I will refer to as mechanistic models. I think these are worthwhile discussing for a number of reasons.

  1. In one sense, they represent one of the best arguments against the ML viewpoint in terms of identifying where human intelligence and understanding becomes important to science.
  2. I want to distinguish mechanistic from intepretable in this context. In particular, my concerns are not really about the benefits of mechanistic models (although this is also an interesting topic) and I want to clarify this.
  3. Statisticians rarely think of modeling in these terms and I think this represents one of the discipline’s greatest deficiencies.

The sense in which I use mechanistic is somewhat broader than is sometimes employed (ie, it encompasses more than simply physical mechanics). The distinction I am making is between these and what I would describe as data-descriptive models; it also roughly distinguishes the models employed by applied mathematicians from those used by statisticians.

To make it clear for the physicists: I use the word  interpretable to be a property of the mathematical form that a model takes, not of its real-world meaning. Ie, I am asking “Should we worry about whether we can understand what the mathematics does?” I am aware of the vagueness of the term “understand” — that’s a large part of the reason for this blog.

Essentially, mechanistic models are generally dynamic models based around a description of processes that we believe are happening in a system, even if we cannot observe these particularly well. i.e. they provide a mechanism that generates the patterns we see. They are often given by ordinary differential equations, but this has mostly been because ODE’s are easy to analyze, and we can be broader than that. ***

The simplest example that I can think of is the SIR model to describe epidemics and I think this will make a good exposition. We want to describe how a disease spreads through a population. To do so, we’ll divide the population into susceptible individuals (S) who have not been exposed to the disease, infectious (I) who are currently sick, and recovered (R) who have recovered and are now immune***. Any individual has a progression through these stages S -> I -> R; we now need to describe how the progression comes about.

I -> R is the easiest of these to model — people get sick and stay sick for a while before recovering. Since each individual is different, we can expect the length of time that an individual stays sick to be random. For convenience, an exponential distribution is often used (say with parameter m), although the realism of this is debatable.

S -> I is more tricky. In order to become sick you must get infected, presumably by contact with someone in the I group. This means that we must describe both how often you come in contact with an I, and the chances of becoming infected if you do. The simplest models envision that the more I’s there are around, the sooner an S will bump into one and become infected. If we model this waiting time by an exponential distribution (for each S) we give it parameter bI so the more I there are, the sooner you get infected.

If you turn this individual-level model into aggregate numbers (assuming exponential distributions again because of their memoryless property), you get I -> R at rate mI (since we’re talking about the whole I population) and S -> I at rate bSI. You can simulate the model for individuals, or in terms of aggregate quantities, or if the population is large enough (and you re-scale so we don’t have individuals, but a proportion) we can approximate it all by an ODE:

DS = – bSI
DI = bSI – mI
DR = mR

where DS means the time-derivative of S. Doing this turns the model into a deterministic system which can be a reasonable approximation, especially for mathematical analysis, although in real data the noise from individual variability is often evident.

There are obviously many ways to make this model more complicated — stages of disease progression, sub-populations that mix with each other more than others, geographic spread, visitors, immunization, loss of immunity and a whole bunch of others. The epidemiological literature is littered with these types of elaboration.

The point of this model is that it tells a coherent story about what is happening in the system and how it is happening, hence the moniker “mechanistic”. This is in contrast to most statistical and ML models that seek to describe static relationships without concern as to how they came about — even time-series models are usually explanation-free. I have also avoided the term “causal” — although it would be quite appropriate here — in order to not confuse it with the statistical notions of causal modeling as studied by  Judea Pearl, which are similarly static.

Having gone through all this, there are some observations that I now want to make:

1. I think we can distinguish mechanistic versus interpretable here. My father  would be inclined to view this type of model as the only type worth interpreting — he sniffly dismissed the models I examined earlier as all being “correlational”, and would presumably say the same thing of causal models in Pearl’s sense.

I’m not sure he’s wrong in that (see below), but it’s not quite the problem that I want to examine in this blog and I think I can make some distinctions here: while the structure of the SIR model above is clearly motivated by mechanisms, a substantial part of it is dictated by mathematical convenience rather than realism. The exponential distribution, and an assumption that an S is as likely to run into one I as any other are cases in point. Moreover there is no particular reason why the description of some of these mechanisms should have algebraically elegant forms. Newton’s law of gravity, for example, would still be a mechanistic description if the force decayed according to some algebraically-complicated function of the distance between objects rather than the inverse square (even if this would be less mathematically elegant).

Indeed, one might imagine employing ML to obtain some term in a mechanistic model if the relationship was complex and there were data that allowed ML to be used. For example, the bSI term in the SIR model is an over-simplification and is often made more complex — it’s not clear that using some black-box model here would really remove much by the way of interpretation. My central concern — esoteric though it may be — is with regard to the algebraic (or, more generally, cognitive) simplicity of the mathematical functions that we use.

2. Mechanistic models do, however, provide some more-compelling responses to the ML philosophy. A mechanistic understanding of a system is more suggestive of which additional measurements of a system are going to allow for better prediction and therefore what we might want to target. In work I do with colleagues in ecology, we believe that some dynamics are driven by sub-species structure and this suggests we will be able to parse this out better after genotyping individuals. Similarly, it allows us to conceive of interventions in the system that we might hope will either test our hypotheses, or pin down certain system parameters more accurately.

An ML philosophy might retort that we can, of course, predict the future with a black box model, just give us some data. That mechanistic interpretation is mostly developed post-hoc and humans have many times been shown to be very good at making up stories to explain whatever data they see (more on that in another post) and that active learning  looks at what new observations would be most helpful, and you could pose this problem in that context, too. Of course, this does rather rely on the circular argument “interpretation is bullshit therefore interpretation is bullshit”.

3. As a statistician who has spent a considerable amount of time working on these types of models, I am distressed at how foreign this type of modeling is to most of my colleagues. Almost all models taught (and used) in statistics are some variant on linear regression, and basically none attempt to explain how the relationships we see in data come about — even the various time series models (ARIMA, GARCH etc) take a punt on this. The foreignness of these modeling frameworks to statisticians is, I suspect, because they make up no part of the statistical curriculum (when faced with a particularly idiotic referee report I’m somewhat inclined to say it’s that statisticians just aren’t that good at math, myself included) and I think this is the case for three reasons:

a) On the positive side, statisticians have had a healthy skepticism of made-up models (and ODEs really do tend to not fit data well at all). Much of the original statistical modeling categorized the world in terms of levels of an experiment so that exact relationships did not have to be pinned down: your model described plant growth at 0.5kg of fertilizer and 1kg of fertilizer separately and didn’t worry about what happened at 0.75kg. I’m fairly sure many statisticians would be as skeptical about SIR models as and ML-proponent, particularly given all the details it leaves out.
b) More neutrally, in many disciplines such mechanistic explanations simply aren’t available, or are too remote from the observations to be useful. To return to the agricultural example above, we know something about plant biochemistry, but there is a long chain of connections between fertilizing soil, wash-out with water, nutrient uptake and physical growth. When the desire is largely to assess evidence for the effectiveness of the fertilizer, something more straightforward is probably useful.
Of course, statisticians have chosen these fields, and have not attempted to generate mechanistic models of the processes involved. I sometimes feel that this is due to an inclination to work with colleagues who are less mathematically sophisticated than the statistician and hence cannot question their expertise. I sometimes also think it’s due to a lack of interest in the science, or at least the very generalist approach that statisticians take which means that they don’t know enough of any science to attempt a mechanistic explanation. Both of these may be unfair — see uncharitable parenthetical comments above.
c) Most damningly, it isn’t particularly easy to conduct the sort of theoretical analysis that statisticians like to engage in for these models. And it makes this type of work difficult to publish in journals that have a theory fetish. There are plenty of screeds out there condemning this aspect of statistics and I won’t add another here: it’s not as bad as it used to be (in fact, it never was) and theory can be of enormous practical import. However, convenient statistical theory does tend to drive methodology more than it ought, and it does drive the models that statisticians like to examine.

Of course, everyone thinks that all other researchers should do only what they do. *** Case in point was Session 80 at ENAR 2014 which convinced my cynical view that “Big Data” did indeed have a precise definition: it’s whatever the statistician in question found interesting. I’m not an exception to this, but then blogs are a vehicle for espousing opinions that couldn’t get published in more thoughtful journals, so….

In any case, mechanistic modeling might be an answer to ML (see Nate Silver for practical corroboration) and I might explore that in more detail. They are distinct from interpretable models, and although mechanistic models generally employ interpretable mathematical forms, they need not do so. Up next: what can we understand besides simple algebra?
*** Anyone who has examined data from outside the physical sciences should find the idea that an ODE generated it to be laughable, although the ODE can be a useful first approximation.

*** Alternatively R can mean “removed” or dead.

*** This is foolish: who wants all that competition?

 

On Approximate Interpretation

Another seminar that I went to in the McGill Psychology department in 2005 was given by Iris van Rooij, a young researcher in cognitive science with a background in computer science. Her talk focused on looking at the issue of computational complexity within cognitive science and her thesis went something like this:

When psychologists describe humans as performing some task, they need to bear in mind that humans must have the cognitive resources to do so.

This is not particularly controversial. My earlier posts argued that humans DON’T have the cognitive resources to compute or understand the implications of the average of 800 large decision trees.

However, the example she gave was was quite different. Her example was the categorization problem. That is, one of the things cognitive scientists think we do is to automatically categorize the world — to decide that some things are chairs and others plants and yet others pacifiers-with-mustaches-drawn-on. Moreover, we don’t just classify the world, we also work out what the classes (and possibly subclasses are) and we do so at a very young age. There is, after all, no evolutionary reason that we should be born knowing what a chair is, or a pacifier-with-mustaches, either.

van Rooij’s problem with this was that the classification problem is NP-hard. This takes a bit of unpacking. Imagine the problem that we have a set of objects, and have some measure that quantifies how similar each pair of objects is. We now want to sort them into a set of classes where the elements of any class are closer to each other than they are to elements of any other class. It turns out that this problem, if you want to get it exactly right, takes computational effort that grows very quickly as the number of objects you are dealing with increases. For even a few hundred objects the amount of time required to produce a categorization on the sort of laptop that I run will end up measured in years, and humans are certainly not much faster at most computation.** Thus, said van Rooij, we cannot reasonably say that humans are solving the categorization problem.

Now the natural response is that “Well obviously we’re not carrying out this form of mathematical idealization.” In fact, when computers are required to do something like this they use a set of heuristic approaches that don’t exactly solve the problem, but hopefully come somewhere close. van Rooij reply would be (actually was) “Then you should describe what humans are actually doing.” Now this is fair enough as it goes, but I still thought “Surely the description that this is the sort of problem we’re trying to solve still has value.”

This is a specific case of saying “the world behaves approximately like this”, or even “my model behaves approximately like this”. From a scientific perspective, the initial proposition “Humans carry out categorization” opens the way to exploring how we do so, or try to do so. So dismissing this approximate description because it isn’t computationally feasible that we exactly solve a mathematical idealization just prevents psychologists from using a good launching pad. With any such claim, they will almost certainly discover that the statements are naive and humans more error-prone than the claim implies.

But it also opens the question of what description of what humans do would suffice? We could certainly go down to voltages traveling between neurons in the brain, but this is unlikely to be particularly helpful for us “understanding” what is going on (even if that level of detail were experimentally, or computationally, feasible). After all, most of the experiments involving this task involve visual stimuli, at least, so various visual processing systems are involved, as well as memory, spacial processing (since we mostly think of grouping objects into piles) and who knows what else. It’s also not clear how specific all of this will be to the individual human. However, it is likely that any other description is only going to be approximate, even if it is now computationally feasible in a technical sense.

I think the higher level description of “they’re sorting the world into categories” is valuable, even knowing that it’s not exactly right, because it allows scientists to conceptualize the questions they’re asking, or to employ this task for other experiments. Of course, this is a very “science by interpretation” framework; a devotee of the ML viewpoint would presumably say that you should just predict what they will do and plug that into whatever you need it for.

By the same token, an approximate description of what the 800 bagged decision trees are doing is often enough to provide humans will some notion of what they need to think about, at least until we have computers to also plan our experiments for us. Of course any approximation has to come with caveates about where it works and when the narrative it gives you leads you in the wrong direction. It’s perfectly reasonable to say “humans categorize the world” if you are interested in using this task as some form of distraction for subjects while studying something else. It may too simplistic if the categories they come up with is part of what you are going to look at. Cognitive scientists are forced to start from the broad over-simplified statement and work out experimentally how it needs to be made more complicated. When looking at an ML model, we can see all of it directly and I’ll spend a post (in a little bit) on how that gets done.

Next, however, are models for dynamics and the distinction between being mechanistic and being interpretable.

**  Quantum computers don’t face the same hurdles, but while there are quantum effects in biology, I don’t think we can claim it in the brain.

On Meta-Interpretation

In 2005 while I was technically employed by the Psychology department of McGill University, I went to see a talk by Stevan Harnad, a professor of Cognitive Science at Universite de Quebec a Montreal. This was a rather philosophical discussion of how Cognitive Science ought to proceed and contained, to my understanding, one of the most misguided notions that I have come across. It was based around the proposition that

“Whatever can do whatever we do indistinguishably from us is a cogniser. And the explanation of how it does it is the explanation of how we do it.”

There are, I think, two fundamental misconceptions here (besides the jargon). The first is simply utility: I was inclined to say to this proposition “Give me a willing female and fifteen years and I will produce a machine that does everything humans do indistinguishably from us, and yet I defy you to explain how a teenager can fail to notice the washing left by the stairs, or why, for that matter.” **  This is a logical and correct application of Harnad’s statement — I’m fairly sure that how my niece does things is pretty much the same as how I do things — and yet the cognizer in question is of no greater help to cognitive science than any other human. ** So the fact that you can create a cogniser does not mean you can explain it.

But this notion is wrong at a more fundamental level because it simply fails to acknowledge that the same thing can be done in more than one way. My recollection was that Harnad wasn’t suggesting we need to create some form of cyborg that would mimic humans in all aspects of life, but that his statement applied if we isolated some particular cognitive task and could accurately reproduce human performance at that task. He was particularly enthusiastic about the use of neural networks to carry this out.

This runs straight up against the notion of universal approximators. We saw in Post 2 that there are currently several methods to produce “machines” that accept inputs and produce outputs (using which we can “do” many things) and which are capable of mimicking any reasonable model up to arbitrary accuracy, given enough data. Neural networks are certainly one of these, and they have the advantage that they appear — in a very abstract way — to mimic the biology of the brain. But we might say the same thing about nearest neighbour methods basing outputs on a library of previously seen examples; another plausible explanation, following some introspection. We might also mimic the process as well as you like with a big decision tree. This last idea feels less like the way I observe myself thinking, but certainly would have fit in with early notions of how to build artificial intelligence.

The point is that each of these universal approximators has a very different explanation of “how” they do what they do and yet they can all be made to look like they do the same thing as each other, as closely as you like. More generally, this gets at the notion of meta-interpretation. We can readily understand the mechanics of how a decision tree works, or nearest neighbours, or a neural network in general principles. This does not tell us what it is doing because that depends on the specifics of the of the parameters, structures and data involved. A general statement of “it does this sort of thing” is not sufficient to “understand” any particular instance.

To be fair to Harnad, there are a number of alternative ways you could think about his statement. You might declare it unfair to be allowed as much data as you want — “what we do” may mean “from scratch” rather than just at the point where we measure performance (although how you define the starting point which, given that certainly some of what humans do is simply innate, is somewhat problematic). In this case, you also have to invoke how you get from data to model, which machine learning probably does not do as efficiently as the human brain, at least for doing human sorts of things. But surely I can learn a complicated meta model of how one does that, and then I would still have many different paths to the same output.

Harnad might also have had a much more mechanistic approach in mind (I will get to explaining what that adjective means, someday): “If we can produce an understandable machine that does what humans do, then the explanation for how the machine does it is the explanation for how we do it.” This neatly steps around the teenager in the room and is something of an appeal to Ockham’s razor, relying on interpretability to exclude alternative, more complex, explanations. But that does actually need to be explicit: I might add something extraneous (in Biology, the human appendix is a great example) to the process and still have an understandable, if redundant, model. Simplicity is great for fitting the world into our heads, but to make a claim that the world really does try to minimize some notion of complexity takes more of a stretch.

Most of this blog is fairly ambivalent about the issues it presents, but for this particular post I am prepared to be definitive. The structure of a function, without its specifics, is not an explanation of the phenomenon that it mimics. This does not mean that universal approximators cannot (at least sometimes) produce interpretable prediction models, but the models must be interpreted in their specifics.

Next: on simplicity, approximate models and another commentary left over from McGill.

**  And yes, the initial impetus for this blog was to provided just this retort, some 8 years too late.

**  It is now pre-ordained that she will go into psychology just so the world can prove me wrong.

ML Philosophy and Does Interpretation Matter?

In 2003 I was a graduate student at Stanford when Andreas Weigend, then chief scientist at amazon.com, gave a talk at Stanford. Amazon have always been aware of the potential of their data and been innovative and aggressive in using it, long before “big data” became a buzzword, or the term was even coined (in fact, I’d estimate that it takes at least a decade for the wider community to catch on to a new trend in technology as a general principle, maybe two). In the talk, he observed that if you followed users over time, it was pretty easy to divide them up into groups (in Statistics/ML this is called cluster analysis) in terms of the type of thing that they tend to do — I, for example, very rarely make impulse buys but I do spend a lot of time dithering over products. However, individual users still did very different things each time they visited amazon.com and if you could work out what they wanted to do, you could then tailor a strategy about that. For example, if you could work out that they wanted to buy something specific, you could offer an additional discount on that to convince them to buy it at Amazon. On the other hand, if they were just wasting time, you might want to advertise that they can also buy kitchen gadgets so that in six months time, when they decide they need a new rice cooker, for example, they also check out what’s available on Amazon (that worked for me, although living in Ithaca meant that I was pretty short of options, anyway).

Being the bushy-tailed time-endowed graduate student that I was back then, I thought I saw a way of using some methods from educational testing theory (my colleague, Matthew Finkelman, had been talking to me about this; happenstance is a great help to the scientific method) to tackle this problem. There was limited public data available on user behavior on websites, but there was one data set from MSNBC and I used this and showed that you could divide sessions into about six different “types”, and then as user clicks came through you could progressively improve how confident you were of which “type” of session the user was conducting. I even developed some tools to look at how you could tailor what you showed the user to maximize your information about what sort of thing they were doing (sessions on both Amazon and MSNBC had an average of six clicks, so you really wanted to maximize what you could learn about them).

I don’t think Andreas Weigend ever actually looked at what I had done, but I did get a paper out of it which counts as some currency. It’s largely been left ignored since; a fate that generally befalls ideas that are not pushed hard in any business. This is not the point; some years later, John Langford (then at Yahoo) came to talk at Cornell on the subject of analyzing browsing behavior and I described this exercise. His reaction was “Why go to all that modeling trouble? All you need to do is to predict what they will click on next.” Now this was the topic of his seminar — it’s a non-standard problem and he did a very nice job — so it’s not particularly surprising he was thinking that way. But it also indicated a philosophy of how one goes about this sort of problem that I think demonstrates the distinction between ML viewpoints and the mathematical modelers that had come before them; “don’t worry about imposing structure on the data that makes sense, just work out what the prediction problem is.”

I’d like, therefore, to give a brief caricature of what I might call the ML philosophy (let me make it plain that this is a caricature, although as we will see, this is not just a straw man to be knocked down later):

“The essential test of science is that it makes good predictions and this is exactly what ML targets. Philosophies such as as Popper’s falsifiability make no mention of interpretability, for obvious reasons. By employing interpretable models you restrict the class of relationships you can model unnecessarily and you will likely do better at prediction if you use a more flexible class of models, even if you can’t interpret it.”

It’s important to expand on a couple of things here. First, Poperian falsifiability isn’t classically expressed in terms of probabilistic models in which prediction will never be perfect (as it certainly isn’t for ML), but we can still understand this philosophy in terms of accumulating enough evidence to show that a theory is incorrect. The second important clarification here is that Popper explicitly classes theories that cannot be falsified (because whatever happens, there is an explanation) as being unscientific. Although ML employs methods that are “universal approximators” and can mimic any relationship, this does not rule them out of the realm of science. This is because once you have employed ML and produced a model, that model (I think) is what Popper would refer to as being the theory and it is falsifiable.

How might an interpretationalist respond to this statement? Or what might be the role of interpretation in the business of mathematical modeling? I can think of six possible reasons that interpretation might be put up as being important, along with rebuttals from the ML advocate:

1. Human nature. Humans like to understand things, at least in the sense of feeling like they have some form of control over the process at hand. I don’t know of formal psychological studies stating this, but I’d be very surprised to see it disputed. Even a proponent of the ML modeling philosophy above is unlikely to dispute this; it’s part of what has made particular methods that include diagnostics (we’ll get to these later) popular. However, the ML philosopher would retort that there are lots of parts of human nature that aren’t particularly helpful and the fact that you feel in control of a model does not make it the best thing to use. In short, all the work that I do in this field really is so much cultural affectation.

2. Extrapolation. A second argument goes that using ML methods works fine when you are making predictions near the data that you already have, but if you are faced with a very unusual set of inputs you need to know more about the processes that generate the data. It is, for example, unusual to find a 6’10” 93 year old; if we used a nearest-neighbours method, we might predict the weight of someone who was 6’3 and 78 years old which really may not be a reliable guide. Here we might think that the linear regression, precisely because it is less flexible, is more likely to give a good prediction further away from the data.

The ML response to this is simply to ask “On what basis do you trust one extrapolation more than another?” The humans who designed the interpretable model saw only the same data as the ML algorithm. Do you have any reason to believe that they are better at extrapolating (let alone designing models that are better at extrapolating) than a computer? There could be lots of things that make weight at such extremes of height and age quite different and a human (with no empirical experience) doesn’t know this any more than the computer does.

3. Generalization to new problems. Another response is that the ML philosophy described above mis-understands the term “prediction”. It is not the ability to predict well on the same sort of data that developed it which is relevant. Rather, it is the ability to generalize well. Newton’s laws were developed by experiments on earth and astronomical observations of the solar system, but ought to apply to any solar system. This is something that Copernican or Ptolemaic models simply couldn’t do and which an ML data also can’t approach since it doesn’t have any ability to go beyond its domain. In the Amazon example, we are interested not just in profit from the user’s clicks this session, but it gaining profit in a few months by selling them a rice cooker.

This is a good argument, but the ML philosopher can still respond in three parts. Firstly, they say, we have moved from an input-output set of relationships (here is the position of Jupiter, Neptune and Earth, where is Mars?) to a much more abstract representation of dynamics and that’s not entirely fair. Second, though, we can still play the ML game in a new solar system, we just need to observe it for a while. But more important thirdly, this is about changing the prediction task and the focus. ML could of course be used to uncover Newton’s laws if you present the inputs and outputs in the correct way: as forces acting on a body (inputs) and acceleration in response (outputs). Alternatively, without worrying about predicting velocity or acceleration, just predict the location a little time ahead and you’ll get something pretty similar. For Amazon, you also change the prediction task: how does what I present affect my profit from this user over the next year?

So much for the first three arguments, although the ML response to each is in some ways unsatisfying. The next three are things that I find a bit more compelling, although they are not without challenge.

4. Evidence for statements. This concerns statistical inference and legality in some sense; there are times when we do need to make causal statements, and need to back this up with evidence. Imagine, for example, using ML to predict lung cancer. You get an relationship that may use smoking as an input but with no means of determining how changing smoking affects the probability of getting lung cancer.

Now suppose that you went a bit further and asked what could be changed to most decrease lung cancer risk and found out that inputs in which smoking=0 always had lower risk, this provides the basis for a reasonable recommendation (although it’s still not causal), but it doesn’t tell you that this hasn’t occurred in your model just by chance. For this you need statistical inference to assess how likely it is that the data in which smoking makes no difference would have randomly been arranged so it looks like it does. These problems are the bread and butter of statistical inference but so far do not exist for ML models.

This, first, is a technical hurdle and something that I work on — I think it’s likely that it will be partially solved soon. A more hard-core ML philosopher will challenge the basis on which you wish to assign causality. The usefulness of knowing that changing smoking lowers the probability of lung cancer in any one particular example is helpful and should certainly be advised in that example, but why is the general statement needed? Well, partly for legal reasons if you want to sue tobacco companies, but this goes back to the presupposition that interpreting a model of the world in causal terms is relevant and useful.

5. Reasons for outcomes. An associated objection to ML only is that sometimes you must give a reason for the output. For example, if you are a bank and deny someone a loan, you need to specify why, and if you use an interpretable model this is relatively easy; “Our black box said so” mostly doesn’t cut it.

Another way to manage this is with decision trees. This is because you can always look at the last decision you made before you gave the prediction and say “this is why” (see the explanation of the mechanics of decision trees in the previous post). Fair Isaac  — a credit rating company — actually employs these for this exact reason. Individual decision trees actually tend to be fairly bad at prediction (this is because of the way you get from data to a tree), so what Fair Isaac does is to use a more sophisticated ML algorithm to get a prediction function. They then generate lots and lots and lots of fake data with the outcome given by their prediction function and use this to produce a tree that mimics the function fairly well. In a recent paper I used the same idea to shorten medical questionnaires that screen for depression (because you only need to ask the questions as you get to them when going down the tree).

Besides the obvious artificiality of the “reasons” given here and ML advocate can, I think, say two things. First, that there are lots of other ways of giving “reasons”, second she can deny the relevance of “reasons”, in the same way that she would deny the relevance of causality. This wouldn’t prevent listing changes in the inputs that would result in the desired output, but there are many possible such changes and needing a tree to list just one seems rather restrictive (although it satisfies a legal requirement).

6. Choice of inputs. A final argument, and one which I find much more compelling, is that an ML algorithm doesn’t exist in isolation. Someone had to choose the data the employ — what inputs should be available, which should be collected etc. In order to do this well, you need to know something about the process at hand — if I want to predict lung cancer I’d better have some idea that smoking might be related to it. (On a note from a collaboration, we might decide that the proportion of ethnically Hawaiian children is not particularly relevant to predicting songbird abundance at locations along the US east coast). Knowing that a factor is relevant is an important interpretational input, even if the specific ways it impacts isn’t. Moreover, human learning from modeling exercises then informs future modeling efforts. It’s pretty clear from first principles that we might suspect that smoking would affect lung cancer, but that’s because we already have a lot of interpretational models about human physiology.

An ML response here is that you should simply include all possible inputs that you can get your hands on, but I think that this doesn’t really answer how we even know what we should record, or what new things you should look to record. Of course, there are ways of trying to work out what inputs an ML function makes use of, even if you don’t really know how it uses them. I’ve been involved in some of this work myself, and this may be enough information to answer this purpose. Still, that’s a chink in the non-interpretation armor.

Looking over these, in many ways I find the ML response artificial and sometimes circular (interpretation is unimportant because interpretation is unimportant). In particular, we can deny the relevance of causal models which might be used in a legal case, but I don’t see particular arguments for this other than by stating that there are no particular arguments for causal models, either. They are the way that humans have envisaged the world working and affect a lot of the way we run our lives. However, I find it hard to demonstrate internal inconsistencies in the ML viewpoint.

Note, by the way, that I am in no way arguing that interpretability is always necessary. I don’t see much point in interpretable facial recognition software, or handwriting recognition, or determining if a credit card transaction is fraudulent or predicting the final price in an auction. There’s lots of times when you really don’t particularly care about much except predictive accuracy, the question is whether there is no case where you do.

Next up: on meta-interpretation and why it isn’t the same.

On Models in Machine Learning

My intention, in writing this blog, is that it should not require you, dear reader, to already possess a sophisticated background in Machine Learning (ML) methods or in statistics. I will therefore spend some of this post outlining the sort of prediction functions that ML produces. This will also help to clarify the problem of interpretation, since most of what you get in ML is understandable in very simple contexts but this becomes impossible as they are used in practice. I will beg the patience of any of you who know most of this already, but it will be somewhat useful in illustrating my point. I’m not going to discuss, except where necessary, how ML uses data to come up with its models, but rather what the models are like; for those who want to know more, the Wikipedia entries on all of these provide really very good overviews, as do do two books out of my old alma mater:

We have already seen linear regression as the archetype of classical statistical models, in ML there are three types of models that get used regularly and I’ll try to provide a description of each of these: examples, trees, and networks.

Methods based on examples

I alluded to these methods in the previous post. What I have called example methods (and under which I also add kernel methods), we base on prediction directly on the examples in our data set. The simplest of these methods is Nearest Neighbours, which does exactly what it says — if you have a new vector of inputs, it scans through the data and finds the example with inputs closest to those that your new input, then it predicts the outcome for that example.

This is illustrated in the figure below:

The red star is a new point and the line connects it to the nearest example in our data set of height and age. For the new point we predict the weight of the individual at the other end of the line. This seems like a fairly simple type or rule that is pretty easy to follow (even on a manual level) although it isn’t particularly easy to see how to glean the relative importance of height versus age in this case. However, it quickly gets complicated:

  • It’s often better to use not just the closest example, but the average output of the closest 5, or 10, or 15 examples, or sometimes more.
  • In fact, it’s often better to just take the average of all the outcomes, but reduce the weight that each outcome gets when it is further from the new inputs so that only the nearby inputs have a lot of influence on the prediction. These are called “kernel methods” where the kernel describes how weight decreases further away from your new inputs.
  • We have not addressed how you compare height and weight in terms of closeness — is (45 years, 84kg) closer to (45 years, 90kg) than (43years 84kg)? — but this gets harder as you have more and more input measurements. What if we add 20 demographic variables? What about 20,000 gene expression levels?

There are other methods that pick out particular examples from the data so that you don’t need to look at the whole data set. For classification in particular (when the outcome is binary: alive or dead?, will they default on their loan?, which letter is that a picture of?), support vector machine do this, although somewhat less directly than the nearest neighbours approach suggests. However, even in these cases, the number of examples used can still be very large.

Tree-Based Methods

Decision trees are used in more frameworks than statistical prediction, but have been a very popular tool since their development in the late 1970’s and early 1980’s. (Following some early developments, they first came to fruition in a book by Brieman, Friedman, Olshen and Stone in 1984, (many of these ideas were independently re-invented by Robert Quinlan in 1990.) They represent a cascade of decisions as depicted below.

In this figure, we start at the “top” of the tree (why the tree is upside down isn’t so clear, but decision root systems just didn’t sound as good) and decide whether the new height input is less than 69.5 inches. If it is, we take the left branch of the tree and decide whether height is also less than 65.5in. Supposing it is not, we go right and ask whether the new age input is less than 24.5. Let’s say it is, we predict 139.2lb. For any example, we keep following the tree down this way until we arrive at the prediction given by the “leaf”. If someone is greater than 69.5in tall, we predict 180.3lb immediately.

This, also, is a pretty understandable procedure — a human can follow it quickly (the ability to make manual predictions was originally an asset and some of these aspects are still useful as we’ll see when I come back to interpretability) and in many ways it’s easy to work out what’s important: which variables are split most often, and which splits make the biggest change in prediction? However, this breaks down fairly quickly when the size of the tree grows as the next figure shows (even with only Height and Age and 69 observations).

As with many of these models, this fairly simple idea has been made considerably more complicated because doing so leads to better predictions. In particular, it is now common to use hundreds or thousands of trees, each a bit different, and to average the predictions made from each of these trees. Methods called Bagging, Boosting and Random Forests all produce models like this. Even giving some form of meaning to the average of the output of two decision trees is difficult; at hundreds we are lost. Of course, with modern computing we can very easily store the structure of the trees and make predictions for hundreds of trees in fractions of a second, even though this is not feasible for the biological machinery in our brains.
Networks

To some extent, all machine learning methods originate with some heuristic: nearby examples and trees certainly fall into this camp. Neural networks were developed as a very loose analogy to the working of the brain (any biologist should cover their eyes and those of anyone around them at this point). The idea is that in the brains, neurons pass information by a creating “spike” or an action potential which passes voltage to the neurons it is connected to. In order to spike, it needs input from other neurons, when the voltage received from these is large enough it spikes — stimulating other neurons — and resets its voltage.

This is mimicked, crudely mathematically, in the following diagram:

On the left, are the inputs, each neuron takes these inputs and adds them up with different weights. They they decide if this adds up to enough weight and, if so, send a signal to the neurons downstream. These do the same thing (just with signals from each neuron) and so forth until the end. The “1”s are just a constant “base-rate” stimulus that the neuron always experiences.

In fact, this is an over-simplification, the neurons don’t spike or not in our case. Instead, they add up their inputs (with weights) and then “squash” that sum so that it lies between 0 and 1. Below is one of the classical squashing functions employed:

On the X-axis is the sum-of-weighted-inputs and the Y-axis is what the neuron passes on to the next in the level. This “squashed” set of inputs is then passed to the neurons downstream (to the right in this case) who all do their job until we get to the output.

In this case, despite biological analogies, it might seem rather difficult to find any way of providing an interpretation for what this particular procedure is doing. However, when only one neuron is used to get from inputs to a binary (0 or 1, dead or alive etc) outcome, this is exactly logistic regression and the weights w_i are interpreted like the b1 and b2 in our linear regression example (they act on the log of the odds of a 1, but look here if you want a more complete explanation). However, it again gets increasingly difficult to provide a mental picture of the map from inputs to outputs as the number of neurons, or the number of levels, increases.

 

None of these methods do only classification, we can predict numbers as well as categories with most of them (although regression support vector machines have not been particularly successful), but the same interpretational difficulties apply in this case as well.

——————————————————-

So where has all this gotten us? Well, I think it illustrates at least four aspects of what we mean by interpretation and why it is difficult in ML. The first of these is complicatedness (complexity being a loaded term in many disciplines). There are simply too many things to keep track of — too many examples, too many trees with too many levels, too many weights — for humans to mentally be able to represent all of it. In fact, one means of defining a model as being “interpretable” is to say that “a human can manually get from inputs to outputs.” This definition necessarily includes “has the patience to” along with “has the mental capacity to”. And to some extent this is a good start; I could put up with a two-node neural network, or a couple of decision trees, or nearest neighbours based on 5 examples. But this description doesn’t completely get at what seems to be important; imagine the linear regression model that we have taken as our example of interpretability, but instead of just Age and Height we also include the 20,000 gene expression levels we alluded to earlier**. I’m pretty confident that I don’t have either the mental capacity or the patience to add up a weighted sum of 20,003 numbers and while I do think that this many numbers compromises the interpretability of the model, there is still a lot you can learn from it:

  • The effect of changing any individual input (see caveats about this in the previous post) is still given by its coefficients.
  • The size of the coefficients can be used to give a measure of the relative importance of each input — this can be interpreted multiple ways to account for the magnitude or variability of the inputs. Using this, we can search out which inputs have large effects, quantify how large those effects are and how much we loose by not including the effects of the inputs that make less difference.

So in fact, even when the number of coefficients is large, there is a lot that can be obtained in terms of gaining a mental picture of how the output is shaped by the inputs.

This contrast leads into a second aspect of interpretation and this is the ability to “modularize” complicated models. That is, even high dimensional linear regression retains some interpretability because we can break it up into smaller components, each of which is clearly interpretable in the “a human could/would do the calculation manually” sense. Moreover, unlike the average of the output of 200 trees, each of which might be simple, the interpretation of one term in linear regression retains its interpretation when put back in the context of the whole model. If we took the time to “understand” (a loaded word that I’m trying to avoid) one of the 200 trees, what we learned from that would be swamped by the effects of all the other trees in the average. So complicated but interpretable models have parts that can be understood and for which that understanding continues to be relevant when they are put in the context of the whole model. Notably, none of the ML models discussed above have this property, at least when they become complex.

A third aspect of all of this that is less obvious from these descriptions is that most ML models have the property of being what is called a “universal approximator”. That is, whatever the optimal input-output relationship is (which we don’t know, of course), given enough data and computing any of the methods above can get as close to the optimal relationship as you like**. This is not true of linear regression. One might speculate that this is not true of any interpretable model class. However, I am not sure of this and I will discuss symbolic regression — a class of models almost universally ignored by statisticians — at some point later.

Finally, I want to note that while what comes out of ML is an input-output function that is not interpretable, the type of function that comes out is interpretable. This is clearly true, since I just spent this post describing how these functions work (and I hope I have not left everyone fogged up about it!). The process of how you obtain these input-output functions from data is also quite understandable (even if I have not gone through it here). However, any specific function isn’t, at least to the extent that you can readily describe how the relationship it describes is different from some other such function that uses the same inputs. I will argue in a post or two that while this is somewhat useful (actually it’s necessary, humans did have to come up with the methods to produces the functions, after all) this is a very different type of interpretation that is not the same, or as useful (if any interpretation is useful), as the sort of process we use for understanding linear regression.

Next up: a caricature of ML modeling philosophy and some justifications for worrying about interpretability.

**Statisticians: remember I don’t care how we get the coefficients.

**For the mathematicians: assuming this relationship behaves somewhat nicely.

Introduction

Welcome, dear reader, to Of Models and Meaning: Musings on Statistics and Learning.

In a world in which publication is essentially free (apart from the effort involved in content creation), I should not need to justify the self-indulgence of putting something like this online. Nonetheless, amid the cacophony of internet opinion, I feel some need to outline why this particular contribution might warrant, for some of you, a few minutes of attention. My first post is, hence, an explanation.

I have started this blog to explore a series of philosophical issues around statistical modeling that I have not found addressed elsewhere. These issues are centered around the question “What does it mean to interpret or understand a mathematical model?”, and why or whether doing so is important. They arise out of my particular intersection of statistics and machine learning and have, for me, been a fairly consistent source of larger questions about the technical work that I do, why I do it and what it means.

This therefore is not a blog in the traditional sense. I am not going to attempt to popularize existing ideas on the topic (especially since I’m not aware of many). The posts are also going to be a lot longer than those found in most blogs. This is more a way of making public material that I would not publish formally and seeing if anyone out there has a perspective on it.

To give a brief history: I started my career studying mathematics in my undergraduate degree at the Australian National University. At the time, like many mathematics students, I found statistics only to be a source of poorly-explained tedium (I may digress to this topic later). This changed when I took a course on “data mining”**  since relabeled Machine Learning (henceforth ML) which I thought of as what statistics “ought” to be. By happy chance, I worked out that most of the people whose work I found exciting were actually in statistics departments (many of them still are at Stanford) and went through a wrenching change of identity in the nick of time to apply to statistics programs for my PhD.

I went to Stanford (mostly because I chanced on an excellent mathematics program and somehow impressed people) and worked with Jerry Friedman: one of the most original thinkers in machine learning and a hero even before I arrived. A lot of my work was on the interpretation of the models that ML produces and really represents the motivation behind this blog. You’ll hear more about my work in later posts, although hopefully not gratuitously. I then went on to do a post-doc at McGill University with Jim Ramsay (another tower of creativity) on differential equation models. These models were very different, but also made for a useful contrast to the modeling perspectives in machine learning and in statistics. I somehow landed a job at Cornell where I try to work in more fields than is good for me and on rare occasions manage to take a step back and think about what I’m doing.

So what am I doing? Let me get to the point. The explosion in computer power over the past 30 years has lead to the wide-scale use of a new class of mathematical models and a new modeling framework: machine learning. Prior to the 1980’s, almost all mathematical representations of the world were manually designed by humans, could be written down using a small number of algebraic terms, and were readily understood with the right training, even if, as in chaos theory, the behavior of those models was sometimes surprising**. The most common example of this type of model in statistics (and what will make a useful comparison to ML below) is regression:

Y = b0 + b1 X1 + b2 X2 + E

This model represents the relationship between an outcome, Y, and two inputs, X1 and X2. A classical statistical example that I use in introductory classes is to predict someone’s weight, Y, from their age, X1 and height X2. There is an additional error term E for each individual that is assumed to be random and in practice is also used to mop up whatever other factors might influence someone’s weight, but which we haven’t measured (another post?). The model then says that for every additional year of age, you expect a person to be b1 kg heavier given no further information besides their height (in a class data set, b1 was estimated at 1.23lb/year). Even this interpretation can be challenged: people’s height changes as they age so comparing someone at age 40 and at age 50 it is unreasonable to say that their weight only changed by 10 b1 because X2 will likely have changed a little as well. However, this is not my principal concern and I chose the language of inputs and outputs deliberately because I’m in interested in the way the model describes how you get from both X1 and X2 to Y. If you like, b1 can be understood in terms of comparing two people who are the same height but 10 years different in age.

The rest of this model is interpretable in the same manner: b2 is the effect of height (5.2lb/inch from data) and b0 (-234) is, somewhat less meaningfully, the value when both are zero. E is different for each individual, but its standard deviation gives a notion of how much of the natural variability in weight the model is able to account for. These models can be made more complex by using more variables, or quadratic or interaction terms and so forth. But, each of these terms has a simple, understandable structure and indeed we go to great effort — especially in introductory statistics courses — to get students to understand what these models do and say about the world. Most of statistics has, classically, been focussed on obtaining values for b0, b1 and b2 using data, but with the structure of the model fixed.

Machine learning takes a radically different approach that has only been possible since the explosion of both computing power and data that began in the early 1970’s. The central tenant of ML is that the important problem is _prediction_. That is, given new inputs, X1 and X2 above, we want to predict the output Y as accurately as possible by any means that we can. The estimation of b0, b1, b2 is then purely incidental and only relevant if the linear regression model happens to give good predictions; we’d generally expect that linear regression is too simplistic to be the best model unless you have only a small amount of data. This has meant that ML has developed new — and very successful — methods of producing models that, for the first time, cannot be easily written down or understood. I will go into some in more detail in later posts, but a useful example rule to think of is “Choose the example in the data set that has X1 and X2 closest to the new inputs and give its value of Y as your prediction.” We can understand how the prediction function works, but with a large data set we have little hope of guessing at what it might spit out or at understanding the relative importance of X1 and X2 on the outcome. In some sense (which I would like to make precise), we don’t really grasp what the model is doing. There are many more sophisticated approaches to model building in ML but they all result in input-output relationships that are essentially black boxes — you put in X1 and X2, you get out Y and otherwise you don’t try to uncover what is going on.

My own work was on trying to take such a black box and tease it apart again — “X-raying the Black Box” is a term that I wish I’d come up with — and I developed and improved some methods for doing so. I’m currently involved in some new work on doing formal statistical inference with certain classes of ML methods. All this has brought me to the problem of “Why?” The issue here is that the ML philosophy is rather appealing: the test of a scientific theory is that it predicts well, so what do you need anything else for? Why not do science by working out what you need to predict, generating lots of data about that problem and throwing your favourite ML method (or the one that works best) at it? Is all of what I do trying to understand a model just so much cultural affectation? And what is this understanding or interpretation, anyway? How do you define it? When does it become impossible as it is in ML? Even if it is sometimes important, it certainly isn’t always, what circumstances make it important?

I have some fumbling ideas about how to answer some of these questions and others I’m more lost in. This blog is largely going to be about me trying to get something coherent about some of this. I’ll take a fairly gentle look at the sorts of models that ML looks at next and some of the ways that they can be interpreted. Then I’ll ask about the boundaries of interpretation and put up some ideas about when it might be important. Along the way I’ll throw in some perspectives on modeling philosophy, on the sort of models statisticians tend to work with, on what makes models mechanistic (I’ll define the term when we get to it) and anything else that seems relevant to me.

One particular debate that I do not want to revisit is the perennial Bayesian/Frequentist argument. More than enough has been said on this for there to be an effective impasse — it’s like arguing about the existence of gods. For the record, I am philosophically frequentist, although I admire and use Bayesian computational methods. But my concerns here are about the way you choose to represent relationships, mathematically, not in how estimate them and not, for the most part, in how you represent the accuracy of your estimate.

I want to state at the outset that I am not a philosopher, nor do I claim to know much beyond some undergraduate philosophy coursework. I certainly haven’t done any extensive literature search, although the people I know who might have leads have not come up with any. If you know of literature that I ought to read, references are surely welcome as are reactions to my posts.

And with that, dear reader, I have tried your patience enough. The next post will examine some examples of the models you get from ML and the problems they pose for what you mean by “interpretation”.

** a term that had poor connotations in the social sciences and has now been replaced; I haven’t seen wikipedia’s distinction between these terms born out in practice 

** modern physics may be an early exception to the “readily understood” bit, but the models arise from very different considerations than the problems I will pose in a minute.