Guest post by Jim Manzi, founder and Chairman of Applied Predictive Technologies, and the author of Uncontrolled: The Surprising Payoff of Trial-and-Error for Business, Politics and Society.
Gabriel, your very
deep post that, in passing, requested my comment was fascinating. My family thanks you for the weekend I just
spent staring off into space.
You open with this:
Sampling error? Omitted variable
bias? Bah, that's for first-year grad students. What I find really interesting
is there are some fairly basic principles for how analysis can get really
screwy but which can't be fixed by adding more control variables, increasing
your sample size, or fiddling with assumptions about the distribution of the
dependent variable.
I spend an enormous amount of time in my book arguing that
that this problem is pervasive and significant, and that exactly this triptych of
remedies will fail to enable us to build models that make useful, reliable and
non-obvious predictions for the effects of our interventions in human social
systems. In it, I take apart some
celebrated social science models for failing in this respect. But in the spirit of what's sauce for the
goose is sauce for the gander, I then take apart a model that I built to
estimate the effect of changing the name of a convenience store, to show how
all three together can't put Humpty Dumpty back together again.
Start at the most foundational level: What is
causality? I have an engineer's
perspective on this. What I care about
is my ability to predict the effect of my interventions better than I can
without the model.
Consider two questions:
1. - Does A cause B?
2. - If I take action A, will it cause outcome B?
I don't care about the first, or more precisely, I might
care about it, but only as scaffolding that might ultimately help me to answer
the second.
For example, in your shoes story, I don't care whether the
characteristic of discomfort cause shoes to be considered attractive. I care about whether, for example, if I take
an existing type of shoes and narrow the toes, this will cause them to get more
coverage in fashion magazines, sell more units or whatever.
In general, the best way to determine this is to take some comfortable
shoes, narrow the toes, and then see what happens to sales. That is, to run an experiment.
There are big problems with this approach. One obvious one is that it is often
impossible or impractical to run the experiment. But even if we assume that I have done
exactly this experiment, I still have the problem of measuring the causal
effect of the intervention. In a
complicated system, like shoe stores, I have to answer the question of how many
pairs I would have sold in the, say, three months after changing my design to
narrow toes - I can't just assume that I would have sold the same number of
wide-toed shoes that I did in the prior three months. For reasons well-known to you, and that I go through
at length in the book, the best way to measure this in a complicated system is
a randomized field trial (RFT) in which I randomly assign some stores to get
the new shoes and others to keep selling the old shoes. In essence, random assignment allows me to
roughly hold constant all of the "screwy" effects that you reference between the
test and control group.
But what many cheerleaders for randomized experiments gloss
over is that even if I have executed a competent experiment, it is not obvious
how I turn this result in to a prediction rule for the future (the problem of
generalization or external validity). Here's
how I put this in an article
a couple of years ago:
In medicine, for example, what we
really know from a given clinical trial is that this particular list of
patients who received this exact treatment delivered in these specific clinics
on these dates by these doctors had these outcomes, as compared with a specific
control group. But when we want to use the trial's results to guide future
action, we must generalize them into a reliable predictive rule for
as-yet-unseen situations. Even if the experiment was correctly executed, how do
we know that our generalization is correct?
A physicist generally answers that
question by assuming that predictive rules like the law of gravity apply
everywhere, even in regions of the universe that have not been subject to
experiments, and that gravity will not suddenly stop operating one second from
now. No matter how many experiments we run, we can never escape the need for
such assumptions. Even in classical therapeutic experiments, the assumption of
uniform biological response is often a tolerable approximation that permits
researchers to assert, say, that the polio vaccine that worked for a test
population will also work for human beings beyond the test population.
But as we climb a ladder of phenomenological complexity from
physics to biology to sociology, this problem of generalization becomes more
severe. As I put it in Uncontrolled:
We can run a clinical trial in
Norfolk, Virginia, and conclude with tolerable reliability that "Vaccine X
prevents disease Y." We can't conclude that if literacy program X works in
Norfolk, then it will work everywhere. The real predictive rule is usually
closer to something like "Literacy program X is effective for children in
urban areas, and who have the following range of incomes and prior test scores,
when the following alternatives are not available in the school district, and
the teachers have the following qualifications, and overall economic conditions
in the district are within the following range." And by the way, even this
predictive rule stops working ten years from now, when different background
conditions obtain in the society.
We must have some model that generalizes. What we really need to do is to build a
distribution of results of "experiments + model" in predicting the results of
future experiments. An example of what I
mean applied to criminology is the following from the article I referenced
above:
One of the most widely publicized
of these [criminology RFTs] tried to determine the best way for police officers
to handle domestic violence. In 1981 and 1982, Lawrence Sherman, a respected
criminology professor at the University of Cambridge, randomly assigned one of
three responses to Minneapolis cops responding to misdemeanor domestic-violence
incidents: they were required to arrest the assailant, to provide advice to
both parties, or to send the assailant away for eight hours. The experiment
showed a statistically significant lower rate of repeat calls for domestic
violence for the mandatory-arrest group. The media and many politicians seized
upon what seemed like a triumph for scientific knowledge, and mandatory arrest
for domestic violence rapidly became a widespread practice in many large
jurisdictions in the United States.
But sophisticated experimentalists
understood that because of the issue's high causal density, there would be
hidden conditionals to the simple rule that "mandatory-arrest policies will
reduce domestic violence." The only way to unearth these conditionals was to
conduct replications of the original experiment under a variety of conditions.
Indeed, Sherman's own analysis of the Minnesota study called for such
replications. So researchers replicated the RFT six times in cities across the
country. In three of those studies, the test groups exposed to the
mandatory-arrest policy again experienced a lower rate of rearrest than the
control groups did. But in the other three, the test groups had a higher
rearrest rate.
Why? In 1992, Sherman surveyed the
replications and concluded that in stable communities with high rates of
employment, arrest shamed the perpetrators, who then became less likely to
reoffend; in less stable communities with low rates of employment, arrest
tended to anger the perpetrators, who would therefore be likely to become more
violent. The problem with this kind of conclusion, though, is that because it
is not itself the outcome of an experiment, it is subject to the same
uncertainty that Aristotle's observations were. How do we know if it is right?
By running an experiment to test it--that is, by conducting still more RFTs in
both kinds of communities and seeing if they bear it out. Only if they do can
we stop this seemingly endless cycle of tests begetting more tests. Even then,
the very high causal densities that characterize human society guarantee that
no matter how refined our predictive rules become, there will always be
conditionals lurking undiscovered. The relevant questions then become whether
the rules as they now exist can improve practices and whether further
refinements can be achieved at a cost less than the benefits that they would
create.
We can then then compare the accuracy of such a theory this
to analogous distributions of predictions made by non-experimental methods (that
can vary from sophisticated regression models to newer machine learning
techniques to prediction markets to the judgments of experts, and so on) for
predicting the results of future experiments.
As I put this in the book:
The job of experimentation in
business is to put rounds on target. Abstract discussion of causality is a
means to the end of using prior experimental results to more accurately predict
the shareholder value impacts of various alternative potential courses of
action.
As I go into, there is no absolutely secure philosophical
resting place. That is, even if I have
such a distribution of results for the predictions made by various methods, I
can't ever be absolutely certain that this distribution won't suddenly
change. (I expend a lot of effort trying
to unify the problem of induction and the reference class problem to show that
this is always a risk, no matter what.)
But I think this is as close as you can get.
What this demands, of course, is a lot of experiments. This
is why lowering the cost per test is so critical. Not just as an efficiency measure, but
because in practice in enables me to get to much more reliable predictions of
the effects of my proposed interventions.
To come back to where we started, I think this this is the
way to evaluate whether some model, tool, guru or whatever has "really"
discovered a causal relationship. A
statement about causality only has operational meaning as a predictor of future
results of rigorous tests of the causal theory for the outcome of an
intervention.