statistics – Greg Laden's Blog

Falsehood: Correlation Implies/Does Not Imply Causality

Greg Laden — Mon, 16 Oct 2017 15:08:23 +0000

As is the case with any good falsehoods, one can never really be sure what the falsehood may actually be. In this case, there are two falsehoods: 1) When we see a statistical correlation between two measurements or observations, we can not assume that there is a causal link from one to the other. This is the way the statement “Correlation does not imply causality” or some similar version of that aphorism generally means, and this is an admonishment we often hear; and 2) When we see a statistical correlation between two measurements or observations, there probably is a causal link in there somewhere, even when we hear the admonishment “Correlation does not imply causality” from someone, usually on the Internet. To put a finer point on this: What do you think people mean when they say “Correlation does not equal causality?” or, perhaps more importantly, what do you think that statement invokes in other people’s minds?

When I hear it I usually think “Don’t be a dumbass.” I mean, really, nobody is thinking that a mere statistical correlation means that two sets of observations have a definitive causal link. Almost always a correlation is being referred to because there is reason to suspect a causal link between two things, and this link is, we suspect, illustrated by this correlation.

When we hear “Causation doesn’t imply causation” is the person saying that two series of, say, 200 pairs of numbers that closely describe a straight line or a nice well behaved curve on a graph are not so seemingly linked because of causation happening somewhere, and that its just random? Often, yes, that is what they are saying. Recently, a friend of mine mentioned a possible link between a number of physical things about herself and a described medical syndrome, and a friend of hers said “That’s correlation. Correlation doesn’t mean causation.” I thought that was an interesting example of the use of the phrase. My friend with the interesting symptoms was not comparing a series of measurements of two phenomena, but rather, a series of attributes, and a mixture of quantitative and qualitative attributes at that, and how well they matched a similar list thought to be linked to a certain condition. She was diagnosing, not measuring. She was carrying out a Peircian abductive inference, not a quantitative induction. Yet the phrase came up in a rather scolding manner, from a well meaning yet somehow paternalistic observer. And it meant, as it often does, nothing helpful.

To explore this concept further, let’s examine what we think “causality” is at a basic level: Most of the time, when we use some variant of that term, we mean that one thing is causing another thing. Gravity causes the apple to fall to the ground instead of sideways when it detaches from the tree (although we are saying nothing about why it un-attached from the tree, so we are not giving a full causal explanation for the observation). Pacific El Nino cycles cause corresponding cycles of aridity or increased rainfall in other parts of the world. Heavy traffic causes my drive to be longer. And so on.

Sometimes, we have reason to believe that two things co-vary because of one or more external causes. Aridity in one region of the world is correlated with higher rainfall in another reason of the world, and it turns out that both meteorological variations are caused by the effects of the Pacific El Nino. Quite often, especially in complex systems like are often dealt with in the social sciences, we can replicate correlations among various phenomena but we may have multiple ideas about what the causal structure underlying the phenomenon at hand may be. Repeated observations rule out random associations or meaninglessness in the data, but we are faced with multiple alternative models for where to put the causal arrows. In other words, we’re pretty sure there is a “causal link” somewhere, but we can’t see, or agree amongst ourselves, on what it is.

For instance, there is an association between hunting success (by males) in some forager groups and what might be called “mating success” measured as either married/not married; age of first marriage; married for more years vs. fewer; one vs. more than one wife; fewer vs. more children. (There have been a number of studies using a number of variables.) I’m pretty sure that there are two distinctly different causes for this “correlation.” 1) Better hunters are preferred by some women; and 2) Men who are married and, especially, have a couple of kids, are compelled to be more successful at hunting. (The truth is that most forager men are excellent hunters; Day to day variation in success is mostly random; Therefore hunting “success” can be most reliably increased by hunting more and possibly by simply focusing on the effort more keenly rather than screwing off.) Both causes are probably at work in most systems. The causal arrows are much more varied and fickle than the very arrows the men carry in their quivers.

This means that if you find a correlation between some measure of hunting success and some measure of mating success in a group of Hunter-Gatherers, the statement “correlation does not imply causation” is meaningless, though the statement “the specific model you present to explain your data is wrong in that you have causation backwards” may be correct! Or not!

Scientists (and others) often arrive at a point where they assume, pragmatically, that there is a causal link between two things even when the link can’t be explained in a coherent model. In fact, this happens quite often and is probably what directs a lot of research, as novel experiments or exploratory programs are designed to pin down such a model. When this happens, the presumption of causality has been derived from mere correlation. It has been said (go look it up in Wikipedia) that correlation does not prove causation, but it can be a hint. In practice, and logically, there is too large a gap between the statement “Correlation implies or proves causality” and “Correlation is a hint.” Correlation is as good as the data, its replicatbility, the relevant statistics, and yes, even p-values, if you know how to use them. If you take numerous honest stabs at a relationship between phenomena, measure things a few different ways to help rule out a bias in how that is being done, avoid doing stupid statistics (like accidentally correlating a variable to itself), replicate with the same results, don’t throw out trials unless there is a valid reason to do so, the statistics are sufficiently robust or at least correctly chosen, and the p-values are kick-ass, then your correlations are not hints. No. Your correlations imply causation. They may not imply a simple causal effect with one thing you’ve measured causing change in the other … see the discussion above for where the causal arrows may be pointing. There may be parts of your model missing or obscured, but correlation implies causation.

So, there are several aspect to this fallacy.

“Correlation equals causation” is a misstatement because there are reasons other than causation that correlations between data series can emerge.

“Correlation equals causation” can be wrong because it specifies a causal structure that happens to be wrong, and more subtly but also more importantly, correlation of an “X” variable (on the horizontal axix) and a “Y” variable (on the vertical axis) usually implies, even though this is entirely arbitrary, that X causes Y (X being the independent variable, and Y being the dependent variable). Similarly, an equation “Y = mX + b” seems to be saying that X causes Y. Similarly, a statement like “when we increase altitude, temperature seems to decrease” implies that temperature varies as a function of … because of … as something caused by …. altitude. But the fact that things can be ordered this way on a graph, put in this kind of equation, or described with this kind of language does not in and of itself mean that the causal arrow has been spotted and tamed.

“Correlation does not imply causation” is entirely wrong. If, that is, you think the word “imply” means “suggest.” Correlation does indeed “suggest” causation, though it may not suggest a particular directionality or structure of causation. So, if a person says:

“I have a correlation over here. This suggests some kinda causal tingie going on here.”

Then the response:

“No, dear, correlation does not imply causation”

is a dumb-ass thing do say. If, on the other hand, a person says:

“I’ve noticed this correlation between thing one and thing two. This strongly implies an underlying truth consisting of thing one causing thing two”

Then the response:

“OK, that’s interesting, but correlation does not mean causation:”

is a worthy missive.

Finally, and to the reason I wrote this post to begin with. I think there is a correlation between when someone says “correlation does not imply causation” and the person saying that having an agenda other than spreading the word on introductory level statistics. Sometimes it is just an effort to get the person off the topic. In the example above, about the illness, the person was trying to get the affected individual to not link symptoms with some awful diseases as a matter of denial: Hopefulness that the person didn’t really have the disease. In other cases it is more paternalistic. But then there are those instances that are more troubling and possibly more common: Denialism. We see statements like “correlation does not imply causation” when decades of data from multiple sources analyzed a variety of different ways consistently and repeatedly link the release of fossil carbon into the atmosphere with warming, for example. In these cases not only is the statement being used incorrectly and even nefariously, it is being used in a more bizarro-land sort of way: Correlation means that THERE IS NO CAUSATION. How do you get from a strong statistical argument for something to the idea that a strong statistical argument means the opposite of what it means? By having a statement like “correlation does not imply causation” reach aphorism level of inanity. Under such linguistic conditions, statements like “I could not possibly care less than I do about this, meaning that I care not at all” transform to statements like “I could care less” which means the opposite, in words, but the same, in spirit. “Correlation does not bla bla bla” in the denialist context means “Statistics are wrong.” And that’s just wrong.

So, there, I said it but you may not have heard it: “Correlation does not mean causation” or some variant thereof is, sometimes, a dog whistle.

How to do Statistics Wrong

Greg Laden — Thu, 14 Sep 2017 20:56:08 +0000

Telling people that they are doing statistics wrong is a cottage industry that I usually want nothing to do with, for various reasons including the fact that the naysayers are often blindly repeating stuff they heard but do not understand. But, Alex Reinhart, in Statistics Done Wrong: The Woefully Complete Guide, does not do that, and this is a book that is worth reading for anyone who either generates or needs to interpret statistics.

Most of the 10 chapters that address specific technical problems with statistics, where they are misused or misinterpreted, are very helpful in guiding a reader in how to think about statistics, and certain fallacies or common errors may well apply to a particular person’s work on a regular basis. I’ve put the table of contents below so you can see how this may apply to you. This is a worthy addition to the bookshelf. Get this book and stop doing your stats wrong!

The author is a grad student and physical scientist at Carnegie Mellon.

Here’s the table of contents:

Chapter 1: An Introduction to Statistical Significance
Chapter 2: Statistical Power and Underpowered Statistics
Chapter 3: Pseudoreplication: Choose Your Data Wisely
Chapter 4: The p Value and the Base Rate Fallacy
Chapter 5: Bad Judges of Significance
Chapter 6: Double-Dipping in the Data
Chapter 7: Continuity Errors
Chapter 8: Model Abuse
Chapter 9: Researcher Freedom:Good Vibrations?
Chapter 10: Everybody Makes Mistakes
Chapter 11: Hiding the Data
Chapter 12: What Can Be Done?

The Manga Guide to Regression Analysis

Greg Laden — Wed, 13 Jul 2016 17:28:44 +0000

Manga is the Japanese sounding but not used so much in Japan term for a form of cartooning art that has its roots from before World War II but that emerged in its common form during the post war Occupation period. Early used in political cartooning, Manga style drawing is now used for a wide range of expression, and has a place in illustrating a wide range of products, read by Japanese citizens of all sorts and ages. Outside of Japan, Manga is the starting point for the wildly popular Anime style of expression, which of course brings us to…

Pokeman go

But, we are not here to talk about Pokeman go. We are here to talk about Regression Analysis.

No Starch Press has been producing Manga Guides for some years now. They cover many area of math, science, and technology. (I’ve provided a list below.)

The most recent Manga Guide is The Manga Guide to Regression Analysis by Shin Takahashi and Iroha Inoue.

This book presents the story if Miu, a young woman who is having some trouble understanding regression analysis. But she has a love interest to inspire her, and a brilliant coworker to guide her, and with these motivations and tools embarks on a learning journey to grasp such concepts as how to calculate the regression equation and check it’s accuracy, how to use correlation coefficients, test hypotheses, conduct analyses of variance (and analysis of variance is mathematically identical to a regression analysis), predict odds ratios, and do a few parametric statistics to boot.

This is the book that a graduate student who needs to know regression, but is not in a highly mathematical field and skipped college Statistics, will read, learn from, and later claim belongs to his younger brother. Or, that a science-oriented non scientist who is tired of glossing over the statistical parts of the science she reads can use to get up to speed. Or, that a business person or political junkie who wants to use basic regression tools to spot trends or predict primary outcomes might find helpful.

I think that Manga is a medium that many people relate to and find comfortable, and for such individuals, all of the Manga guides, to various math and science concepts, are great. If you have a high school student in your life who is facing a stats course, this is a good gift. Even though the book focuses on Regression, you should know that regression analysis incorporates, or in some way relates to, the vast majority of statistical techniques. When I’ve taught or tutored graduate level stats, and I learned this from the famous Mark Pagel, I’ve always focused on regression because it is very intuitive, yet powerful, and touches on everything. In other words, if you are going to learn one advanced statistical technique, make it (multiple variable) regression.

Interestingly, The Manga Guide to Regression Analysis is a great introduction, but it is not confined to basic regression. The material in this book takes you through a number of different ways to do regression, and will bring you to the point where you should be able to understand and swap in any of the numerious alternative modeling approaches that are out there and available in various statistical packages.

An appendix provides a guide to using Excel to do regression analysis.

Other Manga Guides

The Manga Guide to Physiology

The Manga Guide to Physics

The Manga Guide to Electricity

The Manga Guide to Linear Algebra

The Manga Guide to Statistics

The Manga Guide to Biochemistry

The Manga Guide to Calculus

The Manga Guide to Databases

The Manga Guide to Relativity

The Manga Guide to the Universe

The Manga Guide to Molecular Biology

William M. Briggs has misunderstood a high-school level data graph

Greg Laden — Wed, 01 Feb 2012 13:11:19 +0000

And I suspect he’s done so willingly. Well, you know what they say about statistics and liars.

Here’s the story. The Wall Street Journal and the Daily Mail independently published highly misleading and blatantly idiotic pieces on climate change. We’ve covered this extensively already over the last few days. Phil Plait, of the Bad Astronomy Blog on Discovermagazine.com, was one of numerous scientists to respond to those flaming examples of horrific bottom feeding journalism with the post “While temperatures rise, denialists reach lower.” In that post, he presented a still-image from a moving GIF that has been going around, originally from Skeptical Science. I’ve used the GIF myself just recently, but I’ll re-post it here for your convenience:

The graph represents a bunch of data, from various sources, indicating global temperature over time. The data points are linked together with a “line” to show that they represent data over time, though you could leave the line out because the X axis is labeled as time and the data are basically a scatter of points showing the relational between two variables: Time and temperature. But lines like this are traditionally used in climate science and are technically known as “squiggles.” So the line belongs there.

The purpose of the data scatter on this graph is to demonstrate two conceptually distinct approaches to data analysis: Seeking large scale trends by scientists vs. denying the existence of large scale trends by science deniers.

The dynamic graph begins by displaying in sequence chunks of data, one chunk at a time, from left to right, and with each chunk a downward sloping blue line. This represents the denialist perspective … find a series of points that looks like a downward trend, and put a line on it; A downward facing line, of course. Then the graph shows all the data at once with a line representing the trend shown by these data; There is a clear upward slope.

This graphic is a fun and informative way to demonstrate the difference between a series of attempts to lie bout the data vs. an overall simple straightforward presentation of the data. The upward tend line in the final image is actually unnecessary but it serves as a punch line. Personally, I would have preferred different color scheme with the lies in red and the truth in blue, but what the heck.

Phil presented this graph without much comment in his post, and he presented only the part of the image that shows the upward trend. The only context provide for the graph was the short paragraph before it, which read, in reference to the argument that the Earth is cooling and not warming:

The Skeptical Science website destroyed this argument in November 2011, in fact. The OpEd also ignores the fact that nine of the ten hottest years on record all occurred since the year 2000.

Note the link to Skeptical Science that I’ve included here in the block quote. That is a link to the blog post that this nice graph was initially lifted from by everybody else. In that post, Sketpical Science says:

Right now we’re in the midst of a period where most short-term effects are acting in the cooling direction, dampening global warming. Many climate “skeptics” are trying to capitalize on this dampening, trying to argue that this time global warming has stopped, even though it didn’t stop after the global warming “pauses” in 1973 to 1980, 1980 to 1988, 1988 to 1995, 1995 to 2001, or 1998 to 2005 (Figure 1).

And provides this caption for it:

Figure 1: BEST land-only surface temperature data (green) with linear trends applied to the timeframes 1973 to 1980, 1980 to 1988, 1988 to 1995, 1995 to 2001, 1998 to 2005, 2002 to 2010 (blue), and 1973 to 2010 (red). Hat-tip to Skeptical Science contributor Sphaerica for identifying all of these “cooling trends.”

So now you know what the graph is, where it is from, and what Phil said about it and how he used it.

And now, it’s time to play Whack a Mole, because that is what talking about Science Deniers is these days.

William Briggs claims on his blog at wmbriggs.com to be a qualified statistical expert, and he felt moved to criticize Phil’s use of the graph and what the graph shows, but he totally screws that up clearly demonstrating that he needs to rethink his qualifications. Or his honesty.

First, Briggs obnoxiously tells us that he has already blogged about how to cheat or fool yourself with time series, and that Phil had not read this blog post or done his home work, as though anyone on this planet was required to, or interested in, paying much attention to him.

He then tells us that the data “…are not–they most certainly are not–global temperatures.”

Please see the caption above. Thes are BEST land-only surface temperature data.

He then tells us “Each dot instead is an estimate of global temperature: worse, most dots are also different kinds of estimates from each other. That is, the first dot was estimated using data X and method A, and the second dot was estimated using data Y and method B, and so forth. Well, maybe the first and second dot were the same, but older dots are different than the newer ones.”

The data are from here. Yes, they are from a variety of sources, and have been combined systematically to make a useful source of analysis for climate studies. Here’s what Berkeley Earth says about them:

The Berkeley Earth Surface Temperature Study has created a preliminary merged data set by combining 1.6 billion temperature reports from 15 preexisting data archives. Whenever possible, we have used raw data rather than previously homogenized or edited data. After eliminating duplicate records, the current archive contains 39,390 unique stations. This is more than five times the 7,280 stations found in the Global Historical Climatology Network Monthly data set (GHCN-M) that has served as the focus of many climate studies. The GHCN-M is limited by strict requirements for record length, completeness, and the need for nearly complete reference intervals used to define baselines. We have developed new algorithms that reduce the need to impose these requirements (see methodology), and as such we have intentionally created a more expansive data set.

We performed a series of tests to identify dubious data and merge identical data coming from multiple archives. In general, our process was to flag dubious data rather than simply eliminating it. Flagged values were generally excluded from further analysis, but their content is preserved for future consideration.

There’s more, but that should give you an idea of what we see here. Briggs is trying to make you think that this is a horrid data set that can’t be trusted with huge internal error, but in fact, this is one of the best data sets ever put together for anything.

Then Briggs says:

All you have to remember is these dots are estimates, results from statistical models. The dots are not raw data. That means the dots are uncertain. At the least, Plait should have shown us some “error bars” around those dots; some kind of measure of uncertainty.

This is wrong at three levels. First, the graph was not used by Phil or Skeptical Science in a way that requires error bars. Second, Briggs is inappropriately emphasizing the degree to which the data are estimates or derived or otherwise iffy. Third, one thing we know about these data is that at the global scale (remember,we are measuring global temperature here) the variation is relatively low. Most large scale variation in temperatures is regional or inter-regional, not global. This is a case of Briggs making a huge mistake in his evaluation: He has lost control of the source and nature of variation. Variation must be understood on a situational basis. For example, if I said that I’ve measured the number of heads per person in a classroom of living humans, and I gave you the estimate of “1” you would not say “Hang on a sec.. variation in opinion polling data tends to be about 5% at two sigmas, so you really should say that the number of head per person ranges from .95 to 1.05” If you added error bars to the dots on that graph you would still have the same graph, and if you added error zones to the line itself the line would still be there. Briggs is trying to make you think there is huge error that simply is not there. (More about estimates below.)

He then blathers on further about confusing estimates and predictions which is totally irrelevant to the graph as well as to Phil’s point. In any event, we do have confidence limits for these data, from Berkeley, shown in this graph:

source

Another point Briggs makes is that the starting point for a line determines the overall position of the line. That is not true. If you determine the Y-intercept of a line as fixed then the rest of the line will be different than if you let the Y-intercept go where it ends up, but that is not what has happened here. He points out that the starting point for this graph is 1973, “a point which is lower than the years preceding this date.” That’s interesting because it indicates that he’s seen the original raw data, but all of his other comments suggest he does not really know where the data are from. In any event, note that the line is not fixed at any starting point on the left side of the graph. He’s just making this up. If we need to look to the left, to time periods earlier than 1973, to see if 1972 and earlier were warm and that the Sketpical Science graph is thus made up, there are plenty of sources of information and pretty pictures to examine the longer term trends. Like this one:

Which is from here.

The earth is warming, folks. And Briggs is not making it any easier by pulling wool hats over our eyes.

I’d like to make another point about Brigg’s blathering on about “estimates.” Everything is an estimate. Say you want to know the temperature outside before you head off for work. You can look outside and make an estimate. You see snow, but around the edges it is wet. Someone is walking down the street with a heavy coat, but it is unzipped, yet they are wearing a hat. Someone is jogging and not wearing shorts, and their breath is just a little steamy. Someone else is wearing a medium size coat and it is zipped up.

Obviously it is about 35 degrees F, minus 3 and plus 6. That is an estimae.

Or, you can look at the thermometer you nailed to your fence last year. It says 34 degrees. Hey, that’s close to our estimate! But the thermometer reading is still an estimate. The thermometer works on the physics of a bimetalic strip inside it. How carefully was it calibrated, and it is accurate a all ranges of temperature? The angle of viewing adds variation of about a half a degree. In any event, the temperature indicated by the dial is not a temperature, but rather, a response to the physics of two metals contracting and expanding deferentially in a spring, and how this positions a pointer on a big dial of numbers. That’s not temperature, it is an indirect effect of temperature. And what temperature are you measuring? You want to know the ambient temperature of the air you will be walking around in, but the air by thermometer is the air in your snow covered yard, not the cleared off sidewalk next to the road. And the thermometer is measuring the temperature of the fence it is attached to and it’s own corpus on which the sun is shining, along with the air.

And so on and so forth. Don’t let this “it’s an estimate” thing of Briggs fool you. All measurements are estimates, and one of the things scientists do is deal with that reality by understanding where the numbers come from, what the sources of variation are, and what the tolerances of the system are. Briggs does not seem to understand that.

There is a word on estimates and Brigg’s misunderstanding of the term (and his misunderstanding of what an “average” is) over at Open Mind as well.

And now we come to the final and most important point. Briggs appears to be critiquing the points Phil Plait is making in his post, but he is not. He does not really talk about the points Phil makes. He has created straw arguments and then attacked them. He is trying to distract from the central point that the Wall Street Journal and the Daily Mail totally botched their reporting and are essentially engaged in a disinformation campaign, in which Briggs is willfully engaged as well. Briggs is playing the Watch the Monkey strategy.

Pay no attention to the monkey. Pay attention to the Planet.

By the way, if you’ve gotten this far, you probably would like to know about Mike Mann’s brand new book: The Hockey Stick and the Climate Wars I’ve not read it yet, but after writing this blog post I feel like I’m in it!

How do we know how bad the Swine Flu is so far?

Greg Laden — Sat, 31 Oct 2009 16:02:53 +0000

I spent about 45 minutes yesterday in the local HMO clinic. They had turned the main waiting room into a Pandemic Novel A/H1N1 Swine (nee Mexican) Influenza quarantine area, and I could feel the flu viruses poking at my skin looking for a way in the whole time I was there.

Amanda, who is 8.3 months pregnant, started getting symptoms of the flu two days ago. As a high school teacher in a school being affected in a state being affected (as most are) she is at high risk for this. She was one of the first people around here to get the vaccine, just a couple of days ago, but it takes about 10 days to take full effect, so it was recommended that she go on Tamiflu for a while.

Tamiflu seems to not work very well against the current (or should I say expected) seasonal flu, but it appears that the Pandemic Swine Flu has virtually no resistance to it. And it normally works fast. Within 24 hours Amanda’s symptoms disappeared. There are three possible explanations for that:

Utter chance;
Tamiflu did it’s thing; or
The Tamiflu pill was actually a sugar pill with an especially strong Placebo effect.

Today, Amanda and many many other teachers from across the country are meeting at the national Science Teachers Association. So any mixing up and spreading of the flu that the students have not yet accomplished will be compensated for by the teachers exchanging the virus today and over the weekend. But Amanda has her Tamiflu and the vaccine, so she should be fine. I may ask her to take some extra placebo tonight with dinner.

In the next iteration of a pandemic, we should be providing vaccine for free at conferences and conventions. (Maybe we’re doing that now…. anybody know?)

There are three things you should read on the internet this morning about the flu, vaccines, and related issues:

1) Swine flu: How bad was the first wave?

This is by Revere, and it covers a paper just published in expedited form, OpenAccess, so you can read it yourself. I’ll have a few comments to make about this paper below, but the best summary of its results is Revere’s post at Effect Measure.

Then there are these two items by Orac and and James Hrynyshyn, respectively, on related issues: 2) The anti-vaccine movement strikes back using misogyny and 3) The link between the climate denial and anti-vaccine crowds

OK, now, about this flu paper. My comments are restricted to two aspects of the method used in this paper, and all I really want to do is add a little to your comfort level in relation to these methods. These are methods commonly used in my own fields of research (including archaeology) and that I’ve thought a bit about and taught in various classes, and I’ve found that people, once they start to learn about them, get all freaked out and refuse to believe that they are of any use. The methods are, to adopt terminology for this post that may not be reflected perfectly in the paper at hand, extrapolation and resampling.

Resampling first. Bootstrapping is also known, depending on its implementation, as Monte Carlo Simulation, Resampling, or just Simulation. There are other terms as well. It is probably best to consider them all under the heading “Resampling.”

To really understand the value of resampling, it is best to start with a concept of the inadequacy of normal parametric statistics. What the heck does that mean? At the risk of oversimplifying…. Let’s say you have calculated two averages and you want to find out if it is statistically OK for you to say that they are different … that they are averages of different populations, instead of two numbers that look different but only for random reasons. So you take the averages, the difference between them, and some kind of estimate of the variation in the population(s) you think you are sampling, and the number of samples you took to get the average.

If the two numbers are farther apart, you can have more confidence that they are different. If the amount of variation in the actual populations you are sampling is low, then you can have more confidence that they are different. If the number of samples you’ve taken is greater, you can have greater confidence that they are different.

Standard statistical methods evaluate this information … difference between means, variation in the population, and size of your sample(s), to give you a couple of numbers you can use to determine if it is statistically valid to say that the numbers are different.

But, there is a problem with this. In order for out of the box statistical methods to be used to do this, there has to be a number of assumptions made about the underlying distributions of the population(s) you are looking at. For instance, it is common to assume that these populations are “normally distributed” (like a bell curve) or that they follow some other standard, well studied distribution. So, you plug the numbers you have … the means or the difference between them, the info on variance, and the sample size … and those parameters are evaluated by magic statistical formulas built into computer programs in relation to some pre-existing model using distributions and statistics derived from earlier study with those distributions.

Often that works well because the previously studied distributions, and the relationships between the numbers and the distributions and stuff tends to be the same time after time. If you are studying the behavior of a roulette wheel, the frequency over time of raindrops falling into a bucket, people getting the flu, Russian soldiers getting killed by their horses, the distribution of stars in the sky, and so on, you may be able to use research on the distributions and statistical measures (and their interactions) carefully carried out on one or two of these phenomena to develop shortcuts to apply in the other situations.

And that is the crux of what I want to say: Standard statistical tests (the z-test, the t-test, the F-test, chi-square statistics, etc. etc.) whether they be “parametric” or “non-parametric” are all shortcuts.

The reason these shortcuts exist is because it is impossible to take thousands or tens of thousands of data points, analyze them to determine the nature of the distributions they represent, then use those discovered empirically based situationally dependent distributions to calculate test statistics and confidence intervals and stuff.

Unless, of course, we had a machine to do this! If only we had a machine into which we could put all the data, and then this machine would do calculations on the data!

Yes, folks, with modern computers it is quite straight forward to replace the old fashioned shortcuts with a brute force, direct analysis of actual data which produces (using proper methods and theory) much much better statistics than before.

I want to re-explain this two more ways keeping in mind that I’m still oversimplifying.

1) Here is the actual sequence of events one would like to do in statistical analysis.

a) Formulate a hypothesis about some numbers.

b) Fully analyze the distributional context of those numbers … are the populations they come from uniformly distributed? skewed? unary (only one possible number can be obtained no matter how often you sample it)? distributed like a bell curve?

c) Calculate the parameters of the actual distribution linked to the actual numbers you are using.

d) Calculate the actual probability related to your hypothesis, such as “the probability that these two numbers I say are different are actually drawn form the same population and only look different because of the nature of the distributions I analyzed in step ‘b’ is …”

Here’s what really happens in traditional statistical analysis:

a) Some guy, like two hundred years ago, gets interested in numbers and creates idealized distributions of things and figures out that there are some interesting relationships between and among them.

b) Some other guys, over the next couple of centuries, do the same thing with a bunch of other phenomena and come up with a handful of additional relationship types. Having no computers for any of this, that was hard.

c) Meanwhile, people figure out how to take this handful of distribution sets and use then to estimate what may or may not be going on with a particular data set. But each time one must worry about the degree to which one’s own data matches the original distribution on which a certain test statistic is based. Over time, people forget what the original distributions even were, and begin to fetishize them. For instance, the degree to which one’s data behave just like Russian Army horses’ tendency to kick soldiers to death becomes a matter of great angst and consternation, especially in graduate school.

d) Individual researchers learn which other researchers to emulate, and then they just do what they do and hope nothing goes wrong. The important thing is the p-value anyway.

Here is how resampling works:

a) All of the above is compressed into a single analysis of your actual data.

The distributional behavior of your data is determined by taking repeated random samples of the data (with replacement). Perhaps you will do this at several sample sizes. The result tells you how badly wrong your hypothesis can be … and if the answer is “not to bad” then your good. (This is all done with numbers, of course.)

2) For my second parable, imagine that you are in a situation that has nothing to do with statistics but requires you to make a decision. It is complex. The situation is unique although is falls into a known category of situations. So, you go to an experienced expert in this kind of situatoi and you describe only the basic outline, leaving out all details, and ask the expert what she would normally do in this situation.

The expert replies “Well, I don’t know the details, but generally, in this situation, I’d punt (or whatever).”

Alternatively, you are facing the same situation. So you get the expert (from above) and bring them to wherever it is you are working on this. The expert gets to see the exact situation you are in, and how your situation differs from the typical situation. Based on all the information, she draws a very different conclusion than above because there are particulars that matter.

“Don’t punt (or whatever).”

Which would you prefer? The first scenario is your data in a t-test. The second scenario is your data bootstrapped.

The second analytical techniques talked about in the paper covered by Revere is extrapolation. Obviously, extrapolation is dangerous and scary. Which would you feel more comfortable with:

1) Estimate the percentage of people who are sick in the hospital with a possible flu who require IV fluids in a particular hosptical in United States. You are given given data on number of people who walk into a hospital with flu-like symptoms, and the number of these people who get IV’s, for five one week periods distributed evenly across the flu season in ten randomly chosen hospitals plus the one you are charged to calculate this number for. In other words, you are having a statistician’s wet dream.

2) Estimate the number of people who have the flu in the United states for a given flu season based on the number of IV’s doled out to patients in ten randomly chosen hospitals. You are now having a statistician’s nightmare.

Or, consider this somewhat cleaner comparison:

You must dig a hole into which will be placed the the concrete base for a gate you hope to have in a fence you are installing in your yard.

1) All of the fence posts are in place, and you are told to put the gate post half way between two of the posts.

2) None of the fence posts are in place, and you are told to measure a line that is 47.5 feet from the NW corner of your house at bearing 312 degrees. You are not quite sure what is meant by “corner” of your house because your foundation has a vertical jog in it, and the original measurement may have been from the siding and not the foundation. Your compass sucks. You are not sure if this is 312 degrees off magnetic north or true north. You don’t have a tape measure that long. And so on.

Taking numbers that are fairly good numbers and dividing them up, looking within their ranges, breaking them into bits, is interpolation, and that can be done fairly accurately. Extending numbers outward long ‘distances’ (sometimes real distances, sometimes time, sometimes frequencies, etc.) involves a lot more uncertainty. That is what you see in the flu paper. The authors use appropriate techniques, and you will see that the range of numbers they conclude in answer to the question proposed in the title of the paper is quite large … that is because it is extrapolation that they are using, but these numbers are well confirmed by a kind of resampling.

How well all this works depends, as usual, on the question you are asking. One time I needed to find out if a particular house was made of brick vs. timber. The remote farm house had been torn down and most of the debris seemed to be dumped in the cellar hole. There were a lot of bricks, but there would have been one or two chimneys in a frame house. Also, a frame house could be “nogged” which is where clunkers and seconds (low quality bricks) are used to fill in between the timbers. Or, it oculd have been a brick house.

So, I did two things. Using the foundation size and what was known for houses at the time, I estimated how many bricks would be used for the following:

A two story brick house
A one story brick house
A two story nogged house
A one story nogged house
A two story house with a brick chimney
A one story house with a brick chimney

Separately, I weighed all the bricks we dug up in several holes, and extrapolated that number to estimate how many bricks would likely be found if we dug up the whole property. I came up with a number closest to choice 5: One chimney on one story frame house.

I did not need to know the actual number of bricks. What I needed to know was which of the plausible alternatives the estimate of brick quantity matched most closely. For the flu, it may be enough at this time to know if the Swine Flu is like the seasonal flu, not nearly as bad, much worse, etc.

Confidence can be increased in extrapolation with confirming evidence. In the case of the farm house, I counted the number of brick faces that were heavily charred (from being inside the chimney) and found that this number relative to uncharred faces was a very high. This suggests a fireplace. I noted that the bricks were mostly in one area of the foundation like maybe there was a chimney there. That suggests the chimney idea is more likely than the other ideas. And, I noted that most houses built in Saugerties NY in the 1870s were one story unnogged timber with brick chimneys. Had I started with that last observation and drew conclusions I might be guilty of confirmation bias. But instead, I ended with it, and got reasonable confirmation.

The first estimate was truly unworthy …. I could have been way far off with the brick count for a lot of reasons, and I had to make a lot of assumptions (we had not dug very many holes!). But the ratio of burned surfaces was an independent confirmation, and the conclusion was not unexpected. So, I was able to argue against confirmation bias (finding what we expected) and put this house down in the data base as yet another timber framed farm house.

Extrapolation is dangerous. Ask any Marine artillery forward observer you may happen to know, because it is what they do, but they do it with bombs and a misplaced bomb may fall right on him or herself, or a nearby baby food factory, or some other thing you don’t want to drop a bomb on. But with strong empirical background, experience, good theory, and independent confirmation it works. Or at least, it is often the best we can do and our best is good enough.

Reed, C, Angulo, F., Swerdow, D, Lipsitch, M, Meltzer, M, Jeernigan, F., & Harvard School of Public Health (2009). Estimates of the Prevalence of Pandemic (H1N1) 2009, United States, April-July 2009
Emerging Infectiou Diseases, 15 (11)