I have been presenting various versions of a model to predict the outcome of upcoming Democratic primaries. The earlier version of the model worked like this: Make some assumptions about the ratio of voting preference (for Sanders vs. Clinton) among the different major ethnic groups, and using the known distribution of said ethnic groups, predict the future.
I started out with the assumption that among whites, the ratio would be 50:50, based on one datum, the outcome from Iowa, which is essentially a white state. I used a bias for African Americans and Hispanic voters favoring Clinton. That worked well to predict several primaries, with the caveat that what happens in Vermont and New Hampshire would be biased by favorite son effects.
The second part of the model is to update the within-ethnic group biases with further information as it became available, using primarily exit polling. At no point did polling for future races come into play except to demonstrate in advance that the model might work (by comparing polling for some Super Tuesday state polls with the model predictions).
Again, the model predicted Super Tuesday’s outcome pretty well, but there were some surprises especially in order of magnitude where Sanders won. In those states I had predicted either something close to a tie or a modest Sanders win, and he did better.
Now that there have been several other races (Louisiana, Nebraska, Kansas, Maine, Mississippi and Michigan), with more exit polling and some more surprises (that, again, I predicted in polarity but not magnitude) I can see that the model works very well in predicting states where Clinton ultimately won, but under-estimates Sanders’ delegate take in states where he won. And, the states where the latter happens are those that are not part of the “deep south.” This indicated that both “black” and “white” voters (and maybe “hispanic” voters) are doing different things in those different states, and that ethnic mix alone is insufficient. I also considered that whether or not a primary is “open” or not may be a factor (or a primary vs. a caucus) and I’m sure this has an effect. However, the simple characterizations of “open” vs “closed” or even “caucus” vs. “primary” come nowhere close to actually capturing the real variation among these kinds of states. Plus, sadly, there is a general lack of exit polling information for some of the odder states, so the two factors (a different ethnic pattern vs. the effect of the kind of contest) can’t be compared in relation to each other.
So now I have a new model. This is exactly the same as the first model, but uses different ethnic patterns (how each ethnic group is likely to vote) for states that are “southern” (deep south, not the southwest) vs. states that are not “southern”. This could have been done by looking at the proportion of African Americans in each state to produce an adjustment, and I may well do that eventually, but for now a simple binary distinction seems appropriate. I calculated, using exit polls, ethnic patterns for these two kinds of states.
I have data for eight southern states indicating that the ratio of Clinton to Sanders support for White, Black and Hispanic should be 60-40, 88-12, and 71-29. In contrast, for non-southern states, for which I have data from six states, the ratios are 45-55, 69-31, and 46-54. Note, however, that this “black” ratio is based on only four data points, and the hispanic ratio for both types of states is based on one state each.
In other words, Black voters always favor Clinton but much more so in southern states, white voters favor Sanders in non-southern states but the reverse is true in southern states. Hispanic voters strongly favor Clinton in southern states, and mildly favor sanders in non-southern states.
Applying this model to the past, it does less well than earlier versions of the model did on the first few primaries, and better on later primaries. This may mean that there is a change in voting behavior, or simply differences in the states that happen to go earlier or later. Indeed, the current model still somewhat underestimates Sanders performance where he does well, and if the smaller number of later states (i.e, excluding Iowa, New Hampshire and Nevada) is used to estimate these ratios, the White ratio is unchanged but the Black ratio works a bit less against Sanders. But at this point we have broken the data down into too-small units and are nitpicking. (By the way, if I recalculate the ratios weighing for state population size, which might be better because larger states may be better samples, there is no significant difference. More likely, a weighted average that ranks the quality of the exit polling data would be more logical and useful, but I do not have any such quality measures.)
When retrodicting previous contests with the new model, to see how well it works, the outcome isn’t too bad. It fails to predict Iowa, Nevada, Colorado, and Massachusetts, but is close. The new model predicts a 65-65 split in Michigan, which actually had a 61-69 split, so that’s wrong (but a tie is better than the wrong win.)
I could easily adjust the Sanders numbers to make the model predict the outcomes better in those states where he won, and that might be reasonable because of the status-quo part of the status-quo-ethnic model. But it would be an arbitrary adjustment with respect to the ethnic part of the model, so it is better not to.
This model retrodicts that Clinton takes 785 committed delegates and Sanders takes 536 committed delegates to date. By my count (which may vary from other counts because sometimes the delegates are counted funny) Clinton has actually won 769 and Sanders has won 502. That’s not bad, I’ll take it.
So, if this model is any good, I should be able to tell you now who will win the various races in the all-important upcoming Son of Super Tuesday, next week.
Clinton will win Florida, barely. The model projects a tiny lead for Sanders in Illinois, so that may be a tie. Clinton handily wins Missouri and North Carolina. Sanders barely wins Ohio. At the end of the day (aside, again, from delegate awarding oddities) Clinton will have added 376 committed delegates to Sanders’ 314. A Clinton win, but not a big one, is expected for next Tuesday.
Finally, according to this latest version of the status quo ethnic mix model, Clnton will win the nomination. The following graph shows the cumulative delegate count for each candidate, with the first several dates (up to yesterday’s primaries in Mississippi and Michigan) using actual committed delegate counts, and the rest using the projections from the model.
It is very important to note that this model probably underestimates Sanders’ performance in a subset of states. In other words, Sanders actual delegate count will be somewhere between the two lines shown here for a few weeks. The question then remains, can he get his line to cross Hillary’s line?
Note that in this scenario, Sanders wins both New York and California, but just by a little. If there is a handful of big states where my “just by a little” actually turns out to be “by a surprising amount” there could be a different outcome. Indeed, Sanders is expected to outperform Clinton from New York onward in many primaries, and if he does “a surprising amount” (which by then won’t seem like a surprising amount anymore) wherever possible, he could pull ahead.