Tag Archives: Democratic Primary Prediction

Who Will Win The Democratic Primary? (Updated model)

I have been presenting various versions of a model to predict the outcome of upcoming Democratic primaries. The earlier version of the model worked like this: Make some assumptions about the ratio of voting preference (for Sanders vs. Clinton) among the different major ethnic groups, and using the known distribution of said ethnic groups, predict the future.

I started out with the assumption that among whites, the ratio would be 50:50, based on one datum, the outcome from Iowa, which is essentially a white state. I used a bias for African Americans and Hispanic voters favoring Clinton. That worked well to predict several primaries, with the caveat that what happens in Vermont and New Hampshire would be biased by favorite son effects.

The second part of the model is to update the within-ethnic group biases with further information as it became available, using primarily exit polling. At no point did polling for future races come into play except to demonstrate in advance that the model might work (by comparing polling for some Super Tuesday state polls with the model predictions).

Again, the model predicted Super Tuesday’s outcome pretty well, but there were some surprises especially in order of magnitude where Sanders won. In those states I had predicted either something close to a tie or a modest Sanders win, and he did better.

Now that there have been several other races (Louisiana, Nebraska, Kansas, Maine, Mississippi and Michigan), with more exit polling and some more surprises (that, again, I predicted in polarity but not magnitude) I can see that the model works very well in predicting states where Clinton ultimately won, but under-estimates Sanders’ delegate take in states where he won. And, the states where the latter happens are those that are not part of the “deep south.” This indicated that both “black” and “white” voters (and maybe “hispanic” voters) are doing different things in those different states, and that ethnic mix alone is insufficient. I also considered that whether or not a primary is “open” or not may be a factor (or a primary vs. a caucus) and I’m sure this has an effect. However, the simple characterizations of “open” vs “closed” or even “caucus” vs. “primary” come nowhere close to actually capturing the real variation among these kinds of states. Plus, sadly, there is a general lack of exit polling information for some of the odder states, so the two factors (a different ethnic pattern vs. the effect of the kind of contest) can’t be compared in relation to each other.

So now I have a new model. This is exactly the same as the first model, but uses different ethnic patterns (how each ethnic group is likely to vote) for states that are “southern” (deep south, not the southwest) vs. states that are not “southern”. This could have been done by looking at the proportion of African Americans in each state to produce an adjustment, and I may well do that eventually, but for now a simple binary distinction seems appropriate. I calculated, using exit polls, ethnic patterns for these two kinds of states.

I have data for eight southern states indicating that the ratio of Clinton to Sanders support for White, Black and Hispanic should be 60-40, 88-12, and 71-29. In contrast, for non-southern states, for which I have data from six states, the ratios are 45-55, 69-31, and 46-54. Note, however, that this “black” ratio is based on only four data points, and the hispanic ratio for both types of states is based on one state each.

In other words, Black voters always favor Clinton but much more so in southern states, white voters favor Sanders in non-southern states but the reverse is true in southern states. Hispanic voters strongly favor Clinton in southern states, and mildly favor sanders in non-southern states.

Applying this model to the past, it does less well than earlier versions of the model did on the first few primaries, and better on later primaries. This may mean that there is a change in voting behavior, or simply differences in the states that happen to go earlier or later. Indeed, the current model still somewhat underestimates Sanders performance where he does well, and if the smaller number of later states (i.e, excluding Iowa, New Hampshire and Nevada) is used to estimate these ratios, the White ratio is unchanged but the Black ratio works a bit less against Sanders. But at this point we have broken the data down into too-small units and are nitpicking. (By the way, if I recalculate the ratios weighing for state population size, which might be better because larger states may be better samples, there is no significant difference. More likely, a weighted average that ranks the quality of the exit polling data would be more logical and useful, but I do not have any such quality measures.)

When retrodicting previous contests with the new model, to see how well it works, the outcome isn’t too bad. It fails to predict Iowa, Nevada, Colorado, and Massachusetts, but is close. The new model predicts a 65-65 split in Michigan, which actually had a 61-69 split, so that’s wrong (but a tie is better than the wrong win.)

I could easily adjust the Sanders numbers to make the model predict the outcomes better in those states where he won, and that might be reasonable because of the status-quo part of the status-quo-ethnic model. But it would be an arbitrary adjustment with respect to the ethnic part of the model, so it is better not to.

This model retrodicts that Clinton takes 785 committed delegates and Sanders takes 536 committed delegates to date. By my count (which may vary from other counts because sometimes the delegates are counted funny) Clinton has actually won 769 and Sanders has won 502. That’s not bad, I’ll take it.

So, if this model is any good, I should be able to tell you now who will win the various races in the all-important upcoming Son of Super Tuesday, next week.

Clinton will win Florida, barely. The model projects a tiny lead for Sanders in Illinois, so that may be a tie. Clinton handily wins Missouri and North Carolina. Sanders barely wins Ohio. At the end of the day (aside, again, from delegate awarding oddities) Clinton will have added 376 committed delegates to Sanders’ 314. A Clinton win, but not a big one, is expected for next Tuesday.

Finally, according to this latest version of the status quo ethnic mix model, Clnton will win the nomination. The following graph shows the cumulative delegate count for each candidate, with the first several dates (up to yesterday’s primaries in Mississippi and Michigan) using actual committed delegate counts, and the rest using the projections from the model.


It is very important to note that this model probably underestimates Sanders’ performance in a subset of states. In other words, Sanders actual delegate count will be somewhere between the two lines shown here for a few weeks. The question then remains, can he get his line to cross Hillary’s line?

Note that in this scenario, Sanders wins both New York and California, but just by a little. If there is a handful of big states where my “just by a little” actually turns out to be “by a surprising amount” there could be a different outcome. Indeed, Sanders is expected to outperform Clinton from New York onward in many primaries, and if he does “a surprising amount” (which by then won’t seem like a surprising amount anymore) wherever possible, he could pull ahead.

Super Tuesday: What does it mean for the Democratic Primary?

As you know, I developed a simple model for projecting future primary outcomes in the Democratic party. This model is based on the ethnic mix in each state, among Democratic Party voters. The model attributes a likely voting choice to theoretical primary goers or causers based on previous behavior by ethnicity. Originally I made two models, one using numbers that the Clinton campaign was banking on, and one using numbers that the Sanders campaign was banking on.

The results of the Super Tuesday primaries demonstrated that the Sanders-favoring model does not predict primary outcomes. Those same results showed that the Clinton-favoring model worked better. But the numbers also indicated that the Clinton favoring model estimates Clinton’s ultimate delegate take somewhat inaccurately.

I adjusted the model parameter so the model now matches reality for a subset of the primaries that have already happened to within five percent. The model still slightly favors Clinton, but not by much. The subset of primaries includes only the US states (not territories, where I don’t expect the ethnic mix approach to work at all) and excludes states with a strong favorite son effect. This therefore excludes New Hampshire and Vermont. Due to oddities in the Texas delegate system, the adjustment was also made by excluding Texas, though the model results for Texas match very well proportionately.

(Note: Using only the subset of states, the model predicts previously held primaries and caucuses to within less than two tenths of a percent).

The new model now only has one version, which as noted matches primaries so far very well. While there is a somewhat southern bias in the set of primaries that have been carried out so far, that bias is probably not important. I have a fairly high level of confidence in the model.

The result is best seen in this graphic, which shows the cumulative delegate count of committed delegates in US states. So this excludes non-committed delegates (known as “Super Delegates”) and it excludes territories and other non-states (but it does include DC, because DC is like a state).


Assuming a large proportion of the Democratic Party’s uncommitted delegates support Clinton, Clinton will probably achieve the necessary number of delegates to lock the nomination either on the 19th of April with the New York primary, or on the 26th of April, with the Maryland, Connecticut, Delaware, Pennsylvania and Rhode Island primaries.

There are two phases of primaries coming up. First we have a series of weeks with only one or two primaries happening at once, with a total of 300 committed delegates (130 from Michigan). Then we have what is effectively Return of Super Tuesday, with 691 committed delegates, including Florida with 214. For Sanders to regain traction, he has to do well in some of these big states. In particular, Sanders has to outperform the model in Michigan, Florida, Illinois and possibly North Carolina and Ohio.

When we look at many of these states, the model seems to fit very well with the available polling data, except in cases where the polls suggest a stronger outcome for Clinton. The following table compares the model projections with estimates of the delegate split based on polls. All delegates are assumed to be awarded (among the committed delegates only) and the polling data is not very dense and in some cases not too recent, so this is a very rough estimate.


Prior to Super Tuesday, the then-current version of this model projected results that conformed closely with polls. For most states, the outcome of the actual voting matched the projections and the polls pretty well, except in a couple of places. Now, the refined model matches polling data even more closely, but the polling data is not necessarily to be trusted because there has not been enough polling. (I avoided comparisons with really old polls which are entirely useless).

Clinton’s path to the nomination is clear. Sanders’ path to the nomination requires something to change, and to change dramatically and quickly.