Almost exactly 50% of the votes have been cast in the Democratic Party primary and caucus process. I’ve been updating a model to predict primary and caucus results all along, and the model has done fairly well. The most recent update, however, was a bit off. That update involved separating states into two groups, southern vs northern, then calculating different sets of likely voting patterns by ethnicity for those two groups, and integrating that with estimates of ethnic distribution (“white, black, hispanic”) among Democratic voters by state.

What I did not do in those models was to incorporate the effect of whether or not a primary or caucus is open, closed, or somewhere in between.

Now that we have had quite a few primaries and caucuses, it is possible to move to a somewhat more sophisticated model, because there is (probably) enough data.

I ran a multi-variable regression analysis that coded primary openness (0=closed, 1=semi open, 2=open) and whether or not a state is southern or not southern, then included the percent of each ethnic group by state.

The result indicated that the percent of a voting group (by state) that is hispanic did not influence the result. In doing the analysis I looked only at states, and excluded Vermont and New Hampshire because of the strong favorite son effect. The resulting model, naturally, predicts the number of delegates that have already been awarded to each candidate, in total, precisely, for the simple reason that the model is based on that number. Within the data set, the R-squared value is 0.83, which is pretty good. This means, roughly, that 83% of the variation in voting (by percent who voted for each candidate) is explained by those variables. The following table shows the actual delegates won vs. the delegates predicted by the model.

Also indicated is the spread between the two candidates in percent. The spread starts off a bit wonky because there are only a few contests, but then settles in to about 20% and remains at that level. Not shown is an analysis of the degree to which Sanders performed relative to expectations. If that number changed a lot, showing a trend, this would be important for predicting the future. The first half of the contests show Sanders under performing, according to this model, by 2%, and the last half have him over performing by 2%. So there may be a very low level “surge,” but not enough to make any real difference in the outcome.

So, what does the future look like? There are several states coming up where Sanders is likely to do well. But is it enough to make it likely for him to overtake Clinton? With a 20% spread and half the votes counted, Sanders would have to take an average of 60% of the delegates from here on. That is very unlikely.

The following table shows the primary and caucus outcomes through the present, followed by the predicted delegate commitments for the rest of the primary season. The percent spread between the candidates is indicated, and it does indeed drop over time, though slowly, reaching a minimum of 8% for the last few races.

The total number of delegates required to lock the nomination is 2,383. There are 717 uncommitted delegates (aka “Super Delegates”). If we assume that all of those uncommitted delegates will simply vote for the majority candidate, then the number of delegates required to have a likely lock on the nomination is 1669. This is not a fully supportable assumption because some of the uncommitted delegates may chose a different path, but it is a reasonable approximation.

The part of the table above marked in yellow indicates the approximate point in time when the leading candidate, Clinton, will get somewhere around 1669 delegates. So, if this model is reasonably accurate, Clinton will achieve a lock about mid May.

The next set of primaries, next week, are Arizona, Idaho, and Utah. In my view, these are somewhat hard to predict. Polls suggest a weak Sanders win in Idaho and a weak Clinton win in Utah. My model predicts a strong Clinton win in Arizona, and Sanders victories in Idaho and Utah. The total number of delegates at stake next week is small (131 in total). In order for Sanders to signal that he can overtake Clinton, he would have to win about 79 delegates in total. If he falls short of that, the rest of the road is more uphill. If he does better than that, then he may be seriously in the running.

Sanders is also expected to do well in the next several races (Alaska, Hawaii, Washington, Wisconsin, and Wyoming) according to my model. However, I don’t actually expect my model to work at all in Hawaii. My model suggests that he may well achieve over 55% of the vote in those primaries, but again, he will have to have already achieved 60% (unlikely) on the 22nd for this to start to accumulate to a catch-up number.

Following Wyoming is New York State followed by Super Tuesday III, six states with 631 delegates. My model suggests he will get less than half of these delegates, though he will do well in Pennsylvania and lose by not much in New York. I’m also predicting that he will win in California, in June, but not by much.

Between now and the end of the race, there are 1946 uncommitted delegates to fight for. Of these, the top five states account for a whopping 1138 delegates. These states are Washington, New York, Pennsylvania, California, and New Jersey. I predict he will come close to even with Clinton or win most of these states (but Clinton will do very well in New Jersey), but in order for Sanders to overtake Clinton by focusing on these states, he’ll have to do VERY well in all or most of them.

This model uses everything that happened before (mostly) to predict everything that will happen in the future. The first half of this series of events is over (in terms of delegate counts) and there is no evidence of any dynamic change occurring at the moment. This model does an excellent job at retrodicting the prior races, but it might slightly underestimate Sanders performance, since for the last half of the retrodicted contests Sanders outperforms the model by an average of 2%. However, in order for him to catch up to Clinton, he has to outperform the model by 10%.

The graphic at the top of the post is the predicted delegate counts for the entire primary season. The already-held contests are represented as predictions instead of actual because the final number (today’s delegate count) is the same for both predicted and actual. There is a slight narrowing of the gap (see table above) but not enough to change the outcome of Clinton achieving a lock on the Democratic Party nomination in May.