Tag Archives: Science by Spreadsheet

Will Clinton or Sanders win the Democratic Nomination?

Both Hillary Clinton and Bernie Sanders are viable candidates to win the Democratic nomination to run for President of the United States.

There are polls and pundits to which we may refer to make a guess as to who will win. Or, we could ignore all that, and let the process play out and see what happens. But, spreadsheets exist, so it really is impossible to resist the temptation of creating a simplistic spreadsheet model that predicts the outcome.

But we can take that a step further and suggest alternate scenarios, based on available data. So I did that.

I have removed the so called “Super Delegates” from the process. This model assumes that the super delegates will ultimately either divide themselves up to reflect the overall distribution of committed delegates, or will mass towards the apparent leader. In any event, it is important that you know that the term “Super Delegate” is an unofficial made up term. They are really called “Uncommitted Delegates” because they are uncommitted. They will walk into the National Convention with no requirement as to whom they cast their vote for. That is their purpose. Meanwhile, it is true that individual Uncommitted Delegates will “endorse” a candidate during the process. Personally, I’m against this because it leads to conspiratorial ideation among activists and other interested parties. If I was King of the Democratic Party, I would make a rule that if you are going to be an Uncommitted Delegate that you don’t endorse or in any other way imply support for a candidate. (I would also probably reduce the total number of Uncommitted Delegates somewhat.)

So, in this model, the number of delegates it takes to be assured the nomination, pragmatically if not fully realistically, is the number required by the process minus the number of Uncommitted Delegates, or 2382-712=1670. In the graphs below, I represent this threshold by a wide blue line to reflect uncertainty. When a candidate’s delegate count makes it to the vague blue line first, that is an indicator that this candidate may be anointed. But, if the two candidates are close in delegate count at this point, a proper degree of uncertainty has to be assumed.

This modeling effort explores the effect of ethnicity on the outcome. I assume all voters are White, Black, or Hispanic. I also only look at US states and DC, because things may be very different in the territories and possessions with respect to ethnicity. It is not too hard to estimate the relative preference for either of the two candidates among White, Black, and Hispanic subpopulations. It is probably true that these ethnic divisions work very differently in different areas. For example, union endorsements may affect ethnic voting patterns more or less for different ethnicities in different states. Importantly, it is likely that both preference and turnout will evolve among the ethnic groups as the primary process continues. This, of course, is why we use a spreadsheet. You can change the numbers any time as more information is available.

This model does not involve age directly, but does so indirectly, in that variations in age graded participation factor into ethnicity. Same with sex, or more accurately, sex is divided evenly across the primary states (I assume) while age might not be, so again, it can factor into ethnicity. But a more sophisticated model that looks at turnout differentials or anomalies across age and sex would be better, and if the information related to this becomes available, perhaps I’ll update the model.

The Iowa Caucus involved mostly White voters, and told us that Clinton and Sanders are very close to even in this demographic. So, the model could assume a 50-50 spit among White voters. Currently available and fairly recent polling data tell us that Clinton is preferred by African American Democrats and Hispanic Democrats, but to different levels. So, a first stab at this model can use a Clinton-Sanders ratio of 70-30 for African American primary voters, and 60-40 for Hispanic primary voters. Using these three sets of ratios, and known statewide demographics across the primary, we can estimate the effects of ethnicity.

One problem you might note right away is that the statewide ethnicity profiles are not the same as the Democratic Party ethnicity profiles. A better version of this model will use the primary participant profiles instead. But, the last two election cycles of data are probably biased in this regard because of Obama’s candidacy, and thus may be incorrect. The preferred method will be to recalculate state by state ethnicity profiles, to estimate how many of each of three groups will vote, based on the returns from the first several primaries. I’ll do that. Right now this is impossible because both Iowa and New Hampshire lack the diversity in the voting population to allow it.

I am ignoring the New Hampshire results because I don’t know how to adjust for the Favorite Son Effect there. Also, New Hampshire is an odd state when it comes to primaries. The largest voting block, in the New Hampshire Primary, is uncommitted, and they can vote in either primary (but Republican and Democratic voters can not switch). This, and some other factors, has resulted in a special culture among New Hampshire voters. So, between the Favorite Son Effect and the special snowflake nature of New Hampshire (which is what makes New Hampshire so interesting and important, of course) I’m ignoring it for now, but will include data from the Granite State when there are more other states to consider.

So, the first model assumes the above stated numbers, and produces this effect:

Screen Shot 2016-02-11 at 1.46.00 PM

In this model, Clinton wins the primary. The pattern of delegate accumulation is interesting, and is actually one of the main reasons to do this modeling, but it only becomes understandable when compared to other outcomes, so let’s look at the alternative model I ran and then compare.

The second model takes a cue from the large number of new young voters combined with their Bernie-ness and their whiteness to suggest a change in the White Ratio to favor Sanders. I sucked on my thumb for a minute and came up with a 40-60 ratio. This model gives credit to Sanders campaign claims that African Americans will grok the Bern, and lowers the differential among Black voters to 60-40. This model assumes something similar for Hispanic voters, and adds another element. It is possible that in some states labor related issues will cause Hispanic votes to shift even more strongly to Sanders, so my thumb-suck estimate for this ratio is 40-60.

The second model is designed to favor Sanders in a way that might reasonably reflect actual possible voting preference shifts that the Sanders’ campaign is attempting. So, this model assumes Sanders succeeds where he is clearly trying, and produces this result:

Screen Shot 2016-02-11 at 1.48.32 PM

Now, we can compare the two models, which I think are a) reasonable given what we know and b) need to be taken with a grain of salt because of what we don’t know.

The two models show a difference in how the spread between the candidates evolves, and when the projected winner can be seen as anointed by the process. In the case of the Clinton win, which assumes the status quo maintained for the entire campaign, and gives credit to the idea that “Sanders can’t win in the South” (more or less), the two candidates stay close enough to each other that there will be no clear winner for a long time, even if Clinton actually does stay ahead of Sanders the whole time. In this case, the jump into the blue zone, though not by a very large margin, does not happen until April 26th, when there are several primaries including Pennsylvania, with a massive delegate count. Also, importantly, after this date there are still some very large states including New Jersey and especially California, that could flip a result. If this is the pattern that develops, the day after the big primary day on April 26th, if I was Sanders, I’d camp out in California!

In the case of the Sanders win, the pattern is very different. (This is why this is interesting.) Here, Sanders pulls farther ahead, and sooner. The big jump would be on March 15th, which is a day of several primaries, including Florida, Illinois, and North Carolina. In this model, a close campaign shifts to a strong Sanders lead, and Bernie does not look back.

Those two scenarios represent two very different primary seasons, indeed!

I will update or redo these models after the next primary or two. Between Nevada and South Carolina, we can get much better data on the ethnic effects on the numbers, though of course, it will still be very provisional. Those data will be limited by not being extensive, but will represent a lot of diversity. On Super Tuesday (March 1st) enough data from a bunch of primaries across the US will allow, I think, a very accurate model that will probably predict the outcome of the primary season IF whatever the status quo on that day happens to be maintains into the future. After that, differences from whatever looks apparent will require something to happen or change to cause voters to do the unexpected.

How warm will 2014 be?

We just experienced the warmest two months (May and June) on record, meaning, essentially, in well over 100 years. This is because of anthropogenic global warming (AGW). Does this mean that 2014 will be the warmest year on record? Probably not, in part because February was pretty cold and that lowers the score for the year. But it will be a warm year.

There is a strong correlation between the temperature in June and what turns out to be the global mean for the year. This can be shown empirically by calculating a simple correlation coefficient for each month of the year and the year’s average. For this I used the GISS anomaly data.

Screen Shot 2014-07-23 at 11.03.39 AM

Clearly, the ability of a month to predict the year follows a seasonal march, with June and its sibling months performing the best. I asked Michael Mann about this and he told me, “I think it is simply a consequence of signal-to-noise. Boreal summer has a large signal-to-noise ratio because the effects of radiative forcing are relatively large compared to those of internal atmospheric dynamics. Winter on the other hand tends to be dominated by synoptic and planetary-scale dynamics, meaning the signal of forcing is buried in more noise.”

Makes sense and the data shows this.

So let’s use June to predict 2014. Running all the data from GIS through a simple regression model, we get this:

Screen Shot 2014-07-23 at 1.16.45 PM

Yeah, I know, no axes lables. This is just a quick and dirty exercise in Science by Spreadsheet! This is June temperature anomaloy on the X axis and annual on the Y. The black regression line has the indicated R-squared and model formula. I added a second order polynomial regression line (in red) to check to see if the ability to predict goes haywire for the higher temperature values (which are also the more recent years). I’m going to say it doesn’t, though if we do a similar model regressing the second half of the year on the first, there is a skew with the higher (and thus later) values:

Screen Shot 2014-07-23 at 1.14.53 PM

So, I’m reasonably confident that June is a good predictor of the year, though I’m also sure that this method won’t predict the exact ranking for a given year. But we can try it anyway. Here is a list of all of recent years sorted by how hot it got (using the same data) with 2014 added in as a prediction (the rest of the GAT numbers are observations).

Screen Shot 2014-07-23 at 11.43.13 AM

Using this table we can see two things. First, it would take only a small difference from the prediction to move 2014 up or down. The average amount the predictions for these years are off is actually large enough to move 2014 up to the third slot, or down to the tenth slot or so, very easily. But given only this prediction, we might expect 2014 to tie as the fifth warmest year (if we round it off) or to be the sixth warmest year, more or less.

This assumes we don’t have warming effects of an El Niño this year. If we don’t I’m going to guess that 2014 will be about in the middle of the top ten years ever. If we do have an El Niño that affects temperatures during the last few months of the year, we could see a 2014 that is closer to the top of the pile.

That’s my story and I’m sticking to it. Until more data comes along and then I’ll revise as needed, of course.

Here’s a video from Paul Douglas discussing June’s temperature record:

Atlantic Hurricanes and El Niño

I have a little “science by spreadsheet” project for you, concerning the relationship between El Niño and Atlantic hurricanes.

The chance of an El Niño event happening this year seems to go up every few days, with most, perhaps all, climate models suggesting that El Niño will form this Summer or Fall. Climate experts tell us that there are typically fewer hurricanes in the Atlantic during El Niño years. So, I was interested to see how many fewer. Also, there appears to be a different kind of El Niño that happens sometimes, perhaps more often these days as an effect of global warming, which is variously referred to as Modoki or Central Pacific El Niño. The definition of this type of event, and even whether or not it is real, is not well established, but it has been said that the effect of this version of El Niño on Atlantic hurricanes is different.

The data used for this analysis covers the period from 1950 to 2012, simply because that is the range of years for which El Niño and hurricane data are readily available for copy/paste into the spreadsheet. Aside from numbers of hurricanes, we’ll look at the Accumulated Cyclone Energy index (ACE). This is a value calculated from the storms that occur, using measures of wind speed over the life of the storm. Since tropical storms and hurricanes vary in ways not captured by simply counting them, or even by counting them by standard categories (one through five), this measure is a better reflection of overall major storminess in the region. The following figure shows the relationship between ACE and frequency of hurricanes in the Atlantic Basin. Please keep in mind that the clear relationship between these numbers is a given: ACE is calculated, essentially, from Number of Hurricanes together with a measure of hurricane strength, so the same variable (number of hurricanes) is on both axes of the graph. The purpose of this graph is to give an idea of the variation of hurricane frequency around the measurement of overall energy in the system, so this really mainly shows how complex the manifestation of hurricanes in a given season is.

Screen Shot 2014-04-23 at 8.40.47 AM

Now let’s look at the relationship between the number of named Atlantic tropical storms, hurricanes, and major hurricanes, in El Niño, no-El Niño, and CP El Niño years, as well as the ACE.

Screen Shot 2014-04-23 at 8.25.41 AM

Without bothering with any statistical tests or other mumbo-jumbo (this is Science by Spreadsheet, after all) we can see that the number of named storms, hurricanes, and major hurricanes, as well as the ACE index, are all higher in years that do not have an El Niño. But, it is also apparent (again, no statistical tests) that the difference is not huge. In other words, if you live in a hurricane susceptible area, and you are thinking that you’re not going to have a problem with hurricanes this year because there will probably be an El Niño, think again. There are still going to be hurricanes. Also of interest is that CP El Niño years, of which there are only a few, are like regular El Niño years though maybe the reduced number of major hurricanes is a real phenomenon. (Also note, in these data most of the “CP El Niño” years are also El Niño years, but not all, in case you were trying to add up the values of N.)

Another way of looking at the same data is to ask what percentage of named storms develop into hurricanes, or major hurricanes, under these different ENSO conditions. Here are the percentages:

Screen Shot 2014-04-23 at 8.35.27 AM

So, just over half of the named storms develop into hurricanes in an average year, regardless of El Niño, and about a quarter into major hurricanes. CP El Ninño years seem to show, as we saw above, less development of major hurricanes. But, the total number of these years is small, so this may mean nothing.

Remember last year’s Atlantic hurricanes? No, nobody else does either. It was an anemic year for Atlantic hurricanes. This is attributed to the giant plume of Saharan dust that attenuated tropical storm development in the basin that year. It might be reasonable to say that the number and intensity of hurricanes per year is highly variable for a lot of reasons, and factors such as Saharan dust may have very large impacts on hurricane formation. In other words, the variation introduced into the system by El Niño may be important but not overwhelming.

In order to look at the overlap between El Niño and non El Niño years, I made this frequency histogram:

Screen Shot 2014-04-23 at 9.34.30 AM

(Note that this frequency histogram uses intervals of 3 years; the one year on “30” is a year with 28 storms, falling into the interval 27.1 to 20. Science by spreadsheet has its limitations.)

There is a certain amount of overlap. Extremely active Atlantic hurricane seasons seem to only occur in non-El Niño years, over on the right side of the graph, but the distributions of named storm frequency is not separate and distinct. Another way of looking at this is to note that the range of number of named storms per year for non El Niño years is 4-20, while the range of number of named storms for El Niño years is 6-18.

Sea surface temperatures influence Atlantic hurricane formation. Here’s a graph from someone else’s spreadsheet showing this relationship:

"A graph showing the correlation between the and the number of major hurricanes which form in the Atlantic basin. Moving averages for AMO are by the years' average indexes, 5 years before and 5 years after, not the provided 121-month smoothing."
“A graph showing the correlation between the and the number of major hurricanes which form in the Atlantic basin. Moving averages for AMO are by the years’ average indexes, 5 years before and 5 years after, not the provided 121-month smoothing.”

Clearly, a large proportion of hurricane frequency is explained by variations in sea surface temperature. Clearly, Saharan dust explains some of the variation. El Niño also explains some of the variation, but it is only part of the story.

AMO-Hurricane graph
El Ninño year data
Tropical storm data
CP El Ninño data