« John McCain Thinks Social Security Is a Disgrace | Main | Don't Do It Yourself »

Similarity

09 Jul 2008 08:39 am

One popular vein of analysis in sports commentary is doing "similarity scores" for different athletes. By analyzing the past career of a current player and seeing which older players he's similar to, you can glean information about his likely future trajectory. Nate Silver, who mainly does quantitative sports analysis though he's recently become known for his political blogging, has done something similar for American states based on a variety of political and demographic factors.

It's an interesting exercise and shows, among other things, that there's less similarity than I might have thought out there. A number of states, including very large ones like Florida and Texas, are essentially unique by this standard and the closest any pair gets (North Carolina and South Carolina) is 71 out of 100.

Share This

Comments (8)

The dimensions of similarity he's chosen, though, are a little more arbitrary than in baseball. It's obvious that one way to measure the similarity of position players is by how many homeruns they hit. Percentage of same-sex households is a little less obvious. (Nate found those variables useful for predicting Obama versus Clinton outcomes in the primary, but it's something of a leap to apply them to the general.) This is his first stab at applying PECOTA to states, and it's a good one, but it's still just a stab at this point.

In other words, don't put too much stock in the actual numbers until the model's been refined after an election or two (or at least a month or two).

As a Connecticut native (though now living in New York) I quite easily see the similarity with New Jersey. Both states have an essentially suburban character, being partly within the NYC commuting zone; both are predominately Democratic; racial breakdowns are in the same general range; and income levels and distributions also seem quite similar.

I like how Colorado turns up as the quintessential mid-size, reasonably affluent, disproportionately white, non-NY/DC-suburb state.

Isn't this just a stupid way of doing regression?

As an (almost) life-long North Carolinian, I find it bizarre that NC and SC are seen as the most similar states in the nation. Details here may be more than anyone wants to know, but they illustrate the point that the similarity scores are very raw in their development (at least as general election predictors). For a lot of the reasons I list below, I would say NC is much more like VA than SC.
I grew up in Asheville in western NC - a place turning into the Santa Fe of the East and extremely culturally liberal by Southern standards. Two of our three TV stations came from upstate SC, an area that is on the Alabama-level politically, so much so that my wife and I ruled it out as a place to live specifically because of the politics.
The Research Triangle area is without question the least "Southern" major metro area in the South (accepting the conventional Southern definition that Florida south of Gainesville is not in the South at all.) Nothing remotely like that in SC. Even the major metro area split between the states (Charlotte) contributes to the difference - SC picks up a lot of the very conservative suburbs, and NC gets all the city.
The black proportion of the population is about 50% higher in SC (31% vs 22%) and this has huge impacts. SC goes way over the level where higher black population seems to result in much lower white support for Democrats, and Republicans mostly dominate state-level politics. Except, there is a large area where the black population is high enough to virtually guarantee Democratic victory locally. NC is famous for how convoluted the congressional districts were to create two black-majority districts. This wasn't just partisan protection - there is no large area in the state where the African-American percentage is high enough to create simple AA-majority districts. But there is one whole CD in the west that is virtually all-white, and there could easily be a second almost as white if the lines were drawn that way. Although presidential elections are rarely competitive, Democrats mostly control everything of significance at the state-level, and if Kissell beats Hayes would even have a majority of the white-majority CD's.

Geographical proximity is one of Nate's dimensions. He doesn't say how he does this, but if it is based on the geographical center of the state, this would bias strongly in favor of higher similarity scores on that dimension for eastern states, since there is just no way for the large states out west to be as close to each other.

As an (almost) life-long North Carolinian, I find it bizarre that NC and SC are seen as the most similar states in the nation. Details here may be more than anyone wants to know, but they illustrate the point that the similarity scores are very raw in their development (at least as general election predictors). For a lot of the reasons I list below, I would say NC is much more like VA than SC.
I grew up in Asheville in western NC - a place turning into the Santa Fe of the East and extremely culturally liberal by Southern standards. Two of our three TV stations came from upstate SC, an area that is on the Alabama-level politically, so much so that my wife and I ruled it out as a place to live specifically because of the politics.
The Research Triangle area is without question the least "Southern" major metro area in the South (accepting the conventional Southern definition that Florida south of Gainesville is not in the South at all.) Nothing remotely like that in SC. Even the major metro area split between the states (Charlotte) contributes to the difference - SC picks up a lot of the very conservative suburbs, and NC gets all the city.
The black proportion of the population is about 50% higher in SC (31% vs 22%) and this has huge impacts. SC goes way over the level where higher black population seems to result in much lower white support for Democrats, and Republicans mostly dominate state-level politics. Except, there is a large area where the black population is high enough to virtually guarantee Democratic victory locally. NC is famous for how convoluted the congressional districts were to create two black-majority districts. This wasn't just partisan protection - there is no large area in the state where the African-American percentage is high enough to create simple AA-majority districts. But there is one whole CD in the west that is virtually all-white, and there could easily be a second almost as white if the lines were drawn that way. Although presidential elections are rarely competitive, Democrats mostly control everything of significance at the state-level, and if Kissell beats Hayes would even have a majority of the white-majority CD's.

Geographical proximity is one of Nate's dimensions. He doesn't say how he does this, but if it is based on the geographical center of the state, this would bias strongly in favor of higher similarity scores on that dimension for eastern states, since there is just no way for the large states out west to be as close to each other.

Here is the same idea extended:

http://ksghome.harvard.edu/~aabadie/ccs.pdf

It's not just regression - it's more like trying to build a control group to match the treatment group, rather than just assuming no omitted variables and that all observable characteristics have a linear effect on the outcome of interest.

I think there's less to the apparent lack of similarity than meets the eye. First, any analysis that uses 19 separate measures to evaluate the similarity/difference of the states is bound to identify a lot of areas of difference. The 19 things are bound to be somewhat arbitrary.

Second, you have to read the linked article in some detail to catch this, but zero is not the bottom of the similarity scale - "negative similarity" is possible. For some reason, the author set all the negative scores to zero, which is rather misleading. If we can assume that the scale actually runs from -100 to +100 (the article doesn't say how dissimilar states can be, but it's a fair guess), then a score of 71 for NC/SC is really a pretty close match.

And can I bitch once again about terrible graphics (in the linked article)? Yet again, we have a table filled with mystery numbers and colors. The colors are not explained anywhere that I can find, although after staring at the chart for a lot longer than I'm usually willing to do, I figured out that they symbolized geographic regions. To understand the numbers, you have to read a good bit of the article. My God, people, is a legend so hard to do?


Comments closed July 23, 2008.

Copyright © 2008 by The Atlantic Monthly Group. All rights reserved.