On June 12th, 85 percent of eligible Iranian voters cast a presidential ballot; on June 13th, many of these same citizens took to the streets to protest the apparent reelection of Ahmadinejad. The final vote tally, as reported by Juan Cole, a prominent Middle East expert and History Professor at the University of Michigan, is below:
So here is what Interior Minister Sadeq Mahsouli said Saturday about the outcome of the Iranian presidential elections:
"Of 39,165,191 votes counted (85 percent), Mahmoud Ahmadinejad won the election with 24,527,516 (62.63 percent)."
He announced that Mir-Hossein Mousavi came in second with 13,216,411 votes (33.75 percent).
Mohsen Rezaei got 678,240 votes (1.73 percent)
Mehdi Karroubi with 333,635 votes (0.85 percent).
He put the void ballots at 409,389 (1.04 percent).
Source: Stealing the Iranian Election via JuanCole.com
Despite the veil of electoral authenticity, rather large anomalies have been identified. Juan Cole quickly provided circumstantial evidence while the academic folks took a little more time completing their peer-reviewed papers. A consensus has emerged, even the Iranian State TV has acknowledged discrepancies in the election.
The purpose of this article is to invalidate the preliminary claims of election fraud in Iran. The first attempt came in the form of a graph popularized by The Atlantic columnist Andrew Sullivan. A composite of the original graphs is presented below; the multiple colors depict different perspectives on the same data set:
Andrew Sullivan posted multiple versions of this graph to his blog on June 13th. From the various versions it became clear that the data source was consistent, but the application varied. Iran's Entekhab News and web based TehranBureau.com both used the election results provided by JameJamOnline.ir to create their graphs; which were later referenced by Sullivan.
The percent of the vote reported at each given coordinate is calculated with respect to the final two-way vote total; the reporting percent is overlaid near its associated coordinate. It is also important to note that there are two data sets. One is blue and has six dots while the other is red and has seven dots; the other red dots are exactly hidden behind the six blue points.
I will now provide four additional facts which have not be explicitly stated; these facts are either crucial to the creation or subsequent interpretation of the original graphs:
1. Ahmadinejad's vote total is represented by the X-axis and Mousavi's vote total by the Y-axis.
2. Entekhab News plotted [PNG] seven data points while TehranBureau's graph [PNG] excluded the first data point, while using the other six.
3. The regression technique is a linear least-squares approximation that is not forced through the origin. Ideally, the linear equations should pass through the origin; because at some point in time, before any votes have been counted, both candidates have zero votes.
4. The original source JameJamOnline.ir is written in Farsi, a language I cannot read; because of this, the coordinates for the data points were not explicitly available. TehranBureau provided [PNG] coordinates for the six data points they used, but the first point used by Entekhab News is still unavailable. It was however possible to use the least-squares equation depicted on their graph and the other six points to determine a very reliable estimate[*] for the first by reversing the regression. The coordinates used on the above graph are presented below:
Report % Ahmadinejad Mousavi Two Way
12.98* 3,469,534 1,429,332 4,898,867
26.45 7,027,919 2,955,131 9,983,050
39.37 10,230,478 4,628,912 14,859,390
54.55 14,011,664 6,575,844 20,587,508
62.10 15,913,256 7,526,117 23,439,373
66.50 16,974,382 8,124,690 25,099,072
72.15 18,302,924 8,929,232 27,232,156
Final 24,527,516 13,216,411 37,743,927
By applying the data within fact #4, it becomes clear that the graph only encompasses about 45% or 60% of the total vote for the six and seven point graphs respectively. The entire analysis takes place within this region; the respective linear correlations are only valid within these ranges.
Sullivan initially referenced the Entekhab News version but it was not and still is not useful due to the language barrier; Sullivan would later reference the English analysis by TehranBureau.com. Judging from their about page, TehranBureau.com shares strong ties with the Columbia Journalism School and features a slew of qualified contributors. Muhammad Sahimi, a chemical engineer and TehranBureau contributor, provided the following analysis on the six point graph:
The vertical axis (y) shows Mr. Mousavi's votes, and the horizontal (x) the President's [Ahmadinejad]. R^2 shows the correlation coefficient: the closer it is to 1.0, the more perfect is the fit, and it is 0.9995, as close to 1.0 as possible for any type of data.
Statistically and mathematically, it is impossible to maintain such perfect linear relations between the votes of any two candidates in any election - and at all stages of vote counting. This is particularly true about Iran, a large country with a variety of ethnic groups who usually vote for a candidate who is ethnically one of their own.
Source: Faulty Election Data via TehranBureau.com
[The referenced article has since been removed from TehranBureau.com]
Muhammad Sahimi's assertion is not well received and lacks any proper causation; especially given the 45% window of relevance. In fact his "impossible" claim appears to be baseless when compared to relevant data from the 2008 Presidential Election in the USA. I intend to reproduce the high R^2 value using real-time data I collected on November 4th, 2008. The reported vote totals from each state were queued for download about 400 times an hour from MSNBC.com; data was collected in a circular queue as fast as possible. This does not mean that I have a complete set of data; networking and storage issues created significant discontinuities within the data, especially as the night progressed. MSNBC was used as the source because it was the only website that presented the election results as pure HTML; CNN, CBS, et al. used an asynchronous reporting scheme that prohibited the automated retrieval of their reported election results. Using some of this data, primarily from the East Coast, I will prove that a linear trend with a very high R^2 is the expected outcome of such a graph.
Let's first begin by analyzing Kentucky, one of the first states to begin reporting results. The state of Kentucky lies across the Eastern and Central time zones; about half the state's polls closed at 6 ET and the other half at 7 ET. The graph below illustrates the number of votes received by each presidential candidate with respect to the time at which they were recorded; I began collecting data from all states at around 6:40 CT. The graph below depicts each US candidate's vote total as a function of time and looks decisively non-linear, as many would expect:
The graph above simply intends to illustrate the discontinuities and imperfections of our data set in a more logical format. The graph above is
not supposed to resemble the Iran graph; the version intended for comparison, using Kentucky data, is presented below:
From simple inspection, the Kentucky graph appears to be reasonably linear, clearly depicting a strong similarity to the Iran graph of internet lore. Although the R^2 is slightly lower than its TehranBureau counterpart, an R^2 value of .9995 still remains plausible. I would argue that Kentucky represents an acceptable microcosm of "ethnic groups," but other factors may be at play. Kentucky may be the norm or it may be the exception, the only way to find out is by analyzing more data. I conducted the same analysis using Virginia's data; first the votes vs. time graph from Virginia for a glimpse at our data set:
The Virginia data is clearly smoother than its Kentucky equivalent, but the curves resemble the same general form. I looked at a number of other states and the same general shape held across geographic and demographic borders. We'll now explore the relation between each candidate's vote totals in Virginia:
The Virginia graph seems to support the linear trend we saw in the Kentucky graph, but again the R^2 value is slightly lower than our target. This discrepancy can likely be attributed to the large number of points plotted, around 1,500, in the preceding graphs. If we were to strictly adhere to our four previously stated facts, specifically by using just six or seven points, we could probably achieve higher R^2 values. Let's go ahead and do that now.
The observations I made earlier will now play an important role in definitively disproving the "impossibility." Let's first begin by establishing the various threshold reporting levels for Kentucky and Virginia with respect to the original:
Report % Time CT McCain Obama Two Way
16.76 18:40 173,406 126,564 299,970
26.49 19:12 268,616 205,480 474,096
36.00 19:32 371,124 273,229 644,353
53.03 20:02 532,940 416,231 949,171
62.09 20:17 621,411 489,848 1,111,259
64.64 20:24 642,008 514,837 1,156,845
80.57 20:27 818,572 623,366 1,441,938
Final 1,043,264 746,510 1,789,774
Report % Time CT McCain Obama Two Way
13.38 19:02 278,094 214,706 492,800
26.53 19:32 547,199 430,212 977,411
40.32 19:52 780,552 705,180 1,485,732
56.49 20:17 1,049,451 1,031,789 2,081,240
63.84 20:42 1,179,737 1,172,437 2,352,174
67.88 20:52 1,251,123 1,250,040 2,501,163
72.36 21:07 1,328,103 1,338,087 2,666,190
Final 1,726,053 1,958,370 3,684,423
Some rough extrapolations must be done to satisfy these thresholds; there are several ways to do this, but the two-way vote total was chosen as the measuring stick. When the distribution of the data resulted in two points equally spaced from the intended threshold, the larger percentage was used. This is not a perfect scenario, but it should still serve to facilitate an unbiased result. If you don't like my methodology you can download the data in CSV format at the end of this article and make your own rules.
The composite six and seven point graphs for Kentucky, Virginia, Michigan and Pennsylvania are presented below with strict adherence to the original's methodology:
The Kentucky data set is by far the most variable of the four states depicted and the threshold percentages also have the largest error relative to their corresponding target. Unfortunately, Kentucky is unable to provide definitive evidence, in terms of the R^2 value, to entirely vacate the "impossible" claim. Virginia is our next stop:
The R^2 value associated with the six point regression, .9996, is higher than the R^2 value of .9995 associated with the TehranBureau graph. The seven point R^2 value is however lower than the Entekhab News value of .9986. This unarguably debunks Muhammad Sahimi's assertion of statistical and mathematical impossibility. Such an outcome is very possible, perhaps even probable.
Virginia is also geographically representative of the urban/rural population demographics in Iran. Virginia has an urban population of 72.9% according to the 2000 Census while Iran's urban population is 68% according to their 2006 Census. Dissimilarities do however remain, including the margin of victory and the total number of votes cast; and while this may not be an ideal comparison, the aspect of impossibility has been erased. Onto Michigan:
The Michigan graph overcomes our seven point target with an R^2 of .9991, but it fails to match the six point result put forth by TehranBureau. The urban population of Michigan, at 75.5%, is also fairly close to Iran's. We have yet another example proving the possibility of such a correlation. Pennsylvania continues the trend:
Pennsylvania's R^2 values match the TehranBureau mark and fall just short of the .9986 value needed to equal the seven point correlation coefficient presented by Entekhab News. Pennsylvania is however more urban than Iran by about 10.0%; but given that this is now the third state with an R^2 in excess of, or equal to, the value claimed by an Iranian source, the presence of a linear correlation is irrelevant to the possibility of election fraud.
Having dispelled the individual R^2 values for both the six and seven point data sets, I never ran into a state that met or exceeded both R^2 values. This lack of repeatability may be significant, but based upon the preceding work, its likely just a case of random coincidence and inconsistent data.
The bottom line is this, a linear relationship between two candidates' vote totals is the expected correlation. The direct result of this research does not however prove or disprove election fraud, it simply invalidates the linear correlation metric as a means of identifying fraud.
And finally, the real time data as promised:
Kentucky: [CSV, 49KB]
Virginia: [CSV, 60KB]
Michigan: [CSV, 56KB]
Pennsylvania: [CSV, 62KB]
If you do anything useful or interesting with this data, please let me know.