By:
TJHalva |
Comments [0] | Category:
Methodology | 10/4/2008 3:25:56 AM CT
This article will focus on the methods and transparency provided by this site, VoteForAmerica.net, in regard to our electoral projections. If you have ever questioned the impartiality of our mathematical methods you should continue reading. We are able to separate fact from opinion.
The ultimate purpose of these projections is to predict what will happen in the future based on publicly available information. If personal thought is injected into a mathematical model, the model quickly digresses to a personal model in disguise. The central issue arises when the mathematical facade is used to promote ones own personal agenda, despite the presented intent; this process of deception is wrong.
My opinion may differ from yours, you may think I'm biased, which is fine, but you cannot think my electoral projections are biased. You may question my methodology, but you cannot question the result.
I have created an enormous level of transparency regarding my calculations, beyond that of any other projection site. Our methodology should provide the details necessary to duplicate our results, but I'm going to provide an additional resource. Through the use of KyPlot (Download KyPlot 2 B15) and MathCad I have compiled sample calculation worksheets that validate our approach using polling data from Iowa. The image below shows the output of a Local Regression fit in KyPlot using our parameters. You'll notice that the end points (in the thick black rectangles) exactly match our results.

Our polling graph from Iowa taken on October 4th; observe the equivalence of our projection and that presented in the KyPlot screenshot above.

The KyPlot file used to create this table is available for download; screenshots taken directly from KyPlot are available for McCain and Obama.
The MathCad worksheet produces an identical result, but provides a more thorough example of the applied mathematics of Local Regression. The files for both approaches are available for download. The KyPlot approach provides an easier and quicker method for validation, while the MathCad approach provides a more in depth analysis.
There should now be absolutely no question about the accuracy of our projection results; if only other sites provided such transparency.
RealClearPolitics provides polling projections based on simple averages. Their results are very easy to check, but the method by which they collate data and arrive at these results is very questionable. There is no publicly available document that explicitly lists the methodology for poll inclusion into their averages. It took me just 30 seconds to peruse their state tables and find an inconsistency in methodology. Analyzing the Virginia, Minnesota and Michigan pages highlights this inconsistency. On the Minnesota page, four polls are included in the average with the first excluded poll showing Obama with a lead of two and an end date of 9/17. Heading over to the Virginia page reveals that there are five polls included in the average, with the first exclusion ending on 9/25. Michigan deviates even further, with eight polls included and the first excluded poll ending on 9/21. Judging by these three states there is no discernible pattern, I'm not saying there isn't, but I have no idea what is it. Until this information is published the quality of RealClearPolitics' averages should be questioned.
FiveThirtyEight.com includes an excellent methodology page, but neglects to provide information regarding the bandwidth or degree of the Local Regression method. The bandwidth determines the tightness of fit; if the bandwidth is very low the trendline will be adjusted to fit more recent data; if the bandwidth is high a more gradual trend is approximated based on an older subset of polling data. The bandwidth could be altered on a state by state basis to tailor the result to a specific agenda. The degree will have a negligible affect on the result, but is vitally important in duplicating a result. VoteForAmerica.net uses a bandwidth of 15 and a degree of 3 for all states with some exceptions.
FiveThirtyEight.com also performs 10,000 simulations to compute win probability based on a mathematically valid Monte Carlo method. The problem however stems from the relatively small number of simulations. In my experience 10,000 simulations will provide relative accuracy when applied to 51 events, but is by no means an authoritative result. The result itself is random, although still situated within a reasonable window of accuracy. I may be wrong on the convergence of the simulation, 10,000 may be enough, but FiveThirtyEight.com has never directly addressed this situation. To entirely eliminate this issue VoteForAmerica.net uses the concept of the Cumulative Distribution Function to arrive at an absolute result.
Pollster.com provides the service most similar to our own, but again, like FiveThirtyEight.com does not disclose information on bandwidth or degree. Pollster.com provides very little transparency on their methods as their FAQ page is still under construction. Overall, Pollster.com suffers from nearly the same short comings as FiveThirtyEight.com.
As a final comment; I do not necessarily believe that any of these sites are manipulating their result to achieve a certain end. I am simply stating that it is impossible to know due to a lack of transparency.
Published on October 4th
at 3:25 AM CT
:: 0 Comments
By:
TJHalva |
Comments [0] | Category:
Methodology | 9/15/2008 5:58:55 PM CT
When I collect polls I attempt to classify them into three categories; non-partisan, partisan and the polls that are just not valid for whatever reason. I add all polls to the database, but not all polls are used in the calculations. Partisan polls, as their name implies were either commissioned directly for a certain candidate, or they are performed by a pollster with a known bias. The partisan polls are suffixed with a (D) or an (R) depending on the party affiliation. An invalid polls is essentially converted to a partisan poll (the algorithm only differentiates between non-partisan and partisan polls) with a (D) or an (R) arbitrarily assigned based on whatever criteria I choose, barring in mind that the classification of a (D) or an (R) makes no difference to the math behind the model. Occasionally a pollster will provide both a Likely Voter demographic and a Registered Voter demographic; when this choice presents itself the Likely Voter result is used. There are also occasions where a pollster will release a poll with separate results for a separate wording or phrasing of questions. In this situation I will average the results of each separate result, assuming of course that the separation is comparably equivalent. Pollsters sometimes release polls with "leaners" and without "leaners", I use the "leaner" result. Pollsters also may include third party candidates such as Bob Barr (L) in their poll; I use the result which includes the third party candidate when available, unless the third party candidate is not on the ballot for the given state.
When recording polls I use the end date of the poll's sampling window. I prefer the end date method to a median or start point because it is much easier to interpret end points on the graph than it is to try and extrapolate the meaning of the plot from a median or start point. The Local Regression method chronological arranges the polls based on their end date and discards elapsed time in between. For example if there are two polls, one with an end date of 6/20/08 and another with an end date of 8/15/08 they are treated merely as poll 1 and poll 2, the end dates have no significance other than to order the polls. I disavow dates for two reasons; I assume a poll is true until proven otherwise by another more recent poll, at which point the older poll receives a lower weight in the calculation. The second reason is directly due to processing time. I would ideally like to be able to do a range of influence for each poll and then the regression value for each day (instead of a regression value for each day in which a poll is released as is currently done), but that simply cannot be done in a reasonable amount of computing time while still maintaining the dynamic nature of the site.
Moving onto our state graphs. Non-partisan polls are represented on the graphs with a filled-in circle while partisan (and therefore invalid) polls are denoted by their empty centers. A blue circle corresponds to Obama, red for McCain, and a tie is denoted by purple. The thick colored lines are the results of the Local Regression algorithm and the dashed lighter lines are the variances. If there is a lot of fluctuation between successive polls, the variance will be larger, if all the polls tend to follow a similar path, the variance will be smaller. I wrote a nice PDF describing the entire Local Regression method. The arrows to the right of each candidate's name are currently meaningless, but will hopefully have a use fairly soon. The percentage in the parenthesis after the status (more on this later) corresponds to the flipping point calculation.
The colors used on the maps, tables and graphs are all based on the status of a given state. The status is defined in three step increments. A state where one candidate is within three points of another is defined to be a Toss Up; if Obama has between a three and six point lead the state is classified as Weak Dem, if McCain has a lead between six and nine its classified as Core Rep and if either candidate has larger than a nine point lead the state is Safe for the given party.
The entire site is dynamically created. What this means is that whenever I add a poll to the database all the maps, images and tables instantly reflect the new values. I try and update the "Poll Update" article for the given day whenever I add a new poll, but sometimes this isn't always possible. Luckily the "Latest Polls" side panel reads from the database so you can also be aware of what new polls are included in the calculations.
I also do not currently include tracking polls because they outweigh (by outnumbering) the other, more thoroughly conducted national polls. I plan on eventually adding a national tracking poll section that will be separate from the other national polls that are currently collected.
The entire mathematical procedure uses only polling data, demographic information is not taken into account; I leave that to the pollsters. The algorithm does have a known weakness however; it cannot predict abnormalities in voting patterns that may be the result of foul play.
Published on September 15th
at 5:58 PM CT
:: 0 Comments
By:
TJHalva |
Comments [0] | Category:
Methodology | 8/12/2008 11:20:56 AM CT
On the electoral table (as well as on the state graphs) I've added a flip percent column that lists the probability of a given state anointing the minority party of 2004 the 2008 victor. For example Bush won Iowa in 2004, but now the algorithm says there is a 99.75 percent chance of Iowa flipping into the Obama column for the 2008 Election. The probability is calculated using a Cumulative Distribution Function (CDF). The CDF computation begins with the most recent Local Regression result for the minority party of 2004 and the variance of that result; the algorithm then outputs the percent chance of the minority party of 2004 receiving the majority vote in 2008. The majority vote is calculated by adding the most recent Local Regression values for McCain and Obama and dividing by two.
In some cases where a state has just a single valid poll, the result incurs a zero divided by zero error; when this happens the output is normalized to be 0. Washington DC has no polls, thus making calculations difficult; to account for this I assume that DC will be a very large victory for the Democrats (as it has historically) and in turn a zero percent chance of flipping.
Published on August 12nd
at 11:20 AM CT
:: 0 Comments
By:
TJHalva |
Comments [5] | Category:
Methodology | 7/31/2008 2:11:34 PM CT
I rolled out a brand new algorithm that uses local regression to generate a trendline corresponding to the states poll data; I created a PDF detailing the method. The algorithm uses a degree of 3, a bandwidth of 15, and a confidence interval of 95%. If a given state has less than 5 non-partisan polls a local least squares method is used. The results of the algorithm are presented on our graphs and in our electoral college estimates. If there are two or more polls with the same end date they are averaged and treated as a single poll.
Published on July 31st
at 2:11 PM CT
:: 5 Comments
By:
TJHalva |
Comments [0] | Category:
Methodology | 4/15/2008 6:08:31 PM CT
The democratic nomination algorithm has been adjusted for the case where there is no poll data available within the last 8 days. Under the new rules the system will select the most recent poll for that state (regardless of age) and average it with the current national average. This change allows for unique results in each race that still reflects the will of the people while adjusting for the recent trends in public sentiment.
Published on April 15th
at 6:08 PM CT
:: 0 Comments
By:
TJHalva |
Comments [0] | Category:
Methodology | 4/10/2008 8:56:10 PM CT
The current algorithm for the Democratic Nomination gathers all relevant polling data for the given race within the last 8 days. If there is no data available within the last 8 days, the national race is substituted in place of the selected race. The results of this query are averaged and extrapolated out to 100%; this means that if the poll shows that Candidate A has 47% and Candidate B has 40% the new results will show Candidate A receiving (47%)/(47% + 40%) = 54.02 percent and Candidate B receiving (40%)/(47% + 40%) = 45.98%. Next the percentage of each candidate is multiplied by the number of delegates in each race and the result shows the projected number of delegates each candidate will receive for a given race.
The algorithm for the General Election (not yet implemented) is slightly different. The system again gathers all relevant poll data for the given state within the last 15 days, but if there is no data available the result of the 2004 election are used in its place. The percentages are calculated in the same manner as the Democratic Nomination. Once these percentages are calculated the candidate with the highest percentage gets all the electoral votes for that state.
Published on April 10th
at 8:56 PM CT
:: 0 Comments