| Comments 
| Category: Methodology
| 9/15/2008 5:58:55 PM CT
When I collect polls I attempt to classify them into three categories; non-partisan, partisan and the polls that are just not valid for whatever reason. I add all polls to the database, but not all polls are used in the calculations. Partisan polls, as their name implies were either commissioned directly for a certain candidate, or they are performed by a pollster with a known bias. The partisan polls are suffixed with a (D) or an (R) depending on the party affiliation. An invalid polls is essentially converted to a partisan poll (the algorithm only differentiates between non-partisan and partisan polls) with a (D) or an (R) arbitrarily assigned based on whatever criteria I choose, barring in mind that the classification of a (D) or an (R) makes no difference to the math behind the model. Occasionally a pollster will provide both a Likely Voter demographic and a Registered Voter demographic; when this choice presents itself the Likely Voter result is used. There are also occasions where a pollster will release a poll with separate results for a separate wording or phrasing of questions. In this situation I will average the results of each separate result, assuming of course that the separation is comparably equivalent. Pollsters sometimes release polls with "leaners" and without "leaners", I use the "leaner" result. Pollsters also may include third party candidates such as Bob Barr (L) in their poll; I use the result which includes the third party candidate when available, unless the third party candidate is not on the ballot for the given state.
When recording polls I use the end date of the poll's sampling window. I prefer the end date method to a median or start point because it is much easier to interpret end points on the graph than it is to try and extrapolate the meaning of the plot from a median or start point. The Local Regression method chronological arranges the polls based on their end date and discards elapsed time in between. For example if there are two polls, one with an end date of 6/20/08 and another with an end date of 8/15/08 they are treated merely as poll 1 and poll 2, the end dates have no significance other than to order the polls. I disavow dates for two reasons; I assume a poll is true until proven otherwise by another more recent poll, at which point the older poll receives a lower weight in the calculation. The second reason is directly due to processing time. I would ideally like to be able to do a range of influence for each poll and then the regression value for each day (instead of a regression value for each day in which a poll is released as is currently done), but that simply cannot be done in a reasonable amount of computing time while still maintaining the dynamic nature of the site.
Moving onto our state graphs. Non-partisan polls are represented on the graphs with a filled-in circle while partisan (and therefore invalid) polls are denoted by their empty centers. A blue circle corresponds to Obama, red for McCain, and a tie is denoted by purple. The thick colored lines are the results of the Local Regression algorithm and the dashed lighter lines are the variances. If there is a lot of fluctuation between successive polls, the variance will be larger, if all the polls tend to follow a similar path, the variance will be smaller. I wrote a nice PDF describing the entire Local Regression method. The arrows to the right of each candidate's name are currently meaningless, but will hopefully have a use fairly soon. The percentage in the parenthesis after the status (more on this later) corresponds to the flipping point calculation.
The colors used on the maps, tables and graphs are all based on the status of a given state. The status is defined in three step increments. A state where one candidate is within three points of another is defined to be a Toss Up; if Obama has between a three and six point lead the state is classified as Weak Dem, if McCain has a lead between six and nine its classified as Core Rep and if either candidate has larger than a nine point lead the state is Safe for the given party.
The entire site is dynamically created. What this means is that whenever I add a poll to the database all the maps, images and tables instantly reflect the new values. I try and update the "Poll Update" article for the given day whenever I add a new poll, but sometimes this isn't always possible. Luckily the "Latest Polls" side panel reads from the database so you can also be aware of what new polls are included in the calculations.
I also do not currently include tracking polls because they outweigh (by outnumbering) the other, more thoroughly conducted national polls. I plan on eventually adding a national tracking poll section that will be separate from the other national polls that are currently collected.
The entire mathematical procedure uses only polling data, demographic information is not taken into account; I leave that to the pollsters. The algorithm does have a known weakness however; it cannot predict abnormalities in voting patterns that may be the result of foul play.
Polling Methodology and Application
Leave a Reply:
Name: (Defaults to Anonymous)
Type the characters you see in the image below: