They got it wrong – again. Despite most opinion polls and forecasts stating that Hilary Clinton would beat Donald Trump in the US presidential election, the reverse happened. Of course, you could argue that the pollsters were dead-on correct: polls called a tight race with Clinton shading it, and that's exactly what happened – Clinton won the popular vote, after all – but Trump routed her in terms of electoral votes.
But in-depth polls were also done state-by-state, not least by pollster guru Nate Silver at FiveThirtyEight, who calculated that Trump had just a 29% chance of winning. Conservative voters were hugely underestimated, but how?
So did 'shy' Trump voters lie to pollsters? Are forecasts based on the wrong data? And can new technology – some of it from a shell-shocked Silicon Valley – help breathe new life into an industry that's now in severe danger of being discredited?
How do opinion polls work?
Opinion polls are all about extrapolating trends from a relatively small data sample. The pollster asks people how they intend to vote, or how they did just vote, and algorithms are applied to create a demographically balanced national picture.
In a country of 231 million potential voters – although around 100 million don't actually vote – it's always going to be based as much on assumptions as on actual data. Key to this is voter turnout, which is very hard to predict; there's simply no data on it until after election day.
"The challenge of making any prediction from data is to make sure that the data is representative," says Matt Jones, Analytics Strategist at data science consultancy Tessella. "Traditional statistical analysis of polling data and surveys will only be representative of those that bothered to take part, and that section of the voting population is not representative."
Polls are given huge gravitas by the media to the extent that they can be decisive in whether people bother to vote or not – so they can swing an election.
Machine learning is already used when running election predictions. It's part of standard statistical analysis. "As for any statistical analysis the single most critical factor is the amount of data available on which to run your algorithms, base your predictions," says Claus Jepson, Chief Architect at Unit4. "As of today the data set available is simply too limited to offer precise predictions, making it necessary to include human interpretations – hence making the predictions biased.”
For example, pollsters decide how much statistical weight to give to how many historical election results. “At some point in time the data available will be large enough for algorithms to effectively predict, less biased, outcomes based on polls," thinks Jepson.
Social media and sentiment analysis
Some of that 'new' data is from social media, which looks set to become a fresh tool for pollsters looking to track changing opinions. "The use of ‘social listening’ of social media conversations and behaviour may have been an early warning of possible contradictions from official polls," says Mark Skilton, Professor of Practice in the Information Systems & Management Group at Warwick Business School.
This is the science of sentiment analysis – when people write things in Twitter and Facebook posts, it's possible to extract positive, negative, or neutral attitudes. No one is suggesting that pollsters just use Twitter to predict elections, but it can be used to improve a purely statistical model by adding a vital dynamic dimension.
For example, BJSS SPARCK analysed 14 million tweets before the election and correctly predicted the outcome, uncovering that seven out of every ten tweets sent in the last four weeks of the campaign were in favour of Trump.
"When they use social media, people become less guarded about their true social and political affiliations," says Simon Sear, Practice Leader of BJSS SPARCK. "Their language becomes unfiltered, they ‘like’ content that appeals to them and follow people and organisations which represent their values … contrast that with having to admit embarrassing sentiment and intentions to a potentially judgemental human pollster."