Over the last few years, social media has become a primary news source. On a range of stories such as disasters and political uprisings social networks are routinely the place where news is broken first – through eyewitness reports via text, pictures and videos.
The Hudson River plane crash in 2009 was an early example of how a mobile phone picture can be distributed to a global audience within seconds. Since then, user-generated content posted to social networks like Twitter, Facebook, and YouTube has shaped coverage of a variety of news events including the death of Osama Bin Laden in 2010, the Japanese earthquake of 2011, and the popular uprisings in the middle-east. During Hurricane Sandy in 2012, users of Instagram—a social photo sharing site—posted 10 photographs a second of the devastation, with around half a million photos being posted in total.
But amid this deluge of information, we also find a large number of fakes. Analysis by the Guardian newspaper suggested up to 15% of the most shared items during Hurricane Sandy were misleading or deliberately falsified. Many of these were picked up by the mainstream media and given added credibility – at least for a while.
Figure 1: Fake Hurricane Sandy pictures sent in social media
Over the years there have been a number of similar well-documented cases. Following the death of Osama Bin Laden in May 2011, ‘PhotoShopped’ pictures purporting to show his dead body were distributed on social media and picked up by newspapers, news websites, and TV stations, potentially inflaming passions in the region. In the December 2012 Connecticut school shootings, news agency and TV outlets misidentified the gunman to millions of readers and viewers, based on false information in social media. Following the Boston Marathon bombings in April 2013, social media tried to crowdsource the identification of the perpetrators with unsatisfactory results.
Figure 2: Twitter users make wrong assumptions around the identity of the Boston Bombing (April 2013)
Reddit, Twitter and 4Chan contained information that wrongly identified innocent people – causing great distress to these individuals as well as friends and family. Posting fake pictures can also cause unnecessary alarm among the general public. Images that purported to show sharks in New York City were posted at the height of Hurricane Sandy. A giant creature washed up on a California beach was linked to effects of radiation from the Fukushima nuclear plant in 2011.
Such incidents have led to increasing concerns about the reliability of news in social networks. Journalists are looking to quickly identify relevant information and trusted sources, but how can they also quickly spot misinformation, faked pictures, and the setting of false trails? Without tackling these issues the use of social media as a source of news will be increasingly problematic.
So what can be done? In other domains researchers have been able to help mitigate the effects of false and potentially damaging information. The problem of email and web spam in the 1990s led to a series of new and increasingly effective approaches to combatting the problem. Link-based and content-based dependencies among Web pages were used (Castillo et al. ) to develop an automated way to predict and identify spam and irrelevant content. Other studies (Seo et al. ) noted that false claims in social media tend to come from a small number of unreliable sources – so research has focused on how to identify this small group of users.
But the news problem is proving particularly hard to crack. Previously reliable contributors in one domain may prove unreliable in another. An eyewitness to a news story may have a very limited footprint on which to base decisions on credibility – and yet the content may be uniquely valuable. Content analysis may be confounded by the amount and speed of information and the range of different formats now flowing through social media. In polarised situations like Syria and the Ukraine, protagonists are often looking to use social media to spread false rumours and have become sophisticated in tampering with pictures and video to promote their point of view. These issues are hard to resolve algorithmically with any degree of certainty.
One promising approach in distinguishing between real and fake news images came in research by Gupta et al. , where the authors attempted to capture the patterns of fake twitter content, by using classification models on tweet text and user features. Around a defined story dataset, this approach achieved success levels of up to 97%, but this depended on training the classifier against this particular story, introducing considerable bias and overfitting on the learning process.
In our research for SocialSensor, an EU research project, we attempted to replicate these results using pictures mined from social media around Hurricane Sandy and the Boston Bombings. Fake images were identified from journalistic blogs to provide a ‘ground truth’ against which we could test various algorithms. While we identified false pictures in around 80% of the cases, this depended on the training set used to create the classifier. Applying the algorithms trained in one dataset (Hurricane Sandy) to the second (Boston Marathon) produced precision that was not much higher than the random baseline.
Working closely with journalists, however, as part of the project, we have built up a deep understanding of how newsrooms verify content from social media and which factors they take into account under different circumstances. Without wishing to understate the complexity of these processes, we think a number of these factors are applicable to a wide range of news stories and could help us to come up with a more generalisable approach that could at least get closer to the answer.
One hypothesis is that it would be valuable to add the geographic location of the user as a key factor in determining the relevance of content in a breaking news situation. The time the tweet was posted and its proximity to the beginning of the story is another key element that a journalist would use to assess veracity and in most cases can be determined automatically. Finally, we believe the incorporation of features from appropriately selected terms (based on statistical analysis of an independent set of fake and real tweets) will also carry considerable predictive power.
Even without these features, we hope that our work already sets out an experimental framework for assessing the performance of computational verification approaches on social multimedia which will benefit the wider research community – and we have published this on GitHub. We welcome feedback and collaboration to further work on this exciting challenge.
 Castillo, C., Donato, D., Gionis, A., Murdock, V., & Silvestri, F. (2007, July). Know your neighbors: Web spam detection using the web topology. In Proceedings of the 30th ACM SIGIR conference (pp. 423-430). ACM
 Seo, E., Mohapatra, P., & Abdelzaher, T. (2012, May). Identifying rumors and their sources in social networks. In SPIE Defense, Security, and Sensing (pp. 83891I-83891I). International Society for Optics and Photonics.
 Gupta, A., Lamba, H., Kumaraguru, P., & Joshi, A. (2013, May). Faking Sandy: characterizing and identifying fake images on Twitter during Hurricane Sandy. In Proceedings of the 22nd international conference on World Wide Web companion (pp. 729-736)