@SNOW/WWW, 2016, by Stefanie Wiegand and Stuart E. Middleton
Social media are becoming increasingly important as a source for journalists. Their advantage is that content is available quickly during breaking news events. In a breaking news event, journalists get first hand eyewitness reports, often including photos or videos. While there is lots of genuine information available, there is also plenty of satire, propaganda and copycat content. Journalists are torn between being the first to get the story out while risking their reputation in case it was false information and verifying the content is actually genuine and publishing with too big a delay. There have been suggestions to use the wisdom of the crowd, but in many cases, social media acts as an echo chamber, spreading rumours that later often turn out to be false. This is not so much of a problem for long-term news stories because with time it becomes more clear what really happened but in breaking news situations, it can be tricky to quickly distinguish fact from fiction.
Dashboards like TweetDeck or StoryFul help journalists organise the high volumes of content and discover newsworthy stories. Other tools like TinEye or Google image search can be used to verify images found on social media. Journalists use lists of trusted sources to check whether a content item is true or not. We agree that having a defined set of trusted (or indeed untrusted) sources is a good idea to filter the noise created by the echo chamber that is social media, but we think it can be partially automated.
Our trust model enables journalists to maintain a list of their sources, linking new content to authors. While tracking a news story on social media, content items are associated with authors and can be filtered using predefined lists. For each new content item, it becomes clear immediately whether it is in some way related to a source: if it’s been posted by that source, mentions that source or is attributed to it.
We also want journalists to discover new eyewitness content quickly. This means we cannot rely on trending content from news organisations alone since the content is not new anymore. Instead, we want to look at content being shared that contains images or video that is new (<5 minutes since publication) and starts to be shared by more people. Chances are it has not yet been verified and is potentially eyewitness content.
What we’ve done
We crawled various social media sites (Twitter, YouTube and Instagram using our own crawling software, searching for content with sepcific hashtags (e.g. #Paris). We used natural language processing techniques to identify named entities (such as “BBC” or “Le Monde”) in English and French and mentioned URLs. Then we imported the data into our trust model, which already contained a sample list of trusted and untrusted sources (e.g. @BBC was defined as a trusted source, @TheOnion as an untrusted source). This way, we can easily retrieve all content written by, mentioning or attributed to a specified source.
To show how using trusted sources can help a journalist, we picked five pictures posted during the night of the Paris attacks. Three of them are true and one false. we identified URLs for copies of the posted image, that might have been shared instead of the original image URL. We then queried our database in 10 minute intervals during the first hour after each image was published to see how often it was shared (overall and by trusted/untrusted sources).
This is one of the genuine tweets we used. The author happens to be nearby when the events happen and posts several photos to twitter during the course of the evening.
Photo des morts ss drap blanc pic.twitter.com/ZSs2s9Lp4m
— petemystrong (@pierre75010) November 13, 2015
This tweet is a “fake” image, taken out of context from the Charlie Hebdo attacks in early 2015.
Paris, Not Afraid. pic.twitter.com/0XEtD9urtg
— ian bremmer (@ianbremmer) November 14, 2015
In our second experiment, we sorted URLs by the number of mentions. Every 5 minutes, we compared the currently top ranking URLs that were being shared on social media and filtered the old ones out (i.e. the ones that had been shared previously). By doing this, we tried to detect new eyewitness content to investigate before I went viral.
What we found and what it means
When analysing eyewitness content, we found that
- Untrusted sources share images earlier than trusted sources
In case of an image which has not yet been verified, untrusted sources pick it up earlier.
- Trusted sources are an indication for an image to be authentic
Although verification needs to be executed by a human, the fact that trusted sources are related to user-generated content make it more likely to be genuine. Typically, this is the case about 30 minutes after a picture has been published.
For verification, our central hypothesis is that the “wisdom of the crowd” is usually no wisdom at all. We think it’s often better to base a decision on few trusted sources than to risk falling victim to the “echo chamber”. Our results show that from about 30 minutes onwards, the involvement of trusted sources gives a good indication of the veracity of a piece of user-generated content. If a journalist is prepared to wait for 30 minutes (or perhaps discovers an image after that time), it can point them into the right direction for conventional means of verification, such as attempting to contact the source directly and doing some factual cross-checking.
For the discovery of newsworthy eyewitness content, we found that it helps to filter old content. We chose a time-window of 5 minutes, but others are possible. Using this method, all 5 of our tested images showed up in the top 6% of all content crawled during this period. This means a journalist scanning a social media stream for newsworthy content would not have to check hundreds or thousands of URLs but could focus on the top URLs. Of course this doesn’t mean all top URLs will contain genuine images but they are more likely to be related eyewitness content. This approach can also be combined with other state of the art filtering approaches like e.g. automated multimedia fake detection to further improve the quality of the recommended real-time content to journalists.
Where to go from here
Our results – although preliminary – look promising. An estimate for the truth content in social media posts could help journalists to become faster and more efficient if presented in a graphical way. Apart from the trusted sources lists we’ve used, our method can easily be extended to use other information, such as weather or lighting conditions. This information is already available and could be obtained dynamically.
The essence of this work is that we try to assist journalists, not to replace them by automating the process – we don’t think this is possible anytime soon. By automating manual, labour-intensive parts of the verification process however, we are able to give them a tool that they can use to verify and publish faster and with more confidence. Hopefully, this helps them to better deal with the pressure of breaking news publishing.
If you want to know more about our work on this, you can read our publications about real-time crisis mapping of natural disasters or extracting attributed verification and debunking reports using social media, visit the REVEAL website, or follow us on Twitter at @RevealEU and @stuart_e_middle.