Tag Archives: snow2016

Hoaxy: A Platform for Tracking Online Misinformation

@SNOW/WWW, 2016, by Truthy Team

diffusion networks of hoaxes in Twitter

Approximately 65% of American adults access the news through social media. Through our shares and retweets, we participate in the propagation of the news that we find interesting and trustworthy. This has the implication that no individual authority can dictate what kind of information is distributed on the whole network. While such platforms have brought about a more egalitarian model of information access according to some, the lack of oversight from expert journalists makes social media vulnerable to the intentional or unintentional spread of misinformation. Do you believe what you read on social media?

Several characteristics of online social networks, such as homophilypolarized echo chambersalgorithmic ranking, and social bubbles, create considerable challenges for our capability to discriminate between facts and misinformation, and allocate our attention and energy accordingly. Furthermore, the harsh competition for our limited attention created by the fast news life cycle makes it inevitable that some news will go viral even if they carry false or unreliable information. 

It is therefore not too surprising that many hoaxes have spread online in viral fashion, oftentimes with worrying real-world consequences, for example in health and finance. Examples include rumorsfalse news, and conspiracy theories. The recent emergence of fake news sites is a worrysome phenomenon. While some are funny, many attract eyeballs and advertising profits by spreading uncertainty, fear, panic and civil disorder.

Due to the magnitude of the phenomenon, media organizations are devoting increasing efforts to produce accurate verifications in a timely manner. For example, during Hurricane Sandy, false reports that the New York Stocks Exchange had been flooded were corrected in less than an hour. The fact-checking network includes Snopes.comPolitiFact, and FactCheck.org.  More recently, efforts have been made to detect and track rumors. Fact-checking assessments are consumed and broadcast by social media users like any other type of news content, leading to a complex interplay between news memes that vie for the attention of users. Political scientists tell us that in many cases, fact-checking effort may be ineffective or even counterproductive. How to make sense of all this? To date, there is no systematic way to observe and analyze the competition dynamics between online misinformation and its debunking.

To address some of these challenges, researchers at the Indiana University Network Science Institute (IUNI) and the School of Informatics and Computing’s Center for Complex Networks and Systems Research (CNetS) are working on an open platform for the automatick tracking of both online fake news and fact-checking on social media. The goal of the platform, named Hoaxy, is to reconstruct the diffusion networks induced by hoaxes and their corrections as they are shared online and spread from person to person. Hoaxy will allow researchers, journalists, and the general public to study the factors that affect the success and mitigation of massive digital misinformation.

hoaxy architecture

Hoaxy hasn’t been released to the public yet, but we have been collecting data for a few months. Preliminary analysis of traffic volume reveals that, in the aggregate, the sharing of fact-checking content typically lags that of misinformation by 10-20 hours. More interestingly, we find that the sharing of fake news is dominated by very active users, while fact checking is a more grass-roots activity. A paper with these results, titled Hoaxy: A Platform for Tracking Online Misinformation and authored by Chengcheng Shao, Giovanni Luca CiampagliaAlessandro Flammini, and Filippo Menczer, will be presented this April at the Third Workshop on Social News On the Web (SNOW), to be held in conjunction with the 25th International World Wide Web Conference (WWW 2016) in Montreal.

The World Economic Forum ranks massive digital misinformation among the top future global risks, along with water supply crises, major systemic financial failures, and failure to adapt to climate change. Social news observatories such as Hoaxy have the potential to shed light on this phenomenon and help us develop effective countermeasures.

For Your Eyes Only: Consuming vs. Sharing Content

@SNOW/WWW, 2016, by Roy Sasson and Ram Meshulam


Do you share on Facebook every page that you visit?

Assuming that the answer is “no”, how do you determine what to share? The answer to this question is meaningful for publishers, content marketers and researchers alike. Many of them try to infer user engagement from the sharing activity of users, among other signals. The underlying assumption is that highly shared articles are highly interesting/engaging for the users.

Based on more than billion data points from hundreds of publishers which use Outbrain’s Engage platform worldwide, we show that the above assumption is not necessarily true. There is a dissonance between what users choose to read in private vs. what they choose to share on Facebook. We denote (log of) the ratio between private engagement (measured by click-through rate) and social engagement (measured by share-rate) as the private-social dissonance.

The private-social dissonance consistently varies across content categories. Content categories such as Sex, Crime and Celebrities are characterized by a high positive dissonance. Articles under these categories tend to be visited relatively more than being shared. On the other hand, content categories such as Books, Wine and Careers are characterized by a negative dissonance. Articles under these categories tend to be shared relatively more than their popularity.


This figure shows content categories, sorted in a descending order by their private-social dissonance. To the human eye, inspecting the categories from top-left to bottom-right the picture is clear. Users tend to read without sharing articles from categories that could harm (or not increase) their social appeal. On the other hand, users tend to share categories that are not relatively popular, yet they reflect a positive and socially desirable identity of the sharing user. Our results are time consistent and did not vary substantially during a period of one year.

To further test the value of social signals in terms of engagement, a model which utilizes different signals and produces click-prediction was trained and deployed on a live recommendation system. The resulting weights ranked the social signal lower than other signals, such as click-through rate.

What next?

It would be interesting to investigate the relation between private-engagement and Facebook’s recently announced Reaction buttons. Will some of the new buttons have a close-to-zero dissonance, and thus can be used as an accurate metric for engagement? Twitter is also a good candidate for investigation. Another direction is to use refined private-engagement signals instead of CTR, such as time-on-page or scrolling behavior. An interesting question can be – ‘do users actually read what they share?’.

Another interesting direction is to utilize the private-social dissonance in a classifier for inappropriate content. Articles with high positive dissonance are many times inappropriate to some extent. Such a classifier is based on users’ behavior and does not rely on natural language processing or image processing.

In conclusion, publishers, marketers, architects of recommendation-systems and anyone who uses social signals as an engagement metric should be aware of the private-social dissonance.

For more details, please refer to our paper entitled: “For Your Eyes Only: Consuming vs. Sharing Content” by Roy Sasson and Ram Meshulam, Third Workshop on Social News on the Web @ WWW ’16 (SNOW 2016), Montreal, Canada, April 2016.

Predicting News Popularity by Mining Online Discussions

@SNOW/WWW, 2016, by Georgios Rizos, Symeon PapadopoulosYiannis Kompatsiaris

The identification of top online news stories very early after their posting, or even before, is an important problem and invaluable to online social media and news press or aggregators. Recently, the New York Times have launched a new tool called Blossom for their internal usage, that makes data-driven recommendations to a journalist about which of the published stories will go viral when featured on Facebook. The tool employs machine learning on the firm’s big data stores and as such we can see that such applications and techniques are at the forefront of data-informed journalism.

But what can be done to ensure that the best content bubbles-up to the place of most exposure in a social medium? It really depends on what users look for in their information consumption and it may be more intricate that a simple score count.

Online users may wish to follow the zeitgeist by consuming viral content or they may be professionals in search for more thought-provoking material, such as reading a discussion-raising or controversial story. Given the web’s limited attention, proper exposure mechanisms such as a smart news-feed or ranking process are important services provided by social media to cater to the information needs of the users and increase online traffic and monetization on their website.

Early prediction of online popularity for multimedia or news stories can be used for improved exposure mechanisms and is crucial to online social media stakeholders, business intelligence specialists and journalists alike.

Regarding news story popularity prediction specifically, a study in [1] describes a mathematical model that predicts the size of the discussion at a future point based on capturing patterns in the rate comments are added. Another study [2] has addressed the same task by looking at characteristics pertaining to the time (day, month) a post is made or the number of identified entities present in the post. However, such methods do not take into account the complexity of the structure of social interactions among the implicated users. This has been attempted in certain studies concerned with hashtag popularity prediction [3] and prediction of the number of shares in Facebook [4] with promising results, however this success has not been transferred to online discussions. Indeed, only simple characteristics of the structure of online discussions have been used in a study [5] tangentially related to news story popularity prediction.

What about a discussion that has attracted many comments, all of them exclaiming a simple agreement? What if a small number of users have posted multiple times or hi-jacked the thread for an unrelated discussion?

To address this need, our work on the EU FP7 research project REVEAL led to the development of a machine learning framework for predicting future popularity values of an online news story by analyzing the structural complexity of the online discussion that is generated about it. Our hypothesis is that by being able to capture different configurations of the early structure of the discussion, we can reveal future discussion size, number of contributing users, vote score and perhaps most importantly, controversiality.

Our framework aims to capture information from two structural sources present in online discussions: the comment tree and the user network that contribute to the discussion. We extract a number of features that characterize each of these networks. Using these features, a machine learning algorithm is used to make predictions on the future popularity of the post. An overview of our framework is shown in the following figure. But what is the intuition that motivates the utilization of these two kinds of networks?


A news post that generates a level-one depth comment tree, even with many comments agreeing or praising it might be indicative of light-hearted or shallow interest content. Instead, a story that generates multiple lengthy chained replies might indicate a more controversial or discussion-provoking source material.

Similarly, a user network indicating that the majority of replies are made between small numbers of users may imply thread hi-jacking or the presence of a conversation of power users. Alternatively, a discussion in which there is deliberation from the full set of implicated users may be of more general interest.

In order to evaluate the performance of our methodology for score and controversiality prediction we collected a Reddit dataset by focusing on posts made in 2014 on several news-related subreddits. Whereas our approach proved superior to simpler past methods, the greatest improvement was achieved in terms of score and controversiality prediction.

In order to get a feeling of the performance of our method, we calculated the percentage (Jaccard coefficient) of the top-100 controversial stories that our method successfully predicted (as compared to the true top-100). This is shown in the following table, along with the results from two other approaches.

results table

Our method is denoted by all_graph. The method denoted by temporal is based on a subset of the features used to capture growth rate in [4] and the combination of the two methods is all. The percentages shown refer to the lifetime that the post has been uploaded. In order to make the comparisons at an early stage of the discussions we show the results from 1-14% of the mean time elapsed between their posting and the accumulation of 99% of the comments.

As an example, at the 5% post lifetime, the most controversial post that our method identified was titled “Gun deaths for U.S. officers rose by 56 percent in 2014: report.”. We see that many users contribute to the discussion by linking to more specific information, although some disagree by claiming that the title is worded to evoke sensationalism and others yet that discuss how civilian gun deaths is a related and under-reported statistic.

We have shown then that more in-depth representation of the structure of social interactions made around news posts is a successful means of predicting popularity and identifying top material. This network-based approach can even be used to complement other methods such as text-based ones or methods that examine the poster’s influence on the medium. We will continue our efforts to extend our method by improved technologies and additional sources of information.

The code for our method and the full series of experiments we performed can be found on GitHub.


[1] A. Tatar, P. Antoniadis, M. D. De Amorim, and S. Fdida. From popularity prediction to ranking online news. Social Network Analysis and Mining, 4(1):1–12, 2014.

[2] M. Tsagkias, W. Weerkamp, and M. De Rijke. Predicting the volume of comments on online news stories. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, pages 1765–1768. ACM, 2009.

[3] L. Weng, F. Menczer, and Y.-Y. Ahn. Predicting successful memes using network and community structure. arXiv preprint arXiv:1403.6199, 2014.

[4] J. Cheng, L. Adamic, P. A. Dow, J. M. Kleinberg, and J. Leskovec. Can cascades be predicted? In Proceedings of the 23rd international conference on World Wide Web, pages 925–936, 2014.

[5] J. Lee, M. Yang, and H. Rim. Discovering high-Quality threaded discussions in online forums. Journal of Computer Science and Technology, 29(3):519–531, 2014.

Send in the robots: automated journalism and its potential impact on media pluralism

@SNOW/WWW, 2016, by Pieter-Jan Ombelet, Aleksandra Kuczerawy, Peggy Valcke

Image blog on automated journalismEmploying robot journalists: legal implications, considerations and recommendations

Resources for investigative journalism are diminishing. In the digital age, this was a foreseeable evolution: publishers typically regard these pieces as time-consuming and expensive, and the results of the research are often unpredictable and potentially disappointing. We analyse automated journalism (also referred to as robotic reporting) as a potential solution to combat the diminution of investigative journalism, and looks at the potential (positive and negative) impact of automated journalism on media content diversity.

What is automated journalism?

Automated journalism was defined by Matt Carlson as “algorithmic processes that convert data into narrative news texts with limited to no human intervention beyond the initial programming”. Narrative Science and Automated Insights are arguably the biggest companies at the moment specialising in this algorithmic content creation. Once there is core data to work with, the software of these companies can extrapolate complete news stories out of this data. To date, the most common uses of this software have been in the field of sports and financial reporting, often creating niche content that would not exist without the respective software (such as reports on ‘Little League’ games).

Don’t forget the humans!

Once these algorithms are optimised to allow newsrooms to use robotic reporters to write and edit news stories independently, this could have a serious impact on human journalists. Stuart Frankel, CEO of Narrative Science, envisions a media landscape in which “a reporter is off researching his or her story, getting information that’s not captured in some database somewhere, really understanding the story in total, and writing part of that story, but also including a portion of the story that in fact is written by a piece of technology.” In his vision, journalists would not be discharged. The labour would merely be reallocated, hereby ensuring a higher level of efficiency. Moreover, the portions written by the algorithm would often provide meaningful output from complex data, and be less biased and in that sense more trustworthy than could be expected from a human journalist.

Other voices have expressed more caution. They emphasise the humanity that is inherently linked to high quality journalism. This argument is valid, especially for wholly automated articles, which indeed lose a sense of complexity, originality, authenticity and emotionality that only a human can express. An article written by an algorithm will never intentionally contain new ideas or viewpoints. And this generic nature is one of the downsides of automated journalism when ensuring a diverse media landscape. The media play a crucial role in a representative democracy, characterised by its culture of dissent and argument. Generic news stories do not invigorate this culture.

Still, evolving to a media landscape which uses algorithms to write portions of the story should be embraced. However, there is an important caveat: these pieces should be edited by human journalists or publishers and supplemented by parts written by the human reporters themselves, to combat a sole focus on quantitative content diversity, i.e. a merely numerical assessment of diversity, without taking quality into account.

Moreover, one must not underestimate the possibility of human journalists simply losing their jobs or seeing their jobs change to the role of an editor of algorithmic output. Carlson even highlights the predictions of certain technology analysts, who foresee that “recent developments in computing may mean that some white-collar jobs are more vulnerable to technological change than those of manual workers. Even highly skilled professions, such as law, may not be immune”.

Quality content remains crucial

Indeed, these are possible risks. Still, one should not overestimate these negative side effects and lapse into doom scenarios. People will remain interested in qualitative content. Reallocation of resources due to converging media value chains have had remarkably interesting consequences that often show this interest. Original content creation by streaming services such as Netflix and Amazon has had incredible success. Furthermore, the proliferation and popularity of user-generated (journalistic) content and citizen investigative journalism websites (e.g. Bellingcat) has shown that there is interesting new content emerging, albeit in maybe a less traditional sense. We should therefore remain hopeful that Frankel’s attractive vision of reporters using technology to enhance the quality of their news stories will have a positive impact on media diversity and pluralism.

Note: This blog was not written by an algorithm! Instead, it is a modified and updated version of a text published on the LSE media Policy Project blog. This article provides the views of the named author only and does not represent the position the LSE Media Policy Project blog, nor the views of the London School of Economics.

Understanding the Competitive Landscape of News Providers on Social Media

@SNOW/WWW, 2016, by Devipsita Bhattacharya and Sudha Ram


New York Times article using a combination of web technologies to display content.

News consumption and distribution has undergone an unprecedented change with the rise of the Internet. News in this day and age has become synonymous with e-articles published on news websites. Use of web and electronic presentation technologies has enabled news providers to create content rich webpages to deliver news in a detailed and an engaging manner. News articles now contain a variety content such as text, images, podcasts, videos and real-time user comments. For anyone with an Internet connection, news is now an on-demand commodity.

New York Times Official Twitter Account Page

New York Times Official Twitter Account Page.

With features of social recommendation, content sharing and micro-blogging; social media websites (e.g. Twitter) also play a critical role in electronic distribution of news articles. Users after reading online news articles, submit their recommendations on Twitter in the form of tweets, which are then viewed by other users leading to news article webpage visits. User activities such as these have indirectly enabled the news providers to make their audience aware of the content published daily on their websites. News article sharing has also helped news providers to reach out to a much wider audience in a very cost effective manner than was possible hitherto. Moreover, in response to the popularity of social media, news agencies have created official user accounts on various social media websites and use these accounts to regularly to post about selected news articles published on their respective websites.

Our previous work has methodologies for analyzing news article propagation on social media, e.g., Twitter. We have extracted and analyzed several Twitter based propagation networks to examine characteristics of user participation in news article propagation. Using these networks, we have formulated a framework for performance measurement of news article propagation [1; 2; 4]. Our framework includes measures that can be grouped into two major categories i.e. Structural and temporal properties of propagation. Structural properties measure unique characteristics of  the propagation network such as length of longest cascade chain, average cascade length, user contribution, effective influence of news provider, article conversion ratio and multiplier effect. Temporal properties includes measures related to lifespan of articles, retweet response rate, and rate of spread of tweets.   We have also extracted implicit networks of user-user relationships based on commonalties of article preference and tweeting activity and examined how users connect over time based on their article sharing activity [3]. Our work has enabled us to examine news propagation in a unique and multi-faceted way by harnessing the power of network science.

Our current study focuses on similarities and differences in news propagation patterns on social media based on the primary channel of a news provider. The Internet apart from enabling e-articles has also transformed the news landscape with traditional news providers (printed, televised etc.) competing for news webpage article views. That is, news providers that previously competed with other providers primarily based on the channel of news distribution, now find themselves competing with a whole new set of participants. For instance, before the Internet, newspapers such New York Times and Washington Post were competing for subscribers and advertising revenues from their printed newspapers. Similarly, network news companies such as CNBC and CNN were competing with each other for audience engagement during prime time news hour. However, with each of these news providers now having news websites, the competition is no longer limited to their rivals in their primary distribution channel. News providers now also contend to attract advertisers and readers for their article webpages. In our study we compared the patterns of news article propagation on Twitter based on the primary channel of news distribution. News providers on the Internet can be grouped into different categories based on their primary channel of distribution.

Primary Channels of News Distribution

Primary Channels of News Distribution.

Generally, Porter’s Five Forces model is used for strategic analysis of organizations in an industry. However, in our study, we develop a network based methodology for analyzing the competition among these news channel categories on social media. We collected a dataset of 24 million article tweets from Twitter for 32 news providers over a 3-week period, i.e., September 1- September 22, 2015.

Using this dataset, we extracted news propagation patterns for each news provider and analyzed the similarities and differences between their networks. Our Twitter based propagation network is a user – user network defined for a single news provider. Each node in the network represents a Twitter user participating in article sharing activity of a given news provider. Each edge (directed from source to target) represents the aggregate retweeting relationship established between two users. It is a network of aggregated propagation activity (i.e. across multiple articles) observed over a period of time.

List of News Providers and their Primary News Channels.

List of News Providers and their Primary News Channels.

We compared the networks using a number of structural properties of their propagation networks including the “density” of each network, proportion of disconnected users, average length of user cascade chains, number of retweeting relationships per user, the ability of users to form communities, and tweeting frequency of user(s).

Important Findings

Visualizing the propagation patterns of different news provider's user-user network. Nodes accordingly colored to outline different user communities.

Visualizing the propagation patterns of different news provider’s user-user network. Nodes accordingly colored to outline different user communities.

We determined that when compared to networks of other channels, “online only” news providers have the smallest (number of nodes and edges) but the most dense networks. Interestingly, even with high density, their networks were found to have a higher concentration of disconnected nodes. This is expected since “online only” news providers have emerged only recently when compared to other news providers in our sample. For other news channels, we had mixed inferences when examining structural properties of their propagation networks. But, we were able to establish a statistically significant difference between the news channels based on their structural properties.

Our analysis of the news channels using a network based methodology makes several contributions.

  1. It allows news providers to benchmark their social media based propagation performance against other competitors in the same or in a different primary distribution channel. This is particularly important since on social media, even traditional suppliers of news (e.g. News agencies such as Reuters, Associated Press) are considered direct competitors for any news provider hosting an online news website.
  2. We identified features unique to our Twitter-based aggregate user-user networks. Important among these, is the presence of multiple disconnected communities of nodes. On an average, we found that a news provider’s propagation network had at least 4,000 disconnected communities containing two or more nodes. This highlights the importance of news article tweeting activities independent of those originating from news provider Twitter accounts.
  3. Network analysis adds a new dimension to competitive analysis which generally considers participation volume (number of users) to measure engagement. For instance, we observed that “news agency” (Reuters, Associated Press etc.) propagation networks had lower average counts of nodes and edges when compared to those of “television” (ABC News, CNN etc.) news networks (by a margin of 100,000). By considering these differences in values, television news providers emerge as “winners” in audience participation on social media when compared to “news agency” networks. However, we also ascertained that television and news agency networks had approximately equal values of network diameter (19.5 and 19 respectively). While on an average television based news agencies networks show higher tweeting and retweeting activity from their Twitter users, their audience’s ability to connect amongst each other to form the longest cascade chain over time is the same as that of “news agency” networks having lower average Twitter user participation count.

Our research points reveals that analysis of competition among news providers on social media needs a comprehensive consideration of various facets associated with user participation. It also shows that network science can provide important insights into the changing landscape of news on social media.


[1] Bhattacharya, D. and Ram, S., 2012. News article propagation on Twitter based on network measures – An exploratory analysis. In Proceedings of the 22nd Workshop on Information Technology and Systems.

[2] Bhattacharya, D. and Ram, S., 2012. Sharing News Articles Using 140 Characters: A Diffusion Analysis on Twitter. In International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 966-971.

[3] Bhattacharya, D. and Ram, S., 2013. Community Analysis of News Article Sharing on Twitter. In Proceedings of the 23rd Workshop on Information Technology and Systems.

[4] Bhattacharya, D. and Ram, S., 2015. RT @News: An Analysis of News Agency Ego Networks in a Microblogging Environment. ACM Trans. Manage. Inf. Syst. 6(3):1-25.

Veracity and Velocity of Social Media Content during Breaking News: Analysis of November 2015 Paris Shootings

@SNOW/WWW, 2016, by Stefanie Wiegand and Stuart E. Middleton


Social media are becoming increasingly important as a source for journalists. Their advantage is that content is available quickly during breaking news events. In a breaking news event, journalists get first hand eyewitness reports, often including photos or videos. While there is lots of genuine information available, there is also plenty of satire, propaganda and copycat content. Journalists are torn between being the first to get the story out while risking their reputation in case it was false information and verifying the content is actually genuine and publishing with too big a delay. There have been suggestions to use the wisdom of the crowd, but in many cases, social media acts as an echo chamber, spreading rumours that later often turn out to be false. This is not so much of a problem for long-term news stories because with time it becomes more clear what really happened but in breaking news situations, it can be tricky to quickly distinguish fact from fiction.

Our idea

Dashboards like TweetDeck or StoryFul help journalists organise the high volumes of content and discover newsworthy stories. Other tools like TinEye or Google image search can be used to verify images found on social media. Journalists use lists of trusted sources to check whether a content item is true or not. We agree that having a defined set of trusted (or indeed untrusted) sources is a good idea to filter the noise created by the echo chamber that is social media, but we think it can be partially automated.

Our trust model enables journalists to maintain a list of their sources, linking new content to authors. While tracking a news story on social media, content items are associated with authors and can be filtered using predefined lists. For each new content item, it becomes clear immediately whether it is in some way related to a source: if it’s been posted by that source, mentions that source or is attributed to it.

We also want journalists to discover new eyewitness content quickly. This means we cannot rely on trending content from news organisations alone since the content is not new anymore. Instead, we want to look at content being shared that contains images or video that is new (<5 minutes since publication) and starts to be shared by more people. Chances are it has not yet been verified and is potentially eyewitness content.

What we’ve done

We crawled various social media sites (Twitter, YouTube and Instagram using our own crawling software, searching for content with sepcific hashtags (e.g. #Paris). We used natural language processing techniques to identify named entities (such as “BBC” or “Le Monde”) in English and French and mentioned URLs. Then we imported the data into our trust model, which already contained a sample list of trusted and untrusted sources (e.g. @BBC was defined as a trusted source, @TheOnion as an untrusted source). This way, we can easily retrieve all content written by, mentioning or attributed to a specified source.

To show how using trusted sources can help a journalist, we picked five pictures posted during the night of the Paris attacks. Three of them are true and one false. we identified URLs for copies of the posted image, that might have been shared instead of the original image URL. We then queried our database in 10 minute intervals during the first hour after each image was published to see how often it was shared (overall and by trusted/untrusted sources).

This is one of the genuine tweets we used. The author happens to be nearby when the events happen and posts several photos to twitter during the course of the evening.

This tweet is a “fake” image, taken out of context from the Charlie Hebdo attacks in early 2015.

In our second experiment, we sorted URLs by the number of mentions. Every 5 minutes, we compared the currently top ranking URLs that were being shared on social media and filtered the old ones out (i.e. the ones that had been shared previously). By doing this, we tried to detect new eyewitness content to investigate before I went viral.

What we found and what it means

When analysing eyewitness content, we found that

  • Untrusted sources share images earlier than trusted sources
    In case of an image which has not yet been verified, untrusted sources pick it up earlier.
  • Trusted sources are an indication for an image to be authentic
    Although verification needs to be executed by a human, the fact that trusted sources are related to user-generated content make it more likely to be genuine. Typically, this is the case about 30 minutes after a picture has been published.

For verification, our central hypothesis is that the “wisdom of the crowd” is usually no wisdom at all. We think it’s often better to base a decision on few trusted sources than to risk falling victim to the “echo chamber”. Our results show that from about 30 minutes onwards, the involvement of trusted sources gives a good indication of the veracity of a piece of user-generated content. If a journalist is prepared to wait for 30 minutes (or perhaps discovers an image after that time), it can point them into the right direction for conventional means of verification, such as attempting to contact the source directly and doing some factual cross-checking.

For the discovery of newsworthy eyewitness content, we found that it helps to filter old content. We chose a time-window of 5 minutes, but others are possible. Using this method, all 5 of our tested images showed up in the top 6% of all content crawled during this period. This means a journalist scanning a social media stream for newsworthy content would not have to check hundreds or thousands of URLs but could focus on the top URLs. Of course this doesn’t mean all top URLs will contain genuine images but they are more likely to be related eyewitness content. This approach can also be combined with other state of the art filtering approaches like e.g. automated multimedia fake detection to further improve the quality of the recommended real-time content to journalists.

Where to go from here

Our results – although preliminary – look promising. An estimate for the truth content in social media posts could help journalists to become faster and more efficient if presented in a graphical way. Apart from the trusted sources lists we’ve used, our method can easily be extended to use other information, such as weather or lighting conditions. This information is already available and could be obtained dynamically.

The essence of this work is that we try to assist journalists, not to replace them by automating the process – we don’t think this is possible anytime soon. By automating manual, labour-intensive parts of the verification process however, we are able to give them a tool that they can use to verify and publish faster and with more confidence. Hopefully, this helps them to better deal with the pressure of breaking news publishing.

If you want to know more about our work on this, you can read our publications about real-time crisis mapping of natural disasters or extracting attributed verification and debunking reports using social media, visit the REVEAL website, or follow us on Twitter at @RevealEU and @stuart_e_middle.