Author Archives: Gianmarco De Francisci Morales

Changing News Media landscape in South Korea

@SNOW/WWW, 2017, by Hongjun Lim, Choongho Chung, Jihee Kim, Juho Kim, Sue Moon and Meeyoung Cha

Our research reports one such transformation point in the Korean media landscape that started over a political scandal in 2016 and is still on-going as of the writing of this research in February 2017. The scandal, involving allegations for the extortion, abuse of power, and bribery of the Korean president, has led to the largest ever protest in the country on December 3rd, with the estimated number of participants over 2.3 million (or 4.5% of South Korea’s population) gathering for peaceful demonstrations. The cumulative count of participants from late October to December adds up to more than 10 million. A motion of impeachment has been approved on December 9th in the National Assembly, and awaits final ruling by the Constitutional Court.

Coverage of this scandal by Korean media is noteworthy for several reasons. First, the event increased news audience. Television viewer ratings on news programs went up by 1.5 times and the total reactions on news stories on social media went up by 1.9 times. This event also set a new standard for journalistic convention in that many “exclusive” stories were produced from newsrooms. Second, news media ranks changed.The Korean news influence until recently has been dominated by established newsrooms such as Chosun Daily, Joongang Daily, Kyunghyang, and MBC News that were founded half a century or little over a century ago. The political scandal, however, changed this ranking. A cable network-based JTBC, established in 2011, was the first to disclose a concrete evidence that sparked the presidential scandal. JTBC’s sheer focus on the scandal helped it quickly gain the largest audience in terms of both television viewer ratings and social media reactions. How was this relatively young newsroom able to take the lead over half-a-century old established newsrooms?

The public reception of the scandal measured by content popularity.

We first examine the temporal evolution of the 2016 presidential scandal from the public reactions and the media coverage, in turn. First, the public reaction can be represented by the aggregate Likes count received on news posts. This trend is shown in Figure 1, where the news story (‘Disclosure of evidence on political scandal’ in the figure) published on October 24th evening by the JTBC newsroom marks the beginning of the huge public interest. That news story was the very first instance for any news media outlet to disclose a tangible piece of evidence supporting the prior allegations on the president. Compared to the week before the scandal, news articles on Facebook received on average 1.9 times more Likes per post. While not shown in the figure, the total number of news posts per day, however, is similar to early October in that none of the five media outlets produced significantly more news posts on their Facebook page upon the event.

Figure 1 : Timeline of the 2016 South Korea’s presidential scandal seen by the daily total Likes on Facebook news posts. The Likes count has been aggregated across all five news outlets.

Shifts in media ranking and effects of exclusive news post.

Figure 2(a) shows the average Likes count per post for each media outlet along with error bars measured in standard errors before and after the event. Most platforms show an increment in audience engagement. However, the most popular right-wing outlet that has the largest offline newspaper circulation in South Korea, Chosun Daily, shows a slight decrease in the Likes count. JTBC, the youngest newsroom, in contrast, more than triples the Likes count over the same time period. This finding demonstrates that circumstances surrounding the political scandal did lead to a change in the South Korean media landscape in terms of which media the public favors.

Figure 2(b) shows the percentage of news that we deem as ‘exclusive’ stories for each news media outlet. all posts containing either ‘exclusive news’ or ‘special news’ were classified as an the exclusive news type, while all other posts were dubbed as the general news type. In fact, the rate of news deemed as exclusive type across media outlets were proportional to the rate of likes news outlets gathered-as seen by the reverse ordering of media outlets in Figures 2(a) and (b). The results show that JTBC’s rise in popularity can be partially explained by its higher percentage of exclusive news posts to those of other news media outlets.

(a) Average Likes count per media (b) Average Likes count per media
Figure 2 : Likes count change and exclusive news percentage for each media outlet

Divergence of topics over time.

We compare further how this newly emerged platform compares to other newsrooms during the course of time after the convergence of topics among all selected news media. Figure 3 shows a change of topical similarity between JTBC and the most established media, Chosun Daily. The topical similarity between the two news providers that start as low as 0.35 rises over the event to a peak of 0.52, at the disclosure of the political scandal. Basically, the topical similarity increases if there exists more number of same topic pairs between the topics of the two news outlets. After reaching its peak on the second week, the headline topic similarity starts to gradually decrease over time.

Figure 3 : Diverging topical similarity in news headlines between two news outlets: JTBC and Chosun Daily

In order to look deeper into such topic divergence, we visualized topical networks of the two media. Figures 4(a) and (b) are network plots of headline topics covered by JTBC and Chosun Daily. Figure 4(a) shows that most of the news stories published by JTBC are about the scandal, indicated by the large coverage of the bubble. Prominent topics here are ‘president’, ‘impeachment’, ‘recording’, and ‘judgment’ in the graph. Some topics appear as not directly related to the event such as ‘Article’, ‘Social department’. The visualization shown in Figure 4(b) for Chosun Daily, in contrast, shows a smaller coverage of the bubble. A number of topics appear as not directly related to the scandal such as ‘establishment of a new party’, ‘North Korea’, and ‘the THAAD US army missile system’. The difference in topical similarity is known to be caused by differing preferences of each newsroom to focus on the current biggest issue in more detail, or to focus on other topics unrelated to the political scandal, respectively.

(a) Topic network of JTBC (b) Topic network of Chosun Daily
Figure 4 : Comparison of between two news outlets`s topic network


In our research, we collect data from the Facebook pages of major newspaper publishers in South Korea in order to examine the changing media landscape over this notable political scandal.

News posts studied in this paper were accessed through the Facebook Graph API over a single day period on January 10th, 2017. We gathered a total of 166,131 posts, which are the entire set of public posts of five news media outlets in South Korea.

Media Data since Post Total likes
JTBC 2011/11/28~ 17,323 12,612,111
Chosun Daily 2010/04/16~ 26,032 25,210,818
Joongang Daily 2010/11/09~ 47,089 10,634,283
Kyunghyang 2012/05/28~ 33,001 31,655,183
Hankyoreh 2010/08/05~ 42,686 27,194,615
Total 2010/04/16~ 166,131 107,307,010
Table 1 : Summary of the dataset

Table 1 displays key information about the gathered data indicating the data period, post count, and the aggregate Likes count of news posts. The five outlets opened their Facebook pages at different points in time, attributing to different time range of data between the outlets. For each news post, we collect information about the timestamp (i.e., time when a post was uploaded to Facebook), main text (i.e., short text description about any linked content), news headline (i.e., text headline of the news article that is linked on the Facebook post, if any), news video (i.e., video content that is directly uploaded to the page, if any), news image (i.e., image content that is directly uploaded, if any), as well as the count of Likes and Comments. To ensure that all posts had enough time to circulate within Facebook, we analyze only those news stories posted in 2016.

For more details, please refer to our paper entitled: “Changing News Media landscape in South Korea” by Hongjun Lim, Choongho Chung, Jihee Kim, Juho Kim, Sue Moon and Meeyoung Cha, Fourth Workshop on Social News on the Web @ WWW’17 (SNOW 2017), Perth, Australia, April 2017.

Now You See It, Now You Don’t! A Study of Content Modification Behavior in Facebook

@SNOW/WWW, 2017, by Fuxiang Chen and Ee-Peng Lim

In social media, content posts and comments can be edited or removed. We therefore define two types of content modification, namely: (a) content censorship (CC), and (b) content edit (CE). Content censorship refers to complete deletion of some content post or comment, whereas content edit refers to edits made to a content post or a comment.


We selected 57 public Facebook pages of three different regions (Hong Kong, Singapore and United States) ranging from News to Community, Event, and Group pages in our study. All these pages contain content mainly in English language.

Figure 1: Studying and Tracking Periods

We download the posts and comments created during the period from 1 January 2016 to 23 August 2016 (Study Period) and track them for changes that occur during a period of three weeks, from 8 August 2016 to 23 August 2016 (Tracking Period). The tracking period covers many versions of the Facebook pages. By comparing every two consecutive versions of posts and comments, we determine two types of changes to the posts and comments in these pages. A post or a comment is edited if its content varies in two consecutive versions. A post or a comment is deleted if it is present in the earlier version but not the next version.

Empirical Data Analysis

We first investigate the likelihood of edit and removal for post and comments.

Figure 2: Likelihood of Posts/Comments getting Edited/Removed

We observed that posts are more likely to be edited than removed (see Figure 2). We believed that Facebook users generally spend more time on crafting and writing a post, and thus the posts are not likely to be removed. In contrast, comments are more likely to be removed than edited. We believed that Facebook users generally spend much less time to write comments. The loss of efforts is minimal when a comment is removed.

We also investigate the recency effects of Facebook Modification in Facebook in two aspects. We first analyse the content censorship and edit actions performed on posts and comments created during the tracking period. For each censorship and edit action, we determine the number of days between the content creation date and the action date. We then bin each action by its number of days, and count the number of censorship and edit actions in each bin.

Figure 3: Recency Effect in CE (Tracking Period)

Figure 4: Recency Effect in CC (Tracking Period)

Figures 3 and 4 show the edit and censorship action count for different day bins respectively. Figure 3 shows that the number of posts and comments edits achieves the highest volume in the first day, and then decreases exponentially until the 7th day. Similarly, Figure 4 shows the same exponential decreasing trend in the number of posts and comment censorship. Thus, this suggests that users are more likely to perform content modification on the more recent posts/comments.

CC & CE Annotation

After detecting the edited posts and comments, we then seek to understand the reasons behind these post/comment modification using a manual annotation task.

Manual judgements on these post/comment removals and edits show that majority of the content censorship is related to negative reports on events and personal grouses, and content edit is mainly performed to improve content quality and correctness.

For more details, please refer to our paper entitled: “Now You See It, Now You Don’t! A Study of Content Modification Behavior in Facebook” by Fuxiang Chen and Ee-Peng Lim, 4th Workshop on Social News on the Web @ WWW ’17 (SNOW 2017), Perth, Australia, April 2017.


This research is supported by the National Research Foundation, Prime Minister’s Office, Singapore under its International Research Centres in Singapore Funding Initiative.

Hoaxy: A Platform for Tracking Online Misinformation

@SNOW/WWW, 2016, by Truthy Team

diffusion networks of hoaxes in Twitter

Approximately 65% of American adults access the news through social media. Through our shares and retweets, we participate in the propagation of the news that we find interesting and trustworthy. This has the implication that no individual authority can dictate what kind of information is distributed on the whole network. While such platforms have brought about a more egalitarian model of information access according to some, the lack of oversight from expert journalists makes social media vulnerable to the intentional or unintentional spread of misinformation. Do you believe what you read on social media?

Several characteristics of online social networks, such as homophilypolarized echo chambersalgorithmic ranking, and social bubbles, create considerable challenges for our capability to discriminate between facts and misinformation, and allocate our attention and energy accordingly. Furthermore, the harsh competition for our limited attention created by the fast news life cycle makes it inevitable that some news will go viral even if they carry false or unreliable information. 

It is therefore not too surprising that many hoaxes have spread online in viral fashion, oftentimes with worrying real-world consequences, for example in health and finance. Examples include rumorsfalse news, and conspiracy theories. The recent emergence of fake news sites is a worrysome phenomenon. While some are funny, many attract eyeballs and advertising profits by spreading uncertainty, fear, panic and civil disorder.

Due to the magnitude of the phenomenon, media organizations are devoting increasing efforts to produce accurate verifications in a timely manner. For example, during Hurricane Sandy, false reports that the New York Stocks Exchange had been flooded were corrected in less than an hour. The fact-checking network includes Snopes.comPolitiFact, and  More recently, efforts have been made to detect and track rumors. Fact-checking assessments are consumed and broadcast by social media users like any other type of news content, leading to a complex interplay between news memes that vie for the attention of users. Political scientists tell us that in many cases, fact-checking effort may be ineffective or even counterproductive. How to make sense of all this? To date, there is no systematic way to observe and analyze the competition dynamics between online misinformation and its debunking.

To address some of these challenges, researchers at the Indiana University Network Science Institute (IUNI) and the School of Informatics and Computing’s Center for Complex Networks and Systems Research (CNetS) are working on an open platform for the automatick tracking of both online fake news and fact-checking on social media. The goal of the platform, named Hoaxy, is to reconstruct the diffusion networks induced by hoaxes and their corrections as they are shared online and spread from person to person. Hoaxy will allow researchers, journalists, and the general public to study the factors that affect the success and mitigation of massive digital misinformation.

hoaxy architecture

Hoaxy hasn’t been released to the public yet, but we have been collecting data for a few months. Preliminary analysis of traffic volume reveals that, in the aggregate, the sharing of fact-checking content typically lags that of misinformation by 10-20 hours. More interestingly, we find that the sharing of fake news is dominated by very active users, while fact checking is a more grass-roots activity. A paper with these results, titled Hoaxy: A Platform for Tracking Online Misinformation and authored by Chengcheng Shao, Giovanni Luca CiampagliaAlessandro Flammini, and Filippo Menczer, will be presented this April at the Third Workshop on Social News On the Web (SNOW), to be held in conjunction with the 25th International World Wide Web Conference (WWW 2016) in Montreal.

The World Economic Forum ranks massive digital misinformation among the top future global risks, along with water supply crises, major systemic financial failures, and failure to adapt to climate change. Social news observatories such as Hoaxy have the potential to shed light on this phenomenon and help us develop effective countermeasures.

For Your Eyes Only: Consuming vs. Sharing Content

@SNOW/WWW, 2016, by Roy Sasson and Ram Meshulam


Do you share on Facebook every page that you visit?

Assuming that the answer is “no”, how do you determine what to share? The answer to this question is meaningful for publishers, content marketers and researchers alike. Many of them try to infer user engagement from the sharing activity of users, among other signals. The underlying assumption is that highly shared articles are highly interesting/engaging for the users.

Based on more than billion data points from hundreds of publishers which use Outbrain’s Engage platform worldwide, we show that the above assumption is not necessarily true. There is a dissonance between what users choose to read in private vs. what they choose to share on Facebook. We denote (log of) the ratio between private engagement (measured by click-through rate) and social engagement (measured by share-rate) as the private-social dissonance.

The private-social dissonance consistently varies across content categories. Content categories such as Sex, Crime and Celebrities are characterized by a high positive dissonance. Articles under these categories tend to be visited relatively more than being shared. On the other hand, content categories such as Books, Wine and Careers are characterized by a negative dissonance. Articles under these categories tend to be shared relatively more than their popularity.


This figure shows content categories, sorted in a descending order by their private-social dissonance. To the human eye, inspecting the categories from top-left to bottom-right the picture is clear. Users tend to read without sharing articles from categories that could harm (or not increase) their social appeal. On the other hand, users tend to share categories that are not relatively popular, yet they reflect a positive and socially desirable identity of the sharing user. Our results are time consistent and did not vary substantially during a period of one year.

To further test the value of social signals in terms of engagement, a model which utilizes different signals and produces click-prediction was trained and deployed on a live recommendation system. The resulting weights ranked the social signal lower than other signals, such as click-through rate.

What next?

It would be interesting to investigate the relation between private-engagement and Facebook’s recently announced Reaction buttons. Will some of the new buttons have a close-to-zero dissonance, and thus can be used as an accurate metric for engagement? Twitter is also a good candidate for investigation. Another direction is to use refined private-engagement signals instead of CTR, such as time-on-page or scrolling behavior. An interesting question can be – ‘do users actually read what they share?’.

Another interesting direction is to utilize the private-social dissonance in a classifier for inappropriate content. Articles with high positive dissonance are many times inappropriate to some extent. Such a classifier is based on users’ behavior and does not rely on natural language processing or image processing.

In conclusion, publishers, marketers, architects of recommendation-systems and anyone who uses social signals as an engagement metric should be aware of the private-social dissonance.

For more details, please refer to our paper entitled: “For Your Eyes Only: Consuming vs. Sharing Content” by Roy Sasson and Ram Meshulam, Third Workshop on Social News on the Web @ WWW ’16 (SNOW 2016), Montreal, Canada, April 2016.

Predicting News Popularity by Mining Online Discussions

@SNOW/WWW, 2016, by Georgios Rizos, Symeon PapadopoulosYiannis Kompatsiaris

The identification of top online news stories very early after their posting, or even before, is an important problem and invaluable to online social media and news press or aggregators. Recently, the New York Times have launched a new tool called Blossom for their internal usage, that makes data-driven recommendations to a journalist about which of the published stories will go viral when featured on Facebook. The tool employs machine learning on the firm’s big data stores and as such we can see that such applications and techniques are at the forefront of data-informed journalism.

But what can be done to ensure that the best content bubbles-up to the place of most exposure in a social medium? It really depends on what users look for in their information consumption and it may be more intricate that a simple score count.

Online users may wish to follow the zeitgeist by consuming viral content or they may be professionals in search for more thought-provoking material, such as reading a discussion-raising or controversial story. Given the web’s limited attention, proper exposure mechanisms such as a smart news-feed or ranking process are important services provided by social media to cater to the information needs of the users and increase online traffic and monetization on their website.

Early prediction of online popularity for multimedia or news stories can be used for improved exposure mechanisms and is crucial to online social media stakeholders, business intelligence specialists and journalists alike.

Regarding news story popularity prediction specifically, a study in [1] describes a mathematical model that predicts the size of the discussion at a future point based on capturing patterns in the rate comments are added. Another study [2] has addressed the same task by looking at characteristics pertaining to the time (day, month) a post is made or the number of identified entities present in the post. However, such methods do not take into account the complexity of the structure of social interactions among the implicated users. This has been attempted in certain studies concerned with hashtag popularity prediction [3] and prediction of the number of shares in Facebook [4] with promising results, however this success has not been transferred to online discussions. Indeed, only simple characteristics of the structure of online discussions have been used in a study [5] tangentially related to news story popularity prediction.

What about a discussion that has attracted many comments, all of them exclaiming a simple agreement? What if a small number of users have posted multiple times or hi-jacked the thread for an unrelated discussion?

To address this need, our work on the EU FP7 research project REVEAL led to the development of a machine learning framework for predicting future popularity values of an online news story by analyzing the structural complexity of the online discussion that is generated about it. Our hypothesis is that by being able to capture different configurations of the early structure of the discussion, we can reveal future discussion size, number of contributing users, vote score and perhaps most importantly, controversiality.

Our framework aims to capture information from two structural sources present in online discussions: the comment tree and the user network that contribute to the discussion. We extract a number of features that characterize each of these networks. Using these features, a machine learning algorithm is used to make predictions on the future popularity of the post. An overview of our framework is shown in the following figure. But what is the intuition that motivates the utilization of these two kinds of networks?


A news post that generates a level-one depth comment tree, even with many comments agreeing or praising it might be indicative of light-hearted or shallow interest content. Instead, a story that generates multiple lengthy chained replies might indicate a more controversial or discussion-provoking source material.

Similarly, a user network indicating that the majority of replies are made between small numbers of users may imply thread hi-jacking or the presence of a conversation of power users. Alternatively, a discussion in which there is deliberation from the full set of implicated users may be of more general interest.

In order to evaluate the performance of our methodology for score and controversiality prediction we collected a Reddit dataset by focusing on posts made in 2014 on several news-related subreddits. Whereas our approach proved superior to simpler past methods, the greatest improvement was achieved in terms of score and controversiality prediction.

In order to get a feeling of the performance of our method, we calculated the percentage (Jaccard coefficient) of the top-100 controversial stories that our method successfully predicted (as compared to the true top-100). This is shown in the following table, along with the results from two other approaches.

results table

Our method is denoted by all_graph. The method denoted by temporal is based on a subset of the features used to capture growth rate in [4] and the combination of the two methods is all. The percentages shown refer to the lifetime that the post has been uploaded. In order to make the comparisons at an early stage of the discussions we show the results from 1-14% of the mean time elapsed between their posting and the accumulation of 99% of the comments.

As an example, at the 5% post lifetime, the most controversial post that our method identified was titled “Gun deaths for U.S. officers rose by 56 percent in 2014: report.”. We see that many users contribute to the discussion by linking to more specific information, although some disagree by claiming that the title is worded to evoke sensationalism and others yet that discuss how civilian gun deaths is a related and under-reported statistic.

We have shown then that more in-depth representation of the structure of social interactions made around news posts is a successful means of predicting popularity and identifying top material. This network-based approach can even be used to complement other methods such as text-based ones or methods that examine the poster’s influence on the medium. We will continue our efforts to extend our method by improved technologies and additional sources of information.

The code for our method and the full series of experiments we performed can be found on GitHub.


[1] A. Tatar, P. Antoniadis, M. D. De Amorim, and S. Fdida. From popularity prediction to ranking online news. Social Network Analysis and Mining, 4(1):1–12, 2014.

[2] M. Tsagkias, W. Weerkamp, and M. De Rijke. Predicting the volume of comments on online news stories. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, pages 1765–1768. ACM, 2009.

[3] L. Weng, F. Menczer, and Y.-Y. Ahn. Predicting successful memes using network and community structure. arXiv preprint arXiv:1403.6199, 2014.

[4] J. Cheng, L. Adamic, P. A. Dow, J. M. Kleinberg, and J. Leskovec. Can cascades be predicted? In Proceedings of the 23rd international conference on World Wide Web, pages 925–936, 2014.

[5] J. Lee, M. Yang, and H. Rim. Discovering high-Quality threaded discussions in online forums. Journal of Computer Science and Technology, 29(3):519–531, 2014.

Send in the robots: automated journalism and its potential impact on media pluralism

@SNOW/WWW, 2016, by Pieter-Jan Ombelet, Aleksandra Kuczerawy, Peggy Valcke

Image blog on automated journalismEmploying robot journalists: legal implications, considerations and recommendations

Resources for investigative journalism are diminishing. In the digital age, this was a foreseeable evolution: publishers typically regard these pieces as time-consuming and expensive, and the results of the research are often unpredictable and potentially disappointing. We analyse automated journalism (also referred to as robotic reporting) as a potential solution to combat the diminution of investigative journalism, and looks at the potential (positive and negative) impact of automated journalism on media content diversity.

What is automated journalism?

Automated journalism was defined by Matt Carlson as “algorithmic processes that convert data into narrative news texts with limited to no human intervention beyond the initial programming”. Narrative Science and Automated Insights are arguably the biggest companies at the moment specialising in this algorithmic content creation. Once there is core data to work with, the software of these companies can extrapolate complete news stories out of this data. To date, the most common uses of this software have been in the field of sports and financial reporting, often creating niche content that would not exist without the respective software (such as reports on ‘Little League’ games).

Don’t forget the humans!

Once these algorithms are optimised to allow newsrooms to use robotic reporters to write and edit news stories independently, this could have a serious impact on human journalists. Stuart Frankel, CEO of Narrative Science, envisions a media landscape in which “a reporter is off researching his or her story, getting information that’s not captured in some database somewhere, really understanding the story in total, and writing part of that story, but also including a portion of the story that in fact is written by a piece of technology.” In his vision, journalists would not be discharged. The labour would merely be reallocated, hereby ensuring a higher level of efficiency. Moreover, the portions written by the algorithm would often provide meaningful output from complex data, and be less biased and in that sense more trustworthy than could be expected from a human journalist.

Other voices have expressed more caution. They emphasise the humanity that is inherently linked to high quality journalism. This argument is valid, especially for wholly automated articles, which indeed lose a sense of complexity, originality, authenticity and emotionality that only a human can express. An article written by an algorithm will never intentionally contain new ideas or viewpoints. And this generic nature is one of the downsides of automated journalism when ensuring a diverse media landscape. The media play a crucial role in a representative democracy, characterised by its culture of dissent and argument. Generic news stories do not invigorate this culture.

Still, evolving to a media landscape which uses algorithms to write portions of the story should be embraced. However, there is an important caveat: these pieces should be edited by human journalists or publishers and supplemented by parts written by the human reporters themselves, to combat a sole focus on quantitative content diversity, i.e. a merely numerical assessment of diversity, without taking quality into account.

Moreover, one must not underestimate the possibility of human journalists simply losing their jobs or seeing their jobs change to the role of an editor of algorithmic output. Carlson even highlights the predictions of certain technology analysts, who foresee that “recent developments in computing may mean that some white-collar jobs are more vulnerable to technological change than those of manual workers. Even highly skilled professions, such as law, may not be immune”.

Quality content remains crucial

Indeed, these are possible risks. Still, one should not overestimate these negative side effects and lapse into doom scenarios. People will remain interested in qualitative content. Reallocation of resources due to converging media value chains have had remarkably interesting consequences that often show this interest. Original content creation by streaming services such as Netflix and Amazon has had incredible success. Furthermore, the proliferation and popularity of user-generated (journalistic) content and citizen investigative journalism websites (e.g. Bellingcat) has shown that there is interesting new content emerging, albeit in maybe a less traditional sense. We should therefore remain hopeful that Frankel’s attractive vision of reporters using technology to enhance the quality of their news stories will have a positive impact on media diversity and pluralism.

Note: This blog was not written by an algorithm! Instead, it is a modified and updated version of a text published on the LSE media Policy Project blog. This article provides the views of the named author only and does not represent the position the LSE Media Policy Project blog, nor the views of the London School of Economics.

Understanding the Competitive Landscape of News Providers on Social Media

@SNOW/WWW, 2016, by Devipsita Bhattacharya and Sudha Ram


New York Times article using a combination of web technologies to display content.

News consumption and distribution has undergone an unprecedented change with the rise of the Internet. News in this day and age has become synonymous with e-articles published on news websites. Use of web and electronic presentation technologies has enabled news providers to create content rich webpages to deliver news in a detailed and an engaging manner. News articles now contain a variety content such as text, images, podcasts, videos and real-time user comments. For anyone with an Internet connection, news is now an on-demand commodity.

New York Times Official Twitter Account Page

New York Times Official Twitter Account Page.

With features of social recommendation, content sharing and micro-blogging; social media websites (e.g. Twitter) also play a critical role in electronic distribution of news articles. Users after reading online news articles, submit their recommendations on Twitter in the form of tweets, which are then viewed by other users leading to news article webpage visits. User activities such as these have indirectly enabled the news providers to make their audience aware of the content published daily on their websites. News article sharing has also helped news providers to reach out to a much wider audience in a very cost effective manner than was possible hitherto. Moreover, in response to the popularity of social media, news agencies have created official user accounts on various social media websites and use these accounts to regularly to post about selected news articles published on their respective websites.

Our previous work has methodologies for analyzing news article propagation on social media, e.g., Twitter. We have extracted and analyzed several Twitter based propagation networks to examine characteristics of user participation in news article propagation. Using these networks, we have formulated a framework for performance measurement of news article propagation [1; 2; 4]. Our framework includes measures that can be grouped into two major categories i.e. Structural and temporal properties of propagation. Structural properties measure unique characteristics of  the propagation network such as length of longest cascade chain, average cascade length, user contribution, effective influence of news provider, article conversion ratio and multiplier effect. Temporal properties includes measures related to lifespan of articles, retweet response rate, and rate of spread of tweets.   We have also extracted implicit networks of user-user relationships based on commonalties of article preference and tweeting activity and examined how users connect over time based on their article sharing activity [3]. Our work has enabled us to examine news propagation in a unique and multi-faceted way by harnessing the power of network science.

Our current study focuses on similarities and differences in news propagation patterns on social media based on the primary channel of a news provider. The Internet apart from enabling e-articles has also transformed the news landscape with traditional news providers (printed, televised etc.) competing for news webpage article views. That is, news providers that previously competed with other providers primarily based on the channel of news distribution, now find themselves competing with a whole new set of participants. For instance, before the Internet, newspapers such New York Times and Washington Post were competing for subscribers and advertising revenues from their printed newspapers. Similarly, network news companies such as CNBC and CNN were competing with each other for audience engagement during prime time news hour. However, with each of these news providers now having news websites, the competition is no longer limited to their rivals in their primary distribution channel. News providers now also contend to attract advertisers and readers for their article webpages. In our study we compared the patterns of news article propagation on Twitter based on the primary channel of news distribution. News providers on the Internet can be grouped into different categories based on their primary channel of distribution.

Primary Channels of News Distribution

Primary Channels of News Distribution.

Generally, Porter’s Five Forces model is used for strategic analysis of organizations in an industry. However, in our study, we develop a network based methodology for analyzing the competition among these news channel categories on social media. We collected a dataset of 24 million article tweets from Twitter for 32 news providers over a 3-week period, i.e., September 1- September 22, 2015.

Using this dataset, we extracted news propagation patterns for each news provider and analyzed the similarities and differences between their networks. Our Twitter based propagation network is a user – user network defined for a single news provider. Each node in the network represents a Twitter user participating in article sharing activity of a given news provider. Each edge (directed from source to target) represents the aggregate retweeting relationship established between two users. It is a network of aggregated propagation activity (i.e. across multiple articles) observed over a period of time.

List of News Providers and their Primary News Channels.

List of News Providers and their Primary News Channels.

We compared the networks using a number of structural properties of their propagation networks including the “density” of each network, proportion of disconnected users, average length of user cascade chains, number of retweeting relationships per user, the ability of users to form communities, and tweeting frequency of user(s).

Important Findings

Visualizing the propagation patterns of different news provider's user-user network. Nodes accordingly colored to outline different user communities.

Visualizing the propagation patterns of different news provider’s user-user network. Nodes accordingly colored to outline different user communities.

We determined that when compared to networks of other channels, “online only” news providers have the smallest (number of nodes and edges) but the most dense networks. Interestingly, even with high density, their networks were found to have a higher concentration of disconnected nodes. This is expected since “online only” news providers have emerged only recently when compared to other news providers in our sample. For other news channels, we had mixed inferences when examining structural properties of their propagation networks. But, we were able to establish a statistically significant difference between the news channels based on their structural properties.

Our analysis of the news channels using a network based methodology makes several contributions.

  1. It allows news providers to benchmark their social media based propagation performance against other competitors in the same or in a different primary distribution channel. This is particularly important since on social media, even traditional suppliers of news (e.g. News agencies such as Reuters, Associated Press) are considered direct competitors for any news provider hosting an online news website.
  2. We identified features unique to our Twitter-based aggregate user-user networks. Important among these, is the presence of multiple disconnected communities of nodes. On an average, we found that a news provider’s propagation network had at least 4,000 disconnected communities containing two or more nodes. This highlights the importance of news article tweeting activities independent of those originating from news provider Twitter accounts.
  3. Network analysis adds a new dimension to competitive analysis which generally considers participation volume (number of users) to measure engagement. For instance, we observed that “news agency” (Reuters, Associated Press etc.) propagation networks had lower average counts of nodes and edges when compared to those of “television” (ABC News, CNN etc.) news networks (by a margin of 100,000). By considering these differences in values, television news providers emerge as “winners” in audience participation on social media when compared to “news agency” networks. However, we also ascertained that television and news agency networks had approximately equal values of network diameter (19.5 and 19 respectively). While on an average television based news agencies networks show higher tweeting and retweeting activity from their Twitter users, their audience’s ability to connect amongst each other to form the longest cascade chain over time is the same as that of “news agency” networks having lower average Twitter user participation count.

Our research points reveals that analysis of competition among news providers on social media needs a comprehensive consideration of various facets associated with user participation. It also shows that network science can provide important insights into the changing landscape of news on social media.


[1] Bhattacharya, D. and Ram, S., 2012. News article propagation on Twitter based on network measures – An exploratory analysis. In Proceedings of the 22nd Workshop on Information Technology and Systems.

[2] Bhattacharya, D. and Ram, S., 2012. Sharing News Articles Using 140 Characters: A Diffusion Analysis on Twitter. In International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 966-971.

[3] Bhattacharya, D. and Ram, S., 2013. Community Analysis of News Article Sharing on Twitter. In Proceedings of the 23rd Workshop on Information Technology and Systems.

[4] Bhattacharya, D. and Ram, S., 2015. RT @News: An Analysis of News Agency Ego Networks in a Microblogging Environment. ACM Trans. Manage. Inf. Syst. 6(3):1-25.

Veracity and Velocity of Social Media Content during Breaking News: Analysis of November 2015 Paris Shootings

@SNOW/WWW, 2016, by Stefanie Wiegand and Stuart E. Middleton


Social media are becoming increasingly important as a source for journalists. Their advantage is that content is available quickly during breaking news events. In a breaking news event, journalists get first hand eyewitness reports, often including photos or videos. While there is lots of genuine information available, there is also plenty of satire, propaganda and copycat content. Journalists are torn between being the first to get the story out while risking their reputation in case it was false information and verifying the content is actually genuine and publishing with too big a delay. There have been suggestions to use the wisdom of the crowd, but in many cases, social media acts as an echo chamber, spreading rumours that later often turn out to be false. This is not so much of a problem for long-term news stories because with time it becomes more clear what really happened but in breaking news situations, it can be tricky to quickly distinguish fact from fiction.

Our idea

Dashboards like TweetDeck or StoryFul help journalists organise the high volumes of content and discover newsworthy stories. Other tools like TinEye or Google image search can be used to verify images found on social media. Journalists use lists of trusted sources to check whether a content item is true or not. We agree that having a defined set of trusted (or indeed untrusted) sources is a good idea to filter the noise created by the echo chamber that is social media, but we think it can be partially automated.

Our trust model enables journalists to maintain a list of their sources, linking new content to authors. While tracking a news story on social media, content items are associated with authors and can be filtered using predefined lists. For each new content item, it becomes clear immediately whether it is in some way related to a source: if it’s been posted by that source, mentions that source or is attributed to it.

We also want journalists to discover new eyewitness content quickly. This means we cannot rely on trending content from news organisations alone since the content is not new anymore. Instead, we want to look at content being shared that contains images or video that is new (<5 minutes since publication) and starts to be shared by more people. Chances are it has not yet been verified and is potentially eyewitness content.

What we’ve done

We crawled various social media sites (Twitter, YouTube and Instagram using our own crawling software, searching for content with sepcific hashtags (e.g. #Paris). We used natural language processing techniques to identify named entities (such as “BBC” or “Le Monde”) in English and French and mentioned URLs. Then we imported the data into our trust model, which already contained a sample list of trusted and untrusted sources (e.g. @BBC was defined as a trusted source, @TheOnion as an untrusted source). This way, we can easily retrieve all content written by, mentioning or attributed to a specified source.

To show how using trusted sources can help a journalist, we picked five pictures posted during the night of the Paris attacks. Three of them are true and one false. we identified URLs for copies of the posted image, that might have been shared instead of the original image URL. We then queried our database in 10 minute intervals during the first hour after each image was published to see how often it was shared (overall and by trusted/untrusted sources).

This is one of the genuine tweets we used. The author happens to be nearby when the events happen and posts several photos to twitter during the course of the evening.

This tweet is a “fake” image, taken out of context from the Charlie Hebdo attacks in early 2015.

In our second experiment, we sorted URLs by the number of mentions. Every 5 minutes, we compared the currently top ranking URLs that were being shared on social media and filtered the old ones out (i.e. the ones that had been shared previously). By doing this, we tried to detect new eyewitness content to investigate before I went viral.

What we found and what it means

When analysing eyewitness content, we found that

  • Untrusted sources share images earlier than trusted sources
    In case of an image which has not yet been verified, untrusted sources pick it up earlier.
  • Trusted sources are an indication for an image to be authentic
    Although verification needs to be executed by a human, the fact that trusted sources are related to user-generated content make it more likely to be genuine. Typically, this is the case about 30 minutes after a picture has been published.

For verification, our central hypothesis is that the “wisdom of the crowd” is usually no wisdom at all. We think it’s often better to base a decision on few trusted sources than to risk falling victim to the “echo chamber”. Our results show that from about 30 minutes onwards, the involvement of trusted sources gives a good indication of the veracity of a piece of user-generated content. If a journalist is prepared to wait for 30 minutes (or perhaps discovers an image after that time), it can point them into the right direction for conventional means of verification, such as attempting to contact the source directly and doing some factual cross-checking.

For the discovery of newsworthy eyewitness content, we found that it helps to filter old content. We chose a time-window of 5 minutes, but others are possible. Using this method, all 5 of our tested images showed up in the top 6% of all content crawled during this period. This means a journalist scanning a social media stream for newsworthy content would not have to check hundreds or thousands of URLs but could focus on the top URLs. Of course this doesn’t mean all top URLs will contain genuine images but they are more likely to be related eyewitness content. This approach can also be combined with other state of the art filtering approaches like e.g. automated multimedia fake detection to further improve the quality of the recommended real-time content to journalists.

Where to go from here

Our results – although preliminary – look promising. An estimate for the truth content in social media posts could help journalists to become faster and more efficient if presented in a graphical way. Apart from the trusted sources lists we’ve used, our method can easily be extended to use other information, such as weather or lighting conditions. This information is already available and could be obtained dynamically.

The essence of this work is that we try to assist journalists, not to replace them by automating the process – we don’t think this is possible anytime soon. By automating manual, labour-intensive parts of the verification process however, we are able to give them a tool that they can use to verify and publish faster and with more confidence. Hopefully, this helps them to better deal with the pressure of breaking news publishing.

If you want to know more about our work on this, you can read our publications about real-time crisis mapping of natural disasters or extracting attributed verification and debunking reports using social media, visit the REVEAL website, or follow us on Twitter at @RevealEU and @stuart_e_middle.

Verification of Social Media Content for News

@SNOW/WWW, 2014, by Christina Boididou, Symeon Papadopoulos, Nic Newman, Steve Schifferes, Yiannis Kompatsiaris

Over the last few years, social media has become a primary news source.  On a range of stories such as disasters and political uprisings social networks are routinely the place where news is broken first – through eyewitness reports via text, pictures and videos.

The Hudson River plane crash in 2009 was an early example of how a mobile phone picture can be distributed to a global audience within seconds. Since then, user-generated content posted to social networks like Twitter, Facebook, and YouTube has shaped coverage of a variety of news events including the death of Osama Bin Laden in 2010, the Japanese earthquake of 2011, and the popular uprisings in the middle-east. During Hurricane Sandy in 2012, users of Instagram—a social photo sharing site—posted 10 photographs a second of the devastation, with around half a million photos being posted in total.

But amid this deluge of information, we also find a large number of fakes. Analysis by the Guardian newspaper suggested up to 15% of the most shared items during Hurricane Sandy were misleading or deliberately falsified. Many of these were picked up by the mainstream media and given added credibility – at least for a while.

blog_fig1a blog_fig1b

Figure 1: Fake Hurricane Sandy pictures sent in social media

Over the years there have been a number of similar well-documented cases.  Following the death of Osama Bin Laden in May 2011, ‘PhotoShopped’ pictures purporting to show his dead body were distributed on social media and picked up by newspapers, news websites, and TV stations, potentially inflaming passions in the region. In the December 2012 Connecticut school shootings, news agency and TV outlets misidentified the gunman to millions of readers and viewers, based on false information in social media. Following the Boston Marathon bombings in April 2013, social media tried to crowdsource the identification of the perpetrators with unsatisfactory results.



Figure 2: Twitter users make wrong assumptions around the identity of the Boston Bombing (April 2013)

Reddit, Twitter and 4Chan contained information that wrongly identified innocent people – causing great distress to these individuals as well as friends and family. Posting fake pictures can also cause unnecessary alarm among the general public. Images that purported to show sharks in New York City were posted at the height of Hurricane Sandy. A giant creature washed up on a California beach was linked to effects of radiation from the Fukushima nuclear plant in 2011.

Such incidents have led to increasing concerns about the reliability of news in social networks. Journalists are looking to quickly identify relevant information and trusted sources, but how can they also quickly spot misinformation, faked pictures, and the setting of false trails? Without tackling these issues the use of social media as a source of news will be increasingly problematic.

So what can be done? In other domains researchers have been able to help mitigate the effects of false and potentially damaging information. The problem of email and web spam in the 1990s led to a series of new and increasingly effective approaches to combatting the problem. Link-based and content-based dependencies among Web pages were used (Castillo et al. [1]) to develop an automated way to predict and identify spam and irrelevant content. Other studies (Seo et al. [2]) noted that false claims in social media tend to come from a small number of unreliable sources – so research has focused on how to identify this small group of users.

But the news problem is proving particularly hard to crack. Previously reliable contributors in one domain may prove unreliable in another. An eyewitness to a news story may have a very limited footprint on which to base decisions on credibility – and yet the content may be uniquely valuable. Content analysis may be confounded by the amount and speed of information and the range of different formats now flowing through social media. In polarised situations like Syria and the Ukraine, protagonists are often looking to use social media to spread false rumours and have become sophisticated in tampering with pictures and video to promote their point of view. These issues are hard to resolve algorithmically with any degree of certainty.

One promising approach in distinguishing between real and fake news images came in research by Gupta et al. [3], where the authors attempted to capture the patterns of fake twitter content, by using classification models on tweet text and user features. Around a defined story dataset, this approach achieved success levels of up to 97%, but this depended on training the classifier against this particular story, introducing considerable bias and overfitting on the learning process.

In our research for SocialSensor, an EU research project, we attempted to replicate these results using pictures mined from social media around Hurricane Sandy and the Boston Bombings. Fake images were identified from journalistic blogs to provide a ‘ground truth’ against which we could test various algorithms. While we identified false pictures in around 80% of the cases, this depended on the training set used to create the classifier. Applying the algorithms trained in one dataset (Hurricane Sandy) to the second (Boston Marathon) produced precision that was not much higher than the random baseline.

Working closely with journalists, however, as part of the project, we have built up a deep understanding of how newsrooms verify content from social media and which factors they take into account under different circumstances. Without wishing to understate the complexity of these processes, we think a number of these factors are applicable to a wide range of news stories and could help us to come up with a more generalisable approach that could at least get closer to the answer.

One hypothesis is that it would be valuable to add the geographic location of the user as a key factor in determining the relevance of content in a breaking news situation. The time the tweet was posted and its proximity to the beginning of the story is another key element that a journalist would use to assess veracity and in most cases can be determined automatically. Finally, we believe the incorporation of features from appropriately selected terms (based on statistical analysis of an independent set of fake and real tweets) will also carry considerable predictive power.

Even without these features, we hope that our work already sets out an experimental framework for assessing the performance of computational verification approaches on social multimedia which will benefit the wider research community – and we have published this on GitHub. We welcome feedback and collaboration to further work on this exciting challenge.

[1] Castillo, C., Donato, D., Gionis, A., Murdock, V., & Silvestri, F. (2007, July). Know your neighbors: Web spam detection using the web topology. In Proceedings of the 30th ACM SIGIR conference (pp. 423-430). ACM
[2] Seo, E., Mohapatra, P., & Abdelzaher, T. (2012, May). Identifying rumors and their sources in social networks. In SPIE Defense, Security, and Sensing (pp. 83891I-83891I). International Society for Optics and Photonics.
[3] Gupta, A., Lamba, H., Kumaraguru, P., & Joshi, A. (2013, May). Faking Sandy: characterizing and identifying fake images on Twitter during Hurricane Sandy. In Proceedings of the 22nd international conference on World Wide Web companion (pp. 729-736)

Trends of News Diffusion in Social Media based on Crowd Phenomena

@SNOW/WWW, 2014, by Minkyoung Kim, David Newth, Peter Christen

Complex Information Pathways in Social Media

More and more web articles today share information by simply including hyperlinks to different types of social media such as mainstream news (News), social networking sites (SNS), and blogs (Blog), as shown in Figure (a). For instance, within a few mouse clicks, an experienced blogger can merge news, relevant ideas from social networking sites, and supporting blog posts from unlimited digital content. Such behaviour from multiple social media platforms collectively forms complex information pathways on the Web, as shown in Figure (b).


With the help of web technologies such as RSS news feeds, social media aggregators, and miscellaneous mobile applications, a wide range of information from different sources is now more accessible than ever before. However, previous studies on diffusion have focused on a single social platform alone (such as Twitter or Facebook) rather than on combined social media of different types. When considering these real circumstances, it is meaningful to obtain macro-level diffusion trends from emergent information pathways. In this regard, there are several challenges. First, underlying real diffusion structures are not only hard to define, but they are also dynamically changing. Second, high accessibility to diverse information sources increases the heterogeneity of social networks in diffusion. Finally, the diversity of information leads to significant variations in diffusion patterns.

A Few Considerations for the study of Cross-Population Diffusion

For studying the cross-population diffusion phenomena, there are several points to consider. First, the meta-population scheme is important since the way of classifying heterogeneous social networks gives a different interpretation of the dynamics of diffusion across populations. Second, identifying trending topics across target populations helps trace beyond the bounds of site-specific (local) diffusion. Finally, categorizing the topics of information enables us to obtain common or distinct diffusion patterns between different categories.

In this study, we focus on real-world news diffusion in social media. Accordingly, as the meta-population scheme, we considered three different types of social media: News, SNS, and Blog, which constitute over 98% of our original Spinn3r dataset [1]. Also, for the identification of real-world news, we use the Wikipedia Current Events [2] as a note-worthy event registry, which helps us to identify trending topics across diverse social media platforms. As the figure below shows, each bullet point is referred to as news, which describes a short summary of an event for that day, along with reference hyperlinks (purple rectangles), and each bold font text is referred to as the corresponding news category.


Macro-level Information Pathways based on Crowd Phenomena

We analyse crowd phenomena in news diffusion across different types of social media such as News, SNS, and Blog in terms of activity, reactivity, and heterogeneity. We found that News is the most active, SNS is the most reactive, and Blog is the most persistent, which governs time-evolving heterogeneity of populations in news diffusion. Finally, we interpret the collective behaviours of News, SNS, and Blog from various angles using our previous model-free [3] and model-driven [4] approaches, showing that the strength and directionality of influence reflect the discovered crowd phenomena in news diffusion. These attempts enable us to understand diffusion trends of news in social media from diverse aspects in a more consistent and systematic way.

Summarizing our findings, the main trends of news diffusion in social media are:

  • SNS and Blog users are less active but more reactive for real-world news than for other arbitrary topics.
  • Regarding activity, the Pareto principle is not applied uniformly across different online social systems.
  • Active news media are tightly connected, enhancing the opportunity to be exposed to other social systems.
  • One week is a meaningful period for tracking news cascades regardless of system types and news topics.
  • The most active news category in each system corresponds to the most reactive news category.
  • Larger diffusion exhibits higher heterogeneity.
  • News is a diligent creator and diligent adopter, SNS is a lazy creator and diligent adopter, and Blog is a diligent creator and lazy adopter.

For more details, please refer to our paper entitled: “Trends of News Diffusion in Social Media based on Crowd Phenomena” by Minkyoung Kim, David Newth and Peter Christen, Second Workshop on Social News on the Web @ WWW ’14 (SNOW 2014), Seoul, Korea, April 2014.

[1] ICWSM’11 Dataset.
[2] Wikipedia Current Events in January, 2011.
[3] M. Kim, D. Newth, and P. Christen. Macro-level information transfer across social networks. In WWW Companion, Seoul, Korea, 2014.
[4] M. Kim, D. Newth, and P. Christen. Modeling dynamics of diffusion across heterogeneous social networks. Entropy, 15(10):4215-4242, 2013. doi:10.3390/e15104215