Predicting News Popularity by Mining Online Discussions

@SNOW/WWW, 2016, by Georgios Rizos, Symeon PapadopoulosYiannis Kompatsiaris

The identification of top online news stories very early after their posting, or even before, is an important problem and invaluable to online social media and news press or aggregators. Recently, the New York Times have launched a new tool called Blossom for their internal usage, that makes data-driven recommendations to a journalist about which of the published stories will go viral when featured on Facebook. The tool employs machine learning on the firm’s big data stores and as such we can see that such applications and techniques are at the forefront of data-informed journalism.

But what can be done to ensure that the best content bubbles-up to the place of most exposure in a social medium? It really depends on what users look for in their information consumption and it may be more intricate that a simple score count.

Online users may wish to follow the zeitgeist by consuming viral content or they may be professionals in search for more thought-provoking material, such as reading a discussion-raising or controversial story. Given the web’s limited attention, proper exposure mechanisms such as a smart news-feed or ranking process are important services provided by social media to cater to the information needs of the users and increase online traffic and monetization on their website.

Early prediction of online popularity for multimedia or news stories can be used for improved exposure mechanisms and is crucial to online social media stakeholders, business intelligence specialists and journalists alike.

Regarding news story popularity prediction specifically, a study in [1] describes a mathematical model that predicts the size of the discussion at a future point based on capturing patterns in the rate comments are added. Another study [2] has addressed the same task by looking at characteristics pertaining to the time (day, month) a post is made or the number of identified entities present in the post. However, such methods do not take into account the complexity of the structure of social interactions among the implicated users. This has been attempted in certain studies concerned with hashtag popularity prediction [3] and prediction of the number of shares in Facebook [4] with promising results, however this success has not been transferred to online discussions. Indeed, only simple characteristics of the structure of online discussions have been used in a study [5] tangentially related to news story popularity prediction.

What about a discussion that has attracted many comments, all of them exclaiming a simple agreement? What if a small number of users have posted multiple times or hi-jacked the thread for an unrelated discussion?

To address this need, our work on the EU FP7 research project REVEAL led to the development of a machine learning framework for predicting future popularity values of an online news story by analyzing the structural complexity of the online discussion that is generated about it. Our hypothesis is that by being able to capture different configurations of the early structure of the discussion, we can reveal future discussion size, number of contributing users, vote score and perhaps most importantly, controversiality.

Our framework aims to capture information from two structural sources present in online discussions: the comment tree and the user network that contribute to the discussion. We extract a number of features that characterize each of these networks. Using these features, a machine learning algorithm is used to make predictions on the future popularity of the post. An overview of our framework is shown in the following figure. But what is the intuition that motivates the utilization of these two kinds of networks?


A news post that generates a level-one depth comment tree, even with many comments agreeing or praising it might be indicative of light-hearted or shallow interest content. Instead, a story that generates multiple lengthy chained replies might indicate a more controversial or discussion-provoking source material.

Similarly, a user network indicating that the majority of replies are made between small numbers of users may imply thread hi-jacking or the presence of a conversation of power users. Alternatively, a discussion in which there is deliberation from the full set of implicated users may be of more general interest.

In order to evaluate the performance of our methodology for score and controversiality prediction we collected a Reddit dataset by focusing on posts made in 2014 on several news-related subreddits. Whereas our approach proved superior to simpler past methods, the greatest improvement was achieved in terms of score and controversiality prediction.

In order to get a feeling of the performance of our method, we calculated the percentage (Jaccard coefficient) of the top-100 controversial stories that our method successfully predicted (as compared to the true top-100). This is shown in the following table, along with the results from two other approaches.

results table

Our method is denoted by all_graph. The method denoted by temporal is based on a subset of the features used to capture growth rate in [4] and the combination of the two methods is all. The percentages shown refer to the lifetime that the post has been uploaded. In order to make the comparisons at an early stage of the discussions we show the results from 1-14% of the mean time elapsed between their posting and the accumulation of 99% of the comments.

As an example, at the 5% post lifetime, the most controversial post that our method identified was titled “Gun deaths for U.S. officers rose by 56 percent in 2014: report.”. We see that many users contribute to the discussion by linking to more specific information, although some disagree by claiming that the title is worded to evoke sensationalism and others yet that discuss how civilian gun deaths is a related and under-reported statistic.

We have shown then that more in-depth representation of the structure of social interactions made around news posts is a successful means of predicting popularity and identifying top material. This network-based approach can even be used to complement other methods such as text-based ones or methods that examine the poster’s influence on the medium. We will continue our efforts to extend our method by improved technologies and additional sources of information.

The code for our method and the full series of experiments we performed can be found on GitHub.


[1] A. Tatar, P. Antoniadis, M. D. De Amorim, and S. Fdida. From popularity prediction to ranking online news. Social Network Analysis and Mining, 4(1):1–12, 2014.

[2] M. Tsagkias, W. Weerkamp, and M. De Rijke. Predicting the volume of comments on online news stories. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, pages 1765–1768. ACM, 2009.

[3] L. Weng, F. Menczer, and Y.-Y. Ahn. Predicting successful memes using network and community structure. arXiv preprint arXiv:1403.6199, 2014.

[4] J. Cheng, L. Adamic, P. A. Dow, J. M. Kleinberg, and J. Leskovec. Can cascades be predicted? In Proceedings of the 23rd international conference on World Wide Web, pages 925–936, 2014.

[5] J. Lee, M. Yang, and H. Rim. Discovering high-Quality threaded discussions in online forums. Journal of Computer Science and Technology, 29(3):519–531, 2014.