The proceedings of the SNOW Data Challenge are now available at CEUR. They contain an overview paper with the final evaluation scores and all the papers describing the participants’ approaches. In addition, we have made publicly available several of the evaluation resources here.
Consider a scenario of news professionals who use social media to monitor the newsworthy stories that emerge from the crowd. The volume of information is very high and it is often difficult to extract such stories from a live social media stream. The task of this challenge is to automatically mine social streams to provide journalists a set of headlines and complementary information that summarize the most important topics for a number of timeslots (time intervals) of interest. In the context of the SocialSensor project, we found that this is a very important and challenging problem, and for this reason the project organizes this challenge to explore novel and effective solutions.
The width of timeslots will vary (from minutes to hours) depending on the topic or type of the event. Although newsworthiness can sometimes be considered as a highly subjective attribute, in the context of this challenge, we employ an operational definition: newsworthiness of topics for a given timeslot is assessed after sufficient time has elapsed on the basis of their coverage by selected news sites.
Data and topic extraction
We will provide the participants a common framework to mine the Twitter stream and we will ask them to automatically extract topics corresponding to known events (e.g., politics, sports, entertainment) that will be announced. The crawled data will be divided in timeslots and participants will be asked to produce a fixed number of topics for selected timeslots. To simulate a real-time topic detection setting, only tweets up to the end of the timeslot can be used to extract the topic. Each topic should be in the form of a short headline that summarizes a topic related to a piece of news occurring during that timeslot, accompanied by a set of tweets, URLs of pictures (extracted from the tweets), and a set of keywords. The expected output format will be the following: [headline \t keywords \t tweetIds \t picture_urls]
Topics will be evaluated across a mixture of quantitative and qualitative dimensions. A panel of news professionals selected by the task organizers will be in charge of the evaluation phase.
- Precision and recall. The evaluation panel will compile a ground truth of newsworthy topics for each time slot in the dataset. Topics automatically extracted will be manually matched against the ground truth and precision and recall will be calculated.
- Readability. The topics should be provided in form of a textual headline. The evaluation panel will assign a readability score to the headlines of all the topics matching the ground truth
- Coherence/relevance. The tweets and the picture associated with a single topic should be related to each other and the topic headline
- Diversity. The tweets associated with a single topic should be sufficiently different from each other, i.e. near-duplicates and retweets should be avoided.
The submitted approaches will be ranked for each evaluation parameter above and a final ranking will be obtained by combining all the partial rankings.
First prize: 1000$ + iPad air
Second prize: Macbook air
Third prize: iPad air
Task signup deadline: Jan 20, 2014
Release of development set: Jan 21, 2014
Release of test set: March 1, 2014
Submission of extracted topics:
March 3, 2014 (23:59 Hawaii Standard Time) March 4, 2014 (18:00 GMT)
Submission of papers:
March 7, 2014 (23:59 Hawaii Standard Time) March 9. 2014 (23:59 Hawaii Standard Time)
Phase 1: Teams interested in participating should notify the organization by sending a declaration of interest by Jan 20, 2014 by sending an email to firstname.lastname@example.org. In case of team participations, please include the names, emails and affiliations of all team members. Details on the task, including the data crawling tool, timeslots, and number of topics required in output will be provided to all the teams who sign up by the deadline.
Phase 2: Subscribed teams only will be provided with a development kit with a set of tweet ids related to a major 2012 event and the corresponding ground truth topics. A tool to crawl the tweet content from their id will be provided as well.
Phase 3: All subscribed teams will be provided with a new set of tweet ids from which they will be required to extract the topics that will be evaluated for the challenge.
Phase 4: The final submission will include a file (per team) with the detected topics and a short paper (see instructions below) describing the method used and the results.
The accompanying papers must:
- be written in English;
- contain author names, affiliations, and email addresses;
- be formatted according to the ACM SIG Proceedings template with a font size no smaller than 9pt;
- be in PDF (make sure that the PDF can be viewed on any platform), and formatted for US Letter size;
- occupy no less than five and no more than six pages, including the abstract, and references. Appendices are not counted against the page limit.
It is the authors’ responsibility to ensure that their submissions adhere strictly to the required format. We will provide registered participants with a tweaked ACM Proceedings template with the appropriate copyright statement.
Submissions will be evaluated through a peer-review process and accepted based on their technical quality, but independently of the topic detection performance achieved. All accepted submissions will be invited for short presentations during the workshop and will be published independently from the SNOW 2014 proceedings on this page and on CEUR (note that a minimum number of papers should be submitted in order to be able to publish them on CEUR).
Recommended reading or Acknowledgement
- Aiello et al., “Sensing trending topics in Twitter”, IEEE Transactions on Multimedia (Volume:15, Issue: 6), Oct 2013.
- Schifferes et al., “Identifying and verifying nws through social media: Developing a user-centered tool for professional journalists”, Digital Journalism, (doi: 10.1080/21670811.2014.892747), 2014
- Symeon Papadopoulos – CERTH – ITI, Greece
- David Corney – Robert Gordon University, UK
- Luca Maria Aiello – Yahoo Labs, Spain