Providing high-quality news recommendations is a challenging task because the set of potentially relevant news items changes continuously, the relevance of news highly depends on the context, and there are tight time constraints for computing recommendations. Running in its third year, NewsREEL 2016 was organised as a campaign-style evaluation lab as part of the Conferences and Labs of the Evaluation (CLEF) evaluation activity. NewsREEL featured two tasks related to news recommender systems evaluation. Task 1 addressed online evaluation protocols. Task 2 asked participants to conduct experiments on previously recorded interactions. Participants engaging in both tasks observed functional as well as non-function characteristics of their algorithms and could determine how accurately their methods predict users’ preferences. Additionally, they assess how well their systems handle realistic conditions. Results were presented at the CLEF conference in Evora, Portugal. An overview is given in below presentation. Proceedings are available online at CEUR WS.

Task description

  • Task 1: Benchmarking News Recommendation in a Living Lab. Participants gained access to an operating news recommendation service via the Open Recommendation Platform (ORP). Having deployed a recommender systems, participants received recommendation requests from various publishers. These news portals subsequently displayed recommended news articles on their websites. ORP keeps track of users reactions in terms of clicks. In addition, participants received a stream of data informing about newly added or altered news articles and interactions between visitors and publishers. We challenged participants to achieve the highest click-through-rate (CTR). This rate relates the number of clicks on recommendations with the number of requests. Participants had to keep function aspects such as response time in mind. Failed requests reduce the CTR.
  • Task 2: Benchmarking News Recommendation in a Simulated Environment. Participants received a comprehensive data set derived from log files recorded by ORP. The logs span from July to August 2014 and include several large-scale news providers. Participants could re-iterate the logs by means of Idomaar. Thereby, they simulated an environment close to the actual use case. Additionally, they were able to compare different recommendation algorithms on identical data. This improved reproducibility and comparability. We measured how well their methods predicted which news articles a visitor would read in the future. Further, we analysed how much resources their methods consumed. This yielded insights on time and space complexity. Thereby, we received a knowledge on how to estimate how well methods could deal with response time restrictions and load peaks.


In the 2016 edition of NewsREEL, 48 participants registered. Both tasks attracted similarly many participants with Task 1 slightly ahead with 46 registrations compared to Task 2 with 44 participants. Participants deployed 73 recommendation services in Task 1. While participants who signed up for Task 2 gained access to the plista dataset to perform their experiments, participants of Task 1 gained access to ORP where they had to register their service.

ORP was sending content updates and requests for articles during the entire CLEF cycle (30 October 2015 – 25 May 2016). This facilitated exploring a variety of algorithms, optimizing parameters, and testing hardware settings. In order to give participants an indication about their performance, three evaluation test periods were defined after which they received detailed feedback:

  • 06-12 February 2016
  • 20-26 February 2016
  • 05-11 March 2016

The winner was determined in a four-week period scheduled in April and May 2016.


We received a total of seven working notes describing participants’ approaches. We notice that the level of engagement varied among participants. Some settled for individual strategies while others enter up to 14 different methods. Some competed for 49 days, others had their systems turned off for some days. The average click-through-rate varied from 0.42% to 1.23%. Some participants managed to achieve a response rate close to 100%. Results are summarised in below presentation. Further details can be found in the lab overview paper that has been published in the CEUR proceedings.


  • Frank Hopfgartner, University of Glasgow (Lab Organiser)
  • Torben Brodt, plista GmbH (Lab Organiser)
  • Benjamin Kille, TU Berlin (Task 1)
  • Jonas Seiler, plista GmbH (Task 1)
  • Tobias Heintz, plista GmbH (Task 1)
  • Andreas Lommatzsch, TU Berlin (Task 2)
  • Roberto Turrin, Contentwise (Task 2)
  • Martha Larson, TU Delft (Task 2)

Steering Committee

  • Paolo Cremonesi, Politecnico de Milano
  • Arjen de Vries, Radboud University, The Netherlands
  • Michael Ekstrand, Texas State University, USA
  • Hideo Joho, University of Tsukuba
  • Joemon M. Jose, University of Glasgow
  • Noriko Kando, National Institute of Informatics, Japan
  • Udo Kruschwitz, University of Essex
  • Jimmy Lin, University of Maryland
  • Vivien Petras, Humboldt University
  • Till Plumbaum, TU Berlin, Germany
  • Domonkos Tikk, Gravity R&D and Óbuda University