Overview
Commercial providers of information access systems (such as Amazon or Google) usually evaluate the performance of their algorithms by observing how large numbers of their customers interact with different instances of their services. Unfortunately, due to the lack of access to large-scale systems, university-based research is struggling to catch up with this large-scale evaluation methodology. NewsREEL, short for News Recommendation Evaluation Lab, aims to bridge this “evaluation gap” between Academia and Industry.
NewsREEL is organised as a campaign-style evaluation lab of CLEF 2017 and addresses the following information access task:
Whenever a visitor of an online news portal reads a news article on their side, the task is to recommend other news articles that the user might be interested in.
NewsREEL offers two tasks to study this use case. The first task, NewsREEL Live, implements the idea of a “living lab” where the provider of a recommendation service provides access to its infrastructure and user base. The second task, NewsREEL Replay, replays a live setting.
By providing this service for millions of users, the recommendation scenario requires solutions to significant research challenges, such as processing information in real-time, handling vast amounts of data, and providing suitable recommendations. By providing access to the infrastructure of a company, we offer professionals and students the opportunity to develop skills that are in high demand in industry, while at the same time allowing them to familiarize themselves with the academic practice of evaluation of information access systems.
Task Description
We continue to explore news recommenders with real-time constraints as we head toward NewsREEL 2017. The new edition features two tasks similar to 2015 and 2016. Task 1, NewsREEL Live, involves a living lab setting with ORP as backend. Task 2, NewsREEL Replay, provides better comparability as all algorithms are subject to identical conditions as participants simulate recommendation requests.
Task 1: NewsREEL Live.
NewsREEL Live requires researchers to generate news recommendations for millions of readers of online news articles. The quality of these recommendations is evaluated based on users’ click-through rate (CTR). Whenever a visitor of selected news publishers requests an article, this request is forwarded to the recommendation service provider plista who then distributes this request to participants of NewsREEL Live.
The main component of this task is the Open Recommendation Platform (ORP) which grants participants access to a variety of news portals. Participants deploy a recommendation service, which receives recommendation requests. ORP monitors users’ reactions, and, in this way, collects interaction data that is used to compare recommendation algorithms. Participants lacking resources to operate a recommendation service may opt to deploy their algorithms on virtual machines provided by plista.
ORP spreads requests randomly across participants to assure fair comparisons. Participants will experience settings similar to actual recommender systems. Participating systems must respond within 100ms. During the challenge, systems will need to handle peak loads of up to 100 messages per second. ORP records requests, clicks, as well as errors.
The official NewsREEL evaluation metric for Task 1 is Click-Through Rate (CTR). CTR is the number of recommendations produced by a participating system that are clicked by users normalised by the total number of requests for recommendations that were sent to that system. Note that in order to receive a click, a well-formed recommendation needs to be produced in response to a request within the given time limit (100ms). The restriction to well-formed recommendations eliminates recommendations that are empty or have invalid item references. At the surface, CTR is a single, simple to calculate quantity. However, it is important to note that it actually reflects multiple dimensions along which participating systems must perform well. First, systems must generate recommendations that will be clicked. Second, these recommendations have to be generated within the response time limit. Third, the system must be available, meaning that it can respond to requests with well-formed recommendations. The advantage of online evaluation is that participating systems are forced to respect multiple dimensions, making the evaluation more similar to industry evaluations. In the online evaluation, the fairness of comparison between participating systems is enforced in two ways: First, all systems are compared on the basis of specifically defined time windows. This ensures that the general properties of the recommendation request stream are the same (i.e., cover the same time frame of the same real-world news events). Second, recommendation requests are distributed randomly to participating systems.
Example: Participant “rocking recommendations” receives 100,000 recommendation requests. The system manages to provide valid, in-time suggestions in 95,000 cases. Users click on 4,500 suggestions. We compute a CTR of 4,500 / 100,000 = 4.5%.
We will determine the winning team on basis of CTR achieved during an evaluation window. In addition, we defined three testing periods during which we record the performance of participating teams. These results will be presented at the CLEF 2017 in Dublin. Each participant receives a volume of requests. CTR measures the proportion of recommendations that have been clicked. Note that as recommendations fail due to exceeding the time limit or including none or invalid item references, the number of requests increases without a chance of additional clicks. Further, note that we cannot compare systems unless they respond to a similar share of requests. Availability constitutes a major challenge for news recommender systems. We will disregard systems that process less than 75% of requests than the baseline.
Task 2: NewsREEL Replay.
NewsREEL Replay aims to reproduce above outlined online evaluation scenario by simulating a constant data stream of requests for articles. For this, we offer a comprehensive data set comprising one month equivalent of interactions logs on several large-scale news publishers. We provide you with the framework Idomaar that “re-plays” the data set chronologically, in this way simulating a streamed data recommendation task and generates evaluation results.
Participants in this task are required to predict which news articles a user will read next. “Replaying” makes it possible to fully control the conditions when comparing various algorithms. Not only do systems adhere to the same time windows, as it is the case in the online evaluation, they are tested in the identical set of recommendation responses. The downside of replaying is that it underestimates the click response of users, as an offline algorithm is required to generate an item that actually was clicked, rather than one that would potentially be clicked. However, this downside is balanced by the opportunity of studying a wide range of different aspects of system performance, including comparisons of scalability, peak load handling, and the ability to distribute requests.
The official evaluation metric for Task 2 is prediction accuracy, which can be thought of as “offline CTR.” Parallel to the definition of CTR, it is the number of recommended items predicted by an algorithm that corresponds to items that were clicked by a user, normalised by the total number of requests for recommendations that were sent to that system. As above, prediction accuracy implicitly reflects response time and availability of the system. However, we also encourage participants to also take advantage of the framework to investigate other aspects of their systems including scalability and response time. For instance, you can run extensive experiments to determine which algorithm handles the most simultaneous load, which algorithm requires the least memory, or which algorithm can be most easily distributed. These factors reflect important constraints in industrial recommender systems.
Organisers
Lab Organisers
- Frank Hopfgartner, University of Glasgow
- Torben Brodt, plista GmbH
Task Organisers
Task 1
- Benjamin Kille, TU Berlin
- Marvin Grimm, plista GmbH
- Leif Blaese, plista GmbH
Task 2
- Andreas Lommatzsch, TU Berlin
- Roberto Turrin, Moviri
- Martha Larson, TU Delft
Steering Committee
- Paolo Cremonesi, Politechnico di Milano, Italy
- Arjen de Vries, Radboud University, The Netherlands
- Michael Ekstrand, Texas State University, USA
- Hideo Joho, University of Tsukuba, Japan
- Joemon M. Jose, University of Glasgow, UK
- Noriko Kando, National Institute of Informatics, Japan
- Udo Kruschwitz, University of Essex, UK
- Jimmy Lin, University of Waterloo, Canada
- Vivien Petras, Humboldt University, Germany
- Till Plumbaum, TU Berlin, Germany
- Domonkos Tikk, Gravity R&D and Óbuda University, Hungary