We continue to explore news recommenders with real-time constraints as we head toward NewsREEL 2017. The new edition features two tasks similar to 2015 and 2016. Task 1, NewsREEL Live, involves a living lab setting with ORP as backend. Task 2, NewsREEL Replay, provides better comparability as all algorithms are subject to identical conditions as participants simulate recommendation requests.
NewsREEL Live requires researchers to generate news recommendations for millions of readers of online news articles. The quality of these recommendations is evaluated based on users’ click-though rate (CTR). Whenever a visitor of selected news publishers request an article, this request is forwarded to the recommendation service provider plista who then distributes this request to particpants of NewsREEL Live. The process is visualised in below graphic.
The main component for this task is the The Open Recommendation Platform (ORP) which grants participants access to a variety of news portals. Participants deploy a recommendation service, which receives recommendation requests. ORP monitors users’ reactions, and, in this way, collects interaction data that is used to compare recommendation algorithms. Participants lacking resources to operate a recommendation service may opt to deploy their algorithms on virtual machines provided by plista.
ORP spreads requests randomly across participants to assure fair comparisons. Participants will experience settings similar to actual recommender systems. Participating systems must respond within 100ms. During the challenge, systems will need to handle peak loads of up to 100 messages per second. ORP records requests, clicks, as well as errors.
The official NewsREEL evaluation metric for Task 1 is Click Through Rate (CTR). CTR is the number of recommendations produced by a participating system that are clicked by users normalised by the total number of requests for recommendations that were sent to that system. Note that in order to receive a click, a well-formed recommendation needs to be produced in response to a requests within the given time limit (100ms). The restriction to well-formed recommendations eliminates recommendations that are empty or have invalid item references. At the surface, CTR is a single, simple to calculate quantity. However, it is important to note that it actually reflects multiple dimensions along which participating systems must perform well. First, systems must generate recommendations that will be clicked. Second, these recommendations have to be generated within the response time limit. Third, the system must be available, meaning that it can respond to requests with well-formed recommendations. The advantage of online evaluation is that participating systems are forced to respect multiple dimensions, making the evaluation more similar to industry evaluations. In online evaluation, fairness of comparison between participating systems is enforced in two ways: First, all systems are compared on the basis of specifically defined time windows. This ensures that the general properties of the recommendation request stream are the same (i.e., cover the same time frame of the same real-world news events). Second, recommendation requests are distributed randomly to participating systems.
Example: Participant “rocking recommendations” receives 100,000 recommendation requests. The system manages to provide valid, in-time suggestions in 95,000 cases. Users click on 4,500 suggestions. We compute a CTR of 4,500 / 100,000 = 4.5%.
We will determine the winning team on basis of CTR achieved during an evaluation window. In addition, we defined three testing period during which we record the performance of participating teams. These results will be presented at the CLEF 2017 in Dublin. Each participant receives a volume of requests. CTR measures the proportion of recommendations that have been clicked. Note that as recommendations fail due to exceeding the time limit or including none or invalid item references, the number of requests increases without a chance of additional clicks. Further note that we cannot compare systems unless they respond to a similar share of requests. Availability constitutes a major challenge for news recommender systems. We will disregard systems that process less than 75% of requests than the baseline.
NewsREEL Replay aims to reproduce above outlined online evaluation scenario by simulating a constant data stream of requests of articles. For this, we offer a comprehensive data set comprising one month equivalent of interactions logs on several large-scale news publishers. We provide you with the framework Idomaar that “re-plays” the data set chronologically, in this way simulating a streamed data recommendation task, and generates evaluation results. Below figure illustrates this task.
Participants in this task are required to predict which news articles a user will read next. “Replaying” makes it possible to fully control the conditions when comparing various algorithms. Not only do systems adhere to the same time windows, as it is the case in online evaluation, they are tested in the identical set of recommendation responses. The downside of replaying is that it underestimates the click response of users, as an offline algorithm is required to generate an item that actually was clicked, rather than one that would potentially be clicked. However, this downside is balanced by the opportunity of studying a wide range of different aspects of system performance, including comparisons of scalability, peak load handling, and distributability.
The official evaluation metric for Task 2 is prediction accuracy, which can be thought of as “offline CTR”. Parallel to the definition of CTR, it is the number of recommended items predicted by an algorithm that corresponds to items that were clicked by a user, normalised by the total number of requests for recommendations that were sent to that system. As above, prediction accuracy implicitly reflects response time and availability of the system. However, we also encourage participants to also take advantage of the framework to investigate other aspects of their systems including scalability and response time. For instance, you can run extensive experiments to determine which algorithm handles the most simultaneous load, which algorithm requires the least memory, or which algorithm can be most easily distributed. These factors reflect important constraints in industrial recommender systems.