I - Main conclusions and lessons learned

about benchmarking campaigns

 

Benchmarking campaigns are suitable to foster exchange

between academia and industry.

Challenges measured in benchmarking campaigns are overall judged as relevant by both academia and industry.  The motivations differ from an actor to another (e.g. between SME’s, big companies and research institutes) but each of them finds an interest to participate in or to follow the results of benchmarking campaigns. Identified benefits for these actors include: (i) measuring and boosting global research progress (ii) increasing the visibility of good research (iii) facilitating access to evaluation data (iv) facilitating the emergence and the sustainability of research communities (v) fostering the convergence of evaluation methodologies (vi) fostering the emergence of private benchmarks modeled on public ones but using business-specific data

 

Benchmarking campaigns have a positive scientific, technical

and economic impact.

NIST (one of the largest evaluation campaign organized by the US National Institute of Standards and Technology) has measured significant technical, industrial and scientific impact of its campaign. In particular the TRECVID campaign has allowed to double performances of systems over 3 to 10 year span (depending on the topic). According to a study of RTI International return on investment reached a factor 3 to 5[1]. And finally TRECVID has generated more than 2000 publications. Similarly significant technical and scientific impact has also been described in publications about other campaigns.

The results of CHORUS+ survey as well as the discussions during the Think-Tank confirmed that benchmarking campaigns became an important tool for companies to identify relevant research progress and select new technologies for their products. An increasing interest for participating in, and, organizing benchmarking campaigns in the future was measured in both academia and industry.

 

Benchmarking campaigns are criticized in some points.

An important criticism is the implicit cost to participate in an evaluation campaign. Up to 10 additional man months over usual R&D costs are required to participate in an evaluation campaign for the first time. Even if this cost decreases for further participations, this expensive entry price has a negative impact on the participation of SME’s as well as many research groups world wide. Another frequently mentioned shortcoming is related to the scale and scope of data used for benchmarking. Shipping real-world and big data is indeed logistically very difficult and limited by access rights. The consequence is that systems might converge to ad-hoc solutions and therefore generalizes poorly when transferred to real-world content. A last criticism concerns the way technologies are evaluated in benchmarking campaigns, and notably the controversial question of user-centered vs. system-oriented evaluation. Some actors from both academia and industry complain that end-users of the technologies are not involved enough in the evaluation process. The large companies who participated in our survey particularly identified this point as critical. On the other side, user-centered evaluations strongly increase the evaluation cost and are suspect of being more subjective.

 

There is a lack of support to the organization of benchmarking

campaigns in Europe.

There is no dedicated funding in Europe to sustain the organization of public benchmarking campaigns at the international level. Large initiatives such as CLEF or MediaEval typically live through heterogeneous and opportunistic research funds including national and European projects, and volunteer resources from research institutes. In this context it appears particularly difficult to assess the impact of campaigns over longer periods (5-10 year area). On the other side, the American National Institute of Standards and Technology is in charge of organizing most benchmarking campaigns in US with significant permanent resources (complemented by contributing external researchers). There was a consensus during the Think-Tank on that Europe should not simply leave the floor to NIST (for several reasons related to scientific, cultural and social diversity as well as economic strategy). As a result of its central role on stimulating research and innovation in Europe, the EU commission appears as a highly recognized candidate to efficiently set up and support a sustainable and efficient way to fund and synchronize benchmarking campaigns in Europe.

 

Steering of benchmarking campaigns is controversial.

Selecting and synchronizing scientific challenges measured in public benchmarking campaigns is a complex process sensitive to impartiality and biases. In the US, NIST employs several mechanisms: sometimes the challenge is defined by an agency, and in other cases the research community defines the challenge collectively. Most European benchmarking campaigns such as CLEF and MediaEval are based on a bottom-up mechanism. New challenges are proposed by individual research groups or research projects, and the organizers of previous campaigns decide collectively whether this new task should be integrated. There does not exist a specific mechanism to synchronize the campaigns between each other’s. Some participants to the Think-Tank rather suggested a top-down approach where the challenges would be defined by public agencies. A EU based effort should concentrate on evaluating results that are funded by EU funds. Some other participants tempered this approach to avoid adding a layer of bureaucracy and to avoid fragmentation of research evaluation.



[1] Please refer to:  http://trec.nist.gov/pubs/2010.economic.impact.pdf