Exploration vs. exploitation trade-offs in query by example searches

In the query by example search paradigm the user is stirring the search by providing relevance feedback to the examples presented by the search engine. The objective of a searching strategy is to provide the user with satisfying results most quickly. One possible strategy is to select examples with the highest likelihood for satisfying the user’s query. But this is not necessarily an optimal strategy if more information about the user’s query can be obtained from feedback to different examples. Thus, a search strategy is facing an exploration–exploitation dilemma: should it present an example to the user for which informative feedback is to be expected, or should it present an example which is more likely to match the user’s query. This dilemma is even more severe when the current hypothesis about the user’s query favors examples in the vicinity of past relevant examples, but ignores possibly more relevant examples in an unexplored part of the search space.

A similar dilemma exists for a web content provider who needs to display content which is interesting to the user. The system may either choose to display content which is likely to be interesting given the current statistics (exploitation), or it may display content for which statistics are insufficient (exploration).

As a basic scenario for the exploration–exploitation dilemma the bandit problem has been studied in the statics community and more recently also in the machine learning community. While algorithms for the basic bandit problem are well developed, only few algorithms for more involved scenarios do exist. In the context of this project such scenarios need to deal with available side information as well as with delayed feedback. Whereas the basic bandit problem adequately models choices (e.g. of examples to be presented to the user) which are independent of each other, extension of the bandit problem are needed to model choices which can be related to each other through their descriptions and comparison metrics. Descriptions of possible choices provide side information which helps the system to improve its performance.

If a search continues for several steps, the benefit of an example chosen by the system is not immediately revealed but can be assessed only when the user finishes her search. Thus mechanisms for dealing with such delayed feedback are required. In particular reinforcement learning is a suitable approach in this respect.

We will develop and provide algorithms for dealing with the exploration–exploitation dilemma to improve results for query by example search. Building on our previous work about dealing with side information and about reinforcement learning, we will make algorithms available which are tailored to the specific needs of searches with relevance feedback.