Effective and Efficient Search Across Languages
Files
Publication or External Link
Date
Authors
Advisor
Citation
Abstract
In the digital era, the abundance of text content in multiple languages has created a need to develop search systems to meet the diverse information needs of users. Cross-Language Information Retrieval (CLIR) plays an essential role in overcoming language barriers, allowing users to retrieve content in a language that differs from their query language. However, a challenge in designing retrieval systems lies in balancing their effectiveness, which reflects the quality of the ranked outputs, with their efficiency, which encompasses document processing latency at indexing time (indexing latency) and content retrieval latency at query time (query latency). This dissertation focuses on designing neural CLIR systems that offer a Pareto-optimal balance between the competing objectives of effectiveness and efficiency.
While neural ranking models that rely on query-document term interactions, such as cross-encoder models, are highly effective, they are computationally prohibitive for processing large document collections in response to every query. One solution is to build a cascaded pipeline of multiple ranking stages, where a first-stage retrieval system generates a set of documents, which is then reranked by the cross-encoder. Ensuring that the first-stage retrieval system produces an accurate and rapid triage of large document collections is crucial for the success of the cascaded pipeline. This dissertation introduces BLADE, a first-stage system that strikes a better balance between retrieval effectiveness and indexing/query latency on the Pareto frontier by leveraging traditional inverted indexes. Once a smaller set of documents is generated, less efficient techniques can be applied to the output from the first stage. In addition, this dissertation introduces ColBERT-X, the best-known second-stage technique in terms of the balance between retrieval effectiveness and indexing latency on the Pareto frontier. To further tackle the efficiency challenges of cross-encoders, this dissertation introduces CREPE, an approach that optimizes the tradeoff between retrieval effectiveness and query latency.
While traditional CLIR methods rely on Machine Translation (MT) to address vocabulary mismatches between queries and documents, neural techniques match terms in a shared vector space, serving as a complementary source. Fusion techniques help leverage the synergies between these complementary methods by creating ensembles, and the design space of CLIR allows for multiple such ensembles. This dissertation highlights the complementary nature of BLADE and ColBERT-X with traditional CLIR approaches and demonstrates further effectiveness gains by ensembling them without adversely affecting the indexing-time efficiency. These results pave the way for the development of scalable CLIR systems with a better tradeoff between effectiveness and indexing speed.