Search Among Sensitive Content

Thumbnail Image

Files

Publication or External Link

Date

2021

Citation

Abstract

Current search engines are designed to find what we want. But many collections can not be made available for search engines because they contain sensitive content that needs to be protected. Before release, such content needs to be examined through a sensitivity review process, which can be difficult and time-consuming. To address this challenge, search technology should be capable of providing access to relevant content while protecting sensitive content.

In this dissertation, we present an approach that leverages evaluation-driven information retrieval (IR) techniques. These techniques optimize an objective function that balances the value of finding relevant content with the imperative to protect sensitive content. This requires evaluation measures that balance between relevance and sensitivity. Baselines are introduced for addressing the problem, and a proposed approach that is based on building a listwise learning to rank model is described. The model is trained with a modified loss function to optimize for the evaluation measure. Initial experiments re-purpose a LETOR benchmark dataset, OHSUMED, by using Medical Subject Heading (MeSH) labels to represent the sensitivity. A second test collection is based on the Avocado Research Email Collection. Search topics were developed as a basis for assessing relevance, and two personas describing the sensitivities of representative (but fictional) content creators were created as a basis for assessing sensitivity. These personas were based on interviews with potential donors of historically significant email collections and with archivists who currently manage access to such collections. Two annotators then created relevance and sensitivity judgments for 65 topics for one or both personas. Experiment results show the efficacy of the learning to rank approach.

The dissertation also includes four extensions to increase the quality of retrieved results with respect to relevance and sensitivity. First, the use of alternative optimization measures is explored. Second, transformer-based rankers are compared with rankers based on hand-crafted features. Third, a cluster-based replacement strategy that can further improve the score of our evaluation measures is introduced. Fourth, a policy that truncates the ranked list according to the query's expected difficulty is investigated. Results show improvements in each case.

Notes

Rights