Predictive Coding Techniques with Manual Review to Identify Privileged Documents in E-Discovery

Thumbnail Image


Publication or External Link





In twenty-first century civil litigation, discovery focuses on the retrieval of electronically stored information. Lawsuits may be won or lost because of incorrect production of electronic evidence. Organizations may generate fewer paper documents, leading to an increase in the amount of electronic documents by many fold. Litigants face the task of searching millions of electronic records for the presence of responsive and not-privileged documents, making the e-discovery process burdensome and expensive. In order to ensure that the material that has to be withheld is not inadvertently revealed, the electronic evidence that is found to be responsive to a production request is typically subjected to an exhaustive manual review for privilege. Although the budgetary constraints on review for responsiveness can be met using automation to some degree, attorneys have been hesitant to adopt similar technology to support the privilege review process. This dissertation draws attention to the potential for adopting predictive coding technology for the privilege review phase during the discovery process.

Two main questions that are central to building a privilege classifier are addressed. The first question seeks to determine which set of annotations can serve as a reliable basis for evaluation. The second question seeks to determine which of the remaining annotations, when used for training classifiers, produce the best results. As an answer, binary classifiers are trained on labeled annotations from both junior and senior reviewers. Issues related to training bias and sample variance due to the reviewer's expertise are thoroughly discussed. Results show that the annotations that were randomly drawn and annotated by senior reviewers are useful for evaluation. The remaining annotations can be used for classifier training.

A research prototype is built to perform a user study. Privilege judgments are gathered from multiple lawyers using two user interfaces. One of the two interfaces includes automatically generated features to aid the review process. The goal is to help lawyers make faster and more accurate privilege judgments. A significant improvement in recall was noted when comparing the users' review performance when using the automated annotations. Classifier features related to the people involved in privileged communications were found to be particularly important for the privilege review task. Results show that there was no measurable change in review time.

As cost is proportional to time during review, as the final step, this work introduces a semi-automated framework that aims to optimize the cost of the manual review process. The framework calls for litigants to make some rational choices about what to manually review. The documents are first automatically classified for responsiveness and privilege, and then some of the automatically classified documents are reviewed by human reviewers for responsiveness and for privilege with the overall goal of minimizing the expected cost of the entire process, including costs that arise from incorrect decisions. A risk-based ranking algorithm is used to determine which documents need to be manually reviewed. Multiple baselines are used to characterize the cost savings achieved by this approach.

Although the work in this dissertation is applied to e-discovery, similar approaches could be applied to any case in which retrieval systems have to withhold a set of confidential documents despite their relevance to the request.