Cost-sensitive Information Acquisition in Structured Domains
Publication or External Link
Many real-world prediction tasks require collecting information about the domain entities to achieve better predictive performance. Collecting the additional information is often a costly process that involves acquiring the features describing the entities and annotating the entities with target labels. For example, document collections need to be manually annotated for classification and lab tests need to be ordered for medical diagnosis. Annotating the whole document collection and ordering all possible lab tests might be infeasible due to limited resources. In this thesis, I explore effective and efficient ways of choosing the right features and labels to acquire under limited resources.
For the problem of feature acquisition, we are given entities with missing features and the task is to classify them with minimum cost. The likelihood of misclassification can be reduced by acquiring features but acquiring features incurs costs as well. The objective is to acquire the right set of features that balance acquisition and misclassification cost. I introduce a technique that can reduce the space of possible sets of features to consider for acquisition by exploiting the conditional independence properties in the underlying probability distribution.
For the problem of label acquisition, I consider two real-world scenarios. In the first one, we are given a previously trained model and a budget determining how many labels we can acquire, and the objective is to determine the right set of labels to acquire so that the accuracy on the remaining ones is maximized. I describe a system that can automatically learn and predict on which entities the underlying classifier is likely to make mistakes and it suggests acquiring the labels of the entities that lie in a high density potentially-misclassified region.
In the second scenario, we are given a network of entities that are unlabeled and our objective is to learn a classification model that will have the least future expected error by acquiring minimum number of labels. I describe an active learning technique that can exploit the relationships in the network both to select informative entities to label and to learn a collective classifier that utilizes the label correlations in the network.