A distributional and syntactic approach to fine-grained opinion mining
Files
Publication or External Link
Date
Authors
Advisor
Citation
DRUM DOI
Abstract
This thesis contributes to a larger social science research program of
analyzing the diffusion of IT innovations. We show how to
automatically discriminate portions of text dealing with opinions
about innovations by finding {source, target, opinion} triples in text.
In this context, we can discern a list of innovations as targets from
the domain itself. We can then use this list as an anchor for finding
the other two members of the triple at a ``fine-grained''
level---paragraph contexts or less.
We first demonstrate a vector space model for finding opinionated
contexts in which the innovation targets are mentioned. We can find
paragraph-level contexts by searching for an
``expresses-an-opinion-about'' relation between sources and targets
using a supervised model with an SVM that uses features derived from a
general-purpose subjectivity lexicon and a corpus indexing tool. We
show that our algorithm correctly filters the domain relevant subset
of subjectivity terms so that they are more highly valued.
We then turn to identifying the opinion. Typically, opinions in
opinion mining are taken to be positive or negative. We discuss a
crowd sourcing technique developed to create the seed data describing
human perception of opinion bearing language needed for our supervised
learning algorithm. Our user interface successfully limited the
meta-subjectivity inherent in the task (``What is an opinion?'') while
reliably retrieving relevant opinionated words using labour not expert
in the domain.
Finally, we developed a new data structure and modeling technique for
connecting targets with the correct within-sentence opinionated
language. Syntactic relatedness tries (SRTs) contain all paths from a
dependency graph of a sentence that connect a target expression to a
candidate opinionated word. We use factor graphs to model how far a
path through the SRT must be followed in order to connect the right
targets to the right words. It turns out that we can correctly label
significant portions of these tries with very rudimentary features
such as part-of-speech tags and dependency labels with minimal
processing. This technique uses the data from the crowdsourcing
technique we developed as training data.
We conclude by placing our work in the context of a larger sentiment
classification pipeline and by describing a model for learning from
the data structures produced by our work. This work contributes to
computational linguistics by proposing and verifying new data
gathering techniques and applying recent developments in machine
learning to inference over grammatical structures for highly
subjective purposes. It applies a suffix tree-based data structure to
model opinion in a specific domain by imposing a restriction on the
order in which the data is stored in the structure.