A distributional and syntactic approach to fine-grained opinion mining

Thumbnail Image


Publication or External Link






This thesis contributes to a larger social science research program of

analyzing the diffusion of IT innovations. We show how to

automatically discriminate portions of text dealing with opinions

about innovations by finding {source, target, opinion} triples in text.

In this context, we can discern a list of innovations as targets from

the domain itself. We can then use this list as an anchor for finding

the other two members of the triple at a ``fine-grained''

level---paragraph contexts or less.

We first demonstrate a vector space model for finding opinionated

contexts in which the innovation targets are mentioned. We can find

paragraph-level contexts by searching for an

``expresses-an-opinion-about'' relation between sources and targets

using a supervised model with an SVM that uses features derived from a

general-purpose subjectivity lexicon and a corpus indexing tool. We

show that our algorithm correctly filters the domain relevant subset

of subjectivity terms so that they are more highly valued.

We then turn to identifying the opinion. Typically, opinions in

opinion mining are taken to be positive or negative. We discuss a

crowd sourcing technique developed to create the seed data describing

human perception of opinion bearing language needed for our supervised

learning algorithm. Our user interface successfully limited the

meta-subjectivity inherent in the task (``What is an opinion?'') while

reliably retrieving relevant opinionated words using labour not expert

in the domain.

Finally, we developed a new data structure and modeling technique for

connecting targets with the correct within-sentence opinionated

language. Syntactic relatedness tries (SRTs) contain all paths from a

dependency graph of a sentence that connect a target expression to a

candidate opinionated word. We use factor graphs to model how far a

path through the SRT must be followed in order to connect the right

targets to the right words. It turns out that we can correctly label

significant portions of these tries with very rudimentary features

such as part-of-speech tags and dependency labels with minimal

processing. This technique uses the data from the crowdsourcing

technique we developed as training data.

We conclude by placing our work in the context of a larger sentiment

classification pipeline and by describing a model for learning from

the data structures produced by our work. This work contributes to

computational linguistics by proposing and verifying new data

gathering techniques and applying recent developments in machine

learning to inference over grammatical structures for highly

subjective purposes. It applies a suffix tree-based data structure to

model opinion in a specific domain by imposing a restriction on the

order in which the data is stored in the structure.