A Preliminary Statistical Investigation into the impact of an N-Gram Analysis Approach based on Word Syntactic Categories toward Text Author Classification

Loading...
Thumbnail Image

Files

CS-TR-4148.ps (610.06 KB)
No. of downloads: 245
CS-TR-4148.pdf (83.52 KB)
No. of downloads: 1091

Publication or External Link

Date

2000-06-17

Advisor

Citation

DRUM DOI

Abstract

Quantitative analysis of literary style has heretofore utilized semantic elements-word counts. This research attempts to identify quantifiable syntactic elements of style that can be used for author identification. The measurement of syntactic elements utilizes a dictionary with one part of speech per word and looks at phrases delimited by punctuation marks. Different size permutations of words - referred to as grams - are counted within each text. Correlations are measured amongst the gram frequencies of eight texts pertaining to four authors, both contemporary and non-contemporary. The correlations are performed across different gram sizes of words. The same treatment is applied to a target text, the Funeral Elegy text. The approach holds for classifying texts temporally consistently across the various gram sizes. Yet a finer grained investigation is required to certify the authorship of the Funeral Elegy text. (Also cross-referenced as UMIACS-TR-2000-39, LAMP-TR-046)

Notes

Rights