An Information Geometric Treatment of Maximum Likelihood Criteria and Generalization in Hidden Markov Modeling
Publication or External Link
It is shown here that several techniques for masimum likelihood training of Hidden Markov Models are instances of the EM algorithm and have very similar descriptions when formulated as instances of the Alternating Minimization procedure. The N-Best and Segmental K-Means algorithms are derived under a minimum discrimination information criterion and are shown to result from an additional restriction placed on the minimum discrimination information formulation which yields the Baum Welch algorithm. This uniform formulation is employed in an exploration of generalization by the EM algorithm.
It has been noted that the EM algorithm can introduce artifacts as training progresses. A related phenomenon is that over-training can occur; although the performance as measured on the training set continues to improve as the algorithm progresses, performance on related data sets may eventually begin to deteriorate. This is inherent in the maximum likelihood criterion and its cause can be seen when the training problem is stated in the Alternating Minimization framework. A modification of the maximum likelihood training criterion is suggested to counter this behavior and is applied to the broader problem of maximum likelihood training of exponential models from incomplete data. It leads to a simple modification of the learning algorithms which relates generalization to learning speed. Relationships to other techniques which encourage generalization, particularly methods of incorporating prior information, are discussed.