Parallel Strands: A Preliminary Investigation into Mining the Web for Bilingual Text
Abstract
Parallel corpora are a valuable resource for machine translation, but at
present their availability and utility is limited by genre- and
domain-specificity, licensing restrictions, and the basic difficulty of
locating parallel texts in all but the most dominant of the world's
languages. A parallel corpus resource not yet explored is the World Wide
Web, which hosts an abundance of pages in parallel translation, offering a
potential solution to some of these problems and unique opportunities of
its own. This paper presents the necessary first step in that
exploration: a method for automatically finding parallel translated
documents on the Web. The technique is conceptually simple, fully
language independent, and scalable, and preliminary evaluation results
indicate that the method may be accurate enough to apply without human
intervention.
(Also cross-referenced as UMIACS-TR-98-41)