Mining the Web for Bilingual Text
Abstract
STRAND (Resnik, 1998) is a language-independent system for automatic
discovery of text in parallel translation on the World Wide Web. This
paper extends the preliminary STRAND results by adding automatic
language identification, scaling up by orders of magnitude, and
formally evaluating performance. The most recent end-product is an
automatically acquired parallel corpus comprising 2491 English-French
document pairs, approximately 1.5 million words per language.
(Also cross-referenced as UMIACS-TR-2000-44)
(Also cross-referenced as LAMP-TR-051)