Mining the Web for Bilingual Text

Loading...
Thumbnail Image

Files

CS-TR-4153.ps (622.93 KB)
No. of downloads: 299
CS-TR-4153.pdf (204.22 KB)
No. of downloads: 896

Publication or External Link

Date

2000-06-15

Advisor

Citation

DRUM DOI

Abstract

STRAND (Resnik, 1998) is a language-independent system for automatic discovery of text in parallel translation on the World Wide Web. This paper extends the preliminary STRAND results by adding automatic language identification, scaling up by orders of magnitude, and formally evaluating performance. The most recent end-product is an automatically acquired parallel corpus comprising 2491 English-French document pairs, approximately 1.5 million words per language. (Also cross-referenced as UMIACS-TR-2000-44) (Also cross-referenced as LAMP-TR-051)

Notes

Rights