Mining the Web for Bilingual Text

Resnik, P.

Mining the Web for Bilingual Text

Files

CS-TR-4153.ps (622.93 KB)

No. of downloads: 365

CS-TR-4153.pdf (204.22 KB)

No. of downloads: 1000

Date

2000-06-15

Authors

Resnik, P.

Abstract

STRAND (Resnik, 1998) is a language-independent system for automatic discovery of text in parallel translation on the World Wide Web. This paper extends the preliminary STRAND results by adding automatic language identification, scaling up by orders of magnitude, and formally evaluating performance. The most recent end-product is an automatically acquired parallel corpus comprising 2491 English-French document pairs, approximately 1.5 million words per language. (Also cross-referenced as UMIACS-TR-2000-44) (Also cross-referenced as LAMP-TR-051)

URI (handle)

http://hdl.handle.net/1903/1084

Collections

Technical Reports from UMIACS
Technical Reports of the Computer Science Department

Full item page