The impact of sentence alignment errors on phrase-based machine translation performance

Download
  1. (PDF, 262 KB)
AuthorSearch for: ; Search for: ; Search for:
TypeArticle
Proceedings titleProceedings of the Tenth Conference of the Association for Machine Translation in the Americas
ConferenceThe Tenth Biennial Conference of the Association for Machine Translation in the Americas, 28 October-1 November 2012, San Diego, California, USA
AbstractWhen parallel or comparable corpora are harvested from the web, there is typically a tradeoff between the size and quality of the data. In order to improve quality, corpus collection efforts often attempt to fix or remove misaligned sentence pairs. But, at the same time, Statistical Machine Translation (SMT) systems are widely assumed to be relatively robust to sentence alignment errors. However, there is little empirical evidence to support and characterize this robustness. This contribution investigates the impact of sentence alignment errors on a typical phrase-based SMT system. We confirm that SMT systems are highly tolerant to noise, and that performance only degrades seriously at very high noise levels. Our findings suggest that when collecting larger, noisy parallel data for training phrase-based SMT, cleaning up by trying to detect and remove incorrect alignments can actually degrade performance. Although fixing errors, when applicable, is a preferable strategy to removal, its benefits only become apparent for fairly high misalignment rates. We provide several explanations to support these findings.
Publication date
LanguageEnglish
AffiliationInformation and Communication Technologies; National Research Council Canada
Peer reviewedYes
NPARC number21268097
Export citationExport as RIS
Report a correctionReport a correction
Record identifier6aeda7ee-6f72-466f-9c7a-56c71e481d52
Record created2013-04-09
Record modified2016-05-09
Bookmark and share
  • Share this page with Facebook (Opens in a new window)
  • Share this page with Twitter (Opens in a new window)
  • Share this page with Google+ (Opens in a new window)
  • Share this page with Delicious (Opens in a new window)