Arabic Preprocessing Schemes for Statistical Machine Translation

  1. (PDF, 223 KB)
AuthorSearch for: ; Search for:
ConferenceProceedings of Human Language Technology Conference/North American Chapter of the Association for Computational Linguistics (HLT/NAACL) 2006, June 5-7, 2006., New York City, New York, USA
AbstractIn this paper, we study the effect of different word-level preprocessing decisions for Arabic on SMT quality. Our results show that given large amounts of training data, splitting off only proclitics performs best. However, for small amounts of training data, it is best to apply English-like tokenization using part-of-speech tags, and sophisticated morphological analysis and disambiguation. Moreover, choosing the appropriate preprocessing produces a significant increase in BLEU score if there is a change in genre between training and test data.
Publication date
AffiliationNRC Institute for Information Technology; National Research Council Canada
Peer reviewedNo
NRC number48759
NPARC number9167805
Export citationExport as RIS
Report a correctionReport a correction
Record identifier07fcd97a-570b-45e4-a32c-5edef880e6c1
Record created2009-06-29
Record modified2016-05-09
Bookmark and share
  • Share this page with Facebook (Opens in a new window)
  • Share this page with Twitter (Opens in a new window)
  • Share this page with Google+ (Opens in a new window)
  • Share this page with Delicious (Opens in a new window)