A co-classification approach to learning from multilingual corpora

Download
  1. (PDF, 500 KB)
  2. Get@NRC: A co-classification approach to learning from multilingual corpora (Opens in a new window)
DOIResolve DOI: http://doi.org/10.1007/s10994-009-5151-5
AuthorSearch for: ; Search for:
EditorSearch for: Cesa-Bianchi, Nicolo; Search for: Hardoon, David R.; Search for: Leen, Gayle
TypeArticle
Journal titleMachine Learning
Volume79
Issue1-2
Pages105121; # of pages: 17
SubjectText categorization; Multilingual data; Logistic regression; Boosting
AbstractWe address the problem of learning text categorization from a corpus of multilingual documents. We propose a multiview learning, co-regularization approach, in which we consider each language as a separate source, and minimize a joint loss that combines monolingual classification losses in each language while ensuring consistency of the categorization across languages. We derive training algorithms for logistic regression and boosting, and show that the resulting categorizers outperform models trained independently on each language, and even, most of the times, models trained on the joint bilingual data. Experiments are carried out on a multilingual extension of the RCV2 corpus, which is available for benchmarking.
Publication date
LanguageEnglish
AffiliationNational Research Council Canada (NRC-CNRC); NRC Institute for Information Technology
Peer reviewedYes
NPARC number16335052
Export citationExport as RIS
Report a correctionReport a correction
Record identifier3bbbda38-91cf-4a07-8c27-4872100b5f16
Record created2010-11-05
Record modified2016-05-09
Bookmark and share
  • Share this page with Facebook (Opens in a new window)
  • Share this page with Twitter (Opens in a new window)
  • Share this page with Google+ (Opens in a new window)
  • Share this page with Delicious (Opens in a new window)