Combining Coregularization and Consensus-Based Self-Training for Multilingual Categorization

  1. (PDF, 335 KB)
DOIResolve DOI:
AuthorSearch for: ; Search for: ; Search for:
Proceedings titleProceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Conference33rd Annual Association for Computing Machinery’s Special Interest Group on Information Retrieval Conference (ACM SIGIR), July 19-23, 2010, Geneva, Switzerland
Pages475482; # of pages: 8
SubjectMultilingual Document Classification; Learning from Multiple Views; Semi-supervised Learning
AbstractWe investigate the problem of learning document classifiers in a multilingual setting, from collections where labels are only partially available. We address this problem in the framework of multiview learning, where different languages correspond to different views of the same document, combined with semi-supervised learning in order to benefit from unlabeled documents. We rely on two techniques, coregularization and consensus-based self-training, that combine multiview and semi-supervised learning in different ways. Our approach trains different monolingual classifiers on each of the views, such that the classifiers’ decisions over a set of unlabeled examples are in agreement as much as possible, and iteratively labels new examples from another unlabeled training set based on a consensus across language-specific classifiers. We derive a boosting-based training algorithm for this task, and analyze the impact of the number of views on the semi-supervised learning results on a multilingual extension of the Reuters RCV1/RCV2 corpus using five different languages. Our experiments show that coregularization and consensus-based self-training are complementary and that their combination is especially effective in the interesting and very common situation where there are few views (languages) and few labeled documents available.
Publication date
AffiliationNational Research Council Canada (NRC-CNRC); NRC Institute for Information Technology
Peer reviewedYes
NPARC number15469835
Export citationExport as RIS
Report a correctionReport a correction
Record identifiercf783c37-e5ce-4280-a7cc-9e0b865245e7
Record created2010-06-10
Record modified2016-05-09
Bookmark and share
  • Share this page with Facebook (Opens in a new window)
  • Share this page with Twitter (Opens in a new window)
  • Share this page with Google+ (Opens in a new window)
  • Share this page with Delicious (Opens in a new window)