Semi-supervised Document Classification with a Mislabeling Error Model

  1. (PDF, 343 KB)
AuthorSearch for: ; Search for: ; Search for: ; Search for:
Proceedings titleProceedings. The 30th European Conference on Information Retrieval (ECIR 2008)
ConferenceAdvances in Information Retrieval, 30th European Conference on IR Research (ECIR'08), Glasgow, UK, March 30 - April 03, 2008
AbstractThis paper investigates a new extension of the Probabilistic Latent Semantic Analysis (PLSA) model [6] for text classification where the training set is partially labeled. The proposed approach iteratively labels the unlabeled documents and estimates the probabilities of its labeling errors. These probabilities are then taken into account in the estimation of the new model parameters before the next round. Our approach outperforms an earlier semi-supervised extension of PLSA introduced by [9] which is based on the use of fake labels. However, it maintains its simplicity and ability to solve multiclass problems. In ad- dition, it gives valuable information about the most uncertain and difficult classes to label. We perform experiments over the 20Newsgroups, WebKB and Reuters document collections and show the effectiveness of our approach over two other semi-supervised algorithms applied to these text classification problems.
Publication date
AffiliationNational Research Council Canada (NRC-CNRC); NRC Institute for Information Technology
Peer reviewedYes
NRC number50728
NPARC number16435926
Export citationExport as RIS
Report a correctionReport a correction
Record identifier96fa3c52-f816-42de-8099-fd7df5fe6de5
Record created2010-11-25
Record modified2016-05-09
Bookmark and share
  • Share this page with Facebook (Opens in a new window)
  • Share this page with Twitter (Opens in a new window)
  • Share this page with Google+ (Opens in a new window)
  • Share this page with Delicious (Opens in a new window)