C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling

AuthorSearch for: ; Search for:
TypeArticle
ConferenceProceedings of the International Conference on Machine Learning (ICML 2003) Workshop on Learning from Imbalanced Data Sets II, July 21, 2003., Washington, DC, USA
AbstractThis paper takes a new look at two sampling schemes commonly used to adapt machine algorithms to imbalanced classes and misclassification costs. It uses a performance analysis technique called cost curves to explore the interaction of over and under-sampling with the decision tree learner C4.5. C4.5 was chosen as, when combined with one of the sampling schemes, it is quickly becoming the community standard when evaluating new cost sensitive learning algorithms. This paper shows that using C4.5 with under-sampling establishes a reasonable standard for algorithmic comparison. But it is recommended that the least cost classifier be part of that standard as it can be better than under-sampling for relatively modest costs. Over-sampling, however, shows little sensitivity, there is often little difference in performance when misclassification costs are changed.
Publication date
LanguageEnglish
AffiliationNRC Institute for Information Technology; National Research Council Canada
Peer reviewedNo
NRC number47381
NPARC number5765075
Export citationExport as RIS
Report a correctionReport a correction
Record identifier04bc81ac-d061-4ea9-bef4-04a836a682be
Record created2009-03-29
Record modified2016-05-09
Bookmark and share
  • Share this page with Facebook (Opens in a new window)
  • Share this page with Twitter (Opens in a new window)
  • Share this page with Google+ (Opens in a new window)
  • Share this page with Delicious (Opens in a new window)