Beyond the boundaries of SMOTE: a framework for manifold-based synthetically oversampling

Download
  1. Get@NRC: Beyond the boundaries of SMOTE: a framework for manifold-based synthetically oversampling (Opens in a new window)
AuthorSearch for: ; Search for: ; Search for:
TypeArticle
Proceedings titleMachine Learning and Knowledge Discovery in Databases
Series titleLecture Notes in Computer Science
ConferenceJoint European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2016, September 19-23, 2016, Riva del Garda, Italy
ISSN0302-9743
ISBN9783319462264
Pages248263
Subjectmachine learning; class imbalance; synthetic oversampling; manifold and embeddings
AbstractProblems of class imbalance appear in diverse domains, ranging from gene function annotation to spectra and medical classification. On such problems, the classifier becomes biased in favour of the majority class. This leads to inaccuracy on the important minority classes, such as specific diseases and gene functions. Synthetic oversampling mitigates this by balancing the training set, whilst avoiding the pitfalls of random under and oversampling. The existing methods are primarily based on the SMOTE algorithm, which employs a bias of randomly generating points between nearest neighbours. The relationship between the generative bias and the latent distribution has a significant impact on the performance of the induced classifier. Our research into gamma-ray spectra classification has shown that the generative bias applied by SMOTE is inappropriate for domains that conform to the manifold property, such as spectra, text, image and climate change classification. To this end, we propose a framework for manifold-based synthetic oversampling, and demonstrate its superiority in terms of robustness to the manifold with respect to the AUC on three spectra classification tasks and 16 UCI datasets.
Publication date
PublisherSpringer
LanguageEnglish
AffiliationNational Research Council Canada; NRC Institute for Aerospace Research; Information and Communication Technologies
Peer reviewedYes
NPARC number23002088
Export citationExport as RIS
Report a correctionReport a correction
Record identifierb1787f39-6e92-4586-8155-c85442a2d7c2
Record created2017-08-10
Record modified2017-08-10
Bookmark and share
  • Share this page with Facebook (Opens in a new window)
  • Share this page with Twitter (Opens in a new window)
  • Share this page with Google+ (Opens in a new window)
  • Share this page with Delicious (Opens in a new window)