SCUT: multi-class imbalanced data classification using SMOTE and cluster-based undersampling

DOIResolve DOI:
AuthorSearch for: ; Search for: ; Search for:
Proceedings titleProceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management
Conference7th International Conference on Knowledge Discovery and Information Retrieval, November 12-14, 2015, Lisbon, Portugal
SubjectMulti-class imbalance; Undersampling; Oversampling; Classification; Clustering
AbstractClass imbalance is a crucial problem in machine learning and occurs in many domains. Specifically, the two-class problem has received interest from researchers in recent years, leading to solutions for oil spill detection, tumour discovery and fraudulent credit card detection, amongst others. However, handling class imbalance in datasets that contains multiple classes, with varying degree of imbalance, has received limited attention. In such a multi-class imbalanced dataset, the classification model tends to favour the majority classes and incorrectly classify instances from the minority classes as belonging to the majority classes, leading to poor predictive accuracies. Further, there is a need to handle both the imbalances between classes as well as address the selection of examples within a class (i.e. the so-called within class imbalance). In this paper, we propose the SCUT hybrid sampling method, which is used to balance the number of training examples in such a multi-class setting. Our SCUT approach oversamples minority class examples through the generation of synthetic examples and employs cluster analysis in order to undersample majority classes. In addition, it handles both within-class and between-class imbalance. Our experimental results against a number of multi-class problems show that, when the SCUT method is used for pre-processing the data before classification, we obtain highly accurate models that compare favourably to the state-of-the-art
Publication date
PublisherSCITEPRESS - Science and and Technology Publications
AffiliationInformation and Communication Technologies; National Research Council Canada
Peer reviewedYes
NPARC number23000070
Export citationExport as RIS
Report a correctionReport a correction
Record identifierb49c8505-56e1-4361-a467-b59f56911705
Record created2016-06-01
Record modified2016-06-01
Bookmark and share
  • Share this page with Facebook (Opens in a new window)
  • Share this page with Twitter (Opens in a new window)
  • Share this page with Google+ (Opens in a new window)
  • Share this page with Delicious (Opens in a new window)