SCUT: multi-class imbalanced data classification using SMOTE and cluster-based undersampling

  1. (PDF, 464 KB)
DOIResolve DOI:
AuthorSearch for: ; Search for: ; Search for:
Proceedings titleProceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management
ConferenceThe 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, November 12-14, 2015, in Lisbon, Portugal
SubjectClustering and Classification Methods; Machine Learning; Pre-Processing and Post-Processing for Data Mining; Multi-Class Imbalance, Undersampling, Oversampling
AbstractClass imbalance is a crucial problem in machine learning and occurs in many domains. Specifically, the two-class problem has received interest from researchers in recent years, leading to solutions for oil spill detection, tumour discovery and fraudulent credit card detection, amongst others. However, handling class imbalance in datasets that contains multiple classes, with varying degree of imbalance, has received limited attention. In such a multi-class imbalanced dataset, the classification model tends to favour the majority classes and incorrectly classify instances from the minority classes as belonging to the majority classes, leading to poor predictive accuracies. Further, there is a need to handle both the imbalances between classes as well as address the selection of examples within a class (i.e. the so-called within class imbalance). In this paper, we propose the SCUT hybrid sampling method, which is used to balance the number of training examples in such a multi-class setting. Our SCUT approach oversamples minority class examples through the generation of synthetic examples and employs cluster analysis in order to undersample majority classes. In addition, it handles both within-class and between-class imbalance. Our experimental results against a number of multi-class problems show that, when the SCUT method is used for pre-processing the data before classification, we obtain highly accurate models that compare favourably to the state-of-the-art.
AffiliationInformation and Communication Technologies; National Research Council Canada
Peer reviewedYes
NPARC number21277623
Export citationExport as RIS
Report a correctionReport a correction
Record identifiere8c7556d-9f94-466f-a1e5-72cdf9b9513f
Record created2016-05-05
Record modified2016-05-12
Bookmark and share
  • Share this page with Facebook (Opens in a new window)
  • Share this page with Twitter (Opens in a new window)
  • Share this page with Google+ (Opens in a new window)
  • Share this page with Delicious (Opens in a new window)