Reducing the size of databases for multirelational classification : a subgraph-based approach

  1. (PDF, 574 KB)
  2. Get@NRC: Reducing the size of databases for multirelational classification : a subgraph-based approach (Opens in a new window)
DOIResolve DOI:
AuthorSearch for: ; Search for: ; Search for:
Journal titleJournal of Intelligent Information Systems
VolumeNovember 2012
Subjectmulti-relational classification; relational data mining
AbstractMultirelational classification aims to discover patterns across multiple interlinked tables (relations) in a relational database. In many large organizations, such a database often spans numerous departments and/or subdivisions, which are involved in different aspects of the enterprise such as customer profiling, fraud detection, inventory management, financial management, and so on. When considering classification, different phases of the knowledge discovery process are affected by economic utility. For instance, in the data preprocessing process, one must consider the cost associated with acquiring, cleaning, and transforming large volumes of data. When training and testing the data mining models, one has to consider the impact of the data size on the running time of the learning algorithm. In order to address these utility-based issues, the paper presents an approach to create a pruned database for multirelational classification, while minimizing predictive performance loss on the final model. Our method identifies a set of strongly uncorrelated subgraphs from the original database schema, to use for training, and discards all others. The experiments performed show that our strategy is able to, without sacrificing predictive accuracy, significantly reduce the size of the databases, in terms of the number of relations, tuples, and attributes.The approach prunes the sizes of databases by as much as 94 %. Such reduction also results in decreasing computational cost of the learning process. The method improves the multirelational learning algorithms’ execution time by as much as 80 %. In particular, our results demonstrate that one may build an accurate model with only a small subset of the provided database.
Publication date
PublisherSpringer US
AffiliationInformation and Communication Technologies; National Research Council Canada
Peer reviewedYes
NPARC number21238278
Export citationExport as RIS
Report a correctionReport a correction
Record identifier490f5160-072f-428f-b1f7-1ba0011ead5e
Record created2013-02-13
Record modified2016-05-09
Bookmark and share
  • Share this page with Facebook (Opens in a new window)
  • Share this page with Twitter (Opens in a new window)
  • Share this page with Google+ (Opens in a new window)
  • Share this page with Delicious (Opens in a new window)