Knowledge discovery from databases: cost-sensitive and imbalance learning

Knowledge discovery from databases: cost-sensitive and imbalance learning

Publication Type	dissertation
School or College	David Eccles School of Business
Department	Operations & Information Systems
Author	Zhuo, Yang
Title	Knowledge discovery from databases: cost-sensitive and imbalance learning
Date	2010-08
Description	In the current business world, data collection for business analysis is not difficult any more. The major concern faced by business managers is whether they can use data to build predictive models so as to provide accurate information for decision-making. Knowledge Discovery from Databases (KDD) provides us a guideline for collecting data through identifying knowledge inside data. As one of the KDD steps, the data mining method provides a systematic and intelligent approach to learning a large amount of data and is critical to the success of KDD. In the past several decades, many different data mining algorithms have been developed and can be categorized as classification, association rule, and clustering. These data mining algorithms have been demonstrated to be very effective in solving different business questions. Among these data mining types, classification is the most popular group and is widely used in all kinds of business areas. However, the exiting classification algorithm is designed to maximize the prediction accuracy given by the assumption of equal class distribution and equal error costs. This assumption seldom holds in the real world. Thus, it is necessary to extend the current classification so that it can deal with the data with the imbalanced distribution and unequal costs. In this dissertation, I propose an Iterative Cost-sensitive Naïve Bayes (ICSNB) method aimed at reducing overall misclassification cost regardless of class distribution. During each iteration, K nearest neighbors are identified and form a new training set, which is used to learn unsolved instances. Using the characteristics of the nearest neighbor method, I also develop a new under-sampling method to solve the imbalance problem in the second study. In the second study, I design a general method to deal with the imbalance problem and identify noisy instances from the data set to create a balanced data set for learning. Both of these two methods are validated using multiple real world data sets. The empirical results show the superior performance of my methods compared to some existing and popular methods.
Type	Text
Publisher	University of Utah
Subject	Data mining; Knowledge acquisition
Dissertation Institution	University of Utah
Dissertation Name	Doctor of Philosophy
Language	eng
Rights Management	Copyright © Zhuo Yang 2010
Format Medium	application/pdf
Format Extent	2,044,969 bytes
Source	original in Marriott Library Special Collections ; HB33.5 2010 .Y36
ARK	ark:/87278/s6df75xj
Setname	ir_etd
ID	194763
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6df75xj