Introduction to Data Mining
Data mining is the process of extracting knowledge in the form of data patterns from voluminous amounts of data contained in data repository systems. Examples of data repository systems include relational databases, data warehouses, transactional databases, the World Wide Web, advanced database systems, data streams, and flat files (Han & Kamber, 2006, p. 9-10). Origins of Data Mining
Data mining originated from the natural evolution of informational technology. This is evident in the database system industry that has experienced an evolutionary course in functionalities’ development, such as data collection and database creation, data management, and advanced data analysis (Han & Kamber, 2006, p. 1-3). Advanced data analysis is comprised of data warehousing and data mining. The combinations of advanced database systems, web-based databases, and advanced data analysis have given rise to a new generation of integrated data and information systems.
Data Mining Process
The Discovery of knowledge through data mining consists of an iterative sequence of steps. The process starts with stating a problem and formulating a hypothesis to be tested. The second step involves the generation and collection of data. It is then followed by data cleaning, which involves the removal of noise and other inconsistent data (Han & Kamber, 2006, p. 7). Data integration as the fourth step may involve a combination of multiple data sources. It is then followed by retrieval of data relevant to the analysis task from the database. The data is then transformed or consolidated into forms appropriate for mining. The above steps prepare the data for the actual mining process. The actual data-mining step involves the application of intelligent methods to extract data patterns (Han & Kamber, 2006, p. 7). The data patterns are then evaluated using interesting measures to discover interesting patterns that represent knowledge. The final step involves knowledge presentation, mainly through visualization and other knowledge representation techniques to the user.
Types of Data Mining
The two types of data mining include, descriptive and predictive. While descriptive mining tasks demonstrate the general properties of data in the database, predictive mining tasks usually perform inferences on the present data with the aim of making predictions (Han & Kamber, 2006, p. 21). They are crucial in specifying the kinds of patterns to be found in data mining tasks.
Data Mining Operations
Data mining operations are undertaken by several data mining functionalities. The class/concept description functionality is useful in data characterization and discrimination. Data characterization is achieved through the summarization of data belonging to the class understudy in general terms, and its output is presented in various forms, for instance, pie charts, bar charts, curves, and multidimensional tables (Han & Kamber, 2006, p. 21-22). Data discrimination involves comparing the target class with a single or a set of comparative classes. These two tasks can also be undertaken simultaneously. The second mining functionality includes mining frequent patterns, associations, and correlations. Classification and prediction is the third functionality in data mining tasks. Classification involves finding a model that describes and distinguishes data classes or concepts. Such models are used in predicting the class of objects having unknown class labels. The cluster analysis functionality analyses data objects without consulting a known class label to generate such labels, thus facilitating taxonomy formation (Han & Kamber, 2006, p. 25-26). The Outlier analysis functionality is used to detect data objects that are not matching with the common behavior or data model. Finally, the evolution analysis functionality describes and models irregularities or trends for data objects that exhibit dynamic behavior. The data pattern is interesting if humans, valid, potentially useful, and novel, easily understand it. Since data mining is an interdisciplinary field, it can generate a variety of data mining systems that vary in terms of the kind of data and knowledge mined, techniques utilized, and applications adapted (Han & Kamber, 2006, p. 29).
Advantages and Disadvantages of Data Mining
Data mining can be useful in predicting future trends, such as by analyzing the purchasing habits of customers. It can enhance decision-making competencies, which can result in improved organizational management. It can improve organizational efficiency by increasing revenues and reducing costs. The outlier analysis can be useful in detecting fraud. Its major disadvantage is that it can compromise user privacy or security by collecting critical information, which might be unauthorized. Its implementation phase is costly and its operation success may require comprehensive interdisciplinary knowledge and skills (Han & Kamber, 2006, p. 29). The amount of data involved is overwhelming, with possible data inaccuracies.
Mining Frequent Patterns, Associations, and Correlations
Frequent patterns are those patterns appearing frequently in a data set. Examples of mining data patterns include item sets, subsequences, and substructures. Item sets are a set of items appearing frequently together in a transaction data set, for instance, milk and bread (Han & Kamber, 2006, p. 227). Sequential pattern refers to those items occurring frequently in a history database. Different structural forms, such as, sub-graphs, sub-trees, and sub-lattices that can be combined with item sets and subsequences make up a substructure. Identifying frequent item sets, sequential, and structured patterns play a critical role in mining associations, correlations, and other numerous interesting relationships among data. Frequent item set mining has proved instrumental in conducting market basket analysis to analyze the customer’s buying habits by establishing the associations between different items placed in their “shopping baskets” (Han & Kamber, 2006, p. 228-229). Since frequent itemsets from a large database can generate a large number of item sets satisfying the minimum support threshold, the closed frequent item set and maximal frequent item set are usually introduced to overcome the challenge. While it is possible to confirm the actual support counts in the closed frequent item set, actual support counts in the maximal frequent item set cannot be asserted (Han & Kamber, 2006, p. 232). Frequent mining patterns are classified based on: the completeness of patterns to be mined, levels of abstraction involved in the rule set, the number of data dimensions involved in the rule, types of values handled in the rule, and the kinds of rules and patterns to be mined (Han & Kamber, 2006, p. 232-234).
Frequent patterns, Associations, and Correlations Techniques
Apriori is among the most efficient and scalable frequent item set mining methods. The Apriori algorithm is normally used to find frequent item sets using candidate generation based on prior knowledge of frequent itemset properties (Han & Kamber, 2006, p. 234-235). It employs an iterative approach referred to as a level-wise search, where k-items are used to explore (k+1) item sets. The resulting set is narrowed down progressively using a similar approach until no more frequent k-itemsets can be found. Since the approach requires scanning the entire database, an important priority called the “Apriori property” is often used to reduce the search space, thus improving the search efficiency. Apriori property establishes a condition that “all nonempty subsets of a frequent itemset must also be frequent.” The identification of frequent items is often followed by the generation of association rules from them, where the strong rule satisfies both minimum support and minimum confidence. The Frequent Pattern Growth (FP-Growth) is another more efficient method used in finding frequent item sets, but unlike Apriori, it achieves this without candidate generation (Han & Kamber, 2006, p. 242-243). Multilevel association rules are normally mined efficiently using concept hierarchies as influenced by a support-confidence framework. Mining multidimensional association rules are often accomplished through analyzing relational databases and data warehouses. Association mining is then followed up by correlation analysis to identify existing strong rules between the frequent itemsets, although not all strong rules are interesting or useful to the user since some are misleading. Since a data mining process is capable of uncovering thousands of rules from a particular data, most of which might be unrelated to the user’s interest, the constraint-based association mining approach are usually used to confine the search space to areas specified by the user (Han & Kamber, 2006, p. 265-266). Constraints are applied by specifying the type of data to be mined, task-relevant data, desired data dimensions or levels of concept hierarchies to be used, interestingness thresholds, and forms of rules to be mined. Users should note that association rules do not necessarily signify causality, thus not be used directly for prediction without being subjected to further analysis. However, they are responsible for establishing a good starting point for advanced exploration, which makes them instrumental in the understanding of data.
Classification and Prediction
Classification and prediction as forms of data analysis can be useful in extracting models that describe vital data classes or predict future data trends. While classification is critical in predicting categorical labels, prediction normally models functions that are continuously valued (Han & Kamber, 2006, p. 285). Accuracy, efficiency, and scalability of the classification and/or prediction processes are often improved through data cleaning, relevance analysis, and data transformation and reduction. Classification and prediction methods should be compared and evaluated in accordance with their accuracy, speed, robustness, scalability, and interpretability.
Classification by decision tree induction as the first method involves learning of decision trees from class-labeled training tuples. In the decision tree flowchart, each internal (nonleaf) node signifies a test on an attribute, each branch represents the test’s outcome, and each leaf (terminal) node holds a class label (Han & Kamber, 2006, p. 291-292). The uppermost node in the tree structure represents the root node. The most common attribute selection measures for selecting the splitting criterion include information gain, gain ratio, and Gini index. Anomalies in the training data arising from noise or outliers are addressed using tree pruning, a measure that removes the least reliable branches. The Bayesian classification method is used in predicting class membership probabilities (Han & Kamber, 2006, p. 310). It has two classifiers that exhibit high accuracy and speed when applied to large databases. These classifiers include the Naïve Bayesian classifier and Bayesian belief networks. In the rule-based classification, the learned model is represented as a set of IF-THEN rules, which is an expression of the form (Han & Kamber, 2006, p. 318-319). The classification by backpropagation is a neural network learning algorithm. A neural network refers to a set of connected input/output units, where every connection has a weight associated with it. The network learns by adjusting the weights to enable it to predict the right class label of the input tuples. Support Vector Machines is a method that can classify both linear and nonlinear data. The method transforms the original data into a higher dimension using nonlinear mapping. It then searches for the linear optimal separating hyperplane from within the new dimension, thus separating the tuples of one class from another. The associative classification method generates association rules, which are then analyzed for use in classification. It searches for strong associations between frequent patterns and class labels. Each association rule must satisfy particular criteria regarding its accuracy (confidence) and the proportion of the data set it actually represents. The above-discussed classification methods, that is, decision tree induction, Bayesian classification, rule-based classification, classification by backpropagation, support vector machines, and associative classification are referred to as “eager learners.” The reason for this is that when provided with a set of training tuples, they usually construct a generalization/classification model before receiving new tuples, for instance, a test, to classify (Han & Kamber, 2006, p. 347). Contrary to eager learners, the lazy learner methods will only perform generalization after seeing the test tuple to enable it to classify the tuple depending on how similar it is to the stored training tuples (Han & Kamber, 2006, p. 347-348). Examples of lazy learner methods include the k-nearest neighbor classifier and the case-based reasoning classifiers. Other less common commercial classification methods include genetic algorithms, rough set approach, and fuzzy set approaches.
The numeric prediction is normally used to predict a continuous value, rather than a categorical label. It is widely performed through regression, which is a statistical methodology. Regression analysis is used to model a relationship the relationship single or multiple independent (predictor) variables and dependent (response) variables that are continuously valued (Han & Kamber, 2006, p. 354). In data mining, the attribute of interest that describes the tuple is the predictor variable. Regression can be linear or nonlinear. In linear regression, a straight-line regression involves a response variable and a single predictor variable. Multiple-linear regression, which is an extension of straight-line regression, employs several predictor variables. Nonlinear regression analysis models data that does not exhibit linear dependence.
The clustering process in data mining involves partitioning data sets into groups based on data similarity and then assigning labels to the resulting relatively small number of groups. Clustering is instrumental in the identification of dense and sparse regions in object space, thus, it enhances the discovery of overall distribution patterns and interesting correlations among data attributes (Han & Kamber, 2006, p. 383-384). Apart from discovering similarities, the process can be used for outlier detection, where outliers (values) far-off from any cluster can be more interesting than those exhibiting commonalities.
Major clustering methods have been classified into several categories, which include partitioning, hierarchical, density-based, grid-based, and model-based methods (Han & Kamber, 2006, p. 398-400). These methods are used in clustering low-dimensional data. In partitioning methods, a partitioning algorithm organizes a data set of objects into a number of partitions, where each partition represents a cluster. It performs this using a dissimilarity function based on distance; to ensure that objects within a cluster are similar, and are dissimilar to those contained in a different cluster in terms of their data set attributes (Han & Kamber, 2006, p. 401). Its major approaches include the k-means and the k-medoid. The k-means method is a centroid-based technique, in which cluster similarity is measured in relation to the mean value of the objects within a cluster (Han & Kamber, 2006, p. 402). This is perceived as the cluster’s centroid or center of gravity. The k-medoids method is a representative object-based technique, as it selects and uses one representative object per cluster (Han & Kamber, 2006, p. 404). Partitioning is based on minimizing dissimilarities between each object and its corresponding reference point. The hierarchical method performs clustering functions through the grouping of data objects into a tree of clusters. The method can be classified as either agglomerative or divisive, which depends on whether the hierarchical decomposition is established in a bottom-up (merging) or top-down (splitting) manner (Han & Kamber, 2006, p. 408). Agglomerative hierarchical clustering takes the bottom-up (merging) form, while divisive hierarchical takes the top-bottom (splitting) form. Both agglomerative and divisive hierarchical methods cannot backtrack to correct a merge and split decision that has turned out to be a poor choice. Density-based methods are employed to discover clusters with arbitrary shapes. They perceive clusters as being dense regions of objects in the data space, which are separated by low-density regions that depict noise. Grid-based methods make use of multi-resolution grid data structure. It achieves this by quantizing the object space into a finite number of cells that establish a grid structure on which all clustering operations are performed (Han & Kamber, 2006, p. 424). Common examples of grid-based approach include Statistical Information Grid (STING), which explores statistical information contained in grid cells; and Wave Cluster, which uses wavelet transform method to cluster objects. Model-based clustering methods often attempt to optimize the fit between the given data and some mathematical model, as they are based on the idea that data are generated by a combination of underlying probability distribution (Han & Kamber, 2006, p. 429). Its major approaches include Expectation-Maximization (EM), conceptual clustering, and the neural network approach.
The above-discussed methods encounter challenges in clustering high-dimensional data, especially over 10 dimensions. The reason for this is that dimensionality increase reduces the number of dimensions relevant to particular clusters (Han & Kamber, 2006, p. 434-435). Data in the irrelevant dimensions can produce a lot of noise, which can cover the actual clusters to be discovered. Approaches that can effectively cluster high dimensional data include; dimension-growth subspace clustering (CLIQUE), dimensional-reduction projected clustering (PROCLUS), and frequent pattern-based clustering (cluster). While users have little guidance or interaction in the above clustering methods, they can increase their control over the clustering process by performing constraint-based cluster analysis, where they find clusters satisfying their specific preferences or constraints (Han & Kamber, 2006, p. 444-445). Constraints often used include specifying individual objects, clustering parameters, and distance or similarity functions.
Han, J., & Kamber, M. (2006). Data mining: Concepts and techniques. 7th Ed. San Fransisco, CA: Elsevier Inc.