In the era of big data, data mining is the most critical work. Big data mining is the process of discovering valuable and potentially useful information and knowledge hidden in massive, incomplete, noisy, fuzzy, and random large databases. It is also a decision support process. It is mainly based onAI,machine learning, pattern learning, statistics, etc. Through highly automated analysis of big data, inductive reasoning can be made, and potential patterns can be discovered, which can help enterprises, merchants, and users adjust market policies, reduce risks, face the market rationally, and make correct decisions. Currently, in many fields, especially in the commercial field such asbank, telecommunications, e-commerce, etc., data mining can solve many problems, including marketing strategy formulation, background analysis,Business managementCrisis etc. Commonly used methods for big data mining include classification, regression analysis, clustering, association rules,Neural Networksmethods, Web data mining, etc. These methods mine data from different perspectives.
The importance of data preparation: Without high-quality mining results, data preparation often takes more than 60% of the time.
(1)Classification
Classification is to find the common characteristics of a group of data objects in the database and divide them into different categories according to the classification model. The purpose is to map the data items in the database to a given category through the classification model. It can be applied to application classification and trend prediction. For example, a Taobao store divides users’ purchases over a period of time into different categories, and recommends related products to users based on the situation, thereby increasing the store’s sales.
Classification method: Decision tree – the most popular classification method
Features:
a. Each of its divisions is based on the most significant features;
b. The analyzed data sample is called the root of the tree. The algorithm selects the most important one from all features and uses this feature to divide the sample into several subsets;
c. Repeat this process until all instances under the branch are “pure”, that is, each instance in the subset belongs to the same category, and such a branch can be determined as a leaf node. After all subsets become “pure”, the tree stops growing.
Pruning of decision tree:
a. If the decision tree is built too deep, it can easily lead to overfitting problems (that is, the number of all classification results is basically the same and is not representative);
b. Pruning usually adopts a top-down approach. Each time, find the branch in the training data that contributes the least to the prediction accuracy and prune it;
c. In short, let the decision tree grow wildly first, and then slowly shrink back. How much to shrink back as a whole depends on repeated attempts based on the performance on the test set.
(2)Regression analysis
Regression analysis reflects the characteristics of the attribute values of the data in the database, and uses functions to express the data mapping relationship to discover the dependence between attribute values. It can be applied to the prediction of data sequences and the study of correlations. In marketing, regression analysis can be applied to various aspects. For example, through regression analysis of this quarter’s sales, we can predict the sales trend of the next quarter and make targeted marketing changes.
Classification method: Logistic regression – is a commonly used classification method, very mature and widely used
Features:
a. Regression can not only be used for classification, but also for discovering causal relationships between variables;
b. The most important regression models are multiple linear regression and logistic regression;
c. Sometimes logistic regression is not regarded as a typical data mining algorithm.
Logistic regression steps:
a. Training first, the purpose is to find the regression coefficient with the best classification effect;
b. Then use a set of regression coefficients obtained through training to calculate the input data and determine the categories to which they belong.
Test of logistic regression model:
Since we hope that the relationship between the input variables in the model and the target variable is strong enough, two diagnostics need to be done for this:
a. Test of the overall model – R2, that is, what percentage of the variability of the target variable can be explained by all input variables. The larger R2 is, the better the model fits; if R2 is too small, the model cannot be used for prediction.
b. The significance (p-value) of the regression coefficient. If the p-value of an input variable on the target variable is less than 0.05, the input variable can be considered to have a significant effect. Insignificant input variables can be considered to be removed from the model.
Comparison of decision trees and logistic regression:
1. Because the decision tree adopts the segmentation method, it can delve into the details of the data, but at the same time it loses its grasp of the overall situation. Once a branch is formed, its relationship with other branches or nodes is severed, and subsequent excavation can only be carried out locally;
2. Logistic regression always focuses on fitting integer data, so it has a better grasp of the global pattern;
3. Decision trees are easier to use and require less data preprocessing;
4. The logistic regression model cannot handle missing values and is sensitive to outliers. Therefore, missing values should be dealt with before regression, and outliers should be deleted as much as possible.
Classification and regression analysis is called supervised learning:
1. There is a mark;
2. By imitating existing data that has been correctly classified, new data can be classified more accurately. It’s like teaching a child to learn.
(3) Clustering
Clustering is similar to classification, but its purpose is different from classification. It is to divide a set of data into several categories based on the similarities and differences of the data. The similarity between data belonging to the same category is very large, but the similarity between data in different categories is very small, and the correlation of data across categories is very low.
(4) Association rules
Association rules are associations or relationships hidden between data items, that is, the occurrence of one data item can be deduced from the occurrence of other data items. The mining process of association rules mainly includes two stages: the first stage is to find all high-frequency item groups from massive raw data; the second stage is to generate association rules from these high-frequency item groups. Association rule mining technology has been widely used infinanceIt is used by industry enterprises to predict customer needs. Each bank improves its own marketing by bundling information that customers may be interested in on its own ATM machines for users to understand and obtain corresponding information.
Clustering and association rules are called unsupervised learning:
1. No mark;
2. Clustering: Divide customer groups based on customer characteristics. From this, we can adopt differentiated promotion methods for different customer groups;
3. Association rules: Analysis found that a large proportion of customers who buy bread also buy milk, so we can put milk and bread in the same place.
(5)Neural network method
As an advanced artificial intelligence technology, neural network is very suitable for processing non-linear and processing problems characterized by fuzzy, incomplete and imprecise knowledge or data due to its self-processing, distributed storage and high fault tolerance. This feature of it is very suitable for solving data mining problems. Typical neural network models are mainly divided into three categories: the first type is a feed-forward neural network model used for classification prediction and pattern recognition, whose main representatives are functional networks and perceptrons; the second type is used for association Memorize and optimizealgorithmThe feedback neural network model is represented by Hopfield’s discrete model and continuous model. The third category is the self-organizing mapping method for clustering, represented by the ART model. Although neural networks have a variety of models and algorithms, there are no unified rules for which models and algorithms to use in data mining in specific fields, and it is difficult for people to understand the learning and decision-making process of the network.
(6)Web data mining
Web data mining is a comprehensive technology that refers to the Web discovering implicit patterns P from the document structure and usage set C. If C is regarded as input and P is regarded as output, then the Web mining process can be regarded as It is a mapping process from input to output.
Currently, more and more Web data appear in the form of data streams, so mining of Web data streams is of great significance. Currently commonly used Web data mining algorithms include: PageRank algorithm, HITS algorithm and LOGSOM algorithm. The users mentioned by these three algorithms are all general users and do not distinguish individual users. At present, Web data mining faces some problems, including: user classification problems, website content timeliness problems, user stay time problems on the page, the number of links in and out of the page, etc. Today, with the rapid development of Web technology, these problems are still worth studying and solving.