CN111046977A

CN111046977A - Data preprocessing method based on EM algorithm and KNN algorithm

Info

Publication number: CN111046977A
Application number: CN201911392045.7A
Authority: CN
Inventors: 唐雪飞; 黄永鑫; 蒲高飞; 胡茂秋
Original assignee: Chengdu Comsys Information Technology Co ltd
Current assignee: Chengdu Comsys Information Technology Co ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2020-04-21

Abstract

The invention discloses a data preprocessing method based on an EM algorithm and a KNN algorithm, which comprises the following steps: s1, dividing the original data set into a complete data subset and an incomplete data subset according to whether the attribute values are missing or not, taking the complete data subset as a training sample of the EM algorithm, and clustering by using the EM algorithm; and S2, filling missing values on the clustering result by using a KNN algorithm. According to the method, before missing value filling is carried out by using KNN, clustering analysis is carried out on an original data set by using an EM algorithm, and then missing value filling is carried out by using KNN on the obtained clustering result, so that the method is simple to operate and high in filling accuracy.

Description

Data preprocessing method based on EM algorithm and KNN algorithm

Technical Field

The invention belongs to the technical field of data mining, and particularly relates to a data preprocessing method based on an EM algorithm and a KNN algorithm.

Background

The financial statement analysis is used for processing, analyzing, comparing, evaluating and explaining the data provided by the enterprise financial statement. If it is said that the accounting and tabulation belong to the reflection function of accounting, then the financial statement analysis is subject to the interpretation and evaluation functions. The purpose of the financial statement analysis is to judge the financial condition of the enterprise and diagnose the loss of the enterprise operation management. Through analysis, whether the financial condition of an enterprise is good or not, whether the operation management of the enterprise is sound or not and whether the business prospect of the enterprise is bright or not can be judged, and meanwhile, the syndrome of the operation management of the enterprise can be found through analysis, and a problem solving method is provided. The method for analyzing the financial statements mainly comprises a trend analysis method and a ratio analysis method. The trend analysis method is to compare the increasing and decreasing directions and inclinations of each item at the later stage according to the financial statements of several successive stages, so as to reveal the changes and trends in finance and operation.

Data mining requires a large amount of data resources, in practical applications, data from different original databases have a large amount of incomplete data, noisy data, heterogeneous data, error data, and the like due to different initial definitions or structures of the databases, however, most data mining algorithms are usually based on clean and complete data sets. Therefore, data in an actual system cannot be directly applied to data analysis, difficulty of data mining is increased, and unprocessed data can seriously affect the result of knowledge discovery. It follows that data preprocessing is critical to data mining. Statistically, the data preprocessing accounts for 60% of the whole data mining process, and the subsequent learning training only accounts for 10% of the whole work. The quality of data preprocessing directly influences the quality of data, and finally controls the result of subsequent data mining. The effective data preprocessing can improve the quality of the whole data, not only saves space cost and time cost, but also is beneficial to obtaining good data mining results to conduct decision guidance and value evaluation.

Various data quality problems are often encountered in the data mining process, wherein the data imperfection problem is particularly prominent. The phenomenon of data missing is common, for example, in a UCI database commonly used in the field of machine learning, a data set containing missing data accounts for more than 40%. The existing processing methods for data incompleteness problems can be roughly divided into three types: deletion methods, padding methods, and unprocessed methods that retain the original information. The application of the deletion method is very limited, the original information of the data set is lost due to the adoption of the deletion method for dealing with the incomplete data problem, the waste of useful information of the data is easily caused, and meanwhile, the accuracy and the objectivity of a data mining result are influenced to a certain extent by the discarding of the information. The deletion method is mainly suitable for data sets with complete random deletion and small proportion of missing data. The filling method is a relatively scientific and effective processing method, and fully utilizes the information of the data to fill, so that the estimated filling value is as close as possible to the true value of the original data. Compared with the former two methods for changing the original data set, the method without processing keeps the original state of the data set. The method utilizes the machine learning technology to weaken the influence of data missing, and directly learns from an incomplete data set, and the learning methods comprise a Bayesian belief network, a rough set method, an artificial neural network and the like.

Clustering is a typical unsupervised learning method. Under the guidance of no prior knowledge, similar example objects are classified into different categories by a static classification method, so that the example objects in the same category are similar as much as possible, and the difference between different categories is large as much as possible. The KNN algorithm is used for missing value filling, and K nearest neighbor filling is obtained. The K nearest neighbor filling is to search K complete objects which are closest to incomplete objects in a complete data set, and fill missing values by using the information of the K neighbors. Compared with other missing value filling algorithms, the K nearest neighbor filling algorithm has the advantages of simplicity in operation and high filling accuracy, but the algorithm is troublesome in operation because the K value needs to be set manually, and the K values required to be set by different training data are different.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a data preprocessing method based on an EM algorithm and a KNN algorithm, which is simple to operate and high in filling accuracy, wherein the EM algorithm is used for carrying out clustering analysis on an original data set before KNN is used for filling missing values, and then KNN is used for filling the missing values on the obtained clustering result.

The purpose of the invention is realized by the following technical scheme: the data preprocessing method based on the EM algorithm and the KNN algorithm comprises the following steps:

s1, collecting financial system data, dividing the collected data into a complete data subset and an incomplete data subset according to whether attribute values are missing, taking the complete data subset as a training sample of an EM (effective magnetic field) algorithm, and clustering by using the EM algorithm;

and S2, filling missing values on the clustering result by using a KNN algorithm.

Further, the step S1 includes the following sub-steps:

s11, recording the complete data subset data as (x)₁,x₂,...,x_n) Sample x₁,x₂,...,x_nIndependent of each other, each sample corresponds to a class z_iUnknown; the purpose of the clustering algorithm is to determine the class to which the sample belongs, such that the joint distribution p (x) of the sample and the class belongs_i；z_i) Maximization, p (x)_i；z_i) The likelihood function of (d) is:

taking the logarithm of the above formula to obtain:

wherein n is the number of sample data, theta is the model parameter of EM algorithm, and p (x)_i，z_i(ii) a Theta) is the sample x when the model parameter is theta_iAnd class z_iA joint distribution between;

s12, defining a category variable z_iSatisfy a certain distribution Q_iAnd the distribution function Q_i(z_i) The following conditions are satisfied:

transforming the solving formula of l (theta) in the step S1 by using the Zhansen inequality to obtain:

due to the fact that

Is that

The expectation of (c), so is derived from the jensen inequality:

namely, it is

Expected probability f (E [ X ]]) Greater than or equal to

Expectation of function E [ f (X)]；

As known from the jensen inequality, if and only if X is a constant, the inequality takes an equal sign, then there is:

where C is a constant, for a series of different z_iThe values, summed, result in:

due to the fact that

Therefore, the method comprises the following steps:

thus, Q_i(z_i) The calculation formula of (2) is as follows:

p(z_i|x_i(ii) a θ) refers to the sample x when the model parameter is θ_iBelong to the class z_iThe conditional probability of (a);

s13, Q obtained in the step S2_i(z_i) As the distribution of the categories, then maximizing the likelihood function to obtain the final clustering result: given an initial value θ, the loop repeats steps E and M until convergence:

e, step E: for each x_iCalculating Q_i(z_i)＝p(z_i|x_i；θ)；

And M: calculating the ratio of theta:

further, the step S2 includes the following sub-steps:

s21, matching incomplete data subsets D according to the number of missing attribute values_iSorting from small to large;

s22, calculating the distance from each record r in the sorted incomplete data subset to each cluster center c formed by EM clustering, and sorting from small to large;

s23, classifying each incomplete record into the class of the cluster center c with the minimum distance to the record;

s24, calculating the distance dis between the incomplete record and other training data in the belonged class by using an Euclidean distance formula; for the continuous attributes in the incomplete record, missing value padding is performed using the following formula:

wherein v is_nMeaning incomplete recording, β_iRefers to a complete record, P, of the class in which cluster center c is located^rRefers to incomplete recording v_nContaining consecutive attributes of missing values, n referring toα refers to the similarity of two records, i.e. the calculated distance dis;

and S25, filling discrete attributes in the incomplete record by obtaining the mode of other complete records in the corresponding attributes in the belonged records.

The invention has the beneficial effects that: according to the method, before missing value filling is carried out by using KNN, clustering analysis is carried out on an original data set by using an EM algorithm, and then missing value filling is carried out by using KNN on the obtained clustering result, so that the method is simple to operate and high in filling accuracy.

Drawings

Fig. 1 is a flow chart of a data preprocessing method based on the EM algorithm and the KNN algorithm.

Detailed Description

The method provided by the invention belongs to a filling method, and the background technology related to the method is explained as follows:

1. maximum Expectation (EM) algorithm

The Expectation-maximization (EM) algorithm is an algorithm that finds a parameter maximum likelihood estimate or maximum a posteriori estimate in a probabilistic (probabilistic) model, where the probabilistic model relies on unobservable hidden variables (Latent variables). The algorithm is mainly calculated by two steps alternately, wherein the first step is to calculate expectation (E) and calculate the maximum likelihood estimated value of the hidden variable by using the existing estimated value of the hidden variable; the second step is to maximize (M), the maximum likelihood found at step E is maximized to calculate the value of the parameter. The parameter estimates found in step M will be used for the next E step calculation, alternating until convergence. The most direct application of the EM algorithm is to solve parameter estimation, but if we consider potential classes as hidden variables and samples as observed values, the clustering problem can be converted into a parameter estimation problem, which is the principle of clustering using the EM algorithm. The main flow of the EM algorithm is as follows:

(1) initializing initial values theta of model parameters theta randomly⁰；

(2) Start iteration of the EM algorithm:

(a) first, a known joint distribution P (x) is calculated⁽ⁱ⁾，z⁽ⁱ⁾(ii) a Theta) conditional probability expectation L (theta )^j)：

Q_i(z⁽ⁱ⁾)＝P(z⁽ⁱ⁾|x⁽ⁱ⁾；θ)

(b) Second, L (θ, θ) is maximized^j) To obtain theta^j+1：

θ^j+1＝argmax_xL(θ，θ^j)

(c) If theta is greater than theta^j+1And (4) converging, finishing the algorithm, and otherwise, continuing iteration.

(3) And outputting the model parameter theta.

The EM algorithm can guarantee convergence to a stable point but cannot guarantee convergence to a global maximum point, and thus is a locally optimal algorithm. Of course, if the target L (θ, θ) is optimized^j) Convex, the EM algorithm can guarantee convergence to a global maximum, which is the same as the iterative algorithm in the gradient descent method.

2. K Nearest Neighbor (KNN) algorithm

The basic idea of the method is as follows: if most of K most similar samples (namely K adjacent samples in the feature space) in the feature space of a sample to be classified belong to a certain class, the sample also belongs to the class, and the method is a supervised learning algorithm. The general flow of the KNN algorithm is:

(1) calculating the distance between the test data point and each training data point, and sequencing according to the distance increasing order;

(2) selecting K points with the minimum distance from the current test data point;

(3) determining the occurrence frequency of the category where the first K training data points are located;

(4) and returning the class with the highest frequency of the current K training data points as the prediction classification of the current test data point.

It is worth noting that in the KNN algorithm, the distance between the objects is used as a non-similarity measurement index between the objects, and the problem of matching between the objects is avoided. The distance is generally calculated by using a Euclidean distance formula or a Manhattan distance formula, and in the invention, the Euclidean distance is used for measuring the distance between each object, and the distance formula is shown as follows:

the technical scheme of the invention is further explained by combining the attached drawings.

As shown in fig. 1, the data preprocessing method based on EM algorithm and KNN algorithm of the present invention includes the following steps:

s1, collecting financial system data, dividing the collected data into a complete data subset and an incomplete data subset according to whether attribute values are missing or not, if so, determining the collected data to be the incomplete data, otherwise, determining the collected data to be the complete data, taking the complete data subset as a training sample of an EM algorithm, and clustering by using the EM algorithm; the method comprises the following substeps:

taking the logarithm of the above formula to obtain:

wherein n is the number of sample data, theta is the model parameter of EM algorithm, and p (x)_i，z_i(ii) a Theta) is a model parameter of thetaSample x of time_iAnd class z_iA joint distribution between;

due to the fact that

Is that

The expectation of (c), so is derived from the jensen inequality:

namely, it is

Expected probability f (E [ X ]]) Greater than or equal to

Expectation of function E [ f (X)]；

due to the fact that

Therefore, the method comprises the following steps:

thus, Q_i(z_i) The calculation formula of (2) is as follows:

e, step E: for each x_iCalculating Q_i(z_i)＝p(z_i|x_i；θ)；

And M: calculating the ratio of theta:

step E refers to calculating, for each data point in the training sample, i.e. each record in the complete data subset, the probability that it belongs to each cluster therein, and using this as a weight. M steps refer to the estimation of the relevant parameters (mean, variance) of each cluster using the weights calculated in the previous step: and taking the probability in the step E as a weight of each data point, and then calculating the mean value and the variance of each cluster like K-means so as to solve the overall probability or the maximum likelihood of the clusters.

S2, filling missing values on the clustering result by using a KNN algorithm; the method comprises the following substeps:

wherein v is_nMeaning incomplete recording, β_iRefers to a complete record, P, of the class in which cluster center c is located^rRefers to incomplete recording v_nThe cluster center c comprises continuous attributes of missing values, n refers to the total number of complete records of the class where the cluster center c is located, α refers to the similarity of two records, namely the calculated distance dis;

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. The data preprocessing method based on the EM algorithm and the KNN algorithm is characterized by comprising the following steps of:

2. The method for preprocessing data based on EM algorithm and KNN algorithm as claimed in claim 1, wherein said step S1 comprises the following sub-steps:

s11, recording the complete data subset data as (x)₁，x₂，...，x_n) Sample x₁，x₂，...，x_nIndependent of each other, each sample corresponds to a class z_iUnknown; the purpose of the clustering algorithm is to determine the class to which the sample belongs, such that the joint distribution p (x) of the sample and the class belongs_i；z_i) Maximization, p (x)_i；z_i) The likelihood function of (d) is:

taking the logarithm of the above formula to obtain:

due to the fact that

Is that

The expectation of (c), so is derived from the jensen inequality:

namely, it is

Desired probability f (E [ x ]]) Greater than or equal to

Expectation of function E [ f (X)]；

due to the fact that

Therefore, the method comprises the following steps:

thus, Q_i(z_i) The calculation formula of (2) is as follows:

e, step E: for each x_iCalculating Q_i(z_i)＝p(z_i|x_i；θ)；

And M: calculating the ratio of theta:

3. the method for preprocessing data based on EM algorithm and KNN algorithm as claimed in claim 1, wherein said step S2 comprises the following sub-steps: