CN114968992A

CN114968992A - Data identification cleaning and compensation method and device, electronic equipment and storage medium

Info

Publication number: CN114968992A
Application number: CN202210422231.6A
Authority: CN
Inventors: 刘怀亮; 张静; 杨斌; 赵舰波
Original assignee: Xi'an Zhile Technology Co ltd
Current assignee: Xi'an Zhile Technology Co ltd
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2022-08-30

Abstract

The invention discloses a data identification cleaning and compensation method, a data identification cleaning and compensation device, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring data to be cleaned; the data to be cleaned has a plurality of attribute characteristics; clustering the data to be cleaned based on an improved k-Means clustering algorithm; the improved k-Means clustering algorithm is a mode of introducing attribute weights to a plurality of attribute features of data to be cleaned to change a clustering criterion function; removing abnormal data in the clustering result according to a preset abnormal data outlier distance to obtain a cleaning result; compensating the cleaning result based on an improved random forest algorithm to obtain a compensation result; the improved random forest algorithm is a mode of optimizing random forest input data by adopting an interpolation method. The invention is a method for closely combining and considering data cleaning and compensation, and can obtain better data cleaning and compensation effects.

Description

Data identification cleaning and compensation method and device, electronic equipment and storage medium

Technical Field

The invention belongs to the technical field of computer data cleaning, and particularly relates to a data identification cleaning and compensation method and device, electronic equipment and a storage medium.

Background

Currently, data is being generated at unprecedented speeds. With the widespread use of various data acquisition devices, a large amount of data is accumulated and used day by day and night. With the popularization of the application of the internet of things in actual life and production, the characteristic of taking data as the center is increasingly prominent. The sensor nodes which are densely deployed can generate a large amount of sensor data, abnormal values often occur due to limited node energy, complex monitoring environment, easy external attack on the nodes and the like, and the acquired data inevitably doped with some abnormal data due to various random factors cannot be converted into useful information, so that how to improve the data quality becomes a key problem for improving the data utilization efficiency, the acquired data is cleaned, an abnormal data cleaning algorithm of the internet of things is designed, and the data is guaranteed to be reliable and accurate and is a key problem in data analysis of the internet of things.

The current data cleansing studies are numerous, such as: the method for cleaning the large data of the state of the power transmission and transformation equipment based on time sequence analysis is provided by the great Britain, the great Goy 30366, the Chenyufeng and the like, the power system is automatic, 2015(7):7. ", the abnormal data of the power transmission and transformation equipment is classified, and the data of a transformer and a line are cleaned by adopting a double-cycle iterative inspection method based on the time sequence analysis to obtain the data with higher quality; the Yangtonghua, Liningning, Wang hongzhi and the like provide 'parallel big data cleaning process optimization [ J ] based on task combination, the computer science reports, 2016,39(1): 12', research is carried out on the cleaning problem of parallel data, an optimization technology based on task combination is provided, redundant calculation and the same file are combined, the running time of a system is reduced, and the data cleaning efficiency is improved; the Dynasty Jie, Song dynasty, Yan\31054and the like provide a method for cleaning state data of power transmission and transformation equipment based on a stack type noise reduction self-encoder [ J ] power system automation, 2017,041(012): 224-.

However, the above proposed cleaning methods only clean data, and cleaning inevitably involves a loss of data quality, and if there is no effective compensation for the cleaned data, even if cleaning can improve quality to a certain extent, the quality improvement space is limited.

Disclosure of Invention

In order to solve the above problems in the prior art, the present invention provides a data identification cleaning and compensation method, device, electronic device and storage medium. The technical problem to be solved by the invention is realized by the following technical scheme:

in a first aspect, an embodiment of the present invention provides a data identification cleaning and compensation method, including:

acquiring data to be cleaned; wherein the data to be cleaned has a plurality of attribute characteristics;

clustering the data to be cleaned based on an improved k-Means clustering algorithm; the improved k-Means clustering algorithm is used for introducing attribute weights to a plurality of attribute characteristics of the data to be cleaned to change a clustering criterion function mode;

removing abnormal data in the clustering result according to a preset abnormal data outlier distance to obtain a cleaning result;

compensating the cleaning result based on an improved random forest algorithm to obtain a compensation result; the improved random forest algorithm is a mode of optimizing random forest input data by adopting an interpolation method.

In an embodiment of the present invention, the process of clustering the data to be cleaned based on the improved k-Means clustering algorithm includes:

randomly acquiring a plurality of data from the data to be cleaned as initial clustering centers;

constructing a first clustering objective function;

clustering and solving the first clustering objective function according to the initial clustering center to obtain a plurality of intermediate target clustering centers corresponding to the data to be cleaned;

constructing a second clustering objective function;

performing weight solving on the second clustering target function according to the middle target clustering center to obtain an attribute weight corresponding to each attribute in the data to be cleaned;

and clustering and solving the second clustering target function according to the attribute weight to obtain a plurality of target clustering centers corresponding to the data to be cleaned.

In one embodiment of the present invention, the first clustering objective function formula is constructed as:

wherein X represents the attribute feature number of the data to be cleaned, M represents the number of the initial cluster centers, P represents the data set to be cleaned, and P represents the data set to be cleaned _x Represents any sample data, M, in the data set P to be cleaned _m Represents the mth initial cluster center, and d (-) represents the Euclidean distance.

In one embodiment of the present invention, the second clustering target function formula is constructed as:

wherein X represents the number of attribute features of the data to be cleaned, N represents the number of cluster centers of the intermediate target clusters, and lambda ^l Representing the attribute weight corresponding to the l class of data to be cleaned,

a set of samples representing the class i data to be cleaned, x representing the number of samples in the set,

representing the nth middle target cluster center corresponding to the first class of data to be cleaned, | · survival ² Representing solving for the least squares error.

In an embodiment of the present invention, the performing weight solution on the second clustering target function according to the middle target clustering center to obtain an attribute weight corresponding to each attribute in the data to be cleaned includes:

and solving the weight of the second clustering target function according to the middle target clustering center by adopting a Lagrange function solving mode to obtain the attribute weight corresponding to each attribute in the data to be cleaned.

In an embodiment of the invention, the process of compensating the cleaning result based on the improved random forest algorithm comprises:

interpolating the cleaning result to obtain an interpolation matrix;

constructing a random forest regression model according to the interpolation matrix;

aiming at each column of the interpolation matrix, when the column is taken as a target filling column, the data of the rest columns form a filling matrix; and predicting according to the filling matrix and the data corresponding to the target filling column by using the constructed random forest regression model to obtain a prediction result of the target filling column, and taking the prediction result as the compensation result of the corresponding column.

In an embodiment of the present invention, the interpolating the cleaning result to obtain an interpolation matrix includes:

interpolating the cleaning result by using a linear interpolation method to obtain an interpolation matrix; wherein the interpolation matrix is the same as the size of the data to be cleaned.

In a second aspect, an embodiment of the present invention provides an abnormal data identification, cleaning and compensation apparatus, including:

the data acquisition module is used for acquiring data to be cleaned; wherein the data to be cleaned has a plurality of attribute characteristics;

the data clustering module is used for clustering the data to be cleaned based on an improved k-Means clustering algorithm; the improved k-Means clustering algorithm is used for introducing attribute weights to a plurality of attribute characteristics of the data to be cleaned to change a clustering criterion function mode;

the data removing module is used for removing abnormal data in the clustering result according to a preset abnormal data outlier distance to obtain a cleaning result;

the data compensation module is used for compensating the cleaning result based on an improved random forest algorithm to obtain a compensation result; the improved random forest algorithm is a mode of optimizing random forest input data by adopting an interpolation method.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete mutual communication through the communication bus;

the memory is used for storing a computer program;

the processor is used for realizing any one of the steps of the data identification cleaning and compensation method when executing the program stored in the memory.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the method implements any of the above-mentioned steps of the data identification cleaning and compensation method.

The invention has the beneficial effects that:

the invention provides a data identification cleaning and compensation method, which is a method for closely combining and considering data cleaning and compensation, and can obtain better data cleaning and compensation effects, specifically: in the data cleaning process, the characteristic that the influence of the self attribute characteristics of the data to be cleaned on the data is not completely the same is fully considered by utilizing the high-efficiency clustering advantage of the k-Means clustering algorithm, the attribute weight is introduced, the improved k-Means clustering algorithm is provided, the abnormal data can be quickly and accurately identified, and the cleaning quality of the data is improved; in the data compensation process, the fast regression advantage of the random forest algorithm is utilized, the cleaning result based on the improved k-Means clustering algorithm is used as input data, interpolation processing is carried out on the input data of the random forest algorithm by adopting an interpolation method on the basis that the improved k-Means clustering algorithm has a better cleaning result, and the interpolation data considers the attribute characteristics of the data to be cleaned, so that the interpolation data is more fit with the attribute of the sample, the algorithm data requirement is met, the data compensation precision is improved, and the data cleaning quality is further improved.

The present invention will be described in further detail with reference to the accompanying drawings and examples.

Drawings

FIG. 1 is a schematic flow chart of a data recognition cleaning and compensation method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of clustering data to be cleaned based on an improved k-Means clustering algorithm in the data identification cleaning and compensation method according to the embodiment of the present invention;

fig. 3 is a schematic flow chart of a compensation result obtained by compensating a cleaning result based on an improved random forest algorithm in the data identification cleaning and compensation method provided by the embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an abnormal data recognition, cleaning and compensation apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto.

Example one

In order to improve the cleaning quality of the data to be cleaned, the embodiment of the invention provides a data identification cleaning and compensation method, a data identification cleaning and compensation device, electronic equipment and a storage medium.

In a first aspect, referring to fig. 1, an embodiment of the present invention provides a data identification cleaning and compensation method, including the following steps:

s10, acquiring data to be cleaned; wherein the data to be cleaned has a plurality of attribute characteristics.

In particular, the data to be cleaned in any scene is acquired, the data acquisition amount is larger and larger, and the types of the data are more and more, that is, the data to be cleaned has multiple attribute characteristics, for example, the data to be cleaned can be animals, plants, people, or other types of data, each type of data has characteristics of each type of data, the same type of data is very important for the identification of the type of data, the other types of data are not important for the identification of the type of data, and improper introduction may also cause false identification of the type of data. Therefore, the data can be identified by comprehensively considering the attribute characteristics of the data.

S20, clustering the data to be cleaned based on an improved k-Means clustering algorithm; the improved k-Means clustering algorithm introduces attribute weights to a plurality of attribute features of data to be cleaned to change the mode of a clustering criterion function.

Specifically, through the above analysis, the embodiment of the present invention proposes to implement cleaning of data to be cleaned by using the attribute characteristics of the data to be cleaned. The k-Means clustering algorithm is adopted in the cleaning process, and through the analysis of the inventor, when the traditional k-Means clustering algorithm is used for clustering data, the influence of each attribute characteristic of the data to be cleaned on the data is the same, and the influence of each attribute characteristic of the actual data to be cleaned on the data is different, especially the influence among different types of data, so that the clustering effect after k-Means clustering is influenced. Based on the analysis, the embodiment of the invention provides a mode of clustering data to be cleaned based on an improved k-Means clustering algorithm, wherein the improved k-Means clustering algorithm is a mode of introducing attribute weights into a plurality of attribute features of the data to be cleaned to change a clustering criterion function. Specifically, referring to fig. 2, the process of clustering data to be cleaned based on the improved k-Means clustering algorithm includes the following steps:

101. and randomly acquiring a plurality of data from the data to be cleaned as initial cluster centers, for example, the number of the initial cluster centers is M.

102. And constructing a first clustering objective function.

Specifically, the first clustering objective function formula constructed in the embodiment of the present invention is expressed as:

wherein X represents the attribute feature number of the data to be cleaned, M represents the number of the initial cluster centers, P represents the data set to be cleaned, and P represents the data set to be cleaned _x Represents any sample data, M, in the data set P to be cleaned _m Represents the mth initial cluster center, and d (-) represents the Euclidean distance. Therefore, the first clustering objective function does not consider the attribute characteristics of the data to be cleaned.

103. And clustering and solving the first clustering objective function according to the initial clustering centers to obtain a plurality of intermediate target clustering centers corresponding to the data to be cleaned.

Specifically, the embodiment of the invention utilizes the first clustering objective function of the formula (1) to enable V to be obtained by continuously adjusting the initial clustering center ₁ (X) has a minimum value, attribute characteristics are not considered in the clustering process, but through the clustering adjustment, a plurality of intermediate target cluster centers which are more accurate than the initial cluster centers can be obtained and are used for clustering results of subsequent clustering through the attribute characteristics.

104. And constructing a second clustering objective function.

Specifically, the influence of each attribute feature in the conventional k-Means clustering algorithm is the same, so that the clustering effect cannot reach the best effect, the embodiment of the invention provides an idea of introducing an attribute weight, fully highlights the influence of each attribute feature in clustering, and constructs a second clustering objective function, so that a clustering criterion function changes, wherein the second clustering objective function formula is expressed as:

wherein X represents the attribute feature number of the data to be cleaned, N represents the number of the middle target cluster centers, and lambda ^l Representing the attribute weight corresponding to the l class of data to be cleaned,

set of samples, x, representing class I data to be cleaned ^l Represents the number of samples in the class of sample set,

105. And performing weight solving on the second clustering target function according to the middle target clustering center to obtain an attribute weight corresponding to each attribute in the data to be cleaned.

Specifically, in the embodiment of the present invention, a weighted arithmetic mean method may be used to solve the weight of the formula (2) to obtain an attribute weight corresponding to each attribute in the data to be cleaned.

The inventor researches and discovers that when a data set is generally aimed at solving a certain problem with specific constraint, although the clustering result obtained by the improved k-Means clustering algorithm is improved compared with the clustering result obtained by the traditional k-Means clustering algorithm, the problem of constraint solving is complex, the calculation speed is low by adopting the weighted arithmetic mean method, and the consideration of calculation precision is neglected in the calculation process, so that the clustering result still does not reach the optimal solution. When identifying and cleaning a data set aiming at the class of requirements with specific constraints, the embodiment of the invention provides that a lagrangian function solving mode is adopted, weight solving is carried out on a second clustering target function according to a middle target clustering center to obtain an attribute weight corresponding to each attribute in the data to be cleaned, and specifically: and (3) calculating the attribute weight lambda of each attribute feature of the formula (2) under a specific constraint condition by adopting a Lagrangian function solving mode, converting the constrained extreme value problem into an unconstrained extreme value problem, solving the extreme value of the unconstrained Lagrangian function, namely solving the extreme value of the original constrained function, and effectively and quickly solving the attribute weight. The attribute weight value solved by the Lagrange function is more accurate than the attribute weight value solved by a weighted arithmetic mean method, and the more accurate attribute weight value can enable a more accurate clustering effect to be obtained subsequently.

106. And clustering and solving the second clustering target function according to the attribute weight to obtain a plurality of target clustering centers corresponding to the data to be cleaned.

Specifically, in the embodiment of the present invention, different attribute weights are assigned to different attribute features through S105, and V is enabled by continuously adjusting the cluster center of the intermediate target by using the second clustering objective function of formula (2) ₂ And (X) has a minimum value, and due to the addition of the attribute characteristics in the clustering process, the influence of the attribute characteristics on data clustering is fully considered, so that a plurality of target clustering centers obtained through clustering adjustment are more accurate than the middle target clustering centers, and can be better used for final clustering identification. At this time, the number of the obtained target cluster centers is O.

And S30, removing abnormal data in the clustering result according to the preset outlier distance of the abnormal data to obtain a cleaning result.

Specifically, according to the O target cluster centers obtained in S106, the embodiment of the present invention calculates the distance between all the data to be cleaned and each target cluster center, determines the distance from the preset abnormal data outlier distance, determines the corresponding data to be cleaned as abnormal data if the distance is greater than the preset abnormal data outlier distance, removes the abnormal data, determines the corresponding data to be cleaned as normal data if the distance is less than or equal to the preset abnormal data outlier distance, and retains and clusters the normal data, thereby obtaining a final cleaning result. Therefore, the abnormal data can be accurately identified and removed under a more accurate target cluster center, and the cleaning quality of the data to be cleaned is improved. The outlier distance of the abnormal data can be set according to actual requirements.

S40, compensating the cleaning result based on the improved random forest algorithm to obtain a compensation result; the improved random forest algorithm is a mode of optimizing random forest input data by adopting an interpolation method.

Specifically, data cleaning can be achieved by the processing of S30, but the elimination of data inevitably results in different data loss, which inevitably affects the cleaning quality. Therefore, after the cleaning is finished, it is also necessary to compensate the cleaning data. In the general compensation process, the cleaned data is directly used as input, and a random forest algorithm is adopted to compensate the input data, so that a certain compensation effect can be achieved.

However, when the conventional random forest algorithm is directly adopted, the regression problem is not well solved as it does in classification, because the improper initial input data can seriously affect the initial prediction thereof, and when regression is performed, the initial prediction can also affect the prediction of each sample data of the random forest on the input data set, which may cause overfitting when regression prediction is performed on some data, thereby affecting the final overall prediction effect. Therefore, it is very important to consider the initial input data of the random forest algorithm in advance before compensating by using the random forest algorithm, and the embodiment of the present invention proposes to use the cleaned data to construct data with more prior information as the input data of the random forest, and compensate the cleaning result based on the improved random forest algorithm, specifically, a process of compensating the cleaning result based on the improved random forest algorithm, and as shown in fig. 3, includes the following steps:

s401, interpolation is conducted on the cleaning result to obtain an interpolation matrix.

Specifically, the embodiment of the present invention interpolates the cleaning result in an indefinite interpolation manner, and may use a linear interpolation method, a polynomial interpolation method, a newton interpolation method, or the like. The embodiment of the invention utilizes a linear interpolation method to interpolate the cleaning result to obtain an interpolation matrix, the interpolation is random interpolation, and the specific value of the interpolation is weighted averaging of adjacent data.

For example, the data to be cleaned is recorded as P, there are X attribute features, and each attribute feature corresponds to X respectively ¹ 、x ² 、x ³ 、……、x ^X And sampling data, wherein a sample set corresponding to each attribute feature is recorded as:

the total number of samples corresponding to all attribute features is x ¹ +x ² +x ³ +……+x ^X Then, the data P to be cleaned can be formulated as:

the cleaning result obtained by the elimination of S30, namely the cleaned data is recorded as P', because the number of the target cluster cores is O, each target cluster core corresponds to y ¹ 、y ² 、y ³ 、……、y ^O And sampling data, wherein a sample set corresponding to each target cluster center is recorded as:

the total number of samples after washing was y ¹ +y ² +y ³ +……+y ^O Then the cleaned data P' can be formulated as:

converting the data P to be cleaned expressed by the formula (3) into M ₁ ×N ₁ In a matrix memory form of M ₁ ×N ₁ ＝x ¹ +x ² +x ³ +……+x ^X And in the conversion process, the data P to be cleaned with the same attribute characteristic are stored adjacently.

Similarly, all the cleaned data P' expressed by the formula (4) are converted into M ₁ ×N ₂ In a matrix memory form of M ₁ ×N ₂ ＝y ¹ +y ² +y ³ +……+y ^O . Will M ₁ ×N ₂ The matrix of (a) stores data represented as:

wherein, similarly, the conversion process will have the same attributeThe characterized cleaned data P' are stored adjacently, in equation (5),

each sample set in the set of samples contains M ₁ And (4) sample data. M converted from data P' after cleaning ₁ ×N ₂ The matrix P 'is subjected to linear interpolation to form an interpolation matrix B, the specific interpolation positions are random, each interpolation position specifically adopts an adjacent data weighted averaging mode to obtain the value of the data to be interpolated according to the known adjacent values in the cleaned data P', the interpolation data is inserted to a proper extent, and the inventor researches the M after ensuring the conversion of the interpolation matrix and the data to be cleaned ₁ ×N ₁ The size of the plug value matrix is required to be M ₁ *N ₁ Then, the interpolation matrix B formed after the data is inserted can be expressed as:

wherein the content of the first and second substances,

indicating the insertion of data. And (4) generating input data when a random forest regression model is constructed by using a random forest algorithm according to a formula (6).

S402, a random forest regression model is built according to the interpolation matrix.

Specifically, an interpolation matrix B generated by a formula (6) is used as a training matrix, a subset is randomly extracted from the training matrix B and used as a training sample of a regression tree root node, and the construction of a random forest regression model is completed. The detailed construction of the random forest regression model refers to the existing method, the key point of the embodiment of the invention is that the training data adopted in the construction of the random forest regression model is an interpolation matrix B, and because the training data is more in line with the data form of the original data P to be cleaned, and when the interpolation matrix B is formed, the attribute characteristics of adjacent data are considered in the interpolation data, the interpolation data at the missing value is more in line with the attribute of the sample, the requirement of an algorithm experiment is met, and the data compensation precision is improved.

S403, aiming at each column of the interpolation matrix, when the column is taken as a target filling column, the data of the rest columns form a filling matrix; and predicting according to the filling matrix and the data corresponding to the target filling column by using the constructed random forest regression model to obtain a prediction result of the target filling column, and taking the prediction result as a compensation result of the corresponding column.

Specifically, through a trained random forest regression model, performing regression prediction on each column of the interpolation matrix B, specifically: aiming at each column of the interpolation matrix, when the column is taken as a target filling column, the data of the rest columns form a filling matrix; and predicting according to the filling matrix and the data corresponding to the target filling column by using the random forest regression model constructed in the step S402 to obtain a prediction result of the target filling column, and taking the prediction result as a compensation result of the corresponding column.

For example, taking the ith column in the interpolation matrix B as the target padding column, the padding matrix is formed by other columns except the data corresponding to the target padding column. For ease of understanding, the embodiment of the present invention reformulates the interpolation matrix B of equation (6) as:

for equation (7), the data corresponding to the target padding column is b _i ', then the fill matrix can be expressed as:

the interpolation matrix B is here re-expressed as equation (7) only for the sake of a more clear filling matrix form, without having to study the data relationship between equation (6) and equation (7).

Filling the target with data b corresponding to the column _i ' filling the matrix as regression output of a random forest regression model

As an input to the random forest regression model,and predicting the interpolation data of the ith column to obtain a predicted value of the ith example. In order to improve the prediction accuracy, the interpolation data of the ith column is predicted for multiple times by utilizing an integration idea, and the average value of the multiple predicted values is taken as the final interpolation of the ith column and is marked as a _i ', i.e. i column data b in equation (7) _i ' replace with Current predictor a _i ', is shown as

And completing the prediction of the ith column of interpolation data.

Similarly, other columns in the interpolation matrix B are taken as target filling columns, data corresponding to the target filling columns are taken as regression output of the random forest regression model, corresponding filling matrices are taken as input of the random forest regression model, multiple times of prediction are carried out on the interpolation data of the columns to obtain predicted values of the columns, and prediction of all data in the interpolation matrix B is completed so as to complete compensation on the cleaned data P'.

In order to verify the effectiveness of the abnormal data identification, cleaning and compensation method provided by the embodiment of the invention, the following experiment is performed.

The method in the comparative experiment included: the method comprises the following steps of a traditional k-Means clustering algorithm, a support vector machine algorithm, a traditional k-Means clustering algorithm, a traditional random forest algorithm, an improved k-Means clustering algorithm and an improved random forest algorithm. Dividing the data cleaned in the experimental process into five test sample sets, respectively cleaning and compensating the data by using the three algorithms, respectively calculating the Root Mean Square Error (RMSE), the average Absolute value Error (MEA), the determination coefficient R and the algorithm running time t of the final compensation result of the three data cleaning and compensating algorithms, and evaluating and analyzing the cleaning and compensating capacities corresponding to the three data cleaning and compensating algorithms according to the four indexes. Wherein the content of the first and second substances,

RMSE represents the standard deviation of the samples of the difference between the predicted value (compensation result) and the observed value (data sample to be cleaned), and the root mean square error to account for the degree of sample dispersion, the calculation formula can be expressed as:

wherein N' represents the total number of samples of data to be cleaned, i.e. M ₁ ×N ₁ ，y _i Represents the ith sample of data to be cleaned, y _pi Representing the predicted estimate of the ith sample of data to be cleaned.

MAE represents the average of absolute errors between the predicted values and the observed values, and the calculation formula can be expressed as:

r represents the degree of fitting between curves respectively formed by the predicted value and the observed value, and the calculation formula can be expressed as:

wherein, y _a Mean values of the total samples are indicated.

The cleaning and compensation capability evaluation results corresponding to the three data cleaning and compensation algorithms are shown in table 1.

Table 1 evaluation results of cleaning and compensating abilities corresponding to three data cleaning and compensating algorithms

As can be seen from Table 1, compared with the traditional k-Means clustering algorithm, the support vector machine algorithm, the traditional k-Means clustering algorithm and the traditional random forest algorithm, the method provided by the invention has smaller error fluctuation and better fitting trend, and can be seen as follows: in the index evaluation of the MAE, after the method is adopted for data cleaning and compensation, the MEA value is smaller, and the average cleaning and compensation effect is better; in the index evaluation of RMSE and R, the RMSE value of a traditional k-Means clustering algorithm and a support vector machine algorithm in a test sample 1 is smaller, the R value is larger, the accuracy of cleaning and compensating data is higher, the trend fitting performance is also better, and the method has smaller RMSE value and larger R value in a test sample 2, a test sample 3, a test sample 4 and a test sample 5. In conclusion, the cleaning and compensation effects of the method are far superior to those of the traditional k-Means clustering algorithm + support vector machine algorithm, the traditional k-Means clustering algorithm + traditional random forest algorithm, the method has higher precision and better trend fitting performance, and the comprehensive compensation effect is more accurate than that of the traditional k-Means clustering algorithm + support vector machine algorithm, the traditional k-Means clustering algorithm + traditional random forest algorithm, the cleaning and compensation effects of data are better, and the method is more suitable for scenes in which the cleaned data have great influence on the data quality.

In summary, the abnormal data identification cleaning and compensation method provided in the embodiment of the present invention is a method for closely combining data cleaning and compensation, and can obtain better data cleaning and compensation effects, specifically: in the data cleaning process, the characteristic that the influence of the self attribute characteristics of the data to be cleaned on the data is not completely the same is fully considered by utilizing the high-efficiency clustering advantage of the k-Means clustering algorithm, the attribute weight is introduced, the improved k-Means clustering algorithm is provided, the abnormal data can be quickly and accurately identified, and the cleaning quality of the data is improved; in the data compensation process, the fast regression advantage of the random forest algorithm is utilized, the cleaning result based on the improved k-Means clustering algorithm is used as input data, interpolation processing is carried out on the input data of the random forest algorithm by adopting an interpolation method on the basis that the improved k-Means clustering algorithm has a better cleaning result, and the interpolation data considers the attribute characteristics of the data to be cleaned, so that the interpolation data is more fit with the attribute of the sample, the algorithm data requirement is met, the data compensation precision is improved, and the data cleaning quality is further improved.

In a second aspect, referring to fig. 4, an embodiment of the invention provides an abnormal data identification, cleaning and compensation apparatus, including:

a data obtaining module 501, configured to obtain data to be cleaned; the data to be cleaned has a plurality of attribute characteristics;

a data clustering module 502, configured to cluster the data to be cleaned based on an improved k-Means clustering algorithm; the improved k-Means clustering algorithm is a mode of introducing attribute weights to a plurality of attribute features of data to be cleaned to change a clustering criterion function;

the data removing module 503 is configured to remove abnormal data in the clustering result according to a preset abnormal data outlier distance to obtain a cleaning result;

a data compensation module 504, configured to compensate the cleaning result based on an improved random forest algorithm to obtain a compensation result; the improved random forest algorithm is a mode of optimizing random forest input data by adopting an interpolation method.

Further, in the data clustering module 502 according to the embodiment of the present invention, the process of clustering the data to be cleaned based on the improved k-Means clustering algorithm includes:

constructing a first clustering objective function;

clustering and solving the first clustering objective function according to the initial clustering centers to obtain a plurality of intermediate objective clustering centers corresponding to the data to be cleaned;

constructing a second clustering objective function;

Further, in the data clustering module 502 according to the embodiment of the present invention, the first clustering objective function formula is expressed as:

wherein M represents the number of initial cluster centers,p denotes the set of data to be cleaned, P _x Represents any sample data, M, in the data set P to be cleaned _m Represents the mth initial cluster center, and d (-) represents the Euclidean distance.

Further, in the data clustering module 502 according to the embodiment of the present invention, the second clustering target function formula is expressed as:

wherein X represents the number of attribute features of the data to be cleaned, N represents the number of middle target cluster centers, and lambda ^l Representing the attribute weight corresponding to the l class of data to be cleaned,

Further, in the data clustering module 502 according to the embodiment of the present invention, performing weight solution on the second clustering target function according to the middle target clustering center to obtain an attribute weight corresponding to each attribute in the data to be cleaned includes:

and solving the weight of the second clustering target function according to the middle target clustering center by adopting a Lagrange function solving mode to obtain an attribute weight corresponding to each attribute in the data to be cleaned.

Further, in the data compensation module 504 according to the embodiment of the present invention, a process of compensating the cleaning result based on the improved random forest algorithm includes:

interpolating the cleaning result to obtain an interpolation matrix;

aiming at each column of the interpolation matrix, when the column is taken as a target filling column, the data of the rest columns form a filling matrix; and predicting according to the filling matrix and the data corresponding to the target filling column by using the constructed random forest regression model to obtain a prediction result of the target filling column, and taking the prediction result as a compensation result of the corresponding column.

Further, in the data compensation module 504 according to the embodiment of the present invention, interpolating the cleaning result to obtain an interpolation matrix includes:

In a third aspect, referring to fig. 5, an embodiment of the present invention further provides an electronic device, which includes a processor 601, a communication interface 602, a memory 603, and a communication bus 604, where the processor 601, the communication interface 602, and the memory 603 complete communication with each other through the communication bus 604,

a memory 603 for storing a computer program;

the processor 601 is configured to implement the steps of the data identification cleaning and compensation method according to the first aspect when executing the program stored in the memory 603.

The electronic device may be: desktop computers, laptop computers, intelligent mobile terminals, servers, and the like. Without limitation, any electronic device that can implement the present invention is within the scope of the present invention.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this is not intended to represent only one bus or type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In a fourth aspect, corresponding to the method for cleaning and compensating for data identification provided in the first aspect, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the steps of the method for cleaning and compensating for data identification provided in the embodiment of the present invention.

The computer readable storage medium stores an application program that executes the method for data recognition cleaning and compensation provided by the embodiment of the invention when the application program runs.

For the apparatus/electronic device/storage medium embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to part of the description of the method embodiment.

It should be noted that, the apparatus, the electronic device and the storage medium according to the embodiments of the present invention are respectively an apparatus, an electronic device and a storage medium to which the above-mentioned method for data recognition, cleaning and compensation is applied, and all the embodiments of the above-mentioned method for data recognition, cleaning and compensation are applicable to the apparatus, the electronic device and the storage medium, and can achieve the same or similar beneficial effects.

In the description of the present invention, it is to be understood that the terms "first", "second", and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to imply that the number of technical features indicated are in fact significant. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

While the present application has been described in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed application, from a review of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A data identification cleaning and compensation method is characterized by comprising the following steps:

2. The data identification cleaning and compensation method of claim 1, wherein the process of clustering the data to be cleaned based on the improved k-Means clustering algorithm comprises:

constructing a first clustering objective function;

constructing a second clustering objective function;

3. The data recognition cleaning and compensation method of claim 2, wherein the first clustering objective function formula is constructed as:

wherein X represents the attribute feature number of the data to be cleaned, M represents the number of the initial cluster centers, P represents the data set to be cleaned, and P represents the data set to be cleaned _x Represents any sample data, M, in the data set P to be cleaned _m Denotes the m-th initial polymerizationClass cluster center, d (-) represents the Euclidean distance.

4. The data recognition cleaning and compensation method of claim 2, wherein the second clustering objective function formula is constructed as follows:

5. The data identification, cleaning and compensation method according to claim 2, wherein the performing weight solution on the second clustering target function according to the middle target clustering center to obtain an attribute weight corresponding to each attribute in the data to be cleaned comprises:

6. The data recognition cleaning and compensation method of claim 1, wherein the process of compensating the cleaning result based on the improved random forest algorithm comprises:

interpolating the cleaning result to obtain an interpolation matrix;

7. The data recognition cleaning and compensation method of claim 6, wherein the interpolating the cleaning result to obtain an interpolation matrix comprises:

8. The abnormal data identification, cleaning and compensation device is characterized by comprising:

9. An electronic device, comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other via the communication bus;

the memory is used for storing a computer program;

the processor is used for realizing the steps of the data identification cleaning and compensation method of any one of claims 1 to 7 when executing the program stored in the memory.

10. A computer-readable storage medium, characterized in that,

the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the data recognition cleaning and compensation method steps of any one of claims 1 to 7.