CN114968992A - Data identification cleaning and compensation method and device, electronic equipment and storage medium - Google Patents

Data identification cleaning and compensation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114968992A
CN114968992A CN202210422231.6A CN202210422231A CN114968992A CN 114968992 A CN114968992 A CN 114968992A CN 202210422231 A CN202210422231 A CN 202210422231A CN 114968992 A CN114968992 A CN 114968992A
Authority
CN
China
Prior art keywords
data
clustering
cleaned
cleaning
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202210422231.6A
Other languages
Chinese (zh)
Inventor
刘怀亮
张静
杨斌
赵舰波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xi'an Zhile Technology Co ltd
Original Assignee
Xi'an Zhile Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xi'an Zhile Technology Co ltd filed Critical Xi'an Zhile Technology Co ltd
Priority to CN202210422231.6A priority Critical patent/CN114968992A/en
Publication of CN114968992A publication Critical patent/CN114968992A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a data identification cleaning and compensation method, a data identification cleaning and compensation device, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring data to be cleaned; the data to be cleaned has a plurality of attribute characteristics; clustering the data to be cleaned based on an improved k-Means clustering algorithm; the improved k-Means clustering algorithm is a mode of introducing attribute weights to a plurality of attribute features of data to be cleaned to change a clustering criterion function; removing abnormal data in the clustering result according to a preset abnormal data outlier distance to obtain a cleaning result; compensating the cleaning result based on an improved random forest algorithm to obtain a compensation result; the improved random forest algorithm is a mode of optimizing random forest input data by adopting an interpolation method. The invention is a method for closely combining and considering data cleaning and compensation, and can obtain better data cleaning and compensation effects.

Description

Data identification cleaning and compensation method and device, electronic equipment and storage medium
Technical Field
The invention belongs to the technical field of computer data cleaning, and particularly relates to a data identification cleaning and compensation method and device, electronic equipment and a storage medium.
Background
Currently, data is being generated at unprecedented speeds. With the widespread use of various data acquisition devices, a large amount of data is accumulated and used day by day and night. With the popularization of the application of the internet of things in actual life and production, the characteristic of taking data as the center is increasingly prominent. The sensor nodes which are densely deployed can generate a large amount of sensor data, abnormal values often occur due to limited node energy, complex monitoring environment, easy external attack on the nodes and the like, and the acquired data inevitably doped with some abnormal data due to various random factors cannot be converted into useful information, so that how to improve the data quality becomes a key problem for improving the data utilization efficiency, the acquired data is cleaned, an abnormal data cleaning algorithm of the internet of things is designed, and the data is guaranteed to be reliable and accurate and is a key problem in data analysis of the internet of things.
The current data cleansing studies are numerous, such as: the method for cleaning the large data of the state of the power transmission and transformation equipment based on time sequence analysis is provided by the great Britain, the great Goy 30366, the Chenyufeng and the like, the power system is automatic, 2015(7):7. ", the abnormal data of the power transmission and transformation equipment is classified, and the data of a transformer and a line are cleaned by adopting a double-cycle iterative inspection method based on the time sequence analysis to obtain the data with higher quality; the Yangtonghua, Liningning, Wang hongzhi and the like provide 'parallel big data cleaning process optimization [ J ] based on task combination, the computer science reports, 2016,39(1): 12', research is carried out on the cleaning problem of parallel data, an optimization technology based on task combination is provided, redundant calculation and the same file are combined, the running time of a system is reduced, and the data cleaning efficiency is improved; the Dynasty Jie, Song dynasty, Yan\31054and the like provide a method for cleaning state data of power transmission and transformation equipment based on a stack type noise reduction self-encoder [ J ] power system automation, 2017,041(012): 224-.
However, the above proposed cleaning methods only clean data, and cleaning inevitably involves a loss of data quality, and if there is no effective compensation for the cleaned data, even if cleaning can improve quality to a certain extent, the quality improvement space is limited.
Disclosure of Invention
In order to solve the above problems in the prior art, the present invention provides a data identification cleaning and compensation method, device, electronic device and storage medium. The technical problem to be solved by the invention is realized by the following technical scheme:
in a first aspect, an embodiment of the present invention provides a data identification cleaning and compensation method, including:
acquiring data to be cleaned; wherein the data to be cleaned has a plurality of attribute characteristics;
clustering the data to be cleaned based on an improved k-Means clustering algorithm; the improved k-Means clustering algorithm is used for introducing attribute weights to a plurality of attribute characteristics of the data to be cleaned to change a clustering criterion function mode;
removing abnormal data in the clustering result according to a preset abnormal data outlier distance to obtain a cleaning result;
compensating the cleaning result based on an improved random forest algorithm to obtain a compensation result; the improved random forest algorithm is a mode of optimizing random forest input data by adopting an interpolation method.
In an embodiment of the present invention, the process of clustering the data to be cleaned based on the improved k-Means clustering algorithm includes:
randomly acquiring a plurality of data from the data to be cleaned as initial clustering centers;
constructing a first clustering objective function;
clustering and solving the first clustering objective function according to the initial clustering center to obtain a plurality of intermediate target clustering centers corresponding to the data to be cleaned;
constructing a second clustering objective function;
performing weight solving on the second clustering target function according to the middle target clustering center to obtain an attribute weight corresponding to each attribute in the data to be cleaned;
and clustering and solving the second clustering target function according to the attribute weight to obtain a plurality of target clustering centers corresponding to the data to be cleaned.
In one embodiment of the present invention, the first clustering objective function formula is constructed as:
Figure BDA0003608354970000031
wherein X represents the attribute feature number of the data to be cleaned, M represents the number of the initial cluster centers, P represents the data set to be cleaned, and P represents the data set to be cleaned x Represents any sample data, M, in the data set P to be cleaned m Represents the mth initial cluster center, and d (-) represents the Euclidean distance.
In one embodiment of the present invention, the second clustering target function formula is constructed as:
Figure BDA0003608354970000032
wherein X represents the number of attribute features of the data to be cleaned, N represents the number of cluster centers of the intermediate target clusters, and lambda l Representing the attribute weight corresponding to the l class of data to be cleaned,
Figure BDA0003608354970000033
Figure BDA0003608354970000034
a set of samples representing the class i data to be cleaned, x representing the number of samples in the set,
Figure BDA0003608354970000035
representing the nth middle target cluster center corresponding to the first class of data to be cleaned, | · survival 2 Representing solving for the least squares error.
In an embodiment of the present invention, the performing weight solution on the second clustering target function according to the middle target clustering center to obtain an attribute weight corresponding to each attribute in the data to be cleaned includes:
and solving the weight of the second clustering target function according to the middle target clustering center by adopting a Lagrange function solving mode to obtain the attribute weight corresponding to each attribute in the data to be cleaned.
In an embodiment of the invention, the process of compensating the cleaning result based on the improved random forest algorithm comprises:
interpolating the cleaning result to obtain an interpolation matrix;
constructing a random forest regression model according to the interpolation matrix;
aiming at each column of the interpolation matrix, when the column is taken as a target filling column, the data of the rest columns form a filling matrix; and predicting according to the filling matrix and the data corresponding to the target filling column by using the constructed random forest regression model to obtain a prediction result of the target filling column, and taking the prediction result as the compensation result of the corresponding column.
In an embodiment of the present invention, the interpolating the cleaning result to obtain an interpolation matrix includes:
interpolating the cleaning result by using a linear interpolation method to obtain an interpolation matrix; wherein the interpolation matrix is the same as the size of the data to be cleaned.
In a second aspect, an embodiment of the present invention provides an abnormal data identification, cleaning and compensation apparatus, including:
the data acquisition module is used for acquiring data to be cleaned; wherein the data to be cleaned has a plurality of attribute characteristics;
the data clustering module is used for clustering the data to be cleaned based on an improved k-Means clustering algorithm; the improved k-Means clustering algorithm is used for introducing attribute weights to a plurality of attribute characteristics of the data to be cleaned to change a clustering criterion function mode;
the data removing module is used for removing abnormal data in the clustering result according to a preset abnormal data outlier distance to obtain a cleaning result;
the data compensation module is used for compensating the cleaning result based on an improved random forest algorithm to obtain a compensation result; the improved random forest algorithm is a mode of optimizing random forest input data by adopting an interpolation method.
In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete mutual communication through the communication bus;
the memory is used for storing a computer program;
the processor is used for realizing any one of the steps of the data identification cleaning and compensation method when executing the program stored in the memory.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the method implements any of the above-mentioned steps of the data identification cleaning and compensation method.
The invention has the beneficial effects that:
the invention provides a data identification cleaning and compensation method, which is a method for closely combining and considering data cleaning and compensation, and can obtain better data cleaning and compensation effects, specifically: in the data cleaning process, the characteristic that the influence of the self attribute characteristics of the data to be cleaned on the data is not completely the same is fully considered by utilizing the high-efficiency clustering advantage of the k-Means clustering algorithm, the attribute weight is introduced, the improved k-Means clustering algorithm is provided, the abnormal data can be quickly and accurately identified, and the cleaning quality of the data is improved; in the data compensation process, the fast regression advantage of the random forest algorithm is utilized, the cleaning result based on the improved k-Means clustering algorithm is used as input data, interpolation processing is carried out on the input data of the random forest algorithm by adopting an interpolation method on the basis that the improved k-Means clustering algorithm has a better cleaning result, and the interpolation data considers the attribute characteristics of the data to be cleaned, so that the interpolation data is more fit with the attribute of the sample, the algorithm data requirement is met, the data compensation precision is improved, and the data cleaning quality is further improved.
The present invention will be described in further detail with reference to the accompanying drawings and examples.
Drawings
FIG. 1 is a schematic flow chart of a data recognition cleaning and compensation method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of clustering data to be cleaned based on an improved k-Means clustering algorithm in the data identification cleaning and compensation method according to the embodiment of the present invention;
fig. 3 is a schematic flow chart of a compensation result obtained by compensating a cleaning result based on an improved random forest algorithm in the data identification cleaning and compensation method provided by the embodiment of the present invention;
FIG. 4 is a schematic structural diagram of an abnormal data recognition, cleaning and compensation apparatus according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto.
Example one
In order to improve the cleaning quality of the data to be cleaned, the embodiment of the invention provides a data identification cleaning and compensation method, a data identification cleaning and compensation device, electronic equipment and a storage medium.
In a first aspect, referring to fig. 1, an embodiment of the present invention provides a data identification cleaning and compensation method, including the following steps:
s10, acquiring data to be cleaned; wherein the data to be cleaned has a plurality of attribute characteristics.
In particular, the data to be cleaned in any scene is acquired, the data acquisition amount is larger and larger, and the types of the data are more and more, that is, the data to be cleaned has multiple attribute characteristics, for example, the data to be cleaned can be animals, plants, people, or other types of data, each type of data has characteristics of each type of data, the same type of data is very important for the identification of the type of data, the other types of data are not important for the identification of the type of data, and improper introduction may also cause false identification of the type of data. Therefore, the data can be identified by comprehensively considering the attribute characteristics of the data.
S20, clustering the data to be cleaned based on an improved k-Means clustering algorithm; the improved k-Means clustering algorithm introduces attribute weights to a plurality of attribute features of data to be cleaned to change the mode of a clustering criterion function.
Specifically, through the above analysis, the embodiment of the present invention proposes to implement cleaning of data to be cleaned by using the attribute characteristics of the data to be cleaned. The k-Means clustering algorithm is adopted in the cleaning process, and through the analysis of the inventor, when the traditional k-Means clustering algorithm is used for clustering data, the influence of each attribute characteristic of the data to be cleaned on the data is the same, and the influence of each attribute characteristic of the actual data to be cleaned on the data is different, especially the influence among different types of data, so that the clustering effect after k-Means clustering is influenced. Based on the analysis, the embodiment of the invention provides a mode of clustering data to be cleaned based on an improved k-Means clustering algorithm, wherein the improved k-Means clustering algorithm is a mode of introducing attribute weights into a plurality of attribute features of the data to be cleaned to change a clustering criterion function. Specifically, referring to fig. 2, the process of clustering data to be cleaned based on the improved k-Means clustering algorithm includes the following steps:
101. and randomly acquiring a plurality of data from the data to be cleaned as initial cluster centers, for example, the number of the initial cluster centers is M.
102. And constructing a first clustering objective function.
Specifically, the first clustering objective function formula constructed in the embodiment of the present invention is expressed as:
Figure BDA0003608354970000071
wherein X represents the attribute feature number of the data to be cleaned, M represents the number of the initial cluster centers, P represents the data set to be cleaned, and P represents the data set to be cleaned x Represents any sample data, M, in the data set P to be cleaned m Represents the mth initial cluster center, and d (-) represents the Euclidean distance. Therefore, the first clustering objective function does not consider the attribute characteristics of the data to be cleaned.
103. And clustering and solving the first clustering objective function according to the initial clustering centers to obtain a plurality of intermediate target clustering centers corresponding to the data to be cleaned.
Specifically, the embodiment of the invention utilizes the first clustering objective function of the formula (1) to enable V to be obtained by continuously adjusting the initial clustering center 1 (X) has a minimum value, attribute characteristics are not considered in the clustering process, but through the clustering adjustment, a plurality of intermediate target cluster centers which are more accurate than the initial cluster centers can be obtained and are used for clustering results of subsequent clustering through the attribute characteristics.
104. And constructing a second clustering objective function.
Specifically, the influence of each attribute feature in the conventional k-Means clustering algorithm is the same, so that the clustering effect cannot reach the best effect, the embodiment of the invention provides an idea of introducing an attribute weight, fully highlights the influence of each attribute feature in clustering, and constructs a second clustering objective function, so that a clustering criterion function changes, wherein the second clustering objective function formula is expressed as:
Figure BDA0003608354970000081
wherein X represents the attribute feature number of the data to be cleaned, N represents the number of the middle target cluster centers, and lambda l Representing the attribute weight corresponding to the l class of data to be cleaned,
Figure BDA0003608354970000082
Figure BDA0003608354970000083
set of samples, x, representing class I data to be cleaned l Represents the number of samples in the class of sample set,
Figure BDA0003608354970000084
representing the nth middle target cluster center corresponding to the first class of data to be cleaned, | · survival 2 Representing solving for the least squares error.
105. And performing weight solving on the second clustering target function according to the middle target clustering center to obtain an attribute weight corresponding to each attribute in the data to be cleaned.
Specifically, in the embodiment of the present invention, a weighted arithmetic mean method may be used to solve the weight of the formula (2) to obtain an attribute weight corresponding to each attribute in the data to be cleaned.
The inventor researches and discovers that when a data set is generally aimed at solving a certain problem with specific constraint, although the clustering result obtained by the improved k-Means clustering algorithm is improved compared with the clustering result obtained by the traditional k-Means clustering algorithm, the problem of constraint solving is complex, the calculation speed is low by adopting the weighted arithmetic mean method, and the consideration of calculation precision is neglected in the calculation process, so that the clustering result still does not reach the optimal solution. When identifying and cleaning a data set aiming at the class of requirements with specific constraints, the embodiment of the invention provides that a lagrangian function solving mode is adopted, weight solving is carried out on a second clustering target function according to a middle target clustering center to obtain an attribute weight corresponding to each attribute in the data to be cleaned, and specifically: and (3) calculating the attribute weight lambda of each attribute feature of the formula (2) under a specific constraint condition by adopting a Lagrangian function solving mode, converting the constrained extreme value problem into an unconstrained extreme value problem, solving the extreme value of the unconstrained Lagrangian function, namely solving the extreme value of the original constrained function, and effectively and quickly solving the attribute weight. The attribute weight value solved by the Lagrange function is more accurate than the attribute weight value solved by a weighted arithmetic mean method, and the more accurate attribute weight value can enable a more accurate clustering effect to be obtained subsequently.
106. And clustering and solving the second clustering target function according to the attribute weight to obtain a plurality of target clustering centers corresponding to the data to be cleaned.
Specifically, in the embodiment of the present invention, different attribute weights are assigned to different attribute features through S105, and V is enabled by continuously adjusting the cluster center of the intermediate target by using the second clustering objective function of formula (2) 2 And (X) has a minimum value, and due to the addition of the attribute characteristics in the clustering process, the influence of the attribute characteristics on data clustering is fully considered, so that a plurality of target clustering centers obtained through clustering adjustment are more accurate than the middle target clustering centers, and can be better used for final clustering identification. At this time, the number of the obtained target cluster centers is O.
And S30, removing abnormal data in the clustering result according to the preset outlier distance of the abnormal data to obtain a cleaning result.
Specifically, according to the O target cluster centers obtained in S106, the embodiment of the present invention calculates the distance between all the data to be cleaned and each target cluster center, determines the distance from the preset abnormal data outlier distance, determines the corresponding data to be cleaned as abnormal data if the distance is greater than the preset abnormal data outlier distance, removes the abnormal data, determines the corresponding data to be cleaned as normal data if the distance is less than or equal to the preset abnormal data outlier distance, and retains and clusters the normal data, thereby obtaining a final cleaning result. Therefore, the abnormal data can be accurately identified and removed under a more accurate target cluster center, and the cleaning quality of the data to be cleaned is improved. The outlier distance of the abnormal data can be set according to actual requirements.
S40, compensating the cleaning result based on the improved random forest algorithm to obtain a compensation result; the improved random forest algorithm is a mode of optimizing random forest input data by adopting an interpolation method.
Specifically, data cleaning can be achieved by the processing of S30, but the elimination of data inevitably results in different data loss, which inevitably affects the cleaning quality. Therefore, after the cleaning is finished, it is also necessary to compensate the cleaning data. In the general compensation process, the cleaned data is directly used as input, and a random forest algorithm is adopted to compensate the input data, so that a certain compensation effect can be achieved.
However, when the conventional random forest algorithm is directly adopted, the regression problem is not well solved as it does in classification, because the improper initial input data can seriously affect the initial prediction thereof, and when regression is performed, the initial prediction can also affect the prediction of each sample data of the random forest on the input data set, which may cause overfitting when regression prediction is performed on some data, thereby affecting the final overall prediction effect. Therefore, it is very important to consider the initial input data of the random forest algorithm in advance before compensating by using the random forest algorithm, and the embodiment of the present invention proposes to use the cleaned data to construct data with more prior information as the input data of the random forest, and compensate the cleaning result based on the improved random forest algorithm, specifically, a process of compensating the cleaning result based on the improved random forest algorithm, and as shown in fig. 3, includes the following steps:
s401, interpolation is conducted on the cleaning result to obtain an interpolation matrix.
Specifically, the embodiment of the present invention interpolates the cleaning result in an indefinite interpolation manner, and may use a linear interpolation method, a polynomial interpolation method, a newton interpolation method, or the like. The embodiment of the invention utilizes a linear interpolation method to interpolate the cleaning result to obtain an interpolation matrix, the interpolation is random interpolation, and the specific value of the interpolation is weighted averaging of adjacent data.
For example, the data to be cleaned is recorded as P, there are X attribute features, and each attribute feature corresponds to X respectively 1 、x 2 、x 3 、……、x X And sampling data, wherein a sample set corresponding to each attribute feature is recorded as:
Figure BDA0003608354970000101
the total number of samples corresponding to all attribute features is x 1 +x 2 +x 3 +……+x X Then, the data P to be cleaned can be formulated as:
Figure BDA0003608354970000102
the cleaning result obtained by the elimination of S30, namely the cleaned data is recorded as P', because the number of the target cluster cores is O, each target cluster core corresponds to y 1 、y 2 、y 3 、……、y O And sampling data, wherein a sample set corresponding to each target cluster center is recorded as:
Figure BDA0003608354970000103
the total number of samples after washing was y 1 +y 2 +y 3 +……+y O Then the cleaned data P' can be formulated as:
Figure BDA0003608354970000104
converting the data P to be cleaned expressed by the formula (3) into M 1 ×N 1 In a matrix memory form of M 1 ×N 1 =x 1 +x 2 +x 3 +……+x X And in the conversion process, the data P to be cleaned with the same attribute characteristic are stored adjacently.
Similarly, all the cleaned data P' expressed by the formula (4) are converted into M 1 ×N 2 In a matrix memory form of M 1 ×N 2 =y 1 +y 2 +y 3 +……+y O . Will M 1 ×N 2 The matrix of (a) stores data represented as:
Figure BDA0003608354970000111
wherein, similarly, the conversion process will have the same attributeThe characterized cleaned data P' are stored adjacently, in equation (5),
Figure BDA0003608354970000112
each sample set in the set of samples contains M 1 And (4) sample data. M converted from data P' after cleaning 1 ×N 2 The matrix P 'is subjected to linear interpolation to form an interpolation matrix B, the specific interpolation positions are random, each interpolation position specifically adopts an adjacent data weighted averaging mode to obtain the value of the data to be interpolated according to the known adjacent values in the cleaned data P', the interpolation data is inserted to a proper extent, and the inventor researches the M after ensuring the conversion of the interpolation matrix and the data to be cleaned 1 ×N 1 The size of the plug value matrix is required to be M 1 *N 1 Then, the interpolation matrix B formed after the data is inserted can be expressed as:
Figure BDA0003608354970000113
wherein the content of the first and second substances,
Figure BDA0003608354970000114
indicating the insertion of data. And (4) generating input data when a random forest regression model is constructed by using a random forest algorithm according to a formula (6).
S402, a random forest regression model is built according to the interpolation matrix.
Specifically, an interpolation matrix B generated by a formula (6) is used as a training matrix, a subset is randomly extracted from the training matrix B and used as a training sample of a regression tree root node, and the construction of a random forest regression model is completed. The detailed construction of the random forest regression model refers to the existing method, the key point of the embodiment of the invention is that the training data adopted in the construction of the random forest regression model is an interpolation matrix B, and because the training data is more in line with the data form of the original data P to be cleaned, and when the interpolation matrix B is formed, the attribute characteristics of adjacent data are considered in the interpolation data, the interpolation data at the missing value is more in line with the attribute of the sample, the requirement of an algorithm experiment is met, and the data compensation precision is improved.
S403, aiming at each column of the interpolation matrix, when the column is taken as a target filling column, the data of the rest columns form a filling matrix; and predicting according to the filling matrix and the data corresponding to the target filling column by using the constructed random forest regression model to obtain a prediction result of the target filling column, and taking the prediction result as a compensation result of the corresponding column.
Specifically, through a trained random forest regression model, performing regression prediction on each column of the interpolation matrix B, specifically: aiming at each column of the interpolation matrix, when the column is taken as a target filling column, the data of the rest columns form a filling matrix; and predicting according to the filling matrix and the data corresponding to the target filling column by using the random forest regression model constructed in the step S402 to obtain a prediction result of the target filling column, and taking the prediction result as a compensation result of the corresponding column.
For example, taking the ith column in the interpolation matrix B as the target padding column, the padding matrix is formed by other columns except the data corresponding to the target padding column. For ease of understanding, the embodiment of the present invention reformulates the interpolation matrix B of equation (6) as:
Figure BDA0003608354970000121
for equation (7), the data corresponding to the target padding column is b i ', then the fill matrix can be expressed as:
Figure BDA0003608354970000122
the interpolation matrix B is here re-expressed as equation (7) only for the sake of a more clear filling matrix form, without having to study the data relationship between equation (6) and equation (7).
Filling the target with data b corresponding to the column i ' filling the matrix as regression output of a random forest regression model
Figure BDA0003608354970000123
As an input to the random forest regression model,and predicting the interpolation data of the ith column to obtain a predicted value of the ith example. In order to improve the prediction accuracy, the interpolation data of the ith column is predicted for multiple times by utilizing an integration idea, and the average value of the multiple predicted values is taken as the final interpolation of the ith column and is marked as a i ', i.e. i column data b in equation (7) i ' replace with Current predictor a i ', is shown as
Figure BDA0003608354970000124
And completing the prediction of the ith column of interpolation data.
Similarly, other columns in the interpolation matrix B are taken as target filling columns, data corresponding to the target filling columns are taken as regression output of the random forest regression model, corresponding filling matrices are taken as input of the random forest regression model, multiple times of prediction are carried out on the interpolation data of the columns to obtain predicted values of the columns, and prediction of all data in the interpolation matrix B is completed so as to complete compensation on the cleaned data P'.
In order to verify the effectiveness of the abnormal data identification, cleaning and compensation method provided by the embodiment of the invention, the following experiment is performed.
The method in the comparative experiment included: the method comprises the following steps of a traditional k-Means clustering algorithm, a support vector machine algorithm, a traditional k-Means clustering algorithm, a traditional random forest algorithm, an improved k-Means clustering algorithm and an improved random forest algorithm. Dividing the data cleaned in the experimental process into five test sample sets, respectively cleaning and compensating the data by using the three algorithms, respectively calculating the Root Mean Square Error (RMSE), the average Absolute value Error (MEA), the determination coefficient R and the algorithm running time t of the final compensation result of the three data cleaning and compensating algorithms, and evaluating and analyzing the cleaning and compensating capacities corresponding to the three data cleaning and compensating algorithms according to the four indexes. Wherein the content of the first and second substances,
RMSE represents the standard deviation of the samples of the difference between the predicted value (compensation result) and the observed value (data sample to be cleaned), and the root mean square error to account for the degree of sample dispersion, the calculation formula can be expressed as:
Figure BDA0003608354970000131
wherein N' represents the total number of samples of data to be cleaned, i.e. M 1 ×N 1 ,y i Represents the ith sample of data to be cleaned, y pi Representing the predicted estimate of the ith sample of data to be cleaned.
MAE represents the average of absolute errors between the predicted values and the observed values, and the calculation formula can be expressed as:
Figure BDA0003608354970000132
r represents the degree of fitting between curves respectively formed by the predicted value and the observed value, and the calculation formula can be expressed as:
Figure BDA0003608354970000133
wherein, y a Mean values of the total samples are indicated.
The cleaning and compensation capability evaluation results corresponding to the three data cleaning and compensation algorithms are shown in table 1.
Table 1 evaluation results of cleaning and compensating abilities corresponding to three data cleaning and compensating algorithms
Figure BDA0003608354970000141
As can be seen from Table 1, compared with the traditional k-Means clustering algorithm, the support vector machine algorithm, the traditional k-Means clustering algorithm and the traditional random forest algorithm, the method provided by the invention has smaller error fluctuation and better fitting trend, and can be seen as follows: in the index evaluation of the MAE, after the method is adopted for data cleaning and compensation, the MEA value is smaller, and the average cleaning and compensation effect is better; in the index evaluation of RMSE and R, the RMSE value of a traditional k-Means clustering algorithm and a support vector machine algorithm in a test sample 1 is smaller, the R value is larger, the accuracy of cleaning and compensating data is higher, the trend fitting performance is also better, and the method has smaller RMSE value and larger R value in a test sample 2, a test sample 3, a test sample 4 and a test sample 5. In conclusion, the cleaning and compensation effects of the method are far superior to those of the traditional k-Means clustering algorithm + support vector machine algorithm, the traditional k-Means clustering algorithm + traditional random forest algorithm, the method has higher precision and better trend fitting performance, and the comprehensive compensation effect is more accurate than that of the traditional k-Means clustering algorithm + support vector machine algorithm, the traditional k-Means clustering algorithm + traditional random forest algorithm, the cleaning and compensation effects of data are better, and the method is more suitable for scenes in which the cleaned data have great influence on the data quality.
In summary, the abnormal data identification cleaning and compensation method provided in the embodiment of the present invention is a method for closely combining data cleaning and compensation, and can obtain better data cleaning and compensation effects, specifically: in the data cleaning process, the characteristic that the influence of the self attribute characteristics of the data to be cleaned on the data is not completely the same is fully considered by utilizing the high-efficiency clustering advantage of the k-Means clustering algorithm, the attribute weight is introduced, the improved k-Means clustering algorithm is provided, the abnormal data can be quickly and accurately identified, and the cleaning quality of the data is improved; in the data compensation process, the fast regression advantage of the random forest algorithm is utilized, the cleaning result based on the improved k-Means clustering algorithm is used as input data, interpolation processing is carried out on the input data of the random forest algorithm by adopting an interpolation method on the basis that the improved k-Means clustering algorithm has a better cleaning result, and the interpolation data considers the attribute characteristics of the data to be cleaned, so that the interpolation data is more fit with the attribute of the sample, the algorithm data requirement is met, the data compensation precision is improved, and the data cleaning quality is further improved.
In a second aspect, referring to fig. 4, an embodiment of the invention provides an abnormal data identification, cleaning and compensation apparatus, including:
a data obtaining module 501, configured to obtain data to be cleaned; the data to be cleaned has a plurality of attribute characteristics;
a data clustering module 502, configured to cluster the data to be cleaned based on an improved k-Means clustering algorithm; the improved k-Means clustering algorithm is a mode of introducing attribute weights to a plurality of attribute features of data to be cleaned to change a clustering criterion function;
the data removing module 503 is configured to remove abnormal data in the clustering result according to a preset abnormal data outlier distance to obtain a cleaning result;
a data compensation module 504, configured to compensate the cleaning result based on an improved random forest algorithm to obtain a compensation result; the improved random forest algorithm is a mode of optimizing random forest input data by adopting an interpolation method.
Further, in the data clustering module 502 according to the embodiment of the present invention, the process of clustering the data to be cleaned based on the improved k-Means clustering algorithm includes:
randomly acquiring a plurality of data from the data to be cleaned as initial clustering centers;
constructing a first clustering objective function;
clustering and solving the first clustering objective function according to the initial clustering centers to obtain a plurality of intermediate objective clustering centers corresponding to the data to be cleaned;
constructing a second clustering objective function;
performing weight solving on the second clustering target function according to the middle target clustering center to obtain an attribute weight corresponding to each attribute in the data to be cleaned;
and clustering and solving the second clustering target function according to the attribute weight to obtain a plurality of target clustering centers corresponding to the data to be cleaned.
Further, in the data clustering module 502 according to the embodiment of the present invention, the first clustering objective function formula is expressed as:
Figure BDA0003608354970000161
wherein M represents the number of initial cluster centers,p denotes the set of data to be cleaned, P x Represents any sample data, M, in the data set P to be cleaned m Represents the mth initial cluster center, and d (-) represents the Euclidean distance.
Further, in the data clustering module 502 according to the embodiment of the present invention, the second clustering target function formula is expressed as:
Figure BDA0003608354970000162
wherein X represents the number of attribute features of the data to be cleaned, N represents the number of middle target cluster centers, and lambda l Representing the attribute weight corresponding to the l class of data to be cleaned,
Figure BDA0003608354970000163
Figure BDA0003608354970000164
a set of samples representing the class i data to be cleaned, x representing the number of samples in the set,
Figure BDA0003608354970000165
representing the nth middle target cluster center corresponding to the first class of data to be cleaned, | · survival 2 Representing solving for the least squares error.
Further, in the data clustering module 502 according to the embodiment of the present invention, performing weight solution on the second clustering target function according to the middle target clustering center to obtain an attribute weight corresponding to each attribute in the data to be cleaned includes:
and solving the weight of the second clustering target function according to the middle target clustering center by adopting a Lagrange function solving mode to obtain an attribute weight corresponding to each attribute in the data to be cleaned.
Further, in the data compensation module 504 according to the embodiment of the present invention, a process of compensating the cleaning result based on the improved random forest algorithm includes:
interpolating the cleaning result to obtain an interpolation matrix;
constructing a random forest regression model according to the interpolation matrix;
aiming at each column of the interpolation matrix, when the column is taken as a target filling column, the data of the rest columns form a filling matrix; and predicting according to the filling matrix and the data corresponding to the target filling column by using the constructed random forest regression model to obtain a prediction result of the target filling column, and taking the prediction result as a compensation result of the corresponding column.
Further, in the data compensation module 504 according to the embodiment of the present invention, interpolating the cleaning result to obtain an interpolation matrix includes:
interpolating the cleaning result by using a linear interpolation method to obtain an interpolation matrix; wherein the interpolation matrix is the same as the size of the data to be cleaned.
In a third aspect, referring to fig. 5, an embodiment of the present invention further provides an electronic device, which includes a processor 601, a communication interface 602, a memory 603, and a communication bus 604, where the processor 601, the communication interface 602, and the memory 603 complete communication with each other through the communication bus 604,
a memory 603 for storing a computer program;
the processor 601 is configured to implement the steps of the data identification cleaning and compensation method according to the first aspect when executing the program stored in the memory 603.
The electronic device may be: desktop computers, laptop computers, intelligent mobile terminals, servers, and the like. Without limitation, any electronic device that can implement the present invention is within the scope of the present invention.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this is not intended to represent only one bus or type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
In a fourth aspect, corresponding to the method for cleaning and compensating for data identification provided in the first aspect, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the steps of the method for cleaning and compensating for data identification provided in the embodiment of the present invention.
The computer readable storage medium stores an application program that executes the method for data recognition cleaning and compensation provided by the embodiment of the invention when the application program runs.
For the apparatus/electronic device/storage medium embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to part of the description of the method embodiment.
It should be noted that, the apparatus, the electronic device and the storage medium according to the embodiments of the present invention are respectively an apparatus, an electronic device and a storage medium to which the above-mentioned method for data recognition, cleaning and compensation is applied, and all the embodiments of the above-mentioned method for data recognition, cleaning and compensation are applicable to the apparatus, the electronic device and the storage medium, and can achieve the same or similar beneficial effects.
In the description of the present invention, it is to be understood that the terms "first", "second", and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to imply that the number of technical features indicated are in fact significant. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
While the present application has been described in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed application, from a review of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (10)

1. A data identification cleaning and compensation method is characterized by comprising the following steps:
acquiring data to be cleaned; wherein the data to be cleaned has a plurality of attribute characteristics;
clustering the data to be cleaned based on an improved k-Means clustering algorithm; the improved k-Means clustering algorithm is used for introducing attribute weights to a plurality of attribute characteristics of the data to be cleaned to change a clustering criterion function mode;
removing abnormal data in the clustering result according to a preset abnormal data outlier distance to obtain a cleaning result;
compensating the cleaning result based on an improved random forest algorithm to obtain a compensation result; the improved random forest algorithm is a mode of optimizing random forest input data by adopting an interpolation method.
2. The data identification cleaning and compensation method of claim 1, wherein the process of clustering the data to be cleaned based on the improved k-Means clustering algorithm comprises:
randomly acquiring a plurality of data from the data to be cleaned as initial clustering centers;
constructing a first clustering objective function;
clustering and solving the first clustering objective function according to the initial clustering center to obtain a plurality of intermediate target clustering centers corresponding to the data to be cleaned;
constructing a second clustering objective function;
performing weight solving on the second clustering target function according to the middle target clustering center to obtain an attribute weight corresponding to each attribute in the data to be cleaned;
and clustering and solving the second clustering target function according to the attribute weight to obtain a plurality of target clustering centers corresponding to the data to be cleaned.
3. The data recognition cleaning and compensation method of claim 2, wherein the first clustering objective function formula is constructed as:
Figure FDA0003608354960000011
wherein X represents the attribute feature number of the data to be cleaned, M represents the number of the initial cluster centers, P represents the data set to be cleaned, and P represents the data set to be cleaned x Represents any sample data, M, in the data set P to be cleaned m Denotes the m-th initial polymerizationClass cluster center, d (-) represents the Euclidean distance.
4. The data recognition cleaning and compensation method of claim 2, wherein the second clustering objective function formula is constructed as follows:
Figure FDA0003608354960000021
wherein X represents the number of attribute features of the data to be cleaned, N represents the number of cluster centers of the intermediate target clusters, and lambda l Representing the attribute weight corresponding to the l class of data to be cleaned,
Figure FDA0003608354960000022
Figure FDA0003608354960000023
a set of samples representing the class i data to be cleaned, x representing the number of samples in the set,
Figure FDA0003608354960000024
representing the nth middle target cluster center corresponding to the first class of data to be cleaned, | · survival 2 Representing solving for the least squares error.
5. The data identification, cleaning and compensation method according to claim 2, wherein the performing weight solution on the second clustering target function according to the middle target clustering center to obtain an attribute weight corresponding to each attribute in the data to be cleaned comprises:
and solving the weight of the second clustering target function according to the middle target clustering center by adopting a Lagrange function solving mode to obtain the attribute weight corresponding to each attribute in the data to be cleaned.
6. The data recognition cleaning and compensation method of claim 1, wherein the process of compensating the cleaning result based on the improved random forest algorithm comprises:
interpolating the cleaning result to obtain an interpolation matrix;
constructing a random forest regression model according to the interpolation matrix;
aiming at each column of the interpolation matrix, when the column is taken as a target filling column, the data of the rest columns form a filling matrix; and predicting according to the filling matrix and the data corresponding to the target filling column by using the constructed random forest regression model to obtain a prediction result of the target filling column, and taking the prediction result as the compensation result of the corresponding column.
7. The data recognition cleaning and compensation method of claim 6, wherein the interpolating the cleaning result to obtain an interpolation matrix comprises:
interpolating the cleaning result by using a linear interpolation method to obtain an interpolation matrix; wherein the interpolation matrix is the same as the size of the data to be cleaned.
8. The abnormal data identification, cleaning and compensation device is characterized by comprising:
the data acquisition module is used for acquiring data to be cleaned; wherein the data to be cleaned has a plurality of attribute characteristics;
the data clustering module is used for clustering the data to be cleaned based on an improved k-Means clustering algorithm; the improved k-Means clustering algorithm is used for introducing attribute weights to a plurality of attribute characteristics of the data to be cleaned to change a clustering criterion function mode;
the data removing module is used for removing abnormal data in the clustering result according to a preset abnormal data outlier distance to obtain a cleaning result;
the data compensation module is used for compensating the cleaning result based on an improved random forest algorithm to obtain a compensation result; the improved random forest algorithm is a mode of optimizing random forest input data by adopting an interpolation method.
9. An electronic device, comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other via the communication bus;
the memory is used for storing a computer program;
the processor is used for realizing the steps of the data identification cleaning and compensation method of any one of claims 1 to 7 when executing the program stored in the memory.
10. A computer-readable storage medium, characterized in that,
the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the data recognition cleaning and compensation method steps of any one of claims 1 to 7.
CN202210422231.6A 2022-04-21 2022-04-21 Data identification cleaning and compensation method and device, electronic equipment and storage medium Withdrawn CN114968992A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210422231.6A CN114968992A (en) 2022-04-21 2022-04-21 Data identification cleaning and compensation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210422231.6A CN114968992A (en) 2022-04-21 2022-04-21 Data identification cleaning and compensation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114968992A true CN114968992A (en) 2022-08-30

Family

ID=82979440

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210422231.6A Withdrawn CN114968992A (en) 2022-04-21 2022-04-21 Data identification cleaning and compensation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114968992A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117077051A (en) * 2023-07-18 2023-11-17 重庆交通大学 Self-adaptive identification method for dam monitoring abnormal data

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117077051A (en) * 2023-07-18 2023-11-17 重庆交通大学 Self-adaptive identification method for dam monitoring abnormal data

Similar Documents

Publication Publication Date Title
CN108280552B (en) Power load prediction method and system based on deep learning and storage medium
CN115018021B (en) Machine room abnormity detection method and device based on graph structure and abnormity attention mechanism
JP5418408B2 (en) Simulation parameter calibration method, apparatus and program
US20150254554A1 (en) Information processing device and learning method
CN113077097A (en) Air quality prediction method based on deep space-time similarity
CN111428201B (en) Prediction method for time series data based on empirical mode decomposition and feedforward neural network
CN117316333B (en) Inverse synthesis prediction method and device based on general molecular diagram representation learning model
CN106709823A (en) Method for evaluating operation property of electric utilization information collection system of power user
CN114968992A (en) Data identification cleaning and compensation method and device, electronic equipment and storage medium
CN110263264B (en) Method for acquiring social network key node
CN112910890A (en) Anonymous network flow fingerprint identification method and device based on time convolution network
CN109347680B (en) Network topology reconstruction method and device and terminal equipment
CN116306030A (en) New energy prediction dynamic scene generation method considering prediction error and fluctuation distribution
CN114298188A (en) Intelligent analysis method and system for power equipment faults
CN114943328A (en) SARIMA-GRU time sequence prediction model based on BP neural network nonlinear combination
CN115022194A (en) Network security situation prediction method based on SA-GRU
CN114419339A (en) Method and device for training data reconstruction model based on electric power portrait
CN113963757A (en) Oil-filled electrical equipment fault diagnosis method based on gas relation and graph neural network
CN116415510B (en) Breaker temperature rise prediction method and system based on phase space reconstruction and neural network
CN111143761A (en) Matrix completion method based on discrete manufacturing equipment process data
CN117407264B (en) Method, device, computer equipment and medium for predicting memory aging residual time
CN111080118B (en) Quality evaluation method and system for new energy grid-connected data
CN117349087B (en) Internet information data backup method
CN117874528B (en) Semiconductor processing unsupervised anomaly detection method and equipment based on boundary calibration
CN116127785B (en) Reliability evaluation method, device and equipment based on multiple performance degradation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20220830