CN114528284A

CN114528284A - Bottom layer data cleaning method and device, mobile terminal and storage medium

Info

Publication number: CN114528284A
Application number: CN202210152348.7A
Authority: CN
Inventors: 王峰; 李一泉; 邓旭阳; 谭乾; 朱佳; 刘世丹; 温涛
Original assignee: Guangdong Power Grid Co Ltd; Electric Power Dispatch Control Center of Guangdong Power Grid Co Ltd
Current assignee: Guangdong Power Grid Co Ltd; Electric Power Dispatch Control Center of Guangdong Power Grid Co Ltd
Priority date: 2022-02-18
Filing date: 2022-02-18
Publication date: 2022-05-24

Abstract

The invention discloses a method and a device for cleaning bottom layer data, a mobile terminal and a storage medium, wherein the method comprises the following steps: acquiring bottom layer data to be cleaned, and calculating Euclidean distances between a plurality of objects in the bottom layer data to be cleaned and a plurality of initial clustering centers in a MapReduce model through a K-Means clustering algorithm; wherein, the initial clustering center is calculated by the K-Means clustering algorithm; classifying and sorting according to Euclidean distance through a MapReduce model, and performing iterative computation according to a sorting result to obtain a final aggregation clustering center; and determining the final category of the bottom data to be cleaned according to the final aggregation clustering center, and after carrying out abnormal value processing on the bottom data to be cleaned according to the final category, carrying out integrity repair operation according to the abnormal value processing result to finish data cleaning on the bottom data to be cleaned. The invention can improve the cleaning efficiency of the bottom data.

Description

Bottom layer data cleaning method and device, mobile terminal and storage medium

Technical Field

The invention relates to the technical field of data processing, in particular to a method and a device for cleaning bottom layer data, a mobile terminal and a storage medium.

Background

At present, the transformer substation in China is in the development stage of the intelligent transformer substation, and with the increasing scale and complexity of the power information system, the difficulty of the relay protection system in resisting network attack is increased. The relay protection system operates depending on bottom data, and a reliable bottom data base is a key for realizing intelligent relay protection. The source of the bottom layer data is wide, the data is dynamic and has no control, the data types are multiple, the cleaning of the data becomes a necessary step for improving the protection accuracy of the relay protection system, and the accurate protection of the relay protection system can be ensured only by improving the quality of the cleaning data and ensuring the data.

However, the traditional relay protection system has low cleaning efficiency for the bottom layer data, which finally results in low protection accuracy of the relay protection system.

Disclosure of Invention

The embodiment of the invention provides a method and a device for cleaning bottom data, a mobile terminal and a storage medium, which improve the cleaning efficiency of the bottom data and further improve the protection accuracy of a relay protection system.

A first aspect of an embodiment of the present application provides a method for cleaning underlying data, including:

after acquiring bottom data to be cleaned, calculating Euclidean distances between a plurality of objects in the bottom data to be cleaned and a plurality of initial clustering centers in a MapReduce model through a K-Means clustering algorithm; wherein, the initial clustering center is obtained by calculating through a K-Means clustering algorithm;

classifying and sorting according to Euclidean distance through a MapReduce model, and performing iterative computation according to a sorting result to obtain a final aggregation clustering center;

determining the final category of the bottom data to be cleaned according to the final aggregation clustering center, and processing the abnormal value of the bottom data to be cleaned according to the final category to obtain an abnormal value processing result;

and finishing the data cleaning of the bottom layer data to be cleaned after the integrity repairing operation is carried out according to the abnormal value processing result.

In a possible implementation manner of the first aspect, the initial clustering center is calculated by a K-Means clustering algorithm, and specifically includes:

calculating to obtain a plurality of cluster sets in a MapReduce model according to a maximum and minimum distance algorithm;

and calculating to obtain a plurality of initial clustering centers according to the plurality of clustering sets and the K-Means clustering algorithm.

In a possible implementation manner of the first aspect, the method further includes:

the MapReduce model divides the bottom data to be cleaned into a plurality of segments with fixed sizes, stores the segments into key value pairs, and performs distributed computation according to the key value pairs and a plurality of objects to obtain distributed computation results.

In a possible implementation manner of the first aspect, the MapReduce model is used to perform sorting and ordering according to euclidean distances, and a final aggregation clustering center is obtained through iterative computation according to an ordering result, specifically:

and obtaining a new clustering center according to the sequencing result, calculating a change value between the new clustering center and the initial clustering center, and taking the new clustering center as a final aggregation clustering center when the change value is smaller than a preset value.

In a possible implementation manner of the first aspect, the acquiring of the bottom layer data to be cleaned specifically includes:

acquiring initial bottom layer data, performing dimensionality reduction on the initial bottom layer data, generating and acquiring bottom layer data to be cleaned; the method for acquiring the initial bottom layer data specifically comprises the following steps:

acquiring initial bottom layer data based on a Hadoop technology and then obtaining the initial bottom layer data; wherein the initial underlying data comprises: four remote data, platen data, fixed value data, alarm data, fault signals, and action event data.

after data cleaning of bottom data to be cleaned is finished, generating a first cleaning result;

and acquiring the data property of the first cleaning result, performing data conversion operation according to the data property to generate a first conversion result, and storing the first conversion result in the distributed file system.

A second aspect of an embodiment of the present application provides an underlying data cleaning apparatus, including: the device comprises a first calculation module, a second calculation module and a cleaning module;

the first calculation module is used for calculating Euclidean distances between a plurality of objects in the bottom data to be cleaned and a plurality of initial clustering centers in a MapReduce model through a K-Means clustering algorithm after the bottom data to be cleaned is obtained; wherein, the initial clustering center is obtained by calculating through a K-Means clustering algorithm;

the second calculation module is used for carrying out classification and sequencing according to Euclidean distances through a MapReduce model and carrying out iterative calculation according to a sequencing result to obtain a final aggregation clustering center;

the cleaning module is used for determining the final category of the bottom data to be cleaned according to the final aggregation clustering center, and processing the abnormal value of the bottom data to be cleaned according to the final category to obtain an abnormal value processing result; and finishing the data cleaning of the bottom layer data to be cleaned after the integrity repairing operation is carried out according to the abnormal value processing result.

In a possible implementation manner of the second aspect, the initial clustering center is calculated by a K-Means clustering algorithm, and specifically includes:

A third aspect of the embodiments of the present application provides a mobile terminal, which includes a processor and a memory, where the memory stores a computer-readable program code, and when the processor executes the computer-readable program code, the steps of the method for cleaning underlying data described above are implemented.

A fourth aspect of embodiments of the present application provides a storage medium storing computer-readable program code, which when executed implements the steps of an underlying data scrubbing method described above.

Compared with the prior art, the method, the device, the mobile terminal and the storage medium for cleaning the bottom layer data provided by the embodiment of the invention comprise the following steps: after acquiring bottom data to be cleaned, calculating Euclidean distances between a plurality of objects in the bottom data to be cleaned and a plurality of initial clustering centers in a MapReduce model through a K-Means clustering algorithm; wherein, the initial clustering center is obtained by calculating through a K-Means clustering algorithm; classifying and sorting according to Euclidean distance through a MapReduce model, and performing iterative computation according to a sorting result to obtain a final aggregation clustering center; determining the final category of the bottom data to be cleaned according to the final aggregation clustering center, and processing the abnormal value of the bottom data to be cleaned according to the final category to obtain an abnormal value processing result; and finishing the data cleaning of the bottom layer data to be cleaned after the integrity repairing operation is carried out according to the abnormal value processing result.

The beneficial effects are that: according to the method and the device, the final category of the bottom data to be cleaned is obtained through the MapReduce model calculation, and the abnormal value processing is carried out according to the final category, so that the efficiency of abnormal value processing can be effectively improved; and after integrity restoration is carried out according to the abnormal value processing result obtained quickly, data cleaning of the bottom data to be cleaned is completed, so that the data cleaning efficiency of the bottom data to be cleaned is improved, and the protection accuracy of the relay protection system is further improved.

Meanwhile, the abnormal value processing is carried out according to the final category, so that the accuracy of the abnormal value processing can be improved, and the precision of the abnormal value processing result can be improved; the quality of the underlying data can be improved, and a high-quality underlying data base is provided for other applications, so that the accuracy and performance of data mining or data stream mining are improved. The initial clustering center is obtained through calculation of a K-Means clustering algorithm, the use range of the K-Means clustering algorithm is expanded to a cloud computing platform from a single machine under a MapReduce framework, the operation time of the K-Means clustering algorithm is greatly reduced facing mass data, and the operation efficiency is remarkably improved.

Moreover, the data acquisition and storage are realized based on the Hadoop technology, so that the data can be effectively acquired by the digital twin acquisition layer in the data cleaning and conversion process, and the efficiency and the accuracy of data cleaning are further improved.

In addition, the embodiment of the invention provides the 'collector' interface which accords with the IEC-61850 standard to collect the original data layer signals and data, the standardization of the interface can optimize the automation system of the transformer substation, the safety and the reliability of the whole system are improved, and the sharing and the system integration of the information in the substation are finally realized.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating a method for cleaning underlying data according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an underlying data cleaning apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, which is a schematic flow chart of a method for cleaning underlying data according to an embodiment of the present invention, the method includes steps S101 to S104:

s101: after the bottom data to be cleaned is obtained, calculating Euclidean distances between a plurality of objects in the bottom data to be cleaned and a plurality of initial clustering centers in a MapReduce model through a K-Means clustering algorithm.

Wherein the initial clustering center is calculated by a K-Means clustering algorithm.

In this embodiment, the initial clustering center is calculated by a K-Means clustering algorithm, and specifically includes:

calculating to obtain a plurality of cluster sets in the MapReduce model according to a maximum-minimum distance algorithm;

In a specific embodiment, the method further comprises:

the MapReduce model divides the bottom data to be cleaned into a plurality of fragments with fixed sizes, stores the fragments into key value pairs, and performs distributed computation according to the key value pairs and the objects to obtain distributed computation results.

In this embodiment, the acquiring of the data of the bottom layer to be cleaned specifically includes:

acquiring initial bottom layer data, performing dimensionality reduction on the initial bottom layer data, generating and acquiring the bottom layer data to be cleaned; the acquiring of the initial bottom layer data specifically includes:

acquiring and obtaining the initial bottom layer data based on a Hadoop technology; wherein the initial underlying data comprises: four remote data, platen data, fixed value data, alarm data, fault signals, and action event data.

Further, after the dimension reduction processing is performed on the initial bottom layer data, the bottom layer data to be cleaned is generated and acquired, specifically:

and reducing the dimension of the initial bottom data by a Logsf feature selection algorithm, and eliminating redundant features to obtain the bottom data to be cleaned.

Further, after data cleaning and data conversion are performed on the bottom layer data to be cleaned, the result is written back to a Distributed File system hdfs (hadoop Distributed File system), which specifically comprises the following steps: after data cleaning of the bottom layer data to be cleaned is finished, generating a first cleaning result; and acquiring the data property of the first cleaning result, performing data conversion operation according to the data property to generate a first conversion result, and storing the first conversion result in a distributed file system.

Further, the distributed file system HDFS divides nodes into 3 types of roles, which are: a main server node (Namenode), a data block server node (Datanode), and a Client (Client). The main server node is a management node of the HDFS system, is used for storing metadata of the system and plays a management role. The data block server node is responsible for specific massive information storage work, and all files are adjusted to 64 MB-sized data blocks for multi-copy storage. The client provides an access interface for the application program, and can interact with the data block server node.

The main idea of the Logsf algorithm is: in the process of calculating the loss function of the data set, the energy function and the nearest neighbor classification idea are applied to convert the complex and nonlinear problem in any group of characteristic data sets into a simple and easily understood local linear problem.

Assume that the training sample set R is:

R＝{M，N}＝{m_i，n_i}X_i＝1，m_i＝{m_i1，m_i2，....，m_id}∈R_d；

wherein m is_iFor the ith training sample in the dataset, n_iFor a label corresponding to a training sample, X_iThe number of samples in the training sample set is M, and the label set corresponding to the training samples is N.

The loss function for sample mi is then:

L(β，m_i)＝log(1+exp(-β^TF₁))；

wherein, F₁＝|m_i-m′_i|-|m_i-n′_i|，m′_iIs related to the sample m_iNearest but labelled different sample, n'_iIs related to the sample m_iNearest but labeled samples, β is the feature weight, and Fi is the intermediate variable. The sample m can be made by minimizing the loss function to obtain the ideal weight β ″_iAnd nearest sample n'_iIs smaller than the sample m_iAnd m'_iThe distance between them.

In a specific embodiment, the obtaining initial underlying data specifically includes:

acquiring and obtaining the initial bottom layer data from an HDFS data warehouse based on a Hadoop technology; wherein the initial underlying data comprises: four remote data, platen data, fixed value data, alarm data, fault signals and action event data. And storing the initial bottom layer data into a Hadoop storage system after the initial bottom layer data are collected. The initial bottom layer data is located in an original data layer, and data in the original data layer is transmitted to a digital twin layer so as to enable the data twin layer to perform a series of data processing operations such as data cleaning, so that a 'collector' interface conforming to the IEC-61850 standard is defined, the original data layer realizes data transmission and communication with the digital twin layer through a collector, namely the collector is used for transmitting the important initial bottom layer data of the original data layer to the data twin layer.

Further, the collector interface is also used for uploading the electrical parameter signal into the digital twin collection layer.

In a specific embodiment, the IEC-61850 compliant collector comprises 4 functional modules: the device comprises a synchronous signal module, a data acquisition module, a digital signal processing module and a framing coding communication module.

A synchronization signal module: the method comprises the steps of correctly identifying and tracking an externally input lpps second pulse signal through an FPGA (field programmable gate array), then generating a signal, and sending a synchronous sampling signal to primary equipment after exception processing is carried out on an exception signal.

A data acquisition module: after the collector sends synchronous sampling control signals to each path of A/D converter, the FPGA in the collector receives digital quantity or analog small-signal quantity data.

The digital signal processing module: the DSP in the data collector carries out filtering and FFT fast Fourier transform on the data collected by the FPGA to obtain sampling data values such as current, voltage and phase for panel display, and the PowePC corrects the phase error of the original signal collected by the FPGA.

The framing coding communication module: in PowerPC, after calibrating each signal sampling point, the data is subjected to framing coding according to IEC61850 standard and is sent to a collection layer for deep processing.

In a specific embodiment, the data property of the first cleaning result is obtained, a data conversion operation is performed according to the data property to generate a first conversion result, and the first conversion result is stored in a distributed file system, specifically:

the purpose of data conversion is to transform data into a uniform format or format suitable for analysis, which is achieved through data normalization operations. Normalization refers to scaling the attribute data to fall within a small specific interval. This example used maximum-minimum normalization and z-score normalization to perform data transformation on the first wash results to obtain first transformation results.

The maximum and minimum normalized calculation formula is as follows:

therein, max_AIs the maximum value of attribute A, min_ARespectively, the minimum value of the attribute a. v is the value in attribute A, v' is the mapping of v to the interval [ new _ min ]_A，new_max_A]Value of (1), new _ max_AIs the latest maximum value new _ min of the attribute A_AIs the latest minimum value of the attribute a.

The value on the attribute A can be mapped to [ new _ max ] through the maximum and minimum normalized calculation formula_A，new_min_A]V' in the range. The disadvantage of the maximum-minimum normalization is that when new data is added, it may cause changes in max and min, requiring redefinition of the Z-score normalized calculation formula.

The Z-score normalized calculation formula is as follows:

wherein,

is the mean value of the attribute A, δ_AAnd obtaining v' after Z-score normalization on the value v of the attribute A to obtain the standard deviation of the attribute A.

Z-score normalization is valid where the maximum and minimum values of attribute A are unknown.

Further, obtaining the bottom layer data to be cleaned means that preprocessing (namely, dimension reduction processing) is already completed on the initial bottom layer data, then a distance-first cleaning rule is adopted on the bottom layer data to be cleaned, an error between the bottom layer data to be cleaned and real data is given, whether Δ v meets a minimum distance to be cleaned is judged, and if yes, S102-S103 is performed, and then a cleaning result is recorded into the HDFS; and if not, directly writing the bottom layer data to be cleaned back to the HDFS for storage.

S102: and carrying out classification sorting according to Euclidean distances through a MapReduce model, and carrying out iterative computation according to a sorting result to obtain a final aggregation clustering center.

In this embodiment, the performing, by the MapReduce model, classification and sorting according to the euclidean distance, and performing iterative computation according to a sorting result to obtain a final aggregation clustering center specifically:

and obtaining a new clustering center according to the sequencing result, calculating a change value between the new clustering center and the initial clustering center, and taking the new clustering center as the final aggregation clustering center when the change value is smaller than a preset numerical value.

Further, when the change value is greater than or equal to the preset value, replacing the new clustering center with the initial clustering center and repeatedly executing S101-S102, further updating the iterative clustering center until the change value between the clustering center of the latest generation and the clustering center of the previous generation is less than the preset value, completing iterative computation and obtaining the final aggregated clustering center.

S103: and determining the final class of the bottom data to be cleaned according to the final aggregation clustering center, and processing the abnormal value of the bottom data to be cleaned according to the final class to obtain an abnormal value processing result.

In a specific embodiment, the abnormal value processing of the underlying data to be cleaned is performed according to the final category, specifically:

the outlier processing includes missing value filling, and further, the missing value filling method is the most reasonable method for processing the missing data problem. The missing value records and the complete data set have a lot of information correlation, and a data set similar to the missing value can be found by clustering and analyzing the data, so that the missing value filling is carried out more accurately. Preferably, the K-Means clustering algorithm is used as a missing value filling method, and has the advantages of simplicity and high efficiency. It organizes the objects into multiple mutually exclusive groups or clusters, considering that the closer two objects are, the greater their similarity.

The principle of the K-Means clustering algorithm is as follows: assuming that data set D contains n objects in Euclidean space, the objects of D are to be assigned to k clusters C₁，...，C_kIn (b), j is less than or equal to k (i, j, k are real numbers) for 1 less than or equal to i, and

let P be a point in space representing a given data object, c_iIs a cluster C_iWherein p and c_iAre all multi-dimensional data. The Euclidean distance is used as an evaluation index, the Euclidean distance between two points x and y is represented by dist (x, y), and the object p belongs to C_lWith a representation c of the cluster_iDist (p, c) for the difference_i) And (4) showing. Cluster C_iIs measured by intra-cluster variation, i.e. C_iAll objects in (1) and center (c)_iThe sum of the squared errors between, defined as:

where E is the sum of the squared errors of all objects in the data.

Further, before the missing value padding, the method further includes:

1. determining missing data range: calculating the missing data proportion of each field, and then respectively making strategies according to the missing examples and the field importance;

2. unnecessary fields are removed.

Specifically, the embodiment of the invention performs clustering processing on the bottom data to be cleaned and generates corresponding categories, and then performs abnormal value processing according to the categories, so that the efficiency of abnormal value processing can be effectively improved, and the efficiency of data cleaning is further improved.

In a specific embodiment, the K-Means clustering algorithm is processed under the MapReduce model as follows:

completing calculation at a Map stage and recording Euclidean distances between a plurality of objects and a plurality of initial clustering centers in bottom data to be cleaned; and according to the plurality of objects and the corresponding Euclidean distances thereof, calculating and recording to obtain the initial category corresponding to each object. In the Reduce stage, classifying and sequencing according to Euclidean distances obtained in the Map stage to calculate a new clustering center for the next round of Map to use, if the new clustering center obtained in the Reduce stage is compared with the clustering center in the previous round, and the change value is smaller than a preset value, finishing the algorithm, otherwise, performing a new round of MapReduce process; and finishing iterative computation until the change value between the clustering center of the latest generation and the clustering center of the previous generation is smaller than the preset value, and obtaining the final clustering center.

Further, the operation principle of the MapReduce model is as follows:

in the Map stage, the parallel computing framework divides the input data into fixed-size fragments, then stores each fragment as a key-value pair in the format of < key1, value1>, and each Mapper performs distributed computation according to the input key-value pair to obtain intermediate results < key2, value2>, and then sorts the intermediate results key2 and puts the intermediate results with the same key2 value together to form < key2, List (value2) >. In the Reduce stage, the Reducer integrates intermediate results output by different mappers, sequences the intermediate results, and then calls a Reduce () function automatically defined by a user to calculate and process the intermediate results.

In a specific embodiment, the initial clustering center is calculated by a K-Means clustering algorithm, specifically:

(1) and each Map node reads the data set uploaded to the data acquisition layer, and a maximum-minimum distance algorithm is adopted to generate a plurality of cluster aggregates.

(2) And in the Reduce stage, generating K initial clustering centers from a plurality of clustering sets generated in the Map stage by adopting a K-Means clustering algorithm.

(3) Writing the generated information of the initial clustering center into a Cluster directory, adding a file in the directory into a Distributed Cache (Distributed Cache) of Hadoop, and using the file as global shared information during next-stage clustering iteration.

Further, the MapReduce realization principle of the K-means algorithm is specifically as follows:

(1) and each Map node reads cluster center information generated by the previous iteration in the distributed cache in the setup () method.

(2) And calculating the Euclidean distance between each data point and the center point of each cluster by a Map method, finding the cluster center point closest to the Euclidean distance, taking the ID of the cluster center as the ID, and transmitting the data point information as value.

(3) And merging the same cluster ID key assignments of each Map node by utilizing a Combiner at the Map end so as to reduce the network transmission overhead of data.

(4) Combining the results generated by Combiner at Reduce end, and using data point of same cluster according to formula

And calculating a temporary central point and adding the temporary central point into the distributed cache. Wherein, a_iIs a cluster c_iTemporary center point of, m_iIs C_iTotal number of data points in, x represents cluster c_iThe data points in (1).

S104: and finishing the data cleaning of the bottom layer data to be cleaned after the integrity repairing operation is carried out according to the abnormal value processing result.

In a specific embodiment, the performing an integrity repair operation according to the abnormal value processing result specifically includes:

detecting a data format (namely data property) in an abnormal value processing result, and preprocessing the data format; and judging whether the preprocessed data accords with the data integrity constraint, and if not, repairing the data. If the situation of inconsistency with the data integrity constraint still exists after the data is repaired, the data is repaired again until the data meets the requirement; and after the data is repaired, restoring the data to the original format, and finishing the data cleaning of the bottom layer data to be cleaned. Wherein the integrity constraint may be represented by the following equation:

wherein, P_iThe relationship is represented by a relationship of,

representing the tuple variables and the constants,

representing a formula that contains only build-in predicates. m and n are real numbers,

is that P corresponds to

The set of attributes of (a) is,

is that P corresponds to

The set of attributes of (1).

In this embodiment, the method further includes:

after data cleaning of the bottom layer data to be cleaned is finished, generating a first cleaning result;

and acquiring the data property of the first cleaning result, performing data conversion operation according to the data property to generate a first conversion result, and storing the first conversion result in a distributed file system.

In a specific embodiment, the obtaining of the data property of the first cleaning result, performing data conversion operation according to the data property to generate a first conversion result, and storing the first conversion result in a distributed file system specifically includes:

the purpose of data conversion is to transform data into a uniform format or format suitable for analysis, which is achieved through data normalization operations. Normalization refers to scaling the attribute data to fall within a small specific interval. This example performed data transformation on the first wash results using max-min normalization and z-score normalization to obtain first transformation results.

The maximum and minimum normalized calculation formula is as follows:

wherein, max_AMaximum value of attribute A, min_ARespectively, the minimum value of the attribute a. v is the value in attribute A, v' is v maps to the interval [ new _ min ]_A，new_max_A]The new _ maxA is the latest maximum value new _ minA of attribute a is the latest minimum value of attribute a.

The value on the attribute A can be mapped to [ new _ max ] through the maximum and minimum normalized calculation formula_A，new_min_A]V' in the range. The disadvantage of the maximum-minimum normalization is that when new data is added, it may cause a change in max and min, requiring the Z-score normalized calculation formula to be redefined.

The Z-score normalized calculation formula is as follows:

wherein, among others,

To further explain the bottom layer data cleaning device, please refer to fig. 2, fig. 2 is a schematic structural diagram of a bottom layer data cleaning device according to an embodiment of the present invention, including: a first calculation module 201, a second calculation module 202 and a cleaning module 203;

the first calculating module 201 is configured to calculate euclidean distances between a plurality of objects and a plurality of initial clustering centers in the bottom data to be cleaned through a K-Means clustering algorithm in a MapReduce model after the bottom data to be cleaned is acquired; wherein the initial clustering center is calculated by the K-Means clustering algorithm;

wherein the initial clustering center is calculated by a K-Means clustering algorithm;

the second calculation module 202 is configured to perform classification and sorting according to the euclidean distance through the MapReduce model, and perform iterative calculation according to a sorting result to obtain a final aggregation clustering center;

the cleaning module 203 is configured to determine a final category of the bottom layer data to be cleaned according to the final aggregation clustering center, and perform abnormal value processing on the bottom layer data to be cleaned according to the final category to obtain an abnormal value processing result; and completing data cleaning of the bottom layer data to be cleaned after integrity repairing operation is carried out according to the abnormal value processing result.

In an embodiment, the present invention provides a mobile terminal, which includes a processor and a memory, where the memory stores a computer-readable program code, and the processor implements the steps of the above-mentioned method for cleaning underlying data when executing the computer-readable program code.

In one embodiment, the present invention provides a storage medium storing computer readable program code that when executed implements the steps of an underlying data cleansing method described above.

According to the embodiment of the invention, after bottom data to be cleaned is obtained through a first calculation module 201, Euclidean distances between a plurality of objects in the bottom data to be cleaned and a plurality of initial clustering centers are calculated through a K-Means clustering algorithm in a MapReduce model; wherein the initial clustering center is calculated by the K-Means clustering algorithm; classifying and sorting the cluster centers according to Euclidean distances through a MapReduce model by a second computing module 202, and performing iterative computation according to a sorting result to obtain a final aggregation cluster center; finally, determining the final category of the bottom data to be cleaned according to the final aggregation clustering center through the cleaning module 203, and processing the abnormal value of the bottom data to be cleaned according to the final category to obtain an abnormal value processing result; and finishing the data cleaning of the bottom layer data to be cleaned after the integrity repairing operation is carried out according to the abnormal value processing result.

According to the method and the device, the final category of the bottom data to be cleaned is obtained through the MapReduce model calculation, and the abnormal value processing is carried out according to the final category, so that the efficiency of abnormal value processing can be effectively improved; and after integrity restoration is carried out according to the rapidly obtained abnormal value processing result, data cleaning of the bottom data to be cleaned is completed, so that the data cleaning efficiency of the bottom data to be cleaned is improved, and the protection accuracy of the relay protection system is further improved.

Meanwhile, the abnormal value processing is carried out according to the final category, so that the accuracy of the abnormal value processing can be improved, and the precision of the abnormal value processing result can be improved; the quality of the underlying data can be improved, and a high-quality underlying data base is provided for other applications, so that the accuracy and performance of data mining or data stream mining are improved. The initial clustering center is obtained through calculation of a K-Means clustering algorithm, the application range of the K-Means clustering algorithm is expanded to a cloud computing platform from a single machine under a MapReduce framework, the operation time of the K-Means clustering algorithm is greatly reduced in the face of mass data, and the operation efficiency is remarkably improved.

Finally, the Logsf feature selection algorithm is used for reducing the dimension of the initial bottom layer data, so that the complex and nonlinear problem can be converted into a simple and easily-understood local linear problem, and the data processing efficiency is improved. While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. An underlying data scrubbing method, comprising:

after acquiring bottom data to be cleaned, calculating Euclidean distances between a plurality of objects and a plurality of initial clustering centers in the bottom data to be cleaned in a MapReduce model through a K-Means clustering algorithm; wherein the initial clustering center is calculated by the K-Means clustering algorithm;

classifying and sorting according to the Euclidean distance through the MapReduce model, and performing iterative computation according to a sorting result to obtain a final aggregation clustering center;

determining the final category of the bottom layer data to be cleaned according to the final aggregation clustering center, and processing the abnormal value of the bottom layer data to be cleaned according to the final category to obtain an abnormal value processing result;

and finishing the data cleaning of the bottom layer data to be cleaned after performing integrity repair operation according to the abnormal value processing result.

2. The method for cleaning underlying data according to claim 1, wherein the initial clustering center is calculated by a K-Means clustering algorithm, and specifically comprises:

3. The underlying data scrubbing method of claim 1, further comprising:

4. The method for cleaning the underlying data according to claim 1, wherein the MapReduce model is used for sorting and ordering according to the euclidean distance, and a final aggregation clustering center is obtained through iterative computation according to an ordering result, specifically:

5. The method for cleaning the underlying data according to claim 1, wherein the acquiring of the underlying data to be cleaned specifically comprises:

acquiring initial bottom layer data, performing dimension reduction processing on the initial bottom layer data, generating and acquiring the bottom layer data to be cleaned; the acquiring of the initial bottom layer data specifically includes:

acquiring and obtaining the initial bottom layer data based on a Hadoop technology; wherein the initial underlying data comprises: four remote data, platen data, fixed value data, alarm data, fault signals and action event data.

6. The method of claim 5, further comprising:

7. An underlying data cleaning apparatus, comprising: the device comprises a first calculation module, a second calculation module and a cleaning module;

the first calculation module is used for calculating Euclidean distances between a plurality of objects and a plurality of initial clustering centers in the bottom data to be cleaned through a K-Means clustering algorithm in a MapReduce model after the bottom data to be cleaned is obtained; wherein the initial clustering center is calculated by the K-Means clustering algorithm;

the second calculation module is used for carrying out classification and sorting according to the Euclidean distance through the MapReduce model and carrying out iterative calculation according to a sorting result to obtain a final aggregation clustering center;

the cleaning module is used for determining the final category of the bottom layer data to be cleaned according to the final aggregation clustering center, and processing the abnormal value of the bottom layer data to be cleaned according to the final category to obtain an abnormal value processing result; and finishing the data cleaning of the bottom layer data to be cleaned after performing integrity repair operation according to the abnormal value processing result.

8. The bottom-layer data cleaning apparatus of claim 7, wherein the initial clustering center is calculated by a K-Means clustering algorithm, specifically:

9. A mobile terminal comprising a processor and a memory, the memory storing computer readable program code, the processor implementing the steps of a method for scrubbing underlying data as claimed in any one of claims 1 to 6 when the computer readable program code is executed by the processor.

10. A storage medium storing computer readable program code which when executed performs the steps of a method of cleaning underlying data as claimed in any one of claims 1 to 6.