CN114528284A - Bottom layer data cleaning method and device, mobile terminal and storage medium - Google Patents

Bottom layer data cleaning method and device, mobile terminal and storage medium Download PDF

Info

Publication number
CN114528284A
CN114528284A CN202210152348.7A CN202210152348A CN114528284A CN 114528284 A CN114528284 A CN 114528284A CN 202210152348 A CN202210152348 A CN 202210152348A CN 114528284 A CN114528284 A CN 114528284A
Authority
CN
China
Prior art keywords
data
cleaned
cleaning
bottom layer
initial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210152348.7A
Other languages
Chinese (zh)
Inventor
王峰
李一泉
邓旭阳
谭乾
朱佳
刘世丹
温涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Power Grid Co Ltd
Electric Power Dispatch Control Center of Guangdong Power Grid Co Ltd
Original Assignee
Guangdong Power Grid Co Ltd
Electric Power Dispatch Control Center of Guangdong Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Power Grid Co Ltd, Electric Power Dispatch Control Center of Guangdong Power Grid Co Ltd filed Critical Guangdong Power Grid Co Ltd
Priority to CN202210152348.7A priority Critical patent/CN114528284A/en
Publication of CN114528284A publication Critical patent/CN114528284A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for cleaning bottom layer data, a mobile terminal and a storage medium, wherein the method comprises the following steps: acquiring bottom layer data to be cleaned, and calculating Euclidean distances between a plurality of objects in the bottom layer data to be cleaned and a plurality of initial clustering centers in a MapReduce model through a K-Means clustering algorithm; wherein, the initial clustering center is calculated by the K-Means clustering algorithm; classifying and sorting according to Euclidean distance through a MapReduce model, and performing iterative computation according to a sorting result to obtain a final aggregation clustering center; and determining the final category of the bottom data to be cleaned according to the final aggregation clustering center, and after carrying out abnormal value processing on the bottom data to be cleaned according to the final category, carrying out integrity repair operation according to the abnormal value processing result to finish data cleaning on the bottom data to be cleaned. The invention can improve the cleaning efficiency of the bottom data.

Description

Bottom layer data cleaning method and device, mobile terminal and storage medium
Technical Field
The invention relates to the technical field of data processing, in particular to a method and a device for cleaning bottom layer data, a mobile terminal and a storage medium.
Background
At present, the transformer substation in China is in the development stage of the intelligent transformer substation, and with the increasing scale and complexity of the power information system, the difficulty of the relay protection system in resisting network attack is increased. The relay protection system operates depending on bottom data, and a reliable bottom data base is a key for realizing intelligent relay protection. The source of the bottom layer data is wide, the data is dynamic and has no control, the data types are multiple, the cleaning of the data becomes a necessary step for improving the protection accuracy of the relay protection system, and the accurate protection of the relay protection system can be ensured only by improving the quality of the cleaning data and ensuring the data.
However, the traditional relay protection system has low cleaning efficiency for the bottom layer data, which finally results in low protection accuracy of the relay protection system.
Disclosure of Invention
The embodiment of the invention provides a method and a device for cleaning bottom data, a mobile terminal and a storage medium, which improve the cleaning efficiency of the bottom data and further improve the protection accuracy of a relay protection system.
A first aspect of an embodiment of the present application provides a method for cleaning underlying data, including:
after acquiring bottom data to be cleaned, calculating Euclidean distances between a plurality of objects in the bottom data to be cleaned and a plurality of initial clustering centers in a MapReduce model through a K-Means clustering algorithm; wherein, the initial clustering center is obtained by calculating through a K-Means clustering algorithm;
classifying and sorting according to Euclidean distance through a MapReduce model, and performing iterative computation according to a sorting result to obtain a final aggregation clustering center;
determining the final category of the bottom data to be cleaned according to the final aggregation clustering center, and processing the abnormal value of the bottom data to be cleaned according to the final category to obtain an abnormal value processing result;
and finishing the data cleaning of the bottom layer data to be cleaned after the integrity repairing operation is carried out according to the abnormal value processing result.
In a possible implementation manner of the first aspect, the initial clustering center is calculated by a K-Means clustering algorithm, and specifically includes:
calculating to obtain a plurality of cluster sets in a MapReduce model according to a maximum and minimum distance algorithm;
and calculating to obtain a plurality of initial clustering centers according to the plurality of clustering sets and the K-Means clustering algorithm.
In a possible implementation manner of the first aspect, the method further includes:
the MapReduce model divides the bottom data to be cleaned into a plurality of segments with fixed sizes, stores the segments into key value pairs, and performs distributed computation according to the key value pairs and a plurality of objects to obtain distributed computation results.
In a possible implementation manner of the first aspect, the MapReduce model is used to perform sorting and ordering according to euclidean distances, and a final aggregation clustering center is obtained through iterative computation according to an ordering result, specifically:
and obtaining a new clustering center according to the sequencing result, calculating a change value between the new clustering center and the initial clustering center, and taking the new clustering center as a final aggregation clustering center when the change value is smaller than a preset value.
In a possible implementation manner of the first aspect, the acquiring of the bottom layer data to be cleaned specifically includes:
acquiring initial bottom layer data, performing dimensionality reduction on the initial bottom layer data, generating and acquiring bottom layer data to be cleaned; the method for acquiring the initial bottom layer data specifically comprises the following steps:
acquiring initial bottom layer data based on a Hadoop technology and then obtaining the initial bottom layer data; wherein the initial underlying data comprises: four remote data, platen data, fixed value data, alarm data, fault signals, and action event data.
In a possible implementation manner of the first aspect, the method further includes:
after data cleaning of bottom data to be cleaned is finished, generating a first cleaning result;
and acquiring the data property of the first cleaning result, performing data conversion operation according to the data property to generate a first conversion result, and storing the first conversion result in the distributed file system.
A second aspect of an embodiment of the present application provides an underlying data cleaning apparatus, including: the device comprises a first calculation module, a second calculation module and a cleaning module;
the first calculation module is used for calculating Euclidean distances between a plurality of objects in the bottom data to be cleaned and a plurality of initial clustering centers in a MapReduce model through a K-Means clustering algorithm after the bottom data to be cleaned is obtained; wherein, the initial clustering center is obtained by calculating through a K-Means clustering algorithm;
the second calculation module is used for carrying out classification and sequencing according to Euclidean distances through a MapReduce model and carrying out iterative calculation according to a sequencing result to obtain a final aggregation clustering center;
the cleaning module is used for determining the final category of the bottom data to be cleaned according to the final aggregation clustering center, and processing the abnormal value of the bottom data to be cleaned according to the final category to obtain an abnormal value processing result; and finishing the data cleaning of the bottom layer data to be cleaned after the integrity repairing operation is carried out according to the abnormal value processing result.
In a possible implementation manner of the second aspect, the initial clustering center is calculated by a K-Means clustering algorithm, and specifically includes:
calculating to obtain a plurality of cluster sets in a MapReduce model according to a maximum and minimum distance algorithm;
and calculating to obtain a plurality of initial clustering centers according to the plurality of clustering sets and the K-Means clustering algorithm.
A third aspect of the embodiments of the present application provides a mobile terminal, which includes a processor and a memory, where the memory stores a computer-readable program code, and when the processor executes the computer-readable program code, the steps of the method for cleaning underlying data described above are implemented.
A fourth aspect of embodiments of the present application provides a storage medium storing computer-readable program code, which when executed implements the steps of an underlying data scrubbing method described above.
Compared with the prior art, the method, the device, the mobile terminal and the storage medium for cleaning the bottom layer data provided by the embodiment of the invention comprise the following steps: after acquiring bottom data to be cleaned, calculating Euclidean distances between a plurality of objects in the bottom data to be cleaned and a plurality of initial clustering centers in a MapReduce model through a K-Means clustering algorithm; wherein, the initial clustering center is obtained by calculating through a K-Means clustering algorithm; classifying and sorting according to Euclidean distance through a MapReduce model, and performing iterative computation according to a sorting result to obtain a final aggregation clustering center; determining the final category of the bottom data to be cleaned according to the final aggregation clustering center, and processing the abnormal value of the bottom data to be cleaned according to the final category to obtain an abnormal value processing result; and finishing the data cleaning of the bottom layer data to be cleaned after the integrity repairing operation is carried out according to the abnormal value processing result.
The beneficial effects are that: according to the method and the device, the final category of the bottom data to be cleaned is obtained through the MapReduce model calculation, and the abnormal value processing is carried out according to the final category, so that the efficiency of abnormal value processing can be effectively improved; and after integrity restoration is carried out according to the abnormal value processing result obtained quickly, data cleaning of the bottom data to be cleaned is completed, so that the data cleaning efficiency of the bottom data to be cleaned is improved, and the protection accuracy of the relay protection system is further improved.
Meanwhile, the abnormal value processing is carried out according to the final category, so that the accuracy of the abnormal value processing can be improved, and the precision of the abnormal value processing result can be improved; the quality of the underlying data can be improved, and a high-quality underlying data base is provided for other applications, so that the accuracy and performance of data mining or data stream mining are improved. The initial clustering center is obtained through calculation of a K-Means clustering algorithm, the use range of the K-Means clustering algorithm is expanded to a cloud computing platform from a single machine under a MapReduce framework, the operation time of the K-Means clustering algorithm is greatly reduced facing mass data, and the operation efficiency is remarkably improved.
Moreover, the data acquisition and storage are realized based on the Hadoop technology, so that the data can be effectively acquired by the digital twin acquisition layer in the data cleaning and conversion process, and the efficiency and the accuracy of data cleaning are further improved.
In addition, the embodiment of the invention provides the 'collector' interface which accords with the IEC-61850 standard to collect the original data layer signals and data, the standardization of the interface can optimize the automation system of the transformer substation, the safety and the reliability of the whole system are improved, and the sharing and the system integration of the information in the substation are finally realized.
Drawings
FIG. 1 is a schematic flow chart diagram illustrating a method for cleaning underlying data according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of an underlying data cleaning apparatus according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, which is a schematic flow chart of a method for cleaning underlying data according to an embodiment of the present invention, the method includes steps S101 to S104:
s101: after the bottom data to be cleaned is obtained, calculating Euclidean distances between a plurality of objects in the bottom data to be cleaned and a plurality of initial clustering centers in a MapReduce model through a K-Means clustering algorithm.
Wherein the initial clustering center is calculated by a K-Means clustering algorithm.
In this embodiment, the initial clustering center is calculated by a K-Means clustering algorithm, and specifically includes:
calculating to obtain a plurality of cluster sets in the MapReduce model according to a maximum-minimum distance algorithm;
and calculating to obtain a plurality of initial clustering centers according to the plurality of clustering sets and the K-Means clustering algorithm.
In a specific embodiment, the method further comprises:
the MapReduce model divides the bottom data to be cleaned into a plurality of fragments with fixed sizes, stores the fragments into key value pairs, and performs distributed computation according to the key value pairs and the objects to obtain distributed computation results.
In this embodiment, the acquiring of the data of the bottom layer to be cleaned specifically includes:
acquiring initial bottom layer data, performing dimensionality reduction on the initial bottom layer data, generating and acquiring the bottom layer data to be cleaned; the acquiring of the initial bottom layer data specifically includes:
acquiring and obtaining the initial bottom layer data based on a Hadoop technology; wherein the initial underlying data comprises: four remote data, platen data, fixed value data, alarm data, fault signals, and action event data.
Further, after the dimension reduction processing is performed on the initial bottom layer data, the bottom layer data to be cleaned is generated and acquired, specifically:
and reducing the dimension of the initial bottom data by a Logsf feature selection algorithm, and eliminating redundant features to obtain the bottom data to be cleaned.
Further, after data cleaning and data conversion are performed on the bottom layer data to be cleaned, the result is written back to a Distributed File system hdfs (hadoop Distributed File system), which specifically comprises the following steps: after data cleaning of the bottom layer data to be cleaned is finished, generating a first cleaning result; and acquiring the data property of the first cleaning result, performing data conversion operation according to the data property to generate a first conversion result, and storing the first conversion result in a distributed file system.
Further, the distributed file system HDFS divides nodes into 3 types of roles, which are: a main server node (Namenode), a data block server node (Datanode), and a Client (Client). The main server node is a management node of the HDFS system, is used for storing metadata of the system and plays a management role. The data block server node is responsible for specific massive information storage work, and all files are adjusted to 64 MB-sized data blocks for multi-copy storage. The client provides an access interface for the application program, and can interact with the data block server node.
The main idea of the Logsf algorithm is: in the process of calculating the loss function of the data set, the energy function and the nearest neighbor classification idea are applied to convert the complex and nonlinear problem in any group of characteristic data sets into a simple and easily understood local linear problem.
Assume that the training sample set R is:
R={M,N}={mi,ni}Xi=1,mi={mi1,mi2,....,mid}∈Rd
wherein m isiFor the ith training sample in the dataset, niFor a label corresponding to a training sample, XiThe number of samples in the training sample set is M, and the label set corresponding to the training samples is N.
The loss function for sample mi is then:
L(β,mi)=log(1+exp(-βTF1));
wherein, F1=|mi-m′i|-|mi-n′i|,m′iIs related to the sample miNearest but labelled different sample, n'iIs related to the sample miNearest but labeled samples, β is the feature weight, and Fi is the intermediate variable. The sample m can be made by minimizing the loss function to obtain the ideal weight β ″iAnd nearest sample n'iIs smaller than the sample miAnd m'iThe distance between them.
In a specific embodiment, the obtaining initial underlying data specifically includes:
acquiring and obtaining the initial bottom layer data from an HDFS data warehouse based on a Hadoop technology; wherein the initial underlying data comprises: four remote data, platen data, fixed value data, alarm data, fault signals and action event data. And storing the initial bottom layer data into a Hadoop storage system after the initial bottom layer data are collected. The initial bottom layer data is located in an original data layer, and data in the original data layer is transmitted to a digital twin layer so as to enable the data twin layer to perform a series of data processing operations such as data cleaning, so that a 'collector' interface conforming to the IEC-61850 standard is defined, the original data layer realizes data transmission and communication with the digital twin layer through a collector, namely the collector is used for transmitting the important initial bottom layer data of the original data layer to the data twin layer.
Further, the collector interface is also used for uploading the electrical parameter signal into the digital twin collection layer.
In a specific embodiment, the IEC-61850 compliant collector comprises 4 functional modules: the device comprises a synchronous signal module, a data acquisition module, a digital signal processing module and a framing coding communication module.
A synchronization signal module: the method comprises the steps of correctly identifying and tracking an externally input lpps second pulse signal through an FPGA (field programmable gate array), then generating a signal, and sending a synchronous sampling signal to primary equipment after exception processing is carried out on an exception signal.
A data acquisition module: after the collector sends synchronous sampling control signals to each path of A/D converter, the FPGA in the collector receives digital quantity or analog small-signal quantity data.
The digital signal processing module: the DSP in the data collector carries out filtering and FFT fast Fourier transform on the data collected by the FPGA to obtain sampling data values such as current, voltage and phase for panel display, and the PowePC corrects the phase error of the original signal collected by the FPGA.
The framing coding communication module: in PowerPC, after calibrating each signal sampling point, the data is subjected to framing coding according to IEC61850 standard and is sent to a collection layer for deep processing.
In a specific embodiment, the data property of the first cleaning result is obtained, a data conversion operation is performed according to the data property to generate a first conversion result, and the first conversion result is stored in a distributed file system, specifically:
the purpose of data conversion is to transform data into a uniform format or format suitable for analysis, which is achieved through data normalization operations. Normalization refers to scaling the attribute data to fall within a small specific interval. This example used maximum-minimum normalization and z-score normalization to perform data transformation on the first wash results to obtain first transformation results.
The maximum and minimum normalized calculation formula is as follows:
Figure BDA0003510884980000071
therein, maxAIs the maximum value of attribute A, minARespectively, the minimum value of the attribute a. v is the value in attribute A, v' is the mapping of v to the interval [ new _ min ]A,new_maxA]Value of (1), new _ maxAIs the latest maximum value new _ min of the attribute AAIs the latest minimum value of the attribute a.
The value on the attribute A can be mapped to [ new _ max ] through the maximum and minimum normalized calculation formulaA,new_minA]V' in the range. The disadvantage of the maximum-minimum normalization is that when new data is added, it may cause changes in max and min, requiring redefinition of the Z-score normalized calculation formula.
The Z-score normalized calculation formula is as follows:
Figure BDA0003510884980000081
wherein,
Figure BDA0003510884980000082
is the mean value of the attribute A, δAAnd obtaining v' after Z-score normalization on the value v of the attribute A to obtain the standard deviation of the attribute A.
Z-score normalization is valid where the maximum and minimum values of attribute A are unknown.
Further, obtaining the bottom layer data to be cleaned means that preprocessing (namely, dimension reduction processing) is already completed on the initial bottom layer data, then a distance-first cleaning rule is adopted on the bottom layer data to be cleaned, an error between the bottom layer data to be cleaned and real data is given, whether Δ v meets a minimum distance to be cleaned is judged, and if yes, S102-S103 is performed, and then a cleaning result is recorded into the HDFS; and if not, directly writing the bottom layer data to be cleaned back to the HDFS for storage.
S102: and carrying out classification sorting according to Euclidean distances through a MapReduce model, and carrying out iterative computation according to a sorting result to obtain a final aggregation clustering center.
In this embodiment, the performing, by the MapReduce model, classification and sorting according to the euclidean distance, and performing iterative computation according to a sorting result to obtain a final aggregation clustering center specifically:
and obtaining a new clustering center according to the sequencing result, calculating a change value between the new clustering center and the initial clustering center, and taking the new clustering center as the final aggregation clustering center when the change value is smaller than a preset numerical value.
Further, when the change value is greater than or equal to the preset value, replacing the new clustering center with the initial clustering center and repeatedly executing S101-S102, further updating the iterative clustering center until the change value between the clustering center of the latest generation and the clustering center of the previous generation is less than the preset value, completing iterative computation and obtaining the final aggregated clustering center.
S103: and determining the final class of the bottom data to be cleaned according to the final aggregation clustering center, and processing the abnormal value of the bottom data to be cleaned according to the final class to obtain an abnormal value processing result.
In a specific embodiment, the abnormal value processing of the underlying data to be cleaned is performed according to the final category, specifically:
the outlier processing includes missing value filling, and further, the missing value filling method is the most reasonable method for processing the missing data problem. The missing value records and the complete data set have a lot of information correlation, and a data set similar to the missing value can be found by clustering and analyzing the data, so that the missing value filling is carried out more accurately. Preferably, the K-Means clustering algorithm is used as a missing value filling method, and has the advantages of simplicity and high efficiency. It organizes the objects into multiple mutually exclusive groups or clusters, considering that the closer two objects are, the greater their similarity.
The principle of the K-Means clustering algorithm is as follows: assuming that data set D contains n objects in Euclidean space, the objects of D are to be assigned to k clusters C1,...,CkIn (b), j is less than or equal to k (i, j, k are real numbers) for 1 less than or equal to i, and
Figure BDA0003510884980000091
let P be a point in space representing a given data object, ciIs a cluster CiWherein p and ciAre all multi-dimensional data. The Euclidean distance is used as an evaluation index, the Euclidean distance between two points x and y is represented by dist (x, y), and the object p belongs to ClWith a representation c of the clusteriDist (p, c) for the differencei) And (4) showing. Cluster CiIs measured by intra-cluster variation, i.e. CiAll objects in (1) and center (c)iThe sum of the squared errors between, defined as:
Figure BDA0003510884980000092
where E is the sum of the squared errors of all objects in the data.
Further, before the missing value padding, the method further includes:
1. determining missing data range: calculating the missing data proportion of each field, and then respectively making strategies according to the missing examples and the field importance;
2. unnecessary fields are removed.
Specifically, the embodiment of the invention performs clustering processing on the bottom data to be cleaned and generates corresponding categories, and then performs abnormal value processing according to the categories, so that the efficiency of abnormal value processing can be effectively improved, and the efficiency of data cleaning is further improved.
In a specific embodiment, the K-Means clustering algorithm is processed under the MapReduce model as follows:
completing calculation at a Map stage and recording Euclidean distances between a plurality of objects and a plurality of initial clustering centers in bottom data to be cleaned; and according to the plurality of objects and the corresponding Euclidean distances thereof, calculating and recording to obtain the initial category corresponding to each object. In the Reduce stage, classifying and sequencing according to Euclidean distances obtained in the Map stage to calculate a new clustering center for the next round of Map to use, if the new clustering center obtained in the Reduce stage is compared with the clustering center in the previous round, and the change value is smaller than a preset value, finishing the algorithm, otherwise, performing a new round of MapReduce process; and finishing iterative computation until the change value between the clustering center of the latest generation and the clustering center of the previous generation is smaller than the preset value, and obtaining the final clustering center.
Further, the operation principle of the MapReduce model is as follows:
in the Map stage, the parallel computing framework divides the input data into fixed-size fragments, then stores each fragment as a key-value pair in the format of < key1, value1>, and each Mapper performs distributed computation according to the input key-value pair to obtain intermediate results < key2, value2>, and then sorts the intermediate results key2 and puts the intermediate results with the same key2 value together to form < key2, List (value2) >. In the Reduce stage, the Reducer integrates intermediate results output by different mappers, sequences the intermediate results, and then calls a Reduce () function automatically defined by a user to calculate and process the intermediate results.
In a specific embodiment, the initial clustering center is calculated by a K-Means clustering algorithm, specifically:
(1) and each Map node reads the data set uploaded to the data acquisition layer, and a maximum-minimum distance algorithm is adopted to generate a plurality of cluster aggregates.
(2) And in the Reduce stage, generating K initial clustering centers from a plurality of clustering sets generated in the Map stage by adopting a K-Means clustering algorithm.
(3) Writing the generated information of the initial clustering center into a Cluster directory, adding a file in the directory into a Distributed Cache (Distributed Cache) of Hadoop, and using the file as global shared information during next-stage clustering iteration.
Further, the MapReduce realization principle of the K-means algorithm is specifically as follows:
(1) and each Map node reads cluster center information generated by the previous iteration in the distributed cache in the setup () method.
(2) And calculating the Euclidean distance between each data point and the center point of each cluster by a Map method, finding the cluster center point closest to the Euclidean distance, taking the ID of the cluster center as the ID, and transmitting the data point information as value.
(3) And merging the same cluster ID key assignments of each Map node by utilizing a Combiner at the Map end so as to reduce the network transmission overhead of data.
(4) Combining the results generated by Combiner at Reduce end, and using data point of same cluster according to formula
Figure BDA0003510884980000111
And calculating a temporary central point and adding the temporary central point into the distributed cache. Wherein, aiIs a cluster ciTemporary center point of, miIs CiTotal number of data points in, x represents cluster ciThe data points in (1).
S104: and finishing the data cleaning of the bottom layer data to be cleaned after the integrity repairing operation is carried out according to the abnormal value processing result.
In a specific embodiment, the performing an integrity repair operation according to the abnormal value processing result specifically includes:
detecting a data format (namely data property) in an abnormal value processing result, and preprocessing the data format; and judging whether the preprocessed data accords with the data integrity constraint, and if not, repairing the data. If the situation of inconsistency with the data integrity constraint still exists after the data is repaired, the data is repaired again until the data meets the requirement; and after the data is repaired, restoring the data to the original format, and finishing the data cleaning of the bottom layer data to be cleaned. Wherein the integrity constraint may be represented by the following equation:
Figure BDA0003510884980000112
wherein, PiThe relationship is represented by a relationship of,
Figure BDA0003510884980000113
representing the tuple variables and the constants,
Figure BDA0003510884980000114
representing a formula that contains only build-in predicates. m and n are real numbers,
Figure BDA0003510884980000115
is that P corresponds to
Figure BDA0003510884980000116
The set of attributes of (a) is,
Figure BDA0003510884980000117
is that P corresponds to
Figure BDA0003510884980000118
The set of attributes of (1).
In this embodiment, the method further includes:
after data cleaning of the bottom layer data to be cleaned is finished, generating a first cleaning result;
and acquiring the data property of the first cleaning result, performing data conversion operation according to the data property to generate a first conversion result, and storing the first conversion result in a distributed file system.
In a specific embodiment, the obtaining of the data property of the first cleaning result, performing data conversion operation according to the data property to generate a first conversion result, and storing the first conversion result in a distributed file system specifically includes:
the purpose of data conversion is to transform data into a uniform format or format suitable for analysis, which is achieved through data normalization operations. Normalization refers to scaling the attribute data to fall within a small specific interval. This example performed data transformation on the first wash results using max-min normalization and z-score normalization to obtain first transformation results.
The maximum and minimum normalized calculation formula is as follows:
Figure BDA0003510884980000121
wherein, maxAMaximum value of attribute A, minARespectively, the minimum value of the attribute a. v is the value in attribute A, v' is v maps to the interval [ new _ min ]A,new_maxA]The new _ maxA is the latest maximum value new _ minA of attribute a is the latest minimum value of attribute a.
The value on the attribute A can be mapped to [ new _ max ] through the maximum and minimum normalized calculation formulaA,new_minA]V' in the range. The disadvantage of the maximum-minimum normalization is that when new data is added, it may cause a change in max and min, requiring the Z-score normalized calculation formula to be redefined.
The Z-score normalized calculation formula is as follows:
Figure BDA0003510884980000122
wherein, among others,
Figure BDA0003510884980000123
is the mean value of the attribute A, δAAnd obtaining v' after Z-score normalization on the value v of the attribute A to obtain the standard deviation of the attribute A.
Z-score normalization is valid where the maximum and minimum values of attribute A are unknown.
To further explain the bottom layer data cleaning device, please refer to fig. 2, fig. 2 is a schematic structural diagram of a bottom layer data cleaning device according to an embodiment of the present invention, including: a first calculation module 201, a second calculation module 202 and a cleaning module 203;
the first calculating module 201 is configured to calculate euclidean distances between a plurality of objects and a plurality of initial clustering centers in the bottom data to be cleaned through a K-Means clustering algorithm in a MapReduce model after the bottom data to be cleaned is acquired; wherein the initial clustering center is calculated by the K-Means clustering algorithm;
wherein the initial clustering center is calculated by a K-Means clustering algorithm;
the second calculation module 202 is configured to perform classification and sorting according to the euclidean distance through the MapReduce model, and perform iterative calculation according to a sorting result to obtain a final aggregation clustering center;
the cleaning module 203 is configured to determine a final category of the bottom layer data to be cleaned according to the final aggregation clustering center, and perform abnormal value processing on the bottom layer data to be cleaned according to the final category to obtain an abnormal value processing result; and completing data cleaning of the bottom layer data to be cleaned after integrity repairing operation is carried out according to the abnormal value processing result.
In this embodiment, the initial clustering center is calculated by a K-Means clustering algorithm, and specifically includes:
calculating to obtain a plurality of cluster sets in the MapReduce model according to a maximum-minimum distance algorithm;
and calculating to obtain a plurality of initial clustering centers according to the plurality of clustering sets and the K-Means clustering algorithm.
In an embodiment, the present invention provides a mobile terminal, which includes a processor and a memory, where the memory stores a computer-readable program code, and the processor implements the steps of the above-mentioned method for cleaning underlying data when executing the computer-readable program code.
In one embodiment, the present invention provides a storage medium storing computer readable program code that when executed implements the steps of an underlying data cleansing method described above.
According to the embodiment of the invention, after bottom data to be cleaned is obtained through a first calculation module 201, Euclidean distances between a plurality of objects in the bottom data to be cleaned and a plurality of initial clustering centers are calculated through a K-Means clustering algorithm in a MapReduce model; wherein the initial clustering center is calculated by the K-Means clustering algorithm; classifying and sorting the cluster centers according to Euclidean distances through a MapReduce model by a second computing module 202, and performing iterative computation according to a sorting result to obtain a final aggregation cluster center; finally, determining the final category of the bottom data to be cleaned according to the final aggregation clustering center through the cleaning module 203, and processing the abnormal value of the bottom data to be cleaned according to the final category to obtain an abnormal value processing result; and finishing the data cleaning of the bottom layer data to be cleaned after the integrity repairing operation is carried out according to the abnormal value processing result.
According to the method and the device, the final category of the bottom data to be cleaned is obtained through the MapReduce model calculation, and the abnormal value processing is carried out according to the final category, so that the efficiency of abnormal value processing can be effectively improved; and after integrity restoration is carried out according to the rapidly obtained abnormal value processing result, data cleaning of the bottom data to be cleaned is completed, so that the data cleaning efficiency of the bottom data to be cleaned is improved, and the protection accuracy of the relay protection system is further improved.
Meanwhile, the abnormal value processing is carried out according to the final category, so that the accuracy of the abnormal value processing can be improved, and the precision of the abnormal value processing result can be improved; the quality of the underlying data can be improved, and a high-quality underlying data base is provided for other applications, so that the accuracy and performance of data mining or data stream mining are improved. The initial clustering center is obtained through calculation of a K-Means clustering algorithm, the application range of the K-Means clustering algorithm is expanded to a cloud computing platform from a single machine under a MapReduce framework, the operation time of the K-Means clustering algorithm is greatly reduced in the face of mass data, and the operation efficiency is remarkably improved.
Moreover, the data acquisition and storage are realized based on the Hadoop technology, so that the data can be effectively acquired by the digital twin acquisition layer in the data cleaning and conversion process, and the efficiency and the accuracy of data cleaning are further improved.
In addition, the embodiment of the invention provides the 'collector' interface which accords with the IEC-61850 standard to collect the original data layer signals and data, the standardization of the interface can optimize the automation system of the transformer substation, the safety and the reliability of the whole system are improved, and the sharing and the system integration of the information in the substation are finally realized.
Finally, the Logsf feature selection algorithm is used for reducing the dimension of the initial bottom layer data, so that the complex and nonlinear problem can be converted into a simple and easily-understood local linear problem, and the data processing efficiency is improved. While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (10)

1. An underlying data scrubbing method, comprising:
after acquiring bottom data to be cleaned, calculating Euclidean distances between a plurality of objects and a plurality of initial clustering centers in the bottom data to be cleaned in a MapReduce model through a K-Means clustering algorithm; wherein the initial clustering center is calculated by the K-Means clustering algorithm;
classifying and sorting according to the Euclidean distance through the MapReduce model, and performing iterative computation according to a sorting result to obtain a final aggregation clustering center;
determining the final category of the bottom layer data to be cleaned according to the final aggregation clustering center, and processing the abnormal value of the bottom layer data to be cleaned according to the final category to obtain an abnormal value processing result;
and finishing the data cleaning of the bottom layer data to be cleaned after performing integrity repair operation according to the abnormal value processing result.
2. The method for cleaning underlying data according to claim 1, wherein the initial clustering center is calculated by a K-Means clustering algorithm, and specifically comprises:
calculating to obtain a plurality of cluster sets in the MapReduce model according to a maximum-minimum distance algorithm;
and calculating to obtain a plurality of initial clustering centers according to the plurality of clustering sets and the K-Means clustering algorithm.
3. The underlying data scrubbing method of claim 1, further comprising:
the MapReduce model divides the bottom data to be cleaned into a plurality of fragments with fixed sizes, stores the fragments into key value pairs, and performs distributed computation according to the key value pairs and the objects to obtain distributed computation results.
4. The method for cleaning the underlying data according to claim 1, wherein the MapReduce model is used for sorting and ordering according to the euclidean distance, and a final aggregation clustering center is obtained through iterative computation according to an ordering result, specifically:
and obtaining a new clustering center according to the sequencing result, calculating a change value between the new clustering center and the initial clustering center, and taking the new clustering center as the final aggregation clustering center when the change value is smaller than a preset numerical value.
5. The method for cleaning the underlying data according to claim 1, wherein the acquiring of the underlying data to be cleaned specifically comprises:
acquiring initial bottom layer data, performing dimension reduction processing on the initial bottom layer data, generating and acquiring the bottom layer data to be cleaned; the acquiring of the initial bottom layer data specifically includes:
acquiring and obtaining the initial bottom layer data based on a Hadoop technology; wherein the initial underlying data comprises: four remote data, platen data, fixed value data, alarm data, fault signals and action event data.
6. The method of claim 5, further comprising:
after data cleaning of the bottom layer data to be cleaned is finished, generating a first cleaning result;
and acquiring the data property of the first cleaning result, performing data conversion operation according to the data property to generate a first conversion result, and storing the first conversion result in a distributed file system.
7. An underlying data cleaning apparatus, comprising: the device comprises a first calculation module, a second calculation module and a cleaning module;
the first calculation module is used for calculating Euclidean distances between a plurality of objects and a plurality of initial clustering centers in the bottom data to be cleaned through a K-Means clustering algorithm in a MapReduce model after the bottom data to be cleaned is obtained; wherein the initial clustering center is calculated by the K-Means clustering algorithm;
the second calculation module is used for carrying out classification and sorting according to the Euclidean distance through the MapReduce model and carrying out iterative calculation according to a sorting result to obtain a final aggregation clustering center;
the cleaning module is used for determining the final category of the bottom layer data to be cleaned according to the final aggregation clustering center, and processing the abnormal value of the bottom layer data to be cleaned according to the final category to obtain an abnormal value processing result; and finishing the data cleaning of the bottom layer data to be cleaned after performing integrity repair operation according to the abnormal value processing result.
8. The bottom-layer data cleaning apparatus of claim 7, wherein the initial clustering center is calculated by a K-Means clustering algorithm, specifically:
calculating to obtain a plurality of cluster sets in the MapReduce model according to a maximum-minimum distance algorithm;
and calculating to obtain a plurality of initial clustering centers according to the plurality of clustering sets and the K-Means clustering algorithm.
9. A mobile terminal comprising a processor and a memory, the memory storing computer readable program code, the processor implementing the steps of a method for scrubbing underlying data as claimed in any one of claims 1 to 6 when the computer readable program code is executed by the processor.
10. A storage medium storing computer readable program code which when executed performs the steps of a method of cleaning underlying data as claimed in any one of claims 1 to 6.
CN202210152348.7A 2022-02-18 2022-02-18 Bottom layer data cleaning method and device, mobile terminal and storage medium Pending CN114528284A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210152348.7A CN114528284A (en) 2022-02-18 2022-02-18 Bottom layer data cleaning method and device, mobile terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210152348.7A CN114528284A (en) 2022-02-18 2022-02-18 Bottom layer data cleaning method and device, mobile terminal and storage medium

Publications (1)

Publication Number Publication Date
CN114528284A true CN114528284A (en) 2022-05-24

Family

ID=81623720

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210152348.7A Pending CN114528284A (en) 2022-02-18 2022-02-18 Bottom layer data cleaning method and device, mobile terminal and storage medium

Country Status (1)

Country Link
CN (1) CN114528284A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115718744A (en) * 2022-11-28 2023-02-28 北京中航路通科技有限公司 Data quality measurement method based on big data
CN116774639A (en) * 2023-08-24 2023-09-19 中国水利水电第九工程局有限公司 Sewage treatment equipment remote control system based on internet

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115718744A (en) * 2022-11-28 2023-02-28 北京中航路通科技有限公司 Data quality measurement method based on big data
CN115718744B (en) * 2022-11-28 2023-07-21 北京中航路通科技有限公司 Data quality measurement method based on big data
CN116774639A (en) * 2023-08-24 2023-09-19 中国水利水电第九工程局有限公司 Sewage treatment equipment remote control system based on internet
CN116774639B (en) * 2023-08-24 2023-10-27 中国水利水电第九工程局有限公司 Sewage treatment equipment remote control system based on internet

Similar Documents

Publication Publication Date Title
CN114528284A (en) Bottom layer data cleaning method and device, mobile terminal and storage medium
AU2022204116A1 (en) Verification method for electrical grid measurement data
CN114968727B (en) Database through infrastructure fault positioning method based on artificial intelligence operation and maintenance
CN104408667A (en) Method and system for comprehensively evaluating power quality
CN113568928B (en) Data management system applied to energy industry chain early warning system
CN111090643A (en) Mass electricity consumption data mining method based on data analysis system
CN117131449A (en) Data management-oriented anomaly identification method and system with propagation learning capability
CN106599190A (en) Dynamic Skyline query method based on cloud computing
CN112817958A (en) Electric power planning data acquisition method and device and intelligent terminal
CN106599189A (en) Dynamic Skyline inquiry device based on cloud computing
CN117131022B (en) Heterogeneous data migration method of electric power information system
CN112182499B (en) Low-voltage distribution network topological structure identification method based on time sequence electric quantity data
CN111858530B (en) Real-time correlation analysis method and system based on mass logs
CN114826632A (en) Network attack classification method based on network security data cleaning fusion
CN111930725A (en) Distribution and utilization data compression and fusion method and device
CN106816871B (en) State similarity analysis method for power system
CN114238045A (en) System and method for judging and automatically repairing integrity of multi-source measurement data of power grid
Wu et al. Research and improve on K-means algorithm based on hadoop
CN117539920B (en) Data query method and system based on real estate transaction multidimensional data
WO2022156743A1 (en) Feature construction method and apparatus, model training method and apparatus, and device and medium
Tian et al. Nonlinear Data Classification of Power Internet of Things Considering Transient and Steady State
CN109978715A (en) User side distributed generation resource Data Reduction method and device
XiaoYang et al. Research on Data Cleaning Technology of Distribution Electrical Communication Network
Hu et al. Fast Incremental Data Recognition Method Based on TCN Network
CN117708186A (en) Intelligent adjustment and optimization method for application system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination