CN115168326A - Hadoop big data platform distributed energy data cleaning method and system - Google Patents

Hadoop big data platform distributed energy data cleaning method and system Download PDF

Info

Publication number
CN115168326A
CN115168326A CN202210508315.1A CN202210508315A CN115168326A CN 115168326 A CN115168326 A CN 115168326A CN 202210508315 A CN202210508315 A CN 202210508315A CN 115168326 A CN115168326 A CN 115168326A
Authority
CN
China
Prior art keywords
data
distributed energy
type
hadoop big
abnormal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210508315.1A
Other languages
Chinese (zh)
Inventor
刘洋
李立生
张世栋
于海东
刘明林
黄敏
王浩
房牧
刘文彬
刘合金
苏国强
张鹏平
李帅
王峰
文祥宇
由新红
张林利
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN202210508315.1A priority Critical patent/CN115168326A/en
Publication of CN115168326A publication Critical patent/CN115168326A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a distributed energy data cleaning method and system for a Hadoop big data platform.

Description

Hadoop big data platform distributed energy data cleaning method and system
Technical Field
The invention relates to the technical field of electric power data processing, in particular to a distributed energy data cleaning method and system for a Hadoop big data platform.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
The development and progress of the photovoltaic power generation digitization technology are continuously carried out, and the photovoltaic power generation digitization technology is accompanied by various data of large batch, high density and multiple types. The data obtained after screening and cleaning the effective data and processing the abnormal data is the basis of later data analysis. A large number of abnormal values exist in the actual operation process of the photovoltaic system, the reasons for generating the abnormal values comprise data transmission signal noise, sensor faults, communication, measuring equipment fault power stations and the like, the effectiveness of data is reduced by the large number of abnormal data, effective data are screened to carry out qualitative and quantitative analysis, effective cleaning of distributed energy data is achieved, the efficiency can be greatly improved by adopting Hadoop large data platform modeling machine analysis, errors of manual analysis can be effectively avoided, and powerful support can be provided for follow-up large data modeling analysis and prediction by obtaining the cleaned data.
The inventor finds that the existing photovoltaic power generation data processing method does not clean the data characteristics of the distributed energy, has poor consistency check, invalid value and missing value processing capacity of the distributed energy data, and cannot efficiently and accurately obtain available photovoltaic power generation data.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a distributed energy data cleaning method and system of a Hadoop big data platform, the Hadoop big data platform is used as a support, a python scientific calculation library is used for carrying out preliminary exploration and analysis on data, a plot drawing of a third-party drawing library is used for analyzing data, abnormal data types, missing values, data set scales and data distribution conditions under various characteristics are searched, then clustering analysis is used for filling the missing values and processing the abnormal values through the Hadoop big data platform, redundant data are removed, and the distributed energy data are cleaned quickly and accurately.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a distributed energy data cleaning method for a Hadoop big data platform in a first aspect.
A distributed energy data cleaning method for a Hadoop big data platform is characterized by comprising the following steps:
the method comprises the following steps:
the method comprises the steps of obtaining distributed energy data obtained based on a Hadoop big data platform, and converting the obtained distributed energy data into a first specific data type;
drawing a discrete graph by using data of a first specific data type, finding abnormal data, and determining the type of the abnormal data;
converting the data of the first specific data type into a second specific data type, constructing a unique primary key through time and equipment codes, and clearing redundant data of the second specific data type;
judging whether the time sequence of the data without the redundant data is complete, filling the data in the data with the incomplete time sequence through similar days, and completely processing the missing value of the data by the time sequence;
analyzing and predicting the filling missing value by using a preset clustering algorithm;
and judging the importance degree of the data attribute of the obtained abnormal data type, clustering the data with the attribute importance degree larger than a preset value, and backfilling a normal value to obtain the cleaned distributed energy data.
As an optional implementation manner, analyzing and predicting the missing value by using a preset clustering algorithm to fill the missing value includes:
and when the data is abnormal or missing, selecting an optimal value to fill the abnormal data according to the similarity between the label of the current abnormal or missing value and each group.
As an optional implementation manner, the preset clustering algorithm includes:
the input data is: the method comprises the steps of storing a text file inputfile of sample data, a sequence File inputPath of the sample data, a sequence File centrPath of the centroid data, a path clusterPath where a clustering result file (sequence File) is located and the number k of classes;
the output data is: k classes;
reading inputPath, finding out the core points with the minimum variance and using the maximum and minimum distance method to select K fields with large density and long distance as initial centroids, and writing centroid data into the centrPath;
when the clustering termination condition is not met, reading inputPath at the Mapper stage, traversing all centroids for the point corresponding to the key, selecting the nearest centroid, taking the serial number of the centroid as a key, and transmitting the serial number of the point to a Reducer as a value;
in the Reducer stage, the values transmitted from the Mapper stage are merged and output according to keys, and the result is written into the clusterPath;
and reading the clusterPath, recalculating the centroid, writing the result into the centerPath, and circulating the process until the clustering termination condition is met.
As an alternative implementation, the ETL tool button is used to convert the acquired distributed energy data into json type.
As an alternative implementation, the discrete graph is drawn using python to reference a third party drawing library plot.
As an alternative implementation, the json type data is converted into the DataFrame type by using python, and the data is cleaned by using the expanding program library pandas.
As an optional implementation manner, missing values are filled and abnormal data are processed by using cluster analysis through a Hadoop big data platform.
The invention provides a distributed energy data cleaning system of a Hadoop big data platform.
A Hadoop big data platform distributed energy data cleaning system comprises:
a data acquisition module configured to: the method comprises the steps of obtaining distributed energy data obtained based on a Hadoop big data platform, and converting the obtained distributed energy data into a first specific data type;
an anomaly data identification module configured to: drawing a discrete graph by using data of a first specific data type, finding abnormal data, and determining the type of the abnormal data;
a redundant data cleansing module configured to: converting the data of the first specific data type into a second specific data type, constructing a unique primary key through time and equipment codes, and clearing redundant data of the second specific data type;
a time series integrity determination module configured to: judging whether the time sequence of the data with the redundant data removed is complete, filling the data with the incomplete time sequence through similar days, and completely processing the missing value of the data by the time sequence;
an missing value padding module configured to: analyzing and predicting the filling missing value by using a preset clustering algorithm;
an exception data handling module configured to: and judging the importance degree of the data attribute of the obtained abnormal data type, clustering the data with the attribute importance degree larger than a preset value, and backfilling a normal value to obtain the cleaned distributed energy data.
In a third aspect, the present invention provides a computer readable storage medium, on which a program is stored, wherein the program, when executed by a processor, implements the steps in the Hadoop big data platform distributed energy data cleaning method according to the first aspect of the present invention.
The invention provides an electronic device, which comprises a memory, a processor and a program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps in the method for cleaning the Hadoop big data platform distributed energy data according to the first aspect of the invention.
Compared with the prior art, the invention has the beneficial effects that:
1. the Hadoop big data platform distributed energy data cleaning method and system provided by the invention are based on the Hadoop big data platform, utilize python's scientific calculation library to perform preliminary exploration and analysis on data, utilize a third-party drawing library plot to analyze data, search abnormal data types, missing values, data set scales and data distribution conditions under various characteristics, then use clustering analysis to fill missing values and process abnormal values through the Hadoop big data platform, perform deduplication on redundant data, and realize rapid and accurate cleaning of distributed energy data.
2. According to the distributed energy data cleaning method and system for the Hadoop big data platform, after acquired data are obtained, original data are backed up, operational data types such as json are converted through an ETL tool, and abnormal data types are searched through drawing analysis of python; confirming abnormal data types such as repeated data, missing values or abnormal data; the redundant data processing converts the data type into a DataFrame type through python, determines a unique primary key in a time + equipment coding mode, and removes redundant data; processing missing values, simply filling data according to the importance degree and the missing rate of the data attributes, wherein the importance degree is low and the missing rate is low, modeling is mainly performed on the data with high importance degree and high missing rate, and prediction filling is performed on the data through clustering analysis; modeling clustering prediction filling is carried out when the importance degree of the abnormal data attribute is high and the data is not in the normal range, the abnormal data attribute is treated as a missing value, the importance degree is low and the data offset is large, and then deleting treatment is carried out; the precision of data cleaning is greatly improved.
3. According to the Hadoop big data platform distributed energy data cleaning method and system, abnormal data with high attribute importance degree are processed by using a modeling method, the problem that the processing capacity of the existing data cleaning on invalid values, missing values and the like is poor is effectively solved, and compared with methods such as a statistical method and an expert completion method, a model method can be adopted to achieve the achievement of good training at one time and high reusability; compared with the traditional clustering algorithm using K-means, the method has the problems that the final clustering effect is directly determined by the selection of the initial centroid, the noise data and the abnormal points cannot be processed, and aiming at the problem that the K-means is sensitive to the abnormal points, the method effectively reduces the problem that local extrema is caused by the random selection of the initial points by using the improved K-center point algorithm (K-Medoids) and selecting the core points of K fields with large density and long distance as the initial points by using the method of selecting the minimum variance and using the maximum and minimum distance, and fully considers the parallelization and the distribution of the algorithm execution. In the clustering algorithm, characteristics of time, photovoltaic and weather conditions are used as data labels, a dictionary is filled after historical data clustering is carried out, each grouping similarity of the labels and the dictionary is calculated, standard missing values or invalid values are filled, the usability of data can be effectively improved, the influence of noise, missing values, inconsistent data and the like is reduced, cleaning is carried out independently according to the characteristics of distributed energy data, and the accuracy of prediction of a subsequent algorithm model is effectively improved by obtaining high availability of the cleaned data.
4. According to the distributed energy data cleaning method and system for the Hadoop big data platform, the constructed Hadoop big data table structure is combined with the pre-partition function of Hbase in use of actual business, 20 partitions are preset, the storage space of 20 g of each partition is reserved, hash complementation is carried out on the number of the platform area to which a user belongs, the obtained value is determined to be the key of the partition, data of each administrative area are uniformly distributed in each partition, and system performance and query efficiency are improved.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
Fig. 1 is a schematic flow diagram of a Hadoop big data platform distributed energy data cleaning method provided in embodiment 1 of the present invention.
Fig. 2 is a schematic diagram of a detailed process of a cluster analysis algorithm provided in embodiment 1 of the present invention.
Fig. 3 is a schematic flow chart of a data cleansing module according to embodiment 1 of the present invention.
FIG. 4 shows Hbase table region design of the Hadoop big data platform.
Detailed Description
The invention is further described with reference to the following figures and examples.
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
The embodiments and features of the embodiments of the invention may be combined with each other without conflict.
Example 1:
as shown in fig. 1 and 2, embodiment 1 of the present invention provides a distributed energy data cleaning method for a Hadoop big data platform, including the following steps:
s1: after the original distributed energy data are obtained and backed up, an ETL tool button is used for converting the data type into json;
s2: drawing a discrete graph by using python to refer to a third-party drawing library plot to find abnormal data;
s3: converting the data type into a DataFrame by using python, cleaning data by using an extended program library pandas, constructing a unique primary key by time and equipment codes, and clearing redundant data;
s4: judging whether the data time sequence is complete after removing the redundant data, filling data on similar days when the time sequence is incomplete, and processing missing values of the data when the time sequence is complete;
s5: analyzing and predicting filling missing values by using a clustering algorithm based on a Hadoop big data platform;
s6: judging attribute importance degree and normal range of abnormal data types obtained by plot analysis by using plot, carrying out cluster analysis on the attribute importance degree, and backfilling normal values;
s7: and finally, obtaining the cleaned distributed energy data.
Specifically, examples are as follows:
for example, original distributed energy data of one month in a city are obtained, the original distributed energy data are converted into json data by using a button and then output, the original data are backed up, a missing value is carried out according to the attribute of the power generation amount, and data cleaning is carried out on an abnormal value; firstly, analyzing that the generated energy belongs to important attributes, and predicting and filling abnormal values and deficiency values of the generated energy through a clustering analysis algorithm of a Hadoop platform; after data are subjected to plot analysis through plot, the obtained abnormal data types, such as installed capacity and no power generation amount, generated power and no photovoltaic capacity, and the like are deleted; deleting outlier data by calculating the generated power of a user and performing plot through plot (an outlier judgment method, namely any value except the mean value (mean) + -3 standard deviation (STD), namely an outlier); for redundant data processing, determining a unique primary key through equipment number + time, and clearing redundant data by using a duplicate () method of pandas; and (3) processing missing values and abnormal values, extracting a data source to HDFS storage through an ETL tool, cleaning, processing and calculating original data through Hive, and filling the data after cluster analysis and prediction.
In S5, analyzing and predicting the filling missing value by using a clustering algorithm based on a Hadoop big data platform, wherein the method comprises the following steps:
based on historical data of photovoltaic users, installed capacity, time and weather of the users are used as characteristics, clustering is used for grouping the data, the data are averaged and stored in a dictionary, and when the data are abnormal or missing, the most appropriate value can be selected according to the similarity between the label of the current abnormal/missing value and each group to fill abnormal data.
S5.1: cluster definition
Clustering, namely dividing similar things into a group, and when Clustering, not concerning what a certain class is, and only grouping the similar things together with the target to be achieved; therefore, a clustering algorithm usually only needs to know how to calculate the similarity to start working, so that the clustering usually does not need to use training data for learning, which is called unsupervised learning in machine learning and is very suitable for the preprocessing process of data; the clustering does not have any guiding information, the classes are classified and clustered by completely classifying the distribution of the data, the classes are not given in advance, but are classified according to the similarity and the distance of the data, and the similarity between the objects is calculated based on the distance between the objects.
Common methods for calculating the distance include euclidean, manhattan, etc.:
(1) Euclidean: d = sqrt (∑ (xi 1-xi 2) ^);
(2) Manhattan: d = | X1-X2| + | Y1-Y2|;
the similarity calculation method may use cosine similarity pearson.
Cosine similarity algorithms are more common, and are a commonly used definition of distance, unlike euclidean, which refers to the true distance between two points in an m-dimensional space, or the natural length of a vector (i.e., the distance of the point from the origin); the euclidean distance in two and three dimensions is the actual distance between two points.
The cosine distance is also called cosine similarity, and is a measure for measuring the difference between two individuals by using cosine values of an included angle between two vectors in a vector space; the closer the cosine value is to 1, the closer the included angle is to 0 degree, namely the more similar the two vectors are, the more the cosine similarity is; in addition: cosine distance uses cosine values of included angles of two vectors as a measure of the difference between two individuals; the cosine distance is more focused on the difference of the two vectors in direction than on the euclidean distance.
Cosine similarity measures the similarity between two vectors by measuring their cosine values of their angle. The cosine of the 0 degree angle is 1, while the cosine of any other angle is not greater than 1, and its minimum value is-1; whereby the cosine of the angle between the two vectors determines whether the two vectors point in approximately the same direction; when the two vectors have the same direction, the cosine similarity value is 1; when the included angle of the two vectors is 90 degrees, the value of the cosine similarity is 0; when the two vectors point to completely opposite directions, the cosine similarity value is-1; the result is independent of the length of the vector, only of the pointing direction of the vector; cosine similarity is commonly used in the positive space, and therefore gives values between-1 and 1.
The cosine value between two vectors can be found by using the euclidean dot product formula.
Degree of similarity in internal phase:
Figure BDA0003638292480000101
similarity between classes:
Figure BDA0003638292480000102
the pearson correlation coefficient is widely used to measure the degree of correlation between two variables, with a value between-1 and 1; it is evolved from the proposed similar but slightly different ideas by karl pearson; this correlation coefficient is also referred to as the "pearson product-moment correlation coefficient".
The pearson correlation coefficient between two variables is defined as the quotient of the covariance and the standard deviation between the two variables:
Figure BDA0003638292480000111
the above formula defines the overall correlation coefficient, estimates the covariance and standard deviation of the sample, and obtains the pearson correlation coefficient:
Figure BDA0003638292480000112
a partition-based clustering method is selected, a collection of n data objects is given, k partitions are constructed, each partition represents a cluster, k is far less than n, each group at least comprises one object, most partition methods are based on distance, and an iterative relocation technique, such as k-means and k-center point algorithm, is adopted.
The K-mean value is that a data set is divided into K non-empty subsets, then K centroids are randomly selected, data points are distributed to corresponding clusters according to distances, next step, the mean value point of the current cluster is calculated to serve as a new centroid, all data objects are distributed again, and each object is designated as the class where the nearest seed point is located; calculating the mean point of the current cluster as a new centroid again until the position of the centroid is not changed, and considering that the clustering has reached the expected result, and terminating the algorithm; the K-mean convergence speed is high. When the cluster is obviously distinguished, the effect is better; parameters needing parameter adjustment are only the number of clusters K, but only data which can define an average value need to appoint K in advance, namely the number of clusters, other technologies are adopted to determine a better K value, and the selection of an initial centroid directly determines that the final clustering effect cannot process noise data and abnormal points; for the problem that K-means is sensitive to outliers, a modified K-center point algorithm (K-Medoids) is selected.
k-center and k-mean are much like, except for the updated selection of centroid, k-mean updates centroid by taking the mean, and k-center randomly selects k objects as representative points of the initial k clusters, repeatedly replacing representative points with non-representative points until the point with the smallest sum of squared errors is found as the data center point; the partitioning method is thus performed on the basis of the principle of minimizing the sum of the dissimilarities between all objects and their reference points, which is the basis of the k-medoids method; compared with k-means, k-centers are more robust to noise and outlier processing, and the center points are less susceptible to outliers.
S5.2: hadoop-based distributed clustering specific implementation
The traditional clustering algorithm needs to centralize data on one site for processing, which means that all data objects need to be centralized on one site and loaded into a memory at one time; due to factors such as bandwidth limitation of network nodes, privacy protection of data and the like, it is almost impossible to centralize data together; to put it back, even if a large amount of data is allowed to be centrally parallelized, the algorithm will be executed too inefficiently or crash, and the cost of such centralized execution will be unacceptable to the user; when mass data or distributed data are faced, the traditional centralized clustering algorithm is very popular; the change of the data storage mode puts new requirements on the clustering algorithm, and the parallelization and the distribution of the algorithm must be considered.
The pseudo code for realizing the K-Means clustering algorithm in the Hadoop distributed environment is as follows:
inputting:
parameter 0 — text file inputfile where sample data is stored;
parameter 1- -SequenceFile inputPath File for storing sample data;
parameter 2- -the SequenceFile file, centerPath, which stores centroid data;
parameter 3 — path clusterPath where the clustering result file (sequence file) is stored;
parameter 4- -number of classes k;
and (3) outputting: k classes
Begin
Reading inputPath, selecting front k points from the inputPath as an initial centroid, and writing the centroid data into the centrPath;
the While clustering termination condition is not met;
in the Mapper stage, reading inputPath, traversing all centroids for points corresponding to keys, selecting the nearest centroid, taking the serial number of the centroid as a key, and transmitting the serial number of the point to a Reducer as a value;
in the Reducer stage, the values transmitted from the Mapper stage are merged and output according to keys, and the result is written into the clusterPath;
reading clusterPath, recalculating the centroid, and writing the result into the centerPath;
EndWhile
End
the common indexes for judging the clustering effect are the following criterion function values:
Figure BDA0003638292480000131
it is reasonable to think that the smaller the value, the better the clustering effect, and as the cycle continues, the criterion function value will converge to a very small value, so that the value can be used as the end condition of clustering cycle.
In the map stage, the distance between each data point and a clustering center is calculated, the nearest center corresponding to a sample is found, a new clustering center < key, value > is calculated, a key map default format is input, namely the offset of the current sample relative to the real point of the input data file, and the value is a character string formed by values of each dimension of the current sample, and is output: < key ', value' >, key 'is the nearest cluster subscript, value' is the sample point;
a reduce stage: recalculating the clustering center according to the < key, value > of the map stage, and updating the clustering center; and (3) outputting: id < key.d (subscript), new cluster center >;
in the map stage, data communication is reduced, a combination () needs to be made at the local map < key, value >, the data size is reduced, specifically, each dimension of the value is added according to the key, and the count is recorded.
The invention provides a Hadoop big data platform based massive distributed energy data cleaning technology research and application, which can automatically generate cleaned data; the cleaned distributed energy data is finally obtained by performing plot composition analysis on the distributed energy data, processing redundant data and filling missing values and abnormal values of the attributes through clustering analysis prediction, so that the accuracy and high availability of the data are ensured, and a foundation is laid for the subsequent data analysis and prediction.
Example 2:
the embodiment 2 of the invention provides a distributed energy data cleaning system for a Hadoop big data platform, which comprises:
a data acquisition module configured to: the method comprises the steps of obtaining distributed energy data obtained based on a Hadoop big data platform, and converting the obtained distributed energy data into a first specific data type;
an anomaly data identification module configured to: drawing a discrete graph by using data of a first specific data type, finding abnormal data, and determining the type of the abnormal data;
a redundant data cleansing module configured to: converting the data of the first specific data type into a second specific data type, constructing a unique primary key through time and equipment codes, and clearing redundant data of the second specific data type;
a time series integrity determination module configured to: judging whether the time sequence of the data without the redundant data is complete, filling the data in the data with the incomplete time sequence through similar days, and completely processing the missing value of the data by the time sequence;
an missing value padding module configured to: analyzing and predicting the filling missing value by using a preset clustering algorithm;
an exception data handling module configured to: and judging the importance degree of the data attribute of the obtained abnormal data type, clustering the data with the attribute importance degree larger than a preset value, and backfilling a normal value to obtain the cleaned distributed energy data.
The working method of the system is the same as the method for cleaning the distributed energy data of the Hadoop big data platform provided in embodiment 1, and details are not repeated here.
Example 3:
embodiment 3 of the present invention provides a computer-readable storage medium, on which a program is stored, where the program, when executed by a processor, implements the steps in the Hadoop big data platform distributed energy data cleaning method according to embodiment 1 of the present invention.
Example 4:
embodiment 4 of the present invention provides an electronic device, which includes a memory, a processor, and a program that is stored in the memory and is executable on the processor, and when the processor executes the program, the steps in the Hadoop big data platform distributed energy data cleaning method according to embodiment 1 of the present invention are implemented.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A Hadoop big data platform distributed energy data cleaning method is characterized by comprising the following steps:
the method comprises the following steps:
the method comprises the steps of obtaining distributed energy data obtained based on a Hadoop big data platform, and converting the obtained distributed energy data into a first specific data type;
drawing a discrete graph by using data of a first specific data type, finding abnormal data, and determining the type of the abnormal data;
converting the data of the first specific data type into a second specific data type, constructing a unique primary key through time and equipment codes, and clearing redundant data of the second specific data type;
judging whether the time sequence of the data with the redundant data removed is complete, filling the data with the incomplete time sequence through similar days, and completely processing the missing value of the data by the time sequence;
analyzing and predicting the filling missing value by using a preset clustering algorithm;
and judging the importance degree of the data attribute of the obtained abnormal data type, clustering the data with the attribute importance degree larger than a preset value, and backfilling a normal value to obtain the cleaned distributed energy data.
2. The Hadoop big data platform distributed energy data cleaning method as claimed in claim 1, characterized in that:
analyzing and predicting the filling missing value by using a preset clustering algorithm, wherein the method comprises the following steps:
and when the data is abnormal or missing, selecting an optimal value to fill the abnormal data according to the similarity between the label of the current abnormal or missing value and each group.
3. The Hadoop big data platform distributed energy data cleaning method as claimed in claim 1 or 2, characterized in that:
the preset clustering algorithm comprises the following steps:
the input data is: the method comprises the steps of storing a text file inputfile of sample data, a sequence File inputPath of the sample data, a sequence File centrPath of the centroid data, a path clusterPath where a clustering result file (sequence File) is located and the number k of classes;
the output data is: k classes;
reading inputPath, selecting front k points from the inputPath as initial centroids, and writing centroid data into the centrPath;
when the clustering termination condition is not met, reading inputPath at the Mapper stage, traversing all centroids for the point corresponding to the key, selecting the nearest centroid, taking the serial number of the centroid as a key, and transmitting the serial number of the point to a Reducer as a value;
in the Reducer stage, the values transmitted from the Mapper stage are merged and output according to keys, and the result is written into the clusterPath;
and reading the clusterPath, recalculating the centroid, writing the result into the centerPath, and circulating the process until the clustering termination condition is met.
4. The Hadoop big data platform distributed energy data cleaning method as claimed in claim 1, characterized in that:
and converting the acquired distributed energy data into json type by using an ETL tool button.
5. The Hadoop big data platform distributed energy data cleaning method as claimed in claim 1, characterized in that:
the discretization graph is drawn using python referencing a third party drawing library plot.
6. The Hadoop big data platform distributed energy data cleaning method as claimed in claim 1, characterized in that:
the json type data is converted to the DataFrame type using python, and the data is cleaned using the extended library pandas.
7. The Hadoop big data platform distributed energy data cleaning method according to claim 1, characterized by:
missing values are filled and abnormal data are processed by a Hadoop big data platform through clustering analysis.
8. A distributed energy data cleaning system of a Hadoop big data platform is characterized in that:
the method comprises the following steps:
a data acquisition module configured to: the method comprises the steps of obtaining distributed energy data obtained based on a Hadoop big data platform, and converting the obtained distributed energy data into a first specific data type;
an anomaly data identification module configured to: drawing a discrete graph by using data of a first specific data type, finding abnormal data, and determining the type of the abnormal data;
a redundant data cleansing module configured to: converting the data of the first specific data type into a second specific data type, constructing a unique primary key through time and equipment codes, and clearing redundant data of the second specific data type;
a time series integrity determination module configured to: judging whether the time sequence of the data without the redundant data is complete, filling the data in the data with the incomplete time sequence through similar days, and completely processing the missing value of the data by the time sequence;
an missing value padding module configured to: analyzing and predicting the filling missing value by using a preset clustering algorithm;
an exception data handling module configured to: and judging the importance degree of the data attribute of the obtained abnormal data type, clustering the data with the attribute importance degree larger than a preset value, and backfilling a normal value to obtain the cleaned distributed energy data.
9. A computer readable storage medium having a program stored thereon, wherein the program when executed by a processor implements the steps in the Hadoop big data platform distributed energy data washing method according to any of claims 1 to 7.
10. An electronic device comprising a memory, a processor, and a program stored on the memory and executable on the processor, wherein the processor implements the steps of the Hadoop big data platform distributed energy data cleansing method according to any one of claims 1-7 when executing the program.
CN202210508315.1A 2022-05-11 2022-05-11 Hadoop big data platform distributed energy data cleaning method and system Pending CN115168326A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210508315.1A CN115168326A (en) 2022-05-11 2022-05-11 Hadoop big data platform distributed energy data cleaning method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210508315.1A CN115168326A (en) 2022-05-11 2022-05-11 Hadoop big data platform distributed energy data cleaning method and system

Publications (1)

Publication Number Publication Date
CN115168326A true CN115168326A (en) 2022-10-11

Family

ID=83483091

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210508315.1A Pending CN115168326A (en) 2022-05-11 2022-05-11 Hadoop big data platform distributed energy data cleaning method and system

Country Status (1)

Country Link
CN (1) CN115168326A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117278343A (en) * 2023-11-24 2023-12-22 戎行技术有限公司 Data multi-level output processing method based on big data platform data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117278343A (en) * 2023-11-24 2023-12-22 戎行技术有限公司 Data multi-level output processing method based on big data platform data
CN117278343B (en) * 2023-11-24 2024-02-02 戎行技术有限公司 Data multi-level output processing method based on big data platform data

Similar Documents

Publication Publication Date Title
US10916333B1 (en) Artificial intelligence system for enhancing data sets used for training machine learning-based classifiers
Beretta et al. Learning the structure of Bayesian Networks: A quantitative assessment of the effect of different algorithmic schemes
US10152673B2 (en) Method for pseudo-recurrent processing of data using a feedforward neural network architecture
BR112020022270A2 (en) systems and methods for unifying statistical models for different data modalities
CN108154198B (en) Knowledge base entity normalization method, system, terminal and computer readable storage medium
CN114332984B (en) Training data processing method, device and storage medium
CN113705793B (en) Decision variable determination method and device, electronic equipment and medium
KR20180137386A (en) Community detection method and community detection framework apparatus
Zhou et al. Hierarchical surrogate-assisted evolutionary optimization framework
CN109542949B (en) Formal vector-based decision information system knowledge acquisition method
CN109961129A (en) A kind of Ocean stationary targets search scheme generation method based on improvement population
CN115168326A (en) Hadoop big data platform distributed energy data cleaning method and system
Priya et al. Community Detection in Networks: A Comparative study
CN112257332B (en) Simulation model evaluation method and device
He et al. Parallel outlier detection using kd-tree based on mapreduce
CN116304213B (en) RDF graph database sub-graph matching query optimization method based on graph neural network
CN112286996A (en) Node embedding method based on network link and node attribute information
CN110209895B (en) Vector retrieval method, device and equipment
CN112163106A (en) Second-order similarity perception image Hash code extraction model establishing method and application thereof
Li et al. An alternating nonmonotone projected Barzilai–Borwein algorithm of nonnegative factorization of big matrices
Dhoot et al. Efficient Dimensionality Reduction for Big Data Using Clustering Technique
Kalatzis et al. Density estimation on smooth manifolds with normalizing flows
Zhang et al. Self-Adaptive-Means Based on a Covering Algorithm
RU2718409C1 (en) System for recovery of rock sample three-dimensional structure
CN106971011A (en) A kind of big data analysis method based on cloud platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination