CN116662326B

CN116662326B - Multi-energy variety data cleaning and collecting method

Info

Publication number: CN116662326B
Application number: CN202310920648.XA
Authority: CN
Inventors: 芦志成; 潘倩; 汤志彪; 高思远; 郑朕; 聂婷婷; 钱永安; 陈日照; 侯玉铭; 邓必涛; 李剑; 胡美斌; 冯海云; 邓靖川; 杜晓丹; 黄玉芬
Original assignee: Jiangxi Provincial Institute Of Inspection Testing And Certification Metrology Science Research Institute
Current assignee: Jiangxi Provincial Institute Of Inspection Testing And Certification Metrology Science Research Institute
Priority date: 2023-07-26
Filing date: 2023-07-26
Publication date: 2023-10-20
Anticipated expiration: 2043-07-26
Also published as: CN116662326A

Abstract

The application provides a multi-energy variety data cleaning and collecting method, and belongs to the technical field of data cleaning. The method comprises the steps of obtaining multi-energy variety data; cleaning pretreatment is carried out on abnormal data in the multi-energy variety data based on a distributed platform data cleaning method; transforming through a data transformation rule based on a mapping relation to obtain data to be processed; integrating and fusing POI fusion algorithm based on distance class to obtain fusion data; performing data reduction processing based on a nonlinear data dimension reduction algorithm to obtain dimension reduction data; and filtering the reduced-dimension data by adopting a quality evaluation rule based on a prefabricated evaluation index to obtain target data corresponding to the multi-energy variety data. The application connects the data cleaning, conversion, fusion and reduction processing methods in series with high efficiency, realizes the cleaning and collection processing of massive data with different sources, improves the high efficiency of data processing at each stage, and ensures the similarity and high quality of the data.

Description

Multi-energy variety data cleaning and collecting method

Technical Field

The application belongs to the technical field of data cleaning, and particularly relates to a multi-energy variety data cleaning and collecting method.

Background

The performance of big data analysis depends on the quality of the data, and the correct use of high quality data can help make better predictions and decisions, as well as more reliable data analysis. The success of big data analysis depends largely on how the data is cleaned, transformed and integrated. With the development of society, the data volume of each industry is also gradually exponentially increased, the sources of the data are more and more complex, the data often contain incomplete, incorrect or irrelevant data in practical application, and the inaccuracy of the data is caused by various reasons. Quality is particularly a concern when integrating large-scale data from different sources, such as multi-energy variety data; these data sources are typically derived from either homogenous or heterogeneous databases, file systems, and service interfaces, thus reducing the reliability of the data. Data cleaning and collection is an important method for improving the quality of data and improving the data query result, and is generally used for ensuring the accuracy and the value of big data analysis and evaluation results, so that the data cleaning and collection is getting more attention and more importance in the big data research field.

At present, aiming at data processing methods of mass different sources, such as data cleaning, data transformation, data fusion and data reduction, the data processing methods are mostly independently generated, and each independently generated data processing method mostly adopts complex and complicated processing algorithms and rules, so that the quality of data processing is not ideal, and the mining of the data is not facilitated.

Therefore, how to efficiently concatenate the data cleaning, transforming, fusing and reducing processing methods, so as to realize the cleaning and collecting processing of massive data with different sources, and improve the high efficiency of data processing at each stage, ensure the similarity and high quality of the data, and is particularly important for those skilled in the art.

Disclosure of Invention

In order to solve the technical problems, the application provides a multi-energy variety data cleaning and collecting method which can clean and collect massive data with different sources, improve the high efficiency of data processing at each stage and ensure the similarity and high quality of the data.

In a first aspect, the present application provides a method for cleaning and collecting multi-energy variety data, comprising:

acquiring multi-energy variety data; the multi-energy variety data comprise total coal consumption, total petroleum consumption, total natural gas consumption and total clean energy consumption;

Cleaning pretreatment is carried out on abnormal data in the multi-energy variety data based on a distributed platform data cleaning method; wherein the abnormal data includes missing data, erroneous data, and duplicate data;

transforming the multi-energy variety data after cleaning pretreatment through a data transformation rule based on a mapping relation to obtain data to be processed;

performing integrated fusion on the data to be processed by using a POI fusion algorithm based on the distance class to obtain fusion data;

performing data reduction processing on the fusion data based on a nonlinear data dimension reduction algorithm to obtain dimension reduction data;

and filtering the reduced dimension data by adopting a quality evaluation rule based on a prefabricated evaluation index to obtain target data corresponding to the multi-energy variety data.

Preferably, the step of cleaning and preprocessing the abnormal data in the multi-energy variety data based on the distributed platform data cleaning method specifically includes:

matching and filling the missing data in the multi-energy variety data by combining a distributed platform and a clustering filling algorithm;

correcting and repairing the error data in the multi-energy variety data by combining a distributed platform and association rules;

And integrating the repeated data in the multi-energy variety data by combining a distributed platform and a clustering partition algorithm.

Preferably, the step of performing matching filling processing on missing data in the multi-energy variety data by combining a distributed platform and a clustering filling algorithm specifically includes:

loading a Hive warehouse with multi-energy variety data and a multi-energy ontology knowledge base;

executing Map functions on the Hive warehouse to identify missing data contained in the Map functions;

matching the missing data with the multi-energy ontology knowledge base, and judging whether rules in the missing data and the multi-energy ontology knowledge base have corresponding association relations or not;

if yes, directly filling;

if not, searching similar complete data filling by taking the missing data as a clustering core.

Preferably, the step of transforming the multi-energy variety data after the cleaning pretreatment by a data transformation rule based on a mapping relationship to obtain data to be processed specifically includes:

constructing conversion rules of the two data models before and after conversion based on the XML data template;

confirming the source data adaptation of the XML data template and the data model before transformation; wherein the source data is the multi-energy variety data after cleaning pretreatment;

And carrying out data extraction and transformation processing on the source data through an XML data template to obtain data to be processed.

Preferably, the XML data template carries out specified processing on all the initial rows and the final columns in the file to be read and the row names and the column names of the data in the file imported into the database.

Preferably, the step of integrating and fusing the to-be-processed data by the POI fusion algorithm based on the distance category to obtain fused data specifically includes:

carrying out position clustering on the data to be processed by adopting a nearest neighbor algorithm to obtain a primary fusion set;

calculating the name similarity between the fusion objects in the primary fusion set by adopting a Jaro-Winkler algorithm, and collecting the fusion objects meeting the name similarity and the category investigation requirements of the fusion objects into a single set;

calculating the name similarity of the objects in the single set by adopting a Jaro-Winkler algorithm, calculating the position similarity based on the spherical distance, and collecting the objects with the distance lower than a distance threshold and the objects with the name similarity higher than a acquaintance threshold and consistent categories into a fusion set;

and merging the fusion set and the single set to obtain fusion data.

Preferably, the name similarity and the category investigation requirement of the fusion object are specifically: and checking the fusion objects with consistent categories, inconsistent categories and less than a first threshold value and inconsistent categories and with less than a second threshold value, wherein the first threshold value is less than the second threshold value.

Preferably, the step of performing data reduction processing on the fusion data based on the nonlinear data dimension reduction algorithm to obtain dimension reduction data specifically includes:

transforming the fusion data from a low-dimensional subspace to a high-dimensional feature space based on a nonlinear mapping function to obtain mapping data; wherein the nonlinear mapping function is a gaussian kernel function;

projecting the mapping data along the direction of the corresponding feature vector to obtain a nonlinear principal component vector;

constructing a covariance matrix based on the nonlinear principal component vector in a high-dimensional feature space;

solving eigenvalues and eigenvectors of the covariance matrix;

obtaining a new feature vector through schmidt orthogonalization and unitization based on the feature value and the feature vector;

and extracting a preset number of new feature vectors from the new feature vectors through accumulated contribution rates to obtain the dimensionality reduction data.

Preferably, the pre-fabricated evaluation index includes a trusted index and an available index.

In a second aspect, the present application provides a multi-energy variety data cleaning and collecting system, comprising:

the acquisition module is used for acquiring the multi-energy variety data; the multi-energy variety data comprise total coal consumption, total petroleum consumption, total natural gas consumption and total clean energy consumption;

the cleaning module is used for cleaning and preprocessing abnormal data in the multi-energy variety data based on a distributed platform data cleaning method; wherein the abnormal data includes missing data, erroneous data, and duplicate data;

the conversion module is used for converting the multi-energy variety data after the cleaning pretreatment through a data conversion rule based on a mapping relation to obtain data to be processed;

the integration module is used for carrying out integration fusion on the data to be processed based on a POI fusion algorithm of the distance class to obtain fusion data;

the subtraction module is used for carrying out data subtraction processing on the fusion data based on a nonlinear data dimension reduction algorithm to obtain dimension reduction data;

and the filtering module is used for filtering the reduced-dimension data by adopting a quality evaluation rule based on a prefabricated evaluation index to obtain target data corresponding to the multi-energy variety data.

In a third aspect, the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the multi-energy variety data cleaning and collecting method according to the first aspect when the processor executes the computer program.

In a fourth aspect, the present application provides a storage medium having stored thereon a computer program which when executed by a processor implements the multi-energy variety data cleansing acquisition method of the first aspect.

Compared with the prior art, the multi-energy variety data cleaning and collecting method provided by the application has the following beneficial effects:

1. by means of the advantages of efficient and distributed processing of the distributed platform, the identified missing data is subjected to association filling by using a clustering filling algorithm; correcting and repairing the error data identified by the association rule; and integrating the identified repeated data by using a clustering partition algorithm; therefore, high recognition rate of data and high efficiency of data processing are realized, and high-quality data is provided for acquisition and mining of subsequent data.

2. By defining the transformation rules in the data transformation framework and importing the data by adopting an XML data template, the data is extracted more smoothly, and the accuracy and the high efficiency of the data transformation are ensured.

3. The POI fusion algorithm based on the nearest neighbor algorithm and the Jaro-Winkler algorithm and based on the distance category is adopted to integrate and fuse the data, so that the data with different sources, formats and characteristic properties are logically or physically organically concentrated, the fusion process is simplified, and the data fusion efficiency is improved.

4. The complex nonlinear problem is converted into the linear characteristic space problem by utilizing the nonlinear data dimension reduction algorithm, irrelevant characteristics are removed, a large amount of data is effectively processed, and the problems that the accuracy of some dimension reduction algorithms is reduced due to the fact that the resource space of an edge server is saved and the complex nonlinear system is considered are also considered; not only can the result be kept stable, but also the similarity of the data can be kept.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a multi-energy variety data cleaning and collecting method provided in embodiment 1 of the present invention;

FIG. 2 is a model structure of a data transformation framework provided in embodiment 1 of the present invention;

fig. 3 is a flowchart of a POI fusion algorithm based on distance category provided in embodiment 1 of the present invention;

FIG. 4 is a block diagram of a multi-energy variety data cleaning and collecting system corresponding to the method of embodiment 1 provided in embodiment 2 of the present invention;

fig. 5 is a schematic hardware structure of an electronic device according to embodiment 3 of the present invention.

Reference numerals illustrate:

10-an acquisition module;

20-cleaning module, 21-filling unit, 22-repairing unit, 23-integrating unit;

a 30-transformation module, a 31-construction unit, a 32-confirmation unit and a 33-transformation unit;

40-integration module, 41-clustering unit, 42-investigation unit, 43-collection unit, 44-merging unit;

a 50-reduction module, a 51-mapping unit, a 52-projection unit, a 53-construction unit, a 54-solving unit, a 55-orthogonalization unit and a 56-extraction unit;

60-a filtration module;

70-bus, 71-processor, 72-memory, 73-communication interface.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the disclosed aspects may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

Example 1

Specifically, fig. 1 is a schematic flow chart of a multi-energy variety data cleaning and collecting method according to the present embodiment.

As shown in fig. 1, the multi-energy variety data cleaning and collecting method of the embodiment includes the following steps:

s101, acquiring multi-energy variety data.

The multi-energy variety data comprise total coal consumption, total petroleum consumption, total natural gas consumption and total clean energy consumption.

Specifically, the multi-energy variety data in this embodiment is mainly data collected on line for energy consumption of each energy variety. However, the energy consumption collection data are different in data range (whole plant, production process unit, key energy consumption equipment) and data collection frequency (15 minutes real time, daily, month, year). Although the real-time online acquisition type data does not have update requirements, the version control can be performed by adopting a time stamp for data correction, the real-time online acquisition type data has extremely high update frequency and extremely large data scale, a large amount of summarization calculation is required, and the conventional relational database cannot meet the requirements on the data storage scale and the calculation efficiency. The embodiment can adopt a non-relational database for storage to meet the requirements. However, because the non-relational database has low constraint on data quality, data quality management needs to be performed before the intermediate calculation process, and then the data quality management needs to be performed in the intermediate calculation data storage. While selecting the appropriate storage mode, it is necessary to consider whether the stored data structure can support the resource consumption caused by large-scale data JOIN association and GROUP aggregation.

S102, cleaning pretreatment is carried out on abnormal data in the multi-energy variety data based on a distributed platform data cleaning method.

Wherein the abnormal data includes missing data, erroneous data, and duplicate data.

Specifically, the present embodiment uses a cluster filling algorithm to perform association filling on the identified missing data by virtue of the advantages of efficient and distributed processing of the distributed platform; correcting and repairing the identified error data by using the association rule; and integrating the identified repeated data by using a clustering partition algorithm; therefore, high recognition rate of data and high efficiency of data processing are realized, and high-quality data is provided for acquisition and mining of subsequent data.

Further, the specific steps of step S102 include:

s1021, carrying out matching filling processing on missing data in the multi-energy variety data by combining a distributed platform and a clustering filling algorithm.

Specifically, the cleaning modes of the missing data can be summarized into three modes, wherein the first mode is no action and the original mode is reserved; the second way is to delete the whole piece of data directly; the third is to fill in the missing values. The first mode is simple, does not need to perform any action, and is suitable for attributes which have no influence on the later data mining work. The second approach is applicable to situations where the attribute is not unaffected, but the whole piece of data has little impact on the later data mining effort, and is not recommended. The third cleaning mode is a core mode in missing data cleaning. The missing data is matched and filled by combining a distributed platform and a clustering filling algorithm, and the method specifically comprises the following steps:

Step one, loading a Hive warehouse with multi-energy variety data and a multi-energy ontology knowledge base;

step two, executing Map functions aiming at the Hive warehouse to identify missing data contained in the Hive warehouse;

step three, matching the missing data with the multi-energy ontology knowledge base, and judging whether rules in the missing data and the multi-energy ontology knowledge base have corresponding association relations or not;

and step four, if yes, directly filling. Or in other embodiments, if not, searching for similar complete data filling by taking the missing data as a clustering core.

S1022, correcting and repairing the error data in the multi-energy variety data by combining the distributed platform and the association rule.

In particular, for multi-energy variety data from a plurality of different structures, such a huge and complex data set is not ideal if the clustering is directly performed, and a large space complexity is required, so that the processing by directly using the clustering method is very inefficient. Therefore, the embodiment combines the distributed platform and the association rule to correct and repair the error data, which is as follows:

step one, storing multi-energy variety data by a Hive warehouse and importing the multi-energy variety data into a multi-energy ontology knowledge base;

Step two, giving weights to all the attributes according to the multi-energy ontology knowledge base;

step three, executing Map functions, and carrying out clustering partition on the data by using an SNM algorithm;

determining an area needing abnormal value detection, and detecting the area needing abnormal value detection;

and fifthly, correcting the abnormal value by using the association rule.

It should be noted that, in order to reduce the number of times of comparing the data with the association rule, the efficiency is greatly improved, the data is firstly partitioned before being compared with the rule in the ontology knowledge base, and some blocks without abnormal values are removed, so that only the data of the blocks with the abnormal values possibly exist need to be operated, and although one step is added, compared with the method of checking the mass data to determine a small amount of error data, the operation can greatly improve the efficiency.

S1023, combining a distributed platform and a clustering partition algorithm to integrate and process repeated data in the multi-energy variety data.

Specifically, by means of the advantages of efficient and distributed processing of the distributed platform, the characteristic that the number of times of calculation of similarity among records can be effectively reduced by utilizing the clustering partition algorithm is utilized, and the work of similar repeated data among massive multi-energy variety data can be completed with high efficiency and high recognition rate. The step of integrating the repeated data by combining the distributed platform and the clustering partition algorithm specifically comprises the following steps:

multiplying the similarity between the attributes by a weight coefficient to obtain the similarity between every two records;

and fifthly, calculating the similarity among the records in each partition, and integrating the records with high similarity.

It should be noted that, the similarity calculation between the numerical data is measured by calculating the difference of the numerical values. Illustrating: let the data set have n records, S= { S1, S2, …, sn, }, each record has m attributes, different weights are given to different attributes according to the importance of the attributes, Q= { Q1, Q2, …, qm }, if the data type of the kth attribute is numerical, record S _i Is the kth attribute data S of (1) _ik And record S _j Is the kth attribute data S of (1) _jk The similarity calculation formula is as follows:

。

s103, the multi-energy variety data after cleaning pretreatment is transformed through a data transformation rule based on a mapping relation to obtain data to be processed.

Specifically, by defining the transformation rules in the data transformation framework and importing the data by adopting an XML data template, the data is extracted more smoothly, and the accuracy and the high efficiency of the data transformation are ensured. In this embodiment, the model structure of the data transformation framework is shown in fig. 2, and the model structure is divided into three layers, namely a rule recognition layer, a model recognition layer and a data conversion layer. The transformation rules are defined as follows:

In the method, in the process of the invention,OutputDatarepresents the data converted by the data conversion framework,InputDatafor the initial data, i.e. the source data,InputTransModela source data conversion model is represented and,OutputTransModelthe object data transformation model is represented as such,rulethe conversion rule is represented by a rule of conversion,frepresenting a target mapping of data transformations in the data transformation framework.

Further, the specific steps of step S103 include:

s1031, constructing conversion rules of the data models before and after conversion based on the XML data template.

And the XML data template is used for carrying out specified processing on all the initial rows and the final columns in the file to be read and the row names and the column names of the data in the file data imported into a database.

S1032, confirming the source data adaptation of the XML data template and the pre-transformation data model; the source data are the multi-energy variety data after cleaning pretreatment.

S1033, extracting and transforming the source data through an XML data template to obtain data to be processed.

In connection with S1031 to S1033, when storing data in one file, unstructured data and structured data in the checked file are first stored separately. And simultaneously scanning whether a line name exists in the data in the file, judging whether a downlink name exists, if not, creating a new table according to the line name and the column name specified in the XML for storing the data, and if so, acquiring corresponding connection according to the line name, and storing the newly imported data into the HBase. When a table which is successfully created exists in the HBase, and then heterogeneous data is extracted, the table is quickly loaded according to the data mapping rule which is already created, so that new data is extracted efficiently.

S104, integrating and fusing the data to be processed by using a POI fusion algorithm based on the distance category to obtain fusion data.

Specifically, the POI fusion algorithm based on the nearest neighbor algorithm and the Jaro-Winkler algorithm and based on the distance category is adopted to integrate and fuse the data, so that the data with different sources, formats and characteristic properties are logically or physically and organically concentrated, and meanwhile, the fusion process is simplified, and the data fusion efficiency is improved. In this embodiment, a flowchart of a POI fusion algorithm based on distance category is shown in fig. 3.

Further, the specific steps of step S104 include:

s1041, carrying out position clustering on the data to be processed by adopting a nearest neighbor algorithm to obtain a primary fusion set.

Specifically, the nearest neighbor algorithm, also called KNN algorithm (K-nearest neighbor), is a basic classification and regression method, which has no training phase, and classifies or predicts a new sample directly with a training set.

S1042, calculating the name similarity between the fusion objects in the primary fusion set by adopting a Jaro-Winkler algorithm, and collecting the fusion objects meeting the name similarity and the category investigation requirements of the fusion objects into a single set.

The name similarity and the category investigation requirement of the fusion object are specifically as follows: and checking the fusion objects with consistent categories, inconsistent categories and less than a first threshold value and inconsistent categories and with less than a second threshold value, wherein the first threshold value is less than the second threshold value.

Specifically, the Jaro-Winkler algorithm is an algorithm for calculating the similarity between 2 character strings, and is used in the field of record linkage/data connection (duplicate detection/repeated record), and the higher the final score of the Jaro-Winkler algorithm is, the greater the similarity is. The formula of the final score of the Jaro-Winkler algorithm is as follows:

；

wherein: d_j represents the final score, s1, s2 represent the two characters to be aligned, m represents the number of characters matched, and t represents the number of permutations.

S1043, calculating the name similarity of the objects in the single set by adopting a Jaro-Winkler algorithm, calculating the position similarity based on the spherical distance, and collecting the objects with the distance lower than a distance threshold and the objects with the name similarity higher than a similarity threshold and the same category into a fusion set.

Specifically, the spherical distance calculation method is defined as follows according to the principle of triangular derivation:

；

wherein: P ₁ 、P ₂ Representing two points of spatial data, whereinP ₁ The coordinates of the components are%Lon _l ，Lat _l )，P ₂ The coordinates of the components are%Lon ₂ ，Lat ₂ ) R is the spherical radius.

S1044, merging the fusion set and the single set to obtain fusion data.

S105, performing data reduction processing on the fusion data based on a nonlinear data dimension reduction algorithm to obtain dimension reduction data.

Specifically, a nonlinear data dimension reduction algorithm is utilized to convert a complex nonlinear problem into a linear feature space problem, irrelevant features are removed, a large amount of data is effectively processed, and the problems that the resource space of an edge server is saved and the accuracy of some dimension reduction algorithms is reduced due to a complex nonlinear system are also considered; not only can the result be kept stable, but also the similarity of the data can be kept.

Further, the specific steps of step S105 include:

s1051, transforming the fusion data from the low-dimensional subspace to the high-dimensional feature space based on the nonlinear mapping function to obtain mapping data.

The nonlinear mapping function is a Gaussian kernel function, and the Gaussian kernel function has better flexibility than other kernel functions. With a gaussian kernel function, by selecting the appropriate σ value, the gaussian kernel principal component analysis will have an appropriate capture range, which will enhance the link between data points in the original feature space that are close to each other.

Specifically, the gaussian kernel function is a principal component analysis method in terms of a kernel function, which is a nonlinear boost of the PCA algorithm. PCA is a linear algorithm, has no good effect on nonlinear industrial data processing, and can mine nonlinear characteristics hidden in industrial data sets by a principal component analysis method of a kernel function.

And S1052, projecting the mapping data along the direction of the corresponding feature vector to obtain a nonlinear principal component vector.

In particular, since the mapping data concerned has a high-dimensional characteristic, in order to acquire the nonlinear principal component vector corresponding thereto, it is necessary to project the mapping data concerned in a certain dimension direction, which is preferably the direction of the eigenvector corresponding to the mapping data.

S1053, constructing a covariance matrix based on the nonlinear principal component vector in a high-dimensional feature space.

Specifically, the covariance matrix is specifically as follows:

；

wherein: c represents the covariance matrix and,nindicating the amount of the fused data,representation ofx _i Is a kernel function of (a).

S1054, solving eigenvalues and eigenvectors of the covariance matrix.

Specifically, the solution of the eigenvalue and eigenvector is specifically as follows:

；/>；

in the method, in the process of the invention,α=(α _l ,...,α _i ,...,α _n ) ^T ，α _l ,...,α _i ,...,α _n feature vector representing kernel function, lambda representing feature value, lambda ₁ 、λ ₂ 、λ ₃ 、...、λ _n Respectively represent the eigenvalues corresponding to the eigenvectors of the kernel function, and lambda ₁ ≥λ ₂ ≥λ ₃ ≥...≥λ _n K represents a coefficient.

S1055, obtaining a new feature vector through Schmidt orthogonalization and unitization based on the feature value and the feature vector.

S1056, extracting a preset number of new feature vectors from the new feature vectors through the accumulated contribution rate to obtain the dimensionality reduction data.

In combination with the steps, the existing linear principal component analysis can only extract two visible clusters, and the kernel function is a conversion medium between linear and nonlinear, so that the principal component analysis of the kernel function can be better visualized in a two-dimensional subspace, and besides the dimension of a data set is reduced through feature extraction, the purposes of increasing the density of a sample and removing noise can be achieved.

And S106, filtering the reduced-dimension data by adopting a quality evaluation rule based on a prefabricated evaluation index to obtain target data corresponding to the multi-energy variety data.

Specifically, the quality assessment rule essentially assesses the quality of the cleaned data, and the assessment process of the data quality is a process of optimizing the value of the data by measuring and improving the comprehensive characteristics of the data. The difficulty of data quality evaluation indexes and method research is the meaning, content, classification, grading, quality evaluation indexes and the like of data quality. The data quality assessment should contain at least the following basic assessment indicators:

1. The data must be trusted to the user. The credibility comprises indexes such as accuracy, integrity, consistency, validity, uniqueness and the like. Accuracy: whether the descriptive data is consistent with the characteristics of its corresponding objective entity. Integrity: whether there is a missing record or a missing field in the description data. Consistency: whether values describing the same attribute of the same entity are consistent across different systems. Effectiveness is as follows: whether the description data meets user-defined conditions or is within a certain threshold. Uniqueness: describing whether there is a duplicate record of the data.

2. The data must be available to the user. Including timeliness, stability, etc. Timeliness of: whether the description data is current data or historical data. Stability: whether the data is stable or not is described as being within its validity period.

In summary, the multi-energy variety data including the total consumption of coal, the total consumption of petroleum, the total consumption of natural gas and the total consumption of clean energy are obtained on line. By means of the advantages of efficient and distributed processing of the distributed platform, the missing data, the error data and the repeated data are respectively processed by using a clustering filling algorithm, an association rule and a clustering partition algorithm, and high-quality data is provided for acquisition and mining of subsequent data. The data is imported by adopting an XML data template through defining the transformation rules in the data transformation framework, so that the accuracy and the high efficiency of data transformation are ensured. And adopting a POI fusion algorithm based on a distance category and a Jaro-Winkler algorithm to carry out high-efficiency integration fusion on the data. The nonlinear data dimension reduction algorithm is utilized to convert complex nonlinear problems into linear characteristic space problems, irrelevant characteristics are removed, a large amount of data is effectively processed, the stability of results can be kept, and the similarity of the data can be kept.

Example 2

This embodiment provides a block diagram of a system corresponding to the method described in embodiment 1. Fig. 4 is a block diagram of a multi-energy variety data cleaning and collecting system according to the present embodiment, and as shown in fig. 4, the system includes:

an acquisition module 10 for acquiring multi-energy variety data; the multi-energy variety data comprise total coal consumption, total petroleum consumption, total natural gas consumption and total clean energy consumption;

the cleaning module 20 is used for cleaning and preprocessing abnormal data in the multi-energy variety data based on a distributed platform data cleaning method; wherein the abnormal data includes missing data, erroneous data, and duplicate data;

the conversion module 30 is configured to convert the multi-energy variety data after the cleaning pretreatment by a data conversion rule based on a mapping relationship to obtain data to be processed;

the integration module 40 is configured to perform integrated fusion on the data to be processed according to a POI fusion algorithm based on a distance class to obtain fusion data;

the subtraction module 50 is configured to perform data subtraction processing on the fusion data based on a nonlinear data dimension reduction algorithm to obtain dimension reduction data;

the filtering module 60 is configured to filter the reduced-dimension data by using a quality evaluation rule based on a prefabricated evaluation index to obtain target data corresponding to the multi-energy variety data; wherein the prefabricated evaluation index comprises a trusted index and an available index.

Further, the cleaning module 20 specifically includes:

and the filling unit 21 is used for carrying out matching filling processing on missing data in the multi-energy variety data by combining a distributed platform and a clustering filling algorithm. Wherein, this filling unit is used for:

if yes, directly filling;

And the repairing unit 22 is used for correcting and repairing the error data in the multi-energy variety data by combining the distributed platform and the association rule.

And the integration unit 23 is used for integrating the repeated data in the multi-energy variety data by combining a distributed platform and a clustering partition algorithm.

Further, the transformation module 30 specifically includes:

a construction unit 31, configured to construct conversion rules of the two data models before and after the conversion based on the XML data template; and the XML data template carries out specified processing on all the initial rows and the final columns in the file to be read and the row names and the column names of the data in the file imported into the database.

A validation unit 32 for validating the adaptation of the XML data template to the source data of the pre-transformation data model; the source data are the multi-energy variety data after cleaning pretreatment.

And the transformation unit 33 is used for carrying out data extraction transformation processing on the source data through an XML data template to obtain data to be processed.

Further, the integrated module 40 specifically includes:

and the clustering unit 41 is used for carrying out position clustering on the data to be processed by adopting a nearest neighbor algorithm to obtain a primary fusion set.

And the checking unit 42 is configured to calculate the similarity of names between the fusion objects in the primary fusion set by using a Jaro-Winkler algorithm, and collect the fusion objects meeting the requirements of the name similarity and the category checking of the fusion objects into a single set. The name similarity and the category investigation requirement of the fusion object are specifically as follows: and checking the fusion objects with consistent categories, inconsistent categories and less than a first threshold value and inconsistent categories and with less than a second threshold value, wherein the first threshold value is less than the second threshold value.

And a collecting unit 43, configured to calculate a name similarity for the objects in the single set by using a Jaro-Winkler algorithm, and calculate a position similarity based on the spherical distance, and collect the objects with a distance below a distance threshold and the objects with a name similarity above a similarity threshold and a consistent category into a fusion set.

And a merging unit 44, configured to merge the fusion set and the single set to obtain fusion data.

Further, the abatement module 50 specifically includes:

a mapping unit 51, configured to transform the fusion data from a low-dimensional subspace to a high-dimensional feature space based on a nonlinear mapping function to obtain mapping data; wherein the nonlinear mapping function is a gaussian kernel function.

A projection unit 52, configured to project the mapping data along the direction of the corresponding feature vector to obtain a nonlinear principal component vector;

a construction unit 53 for constructing a covariance matrix based on the nonlinear principal component vector in a high-dimensional feature space;

a solving unit 54, configured to solve eigenvalues and eigenvectors of the covariance matrix;

an orthogonalization unit 55 for obtaining a new feature vector by schmitt orthogonalization and unitization based on the feature value and the feature vector;

and an extracting unit 56, configured to extract a preset number of new feature vectors from the new feature vectors through the accumulated contribution rate to obtain a reduced-dimension data.

The above-described respective modules may be functional modules or program modules, and may be implemented by software or hardware. For modules implemented in hardware, the various modules described above may be located in the same processor; or the above modules may be located in different processors in any combination.

Example 3

The multi-energy variety data cleaning and collecting method described in connection with fig. 1 can be implemented by an electronic device. Fig. 5 is a schematic diagram of the hardware structure of the electronic device according to the present embodiment.

The electronic device may include a processor 71 and a memory 72 storing computer program instructions.

In particular, the processor 71 may comprise a Central Processing Unit (CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, abbreviated as ASIC), or may be configured as one or more integrated circuits embodying the present application.

Memory 72 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 72 may comprise a Hard Disk Drive (HDD), floppy Disk Drive, solid state Drive (Solid State Drive, SSD), flash memory, optical Disk, magneto-optical Disk, tape, or universal serial bus (Universal Serial Bus, USB) Drive, or a combination of two or more of the foregoing. The memory 72 may include removable or non-removable (or fixed) media, where appropriate. The memory 72 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 72 is a Non-Volatile memory. In particular embodiments, memory 72 includes Read-Only Memory (ROM) and random access Memory (Random Access Memory, RAM). Where appropriate, the ROM may be a mask-programmed ROM, a programmable ROM (Programmable Read-Only Memory, abbreviated PROM), an erasable PROM (Erasable Programmable Read-Only Memory, abbreviated EPROM), an electrically erasable PROM (Electrically Erasable Programmable Read-Only Memory, abbreviated EEPROM), an electrically rewritable ROM (Electrically Alterable Read-Only Memory, abbreviated EAROM), or a FLASH Memory (FLASH), or a combination of two or more of these. The RAM may be Static Random-Access Memory (SRAM) or dynamic Random-Access Memory (Dynamic Random Access Memory DRAM), where the DRAM may be a fast page mode dynamic Random-Access Memory (Fast Page Mode Dynamic Random Access Memory FPMDRAM), extended data output dynamic Random-Access Memory (Extended Date Out Dynamic Random Access Memory EDODRAM), synchronous dynamic Random-Access Memory (Synchronous Dynamic Random-Access Memory SDRAM), or the like, as appropriate.

Memory 72 may be used to store or cache various data files that need to be processed and/or communicated, as well as possible computer program instructions for execution by processor 71.

The processor 71 reads and executes the computer program instructions stored in the memory 72 to realize the multi-energy variety data cleansing acquisition method of embodiment 1 described above.

In some of these embodiments, the electronic device may also include a communication interface 73 and a bus 70. As shown in fig. 5, the processor 71, the memory 72, and the communication interface 73 are connected to each other through the bus 70 and perform communication with each other.

The communication interface 73 is used to enable communication between modules, devices, units and/or units in the present application. Communication interface 73 may also enable communication with other components such as: and the external equipment, the image/data acquisition equipment, the database, the external storage, the image/data processing workstation and the like are used for data communication.

Bus 70 includes hardware, software, or both, coupling the components of the device to one another. Bus 70 includes, but is not limited to, at least one of: data Bus (Data Bus), address Bus (Address Bus), control Bus (Control Bus), expansion Bus (Expansion Bus), local Bus (Local Bus). By way of example, and not limitation, bus 70 may include a graphics acceleration interface (Accelerated Graphics Port), abbreviated AGP, or other graphics Bus, an enhanced industry standard architecture (Extended Industry Standard Architecture, abbreviated EISA) Bus, a Front Side Bus (FSB), a HyperTransport (HT) interconnect, an industry standard architecture (Industry Standard Architecture, ISA) Bus, a wireless bandwidth (InfiniBand) interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a micro channel architecture (Micro Channel Architecture, abbreviated MCa) Bus, a peripheral component interconnect (Peripheral Component Interconnect, abbreviated PCI) Bus, a PCI-Express (PCI-X) Bus, a serial advanced technology attachment (Serial Advanced Technology Attachment, abbreviated SATA) Bus, a video electronics standards association local (Video Electronics Standards Association Local Bus, abbreviated VLB) Bus, or other suitable Bus, or a combination of two or more of these. Bus 70 may include one or more buses, where appropriate. Although a particular bus is described and illustrated, the present application contemplates any suitable bus or interconnect.

The electronic device can acquire the multi-energy variety data cleaning and collecting system, and execute the multi-energy variety data cleaning and collecting method of the embodiment 1.

In addition, in combination with the multi-energy variety data cleaning and collecting method in the above embodiment 1, the present application can be implemented by providing a storage medium. The storage medium having stored thereon computer program instructions; the computer program instructions, when executed by the processor, implement the multi-energy variety data cleansing acquisition method of embodiment 1 described above.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the application.

Claims

1. The multi-energy variety data cleaning and collecting method is characterized by comprising the following steps of:

filtering the reduced data by adopting a quality evaluation rule based on a prefabricated evaluation index to obtain target data corresponding to the multi-energy variety data;

the step of cleaning pretreatment for the abnormal data in the multi-energy variety data based on the distributed platform data cleaning method specifically comprises the following steps:

Integrating repeated data in the multi-energy variety data by combining a distributed platform and a clustering partition algorithm;

the step of matching and filling the missing data in the multi-energy variety data by combining the distributed platform and the clustering filling algorithm specifically comprises the following steps:

if yes, directly filling;

if not, searching similar complete data filling by taking the missing data as a clustering core;

the step of correcting and repairing the error data in the multi-energy variety data by combining the distributed platform and the association rule specifically comprises the following steps:

storing the multi-energy variety data by a Hive warehouse and importing the multi-energy variety data into a multi-energy ontology knowledge base;

giving weight to each attribute according to the multi-energy ontology knowledge base;

executing Map functions, and carrying out clustering partition on the data by using an SNM algorithm;

correcting the abnormal value by using the association rule;

before the rule comparison with the rule comparison in the ontology knowledge base, firstly partitioning the data to eliminate some plates without abnormal values;

the step of integrating the repeated data in the multi-energy variety data by combining the distributed platform and the clustering partition algorithm specifically comprises the following steps:

multiplying the similarity between the attributes by the weight coefficient to obtain the similarity between every two records;

calculating the similarity between records in each partition, and integrating records with high similarity;

the similarity calculation between the numerical data is measured by calculating the difference of the numerical values, n records are provided for the data set, each record has m attributes, different weights are given to different attributes according to the importance of the attributes, namely Q= { Q1, Q2, … and Qm }, and if the data type of the kth attribute is the numerical value, the record S is provided _i Is the kth attribute data S of (1) _ik And record S _j Is the kth attribute data S of (1) _jk Similarity meter betweenThe calculation formula is as follows:

。

2. the method for cleaning and collecting multi-energy variety data according to claim 1, wherein the step of transforming the multi-energy variety data after cleaning and preprocessing by a data transformation rule based on a mapping relationship to obtain data to be processed specifically comprises:

confirming the adaptation of the XML data template and the source data of the data model before transformation; wherein the source data is the multi-energy variety data after cleaning pretreatment;

3. The multi-energy variety data cleaning and collecting method according to claim 2, wherein the XML data template is characterized in that the starting row and the final column in the file to be read, and the row name and the column name in the data importing database in the file are processed in a specified manner.

4. The method for cleaning and collecting multi-energy variety data according to claim 1, wherein the step of integrating and fusing the to-be-processed data by the POI fusion algorithm based on the distance category specifically comprises the following steps:

and merging the fusion set and the single set to obtain fusion data.

5. The method for cleaning and collecting multi-energy variety data according to claim 4, wherein the name similarity and the category investigation requirements of the fusion object are specifically as follows: and checking the fusion objects with consistent categories, inconsistent categories and less than a first threshold value and inconsistent categories and with less than a second threshold value, wherein the first threshold value is less than the second threshold value.

6. The multi-energy variety data cleaning and collecting method according to claim 1, wherein the step of performing data reduction processing on the fusion data based on a nonlinear data dimension reduction algorithm to obtain dimension reduction data specifically comprises the following steps:

solving eigenvalues and eigenvectors of the covariance matrix;

7. The method of claim 1, wherein the pre-made assessment indicators include a trusted indicator and an available indicator.