CN116340437A

CN116340437A - Multi-clustering method for large-scale multi-source heterogeneous data

Info

Publication number: CN116340437A
Application number: CN202310297924.1A
Authority: CN
Inventors: 张宏俊; 李鹏; 樊卫北; 王汝传; 徐鹤; 朱枫; 程海涛; 薛状状; 孟凡硕
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-03-24
Filing date: 2023-03-24
Publication date: 2023-06-27

Abstract

The invention discloses a multi-clustering method for large-scale multi-source heterogeneous data, which relates to the technical field of data processing and comprises the following steps: preprocessing heterogeneous data of different sources through an ETL tool, and converting the heterogeneous data into a unified target data format; then, according to the voltage level and the equipment type, collecting and classifying the measurement types; constructing a topology analysis engine for the classified multi-source heterogeneous data set according to the correlation between the power distribution network and the network element; rejecting unsatisfied data sets based on topology analysis to obtain data sets to be fused; the method comprises the steps of carrying out observation coefficient analysis on a data set to be fused, distributing a corresponding number of processing terminals to fuse the data set to be fused, improving data fusion efficiency, and realizing cross composite depth analysis on historical data and quasi-real-time data of a power distribution network; and outputting a data fusion result for the research and analysis of power distribution network staff, providing guidance for energy fine management and user service, timely achieving fault early warning and improving power safety.

Description

Multi-clustering method for large-scale multi-source heterogeneous data

Technical Field

The invention relates to the technical field of data processing, in particular to a multi-clustering method for large-scale multi-source heterogeneous data.

Background

With the deep development of smart grid construction, the professional directions, construction time and architecture of each service system of the power distribution network are different, and a large amount of multi-source heterogeneous data such as measurement data, service form data, account information data and the like are generated in the operation process, so that the structure is various, the source is complex, the time scale is non-uniform and the space scale is different; it is counted that a medium-scale distribution network will produce hundreds of TB of data each year; the data are mutually independent in the respective service systems, so that effective fusion cannot be realized, and the data efficiency cannot be fully mined and exerted; and when the data of the power equipment is abnormal, the related information cannot be timely and accurately pushed to related staff, so that fault early warning is achieved.

Based on this, some researches are performed on the above problems in the prior art, for example, patent application CN109241169a discloses a method for integrating a multi-source heterogeneous data fusion database of operation information of a power distribution network, which accesses different service subsystems of the power distribution network according to requirements to obtain a target data set, selects a data set meeting certain conditions in the target data set based on a topology analysis engine, constructs a data fusion model based on a regularized residual search method, eliminates bad data in the target data set after topology analysis processing, and then performs fusion. Patent application CN114238464a discloses a heterogeneous fusion method of multi-element energy data, which fuses data after preprocessing heterogeneous data from different sources. However, the existing multi-source heterogeneous data cluster analysis system cannot intelligently allocate the number of terminals for cluster analysis corresponding to the production operation data of the power distribution network, so that the resource utilization is low, and the data analysis efficiency is low.

Disclosure of Invention

In order to solve the technical problems, the invention provides a multi-clustering method for large-scale multi-source heterogeneous data, which is used for analyzing observation coefficients of a data set to be fused and intelligently allocating the allocation quantity of processing terminals according to the observation coefficients so as to improve the data processing efficiency.

The invention discloses a multi-clustering method for large-scale multi-source heterogeneous data, which comprises the following steps:

step one: in a selected time period, continuously accessing different service subsystems of the power distribution network according to requirements to acquire a target data set on line so as to form a multi-source heterogeneous data set;

step two: preprocessing heterogeneous data of different sources through an ETL tool to convert multiple formats of original data into a unified target data format; preprocessing comprises data screening and data restoration;

step three: classifying the preprocessed multi-source heterogeneous data set according to voltage class, equipment type and acquisition measurement type; constructing a topology analysis engine for the classified multi-source heterogeneous data set according to the correlation between the power distribution network and the network element;

step four: based on analysis of a topology analysis engine, selecting a data set which satisfies KCL law and has consistent voltage, current and power in the multi-source heterogeneous data set at the same time section, and removing the unsatisfied data set to obtain a data set to be fused; wherein the data set to be fused carries a time section;

step five: performing observation coefficient GF analysis on the data sets to be fused, and performing fusion on the data sets to be fused according to the corresponding number of processing terminals distributed by the observation coefficient GF, wherein the fusion is based on an HFCM clustering algorithm;

step six: outputting a data fusion result for the research and analysis of power distribution network staff, and providing guidance for energy fine management and user service; wherein the data fusion result carries a time section.

Further, the observation coefficient GF analysis is carried out on the data set to be fused, and the specific analysis steps are as follows:

acquiring a time section corresponding to a data set to be fused, and calling a research attraction value YG corresponding to the time section;

counting the data size of the data set to be fused as D1; acquiring a power distribution network corresponding to a data set to be fused, and calling a scale value GM and a fault coefficient GZ of the corresponding power distribution network;

the observation coefficient GF of the data set to be fused is calculated by using the formula GF=YG×g1+D1×g2+GM×g3+GZ×g4; wherein g1, g2, g3 and g4 are coefficient factors.

Further, the method for fusing the data sets to be fused by allocating a corresponding number of processing terminals according to the observation coefficients GF specifically comprises the following steps:

a comparison relation table of the observation coefficient range and the distribution quantity threshold value is stored in the database; firstly, determining an observation coefficient range corresponding to an observation coefficient GF, and then determining an allocation quantity threshold corresponding to the observation coefficient range and marking the allocation quantity threshold as L1, namely allocating L1 processing terminals to fuse the data sets to be fused.

Further, the method further comprises the following steps: the data fusion result is subjected to access monitoring, and the research attraction value YG analysis is carried out according to the access record, wherein the specific analysis steps are as follows:

acquiring an access record of a data fusion result within a preset time, wherein the access record comprises an access starting time and an access ending time; acquiring a time section corresponding to a data fusion result;

counting the access times of the time section as C1 for the same time section; accumulating the access time length of each access to obtain the total access time length ZT; the study attraction value YG of the time section was calculated using the formula yg=c1×a1+zt×a2, where a1, a2 are coefficient factors.

Further, the method further comprises the following steps: the power distribution network is subjected to scale value GM analysis, and specifically comprises the following steps:

acquiring a power supply area of a power distribution network; counting the length of a power supply line in the power supply area as DL, the number of power supplies as HL and the average power consumption as VL; the scale value GM of the power distribution network is calculated by using the formula gm=dl×a3+hl×a4+vl×a5, wherein a3, a4, a5 are coefficient factors.

Further, the method further comprises the following steps: carrying out maintenance tracking on the power distribution network, and carrying out fault coefficient GZ evaluation on the power distribution network according to maintenance information; the method comprises the following steps:

acquiring all overhaul information of the power distribution network in a preset time period; the overhaul information comprises a fault network element, overhaul duration and overhaul grades;

counting the overhaul times of the power distribution network as G1; marking the number of fault network elements in each overhaul information as GL, the overhaul duration as GT and the overhaul grade as GD; the maintenance value JXi is calculated by using a formula JXi =gl×b1+gt×b2+gd×b3, wherein b1, b2, b3 are coefficient factors;

comparing the service value JXi to a service threshold; counting the times of JXi which is greater than the overhaul threshold value as G2, and when JXi is greater than the overhaul threshold value, obtaining JXi and the difference value of the overhaul threshold value and summing to obtain an over-detection value CJ; calculating to obtain a super-detection coefficient CP by using a formula CP=G2×b4+CJ×b5, wherein b4 and b5 are coefficient factors; using the formula

And calculating to obtain a fault coefficient GZ, wherein f1 and f2 are coefficient factors.

Further, the voltage class in the third step is as follows: 35kV,20kV and 10kV; the device types are divided into different service subsystems: the transformer, the switch cabinet, the circuit; the collection and measurement types are divided into: state quantity and analog quantity, real-time data and non-real-time data.

The beneficial effects of the invention are as follows: according to the invention, the data fusion result is subjected to access monitoring by combining the data set to be fused, and the research attraction value YG analysis is performed according to the access record; performing scale value GM analysis on the power distribution network to obtain a power supply area of the power distribution network; the power supply line length, the number of power supply lines and the average household power consumption in the power supply area are combined to calculate and obtain a scale value GM of the power distribution network; performing maintenance tracking on the power distribution network, performing fault coefficient GZ evaluation on the power distribution network according to maintenance information, and calculating to obtain an observation coefficient GF of a data set to be fused; then, according to the observation coefficients GF, corresponding numbers of processing terminals are allocated to fuse the data sets to be fused, namely, the observation coefficients in different intervals correspond to different numbers of processing terminals; more processing terminals can be allocated for the data sets to be fused with high observation coefficients GF for data fusion, so that the maximization of resource utilization is realized, and the data processing efficiency is improved.

Drawings

FIG. 1 is a schematic block diagram of a multi-clustering method for large-scale multi-source heterogeneous data.

Detailed Description

The technical solutions of the present invention will be clearly and completely described in connection with the embodiments, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, a multi-clustering method for large-scale multi-source heterogeneous data includes:

step two: preprocessing heterogeneous data of different sources through an ETL tool to convert multiple formats of original data into a unified target data format; data screening, data restoration and the like are carried out through an ETL tool so as to unify the formats of the multi-source heterogeneous data; the method aims at primarily arranging the data, so that the data can be conveniently and accurately mined;

step three: classifying the preprocessed multi-source heterogeneous data set; classifying the multi-source heterogeneous data set according to voltage class, equipment type and acquisition measurement type; the voltage class is divided into: 35kV,20kV and 10kV; the device types are divided into different service subsystems: the transformer, the switch cabinet, the circuit; the collection and measurement types are divided into: state quantity and analog quantity, real-time data and non-real-time data;

step four: constructing a topology analysis engine for the classified multi-source heterogeneous data set according to the correlation between the power distribution network and the network element; wherein the topology analysis engine is built up in dependence of the following principle:

1) Establishing a dependency relationship for a breaker and a switch in a power distribution network according to the connection relationship of the power distribution network;

2) Information that network elements are interdependent: relationship of switch position information and acquisition quantity measurement;

3) The dependence relationship of the network element on the data comprises temporary dependence and fixed dependence under different operation modes, dependence of switch position and acquisition amount measurement and dependence on historical data;

4) The network element is used for protecting the information dependency relationship of the system and the alarm event, SOE event and SOE event in the distribution network operation information;

based on analysis of a topology analysis engine, selecting a data set which satisfies KCL law and has consistent voltage, current and power in the multi-source heterogeneous data set at the same time section, and removing the unsatisfied data set to obtain a data set to be fused; the data set to be fused carries a corresponding time section;

in this embodiment, the operation of the power distribution network in different time periods is different within 24 hours a day, and is divided into a peak period, a valley period and a stationary period; therefore, the data efficiency and the data value of the production operation data generated by the power distribution network in different periods are different;

step five: the method comprises the steps of carrying out observation coefficient analysis on a data set to be fused, and carrying out fusion on the data set to be fused by a corresponding number of processing terminals according to the observation coefficient GF distribution, wherein the fusion is based on an HFCM clustering algorithm and is used for mining the value of multi-source heterogeneous data, so as to realize interconnection and exchange sharing of the multi-source heterogeneous data; the specific analysis steps are as follows:

acquiring a time section corresponding to a data set to be fused, and calling a research attraction value of the corresponding time section to be YG; counting the data size of the data set to be fused as D1;

acquiring a power distribution network corresponding to a data set to be fused; the scale value of the corresponding power distribution network is called as GM, and the fault coefficient of the corresponding power distribution network is called as GZ; the observation coefficient GF of the data set to be fused is calculated by using the formula GF=YG×g1+D1×g2+GM×g3+GZ×g4; wherein g1, g2, g3, g4 are coefficient factors;

the allocation number of the processing terminals is determined to be L1 according to the observation coefficient GF, specifically: a comparison relation table of the observation coefficient range and the distribution quantity threshold value is stored in the database; firstly, determining an observation coefficient range corresponding to an observation coefficient GF, and then determining an allocation quantity threshold corresponding to the observation coefficient range and marking the allocation quantity threshold as L1; namely, L1 processing terminals are allocated to fuse the data sets to be fused;

step six: outputting a data fusion result for the research and analysis of power distribution network staff, and providing guidance for the energy fine management and user service so as to realize the supply and demand interaction between the user and the power grid; the data fusion result carries a corresponding time section.

According to the invention, a large amount of multi-source heterogeneous data generated by the power distribution network is classified, topologically analyzed and fused after bad data are removed, so that the problems of extraction, integration and data quality improvement of multi-source heterogeneous operation information of the power distribution network are solved, the cross composite depth analysis of historical data and quasi-real-time data of the power distribution network is realized, and when data abnormality occurs in power equipment, related information can be timely and accurately pushed to related staff, so that fault early warning is realized, and power safety is improved;

in addition, because the operation data of the power distribution network is massive, a set of efficient data processing framework is needed in order to complete the data processing task; firstly, based on analysis of a topology analysis engine, selecting a data set which satisfies KCL law in a multi-source heterogeneous data set at the same time section, and simultaneously, ensuring that voltage is consistent, current and power are satisfied, and removing unsatisfied data sets to obtain a data set to be fused; then, according to the observation coefficients GF of the data sets to be fused, a corresponding number of processing terminals are allocated to fuse the data sets to be fused, so that the resource utilization is maximized, and the data processing efficiency is improved;

wherein the method further comprises: the data fusion result is accessed and monitored, research attraction analysis is carried out according to the access record, and the specific analysis steps are as follows:

counting the access times of the time section as C1 for the same time section; accumulating the access time length of each access to obtain the total access time length ZT; the study attraction value YG of the time section is calculated by using a formula YG=C1×a1+ZT×a2, wherein a1 and a2 are coefficient factors;

wherein the method further comprises: the method comprises the following steps of:

acquiring a power supply area of a power distribution network; counting the length of a power supply line in the power supply area as DL, the number of power supplies as HL and the average power consumption as VL; calculating a scale value GM of the power distribution network by using a formula GM=DL×a3+HL×a4+VL×a5, wherein a3, a4 and a5 are coefficient factors;

wherein the method further comprises: carrying out maintenance tracking on the power distribution network, recording maintenance information and carrying out fault coefficient evaluation on the power distribution network according to the maintenance information when the power distribution network is monitored to be overhauled; the method comprises the following steps:

acquiring all overhaul information of the power distribution network in a preset time period; the overhaul information comprises a fault network element, overhaul duration and overhaul grades; the maintenance grade is evaluated according to the input maintenance resources after maintenance is completed by maintenance personnel; the more maintenance resources are put into, the higher the maintenance grade is;

comparing the service value JXi to a service threshold; counting the times of JXi which is greater than the overhaul threshold value as G2, and when JXi is greater than the overhaul threshold value, obtaining JXi and the difference value of the overhaul threshold value and summing to obtain an over-detection value CJ; calculating to obtain a super-detection coefficient CP by using a formula CP=G2×b4+CJ×b5, wherein b4 and b5 are coefficient factors;using the formula

The above formulas are all formulas with dimensions removed and numerical values calculated, the formulas are formulas which are obtained by acquiring a large amount of data and performing software simulation to obtain the closest actual situation, and preset parameters and preset thresholds in the formulas are set by a person skilled in the art according to the actual situation or are obtained by simulating a large amount of data.

The working principle of the invention is as follows:

according to the multi-clustering method for the large-scale multi-source heterogeneous data, in the working process, different service subsystems which are connected into a power distribution network according to requirements are continuously connected to acquire a target data set on line in a selected time period so as to form a multi-source heterogeneous data set; preprocessing heterogeneous data of different sources through an ETL tool to convert multiple formats of original data into a unified target data format; classifying the preprocessed multi-source heterogeneous data set according to voltage levels, equipment types and acquisition measurement types; constructing a topology analysis engine for the classified multi-source heterogeneous data set according to the correlation between the power distribution network and the network element; based on analysis of a topology analysis engine, rejecting unsatisfied data sets to obtain data sets to be fused; based on an HFCM clustering algorithm, fusion is carried out on the data sets to be fused, a data fusion result is output, cross composite depth analysis of historical data and quasi-real-time data of the power distribution network is realized, guidance is provided for energy fine management and user service, and supply and demand interaction between a user and a power grid is realized;

wherein, still include: carrying out observation coefficient analysis on the data set to be fused; acquiring a time section corresponding to a data set to be fused, and calling a research attraction value of the corresponding time section to be YG; counting the data size of the data set to be fused as D1; acquiring a power distribution network corresponding to a data set to be fused; the scale value of the corresponding power distribution network is called as GM, and the fault coefficient of the corresponding power distribution network is called as GZ; the observation coefficient GF of the data set to be fused is calculated by using the formula GF=YG×g1+D1×g2+GM×g3+GZ×g4; and determining the distribution number of the processing terminals to be L1 according to the observation coefficient GF, namely distributing L1 processing terminals to fuse the data sets to be fused, and realizing the maximization of resource utilization, thereby improving the data processing efficiency.

In the description of the present specification, the descriptions of the terms "one embodiment," "example," "specific example," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The preferred embodiments of the invention disclosed above are intended only to assist in the explanation of the invention. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. The invention is limited only by the claims and the full scope and equivalents thereof.

Claims

1. A multi-clustering method for large-scale multi-source heterogeneous data is characterized by comprising the following steps:

2. The multi-clustering method for large-scale multi-source heterogeneous data according to claim 1, wherein the observation coefficient GF analysis is performed on the data set to be fused, and the specific analysis steps are as follows:

3. The multi-clustering method for large-scale multi-source heterogeneous data according to claim 2, wherein the method is characterized in that the processing terminals with corresponding numbers are allocated to the data sets to be fused according to the observation coefficients GF for fusion, and specifically comprises the following steps:

4. The multi-clustering method for large-scale multi-source heterogeneous data according to claim 2, wherein the access monitoring is performed on the data fusion result, and the research attraction value YG analysis is performed according to the access record, and the specific analysis steps are as follows:

5. The multi-clustering method for large-scale multi-source heterogeneous data according to claim 2, wherein the method is characterized in that the power distribution network is subjected to scale value GM analysis, specifically:

6. The multi-clustering method for large-scale multi-source heterogeneous data according to claim 2, wherein the power distribution network is overhauled and tracked, and fault coefficients GZ of the power distribution network are evaluated according to overhauling information; the method comprises the following steps:

7. The multi-clustering method for large-scale multi-source heterogeneous data according to claim 1, wherein the voltage class in the third step is as follows: 35kV,20kV and 10kV; the device types are divided into different service subsystems: the transformer, the switch cabinet, the circuit; the collection and measurement types are divided into: state quantity and analog quantity, real-time data and non-real-time data.