CN111897888A

CN111897888A - Household variable relation identification method based on Spark framework and coacervation hierarchical clustering algorithm

Info

Publication number: CN111897888A
Application number: CN202010824192.3A
Authority: CN
Inventors: 黄旭; 李刚; 宋树宏; 胡伟; 刘越; 郭秋婷
Original assignee: Tsinghua University; State Grid Corp of China SGCC; State Grid Liaoning Electric Power Co Ltd
Current assignee: Tsinghua University; State Grid Corp of China SGCC; State Grid Liaoning Electric Power Co Ltd
Priority date: 2020-08-17
Filing date: 2020-08-17
Publication date: 2020-11-06

Abstract

The application discloses a method for identifying a user-variant relationship based on a Spark frame and a condensation hierarchical clustering algorithm, wherein the method comprises the following steps: collecting time sequence voltage data of a transformer and users in a transformer area; preprocessing the time-series voltage data by utilizing Spark SQL to obtain processed data; reducing the dimension of the processed data by adopting a principal component analysis method, and extracting voltage time sequence data characteristics; and performing cluster analysis on the voltage time sequence data characteristics based on a coacervation hierarchical clustering algorithm to determine the classification of the power users, and obtaining a user variable relation recognition result. According to the method, by introducing a Spark distributed operation platform and calling a Spark MLlib machine learning library, the dimensionality reduction of a principal component analysis method and clustering of a coacervation hierarchical clustering algorithm are realized, the calculation of the identification of the user variable relationship is realized, and the problem of low efficiency of the identification calculation of the user variable relationship after the data volume is increased is solved.

Description

Household variable relation identification method based on Spark framework and coacervation hierarchical clustering algorithm

Technical Field

The application relates to the technical field of power system analysis, in particular to a household variable relation identification method based on Spark framework and a condensation hierarchical clustering algorithm.

Background

In a power grid system, user information in a power distribution network platform area is disordered, lost or inaccurate, and the construction process of a smart power grid is severely restricted: after the wiring of the power company is changed or the balanced distribution load line is modified, the user information is not in accordance with the actual user information when the information is not updated in time. In order to facilitate management, a power company manages the low-voltage distribution network users in a distribution area, and the identification of the relationship between the power companies is the basis for realizing the lean marketing and the loss reduction of the power consumption and is also the premise of the detection of electricity stealing.

Content of application

The present application is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, a first objective of the present application is to provide a method for identifying a user variable relationship based on a Spark framework and a clustering algorithm of an aggregation level, which implements dimensionality reduction by a principal component analysis method and clustering by a clustering algorithm of an aggregation level by introducing a Spark distributed operation platform and calling a Spark MLlib machine learning library, implements calculation of user variable relationship identification, and solves the problem of low efficiency of user variable relationship identification calculation after data volume is increased.

The second purpose of the present application is to provide a user-variable relationship identification device based on Spark framework and a coacervation hierarchical clustering algorithm.

A third object of the present application is to provide an electronic device.

A fourth object of the present application is to propose a computer readable storage medium.

In order to achieve the above object, an embodiment of the first aspect of the present application provides a method for identifying a user-dependent relationship based on a Spark framework and a clustering algorithm of a coacervation hierarchy, including the following steps:

collecting time sequence voltage data of a transformer and users in a transformer area;

preprocessing the time sequence voltage data by utilizing Spark SQL (Structured Query Language) to obtain processed data;

reducing the dimension of the processed data by adopting a principal component analysis method, and extracting voltage time sequence data characteristics; and

and performing cluster analysis on the voltage time sequence data characteristics based on a coacervation hierarchical clustering algorithm to determine the classification of power users, and obtaining a user variable relation recognition result.

In addition, the method for identifying the user-dependent relationship based on the Spark framework and the agglomerative hierarchical clustering algorithm according to the above embodiment of the present application may further have the following additional technical features:

optionally, the acquiring time sequence voltage data of the transformer in the transformer area and the users in the transformer area includes:

and pulling time sequence voltage data of the transformer and the users of the transformer area from a preset external system database through Sqoop in a full quantity mode.

Optionally, the method further comprises:

and storing the time sequence voltage data of the transformer and the users in the transformer area into an HDFS (Hadoop distributed File System) and associating with a Hive table.

Optionally, the preprocessing the time-series voltage data by using Spark SQL includes:

calculating the missing value of the time sequence voltage data, and filling the missing value of the time sequence voltage data;

and extracting the characteristic vector of the time sequence voltage data, and performing normalization processing to obtain the processed data.

In order to achieve the above object, a second embodiment of the present application provides a device for identifying a user-dependent relationship based on a Spark framework and a clustering algorithm of a coacervation hierarchy, including:

the acquisition module is used for acquiring time sequence voltage data of the transformer area and users of the transformer area;

the processing module is used for preprocessing the time sequence voltage data by utilizing Spark SQL to obtain processed data;

the extraction module is used for reducing the dimension of the processed data by adopting a principal component analysis method and extracting voltage time sequence data characteristics; and

and the identification module is used for performing cluster analysis on the voltage time sequence data characteristics based on an agglomeration hierarchical clustering algorithm to determine the classification of power consumers and obtain a user variation relation identification result.

Optionally, the acquisition module includes:

and the pulling unit is used for pulling the time sequence voltage data of the transformer and the user of the transformer area from a preset external system database through Sqoop in a full quantity mode.

Optionally, the method further comprises:

and the storage module is used for storing the station transformer and the time sequence voltage data of the station users into the HDFS and associating the HDFS with the Hive table.

Optionally, the processing module includes:

the computing unit is used for computing the missing value of the time sequence voltage data and carrying out missing value filling processing on the time sequence voltage data;

and the extraction unit is used for extracting the characteristic vector of the time sequence voltage data and carrying out normalization processing to obtain the processed data.

To achieve the above object, an embodiment of a third aspect of the present application provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor and configured to perform a Spark framework and agglomerative hierarchical clustering algorithm based user-variable relationship identification method as described in the above embodiments.

In order to achieve the above object, a fourth aspect of the present application provides a computer-readable storage medium storing computer instructions for causing the computer to execute a Spark framework and agglomerative hierarchical clustering algorithm-based user-variable relationship identification method as described in the above embodiment.

Therefore, time sequence voltage data of the transformer and the users in the transformer area can be collected, preprocessing is carried out on the time sequence voltage data by utilizing Spark SQL to obtain processed data, dimensionality reduction is carried out on the processed data by adopting a principal component analysis method, voltage time sequence data characteristics are extracted, clustering analysis is carried out on the voltage time sequence data characteristics based on a coacervation hierarchical clustering algorithm to determine power user classification, and a user variable relation recognition result is obtained. Therefore, the user voltage time sequence data can be acquired according to a certain sampling rate without consuming manpower detection and additional equipment, the user variable relation identification can be remotely realized, meanwhile, the Spark distributed computing framework is utilized, and the computing efficiency can be greatly improved through parallel computing based on the memory, and the problem of high time complexity of the aggregation level clustering algorithm is solved.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a flowchart of a method for identifying a user-dependent relationship based on a Spark framework and a clustering algorithm of a condensation hierarchy according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for identifying a user-dependent relationship based on a Spark framework and a clustering algorithm of a condensation hierarchy according to an embodiment of the present application;

FIG. 3 is a diagram illustrating an example of a user-dependent relationship identification apparatus based on a Spark framework and a clustering algorithm of a condensation hierarchy according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.

The following describes a method for identifying a user-dependent relationship based on a Spark framework and a coacervation hierarchical clustering algorithm according to an embodiment of the present application with reference to the drawings.

Specifically, fig. 1 is a schematic flow chart of a user-dependent relationship identification method based on a Spark framework and a clustering algorithm of a coacervation hierarchy provided in an embodiment of the present application.

It can be understood that, in recent years, the user-varying relationship identification algorithm based on the machine learning theory is widely applied, and many iterative computation scenes exist in the machine learning algorithm. Spark is a distributed computing framework that uses the most advanced DAG scheduler and can greatly improve the computing efficiency through memory-based parallel computing. Spark MLlib is a machine learning library which can be expanded in Spark, and comprises machine learning algorithms such as classification, regression, clustering and collaborative filtering, and various machine learning algorithms can be realized through Spark MLlib. In the face of the continuous increase of the scale of the existing distribution area electric power data, the Spark is utilized to identify the distribution area family change relationship, the precision of the algorithm can be guaranteed, and the calculation efficiency can be greatly improved. Processing and principal component analysis are carried out on the acquired voltage time sequence data under a Spark framework, so that static characteristics of the time sequence data are extracted; and then, clustering analysis is carried out on the voltage time sequence data by using an agglomeration hierarchical clustering method so as to automatically divide the voltage time sequence data into corresponding categories to realize the identification of the user-variable relationship.

As shown in fig. 1, the method for identifying a user-dependent relationship based on a Spark framework and a clustering algorithm of a condensation hierarchy includes the following steps:

in step S101, time series voltage data of the transformer and the users in the transformer area are collected.

Optionally, in some embodiments, with reference to fig. 1 and fig. 2, acquiring time-series voltage data of the transformer and the users of the transformer area includes: and (4) pulling time sequence voltage data of the transformer and the user of the transformer area from a preset external system database through Sqoop in a full quantity mode. Sqoop is a tool for transferring data in Hadoop and a relational database, and can lead data in a relational database (such as MySQL, Oracle, Postgres and the like) into an HDFS of Hadoop and can also lead data of the HDFS into the relational database.

Specifically, the embodiment of the present application may obtain the original power data of the distribution room, that is, the time-series voltage data of the users of the distribution room, including voltage, current, power factor, etc., from the external system. The preset external system database can be a Hive table.

Optionally, in some embodiments, with reference to fig. 1 and fig. 2, the above method for identifying a user-dependent relationship based on a Spark framework and a clustering hierarchy algorithm further includes: and storing the time sequence voltage data of the transformer area and the users of the transformer area into the HDFS, and associating the time sequence voltage data with the Hive table. The HDFS is a distributed file system with high fault tolerance, the requirement on the configuration of a system server is not high, the HDFS can not only ensure the fault tolerance of data, but also provide data access with high throughput, and the HDFS is very suitable for being applied to a large-scale data set. Therefore, the problem of large data storage is solved based on the HDFS distributed storage system.

It can be understood that, in the embodiments of the present application, a Hive table may be created by looking at a table area raw data structure, that is, time-series voltage data of a table area user, and used for storing table area raw data. Wherein, the original data structure of the platform area comprises: collection point number, table number, station area identification, usage type, data type, date, phase sequence, and collection value at each time (24 points a day). Based on the original data structure, a Hive external table is created for storing the station area original data.

Therefore, the data such as the current and the voltage of the transformer area can be fully drawn from an external system database through the Sqoop, and the data are loaded into the Hive table. The parameter commands include: -target-dir (path of HDFS storage), -num-mappers (mapjobnumber), -Hive-database (database name of Hive table), -Hive-table (Hive table name).

In step S102, the sequential voltage data is preprocessed by Spark SQL to obtain processed data.

Optionally, in some embodiments, in combination with fig. 1 and fig. 2, the preprocessing the sequential voltage data by using Spark SQL includes: calculating a missing value of the time sequence voltage data, and filling the missing value of the time sequence voltage data; and extracting the characteristic vector of the time sequence voltage data, and performing normalization processing to obtain processed data.

It can be understood that the application can privately perform data cleaning and data preprocessing on the original data through spark sql. For example, missing value filling, feature vector selection, outlier processing, normalization, etc. are included.

Specifically, in the embodiment of the present application, a Spark session may be created first, which provides a uniform entry point for a user to use various functions of Spark; then accessing data in the Hive table through spark SQL; and removing data with missing values exceeding 20% in the acquired data, replacing abnormal values by adopting a mean value replacement principle, and finally performing normalization processing on the data to obtain processed data.

In step S103, the processed data is subjected to dimensionality reduction by a principal component analysis method, and voltage time series data features are extracted.

Specifically, in the embodiment of the application, a Spark MLlib packet may be called, and the principal component analysis is used to reduce the dimension of the processed data and extract the voltage time series data characteristics.

Therefore, Spark is calculated based on the memory, the intermediate calculation result is not stored in the disk, the time waste of disk reading and writing is avoided, the RDD relation is modeled through DAG (direct acyclic graph), the execution sequence of the operation is automatically scheduled, the shuffle data is reduced, and the calculation efficiency is greatly improved.

In step S104, the clustering analysis is performed on the voltage time series data features based on the aggregation level clustering algorithm to determine the power consumer classification, so as to obtain a user-variable relationship identification result.

Specifically, the embodiment of the application can also utilize the coacervation hierarchical clustering algorithm to classify users according to the transformers, so as to realize the identification of the user-variable relationship.

According to the method for identifying the user-variable relationship based on the Spark framework and the aggregation hierarchical clustering algorithm, time sequence voltage data of a transformer area and users of the transformer area can be collected, Spark SQL is used for preprocessing the time sequence voltage data to obtain processed data, a principal component analysis method is used for reducing the dimension of the processed data, voltage time sequence data characteristics are extracted, the voltage time sequence data characteristics are subjected to clustering analysis based on the aggregation hierarchical clustering algorithm to determine power user classification, and a user-variable relationship identification result is obtained. Therefore, the user voltage time sequence data can be acquired according to a certain sampling rate without consuming manpower detection and additional equipment, the user variable relation identification can be remotely realized, meanwhile, the Spark distributed computing framework is utilized, and the computing efficiency can be greatly improved through parallel computing based on the memory, and the problem of high time complexity of the aggregation level clustering algorithm is solved.

Next, a user-dependent relationship identification apparatus based on a Spark framework and a condensation hierarchical clustering algorithm according to an embodiment of the present application will be described with reference to the drawings.

Fig. 3 is a schematic block diagram of a user-dependent relationship identification apparatus based on a Spark framework and a coacervation hierarchical clustering algorithm according to an embodiment of the present application.

As shown in fig. 3, the apparatus 10 for identifying a user-dependent relationship based on a Spark framework and a clustering algorithm of a condensation hierarchy includes: an acquisition module 100, a processing module 200, an extraction module 300 and an identification module 400.

The acquisition module 100 is used for acquiring time sequence voltage data of the transformer and the users in the transformer area;

the processing module 200 is configured to perform preprocessing on the time-series voltage data by using Spark SQL to obtain processed data;

the extraction module 300 is configured to perform dimension reduction on the processed data by using a principal component analysis method, and extract voltage timing data features; and

the identification module 400 is configured to perform cluster analysis on the voltage time series data features based on an aggregation level clustering algorithm to determine power consumer classification, so as to obtain a user-variable relationship identification result.

Optionally, in some embodiments, the acquisition module 100 comprises: and pulling the unit.

The system comprises a pulling unit and a power supply unit, wherein the pulling unit is used for pulling time sequence voltage data of a transformer in a transformer area and users in the transformer area from a preset external system database through Sqoop in a full quantity mode.

Optionally, in some embodiments, the above apparatus 10 for identifying a user-dependent relationship based on a Spark framework and a clustering algorithm of a condensation hierarchy further includes:

Optionally, in some embodiments, the processing module 200 comprises: a calculation unit and an extraction unit.

The computing unit is used for computing the missing value of the time sequence voltage data and performing missing value filling processing on the time sequence voltage data;

and the extraction unit is used for extracting the characteristic vector of the time sequence voltage data and carrying out normalization processing to obtain processed data.

It should be noted that the explanation of the embodiment of the method for identifying a user-variable relationship based on a Spark frame and an aggregation level clustering algorithm is also applicable to the device for identifying a user-variable relationship based on a Spark frame and an aggregation level clustering algorithm in this embodiment, and details are not repeated here.

According to the household variable relationship recognition device based on the Spark framework and the aggregation level clustering algorithm, time sequence voltage data of a transformer area and users of the transformer area can be collected, Spark SQL is used for preprocessing the time sequence voltage data to obtain processed data, a principal component analysis method is used for reducing the dimension of the processed data, voltage time sequence data characteristics are extracted, clustering analysis is carried out on the voltage time sequence data characteristics based on the aggregation level clustering algorithm to determine power user classification, and a household variable relationship recognition result is obtained. Therefore, the user voltage time sequence data can be acquired according to a certain sampling rate without consuming manpower detection and additional equipment, the user variable relation identification can be remotely realized, meanwhile, the Spark distributed computing framework is utilized, and the computing efficiency can be greatly improved through parallel computing based on the memory, and the problem of high time complexity of the aggregation level clustering algorithm is solved.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device may include:

a memory 1201, a processor 1202, and a computer program stored on the memory 1201 and executable on the processor 1202.

The processor 1202 executes the program to implement the method for identifying user-dependent relationships based on the Spark framework and the agglomerative hierarchical clustering algorithm provided in the above embodiments.

Further, the electronic device further includes:

a communication interface 1203 for communication between the memory 1201 and the processor 1202.

A memory 1201 for storing computer programs executable on the processor 1202.

The memory 1201 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

If the memory 1201, the processor 1202 and the communication interface 1203 are implemented independently, the communication interface 1203, the memory 1201 and the processor 1202 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (enhanced Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 4, but this does not indicate only one bus or one type of bus.

Optionally, in a specific implementation, if the memory 1201, the processor 1202, and the communication interface 1203 are integrated on a chip, the memory 1201, the processor 1202, and the communication interface 1203 may complete mutual communication through an internal interface.

Processor 1202 may be a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present Application.

The present embodiment also provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the above user-dependent relationship identification method based on Spark framework and agglomerative hierarchical clustering algorithm.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or N embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "N" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more N executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of implementing the embodiments of the present application.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or N wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the N steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A method for identifying a user-variant relationship based on a Spark framework and a coacervation hierarchical clustering algorithm is characterized by comprising the following steps:

preprocessing the time sequence voltage data by utilizing Spark SQL to obtain processed data;

2. The method of claim 1, wherein collecting time series voltage data for the district transformer and the district user comprises:

3. The method of claim 1 or 2, further comprising:

and storing the time sequence voltage data of the transformer area and the users of the transformer area into the HDFS, and associating with a Hive table.

4. The method of claim 1, wherein preprocessing the time-series voltage data using Spark SQL comprises:

5. A device for identifying a user-dependent relationship based on a Spark framework and a condensation hierarchical clustering algorithm is characterized by comprising:

6. The apparatus of claim 5, wherein the acquisition module comprises:

7. The apparatus of claim 5 or 6, further comprising:

8. The apparatus of claim 5, wherein the processing module comprises:

9. An electronic device, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the Spark framework and agglomerative hierarchical clustering algorithm-based user-dependent relationship identification method according to any one of claims 1 to 4.

10. A computer-readable storage medium, on which a computer program is stored, the program being executed by a processor to implement the Spark framework and agglomerative hierarchical clustering algorithm-based user-dependent relationship identification method according to any one of claims 1 to 4.