CN105069703A

CN105069703A - Mass data management method of power grid

Info

Publication number: CN105069703A
Application number: CN201510487734.1A
Authority: CN
Inventors: 刘志刚; 魏晓光; 陈剑飞; 刘小宝; 戴昭
Original assignee: State Grid Corp of China SGCC; Jinan Power Supply Co of State Grid Shandong Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; Jinan Power Supply Co of State Grid Shandong Electric Power Co Ltd
Priority date: 2015-08-10
Filing date: 2015-08-10
Publication date: 2015-11-18
Anticipated expiration: 2035-08-10
Also published as: CN105069703B

Abstract

The invention provides a mass data management method of a power grid. The method comprises the following steps: constructing a power grid user data management system, integrating data collected by each power grid subsystem, and utilizing a parallel computing frame to mine and analyze power grid user data; and on the basis of the data management system, utilizing a distributed load prediction algorithm to realize parallel load prediction. The invention provides the mass data management method of the power grid, the data of each system of the power grid user is fused and integrated, and a traditional data computation method is migrated to a distributed platform to meet the operation requirements of the mass data.

Description

A kind of electrical network mass data management method

Technical field

The present invention relates to intelligent grid, particularly a kind of electrical network mass data management method.

Background technology

To the collection of power grid user real time data, transmission and storage, and the magnanimity multi-source historical data of associate cumulation is carried out express-analysis and effectively can be improved demand management, user data to be managed and process supports smart grid security, strong and reliability service.Along with the continuous increase of various kinds of sensors and smart machine quantity, equipment obtains and also exponential growth is occurring with the Various types of data of transmission, these data not only comprise the power consumption that intelligent electric meter is collected, and also comprise temperature, weather, humidity, geography information and wind speed information etc. that various kinds of sensors gathers according to fixed frequency.User data complexity increases.

Technology and the external difference of China's generating and transmitting system are little, but adapted electricity particularly user side there is larger difference, because the market mechanism adapted not yet is formed, the condition of implementation of China intelligent power technology is not mature enough, is difficult to the effective integration supporting intelligent electrical power distribution system and Subscriber Management System.Generally speaking, there is following challenge in the Mass Data Management of power grid user: the fast development of intelligent electric meter and technology of Internet of things, and the mass data mode making it produce varies, and constituent parts data bore differs, and difficulty is integrated in processing.For mass data, how to build a module and carry out regulate expression to it, how to realize Data Integration based on this module is the problem needing solution badly.Because the acquisition mode of data is varied, each communication channel quality differs, and the quality of data not only received is inferior, and also not enough to the management and control ability of data, thus the knowledge causing the data utilizing these inferior to carry out mining analysis discovery is also unscientific, can not make decision-making accurately.This causes ill effect in the world, seriously annoyings information society.Data type is complicated, and traditional relevant database and file memory format can not the demands that increase fast of satisfying magnanimity data.

Summary of the invention

For solving the problem existing for above-mentioned prior art, the present invention proposes a kind of electrical network mass data management method, comprising:

Build power grid user data management system, the data that each electrical network subsystem collects are integrated, and utilizes the data of parallel computation frame to power grid user to excavate and analyze; Based on described data management system, distributed terminator prediction algorithm is utilized to realize parallel load prediction.

Preferably, the framework of described power grid user data management system is divided into application layer, data analysis computation layer, data management layer, Hadoop is utilized to build power grid user data management system, platform adopt HDFS, HBase set up data-storage system, platform builds MapReduce parallel computation frame and Storm memory parallel Computational frame as mass data computational analysis system, the mass data of power grid user is analyzed; Described data management layer carries out collection and integrated to data; Described data acquisition comprises the data gathered from intelligent electric meter, data acquisition monitoring system and various sensor, is managed by Data Migration integrated the comprising of these data to cluster server; In the integrating process of data, data batchmove instrument is adopted to extract and integration work data, by each independently system produce data and historical data utilize data batchmove instrument extracting integral in HBase, and use java persistence instrument to operate column storage database, the online data that the application based on Distributed Calculation produces is written in HBase; Described data analysis computation layer is used for storage and the computational analysis of mass data; Utilize HBase store power load data and related data; Parallel computation module MapReduce is utilized to carry out parallel batch computational analysis to mass data, and the parallel computation module Storm based on internal memory is adopted to data-intensive iterative computation, business desired data is read in internal memory, needs directly to inquire about from internal memory during data.

Preferably, described based on described data management system, utilize distributed terminator prediction algorithm to realize parallel load prediction, comprise further:

Utilize the training process of 3 MapReduce service class execution algorithms, the output of each MapReduce is as the input of thereafter, the decision-making module obtained after training terminates is kept in the distributed type assemblies of Hadoop, and it is divided into three parts: generate data dictionary; Generate decision tree; Form decision tree set;

The sample data that wherein said generation data dictionary comprises training is described, produce a file and describe sample conditional attribute and decision attribute, the type of record condition property value and the position of decision attribute, and the module that will create carries out classifying or regressing calculation, this process is completed by first MapReduce, each Map process reads a part for experimental data, the attribute type of record data and load value or type identification; The description document produced is stored in the file system HDFS of Hadoop with the form of key/value;

Wherein said generation decision tree process comprises following parallel procedure:

1) former data set is had at random to extraction K and the equirotal sample data TS of former state notebook data collection putting back to _{1,2 ..., k}; The training set of a corresponding decision tree of sample data, each sample data is different, and the same with former data set size;

2) determine the attribute number m of each node Stochastic choice according to the number M of attribute in sample data, wherein m<<M, in sort module, m is the square root of M, and in regression block, m is 1/3 of M; Calculate the quantity of information of each attribute in m attribute, select best attributes to carry out branch;

3) recurrence carries out the foundation of node, generates decision tree; The generation of K decision tree is parallel generation, and a Map generates a decision tree, and this process is completed by second MapReduce process;

The set of described formation decision tree comprises each decision tree set of classifiers altogether, each decision tree produces a result, if it is determined that tree set is used for classifying, its net result is that ballot is chosen, when it is used for regression forecasting, K tree provides K value, end value is the mean value of each tree, and this process is completed by the 3rd MapReduce.

Preferably, in the deployment framework of described HBase system, using the supvr of dispatching center as whole distributing real-time data bank, storing metadata information, comprises the key message of the division of labor of each node, node state, data partition mode, data block location, task scheduling, safety management, described dispatching center keeps the consistance of metadata each other by synchronization mechanism, data analysis computation layer is reciprocity in logic, dispose same process and complete same logical operation, data analysis computation layer adopts the redundancy backup mechanism based on affairs, the distributed file system that power grid user data management system adopts HDFS to store as bottom, build the time series data come towards the sequential control assembly of electrical network mass data in store electricity network service, time series data module is built by sequential control assembly, according to the unified time series data receiving storage of collected of peculiar module, and unified query interface is externally provided,

On storage mode, the form of key-value is adopted to store data, namely store towards row, be basic storage and control of authority unit to arrange race, for being empty row, in actual storage, do not take real space, use the design of sparse table, on data framework is disposed, abandon traditional C/S multi-client, the pattern of Single-Server, adopt the cluster mode of distributed multiserver, all data are stored on the multiple stage computing machine in cluster according to replicator dispersion, sequential control assembly bottom depends on column storage database, abstract when concrete process time series data is reading HBase database, write, increase, delete, the basic operation of amendment, the software the superiors are client and the third-party application client of sequential control assembly, all clients carry out concrete operations by the API of Java, all API are the arrangement set of a database manipulation or multiple database manipulation by type parsing module function decomposition into analytic function, these database manipulation set are called by the RPC of Control Component inside, the HBase that finally unified use is asynchronous operates API and completes data manipulation.

The present invention compared to existing technology, has the following advantages:

The present invention proposes a kind of electrical network mass data management method, the data of each for power grid user system are carried out fusion and integrated, and traditional data computing method are moved in distributed platform, the computing requirement of satisfying magnanimity data.

Embodiment

It is hereafter the detailed description to one or more embodiment of the present invention.Describe the present invention in conjunction with such embodiment, but the invention is not restricted to any embodiment.Scope of the present invention is only defined by the claims, and the present invention contain many substitute, amendment and equivalent.Set forth many details in the following description to provide thorough understanding of the present invention.These details are provided for exemplary purposes, and also can realize the present invention according to claims without some in these details or all details.

An aspect of of the present present invention provides a kind of power grid user mass data processing method.Utilize Hadoop cluster to build the basic management system of mass data, the Data Integration that each electrical network subsystem collects is become mass data storage, and utilizes parallel computation frame to carry out quick mining analysis to the mass data of power grid user.For electrical load predicted application, traditional load estimation is moved to Distributed Computing Platform, utilize the load estimation algorithm realization parallel load based on decision tree to predict.The actual needs that the present invention analyzes in conjunction with power grid user mass data, build with the power grid user data management system analysis calculated as master, its basic framework is divided into application layer, data analysis computation layer, data management layer.

This framework utilizes Hadoop to build power grid user data management system, platform adopt HDFS, HBase set up mass data storage system, platform builds MapReduce parallel computation frame and Storm memory parallel Computational frame as mass data computational analysis system, the mass data of power grid user is analyzed.

Wherein, data management layer carries out collection and integrated to data.Data acquisition comprises the data gathered from intelligent electric meter, data acquisition monitoring system and various sensor, these data not only comprise the data of electrical network inside, also comprise a large amount of relevant data, these data are produced by the equipment of different vendor, mode varies, constituent parts data bore differs, and defines mass data flow, and difficulty is integrated in processing.These data integrated to refer to the Data Migration of the generation of legacy system to cluster server, manages efficiently.

Platform adopts data batchmove instrument to carry out extracting integral work to data for this difficult point of data integration, and by each, independently the data that produce of system and historical data utilize data batchmove instrument extracting integral in HBase.Use java persistence instrument to operate column storage database, the online data that the application based on Distributed Calculation produces is written in HBase.

Data analysis computation layer is used for storage and the computational analysis function of mass data.Distributed Calculation layer utilizes Hadoop to build and forms, and mass data storage, in distributed file system HDFS, utilizes HBase to manage data.

This platform utilizes HBase store power load data and related data, HBase database is classified as storage unit, conveniently permutation data are inquired about, and the prediction algorithm used subsequently needs repeatedly to carry out reading calculating to permutation data in learning process, the operational requirements of data is met to the feature of HBase data storage.

Utilize parallel computation module MapReduce to carry out parallel batch computational analysis to mass data, and the parallel computation module Storm based on internal memory is adopted to data-intensive iterative computation.Storm provides a kind of memory parallel Computational frame, business desired data is read in internal memory by framework, directly inquires about from internal memory during desired data, faster than the speed of the MapReduce visit data based on disk like this, decrease the working time of business, decrease I/O operation.

Load estimation is the key link in Electric Power Network Planning, is transformer station, space truss project important computations foundation, and high-precision switch-time load prediction effectively can reduce cost of electricity-generating, has key effect.The present invention uses a kind of integrated learning approach of improvement, take decision tree as basic studies unit, comprise multiple Stochastic subspace identification method and train the decision tree obtained, input sample to be sorted, produce each classification results by each decision tree, final classification results is chosen in a vote by the result of each decision tree.The some shortcomings of decision tree can be overcome, and be with good expansibility and concurrency, effectively can solve the fast processing problem of mass data, have good application prospect for the electrical load prediction under mass data environment.

Whole load estimation process utilizes the training process of 3 MapReduce service class execution algorithms, and the output of each MapReduce is as the input of thereafter.The decision-making module obtained after training terminates is kept in the distributed type assemblies of Hadoop, and it is divided into three parts: generate data dictionary; Generate decision tree; Form decision tree set.Generating data dictionary is exactly be described the sample data of training, produce a file and describe sample conditional attribute and decision attribute, the type of record condition property value and the position of decision attribute, and the module that will create carries out classifying or regressing calculation.This process is completed by first MapReduce, and each Map process reads a part for experimental data, the attribute type of record data and load value or type identification.The description document produced is stored in the file system HDFS of Hadoop with the form of key/value, uses in order to MapReduce subsequently.

Generate the core that decision tree process is whole parallel algorithm, its parallel procedure is wherein in following several respects: 1) former data set is had at random to extraction K and the equirotal sample data TS of former state notebook data collection putting back to _{1,2 ..., k}.Because be have the extraction of putting back to, so can walk abreast former data set extracted, and can not have an impact to TS.The training set of a corresponding decision tree of TS, each TS is different, and the same with former data set size, so both ensure that the difference of each decision tree, can not lose again the knowledge scale of former data set.

2) determine the attribute number m (m<<M) of each node Stochastic choice according to the number M of attribute in sample data, in sort module, m is the square root of M, and in regression block, m is 1/3 of M.Calculate the quantity of information of each attribute in m attribute, select best attribute to carry out branch;

3) foundation carrying out node of recurrence, generates decision tree.The generation of K decision tree is parallel generation, and a Map generates a decision tree, achieves the parallel of algorithm.This process is completed by second MapReduce process.This MapReduce only has Map process not have Reduce process.

Form decision tree set namely each decision tree set of classifiers altogether.Each decision tree can produce a result, if it is determined that tree set is used for classifying, its net result is that ballot is chosen, and when it is used for regression forecasting, K tree can provide K value, and end value is the mean value of each tree.This process is completed by the 3rd MapReduce.

Whole module is based upon on the distributed type assemblies of Hadoop, distributed storage is carried out to mass data, MapReduce is utilized to be walked abreast by algorithm, calculation sample general collection S method is enable to rely on the storage capacity of Hadoop cluster and computing power to the excavation of data and computational prediction, whole process is all executed in parallel, effectively can improve the precision of prediction and improve the ability of load estimation system process mass data.

In the deployment framework of above-mentioned HBase system, using the supvr of dispatching center as whole distributing real-time data bank, storing metadata information, comprises the key messages such as the division of labor of each node, node state, data partition mode, data block location, task scheduling, safety management.Dispatching center generally disposes 2 (also can be made up of multiple stage), the consistance of metadata is kept each other by synchronization mechanism, thus eliminate the risk that dispatching center's Single Point of Faliure causes entire system afunction, simultaneously also for the realization of concurrent request load balancing is laid a good foundation.Data analysis computation layer stores for the burst of mass data, completes all kinds of computation process simultaneously, and the quantity of data analysis computation layer is only limited to the rigid condition such as Ethernet bandwidth, machine room physical condition.Each data analysis computation layer is reciprocity in logic, disposes same process and completes same logical operation, according to the area principle of dispatching center to data, only stores the data belonging to respective partition, thus reaches the object of distributed storage.Consider that Distributed architecture lower node lost efficacy and fault can often occur, the redundancy backup mechanism based on affairs is adopted between data analysis computation layer, same transaction operation is synchronized to another or a few number of units (depends on customizable replicator) according on analytical calculation layer, while realizing data high reliability, for the load balancing of data access is laid a good foundation.

Power grid user data management system adopts the distributed file system that stores as bottom of HDFS, and the sequential control assembly built on this basis towards electrical network mass data carrys out the time series data in store electricity network service.Build time series data module by sequential control assembly, according to the unified time series data receiving storage of collected of peculiar module, and externally provide unified query interface.

On concrete storage mode, be different from the list structure of the determinant of traditional relational, adopt the form of key-value to store data, namely storing towards row, is basic storage and control of authority unit to arrange race.For being empty row, in actual storage, not taking real space, using the design of sparse table.In this way, the space waste problem that Different sampling period causes is solved.On data framework is disposed, abandon traditional C/S multi-client, the pattern of Single-Server simultaneously.Adopt the cluster mode of distributed multiserver, all data are stored in the storage security multiple stage computing machine in cluster strengthening data according to replicator dispersion, improve the search efficiency of data.

Sequential control assembly bottom depends on column storage database.When concrete process time series data, it can abstractly be the basic operation such as reading and writing, increase, deletion, amendment to HBase database.The software the superiors are client and the third-party application client of sequential control assembly.All clients carry out concrete operations by the API of Java.All API can function decomposition into analytic function be the arrangement set of a database manipulation or multiple database manipulation by type parsing module.These database manipulation set are called by the RPC of Control Component inside, and the HBase that finally unified use is asynchronous operates API and completes data manipulation.

Time series data record is made up of measuring object, timestamp, measured value, label 4 fields.Wherein, label, is used for further describing measuring object information to forming by one or more key/value, and measuring object and tag combination are for measuring item.The design of label makes user be easy to inquire the value of the measurement item that it is concerned about.Control Component uses accumulation layer to store data, and accumulation layer is the distributed file storage system of a key/value structure.In distributed accumulation layer, store time series data efficiently, and store the data point of over ten billion easily with minimum internal memory/disk space, the key issue that must solve when being outstanding node store structure design.For this reason, the design of the columnar database HBase table that distributing real-time data bank administration and supervision authorities rely on need abide by the principle: for the major key of the sequential control assembly of employing regular length, should comprise retrieving information as much as possible; The data stored generally comprise a large amount of measuring objects and label, and these fields are elongated, therefore, arrange an ID table and store these information, as the numbering that the overall situation is unique, and numbering and timestamp are merged as major key; Often row should store information as much as possible.Such as, the data of certain time period distributed collection are combined, submit data to according to a row.The program can reduce the number of whole table row major key, thus improves the speed of line retrieval.Store data according to the extension of time, adopt stateless storage scheme, thus system survivability is provided.

The method of Hash maps is all adopted to be numbered for the key of each measuring object, label and value, simultaneously in order to improve the efficiency of data query, by above-mentioned map information ID table in stored in 2 parts, portion is the mapping that measuring object, label key and value are numbered to its hash, and another part numbers the mapping of measuring object, label key and value for hash.Above-mentioned hash numbering all adopts the regular length of 3 bytes.The time series data of measuring object is stored in another table, the line unit of this table adopt measuring object ID+ reference time+ID of ID+ label value of label key, the wherein system development of field reference time corresponding to a certain time series data record to be stored and the application integral point time, except being 4 bytes reference time, other fields are 3 bytes.Time series data in 1 hour is stored in a line in table, a certain record be stored in by row and its relative to reference time offset Δ t corresponding to row under, wherein Δ t=record Shi Jian Chuo – reference time.When a certain line item is filled with, opens next line and continue to store.

Obviously, it should be appreciated by those skilled in the art, above-mentioned of the present invention each module or each step can realize with general computing system, they can concentrate on single computing system, or be distributed on network that multiple computing system forms, alternatively, they can realize with the executable program code of computing system, thus, they can be stored and be performed by computing system within the storage system.Like this, the present invention is not restricted to any specific hardware and software combination.

Should be understood that, above-mentioned embodiment of the present invention only for exemplary illustration or explain principle of the present invention, and is not construed as limiting the invention.Therefore, any amendment made when without departing from the spirit and scope of the present invention, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.In addition, claims of the present invention be intended to contain fall into claims scope and border or this scope and border equivalents in whole change and modification.

Claims

1. an electrical network mass data management method, is characterized in that, comprising:

2. method according to claim 1, it is characterized in that, the framework of described power grid user data management system is divided into application layer, data analysis computation layer, data management layer, Hadoop is utilized to build power grid user data management system, platform adopt HDFS, HBase set up data-storage system, platform builds MapReduce parallel computation frame and Storm memory parallel Computational frame as mass data computational analysis system, the mass data of power grid user is analyzed; Described data management layer carries out collection and integrated to data; Described data acquisition comprises the data gathered from intelligent electric meter, data acquisition monitoring system and various sensor, is managed by Data Migration integrated the comprising of these data to cluster server; In the integrating process of data, data batchmove instrument is adopted to extract and integration work data, by each independently system produce data and historical data utilize data batchmove instrument extracting integral in HBase, and use java persistence instrument to operate column storage database, the online data that the application based on Distributed Calculation produces is written in HBase; Described data analysis computation layer is used for storage and the computational analysis of mass data; Utilize HBase store power load data and related data; Parallel computation module MapReduce is utilized to carry out parallel batch computational analysis to mass data, and the parallel computation module Storm based on internal memory is adopted to data-intensive iterative computation, business desired data is read in internal memory, needs directly to inquire about from internal memory during data.

3. method according to claim 2, is characterized in that, described based on described data management system, utilizes distributed terminator prediction algorithm to realize parallel load prediction, comprises further:

4. method according to claim 3, it is characterized in that, in the deployment framework of described HBase system, using the supvr of dispatching center as whole distributing real-time data bank, storing metadata information, comprises the key message of the division of labor of each node, node state, data partition mode, data block location, task scheduling, safety management, described dispatching center keeps the consistance of metadata each other by synchronization mechanism, data analysis computation layer is reciprocity in logic, dispose same process and complete same logical operation, data analysis computation layer adopts the redundancy backup mechanism based on affairs, the distributed file system that power grid user data management system adopts HDFS to store as bottom, build the time series data come towards the sequential control assembly of electrical network mass data in store electricity network service, time series data module is built by sequential control assembly, according to the unified time series data receiving storage of collected of peculiar module, and unified query interface is externally provided,