CN106547882A

CN106547882A - A kind of real-time processing method and system of big data of marketing in intelligent grid

Info

Publication number: CN106547882A
Application number: CN201610953688.4A
Authority: CN
Inventors: 杨云; 吕跃春; 朱珠; 罗春雷; 聂静; 吴彬; 张晓勇; 雷娟; 张伟; 晏尧; 徐鑫; 徐光侠
Original assignee: Electric Power Research Institute of State Grid Chongqing Electric Power Co Ltd; State Grid Corp of China SGCC; State Grid Chongqing Electric Power Co Ltd
Current assignee: Electric Power Research Institute of State Grid Chongqing Electric Power Co Ltd; State Grid Corp of China SGCC; State Grid Chongqing Electric Power Co Ltd
Priority date: 2016-11-03
Filing date: 2016-11-03
Publication date: 2017-03-29

Abstract

A kind of real-time processing method and system of big data of marketing in intelligent grid, the data acquisition module of proposition, data processing module, data memory module, business logic modules, this five parts of display module.Streaming computation model and batch processing computation model are organically linked together, operation can be easily interacted, so as to preferably provide the process service to marketing data.The real time mass data that the method for the present invention is produced in will be helpful to process intelligent grid in time, and the similarity of all kinds of marketing datas is deeply excavated, while providing more individual character and quality services for power grid user, it is also possible to provide for network system and reliably ensure.

Description

A kind of real-time processing method and system of big data of marketing in intelligent grid

Technical field

The present invention relates to the processing method and system of electric network data processing technology field, particularly electrical network marketing data.

Background technology

With data sustainable growth and the rapid expansion of Electric Power Marketing System, during big also into the electric power data of China Generation.Deepen continuously and advance with big data marketing system construction in intelligent grid, what operation of power networks and equipment detecting/monitoring were produced Data volume and the swift and violent growth of electric network terminal userspersonal information amount, become electricity based on the marketing big data real-time processing of intelligent grid Power department question of common concern.Under intelligent grid environment, reality is required for from generating, power transmission and transformation to user power utilization information gathering etc. When process.Generally big data process in real time is divided into two classes：Stream data analysis in real time and real-time batch system.

The Typical Representative of stream data real-time processing platform is Storm.Storm is one distributed, fault-tolerant real-time Computing system, it easily can write the real-time calculation procedure of complexity in a computer cluster, and ensure each message Will be processed, and in a little cluster, it is per second to process millions of message.Its main feature is as follows： 1) programming model is simple, and Storm has been provided the user flow data and processed simultaneously using the development mode similar to MapReduce Row framework, reduces the complexity of real-time streaming data process.2) various programming languages are supported, can be used on Storm many Programming language is planted, including Clojure, Java, Ruby and Python etc..Increase the support to other language, it is only necessary to realize Storm communication protocols.3) failure of high fault tolerance, the Storm progresses of work good at managing and node, and fast quick-recovery.4) level can Autgmentability, calculating task are distributed in multiple threads, are carried out between process and server parallel.5) Message Processing reliability, Storm Can guarantee that each message at least obtains once complete process.When mission failure, message is retried from message source.6) rapidity, The design of Storm systems ensure that message quickly can be processed, and its bottom uses message queue mechanism.

The Typical Representative of batch processing computation model is Hadoop, a distributed system architecture.With Hadoop as generation Traditional big data batch processing mode of table, needs frequently disk I-O operations, and computational efficiency is low, it is difficult to meet power system Middle on-line condition monitoring, the demand of assessment.For problems, it is applied widely based on calculating platform Spark of internal memory. Spark is a cluster computing system of increasing income calculated based on internal memory, it is therefore an objective to make data analysiss quicker.Spark is adopted Similar to the PC cluster framework of Hadoop, but Spark is applied to the PC cluster of particular job loadtype, and this calculating exists Need shared work data set between multiple parallel iteration operations (such as machine learning algorithm).In order to optimize such meter Calculate, Spark introduces PC cluster based on internal memory, will data set be buffered in internal memory, reduce disk access delays. Used in Spark is calculated, elasticity distribution formula data set RDD (resilient distributed datasets) improves efficiency. RDD is distributed across the read-only object set between a group node.These set can be heavy in the case where partial data collection is lost Build so that Spark has fault tolerant mechanism, the process for rebuilding partial data collection needs to safeguard blood lineage (lineage), i.e., by record The generating process of data, rebuilds the partial data collection lost.In Spark, RDD can be：From HDFS (hadoop Distributed file system) the Scala objects that create in file system；Distribution simultaneously line number between the individual nodes According to section；From the RDD that other RDD are converted；Change the persistency of existing RDD, such as existing RDD is buffered in internal memory. Spark process some particular tasks when, 1～2 order of magnitude higher than Hadoop operational efficiency.The speed of Spark is Hadoop 100 times of MapReduce.

Spark streaming are to build the framework that stream data is processed on Spark, and its ultimate principle is by flow data Be divided into little time segment (several seconds), this fraction data is processed in the way of similar batch batch processings.Spark Streaming is built on Spark, is on the one hand because that the low latency enforcement engine (100ms+) of Spark can be used in real time Calculate, on the other hand compare other process framework (such as Storm) based on Record, RDD data sets are easier to do efficient appearance Fault is managed.In addition the mode that small lot is processed allows it while the logical sum algorithm of compatible batch and real time data processing. Facilitate some certain applications for needing historical data and real time data conjoint analysis.

Existing intelligent grid marketing data processing system is mostly using the processing method that unit is centralized, main to process knot The data of structure, non real-time type, and collection and real-time processing have been carried out to partial service data only, its data storage capacities, Data-handling capacity, data exchange capability, data exhibiting ability and data interactive capability improving limited space, mass data assets Not by rational and efficient use；Lack the means of Stream Processing simultaneously, it is impossible to support application of each electrical network field to real time data to need Ask.

The content of the invention

One object of the present invention is just to provide a kind of real-time processing method of marketing big data in intelligent grid, and it is to electricity Net big data is analyzed process, provides technical support for electrical network marketing.

The purpose of the present invention is realized by such technical scheme, is comprised the following steps that：

1) multiple server groups are received into the electrical network marketing data collected by Flume into Kafka clusters, electrical network is marketed Data separate partition functions select subregion, are stored on corresponding server after subregion；

2) to the electrical network marketing data after subregion, quick diagnosis and assessment are done in the way of stream calculation using Storm, is distinguished Go out requirement of the electrical network marketing data to real-time；

3) real-time processing is carried out to the electrical network marketing data after diagnosis and assessment, low data are selected is required to the time limit MapReduce process；The data high to requirement of real-time, are processed using the K-Means clustering algorithms based on Spark, are analyzed hidden The information ensconced in data；

4) by step 3) output result write in HBase, corresponding service logic is set.

Further, step 1) described in electrical network marketing data using partition functions select subregion concrete grammar it is as follows：

Using Flume blocker interceptors, the Key values in electrical network marketing data event header information are read, i.e., Key assignments, then selects subregion according to key assignments.

Further, step 2) described in Storm comprised the following steps with the concrete grammar of stream calculation mode：Build electrical network battalion The topological structure of pin data handling procedure, carries out denoising successively, calculates characteristic quantity and interpretation of result process to electrical network marketing data.

Further, the step 2) process the distributed file storage system HDFS that the data for obtaining are stored in Hadoop In.

Further, step 3) described in comprised the following steps that based on the K-Means clustering algorithms of Spark：

3-1) in internal memory, each block is converted into an elastic data collection RDD to the blocks of files in HDFS, comprising prison in RDD Survey the feature duration set of data；

Map operation is carried out to RDD 3-2), vectorial Vector is characterized by Monitoring Data abstract, it is every in wherein Vector It is one-dimensional all to correspond to the every one-dimensional of Point, calculate the corresponding clusters of each Vector (Point) and number (Class), wherein Class For the numbering of each cluster centre point, and key-value pair (K, V) is exported for (wherein K represents key for Class, (Point, 1)), is several According to per one-dimensional numbering；V representative values, are the actual value of data；New RDD is generated with this；

3-3) RDD new to each mixes, and the data of identical cluster are stored together, and every in RDD internal calculations Individual cluster centre point；

3-4) judge the distance between central point and previous central point, if meet required, terminate, otherwise from second Step starts, until meeting termination condition；

3-5) output result is write in HDFS.

Further, step 3-3) described in the concrete grammar in RDD internal calculations each cluster centre points it is as follows：

K cluster centre point is randomly selected, μ is designated as₁, μ₂..., μ_k∈Rⁿ, wherein RⁿRepresent n dimension real number vector spaces.

Repeat procedure below, until convergence：To each sample i, its class that should belong to is calculated, Wherein c⁽ⁱ⁾Represent i-th sample generic；x⁽ⁱ⁾Represent the corresponding characteristic vector of i-th sample；μ_jRepresent in j-th cluster The corresponding characteristic vector of heart point.For each class j, such central point is recalculated, Wherein m represents the sum of characteristic vector in j-th classification.

Another object of the present invention is just to provide a kind of real time processing system of marketing big data in intelligent grid, and it can Intellectual analysis are carried out to the marketing data in electrical network, provide data support for electrical network marketing.

By such technical scheme, the purpose of the present invention realizes that it includes：

Data acquisition module, for receiving the electrical network marketing data collected by Flume by Kafka clusters, to data source Integrated；

Data processing module, for using Storm clusters in the way of stream calculation processing data, then using being based on The K-Means clustering algorithms of Spark are calculated in real time；

Data memory module, the result for data processing module is exported are write in HBase；

Business logic modules, for realizing user management and user right system logic；

Display module, for providing Web interactive operations interface.

Further, the Storm clusters are made up of a host node and multiple working nodes, and host node is Nimbus, are born Responsibility business distribution, code distribution, cluster monitoring work；Working node is Supervisor, one physical machine of correspondence, for starting Process.

Further, the working node includes multiple processes, and each process includes multiple threads.

As a result of above-mentioned technical proposal, the present invention has the advantage that：

The characteristics of present invention combines marketing data, devises mixing Hadoop, Storm from system-level aspect, Spark's Real-time processing framework, completes the real-time processing to big data of marketing in intelligent grid with efficiency higher.The method of the present invention The real time mass data produced in will be helpful to process intelligent grid in time, and deeply excavate the similar of all kinds of marketing datas Property, while providing more individual character and quality services for power grid user, it is also possible to provide for network system and reliably ensure

Other advantages of the present invention, target and feature will be illustrated to a certain extent in the following description, and And to a certain extent, based on being will be apparent to investigating hereafter to those skilled in the art, Huo Zheke To be instructed from the practice of the present invention.The target and other advantages of the present invention can pass through description below and right will Seek book to realize and obtain.

Description of the drawings

The description of the drawings of the present invention is as follows.

Fig. 1 is the real time processing system module diagram of big data of marketing in a kind of intelligent grid in the present invention；

Fig. 2 is the framework model figure of Storm clusters in the present invention；

Fig. 3 is the real-time processing architectural framework figure in the present invention for big data of marketing in intelligent grid；

Fig. 4 is the topology diagram of the marketing data stream process designed in the present invention；

Fig. 5 is to realize process schematic based on the K-Means of Spark in the present invention.

Specific embodiment

The invention will be further described with reference to the accompanying drawings and examples.

A kind of real-time processing method of big data of marketing in intelligent grid, specifically includes following steps：

Step one, multi-class parallel message transmission；By in intelligent grid each marketing data collection terminal is abstract is One producer, is then that every class marketing data creates a topic, and producer is by news release to the topic for specifying In, subregion is selected using specific partition functions.Finally, message is provided from kafka clusters to consumer.In electrical network marketing number Polytype data, such as resident living power utility, commercial power, big commercial power etc. are included according in.Flume is to various numbers According to pretreatment is carried out, connection message middleware Kafka is received.Data inside same Topic according to certain Key- Value forms are partitioned storage on a different server.

Step 2, processed in the way of stream calculation using Storm：During using Storm processing datas, first at design data The priority logical relation of the topological structure of reason process, i.e. data processing.The processing sequence of marketing data is followed successively by：Obtain number According to, denoising, calculate characteristic quantity, interpretation of result.

Step 3, real-time processing is carried out to electric network data：Low task choosing MapReduce process is required to the time limit；It is right The high task of requirement of real-time, using K-Means clustering algorithms are combined the characteristics of Spark memory parallel technologies, data is divided For different classifications, find to be hidden in valuable information in marketing data.The judgement of requirement of real-time height is according to statistical number According to drawing.Based on the K-Means clustering algorithms of Spark as shown in figure 5, comprising the following steps that for the algorithm：

(1) blocks of files being stored on HDFS is read in internal memory, each block is converted into a RDD, the inside includes monitoring The feature duration set of data.

(2) map operation is carried out and then to RDD, vectorial Vector is characterized by Monitoring Data abstract, in wherein Vector It is every one-dimensional all correspond to the every one-dimensional of Point, calculate the corresponding clusters of each Vector (Point) and number (Class), wherein Class is the numbering of each cluster centre point, and export key-value pair (K, V) for (wherein K represents key for Class, (Point, 1)), It is the every one-dimensional numbering of data；V representative values, are the actual value of data；New RDD is generated with this.

(3) then in reduction operation, to each, new RDD mixes, and the data of identical cluster are stored together, and In each cluster centre point of RDD internal calculations.

(4) finally judge the distance between central point and previous central point, if meet required, terminate, otherwise from Second step starts, until meeting termination condition.

(5) finally output result is write in HDFS.

Cluster centre point is calculated as：K cluster centre point is randomly selected, μ is designated as₁, μ₂..., μ_k∈Rⁿ, wherein RⁿRepresent N ties up real number vector space.

Repeat procedure below, until convergence：To each sample i, the class that sample i should belong to is calculated,Wherein c⁽ⁱ⁾Represent i-th sample generic；x⁽ⁱ⁾Represent i-th sample corresponding Characteristic vector；μ_jRepresent the corresponding characteristic vector of j-th cluster centre point.For each class j, such center is recalculated Point,Wherein m represents the sum of characteristic vector in j-th classification.

As shown in figure 1, the invention discloses a kind of calculate the big number of intelligent grid marketing calculated with batch processing based on streaming According to real time processing system, which includes：

Electrical network marketing data acquisition module, realizes message queue using Kafka in the data acquisition module, to data source Integrated；

Data processing module, data processing module processing data in the way of stream calculation first with Storm, then Calculated using the Spark frameworks based on internal memory in real time；

The output result of data processing module is stored to output result by data memory module, the data memory module In HBase；

Business logic modules, the business logic modules realize the user management of system and user right system logic；

Display module, the display module provide Web interactive operations interface.

Wherein, data acquisition module is responsible for integrating Flume+Kafka, Producer Producers of the Flume as message, raw The message data of product is saved in Kafka, using Storm topological structure Topology as message consumer Consumer。

Preferably, in real time processing system proposed by the invention, streaming computing module adopts Storm clusters, batch processing meter Calculate module and mainly adopt Spark clusters, can also be other clusters in other embodiments certainly, its function description is as follows：

(1) framework of Storm clusters is client/server, is made up of a host node and multiple working nodes.Host node Nimbus, is responsible for the work such as task distribution, code distribution, cluster monitoring.Working node is Supervisor, one physics of correspondence Machine, for starting worker.Each working node runs multiple worker, and what worker was represented is process, each worker bag Containing multiple Task, Task represents thread.The framework of Storm clusters is as shown in Figure 2.Wherein Nimbus and Supervisor are fast Speed failure, it is stateless, so can restart immediately after some node collapses, do not interfere with the operation of system, host node and work The coordination made between node is completed by Zookeeper.

(2) Spark is the cluster computing system calculated based on internal memory, it is therefore an objective to make data analysiss quicker.Spark is adopted With the PC cluster framework similar to Hadoop, but Spark is applied to the PC cluster of particular job loadtype, this calculating Shared work data set, such as machine learning algorithm are needed between the operation of multiple parallel iterations.In order to optimize such meter Calculate, Spark introduces PC cluster based on internal memory, will data set be buffered in internal memory, reduce disk access delays. Used in Spark is calculated, elasticity distribution formula data set RDD improves efficiency.RDD is distributed across the read-only object between a group node Set.These set can be rebuild in the case where partial data collection is lost so that Spark has fault tolerant mechanism, rebuilds part The process of data set needs to safeguard blood lineage, i.e., by the generating process of record data, rebuild the partial data collection lost. In Spark, RDD can be：The Scala objects created from HDFS file system；Distribution parallel data between the individual nodes Section；From the RDD that other RDD are converted；Change the persistency of existing RDD, such as existing RDD is buffered in internal memory. Spark process some particular tasks when, 1～2 order of magnitude higher than Hadoop operational efficiency.The speed of Spark is Hadoop 100 times of MapReduce.Because during operation Spark systems, server can be intermediate data storage in RAM, and without the need for Jing Often load from disk.

It is as shown in Figure 3 for a kind of real-time processing architectural framework of big data of marketing in intelligent grid in the present invention.By institute The data of acquisition import system in the form of streaming, is processed in the way of stream calculation using Storm, data is made with quick diagnosis and is commented Estimate；After the completion of process, data are stored in the distributed file storage system HDFS of Hadoop；Low data are required to the time limit Analysis task, is completed using MapReduce technologies, directly processes data in magnetic disk；The task high to requirement of real-time, reads from HDFS Elastic data collection RDD is fetched data and be converted to, is calculated using the Spark frameworks based on internal memory.

The present invention is as shown in Figure 4 based on the topological structure that streaming is calculated.Spout represents the origin of marketing data, supports many Data Source is planted, and is respectively processed.Blot represents a process of data processing, comprising denoising, calculates characteristic quantity, result Analysis etc., different characteristic quantity calculates modes and different analysis modes are expressed as different Blot.The output of one Blot can Using the input as another Blot.

So, one intactly data processing module process just complete.And then output result is write in HBase, Corresponding service logic is set, that is, realizes the user management and user right system logic of system.The interactive behaviour of Web is finally provided Make interface to inquire about for staff.

The characteristics of present invention combines marketing data, devises mixing Hadoop, Storm from system-level aspect, Spark's Real-time processing framework, completes the real-time processing to big data of marketing in intelligent grid with efficiency higher.

It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie In the case of spirit or essential attributes without departing substantially from the present invention, the present invention can be realized in other specific forms.Therefore, no matter From from the point of view of which, example all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention will by appended right Ask rather than described above is limited, it is intended that all changes that will fall in the implication and scope of the equivalency of claim Include in the present invention.Any reference in claim should not be considered as and limit involved claim.

Finally illustrate, above example is only unrestricted to illustrate technical scheme, although with reference to compared with Good embodiment has been described in detail to the present invention, it will be understood by those within the art that, can be to the skill of the present invention Art scheme is modified or equivalent, and without deviating from the objective and scope of the technical program, which all should be covered in the present invention Right in the middle of.

Claims

1. the real-time processing method of big data of marketing in a kind of intelligent grid, it is characterised in that comprise the following steps that：

1) multiple server groups are received into the electrical network marketing data collected by Flume, to electrical network marketing data into Kafka clusters Subregion is selected using partition functions, is stored in after subregion on corresponding server；

2) to the electrical network marketing data after subregion, quick diagnosis and assessment is done in the way of stream calculation using Storm, electricity is distinguished Requirement of the net marketing data to real-time；

4) by step 3) output result write in HBase, corresponding service logic is set.

2. the real-time processing method of big data of marketing in intelligent grid as claimed in claim 1, it is characterised in that step 1) in It is described to select the concrete grammar of subregion as follows using partition functions electrical network marketing data：

Using Flume blocker interceptors, the Key values in electrical network marketing data event header information, i.e. key assignments are read, Then subregion is selected according to key assignments.

3. the real-time processing method of big data of marketing in intelligent grid as claimed in claim 1, it is characterised in that step 2) in The Storm is comprised the following steps with the concrete grammar of stream calculation mode：Build the topology knot of electrical network marketing data processing procedure Structure, carries out denoising successively, calculates characteristic quantity and interpretation of result process to electrical network marketing data.

4. the real-time processing method of big data of marketing in intelligent grid as claimed in claim 1, it is characterised in that the step 2) process the data for obtaining to be stored in the distributed file storage system HDFS of Hadoop.

5. the real-time processing method of big data of marketing in intelligent grid as claimed in claim 4, it is characterised in that step 3) in The K-Means clustering algorithms based on Spark are comprised the following steps that：

3-1) in internal memory, each block is converted into an elastic data collection RDD to the blocks of files in HDFS, comprising monitoring number in RDD According to feature duration set；

Map operation is carried out to RDD 3-2), vectorial Vector is characterized by Monitoring Data abstract, it is every one-dimensional in wherein Vector The every one-dimensional of Point is all corresponded to, the corresponding clusters of each Vector (Point) is calculated and is numbered (Class), wherein Class is every The numbering of individual cluster centre point, and key-value pair (K, V) is exported for (wherein K represents key for Class, (Point, 1)), is that data are every One-dimensional numbering；V representative values, are the actual value of data；New RDD is generated with this；

3-3) RDD new to each mixes, and the data of identical cluster are stored together, and each gathers in RDD internal calculations Class central point；

3-4) judge the distance between central point and previous central point, if meet required, terminate, otherwise open from second step Begin, until meeting termination condition；

3-5) output result is write in HDFS.

6. the real-time processing method of big data of marketing in intelligent grid as claimed in claim 5, it is characterised in that step 3-3) Described in the concrete grammar in RDD internal calculations each cluster centre points it is as follows：

Repeat procedure below, until convergence：To each sample i, its class that should belong to is calculated, Wherein c⁽ⁱ⁾Represent i-th sample generic；x⁽ⁱ⁾Represent the corresponding characteristic vector of i-th sample；μ_jRepresent in j-th cluster The corresponding characteristic vector of heart point.For each class j, such central point is recalculated,Its Middle m represents the sum of characteristic vector in j-th classification.

7. using the real time processing system of big data of marketing in the intelligent grid of claim 1-6 any one methods described, its It is characterised by, the system is included：

Data acquisition module, for receiving the electrical network marketing data collected by Flume by Kafka clusters, is carried out to data source Integrate；

Data processing module, for using Storm clusters in the way of stream calculation processing data, then utilize based on Spark's K-Means clustering algorithms are calculated in real time；

Display module, for providing Web interactive operations interface.

8. the real time processing system of big data of marketing in intelligent grid as claimed in claim 7, it is characterised in that：It is described Storm clusters are made up of a host node and multiple working nodes, and host node is Nimbus, be responsible for task distribution, code distribution, Cluster monitoring works；Working node is Supervisor, one physical machine of correspondence, for launching process.

9. the real time processing system of big data of marketing in intelligent grid as claimed in claim 8, it is characterised in that：The work Node includes multiple processes, and each process includes multiple threads.