CN107704594A

CN107704594A - Power system daily record data real-time processing method based on SparkStreaming

Info

Publication number: CN107704594A
Application number: CN201710951969.0A
Authority: CN
Inventors: 宋爱波; 涂金林
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2017-10-13
Filing date: 2017-10-13
Publication date: 2018-02-16
Anticipated expiration: 2037-10-13
Also published as: CN107704594B

Abstract

The invention discloses a kind of power system daily record data real-time processing method based on Spark Streaming, first against the whole network log data stream sharp increase, the problem of classification and the association attributes change for the daily record data that processing system obtains are various, predefined statistical model, reduce the time of processing system pretreatment；Then by the analysis to block gap and processing time relation, the dynamic adjustment based on block gap is found, can be optimal the processing time of query task；It is finally based on this method and devises efficient dynamic adjustable strategies, seek optimal block gap in time, the processing time of query task is reduced, analyzes running status and the track of electric power scheduling automatization system whereby, realizes the analysis conversion of power system health status qualitative to quantitative.The present invention for effective management of power system daily record data so that provide a kind of efficient, easy-to-use real-time processing method.

Description

Power system daily record data real-time processing method based on Spark Streaming

Technical field

The present invention relates to a kind of power system daily record data real-time processing method, more particularly to one kind to be based on Spark Streaming power system daily record data real-time processing method.

Background technology

Electric power is the basic industry of modernized society's operating and development, the safety of power system and stable relation to mankind society The every aspect that can be lived.As a kind of data handling system, it believes electric power scheduling automatization system comprising Operation of Electric Systems Breath, analysis decision instrument and control device.Electric power scheduling automatization system can produce state, debugging, mistake in the process of running Etc. data, this kind of data are referred to as daily record data.A kind of form of expression of the daily record data as Operation of Electric Systems information, to it Fast and accurately analyzed, there is important guaranteeing role for power system security stable operation.

With the continuous expansion of dispatch automated system scale, power system needs the daily record data amount handled in real time drastically Increase.Show big data quantity in face of the whole network real-time logs data, the characteristics of rapid development, it is calculated, analyzed, is emulated and For the demands such as optimization considerably beyond the ability to bear of ordinary computing system, traditional log management means can not meet massive logs The management of data and analysis demand.Previous Stream Processing system by abandon a part of input traffic (such as classification unload Carry), data of the selection with distinguishing feature are handled, or by flexibly increasing extra resource.But as a rule, lose It is not a selection well to abandon data, it is more likely that the data of discarding are extremely important, thus influence the correctness of result；And For the real-time stream of high-throughput, advance acquisition related resource, this cost is huge.

In order to determine the trend of system operation and pattern, find out failure etc., the operation of electric power scheduling automatization system is analyzed State and track are, it is necessary to accomplish on line real time.Influenceed by disk performance, daily record data, which fails timely processing, to be caused to count According to loss, it is necessary to by the fast throughput of internal memory.Meanwhile in face of the continuous change of system resource and state, processing system Can timely it adjust, it is ensured that the processing time of system is optimal.

For problem above, how researchers are begun to focus on using memory source breakthrough I/O bottlenecks, improve data throughput Rate, accelerate the processing speed of data.Apache Spark are exactly the Computational frame of increasing income wherein shown one's talent.Spark is based on interior The iterative calculation framework deposited can in internal memory multi-pass operation specific set of data, realize the quick analyzing and processing of big data.Spark Streaming is as its upper level tool, there is provided the real-time processing function based on interval.Data flow be divided into some data blocks when Between be referred to as block gap, the time that some data chunks synthesize a batch is referred to as batch interval.This mode can be good at meeting Real-time processing requirement of the electric power scheduling automatization system to data in some period.

Generally, if the degree of parallelism of Spark Streaming processing datas (includes data block in a batch Quantity=batch interval/block gap) it is lower, then and the expense and utilization rate of resource will be smaller, such as establishment, the interaction of task Deng.And large-scale parallel computation will cause substantial amounts of resource overhead, the high resource utilization of simultaneous.In order to and When understanding electric power scheduling automatization system running status and track, realize power system health status qualitative to quantitative point Analysis conversion, this is just necessary to ensure that query task can reach relatively low resource overhead and Geng Gao resource utilization.In order to weigh The expense and utilization rate of resource, when in face of different system mode and change in resources, the degree of parallelism of processing needs adjustment in time.

In the last few years, the process demand of real-time stream promoted the development of distributed Computational frame in real time.Such as：Document “High-Throughput Robust Architecture for Log Analysis and Data Stream Mining” Then analyzed as real-time Computational frame, receiving real-time data using Apache Storm.Spark Streaming conducts Spark upper strata upgrade kit, unlike Storm systems：Spark Streaming are not an a record then notes The processing data stream of record, but data flow is divided into the batch job of multiple periods in advance according to time interval and handled. Storm is the real-time Computational frame based on event level, and electric power scheduling automatization system is more in some period The calculating analysis of the stateful batch processing of data flow.And Storm can at least be handled once for every record, when node is from mistake In recover, record can recalculate, and this is just unsatisfactory for the safe and reliable demand of electric power scheduling automatization system.

By dynamic adjustment batch interval or dynamic adjusting data block size, can actually ensure in not advance skill In the case of stream mode and running environment, system can stablize operation.But these modes are paid close attention to and are more data Read-write throughput and resource utilization.And for complicated calculations, dynamic adjustment also fails to the more excellent batch interval of selection or number According to block size, cause processing time increasingly longer, ignore the demand that dispatch automated system is quickly handled completely.

The content of the invention

Goal of the invention：For problem above, the present invention proposes a kind of power system daily record based on Spark Streaming Real-time Data Processing Method.

Technical scheme：To realize the purpose of the present invention, the technical solution adopted in the present invention is：One kind is based on Spark Streaming power system daily record data real-time processing method, comprises the following steps：

(1) statistical model of different log categories is defined；

(2) Spark Streaming block gaps and the relational model of Data Stream Processing time are built；

(3) dynamic adjustment block gap, seeks optimal block gap.

Further, in the step (1), statistical model includes element：Data set, result set, packet condition, it was grouped Filter and rule action.

Further, in the step (2), data flow is divided into the time of some data blocks, i.e. block gap；Some numbers The time of a batch is combined into according to block, that is, criticizes interval.

Relational model construction step：

(1) data flow of reception is divided into independent data block according to block gap by module in batches；

(2) data block in one batch of interval time is rolled into a batch, waits in line to be located into batch queue Reason；

The parallel data processing of all block gaps in (3) one batches of interval times.

Further, in the step (3), batch interval is given, using greedy algorithm, dynamic adjusts block gap, sought most Excellent block gap.

The greedy algorithm step is：

(1) original block time interval is β, adjusting step i；

(2) if the batch processing time that block gap is β is less than the batch processing time that block gap is β+i, between optimal block It is interposed between the left side of initial block gap；If the batch processing time that block gap is β is less than the batch processing time that block gap is β-i, Optimal block gap is on the right side of initial block gap；

(3) when the direction for seeking optimal block gap, exploration is continued cycling through, can not be reduced again until processing time.

Beneficial effect：This method considers the characteristics of power system daily record data, in face of system resource and state not Disconnected change, processing system quickly can be moved timely without redefining statistical function and model according to the change of data flow State adjusts, so as to reach higher resource utilization and shorter processing time.

Brief description of the drawings

Fig. 1 is block gap schematic diagram；

Fig. 2 is influence curve figure of the block gap to processing time.

Embodiment

Technical scheme is further described with reference to the accompanying drawings and examples.

The present invention for it is existing in real time Computational frame processing log data stream existing for deficiency, consider block gap and The relation of Data Stream Processing time, propose a kind of power system daily record data based on Spark Streaming side of processing in real time Method, it is intended to ensure that Spark Streaming block gaps can dynamically adjust with the continuous change of system resource and state, add The processing speed of fast real-time stream, running status and the track of electric power scheduling automatization system are analyzed whereby, realizes power train The analysis conversion of system health status qualitative to quantitative.

The present invention first against the whole network log data stream sharp increase, the classification for the daily record data that processing system obtains and The problem of association attributes change is various, defines statistical model to different log categories in advance, locates in advance so as to reduce processing system The time of reason；Then by the analysis to processing system block gap and processing time relation, it is found that the dynamic based on block gap is adjusted The whole processing time that can be effectively reduced system；Above-mentioned analysis is finally based on, devises the dynamic adjustment based on greedy algorithm Strategy, optimal block gap is sought in time, accelerate the processing speed of log data stream, reduce the processing time of query task.

Power system daily record data real-time processing method based on Spark Streaming, comprises the following steps：

Step 1：The statistical model of different log categories is defined, according to statistical model, quick analysis in real time；

When processing system obtain daily record data classification and association attributes constantly change, in advance for different daily record classes Each field during other Treatment Analysis, statistical model is defined, reduce the time of processing system pretreatment.

Statistical model describe one in real time analysis during, it is necessary to each element set.According to structuring SELECT Sentence format in query language, a statistical model need to include following element：

(1) data set：Equivalent to FROM and WHERE clause., it is necessary to indicate the log category of subscription, system in data set Time window of meter etc., the daily record data for belonging to certain classification are then supported to be based on layout element if necessary to further screening Logical expression.

(2) result set：Equivalent to SELECT clause., it is necessary to most be produced at last during indicating present analysis in result set Raw result field, mainly include layout element and static fields.Static fields support multiple statistical functions：COUNT、SUM、 MAX、MIN、TOP(N)、ASSERT。

(3) it is grouped condition：Equivalent to GROUP BY clauses.Packet condition can only be included in the field defined in result set.

(4) packet filter：Packet filter can only include the static fields in result set, for the element branch of numeric type The operator held has：=,>、>=,<、<=,！=, the operator that the element of character type is supported has：EQUAL、CONTAIN、 BEGINWITH、ENDWITH。

(5) rule action：According to the content matching of result set rule：Storage, alarm.Storage refers to store result of calculation Into external system；Alarm refers to set a threshold value for the result of statistical operation, when result exceeds threshold value, sends alarm letter Breath.

Analyze target and statistical model example is as shown in table 1：

Table 1

Step 2：Build Spark Streaming block gaps and the relational model of Data Stream Processing time；

Spark Streaming block gaps and the relation of Data Stream Processing time are analyzed, seeking makes the Data Stream Processing time Reach the condition of minimum block gap.

As shown in figure 1, the module in batches in figure is Spark Streaming module in batches, its effect is to receive Data flow be divided into multiple batches, then each batch is handled respectively.Module forms a batch, it is necessary to two weights in batches The parameter wanted：Block gap and batch interval.The time that data flow is divided into some data blocks is referred to as block gap, some data block combinations Time into a batch is referred to as batch interval.

Therefore, in batches module by the data flow received first according to block gap (block gap<Batch interval) be divided into it is each Independent data block, then by one batch of interlude, all data blocks in this period can be rolled into one batch Secondary, this last batch, which enters in batch queue, to be waited in line to be processed.

There it can be seen that the execution degree of parallelism of batch is by crowd interval/block gap (batch interval/block Interval) determine, represent the number of data block in a batch.Under equal resource allocation, if the degree of parallelism of processing is got over It is low, then the expense and utilization rate of resource will be smaller, such as the establishment of task, interaction etc.；And large-scale parallel computation is then Substantial amounts of resource overhead, the high resource utilization of simultaneous can be caused.For the expense and utilization rate of trading-off resources, During in face of different system mode and change in resources, the degree of parallelism of processing needs adjustment in time.Understand power dispatching automation system The running status of system and track, realize the analysis conversion of power system health status qualitative to quantitative, it is meant that batch interval needs Keep relative constancy.Therefore, the execution degree of parallelism of processing system is mainly influenceed by block gap.

Analyzed more than, block gap determines the execution degree of parallelism of processing system, while also just has influence on the place of system Rationality energy.As shown in Fig. 2 batch alternate constant of Reduce workflows is at 3 seconds, and batch alternate constant of Join workflows is at 1 second, Respectively under 2MB/S and 4MB/S data stream reception speed, influence of the block gap to processing time.As can be seen that different number According to stream receiving velocity, obtained curve approximation is in parabola, then it is exactly to throw processing time is reached minimum optimal block gap The summit of thing line.In fact, by the change of operating environment and the interference etc. of noise, the relation of block gap and processing time are simultaneously Non- is parabola truly.But have not with suspecting, optimal block gap is necessarily with the change of data reception rate Change and change, because data reception rate is faster, the data in block gap are more；Data reception rate is slower, in block gap Data are fewer, and more major generals of data directly affect the processing time of processing system.

Observed based on more than, for a given batch interval, it is possible to by adjusting the size of block gap, appoint inquiry The processing time of business is optimal.

Step 3：When log data stream is analyzed in real time, according to the relational model in step 2, Spark is utilized The dynamic adjustment of Streaming block gaps, reduce the processing time of query task.

Reach the condition of minimum block gap according to the Data Stream Processing time, by the method for greed, seek in time most Excellent block gap；And according to the continuous change of processing system resource and state, dynamic adjusts, and reduces the processing of query task Time.

The optimization aim of the present invention is to ensure that processing system has often handled a batch, the block gap of next group data receiver Determine.As can be seen that if selected original block interval too small or excessive, explores optimal block gap in Fig. 2 Time will be very long.The scheme of compromise is then to select block gap/2 as initial block gap, and without frequently exploring, then By gradually increasing or reducing block gap, can not reduce again until processing time.

Table 2 gives the algorithm for calculating next block gap.Original block time interval is β, adjusting step i, is calculated Cheng Zhong, β then represent next block gap.P₁And P₂Represent the processing time of the first two batch.

Dynamic adjustable strategies based on greedy algorithm are as shown in table 2：

Table 2

Calculating process mainly includes two parts：If the batch processing time that block gap is β is less than batch that block gap is β+i Processing time, then optimal block gap is in the left side of initial block gap；If block gap is the β batch processing time to be less than block gap For the β-i batch processing time, then optimal block gap is on the right side of initial block gap.When the direction for seeking optimal block gap, Exploration is continued cycling through, can not be reduced again until processing time.

If data reception rate and system running environment keep constant, then optimal block gap will keep stable. But when running environment changes, then optimal block gap will change, and now correct algorithm needs to do in time Go out adjustment to adapt to newest environment.But the convergent time will be extended from the beginning, therefore present invention selection running environment Restart greedy adjustment as initial block gap in block gap before change.

Claims

A kind of 1. power system daily record data real-time processing method based on Spark Streaming, it is characterised in that：Including with Lower step：

(1) statistical model of different log categories is defined；

(2) Spark Streaming block gaps and the relational model of Data Stream Processing time are built；

(3) dynamic adjustment block gap, seeks optimal block gap.
2. the power system daily record data real-time processing method according to claim 1 based on Spark Streaming, its It is characterised by：In the step (1), statistical model includes element：Data set, result set, packet condition, packet filter and rule Then act.
3. the power system daily record data real-time processing method according to claim 2 based on Spark Streaming, its It is characterised by：In the step (2), data flow is divided into the time of some data blocks, i.e. block gap；Some data block combinations Into the time of a batch, that is, criticize interval.
4. the power system daily record data real-time processing method according to claim 3 based on Spark Streaming, its It is characterised by：Relational model construction step in the step (2)：

(1) data flow of reception is divided into independent data block according to block gap by module in batches；

(2) data block in one batch of interval time is rolled into a batch, waits in line to be processed into batch queue；

The parallel data processing of all block gaps in (3) one batches of interval times.
5. the power system daily record data real-time processing method according to claim 4 based on Spark Streaming, its It is characterised by：In the step (3), batch interval is given, using greedy algorithm, dynamic adjusts block gap, seeks optimal block gap.
6. the power system daily record data real-time processing method according to claim 5 based on Spark Streaming, its It is characterised by：The greedy algorithm step is：

(1) original block time interval is β, adjusting step i；

(2) if the batch processing time that block gap is β is less than the batch processing time that block gap is β+i, optimal block gap exists The left side of initial block gap；If the batch processing time that block gap is β is less than the batch processing time that block gap is β-i, optimal Block gap on the right side of initial block gap；

(3) when the direction for seeking optimal block gap, exploration is continued cycling through, can not be reduced again until processing time.