CN107704594B

CN107704594B - Real-time processing method for log data of power system based on spark streaming

Info

Publication number: CN107704594B
Application number: CN201710951969.0A
Authority: CN
Inventors: 宋爱波; 涂金林
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2017-10-13
Filing date: 2017-10-13
Publication date: 2021-02-09
Anticipated expiration: 2037-10-13
Also published as: CN107704594A

Abstract

The invention discloses a real-time processing method of log data of a power system based on Spark Streaming, which comprises the steps of firstly predefining a statistical model aiming at the problems that the log data stream of the whole network is increased sharply and the types and relevant attributes of the log data acquired by a processing system are varied, and reducing the preprocessing time of the processing system; then, through analysis of the relation between the block interval and the processing time, the dynamic adjustment based on the block interval is found, so that the processing time of the query task can be optimized; and finally, designing an efficient dynamic adjustment strategy based on the method, searching for an optimal block interval in time, and reducing the processing time of a query task, so that the running state and the track of the power dispatching automation system are analyzed, and the qualitative to quantitative analysis and conversion of the health condition of the power system are realized. The invention provides an efficient and easy-to-use real-time processing method for effective management of the log data of the power system.

Description

Real-time processing method for log data of power system based on spark streaming

Technical Field

The invention relates to a real-time processing method of log data of a power system, in particular to a real-time processing method of log data of a power system based on Spark Streaming.

Background

Electric power is a fundamental industry for operation and development of modern society, and the safety and stability of an electric power system are related to the aspects of human social life. The power dispatching automation system is a data processing system and comprises power system operation information, an analysis decision tool and a control means. During operation, the power dispatching automation system generates data such as state, debugging and error, and the data is collectively called log data. The log data is used as an expression form of the operation information of the power system, and the log data is analyzed quickly and accurately, so that the log data has an important guarantee effect on the safe and stable operation of the power system.

With the continuous expansion of the scale of the dispatching automation system, the log data volume required to be processed in real time by the power system is increased sharply. The system has the characteristics of large data volume and rapid growth of the whole network real-time log data, and the requirements on calculation, analysis, simulation, optimization and the like of the system far exceed the bearing capacity of a common computing system, so that the traditional log management means cannot meet the management and analysis requirements of mass log data. Previous streaming systems have selected distinct data for processing by dropping a portion of the incoming data stream (e.g., hierarchical offload), or by flexibly adding additional resources. However, generally speaking, discarding data is not a good choice, and it is very likely that the discarded data is very important, thereby affecting the correctness of the result; moreover, for a real-time data stream with high throughput, the cost is huge for acquiring related resources in advance.

In order to determine the running trend and mode of the system, find out faults and the like, the running state and track of the power dispatching automation system are analyzed, and online real-time analysis is needed. Due to the influence of the performance of the magnetic disk, log data cannot be processed in time, so that data loss is caused, and the fast processing capability of a memory is required. Meanwhile, in the face of the continuous change of system resources and states, the processing system needs to be capable of adjusting in time, and the processing time of the system is ensured to be optimal.

In view of the above problems, researchers have focused on how to break through the I/O bottleneck by using memory resources, improve the data throughput, and increase the data processing speed. Apache Spark is an open source computing framework that stands out therein. The Spark iterative computation framework based on the memory can operate a specific data set in the memory for multiple times, so that the rapid analysis and processing of big data are realized. Spark Streaming, as its upper tool, provides real-time processing functions based on intervals. The time when a data stream is divided into several data blocks is called a block interval, and the time when several data blocks are combined into one batch is called a batch interval. The method can well meet the real-time processing requirement of the power dispatching automation system on data in a certain time period.

Generally, if the parallelism of processing data by the Spark Streaming (the number of data blocks contained in a batch is equal to batch interval/block interval) is lower, the overhead and utilization rate of resources will be smaller, such as creation and interaction of tasks. Large-scale parallel computing results in a large amount of resource overhead accompanied by extremely high resource utilization. In order to timely know the running state and track of the power dispatching automation system and realize the qualitative to quantitative analysis and conversion of the health condition of the power system, it is necessary to ensure that the query task can reach lower resource overhead and higher resource utilization rate. In order to balance the overhead and utilization of resources, the parallelism of processing needs to be adjusted in time when facing different system states and resource changes.

In recent years, the processing requirements of real-time data streams have facilitated the development of a distributed real-time computing framework. For example: the document "High-through debug Architecture for Log Analysis and Data Stream Mining" employs an Apache Storm as a real-time computing framework, receives real-time Data and then analyzes. Spark Streaming, as an upper level tool of Spark, differs from Storm system in that: the Spark Streaming is not a processing data stream recorded one after another, but a batch job in which a data stream is divided into a plurality of time periods in advance at time intervals and processed. Storm is a real-time computing framework based on event level, and the power dispatching automation system is more of a computing analysis of data flow stateful batch processing in a certain time period. Moreover, Storm is processed at least once for each record, and when the node recovers from the error, the record is recalculated, so that the requirement of safety and reliability of the power dispatching automation system is not met.

By dynamically adjusting the batch interval or the size of the data block, the system can be ensured to run stably without knowing the state and running environment of the data stream in advance. However, these approaches are more concerned with the read-write throughput and resource utilization of data. And for complex calculation, dynamic adjustment cannot select a better batch interval or data block size, so that the processing time is longer and longer, and the requirement of rapid processing of the dispatching automation system is completely ignored.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems, the invention provides a real-time processing method of log data of a power system based on Spark Streaming.

The technical scheme is as follows: in order to realize the purpose of the invention, the technical scheme adopted by the invention is as follows: a real-time processing method for log data of a power system based on Spark Streaming comprises the following steps:

(1) defining statistical models of different log categories;

(2) constructing a relation model of Spark Streaming block intervals and data stream processing time;

(3) and dynamically adjusting the block interval and searching for the optimal block interval.

Further, in step (1), the statistical model includes the elements: data sets, result sets, grouping conditions, grouping filters, and rule actions.

Further, in the step (2), the data stream is divided into a plurality of data blocks, namely, block intervals; several data blocks are combined into a batch time, i.e., a batch interval.

A relation model construction step:

(1) the batch module divides the received data stream into independent data blocks according to the block interval;

(2) wrapping the data blocks in a batch interval time into a batch, and entering a batch queue to queue for being processed;

(3) the data for all block intervals within one batch interval time are processed in parallel.

Further, in the step (3), a lot interval is given, a greedy algorithm is used for dynamically adjusting the block interval, and an optimal block interval is searched.

The greedy algorithm comprises the following steps:

(1) the initial block interval is expressed as beta, and the adjustment step length is i;

(2) if the batch processing time with the block interval beta is less than the batch processing time with the block interval beta + i, the optimal block interval is on the left side of the initial block interval; if the batch processing time with the block interval beta is less than the batch processing time with the block interval beta-i, the optimal block interval is on the right side of the initial block interval;

(3) when the direction of the optimal block interval is sought, the loop exploration is continued until the processing time cannot be reduced again.

Has the advantages that: the method comprehensively considers the characteristics of the log data of the power system, faces the constant change of system resources and states, the processing system does not need to redefine a statistical function and a statistical model according to the change of data streams, and can quickly and timely dynamically adjust, so that higher resource utilization rate and shorter processing time are achieved.

Drawings

FIG. 1 is a schematic block spacing diagram;

fig. 2 is a graph of the effect of block spacing on processing time.

Detailed Description

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

The invention provides a real-time processing method of log data of a power system based on Spark Streaming, aiming at the defects of the existing real-time computing framework in processing log data stream and comprehensively considering the relation between block intervals and data stream processing time, and aiming at ensuring that the Spark Streaming block intervals can be dynamically adjusted along with the continuous change of system resources and states, accelerating the processing speed of real-time data stream, analyzing the running state and track of a power dispatching automation system and realizing the qualitative to quantitative analysis and conversion of the health condition of the power system.

Firstly, aiming at the problems that the whole network log data stream is increased rapidly and the types and the related attributes of the log data acquired by the processing system are varied, the invention defines statistical models for different log types in advance so as to reduce the time for preprocessing the processing system; then, through the analysis of the relationship between the block interval and the processing time of the processing system, the dynamic adjustment based on the block interval is found to be capable of effectively reducing the processing time of the system; finally, based on the analysis, a dynamic adjustment strategy based on a greedy algorithm is designed, the optimal block interval is searched in time, the processing speed of the log data stream is increased, and the processing time of the query task is shortened.

The method for processing the log data of the power system in real time based on Spark Streaming comprises the following steps:

step 1: defining statistical models of different log categories, and performing rapid real-time analysis according to the statistical models;

when the types and the relevant attributes of the log data acquired by the processing system are changed continuously, a statistical model is defined in advance for each field in the process of processing and analyzing different log types, and the time for preprocessing the processing system is shortened.

The statistical model describes the set of elements that are required in a real-time analysis process. According to the statement format of SELECT in the structured query language, a statistical model needs to contain the following elements:

(1) data set: corresponding to the FROM and WHERE clauses. In the data set, the subscribed log categories, the statistical time window and the like need to be indicated, and if the log data belonging to a certain category needs to be further screened, the logic expression based on the layout elements is supported.

(2) And (4) result set: corresponding to the SELECT clause. In the result set, it is necessary to specify the result fields that will ultimately be generated in the current analysis process, mainly including layout elements and statistics fields. The statistics field supports a number of statistics functions: COUNT, SUM, MAX, MIN, TOP (N), ASSERT.

(3) Grouping conditions are as follows: corresponding to the GROUP BY clause. The grouping condition can only contain fields defined in the result set.

(4) A packet filter: packet filters can only contain statistical fields in the result set, for numeric elements the supported operators are: the name, >, <! The character-type elements support operators such as: EQUAL, continue, begin with, endlive.

(5) The rule acts as follows: and according to the content matching rule of the result set: and warehousing and alarming. Warehousing means storing the calculation result into an external system; the alarm means that a threshold value is set for the result of the statistical operation, and when the result exceeds the threshold value, alarm information is sent.

The analysis target and statistical model are shown in table 1:

TABLE 1

Step 2: constructing a relation model of Spark Streaming block intervals and data stream processing time;

the relation between the spare Streaming block interval and the data stream processing time is analyzed, and the condition of the block interval which enables the data stream processing time to reach the minimum is searched.

As shown in fig. 1, the batch module in the figure is a batch module of Spark Streaming, and is used for dividing the received data stream into a plurality of batches and then processing each batch separately. The batch module forms a batch and requires two important parameters: block interval and batch interval. The time when a data stream is divided into several data blocks is called a block interval, and the time when several data blocks are combined into one batch is called a batch interval.

Therefore, the batch module divides the received data stream into independent data blocks according to the block interval (block interval < batch interval), and then after a batch interval, all the data blocks in the batch interval are wrapped into a batch, and finally the batch enters the batch queue to be queued for processing.

It can be seen that the execution parallelism of a batch is determined by batch interval/block interval (batch interval/block interval), which indicates the number of data blocks in a batch. Under the equal resource allocation, if the processing parallelism is lower, the overhead and the utilization rate of the resources are smaller, such as the creation and interaction of tasks; large-scale parallel computation results in a large amount of resource overhead accompanied by extremely high resource utilization. In order to balance the overhead and utilization of resources, the parallelism of processing needs to be adjusted in time when facing different system states and resource changes. The running state and the track of the power dispatching automation system are known, the qualitative to quantitative analysis and conversion of the health condition of the power system are realized, and the batch interval needs to be kept relatively constant. Thus, the execution parallelism of a processing system is mainly affected by the block spacing.

According to the above analysis, the block interval determines the execution parallelism of the processing system, and also affects the processing performance of the system. As shown in FIG. 2, the batch interval for the Reduce workflow is constant at 3 seconds, while the batch interval for the Join workflow is constant at 1 second, the impact of the block interval on the processing time at data stream receive rates of 2MB/S and 4MB/S, respectively. It can be seen that the resulting curve approximates a parabola for different data stream reception rates, and the optimal block spacing to minimize processing time is the vertex of the parabola. In fact, the relationship between the block interval and the processing time is not a true parabola due to the change of the operating environment, the interference of noise, and the like. However, it is not suspected that the optimal block interval is always changed along with the change of the data receiving rate, because the faster the data receiving rate is, the more data in the block interval is; the slower the data reception rate, the less data in the block interval, and the more data will directly affect the processing time of the processing system.

Based on the above observations, for a given batch interval, the processing time of the query task can be optimized by adjusting the size of the block interval.

And step 3: and when the log data stream is analyzed in real time, according to the relation model in the step 2, the processing time of the query task is reduced by utilizing the dynamic adjustment of the Spark Streaming block interval.

According to the condition that the data stream processing time reaches the minimum block interval, searching the optimal block interval in time through a greedy method; and the dynamic adjustment is carried out according to the continuous change of the resources and the state of the processing system, so that the processing time of the query task is reduced.

The optimization objective of the present invention is to ensure that the block interval for the next batch of data reception has been determined for each batch processed by the processing system. As can be seen in fig. 2, if the selected initial block interval is too small or too large, the time to find the optimal block interval will be long. The trade-off is to select the block interval/2 as the initial block interval without frequent exploration, and then by gradually increasing or decreasing the block interval until the processing time cannot be decreased again.

Table 2 gives the algorithm for calculating the next block interval. The initial block interval is represented as beta, the adjustment step size is i, and in the calculation process, beta represents the next block interval. P₁And P₂Indicating the processing time of the first two batches.

The dynamic adjustment strategy based on the greedy algorithm is shown in table 2:

TABLE 2

The calculation process mainly comprises two parts: if the batch processing time with the block interval beta is less than the batch processing time with the block interval beta + i, the optimal block interval is on the left side of the initial block interval; if the batch time for a chunk interval β is less than the batch time for a chunk interval β -i, then the optimal chunk interval is to the right of the initial chunk interval. When the direction of the optimal block interval is sought, the loop exploration is continued until the processing time cannot be reduced again.

If the data reception rate and the system operating environment remain the same, the optimal block interval will remain stable. However, when the operating environment changes, the optimal block interval will change, and the correct algorithm needs to be adjusted in time to adapt to the latest environment. However, the convergence time is prolonged from the beginning, so the invention selects the block interval before the change of the running environment as the initial block interval and restarts the greedy adjustment.

Claims

1. A method for processing log data of a power system in real time based on Spark Streaming is characterized by comprising the following steps: the method comprises the following steps:

(1) defining a statistical model of different log categories, the statistical model comprising the elements: data sets, result sets, grouping conditions, grouping filters, and rule actions;

(2) constructing a relation model of Spark Streaming block interval and data stream processing time, and dividing the data stream into a plurality of data block time, namely block interval; the time when a plurality of data blocks are combined into a batch, namely the batch interval;

(3) setting batch intervals, dynamically adjusting block intervals by using a greedy algorithm, and searching for optimal block intervals;

the greedy algorithm comprises the following steps:

(3.1) the initial block interval is expressed as beta, and the adjustment step size is i;

(3.2) if the batch processing time of the block interval β is less than the batch processing time of the block interval β + i, the optimal block interval is to the left of the initial block interval; if the batch processing time with the block interval beta is less than the batch processing time with the block interval beta-i, the optimal block interval is on the right side of the initial block interval;

and (3.3) when the direction of the optimal block interval is searched, continuing to circularly search until the processing time cannot be reduced again.

2. The Spark Streaming based power system log data real-time processing method according to claim 1, wherein the method comprises the following steps: the step (2) of constructing the relationship model comprises the following steps:

(2.1) the batching module dividing the received data stream into independent data blocks according to the block interval;

(2.2) wrapping the data blocks in one batch interval time into a batch, and entering a batch queue to queue for being processed;

and (2.3) processing the data of all block intervals in one batch interval time in parallel.