CN111124650B

CN111124650B - Stream data processing method and device

Info

Publication number: CN111124650B
Application number: CN201911369508.8A
Authority: CN
Inventors: 章彩红; 赵子健; 庹艳林
Original assignee: China Construction Bank Corp
Current assignee: China Construction Bank Corp
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2023-10-24
Anticipated expiration: 2039-12-26
Also published as: CN111124650A

Abstract

The application discloses a method and a device for processing streaming data, which are characterized in that when a streaming computing system enters a normal state, real-time data is processed through a main program, a corresponding time period of the processed real-time data is recorded, then the corresponding time period of the real-time data can be scanned by a complement program to determine the corresponding target batch data of historical data to be processed, and then the target batch data is processed. The method realizes the separate processing of the historical data processing and the real-time data processing, and the resource isolation prevents the mutual preemption of the computing resources or the abnormal propagation. Therefore, the historical data is effectively processed, and the accuracy of data analysis is improved.

Description

Stream data processing method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for processing streaming data in a streaming computing system.

Background

Under the cloud computing environment, the data volume is increased faster, and mining and analysis of various data can bring beneficial value data to a business system. The big data analysis scene needs to carry out data statistics, namely, a statistical result is provided from multiple dimensions, so that a clearer and more concise analysis view or measurement index is provided for the follow-up. Wherein a streaming computation based data processing engine may obtain quasi-real-time data for a period of time from message middleware (e.g., kafka) for computation. SparkStreaming of Spark is one widely used computing engine in streaming computing today.

However, when the version of the application version based on SparkStreaming is updated or abnormal, data cannot be processed normally, and data in the period of time can be accumulated, historical data can exist after the application version is recovered, and effective processing of the part of data cannot be realized in the existing processing mode, so that the data processing result is inaccurate, and the accuracy of subsequent data analysis is affected.

Disclosure of Invention

Aiming at the problems, the application provides a stream data processing method and device, which realize effective processing of historical data and improve the accuracy of data analysis.

In order to achieve the above object, the present application provides the following technical solutions:

a streaming data processing method applied to a streaming computing system including a main program and a complement program, processing real-time data when the main program is executed, and processing history data when the complement program is executed, the method comprising:

recording time periods corresponding to all real-time data processed by the main program in response to the main program start, wherein the main program start represents that the streaming computing system enters a normal data processing state;

scanning a time period corresponding to the real-time data through the complement program to determine a complement time period;

carrying out batch splitting on the data in the complement time period to obtain a target batch;

calculating a target message offset in a batch time period corresponding to the target batch according to the message offset of the data of each batch in the message middleware;

determining target historical data according to the target message offset, and controlling the complement program to process the target historical data to obtain a first data processing result;

and generating a data stream processing result according to a second data processing result obtained by the main program for processing the real-time data and the first data processing result.

Optionally, the recording the time period corresponding to all real-time data processed by the main program includes:

controlling the main program to run, so that the main program searches the batch time of the data successfully processed last time, and records the message offset of the start message and the end message of the data of each batch in the message middleware in the processing time period corresponding to the real-time data when the main program is started again;

storing the message offset to a pre-created data table.

Optionally, the scanning, by the complement program, the time period corresponding to the historical data, and determining the complement time period includes:

acquiring processing time corresponding to each batch of data in a time period corresponding to the historical data;

judging whether the processing time of two adjacent batches exceeds a preset time threshold, if so, determining the time period of the two adjacent batches as a complement time period, wherein the data of the next batch is not in a target state; the target state is in the complement or the complement is completed.

Optionally, the determining the target historical data according to the target message offset controls the complement program to process the target historical data to obtain a first data processing result, including:

and if the target frequency comprises a plurality of batch data, controlling the complement program to start a multithreading concurrent processing mode, and processing the target historical data to obtain a first data processing result.

Optionally, the batch splitting of the data in the complement period to obtain a target batch includes:

and acquiring a batch time period corresponding to each target batch, so that the number complement program is executed to process the data of the target batch corresponding to the batch time period.

A streaming data processing apparatus, the apparatus being applied to a streaming computing system including a main program and a complement program, processing real-time data when the main program is executed, and processing history data when the complement program is executed, the apparatus comprising:

the acquisition unit is used for responding to the starting of the main program and recording the time periods corresponding to all real-time data processed by the main program, and the starting of the main program characterizes the streaming computing system to enter a normal data processing state;

the scanning unit is used for scanning the time period corresponding to the real-time data through the complement program and determining the complement time period;

the splitting unit is used for splitting the data in the complement time period into batches to obtain a target batch;

the calculation unit is used for calculating the target message offset in the batch time period corresponding to the target batch according to the message offset of the data of each batch in the message middleware;

the control unit is used for determining target historical data according to the target message offset and controlling the complement program to process the target historical data to obtain a first data processing result;

and the generating unit is used for generating a data stream processing result according to a second data processing result obtained by the main program for processing the real-time data and the first data processing result.

Optionally, the acquiring unit includes:

the first control subunit is used for controlling the operation of the main program, so that the main program searches the batch time of the data successfully processed last time, and records the message offset of the start message and the end message of the data of each batch in the message middleware in the processing time period corresponding to the real-time data when the main program is started again;

and the storage subunit is used for storing the message offset into a pre-created data table.

Optionally, the scanning unit includes:

the first acquisition subunit is used for acquiring processing time corresponding to each batch of data in a time period corresponding to the historical data;

the judging subunit is used for judging whether the processing time of two adjacent batches exceeds a preset time threshold, if so, determining the time period of the two adjacent batches as a complement time period, wherein the data of the next batch is not in a target state; the target state is in the complement or the complement is completed.

Optionally, the control unit is specifically configured to:

Optionally, the splitting unit includes:

the splitting subunit is used for carrying out batch splitting on the data in the complement time period to obtain a target batch;

and the second acquisition subunit is used for acquiring the batch time period corresponding to each target batch, so that the data of the target batch corresponding to the batch time period is processed by executing the number complement program in the batch time period.

Compared with the prior art, the application provides a streaming data processing method and device, when a streaming computing system enters a normal state, real-time data is processed through a main program; and the corresponding time period of the processed real-time data is recorded, then the complementary program can be utilized to scan the corresponding time period of the real-time data to determine the corresponding target batch data of the historical data to be processed, and then the target batch data is processed. The method realizes the separate processing of the historical data processing and the real-time data processing, and the resource isolation prevents the mutual preemption of the computing resources or the abnormal propagation. Therefore, the historical data is effectively processed, and the accuracy of data analysis is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a streaming data processing method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a streaming data processing apparatus according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms first and second and the like in the description and in the claims and in the above-described figures are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to the listed steps or elements but may include steps or elements not expressly listed.

In order to facilitate understanding of embodiments of the present application, terms used in the embodiments of the present application will now be explained.

Spark: is a computational engine designed to handle large-scale data.

Stream-type calculation: is to calculate the data stream in real time.

Spark streaming is a Spark-based streaming computing engine.

Kafka: is a distributed message middleware.

Topic: it is the use of a category attribute in the message middleware that partitions the data.

Partitioning: is a smaller unit of data splitting in Topic.

The embodiment of the application provides a streaming data processing method, which is applied to a streaming computing system, wherein the streaming computing system in the application is different from only one set of programs in the prior art to execute streaming computation, and comprises a main program and a complement program, when the main program is executed, real-time data is processed, when the complement program is executed, historical data is processed, and referring to fig. 1, the method can comprise the following steps:

s101, responding to the starting of the main program, and recording time periods corresponding to all real-time data processed by the main program.

The main program starts to characterize the streaming computing system to enter a normal data processing state. Taking Spark streaming as an example, it is a Spark-based streaming computing engine. When the application version based on SparkStreaming is updated or abnormal, the data cannot be processed normally, and then the main program cannot be started. After recovery, the batch time of the last successful data processing is searched by starting the main program.

Specifically, the batch time A of the warehouse entry data is searched in the storage library by starting the main program, the batch time is used as the batch which is successfully processed by the main program for the last time, the interval between the time A and the main program starting time B is used as the time period of the historical data, and the time period corresponding to the historical data is recorded.

The batch time A is changed into the batch with the post-start time B in the next batch in the database, for example, the batch time of the last data warehouse entry is 9:00 of the day, the main program start time is 10:00, and the data between 9:00 and 10:00 are all regarded as historical data.

S102, scanning the time period corresponding to the real-time data through the complement program, and determining the complement time period.

In the embodiment of the application, the main program can be controlled to run, so that the main program records the message offset of the start message and the end message of each batch of data in the message middleware in the processing time period corresponding to the real-time data; storing the message offset to a pre-created data table.

The process may include:

the main program records the offset of the start message and the end message of each batch in Kafka and stores the offset in the data Table 1. It is necessary to distinguish between different partitions under different topics of different Kafka. The record information contains lot time, kafka cluster identity, topic, partition number, offset of start message, offset of end message, lot status, next processing lot. The main program supports joint processing of multiple Kafka data, and the Topic name and partition number may be different for each Kafka. Wherein the next batch lot attributes are not maintained when the present lot is inserted and maintained by updating when the next lot is started.

And controlling the complement program to run, scanning the time period corresponding to the real-time data, and determining the complement time period. Acquiring processing time corresponding to each batch of data in a time period corresponding to the historical data; judging whether the processing time of two adjacent batches exceeds a preset time threshold, if so, determining the time period of the two adjacent batches as a complement time period, wherein the data of the next batch is not in a target state; the target state is in the complement or the complement is completed.

S103, carrying out batch splitting on the data in the complement time period to obtain a target batch;

s104, calculating a target message offset in a batch time period corresponding to the target batch according to the message offset of the data of each batch in the message middleware;

s105, determining target historical data according to the target message offset, and controlling the complement program to process the target historical data to obtain a first data processing result.

In the processing process, carrying out batch splitting on the data in the complement time period to obtain a target batch; and acquiring a batch time period corresponding to each target batch, so that the number complement program is executed to process the data of the target batch corresponding to the batch time period. And if the target frequency comprises a plurality of batch data, controlling the complement program to start a multithreading concurrent processing mode, and processing the target historical data to obtain a first data processing result.

Specifically, the complement program obtains the historical data time period to be processed, namely, determines the complement time period. By controlling the complement program to scan the data Table1 at regular time, it is checked whether the time T1 of each batch differs from the processing time T2 of the next batch by more than a certain time (for example, 2 minutes), and the next batch status is not in complement or the complement is completed. If yes, the historical data in the time period of T1-T2 is considered to be required to be processed, the complement processing of the historical data in the time period is entered, and if a plurality of batches meet the condition, the multithreading is started to concurrently process a plurality of historical data time periods. The complement processing is started in a single historical data time period, and the method comprises the following steps: the start time T1 and the end time T2 are obtained, and according to the interval (e.g. 10 s) of the number complement procedure, the batch time periods are divided one by one, for example, T1 is 09:00, T2 is 10:00, and then the batches are 09:00:00-09:00:10, 09:00:10-09:00:20,09:00:20-09:00:20, … in sequence. Each batch was polled and the following process was performed inside each batch.

Illustrating:

the start time and the end time of the batch are acquired, a plurality of partitions of each topic under a plurality of Kafka servers are polled, and the Kafka interface is called to acquire the start offset value offset1 and the end offset value offset2 of the message in the time period. If it is the first lot of the present period, the end offset of the lot time a, offset_a, is acquired, and if offset_a is smaller than offset1, then offset1 = offset_a; if the starting offset is the last batch in the time period, acquiring a starting offset of the batch of the main program starting time B, and if the offset is larger than the offset2, the offset 2=the offset_B;

the data from offset1 to offset2 is read from Kafka and processed for business logic, and if the history data differs from the near real-time data in business logic, adjustment is also required in this section.

S106, generating a data stream processing result according to a second data processing result obtained by the main program for processing the real-time data and the first data processing result.

In the present application the start-up procedure in a streaming computing system is divided into two parts. The first part processes near real-time data, called the main program. The second part processes the history data, called the complement program. The two programs have independent starting methods, respectively allocate different Spark resources, and independently start and maintain, so as to prevent the mutual preemption of computing resources or abnormal propagation. And actively searching the batch time of the last successful processing after the main program is started, recording the offset of the start message and the end message in the Kafka during each batch processing, periodically scanning the historical data time period to be processed by the complement program, calculating the offset of the messages of the start and the end for each batch, and accurately consuming the messages in the range of the time period from the Kafka to perform service processing.

Therefore, in the embodiment of the application, unprocessed historical data generated when the streaming computing system is abnormal can be processed, namely, the data processed by the whole streaming computing system is the second data processing result obtained by processing the real-time data by the main program and the first data processing result obtained by processing the historical data by the complement program. The method and the device realize accurate processing of the historical data in the message middleware, and meanwhile, do not influence the processing of normal quasi-real-time data. Moreover, the initial time of the time period for calculating the historical data adopts the result of inquiring the storage library, so that the method is more accurate; historical data processing and quasi-real-time data processing are processed separately, resources are isolated, and mutual preemption of computing resources or abnormal propagation is prevented; the batch time of the historical data processing is accurate, and statistical errors are not caused; the support history data processing logic is different from the near real-time processing logic.

Correspondingly, the embodiment of the application also provides a streaming data processing device which is applied to a streaming computing system, wherein the streaming computing system comprises a main program and a complement program, when the main program is executed, real-time data are processed, and when the complement program is executed, historical data are processed.

The device comprises:

an obtaining unit 10, configured to record a period of time corresponding to all real-time data processed by the main program in response to the main program being started, where the main program is started to characterize that the streaming computing system enters a normal data processing state;

a scanning unit 20, configured to scan, by using the complement program, a time period corresponding to the real-time data, and determine a complement time period;

a splitting unit 30, configured to split the data in the complement period into batches, so as to obtain a target batch;

a calculating unit 40, configured to calculate a target message offset in a batch period corresponding to the target batch according to the message offset of the message middleware in the data of each batch;

a control unit 50, configured to determine target historical data according to the target message offset, and control the complement program to process the target historical data to obtain a first data processing result;

and a generating unit 60, configured to generate a data stream processing result according to a second data processing result obtained by the main program for real-time data processing and the first data processing result.

On the basis of the above embodiment, the acquisition unit includes:

On the basis of the above embodiment, the scanning unit includes:

On the basis of the above embodiment, the control unit is specifically configured to:

On the basis of the above embodiment, the splitting unit includes:

The application provides a stream data processing device, which processes real-time data through a main program when a stream computing system enters a normal state; and the corresponding time period of the processed real-time data is recorded, then the complementary program can be utilized to scan the corresponding time period of the real-time data to determine the corresponding target batch data of the historical data to be processed, and then the target batch data is processed. The method realizes the separate processing of the historical data processing and the real-time data processing, and the resource isolation prevents the mutual preemption of the computing resources or the abnormal propagation. Therefore, the historical data is effectively processed, and the accuracy of data analysis is improved.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A streaming data processing method, wherein the method is applied to a streaming computing system, the streaming computing system comprises a main program and a complement program, real-time data is processed when the main program is executed, historical data is processed when the complement program is executed, and the main program and the complement program are allocated with different Spark resources and are independently started, the method comprises:

responding to the starting of the main program, recording the corresponding time periods of all real-time data processed by the main program, and storing the time periods into a data table, wherein the starting of the main program characterizes that the streaming computing system enters a normal data processing state;

scanning the time period corresponding to the real-time data through the complement program to determine a complement time period, wherein the method comprises the following steps: scanning the data table to obtain processing time corresponding to each batch of data in a time period corresponding to the historical data; judging whether the processing time of two adjacent batches exceeds a preset time threshold, if so, determining the time period of the two adjacent batches as a complement time period, wherein the data of the next batch is not in a target state; the target state is in the complement or the complement is completed;

2. The method according to claim 1, wherein said recording the time period corresponding to all real-time data processed by the main program comprises:

storing the message offset to a pre-created data table.

3. The method of claim 1, wherein determining the target history data according to the target message offset, controlling the complement program to process the target history data, and obtaining a first data processing result, includes:

and if the target batch comprises a plurality of batch data, controlling the complement program to start a multithreading concurrent processing mode, and processing the target historical data to obtain a first data processing result.

4. The method of claim 1, wherein the batch splitting of the data in the complement time period to obtain the target batch comprises:

5. A streaming data processing apparatus, the apparatus being applied to a streaming computing system including a main program and a complement program, the main program processing real-time data when executed and processing history data when the complement program is executed, the main program and the complement program being allocated with different Spark resources and being independently started, the apparatus comprising:

the acquisition unit is used for responding to the starting of the main program, recording the time periods corresponding to all real-time data processed by the main program and storing the time periods into a data table, wherein the starting of the main program characterizes that the streaming computing system enters a normal data processing state;

the scanning unit includes: a first obtaining subunit, configured to scan the data table to obtain processing time corresponding to each batch of data in a time period corresponding to the historical data; the judging subunit is used for judging whether the processing time of two adjacent batches exceeds a preset time threshold, if so, determining the time period of the two adjacent batches as a complement time period, wherein the data of the next batch is not in a target state; the target state is in the complement or the complement is completed;

6. The apparatus of claim 5, wherein the acquisition unit comprises:

7. The device according to claim 5, wherein the control unit is specifically configured to:

8. The apparatus of claim 5, wherein the splitting unit comprises: