CN111124650B - Stream data processing method and device - Google Patents

Stream data processing method and device Download PDF

Info

Publication number
CN111124650B
CN111124650B CN201911369508.8A CN201911369508A CN111124650B CN 111124650 B CN111124650 B CN 111124650B CN 201911369508 A CN201911369508 A CN 201911369508A CN 111124650 B CN111124650 B CN 111124650B
Authority
CN
China
Prior art keywords
data
batch
complement
time
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911369508.8A
Other languages
Chinese (zh)
Other versions
CN111124650A (en
Inventor
章彩红
赵子健
庹艳林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN201911369508.8A priority Critical patent/CN111124650B/en
Publication of CN111124650A publication Critical patent/CN111124650A/en
Application granted granted Critical
Publication of CN111124650B publication Critical patent/CN111124650B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/547Messaging middleware
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application discloses a method and a device for processing streaming data, which are characterized in that when a streaming computing system enters a normal state, real-time data is processed through a main program, a corresponding time period of the processed real-time data is recorded, then the corresponding time period of the real-time data can be scanned by a complement program to determine the corresponding target batch data of historical data to be processed, and then the target batch data is processed. The method realizes the separate processing of the historical data processing and the real-time data processing, and the resource isolation prevents the mutual preemption of the computing resources or the abnormal propagation. Therefore, the historical data is effectively processed, and the accuracy of data analysis is improved.

Description

Stream data processing method and device
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for processing streaming data in a streaming computing system.
Background
Under the cloud computing environment, the data volume is increased faster, and mining and analysis of various data can bring beneficial value data to a business system. The big data analysis scene needs to carry out data statistics, namely, a statistical result is provided from multiple dimensions, so that a clearer and more concise analysis view or measurement index is provided for the follow-up. Wherein a streaming computation based data processing engine may obtain quasi-real-time data for a period of time from message middleware (e.g., kafka) for computation. SparkStreaming of Spark is one widely used computing engine in streaming computing today.
However, when the version of the application version based on SparkStreaming is updated or abnormal, data cannot be processed normally, and data in the period of time can be accumulated, historical data can exist after the application version is recovered, and effective processing of the part of data cannot be realized in the existing processing mode, so that the data processing result is inaccurate, and the accuracy of subsequent data analysis is affected.
Disclosure of Invention
Aiming at the problems, the application provides a stream data processing method and device, which realize effective processing of historical data and improve the accuracy of data analysis.
In order to achieve the above object, the present application provides the following technical solutions:
a streaming data processing method applied to a streaming computing system including a main program and a complement program, processing real-time data when the main program is executed, and processing history data when the complement program is executed, the method comprising:
recording time periods corresponding to all real-time data processed by the main program in response to the main program start, wherein the main program start represents that the streaming computing system enters a normal data processing state;
scanning a time period corresponding to the real-time data through the complement program to determine a complement time period;
carrying out batch splitting on the data in the complement time period to obtain a target batch;
calculating a target message offset in a batch time period corresponding to the target batch according to the message offset of the data of each batch in the message middleware;
determining target historical data according to the target message offset, and controlling the complement program to process the target historical data to obtain a first data processing result;
and generating a data stream processing result according to a second data processing result obtained by the main program for processing the real-time data and the first data processing result.
Optionally, the recording the time period corresponding to all real-time data processed by the main program includes:
controlling the main program to run, so that the main program searches the batch time of the data successfully processed last time, and records the message offset of the start message and the end message of the data of each batch in the message middleware in the processing time period corresponding to the real-time data when the main program is started again;
storing the message offset to a pre-created data table.
Optionally, the scanning, by the complement program, the time period corresponding to the historical data, and determining the complement time period includes:
acquiring processing time corresponding to each batch of data in a time period corresponding to the historical data;
judging whether the processing time of two adjacent batches exceeds a preset time threshold, if so, determining the time period of the two adjacent batches as a complement time period, wherein the data of the next batch is not in a target state; the target state is in the complement or the complement is completed.
Optionally, the determining the target historical data according to the target message offset controls the complement program to process the target historical data to obtain a first data processing result, including:
and if the target frequency comprises a plurality of batch data, controlling the complement program to start a multithreading concurrent processing mode, and processing the target historical data to obtain a first data processing result.
Optionally, the batch splitting of the data in the complement period to obtain a target batch includes:
carrying out batch splitting on the data in the complement time period to obtain a target batch;
and acquiring a batch time period corresponding to each target batch, so that the number complement program is executed to process the data of the target batch corresponding to the batch time period.
A streaming data processing apparatus, the apparatus being applied to a streaming computing system including a main program and a complement program, processing real-time data when the main program is executed, and processing history data when the complement program is executed, the apparatus comprising:
the acquisition unit is used for responding to the starting of the main program and recording the time periods corresponding to all real-time data processed by the main program, and the starting of the main program characterizes the streaming computing system to enter a normal data processing state;
the scanning unit is used for scanning the time period corresponding to the real-time data through the complement program and determining the complement time period;
the splitting unit is used for splitting the data in the complement time period into batches to obtain a target batch;
the calculation unit is used for calculating the target message offset in the batch time period corresponding to the target batch according to the message offset of the data of each batch in the message middleware;
the control unit is used for determining target historical data according to the target message offset and controlling the complement program to process the target historical data to obtain a first data processing result;
and the generating unit is used for generating a data stream processing result according to a second data processing result obtained by the main program for processing the real-time data and the first data processing result.
Optionally, the acquiring unit includes:
the first control subunit is used for controlling the operation of the main program, so that the main program searches the batch time of the data successfully processed last time, and records the message offset of the start message and the end message of the data of each batch in the message middleware in the processing time period corresponding to the real-time data when the main program is started again;
and the storage subunit is used for storing the message offset into a pre-created data table.
Optionally, the scanning unit includes:
the first acquisition subunit is used for acquiring processing time corresponding to each batch of data in a time period corresponding to the historical data;
the judging subunit is used for judging whether the processing time of two adjacent batches exceeds a preset time threshold, if so, determining the time period of the two adjacent batches as a complement time period, wherein the data of the next batch is not in a target state; the target state is in the complement or the complement is completed.
Optionally, the control unit is specifically configured to:
and if the target frequency comprises a plurality of batch data, controlling the complement program to start a multithreading concurrent processing mode, and processing the target historical data to obtain a first data processing result.
Optionally, the splitting unit includes:
the splitting subunit is used for carrying out batch splitting on the data in the complement time period to obtain a target batch;
and the second acquisition subunit is used for acquiring the batch time period corresponding to each target batch, so that the data of the target batch corresponding to the batch time period is processed by executing the number complement program in the batch time period.
Compared with the prior art, the application provides a streaming data processing method and device, when a streaming computing system enters a normal state, real-time data is processed through a main program; and the corresponding time period of the processed real-time data is recorded, then the complementary program can be utilized to scan the corresponding time period of the real-time data to determine the corresponding target batch data of the historical data to be processed, and then the target batch data is processed. The method realizes the separate processing of the historical data processing and the real-time data processing, and the resource isolation prevents the mutual preemption of the computing resources or the abnormal propagation. Therefore, the historical data is effectively processed, and the accuracy of data analysis is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a streaming data processing method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a streaming data processing apparatus according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The terms first and second and the like in the description and in the claims and in the above-described figures are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to the listed steps or elements but may include steps or elements not expressly listed.
In order to facilitate understanding of embodiments of the present application, terms used in the embodiments of the present application will now be explained.
Spark: is a computational engine designed to handle large-scale data.
Stream-type calculation: is to calculate the data stream in real time.
Spark streaming is a Spark-based streaming computing engine.
Kafka: is a distributed message middleware.
Topic: it is the use of a category attribute in the message middleware that partitions the data.
Partitioning: is a smaller unit of data splitting in Topic.
The embodiment of the application provides a streaming data processing method, which is applied to a streaming computing system, wherein the streaming computing system in the application is different from only one set of programs in the prior art to execute streaming computation, and comprises a main program and a complement program, when the main program is executed, real-time data is processed, when the complement program is executed, historical data is processed, and referring to fig. 1, the method can comprise the following steps:
s101, responding to the starting of the main program, and recording time periods corresponding to all real-time data processed by the main program.
The main program starts to characterize the streaming computing system to enter a normal data processing state. Taking Spark streaming as an example, it is a Spark-based streaming computing engine. When the application version based on SparkStreaming is updated or abnormal, the data cannot be processed normally, and then the main program cannot be started. After recovery, the batch time of the last successful data processing is searched by starting the main program.
Specifically, the batch time A of the warehouse entry data is searched in the storage library by starting the main program, the batch time is used as the batch which is successfully processed by the main program for the last time, the interval between the time A and the main program starting time B is used as the time period of the historical data, and the time period corresponding to the historical data is recorded.
The batch time A is changed into the batch with the post-start time B in the next batch in the database, for example, the batch time of the last data warehouse entry is 9:00 of the day, the main program start time is 10:00, and the data between 9:00 and 10:00 are all regarded as historical data.
S102, scanning the time period corresponding to the real-time data through the complement program, and determining the complement time period.
In the embodiment of the application, the main program can be controlled to run, so that the main program records the message offset of the start message and the end message of each batch of data in the message middleware in the processing time period corresponding to the real-time data; storing the message offset to a pre-created data table.
The process may include:
the main program records the offset of the start message and the end message of each batch in Kafka and stores the offset in the data Table 1. It is necessary to distinguish between different partitions under different topics of different Kafka. The record information contains lot time, kafka cluster identity, topic, partition number, offset of start message, offset of end message, lot status, next processing lot. The main program supports joint processing of multiple Kafka data, and the Topic name and partition number may be different for each Kafka. Wherein the next batch lot attributes are not maintained when the present lot is inserted and maintained by updating when the next lot is started.
And controlling the complement program to run, scanning the time period corresponding to the real-time data, and determining the complement time period. Acquiring processing time corresponding to each batch of data in a time period corresponding to the historical data; judging whether the processing time of two adjacent batches exceeds a preset time threshold, if so, determining the time period of the two adjacent batches as a complement time period, wherein the data of the next batch is not in a target state; the target state is in the complement or the complement is completed.
S103, carrying out batch splitting on the data in the complement time period to obtain a target batch;
s104, calculating a target message offset in a batch time period corresponding to the target batch according to the message offset of the data of each batch in the message middleware;
s105, determining target historical data according to the target message offset, and controlling the complement program to process the target historical data to obtain a first data processing result.
In the processing process, carrying out batch splitting on the data in the complement time period to obtain a target batch; and acquiring a batch time period corresponding to each target batch, so that the number complement program is executed to process the data of the target batch corresponding to the batch time period. And if the target frequency comprises a plurality of batch data, controlling the complement program to start a multithreading concurrent processing mode, and processing the target historical data to obtain a first data processing result.
Specifically, the complement program obtains the historical data time period to be processed, namely, determines the complement time period. By controlling the complement program to scan the data Table1 at regular time, it is checked whether the time T1 of each batch differs from the processing time T2 of the next batch by more than a certain time (for example, 2 minutes), and the next batch status is not in complement or the complement is completed. If yes, the historical data in the time period of T1-T2 is considered to be required to be processed, the complement processing of the historical data in the time period is entered, and if a plurality of batches meet the condition, the multithreading is started to concurrently process a plurality of historical data time periods. The complement processing is started in a single historical data time period, and the method comprises the following steps: the start time T1 and the end time T2 are obtained, and according to the interval (e.g. 10 s) of the number complement procedure, the batch time periods are divided one by one, for example, T1 is 09:00, T2 is 10:00, and then the batches are 09:00:00-09:00:10, 09:00:10-09:00:20,09:00:20-09:00:20, … in sequence. Each batch was polled and the following process was performed inside each batch.
Illustrating:
the start time and the end time of the batch are acquired, a plurality of partitions of each topic under a plurality of Kafka servers are polled, and the Kafka interface is called to acquire the start offset value offset1 and the end offset value offset2 of the message in the time period. If it is the first lot of the present period, the end offset of the lot time a, offset_a, is acquired, and if offset_a is smaller than offset1, then offset1 = offset_a; if the starting offset is the last batch in the time period, acquiring a starting offset of the batch of the main program starting time B, and if the offset is larger than the offset2, the offset 2=the offset_B;
the data from offset1 to offset2 is read from Kafka and processed for business logic, and if the history data differs from the near real-time data in business logic, adjustment is also required in this section.
S106, generating a data stream processing result according to a second data processing result obtained by the main program for processing the real-time data and the first data processing result.
In the present application the start-up procedure in a streaming computing system is divided into two parts. The first part processes near real-time data, called the main program. The second part processes the history data, called the complement program. The two programs have independent starting methods, respectively allocate different Spark resources, and independently start and maintain, so as to prevent the mutual preemption of computing resources or abnormal propagation. And actively searching the batch time of the last successful processing after the main program is started, recording the offset of the start message and the end message in the Kafka during each batch processing, periodically scanning the historical data time period to be processed by the complement program, calculating the offset of the messages of the start and the end for each batch, and accurately consuming the messages in the range of the time period from the Kafka to perform service processing.
Therefore, in the embodiment of the application, unprocessed historical data generated when the streaming computing system is abnormal can be processed, namely, the data processed by the whole streaming computing system is the second data processing result obtained by processing the real-time data by the main program and the first data processing result obtained by processing the historical data by the complement program. The method and the device realize accurate processing of the historical data in the message middleware, and meanwhile, do not influence the processing of normal quasi-real-time data. Moreover, the initial time of the time period for calculating the historical data adopts the result of inquiring the storage library, so that the method is more accurate; historical data processing and quasi-real-time data processing are processed separately, resources are isolated, and mutual preemption of computing resources or abnormal propagation is prevented; the batch time of the historical data processing is accurate, and statistical errors are not caused; the support history data processing logic is different from the near real-time processing logic.
Correspondingly, the embodiment of the application also provides a streaming data processing device which is applied to a streaming computing system, wherein the streaming computing system comprises a main program and a complement program, when the main program is executed, real-time data are processed, and when the complement program is executed, historical data are processed.
The device comprises:
an obtaining unit 10, configured to record a period of time corresponding to all real-time data processed by the main program in response to the main program being started, where the main program is started to characterize that the streaming computing system enters a normal data processing state;
a scanning unit 20, configured to scan, by using the complement program, a time period corresponding to the real-time data, and determine a complement time period;
a splitting unit 30, configured to split the data in the complement period into batches, so as to obtain a target batch;
a calculating unit 40, configured to calculate a target message offset in a batch period corresponding to the target batch according to the message offset of the message middleware in the data of each batch;
a control unit 50, configured to determine target historical data according to the target message offset, and control the complement program to process the target historical data to obtain a first data processing result;
and a generating unit 60, configured to generate a data stream processing result according to a second data processing result obtained by the main program for real-time data processing and the first data processing result.
On the basis of the above embodiment, the acquisition unit includes:
the first control subunit is used for controlling the operation of the main program, so that the main program searches the batch time of the data successfully processed last time, and records the message offset of the start message and the end message of the data of each batch in the message middleware in the processing time period corresponding to the real-time data when the main program is started again;
and the storage subunit is used for storing the message offset into a pre-created data table.
On the basis of the above embodiment, the scanning unit includes:
the first acquisition subunit is used for acquiring processing time corresponding to each batch of data in a time period corresponding to the historical data;
the judging subunit is used for judging whether the processing time of two adjacent batches exceeds a preset time threshold, if so, determining the time period of the two adjacent batches as a complement time period, wherein the data of the next batch is not in a target state; the target state is in the complement or the complement is completed.
On the basis of the above embodiment, the control unit is specifically configured to:
and if the target frequency comprises a plurality of batch data, controlling the complement program to start a multithreading concurrent processing mode, and processing the target historical data to obtain a first data processing result.
On the basis of the above embodiment, the splitting unit includes:
the splitting subunit is used for carrying out batch splitting on the data in the complement time period to obtain a target batch;
and the second acquisition subunit is used for acquiring the batch time period corresponding to each target batch, so that the data of the target batch corresponding to the batch time period is processed by executing the number complement program in the batch time period.
The application provides a stream data processing device, which processes real-time data through a main program when a stream computing system enters a normal state; and the corresponding time period of the processed real-time data is recorded, then the complementary program can be utilized to scan the corresponding time period of the real-time data to determine the corresponding target batch data of the historical data to be processed, and then the target batch data is processed. The method realizes the separate processing of the historical data processing and the real-time data processing, and the resource isolation prevents the mutual preemption of the computing resources or the abnormal propagation. Therefore, the historical data is effectively processed, and the accuracy of data analysis is improved.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (8)

1. A streaming data processing method, wherein the method is applied to a streaming computing system, the streaming computing system comprises a main program and a complement program, real-time data is processed when the main program is executed, historical data is processed when the complement program is executed, and the main program and the complement program are allocated with different Spark resources and are independently started, the method comprises:
responding to the starting of the main program, recording the corresponding time periods of all real-time data processed by the main program, and storing the time periods into a data table, wherein the starting of the main program characterizes that the streaming computing system enters a normal data processing state;
scanning the time period corresponding to the real-time data through the complement program to determine a complement time period, wherein the method comprises the following steps: scanning the data table to obtain processing time corresponding to each batch of data in a time period corresponding to the historical data; judging whether the processing time of two adjacent batches exceeds a preset time threshold, if so, determining the time period of the two adjacent batches as a complement time period, wherein the data of the next batch is not in a target state; the target state is in the complement or the complement is completed;
carrying out batch splitting on the data in the complement time period to obtain a target batch;
calculating a target message offset in a batch time period corresponding to the target batch according to the message offset of the data of each batch in the message middleware;
determining target historical data according to the target message offset, and controlling the complement program to process the target historical data to obtain a first data processing result;
and generating a data stream processing result according to a second data processing result obtained by the main program for processing the real-time data and the first data processing result.
2. The method according to claim 1, wherein said recording the time period corresponding to all real-time data processed by the main program comprises:
controlling the main program to run, so that the main program searches the batch time of the data successfully processed last time, and records the message offset of the start message and the end message of the data of each batch in the message middleware in the processing time period corresponding to the real-time data when the main program is started again;
storing the message offset to a pre-created data table.
3. The method of claim 1, wherein determining the target history data according to the target message offset, controlling the complement program to process the target history data, and obtaining a first data processing result, includes:
and if the target batch comprises a plurality of batch data, controlling the complement program to start a multithreading concurrent processing mode, and processing the target historical data to obtain a first data processing result.
4. The method of claim 1, wherein the batch splitting of the data in the complement time period to obtain the target batch comprises:
carrying out batch splitting on the data in the complement time period to obtain a target batch;
and acquiring a batch time period corresponding to each target batch, so that the number complement program is executed to process the data of the target batch corresponding to the batch time period.
5. A streaming data processing apparatus, the apparatus being applied to a streaming computing system including a main program and a complement program, the main program processing real-time data when executed and processing history data when the complement program is executed, the main program and the complement program being allocated with different Spark resources and being independently started, the apparatus comprising:
the acquisition unit is used for responding to the starting of the main program, recording the time periods corresponding to all real-time data processed by the main program and storing the time periods into a data table, wherein the starting of the main program characterizes that the streaming computing system enters a normal data processing state;
the scanning unit is used for scanning the time period corresponding to the real-time data through the complement program and determining the complement time period;
the scanning unit includes: a first obtaining subunit, configured to scan the data table to obtain processing time corresponding to each batch of data in a time period corresponding to the historical data; the judging subunit is used for judging whether the processing time of two adjacent batches exceeds a preset time threshold, if so, determining the time period of the two adjacent batches as a complement time period, wherein the data of the next batch is not in a target state; the target state is in the complement or the complement is completed;
the splitting unit is used for splitting the data in the complement time period into batches to obtain a target batch;
the calculation unit is used for calculating the target message offset in the batch time period corresponding to the target batch according to the message offset of the data of each batch in the message middleware;
the control unit is used for determining target historical data according to the target message offset and controlling the complement program to process the target historical data to obtain a first data processing result;
and the generating unit is used for generating a data stream processing result according to a second data processing result obtained by the main program for processing the real-time data and the first data processing result.
6. The apparatus of claim 5, wherein the acquisition unit comprises:
the first control subunit is used for controlling the operation of the main program, so that the main program searches the batch time of the data successfully processed last time, and records the message offset of the start message and the end message of the data of each batch in the message middleware in the processing time period corresponding to the real-time data when the main program is started again;
and the storage subunit is used for storing the message offset into a pre-created data table.
7. The device according to claim 5, wherein the control unit is specifically configured to:
and if the target batch comprises a plurality of batch data, controlling the complement program to start a multithreading concurrent processing mode, and processing the target historical data to obtain a first data processing result.
8. The apparatus of claim 5, wherein the splitting unit comprises:
the splitting subunit is used for carrying out batch splitting on the data in the complement time period to obtain a target batch;
and the second acquisition subunit is used for acquiring the batch time period corresponding to each target batch, so that the data of the target batch corresponding to the batch time period is processed by executing the number complement program in the batch time period.
CN201911369508.8A 2019-12-26 2019-12-26 Stream data processing method and device Active CN111124650B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911369508.8A CN111124650B (en) 2019-12-26 2019-12-26 Stream data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911369508.8A CN111124650B (en) 2019-12-26 2019-12-26 Stream data processing method and device

Publications (2)

Publication Number Publication Date
CN111124650A CN111124650A (en) 2020-05-08
CN111124650B true CN111124650B (en) 2023-10-24

Family

ID=70503407

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911369508.8A Active CN111124650B (en) 2019-12-26 2019-12-26 Stream data processing method and device

Country Status (1)

Country Link
CN (1) CN111124650B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113094241B (en) * 2021-05-07 2023-09-05 北京京东振世信息技术有限公司 Method, device, equipment and storage medium for determining accuracy of real-time program
CN113515374B (en) * 2021-05-18 2024-02-27 中国工商银行股份有限公司 Data processing method and device, electronic equipment and computer readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776855A (en) * 2016-11-29 2017-05-31 上海轻维软件有限公司 The processing method of Kafka data is read based on Spark Streaming
CN107870763A (en) * 2017-11-27 2018-04-03 深圳市华成峰科技有限公司 For creating the method and its device of the real-time sorting system of mass data
CN108108126A (en) * 2017-12-15 2018-06-01 北京奇艺世纪科技有限公司 A kind of data processing method, device and equipment
CN108509299A (en) * 2018-03-29 2018-09-07 努比亚技术有限公司 Message treatment method, equipment and computer readable storage medium
CN109634784A (en) * 2018-12-24 2019-04-16 康成投资(中国)有限公司 Spark application control method and control device
CN110490229A (en) * 2019-07-16 2019-11-22 昆明理工大学 A kind of electric energy meter calibration error diagnostics method based on spark and clustering algorithm

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10541953B2 (en) * 2017-12-13 2020-01-21 Chicago Mercantile Exchange Inc. Streaming platform reader

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776855A (en) * 2016-11-29 2017-05-31 上海轻维软件有限公司 The processing method of Kafka data is read based on Spark Streaming
CN107870763A (en) * 2017-11-27 2018-04-03 深圳市华成峰科技有限公司 For creating the method and its device of the real-time sorting system of mass data
CN108108126A (en) * 2017-12-15 2018-06-01 北京奇艺世纪科技有限公司 A kind of data processing method, device and equipment
CN108509299A (en) * 2018-03-29 2018-09-07 努比亚技术有限公司 Message treatment method, equipment and computer readable storage medium
CN109634784A (en) * 2018-12-24 2019-04-16 康成投资(中国)有限公司 Spark application control method and control device
CN110490229A (en) * 2019-07-16 2019-11-22 昆明理工大学 A kind of electric energy meter calibration error diagnostics method based on spark and clustering algorithm

Also Published As

Publication number Publication date
CN111124650A (en) 2020-05-08

Similar Documents

Publication Publication Date Title
CN110321387B (en) Data synchronization method, equipment and terminal equipment
US20150278706A1 (en) Method, Predictive Analytics System, and Computer Program Product for Performing Online and Offline Learning
CN109558065B (en) Data deleting method and distributed storage system
CN111124650B (en) Stream data processing method and device
CN108182258B (en) Distributed data analysis system and method
CN110134738B (en) Distributed storage system resource estimation method and device
EP4068118A1 (en) Information pushing system, method and apparatus, device and storage medium
CN108255620B (en) Service logic processing method, device, service server and system
US10133779B2 (en) Query hint management for a database management system
CN109951323B (en) Log analysis method and system
CN111666326A (en) ETL scheduling method and device
CN107688626B (en) Slow query log processing method and device and electronic equipment
US20150234883A1 (en) Method and system for retrieving real-time information
CN111680085A (en) Data processing task analysis method and device, electronic equipment and readable storage medium
CN114547208A (en) Method for full link trace transactions and native distributed database
US10095737B2 (en) Information storage system
CN107871055B (en) Data analysis method and device
CN111258973A (en) Storage and display method, device, equipment and medium of Redis slow log
CN109087107B (en) Real-time monitoring method and system based on distributed memory database and electronic equipment
CN106940710B (en) Information pushing method and device
CN110543509B (en) Monitoring system, method and device for user access data and electronic equipment
CN110909072B (en) Data table establishment method, device and equipment
WO2014162397A1 (en) Computer system, data management method, and computer
CN108829735B (en) Synchronization method, device, server and storage medium for parallel execution plan
CN113220530B (en) Data quality monitoring method and platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant