WO2023082681A1

WO2023082681A1 - Data processing method and apparatus based on batch-stream integration, computer device, and medium

Info

Publication number: WO2023082681A1
Application number: PCT/CN2022/105078
Authority: WO
Inventors: 罗静; 王博一; 王晓; 霍星志; 郭宇鹏; 毛少将
Original assignee: 通号通信信息集团有限公司
Priority date: 2021-11-09
Filing date: 2022-07-12
Publication date: 2023-05-19
Also published as: CN113779094A; CN113779094B

Abstract

The present disclosure provides a data processing method based on batch-stream integration, comprising: obtaining data to be processed; according to a processing link, processing, layer by layer, the data to be processed so as to obtain first data; in processing layers, processing data input into the present processing layer to obtain processed real-time data, storing the processed real-time data in a Hive module on the basis of a Flink stream, and inputting the processed data into the next processing layer; processing the first data in a data application layer to obtain second data; in response to detecting an error in the second data, correcting, according to offline data of the present processing layer, wrong data in the processing layer in which a data error occurs, and inputting the corrected data into the next processing layer, so that the next processing layer processes the input data. The present disclosure further provides a data processing apparatus, a computer device, and a medium.

Description

Data processing method, device, computer equipment and medium based on batch-flow integration

technical field

The present disclosure relates to but not limited to the technical field of big data processing.

Background technique

The essence of the real-time offline fusion platform is a kind of data warehouse. As product demand and internal decision-making have higher and higher requirements for real-time data, real-time data warehouse capabilities are needed to empower it. The data timeliness of traditional offline data warehouses is T+1, and the scheduling frequency is in days, which cannot support the data requirements of real-time scenarios. Even if the scheduling frequency can be set to an hour, it can only solve some scenarios with low timeliness requirements, and cannot meet the scenarios with high effectiveness requirements.

The real-time data warehouse can effectively solve the above problems, but Kafka (open source stream processing platform) is only a temporary storage medium, and the data will have a timeout period, for example, only 7 days of data will be saved, which will lead to the loss of historical data. When an error occurs in a task, since there is no historical data, it is impossible to recalculate the data.

Moreover, in related technologies, the real-time data warehouse of the Lambda architecture has the problem of splitting offline and real-time. The same data source produces two different calculation results, offline and real-time. Moreover, it is necessary to maintain two sets of real-time and offline frameworks to increase operation and maintenance management. cost.

Contents of the invention

The disclosure provides a data processing method, device, computer equipment and medium based on batch-flow integration.

In the first aspect, an embodiment of the present disclosure provides a data processing method based on batch-flow integration, the method is applied to a data processing device, and the data processing device includes a data application layer and a plurality of processing layers, and each processing layer forms processing a link, the method comprising:

Obtaining data to be processed, the data to be processed is real-time data;

According to the processing link, the data to be processed is processed layer by layer to obtain the first data; wherein, in each of the processing layers, the data input to the processing layer is processed to obtain the processed data, and the processed data is The data is real-time data, based on the Flink flow, the processed real-time data is stored in the Hive module, and the processed data is input to the next processing layer; the first data is the last one in the processing chain The processed data obtained by the processing layer;

processing the first data in the data application layer to obtain second data;

In response to detecting that the second data is erroneous, in the processing layer where the data error occurs, the erroneous data is corrected according to the offline data of the processing layer to obtain the corrected data, and the corrected The data is input to the next processing layer, so that the next processing layer can process the input data.

In yet another aspect, an embodiment of the present disclosure also provides a data processing device, including an acquisition module, a first processing module, and a second processing module, the second processing module forms a data application layer, and the first processing module includes a plurality of processing layers, each processing layer forms a processing link, and each processing layer includes a first processing unit and a second processing unit;

The acquiring module is configured to acquire data to be processed, and the data to be processed includes real-time data;

The first processing module is configured to process the data to be processed layer by layer according to the processing link to obtain first data;

Wherein, the first processing unit is configured to process the data input to the processing layer to obtain processed data, the processed data is real-time data, and store the processed real-time data in the Hive module based on the Flink stream , and input the processed data to the next processing layer; the first data is the processed data obtained by the last processing layer in the processing chain; and, receiving the data sent by the second unit corrected data, inputting the corrected data to the first processing unit of the next processing layer, so that the first processing unit of the next processing layer processes the input data;

The second processing unit is configured to, in response to a data error occurring in the processing layer, correct the erroneous data according to the offline data of the processing layer, obtain the corrected data, and send the corrected data to the the first processing unit;

The second processing module is configured to process the first data in the data application layer to obtain second data.

In yet another aspect, an embodiment of the present disclosure further provides a computer device, including: one or more processors; a storage device, on which one or more programs are stored; when the one or more programs are stored by the one or more When multiple processors execute, the one or more processors implement the batch-flow integration-based data processing method as described above.

In yet another aspect, an embodiment of the present disclosure further provides a computer-readable medium on which a computer program is stored, wherein when the program is executed, the batch-flow integration-based data processing method as described above is implemented.

Description of drawings

FIG. 1 is a first schematic flow diagram of a batch-flow integration-based data processing method according to an embodiment of the present disclosure;

FIG. 2 is the second schematic flow diagram of the data processing method based on batch-flow integration provided by an embodiment of the present disclosure;

FIG. 3 is a first structural schematic diagram of a data processing device provided by an embodiment of the present disclosure;

FIG. 4 is a second structural diagram of a data processing device provided by an embodiment of the present disclosure;

Fig. 5 is a schematic structural diagram of a specific example of a data processing device provided by an embodiment of the present disclosure.

Detailed ways

In order to make the purpose, technical solutions and advantages of the present disclosure clearer, the embodiments of the present disclosure will be further described in detail below through specific implementation manners in conjunction with the accompanying drawings. It should be understood that the specific embodiments described here are only used to explain the present disclosure, not to limit the present disclosure.

An embodiment of the present disclosure provides a data processing method based on batch flow integration, the method is applied to a data processing device, the data processing device includes a data application (Application Data Service, ADS) layer and a plurality of processing layers, and each processing layer forms a processing Link, the data processed by the link reaches the data application layer. In some embodiments, the processing layer can include ODS (Operational Data Store, operational data storage) layer, DWD (Data Warehouse Detail, data details) and DWS (Data Ware House Servce, data service) layer, forming ODS layer -> DWD layer —>The processing link of the DWS layer. The data is transmitted in each processing layer sequentially according to the order of the processing links, and the data processing result of the previous processing layer is used as the data source of the next processing layer.

FIG. 1 is a first schematic flowchart of a batch-flow integration-based data processing method provided by an embodiment of the present disclosure. As shown in FIG. 1 , the batch-flow integration-based data processing method includes steps 11-14.

In step 11, the data to be processed is acquired, and the data to be processed is real-time data.

Real-time data refers to data whose lifetime (that is, the duration of data existence) is less than or equal to the timeout period.

Step 12: Process the data to be processed layer by layer according to the processing link to obtain the first data; wherein, in each processing layer, process the data input to the processing layer to obtain processed data, and the processed data is real-time data , store the processed real-time data in the Hive module based on the Flink flow, and input the processed data to the next processing layer; the first data is the processed data obtained by the last processing layer in the processing link.

In this step, the data to be processed is processed layer by layer according to the processing links formed by each processing layer, and the data processing results of the previous processing layer are input into the next processing layer as the data source of the next processing layer. In each processing layer, the real-time data input to this processing layer are processed to obtain processed real-time data. After the processed real-time data is obtained, on the one hand, the processed real-time data is stored in the Hive module based on the Flink flow to facilitate subsequent query and call. It should be noted that the lifetime of the processed real-time data is greater than the timeout time Then it is converted to offline data; Hive is a Hadoop-based data warehouse tool for data extraction, transformation, and loading. It is a mechanism that can store, query, and analyze large-scale data stored in Hadoop. Hive is not suitable for online transaction processing, nor does it provide real-time query functions, and is suitable for batch processing jobs based on large amounts of immutable data. On the other hand, the processed real-time data is input to the next processing layer in the processing chain as the processing result of this processing layer, so as to continue subsequent data processing. According to the above manner, the processing result of the last processing layer in the processing link is obtained, and the processing result is the first data.

Step 13, process the first data in the data application layer to obtain the second data.

The second processing refers to data analysis, for example, OLAP (Online Analytical Processing, Online Analytical Processing). OLAP is a collection of various analysis-oriented operations realized on the basis of the multidimensional model of the data warehouse. The advantage of OLAP is based on the subject-oriented, integrated, historically preserved and immutable data storage of the data warehouse, as well as the multi-dimensional model, multi-view and multi-level data organization form.

In this step, the OLAP module analyzes the first data obtained after processing the link, and the obtained analysis result is the second data, which is the full amount of data, including both real-time data and offline data, all stored in In OLAP, the integration of real-time data and offline data is realized.

Step 14, in response to detecting that the second data is wrong, in the processing layer where the data error occurs, correct the wrong data according to the offline data of this processing layer, obtain the corrected data, and input the corrected data to the next processing layer so that the next processing layer can process the input data.

Offline data is also called historical data, which refers to data whose lifetime is longer than the timeout period. If the timeout period is 7 days, if the data lifetime is less than or equal to 7 days, the data is real-time data; when the data lifetime exceeds 7 days, the data becomes offline data.

In this step, if it is found that the data in the ADS layer is wrong, it can be determined which processing layer has a data error. In the processing layer where the data error occurs, the offline data is obtained from the Hive module, and the offline data is used to correct the wrong data. Real-time data is corrected. Since the data is passed layer by layer and processed layer by layer, changes in the processing results of the previous processing layer will cause corresponding changes in the processing results of the subsequent processing layers, so the corrected data needs to be input to the next processing layer. layer, and the next processing layer re-processes the data.

In the data processing method based on batch-flow integration provided by the embodiments of the present disclosure, the data processing device includes a data application layer and a plurality of processing layers, and each processing layer forms a processing link. The method includes: acquiring data to be processed, the data to be processed is Real-time data; according to the processing link, the data to be processed is processed layer by layer to obtain the first data; wherein, in each processing layer, the data input to the processing layer is processed to obtain processed data, and the processed data is real-time data , store the processed real-time data in the Hive module based on the Flink flow, and input the processed data to the next processing layer; process the first data in the data application layer to obtain the second data; in response to detecting the second The data is wrong. In the processing layer where the data error occurs, the wrong data is corrected according to the offline data of this processing layer, the corrected data is obtained, and the corrected data is input to the next processing layer, so that the next The processing layer processes the input data; the embodiments of the present disclosure can realize real-time data and offline data collection and preprocessing, and integrate real-time data and offline data to realize the same data source, same computing engine, and same computing caliber, and simplify data application Architecture, a system architecture supports both offline data and real-time data analysis, reducing architecture complexity and operation and maintenance costs.

FIG. 2 is a second schematic flow diagram of the data processing method based on batch-flow integration provided by an embodiment of the present disclosure. In some embodiments, as shown in FIG. 2 , after the first data is processed in the data application layer to obtain the second data, the batch-flow integration-based data processing method may further include steps 21-22.

Step 21, in response to receiving a data query request, obtain a query result, the query result includes at least one of the following: offline data of each processing layer, and second data.

In this step, different data sources can be queried separately, or related queries can be performed between different data sources. A separate query means that the offline data of each processing layer is independently queried, and an associated query means that the offline data of each processing layer is queried in association. and the second data.

In some embodiments, OpenLooKeng or Presto can be used to implement data query. OpenLooKeng is an open-source, high-performance data virtualization engine that provides a unified SQL (Structured Query Language) interface, has cross-data source/data center analysis capabilities, and can be oriented to interactive, batch, and stream fusion query scenarios. OpenLooKeng can connect Hive module and OLAP module to realize unified query of offline data and real-time data. Presto is a data query engine that enables fast interactive analysis of more than 250PB of data.

Step 22, sending the query result.

The embodiments of the present disclosure can realize unified query of offline data and real-time data.

In related technologies, the data processing scheme based on the integration of batch and flow has no unified external query interface, and there is a problem of complex data landing management. In order to solve this problem, the embodiment of the present disclosure provides a unified external query interface.

In some embodiments, the sending query result includes the following steps: sending the query result through a preset query interface, and the query interface may include at least one of the following: JDBC API interface, Rest API interface.

JDBC (Java Database Connectivity, Java Database Connection) API (Application Programming Interface, Application Programming Interface) interface is a Java interface that can execute SQL language. Through the JDBC API interface, you can connect to the relational database, and use SQL statements to complete data query and update.

Rest in the RESTful API interface, Representation State Transfer. To put it simply, URL (Uniform Resource Locator, Uniform Resource Locator) is used to represent resources, and HTTP methods are used to represent operations on these resources. RESTful API is a REST-style API interface, which is a typical interface based on the HTTP protocol, ensuring the security of interactive data transmission. After the terminal sends a data query request to the server, if the RESTful API interface is not applicable, a corresponding return format needs to be defined for each terminal's data query request to adapt to the front-end display. However, the RESTful API interface requires the front-end to send data query requests in a predefined syntax format, so the server can define a unified response interface without parsing data query requests in various formats as before, thereby simplifying interface management.

By setting the Rest API interface, the data processing device can connect to visual display components (such as Tableau), and query the full amount of data in a custom way on the web client (Web), thereby supporting the visual display of front-end data.

In some embodiments, the processing the data input to the current processing layer includes the following steps: using a stream data processing engine to process the data input to the current processing layer. In the embodiment of the present disclosure, the Kafka module of each processing layer uses Flink to process the data input to the processing layer.

Kafka is an open source stream processing platform written in Scala and Java. It is a high-throughput distributed publish-subscribe message system that can process all action stream data of consumers in the website.

Flink is an open source stream processing framework whose core is a distributed stream data flow engine written in Java and Scala. Flink executes arbitrary streaming data programs in a data parallel and pipeline manner, and Flink's pipeline runtime system can execute batch and stream processing programs. Flink programs are mapped to streaming data streams after execution. Each Flink data stream starts with one or more sources (data input, such as message queue or file system) and ends with one or more sinks (data output, such as message Queue, file system or database, etc.) ends.

It should be noted that the batch data processing engine can also be used to process the data input to this processing layer, but compared with the stream data processing engine, the real-time performance is not good.

In some embodiments, the correcting the erroneous data according to the offline data of the current processing layer includes the following steps: using a stream data processing engine to correct the erroneous data according to the offline data of the current processing layer. In the embodiment of the present disclosure, the Hive module of each processing layer uses Flink to correct the erroneous data of the processing layer according to the stored offline data of the processing layer. That is, the Hive module uses Flink to modify the Topic (topic) of the real-time data according to the offline data, and returns the modified real-time data to the Kafka module.

In some embodiments, the data to be processed may include log data and business data. Correspondingly, the acquisition of the data to be processed may include the following steps: obtain from the business database by way of CDC (Change Data Capture, change data acquisition) Business data, and obtain log data according to the log collection system (Flume).

CDC can monitor and capture changes in the database (including insertion, update and deletion of data or data tables, etc.), record the changes in the database in the order they occur, and write them into the message middleware for other services to subscribe and consume .

In the embodiment of the present disclosure, the service database may be a relational database, such as a MySQL database.

Flume is a highly available, highly reliable, and distributed massive log collection, aggregation, and transmission system. Flume supports customizing various data senders in the log system to collect data; The processed data is written to the data receiving end.

The data processing scheme based on the integration of batch and stream in the embodiment of the present disclosure supports the collection and preprocessing of real-time data and offline data, supports unified data query, and supports JDBC and Restful publishing by providing external interfaces, and can realize real-time data and offline data processing. Data fusion can solve the problems of inconsistent real-time data and offline data processing and complex data landing management in the big data platform.

The data processing scheme based on batch-stream integration in the embodiment of the present disclosure supports the capability of batch-stream integration, expands the full-scenario OLAP capability, and can access batch data and stream data at the same time through a data model and a SQL statement. Provides a unified query interface. Compared with the Lambda architecture, it can achieve the same source of data, the same computing engine, and the same computing caliber. It also supports historical data and near-real-time data analysis, reduces the complexity of the architecture, and reduces the cost of operation and maintenance. It can help enterprises extremely Simplify the data application architecture, and use one system architecture to meet different needs at the same time, so as to respond to business agility faster.

The embodiment of the present disclosure also provides a data processing device. FIG. 3 is a schematic diagram of the first structure of the data processing device provided by the embodiment of the present disclosure. As shown in FIG. 3 , the data processing device includes an acquisition module 101, a first processing module 102 and The second processing module 103, the second processing module 103 forms a data application layer, the first processing module 102 includes a plurality of processing layers, each processing layer forms a processing link, and each processing layer includes a first processing unit 1021 and a second processing unit 1022 .

The acquiring module 101 is configured to acquire data to be processed, and the data to be processed includes real-time data.

The first processing module 102 is configured to process the data to be processed layer by layer according to the processing link to obtain the first data.

Wherein, the first processing unit 1021 is configured to process the data input to the processing layer to obtain processed data, the processed data is real-time data, and store the processed real-time data in the Hive module based on the Flink flow , and input the processed data to the next processing layer; the first data is the processed data obtained by the last processing layer in the processing chain; and, receiving the correction sent by the second unit input the corrected data to the first processing unit of the next processing layer, so that the first processing unit of the next processing layer can process the input data.

The second processing unit 1022 is configured to, in response to a data error occurring at the processing layer, correct the erroneous data according to the offline data of the processing layer, obtain the corrected data, and send the corrected data to the The first processing unit 1021 .

The second processing module 103 is configured to process the first data in the data application layer to obtain second data.

In some embodiments, the second processing unit 1022 is configured to store the processed data so as to generate offline data of this processing layer.

In some embodiments, the first processing unit 1021 is configured to use a stream data processing engine to process the data input to this processing layer.

FIG. 4 is a second structural diagram of a data processing device provided by an embodiment of the present disclosure. In some embodiments, as shown in FIG. 4 , the data processing device further includes a query module 104 configured to, in response to receiving data A query request is to obtain a query result, the query result including at least one of the following: offline data of each processing layer and the second data; sending the query result.

In some embodiments, the query module 104 is configured to send the query result through a preset query interface, and the query interface at least includes: a JDBC API interface and a Rest API interface.

In some embodiments, the second processing unit 1022 is configured to use a streaming data processing engine to correct the erroneous data according to the offline data of this processing layer.

In some embodiments, the data to be processed includes log data and business data, and the obtaining module 101 is configured to obtain the business data from the business database by changing the data acquisition CDC, and obtain the log according to the log collection system data.

In order to clearly describe the technical solution of the embodiment of the present disclosure, the solution implemented in the present disclosure will be described below through a specific example in conjunction with FIG. 5 . FIG. 5 is a schematic structural diagram of a specific example of a data processing device provided by an embodiment of the present disclosure. As shown in FIG. A processing module 202, a second processing module 203 and a query module 204, the first processing module 202 includes an ODS layer, a DWD layer and a DWS layer, and the above three processing layers include a Kafka module and a Hive module respectively, and the Kafka module in one processing layer It forms a processing unit with the Hive module, and the three processing layers form a processing link in the order of ODS layer->DWD layer->DWS layer. Among them, the ODS layer, DWD layer and DWS layer are connected through the Kafka module to realize data exchange. Passed layer by layer. The second processing module 203 is located at the ADS layer and may be an OLAP module. The query module 204 is respectively connected to the Hive module of each processing layer and the OLAP module of the ADS layer, so as to realize cross-source query.

The acquisition module 201 can acquire business data from the MySQL database through CDC, collect log data from Flume, and send the business data and log data to the Kafka module in the ODS layer.

Taking the ODS layer as an example, the Kafka module uses Flink to process the real-time data input to the ODS layer, obtains the processed real-time data, and loads it to the Hive module through the Flink stream for storage. The Kafka module of the ODS layer sends the processed real-time data to the Kafka module of the DWD layer, so as to continue data processing in the DWD layer.

Query module 204 adopts OpenLooKeng connector, is provided with JDBC API interface and Rest API interface on it, after receiving data query request by above-mentioned interface, initiates data query to following at least one module of each processing layer: Hive module, OLAP module, And return at least one of the following data queried through this interface: offline data, real-time data.

When an error occurs in the second data in the OLAP module based on data query, if the error occurs in the DWD layer, use the offline data stored in the Hive module in the DWD layer to correct the erroneous data, and input the corrected data To the Kafka module of the DWS layer, the Kafka module of the DWS layer continues to process data.

An embodiment of the present disclosure also provides a computer device, the computer device includes: one or more processors and a storage device; wherein, one or more programs are stored on the storage device, when the one or more programs are executed by the one or more When executed by one or more processors, the above-mentioned one or more processors implement the batch-flow integration-based data processing method provided by the foregoing embodiments.

An embodiment of the present disclosure also provides a computer-readable medium on which a computer program is stored, wherein when the computer program is executed, the batch-flow integration-based data processing method provided in the foregoing embodiments is implemented.

Those skilled in the art can understand that all or some of the steps in the method disclosed above and the functional modules/units in the device can be implemented as software, firmware, hardware and an appropriate combination thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be composed of several physical components. Components cooperate to execute. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit . Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As known to those of ordinary skill in the art, the term computer storage media includes both volatile and nonvolatile media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. permanent, removable and non-removable media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage devices, or can Any other medium used to store desired information and which can be accessed by a computer. In addition, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

Example embodiments have been disclosed herein, and while specific terms have been employed, they are used and should be construed in a generic descriptive sense only and not for purposes of limitation. In some instances, it will be apparent to those skilled in the art that features, characteristics and/or elements described in connection with a particular embodiment may be used alone, or may be described in combination with other embodiments, unless explicitly stated otherwise. Combinations of features and/or elements. Accordingly, it will be understood by those of ordinary skill in the art that various changes in form and details may be made without departing from the scope of the present invention as set forth in the appended claims.

Claims

A data processing method based on batch-flow integration, wherein the method is applied to a data processing device, the data processing device includes a data application layer and a plurality of processing layers, each of the processing layers forms a processing link, and the method include:

Obtaining data to be processed, the data to be processed is real-time data;

Process the data to be processed layer by layer according to the processing link to obtain the first data; in each of the processing layers, process the data input to the processing layer to obtain processed data, the processed data For real-time data, store the processed real-time data in the Hive module based on the Flink flow, and input the processed data to the next processing layer; the first data is the last processing layer in the processing chain the resulting processed data;

processing the first data in the data application layer to obtain second data;

In response to detecting that the second data is erroneous, in the processing layer where the data error occurs, the erroneous data is corrected according to the offline data of the processing layer to obtain the corrected data, and the corrected The data is input to the next processing layer, so that the next processing layer can process the input data.
The method according to claim 1, wherein said processing the data input into the processing layer comprises: using a stream data processing engine to process the data input into the processing layer.
The method according to claim 1, wherein, after processing the first data in the data application layer and obtaining the second data, the method further comprises:

In response to receiving a data query request, obtain a query result, the query result includes at least one of the following: offline data of each processing layer, the second data;

Send the query result.
The method according to claim 3, wherein said sending said query result comprises: sending said query result through a preset query interface, said query interface comprising at least one of the following: JDBC API interface, Rest API interface .
The method according to claim 1, wherein said correcting said erroneous data according to the offline data of this processing layer comprises: using a stream data processing engine to correct said erroneous data according to the offline data of this processing layer data are corrected.
The method according to any one of claims 1-5, wherein the data to be processed includes log data and business data, and the obtaining data to be processed includes:

Obtain the service data from the service database by changing the data to obtain CDC, and obtain the log data according to the log collection system.
A data processing device, including an acquisition module, a first processing module and a second processing module, the second processing module forms a data application layer, the first processing module includes a plurality of processing layers, each of the processing layers forming a processing chain, each of the processing layers comprising a first processing unit and a second processing unit;

The acquiring module is configured to acquire data to be processed, and the data to be processed includes real-time data;

The first processing module is configured to process the data to be processed layer by layer according to the processing link to obtain first data;

The first processing unit is configured to process the data input to the processing layer to obtain processed data, the processed data is real-time data, and the processed real-time data is stored in the Hive module based on the Flink flow, and inputting the processed data to the next processing layer; the first data is the processed data obtained by the last processing layer in the processing chain; and receiving the corrected data sent by the second unit input the corrected data to the first processing unit of the next processing layer, so that the first processing unit of the next processing layer processes the input data;

The second processing unit is configured to, in response to a data error occurring in the processing layer, correct the erroneous data according to the offline data of the processing layer, obtain the corrected data, and send the corrected data to the the first processing unit;

The second processing module is configured to process the first data in the data application layer to obtain second data.
A computer device comprising:

one or more processors;

a storage device having one or more programs stored thereon;

When the one or more programs are executed by the one or more processors, the one or more processors are made to implement the batch-flow integration-based data processing method according to any one of claims 1-6 .
A computer-readable medium, on which a computer program is stored, wherein, when the program is executed, the batch-flow integration-based data processing method according to any one of claims 1-6 is realized.