CN112667614A

CN112667614A - Data processing method and device and computer equipment

Info

Publication number: CN112667614A
Application number: CN202011563032.4A
Authority: CN
Inventors: 唐杰
Original assignee: Volkswagen Mobvoi Beijing Information Technology Co Ltd
Current assignee: Volkswagen Mobvoi Beijing Information Technology Co Ltd
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2021-04-16

Abstract

The embodiment of the invention discloses a data processing method, a data processing device and computer equipment. The method comprises the following steps: acquiring data to be processed, and adding the data to be processed to a first data message queue; processing each data to be processed in the first data message queue by adopting a streaming data processing mode based on a Flink real-time processing frame to obtain queue processing data; adding the queue processing data to a second data message queue; and performing real-time data processing on each queue processing data in the second data message queue based on the Flink real-time processing framework. The technical scheme can ensure the consistency and the real-time performance of the data processing process.

Description

Data processing method and device and computer equipment

Technical Field

The embodiment of the invention relates to the technical field of data processing, in particular to a data processing method, a data processing device and computer equipment.

Background

With the rapid development of the internet, more and more multivariate data exist, and the data are always real-time. When big data is processed, technologies such as distributed processing or distributed databases need to be relied on, and ensuring the consistency and real-time performance of the data in the data processing process is always an important issue of data processing.

Currently, in the field of data processing, there are generally two task types of batch computation and real-time stream computation. The Flink is an open source data platform facing distributed real-time stream processing and batch data processing at the same time, and can provide functions of supporting two types of tasks of stream processing and batch processing when running on the basis of the same Flink real-time processing framework. When ensuring data consistency in real-time processing systems, it is often necessary to perform an idempotent write operation or a transactional write operation on the data. An idempotent write operation, in which data is written to one system any number of times, only affects the target system once, requires that the data be idempotent. Transactional write operation combines with a consistency Checkpoint mechanism of Flink to ensure that only one influence is generated on external output, but only data confirmed by Checkpoint can be written to the outside, and due to a certain time interval between checkpoints, the real-time performance of data can be reduced. Therefore, how to keep the consistency and the real-time performance of the data in the processing process based on the Flink real-time processing framework is a problem to be solved urgently.

Disclosure of Invention

The embodiment of the invention provides a data processing method, a data processing device and computer equipment, which are used for ensuring the consistency and the real-time performance of a data processing process.

In a first aspect, an embodiment of the present invention provides a data processing method, including:

acquiring data to be processed, and adding the data to be processed to a first data message queue;

processing each data to be processed in the first data message queue by adopting a streaming data processing mode based on a Flink real-time processing frame to obtain queue processing data;

adding the queue processing data to a second data message queue;

and performing real-time data processing on each queue processing data in the second data message queue based on the Flink real-time processing framework.

In a second aspect, an embodiment of the present invention further provides a data processing apparatus, including:

the first data message queue generating module is used for acquiring data to be processed and adding the data to be processed to the first data message queue;

the queue processing data generation module is set to process each data to be processed in the first data message queue by adopting a streaming data processing mode based on a Flink real-time processing frame to obtain queue processing data;

a second data message queue generating module configured to add the queue processing data to a second data message queue;

and the real-time data processing module is configured to perform real-time data processing on each queue processing data in the second data message queue based on the Flink real-time processing framework.

In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the data processing method according to any embodiment of the present invention is implemented.

In a fourth aspect, the embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the data processing method according to any embodiment of the present invention.

In the technical scheme provided by the embodiment of the invention, after the acquired data to be processed is added to the first data message queue, the data to be processed in the first data message queue is processed based on the Flink real-time processing frame by adopting a streaming data processing mode to obtain queue processing data, and then the queue processing data is added to the second data message queue, so that the data to be processed in each queue in the second data message queue can be processed in real time based on the Flink real-time processing frame, the data can be processed in real time based on the Flink real-time processing frame by adopting two data message queues, no special requirement is required on the data, the problem that the consistency and the real-time performance are difficult to effectively guarantee in the existing data processing process is solved, and the consistency and the real-time performance of the data processing process are guaranteed.

Drawings

Fig. 1 is a schematic flow chart of a data processing method according to a first embodiment of the present invention;

FIG. 2 is a flowchart illustrating a data processing method according to a second embodiment of the present invention;

fig. 3 is a schematic flow chart of data breakpoint resuming according to a second embodiment of the present invention;

FIG. 4 is a flowchart illustrating a data cleaning and processing operation according to a second embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a data processing apparatus according to a third embodiment of the present invention;

fig. 6 is a schematic hardware configuration diagram of a computer device in the fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

Example one

Fig. 1 is a flowchart of a data processing method provided in an embodiment of the present invention, where the embodiment of the present invention is applicable to a case where data of any type is processed based on a Flink real-time processing framework to ensure consistency and real-time performance of a data processing process, and the method may be executed by a data processing apparatus provided in an embodiment of the present invention, where the apparatus may be implemented in software and/or hardware, and may be generally integrated in a computer device.

As shown in fig. 1, the data processing method provided in this embodiment specifically includes:

and S110, acquiring data to be processed, and adding the data to be processed to the first data message queue.

The data to be processed may be log data of various sources, various types and various formats, for example, buried data, log files or external data, and the like. That is, the data to be processed in the embodiment of the present invention may be any type of data, and it is not required that it must satisfy idempotency.

In the embodiment of the invention, the data to be processed can be acquired by a data acquisition device in a data acquisition mode. The data collector may apply any technology having a data collecting function, which is not specifically limited in this embodiment. Optionally, various huge amounts of data to be processed may be collected in a centralized manner by using a data collector of the Flume technology, for example, the data collector of the Flume technology collects log data in a website server, so that collection, aggregation and transmission of distributed data such as massive logs may be realized, and further, stability and fault tolerance of the system are ensured.

A data message queue refers to a sequence of messages in which data can be written and read. Optionally, the first data message queue may be a distributed Kafka message queue, the data to be processed is written into the Kafka message queue, and when the data is processed, the data may be read from the Kafka message queue, so that a problem that a speed of acquiring the data to be processed is inconsistent with a speed of processing the data is avoided, and stability of the system is further ensured.

After the data to be processed is acquired, the acquired data to be processed is added to the first data message queue for subsequent processing of the data to be processed in the first data message queue.

And S120, processing each data to be processed in the first data message queue by adopting a streaming data processing mode based on the Flink real-time processing framework to obtain queue processing data.

The Flink real-time processing framework is an open source data platform which can simultaneously face distributed real-time stream processing and batch data processing, and can simultaneously provide functions of supporting stream processing and batch processing when the same Flink real-time processing framework is operated.

The streaming data processing refers to processing one or more event streams in real time, that is, processing one piece of data in real time after each piece of data is read, so that the real-time property of the data is ensured.

And the queue processing data refers to data obtained by processing each to-be-processed data in the first data message queue based on a Flink real-time processing framework and by adopting a streaming data processing mode.

And processing each data to be processed in the first data message queue by adopting a streaming data processing mode based on a Flink real-time processing frame, thereby obtaining queue processing data and ensuring the consistency and real-time performance of the data. Wherein, when processing each data to be processed, the processing of the external service auxiliary data can be invoked. For example, when data is cleaned, the longitude and latitude in the acquired data to be processed need to be converted into a specific position, and at this time, an external service (such as an electronic map) may be called to accurately locate the specific position corresponding to the longitude and latitude in the data to be processed in the map. The cleaning process is a process of converting data to be processed into directional data, for example, longitude and latitude data may be converted into specific location information after being subjected to the cleaning process.

Optionally, before processing each to-be-processed data in the first data message queue based on the Flink real-time processing framework, the method may further include: reading each data to be processed in the first data message queue; and storing each data to be processed to the distributed file system.

Distributed file system refers to a file system for storing and managing data, wherein the stored data can be communicated and transmitted between nodes through a computer network, and the nodes can be distributed at any position in a data communication link. The data to be processed are stored in the distributed file system, and the data to be processed can be expanded to the whole network, so that the expandability, the stability and the execution efficiency of the system are enhanced.

Before each to-be-processed data in the first data message queue is processed based on the Flink real-time processing framework, each to-be-processed data in the first data message queue is read and stored in the distributed file system, the acquired original data which are not processed in any mode can be reserved, the originality and the integrity of the data are ensured, and the reserved original data can be regularly compared with real-time streaming data and batch data which are processed by the Flink real-time processing framework to verify the accuracy of a data processing result.

And S130, adding the queue processing data to a second data message queue.

Alternatively, the second data message queue may be of the same type as the first data message queue, and may also be a distributed Kafka message queue. And adding queue processing data which is obtained by processing based on a Flink real-time processing framework and by adopting a streaming data processing mode to a second data message queue for subsequent other application processing, such as real-time stream data analysis and calculation, batch data analysis and calculation and the like. At this time, the data in the second data message queue is obtained by processing the acquired data to be processed.

And S140, performing real-time data processing on each queue processing data in the second data message queue based on the Flink real-time processing framework.

The real-time data processing refers to processing such as analyzing and calculating data or data amount requiring real-time characteristics, for example, counting data amount of a website independent Visitor (UV) at a current time, counting Page View amount (PV) at the current time, and counting indexes with strong real-time performance such as total current visit amount.

After the queue processing data is added to the second data message queue, real-time data processing is performed on each queue processing data in the second data message queue based on a Flink real-time processing framework, so that a corresponding real-time processing result can be obtained.

In the data processing process, two message queues, namely a first data message queue and a second data message queue, are adopted, so that the problem that the speed of acquiring the data to be processed is inconsistent with the speed of processing the data is solved, and when a system fails, the system can be ensured to process each piece of data once, the situation that the data is processed repeatedly or for multiple times is avoided, so that the final result of one-time processing can be influenced by one piece of data, and the consistency of the data is ensured.

As an optional implementation manner, after performing real-time data processing on each queue processing data in the second data message queue, the method may further include:

reading the processing data of each queue from the second data message queue and storing the processing data in a static database; the static database is used for processing the batch data of the processing data of each queue;

storing the real-time data processing result and the batch data processing result in a result database; or

And reading the processing data of each queue from the second data message queue and sending the processing data to the target application program through a preset interface.

Static databases refer to databases that can store offline data and other non-real-time data. Alternatively, the static database may be a Hive number bin, which may be used to perform operations such as processing, querying, and analyzing data. Because the data volume that the data message queue can bear is limited, if the data volume in the present data message queue has already reached the maximum value, then write in the data again, will delete the data that exist at first in the present data message queue automatically, can't keep the complete data, and then while processing the batch data that the data is processed to each queue, because the data is lacked, have reduced the accuracy that the batch data is processed, therefore process the data to store each queue in the second data message queue to the static database, can keep each queue in the second data message queue and process the integrality of the data, realize the batch data processing of each queue processing data, and improve the accuracy that the batch data is processed, also can be used in other data analysis calculation in the future.

The batch data processing refers to processing such as analyzing and calculating data or data amount in a period of time, for example, the batch data such as the total browsing amount of a webpage in a specific period of time and the total transaction amount of each day in the last month may be counted based on each queue processing data in the second data message queue stored in the static database.

The result database refers to a database which can store the real-time data processing result and the batch data processing result so as to serve as a cache. For example, the results database may be a mysql database or the like. When a real-time data processing result or a batch data processing result obtained by previous processing needs to be accessed again, the corresponding processing result can be directly read from the result database without performing real-time data processing or batch data processing again, so that the data result has reusability, the repeated processing process of the data is reduced, and the data processing efficiency is improved.

After real-time data processing is performed on each queue processing data in the second data message queue, each queue processing data can be read from the second data message queue and stored in a static database which can be used for batch data processing of each queue processing data, and then the real-time data processing result and the batch data processing result can be stored in a result database as cache, so that the real-time data processing result and the batch data processing result can be directly read in the subsequent data processing process, repeated calculation of data is avoided, a modularization idea is fully applied, and decoupling of two functions of real-time stream processing and batch data processing in a Flink real-time processing frame is achieved.

And the preset interface refers to a preset interface which can transmit the queue processing data in the second data message queue to the third-party application program. The preset Interface may be an Application Programming Interface (API). When a third-party application program needs to acquire each queue processing data in the second data message queue, the data can be processed through each queue in the second data message queue through a preset interface.

The target application program refers to a third-party application program that needs to acquire each queue processing data in the second data message queue, for example, if a certain map APP needs to acquire processed real-time data, the map APP is the target application program.

After the real-time data processing is performed on each queue processing data in the second data message queue, each queue processing data can be read from the second data message queue and sent to the target application program through the preset interface, so that the framework expansion of the Flink real-time processing framework is realized, the data requirements corresponding to different application programs are met, and the flexibility of the framework is improved.

According to the technical scheme provided by the embodiment of the invention, after the acquired data to be processed is added to the first data message queue, the data to be processed in the first data message queue is processed based on the Flink real-time processing frame in a streaming data processing mode to obtain queue processing data, and then the queue processing data is added to the second data message queue, so that the data to be processed in each queue in the second data message queue can be processed in real time based on the Flink real-time processing frame, the data can be processed in real time based on the Flink real-time processing frame by adopting two data message queues, no special requirement is required on the data, the problem that the consistency and the real-time performance are difficult to effectively guarantee in the existing data processing process is solved, and the consistency and the real-time performance of the data processing process are guaranteed.

Example two

Fig. 2 is a flowchart of a data processing method according to a second embodiment of the present invention. The embodiment is embodied on the basis of the above embodiment, wherein the to-be-processed data may be added to the first data message queue, specifically:

determining a first target partition sequence number and a first target partition position corresponding to data to be processed;

and adding the data to be processed to the first target data partition in the first data message queue according to the first target partition sequence number and the first target partition position.

Further, processing each to-be-processed data in the first data message queue by adopting a streaming data processing mode based on the Flink real-time processing framework may include:

determining a current partition sequence number and a data processing progress identifier corresponding to a current data partition;

determining the current data to be processed according to the current partition serial number and the data processing progress mark;

and processing the current data to be processed in real time based on a Flink real-time processing framework.

As shown in fig. 2, the data processing method provided in this embodiment specifically includes:

s210, acquiring data to be processed, and determining a first target partition sequence number and a first target partition position corresponding to the data to be processed.

S220, adding the data to be processed to the first target data partition in the first data message queue according to the first target partition sequence number and the first target partition position.

The target partition sequence number refers to a sequence number corresponding to each data partition, for example, partition 0(job0), partition 1(job1), and the like, by which all data are distributed to different data partitions. When the big data is processed, the big data belongs to the distributed data, so that the data set can be distributed to different data partitions in a data partition mode, the data can be processed or inquired in a plurality of data partitions, and the data management and inquiry efficiency is improved. The data partition may distribute data to different areas of the same disk, or may distribute data to different disks or machine devices, which is not specifically limited in the present invention. The first target partition sequence number may be a sequence number of a first target data partition in which the current data to be processed is located in the first data message queue. The first target data partition, that is, the current data to be processed, is added to the corresponding partition in the first data message queue.

The target partition position refers to a specific storage position of each piece of data in the target data partition with the corresponding sequence number. The data can be placed at the specific target partition position of the corresponding target data partition, and the corresponding data can be inquired according to the target partition position. According to the position of the target partition corresponding to the data, the specific position of the data in the target data partition in the data message queue can be determined. Wherein the target partition position may be represented by an offset (offset). The first target partition location may be a specific queue location of the current data to be processed in the first target data partition in the first data message queue.

After the data to be processed is obtained, a first target partition sequence number and a first target partition position corresponding to the data to be processed need to be determined, and then the data to be processed is added to a first target data partition in a first data message queue according to the first target partition sequence number and the first target partition position.

And S230, processing each data to be processed in the first data message queue by adopting a streaming data processing mode based on the Flink real-time processing framework.

It is worth pointing out that in data processing, the system may often be affected by various unexpected factors, such as sudden increase of traffic, network jitter, problems in cloud service resource allocation, downtime or breakdown due to over-pressure of the system, and the like. In this case, the Flink real-time processing framework restarts the operation, and in order to ensure consistency of the data processing process, that is, in order to ensure that the data processing result obtained after successfully processing the fault and recovering has correctness compared with the processing result obtained without any fault, that is, the occurrence of the system fault does not affect the obtained data processing result, a Checkpoint (consistency check point) mechanism of the Flink is usually adopted to ensure that only one impact is generated on the external output, but a certain time interval exists between checkpoints, which may reduce the real-time performance of the data. Therefore, when the Flink real-time processing framework normally starts to process data or the system fails and restarts the Flink real-time processing framework, in order to simultaneously ensure the consistency and the real-time performance of the data processing process and realize the breakpoint continuous transmission of the data, the embodiment of the invention adopts a closed-loop mode to process the data.

Wherein S230 may specifically include the following operations S231-S233:

s231, determining a current partition sequence number corresponding to the current data partition and a first data processing progress identifier.

The current data partition may be a data partition in which to-be-processed data currently being processed is located.

The data processing progress mark refers to a mark which can be used for indicating the progress of data processing. In the embodiment of the present invention, the first data processing identifier may be represented by a last partition position in the first data message queue, and is used to identify a data processing progress of the first data message queue. The last partition position in the first data message queue refers to a first target partition position corresponding to the last to-be-processed data which is processed in real time based on the Flink real-time processing framework and is completed in real time in the currently processed data partition in the first data message queue before the Flink real-time processing framework is normally started or before the system fails and restarts the Flink real-time processing framework.

And determining a current partition sequence number and a data processing progress identifier corresponding to the current data partition so as to be used for determining the current data to be processed.

And S232, determining the current data to be processed according to the current partition sequence number and the first data processing progress mark.

And determining the current data to be processed in the currently processed data partition in the first data message queue according to the serial number of the current partition and the first data processing progress mark, so as to process the current data to be processed in real time based on a Flink real-time processing framework.

And S233, processing the current data to be processed in real time based on the Flink real-time processing framework to obtain queue processing data.

And processing the current data to be processed in the first data message queue in real time based on a Flink real-time processing framework to obtain queue processing data, wherein the current data to be processed can be processed in a streaming data processing mode.

Optionally, the processing of the current data to be processed in real time based on the Flink real-time processing framework may include: and converting the current data to be processed into target directional data in real time based on a Flink real-time processing framework.

The target directivity data is data containing certain information that can be intuitively understood by a person, for example, information on a specific position or a certain product.

For example, when the Flink-based real-time processing framework is used for processing the current data to be processed in real time, the current data to be processed may be cleaned, so as to convert the current data to be processed into the target directional data in real time. For example, the current data to be processed is longitude and latitude, and after the real-time processing based on the Flink real-time processing framework is carried out, the longitude and latitude can be converted into specific position information which can be visually seen by people; for another example, the current data to be processed is the ID of a certain product, and after the data is processed in real time based on the Flink real-time processing framework, the ID of the product can be converted into specific commodity information. And converting the current data to be processed into target directional data in real time based on a Flink real-time processing framework, and finishing the cleaning processing of the current data to be processed.

And S240, adding the queue processing data to the second data message queue.

And adding queue processing data obtained by processing in a streaming data processing mode based on a Flink real-time processing frame to a second data message queue.

Optionally, adding the queue processing data to the second data message queue may include: determining a second target partition sequence number and a second data processing progress identifier corresponding to the queue processing data; determining a second target partition position of the queue processing data in the second target partition sequence number according to the second target partition sequence number and the second data processing progress mark; and adding the queue processing data to a second target data partition in the second data message queue according to the second target partition sequence number and the second target partition position.

The second target partition sequence number may be a sequence number of a second target data partition in which to-be-processed data corresponding to the processed queue processing data is located in a second data message queue. The second target partition position may be a specific queue position of data to be processed corresponding to the processed queue processing data in the second target data partition in the second data message queue. The second target data partition, that is, the processed queue processing data, is added to the corresponding partition in the second data message queue. When the queue processing data is added to the second data message queue, the second target partition sequence number and the second data processing progress identifier corresponding to the queue processing data may be determined first. The last partition position in the second data message queue refers to a position of a second target partition corresponding to the last queue processing data added to the second data message queue before the normal start of the Flink real-time processing frame or before the system is restarted due to a fault. (ii) a Then determining a second target partition position of the queue processing data in the second target partition sequence number according to the second target partition sequence number and a second data processing progress mark, wherein the second target partition position corresponds to the second data processing progress mark; and finally, adding the queue processing data to a second target data partition in the second data message queue according to the second target partition sequence number and the second target partition position.

It should be noted that, when each piece of to-be-processed data in the first data message queue is processed in a streaming data processing manner based on the Flink real-time processing framework, the determined first data processing progress identifier is used to lock the data processing progress of the first data message queue. Similarly, when the queue processing data is added to the second data message queue, the determined second data processing progress mark is used for locking the data adding progress of the second data message queue. That is, the first data processing progress marker and the second data processing progress marker are two different progress markers.

For example, assuming that the first target partition number of the first target data partition in the first data message queue where the current to-be-processed data is located is 2, the corresponding first data processing progress identifier is 04, which indicates that the 4 th data in the data partition 2 in the first data message queue is currently being processed, since a problem that the data processing speed is inconsistent with the speed of acquiring the to-be-processed data may occur, the to-be-processed data in the first data message queue is already added to the first target partition position 08 in the first target data partition 2, and even the to-be-processed data in the first data message queue is possibly already added to the first target data partition 3, in this case, if the system restarts the Flink real-time processing framework after a failure occurs in the data processing process, when the current to-be-processed data is re-determined, the current to-be-processed data needs to be determined according to the current partition number (i.e., partition 2) corresponding to the current data partition and the first data processing progress identifier (04) determining the current data to be processed, and then processing the current data to be processed based on the Flink real-time processing framework.

It should be noted that, if the speed of acquiring the to-be-processed data is consistent with the data processing speed, that is, after one to-be-processed data is acquired, the to-be-processed data is immediately processed, and then the next to-be-processed data that is just acquired is immediately processed, in this case, the current partition sequence number corresponding to the current data partition corresponds to the first target partition sequence number of the to-be-processed data in the first target data partition, and the first data processing progress identifier corresponds to the first target partition position, so that the current to-be-processed data can be determined only according to the first target partition sequence number and the first target partition position in the first data message queue.

For another example, when queue processing data obtained after the current data to be processed is processed based on the Flink real-time processing framework is added to the second data message queue, considering the problem that the data processing speed is not consistent with the speed of adding data to the second data message queue, it is necessary to determine the second target partition number and the second data processing progress identifier corresponding to the queue processing data, and assume that the corresponding second target partition number is 2 and the second data processing progress identifier is 04, and then it may be determined that the queue processing data should be added to the second target partition position 04 in the second target partition 2, and according to the second target partition number 2 and the second target partition position 04, the queue processing data is added to the second target data partition 2 in the second data message queue.

It should be noted that, if the data processing speed is consistent with the speed of adding the queue processing data to the second data message queue, that is, after a piece of data to be processed is processed, the obtained queue processing data can be immediately added to the second data message queue, and then the obtained next queue processing data can be immediately added, in this case, the second target partition number corresponds to the first target partition number, and the second data processing progress identifier corresponds to the first data processing progress identifier, so that the obtained queue processing data can be added to the second data message queue according to the first target partition number and the first data processing progress identifier.

It can be understood that the first data message queue and the second data message queue have the same data storage mode, that is, the same data storage positions in the two data message queues are the same. For example, if the first target partition sequence number corresponding to the data to be processed in the first data message queue is 0 and the first target partition position is 2, when queue processing data obtained by processing the data to be processed in real time is added to the second data message queue, the corresponding second target partition sequence number is 0 and the second target partition position is 2.

Fig. 3 is a schematic flow chart of breakpoint resuming according to an embodiment of the present invention, and in a specific example, as shown in fig. 3, when a speed of acquiring data to be processed, a data processing speed, and a speed of adding queue processing data to a second data message queue are not consistent with each other, and data is processed and transmitted based on a Flink real-time processing framework, the obtained queue processing data may be added to a second target data partition in a corresponding second data message queue according to a second target sequence number and a second target partition position. When the system has a fault in the data processing process and restarts the Flink real-time processing framework, the current partition serial number corresponding to the current data partition and the data processing progress (namely, a first data processing progress identifier) of the first data message queue can be determined, then the current data to be processed is determined according to the current partition serial number in the first data message queue and the first data processing progress identifier, and the current data to be processed is continuously processed based on the Flink real-time processing framework, so that the data to be processed can be continuously processed at the position where the data is processed before the system fault, and the breakpoint continuous transmission of the data is realized. And then, the second target partition position of the queue processing data in the second target partition sequence number can be determined according to the second target partition sequence number and the data adding progress (i.e. the second data processing progress identifier) of the second data message queue, and the queue processing data continues to be added to the second target data partition in the second data message queue.

And S250, performing real-time data processing on each queue processing data in the second data message queue based on the Flink real-time processing framework.

For those parts of this embodiment that are not explained in detail, reference is made to the aforementioned embodiments, which are not repeated herein.

Fig. 4 is a schematic flow chart of data cleaning and processing according to an embodiment of the present invention, and in a specific example, as shown in fig. 4, data to be cleaned is collected by a data collector using a Flume technology, where the collected data to be cleaned may include log files, buried point data, or external data, the collected data to be cleaned is added to a first data message queue, and then each data to be cleaned in the first data message queue may be read and stored in a distributed file system to retain original data for comparison and verification, or each data to be cleaned in the first data message queue may be cleaned by using a streaming processing manner based on a Flink real-time processing framework, and the obtained queue processing data is added to a second data message queue to perform real-time data processing on each queue processing data in the second data message queue based on the Flink real-time processing framework, in addition, each queue processing data can be read from the second data message queue and stored in the static database to be used for carrying out batch data processing on each queue processing data, and batch data processing results and real-time data processing results can be stored in the result database, or each queue processing data is read from the second data message queue and sent to the target application program through the preset interface, so that the framework expansion of the Flink real-time processing framework is realized, the data requirements corresponding to different application programs are met, and the flexibility of the framework is improved.

According to the technical scheme, after the acquired data to be processed is added to the first target data partition in the first data message queue according to the first target partition sequence number and the first target partition position, when the streaming data processing mode is adopted to process each data to be processed in the first data message queue based on the Flink real-time processing framework, a closed-loop mode is adopted to process the data, namely: firstly, determining the current partition serial number and the data processing progress identification corresponding to the current data partition, determining the current data to be processed according to the current partition serial number and the data processing progress identification, then processing the current data to be processed in real time based on a Flink real-time processing frame to obtain queue processing data, thereby avoiding the problems of data inconsistency and data real-time reduction when the Flink real-time processing frame is normally started or a system is in fault restart, and then adding the queue processing data to a second data message queue, thereby carrying out real-time data processing on each queue processing data in the second data message queue based on the Flink real-time processing frame, realizing real-time processing on the data by adopting two data message queues based on the Flink real-time processing frame, adopting a closed-loop mode in the data processing process and having no special requirements on the data, the problem that the consistency and the real-time performance of the existing data processing process are difficult to effectively guarantee is solved, the consistency and the real-time performance of the data processing process are guaranteed, and the function of data breakpoint continuous transmission is realized.

EXAMPLE III

Fig. 5 is a schematic structural diagram of a data processing apparatus according to a third embodiment of the present invention, and the third embodiment of the present invention is applicable to a case where any type of data is processed based on a Flink real-time processing framework to ensure consistency and real-time performance of a data processing process, and the apparatus may be implemented in software and/or hardware, and may be generally integrated in a computer device.

As shown in fig. 5, the data query apparatus specifically includes: a first data message queue generating module 510, a queue processing data generating module 520, a second data message queue generating module 530, and a real-time data processing module 540. Wherein,

a first data message queue generating module 510, configured to obtain data to be processed, and add the data to be processed to a first data message queue;

a queue processing data generating module 520 configured to process each to-be-processed data in the first data message queue in a streaming data processing manner based on a Flink real-time processing framework to obtain queue processing data;

a second data message queue generating module 530 arranged to add the queue processing data to a second data message queue;

a real-time data processing module 540, configured to perform real-time data processing on each queue processing data in the second data message queue based on the Flink real-time processing framework.

Optionally, the first data message queue generating module 510 is specifically configured to:

determining a first target partition sequence number and a first target partition position corresponding to the data to be processed;

and adding the data to be processed to a first target data partition in the first data message queue according to the first target partition sequence number and the first target partition position.

Optionally, the queue processing data generating module 520 further includes: a current partition serial number and processing progress identification determining unit, a current data to be processed determining unit and a current data to be processed processing unit, wherein,

the current partition serial number and processing progress identification determining unit is set to determine a current partition serial number and a first data processing progress identification corresponding to a current data partition;

the current data to be processed determining unit is set to determine the current data to be processed according to the current partition serial number and the first data processing progress mark;

and the current data processing unit to be processed is set to process the current data to be processed in real time based on the Flink real-time processing framework.

Optionally, the current to-be-processed data processing unit is specifically configured to:

and converting the current data to be processed into target directional data in real time based on the Flink real-time processing framework.

Optionally, the second data message queue generating module 530 is specifically configured to:

determining a second target partition sequence number and a second data processing progress identifier corresponding to the queue processing data;

determining a second target partition position of the queue processing data in the second target partition sequence number according to the second target partition sequence number and the second data processing progress mark;

and adding the queue processing data to a second target data partition in the second data message queue according to the second target partition sequence number and the second target partition position.

Optionally, the apparatus further comprises: a to-be-processed data storage module, wherein the to-be-processed data storage module is specifically configured to read each to-be-processed data in the first data message queue before the Flink-based real-time processing framework processes each to-be-processed data in the first data message queue;

and storing each piece of data to be processed to a distributed file system.

Optionally, the apparatus further comprises: the data processing result storage module or the processing data sending module is specifically set as follows: after the real-time data processing is performed on each queue processing data in the second data message queue, reading each queue processing data from the second data message queue and storing the queue processing data in a static database; the static database is used for processing batch data of the queue processing data;

The processing data sending module is specifically set as follows: after the real-time data processing is performed on each queue processing data in the second data message queue, each queue processing data is read from the second data message queue and is sent to a target application program through a preset interface.

The data processing device can execute the data processing method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of executing the data processing method.

Example four

Fig. 6 is a schematic diagram of a hardware structure of a computer device according to a fourth embodiment of the present invention. FIG. 6 illustrates a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present invention. The computer device 12 shown in FIG. 6 is only an example and should not bring any limitations to the functionality or scope of use of embodiments of the present invention.

As shown in FIG. 6, computer device 12 is in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 6, and commonly referred to as a "hard drive"). Although not shown in FIG. 6, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. System memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in system memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.

Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with computer device 12, and/or with any devices (e.g., network card, modem, etc.) that enable computer device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, computer device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via network adapter 20. As shown, network adapter 20 communicates with the other modules of computer device 12 via bus 18. It should be appreciated that although not shown in FIG. 6, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 16 executes various functional applications and data processing by executing programs stored in the system memory 28, for example, to implement a data processing method provided by an embodiment of the present invention. That is, the processing unit implements, when executing the program:

adding the queue processing data to a second data message queue;

EXAMPLE five

An embodiment five of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a data processing method according to any of the inventive embodiments of the present application: that is, the program when executed by the processor implements:

adding the queue processing data to a second data message queue;

Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, or the like, as well as conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A data processing method, comprising:

adding the queue processing data to a second data message queue;

2. The method of claim 1, wherein adding the pending data to a first data message queue comprises:

3. The method according to claim 2, wherein the Flink-based real-time processing framework processes each of the to-be-processed data in the first data message queue in a streaming data processing manner, including:

determining a current partition sequence number corresponding to a current data partition and a first data processing progress identifier;

determining the current data to be processed according to the current partition sequence number and the first data processing progress mark;

and processing the current data to be processed in real time based on the Flink real-time processing framework.

4. The method according to claim 3, wherein the real-time processing the current data to be processed based on the Flink real-time processing framework comprises:

5. The method of any of claims 2-4, wherein adding the queue processing data to a second queue of data messages comprises:

6. The method according to claim 1, wherein before said Flink-based real-time processing framework processing each of said to-be-processed data in said first data message queue, further comprising:

reading each piece of to-be-processed data in the first data message queue;

and storing each piece of data to be processed to a distributed file system.

7. The method of claim 1, further comprising, after said real-time data processing each of said queue processed data in said second queue of data messages:

reading each queue processing data from the second data message queue and storing the queue processing data in a static database; the static database is used for processing batch data of the queue processing data;

And reading each queue processing data from the second data message queue, and sending the queue processing data to a target application program through a preset interface.

8. A data processing apparatus, comprising:

9. The apparatus of claim 8, wherein the first data message queue generating module is specifically configured to:

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the data processing method according to any of claims 1-7 when executing the program.