CN112905668B

CN112905668B - Database derivative method, device and medium based on distributed data stream processing engine

Info

Publication number: CN112905668B
Application number: CN202110254713.0A
Authority: CN
Inventors: 张灵星; 王海霖; 陈黄; 张国庆
Original assignee: Beijing Zhongjing Huizhong Technology Co ltd
Current assignee: Beijing Zhongjing Huizhong Technology Co ltd
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2023-06-06
Anticipated expiration: 2041-03-05
Also published as: CN112905668A

Abstract

A database derivative method, apparatus and medium based on a distributed data stream processing engine. Comprising the following steps: reading a plurality of data to be imported into a database from a plurality of partitions of a messaging system; storing the plurality of data into a storage unit of the distributed data stream processing engine respectively; importing the data in the storage unit into a database; triggering the distributed data stream processing engine to execute checkpointing operation according to a preset rule; wherein the checkpointing operation comprises: responsive to the distributed data stream processing engine being triggered to perform a checkpointing operation, obtaining a location parameter of data currently read from each of the plurality of partitions; storing position parameters of a plurality of inspection data; labeling a barrier tag for each of the plurality of inspection data; and determining that the checkpointing operation is complete in response to the plurality of inspection data marked with the barrier mark being successfully read in.

Description

Database derivative method, device and medium based on distributed data stream processing engine

Technical Field

The present disclosure relates to the field of data processing technology, and in particular, to a method, apparatus, and medium for database derivative based on a distributed data stream processing engine.

Background

The data processing mode is mainly divided into batch data processing and streaming data processing. The flank is a distributed data stream processing engine for stateful computation of unbounded and bounded data streams. The flank is designed to run in all common clustered environments, performing computations at memory speed and on any scale, allowing for both low latency and high throughput.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated.

Disclosure of Invention

The massive data is time-consuming to write into the database, the occurrence of abnormal conditions can cause the termination of the derivative, and the data needs to be rewritten from the beginning after the abnormal conditions are relieved. It would be advantageous to provide a mechanism that alleviates, mitigates or even eliminates one or more of the above problems.

According to an aspect of the present disclosure, there is provided a database derivative method based on a distributed data stream processing engine, comprising: reading a plurality of data to be imported into a database from a plurality of partitions of a messaging system; storing the plurality of data into a storage unit of the distributed data stream processing engine respectively; importing the data in the storage unit into a database; triggering the distributed data stream processing engine to execute checkpointing operation according to a preset rule, wherein the checkpointing operation comprises the following steps: responsive to the distributed data stream processing engine being triggered to perform a checkpointing operation, obtaining a location parameter of data currently read from each of the plurality of partitions to enable the data currently read from each of the plurality of partitions to be used as check data; storing location parameters of the plurality of inspection data such that breakpoint continuous transmission of the database derivative can be achieved based on the stored location parameters of the plurality of inspection data; labeling a barrier tag for each of the plurality of inspection data; and determining that the checkpointing operation is complete in response to the plurality of inspection data marked with the barrier mark being successfully read in.

According to another aspect of the present disclosure, there is provided a database derivative device based on a distributed data stream processing engine, comprising: a reading unit configured to read a plurality of data to be imported into the database from a plurality of partitions of the message system; a writing unit configured to store a plurality of data into a plurality of storage units of the distributed data stream processing engine, respectively; an importing unit configured to import data in the plurality of storage units into a database; a triggering unit configured to trigger the distributed data stream processing engine to perform a checkpointing operation according to a preset rule, wherein the distributed data stream processing engine comprises a checkpointing unit configured to perform the checkpointing operation, and the checkpointing unit comprises: an acquisition module configured to acquire a location parameter of data currently read from each of the plurality of partitions in response to the distributed data stream processing engine being triggered to perform a checkpointing operation to enable the data currently read from each of the plurality of partitions to be check data; a storage module configured to store location parameters of a plurality of inspection data, wherein breakpoint continuous transmission of a database derivative can be implemented based on the stored location parameters of the plurality of inspection data; an annotating module configured to annotate a barrier tag for each of the plurality of inspection data; and a determination module configured to determine that the checkpointing operation is complete in response to a plurality of inspection data marked with barrier marks being successfully read in.

According to yet another aspect of the present disclosure, there is provided a computer apparatus comprising: a memory, a processor and a computer program stored on said memory, wherein the processor is configured to execute the computer program to carry out the steps of the above method.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the above method.

According to yet another aspect of the present disclosure, there is provided a computer program product comprising a computer program, wherein the computer program when executed by a processor realizes the steps of the above method.

These and other aspects of the disclosure will be apparent from and elucidated with reference to the embodiments described hereinafter.

Drawings

The accompanying drawings illustrate embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for example only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.

Further details, features and advantages of the present disclosure are disclosed in the following description of exemplary embodiments, with reference to the following drawings, wherein:

FIG. 1 illustrates a flowchart of a database derivative method based on a distributed data stream processing engine, according to an example embodiment;

FIG. 2 illustrates a method flow diagram for triggering a distributed data stream processing engine to perform checkpointing operations according to preset rules in accordance with an embodiment of the present disclosure;

3A-3F are process diagrams illustrating checkpointing according to an example embodiment;

FIG. 4 is a schematic diagram illustrating a scenario when fault recovery is performed according to checkpoints;

FIG. 5 is a schematic block diagram illustrating a distributed data stream processing engine based database derivative device in accordance with an exemplary embodiment;

fig. 6 is a block diagram illustrating an exemplary computer device that can be applied to exemplary embodiments.

Detailed Description

In the present disclosure, unless otherwise indicated, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of these elements, and such terms are used merely to separate one element from another element region knowledge graph. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.

The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. As used herein, the term "plurality" means two or more, and the term "based on" should be interpreted as "based at least in part on". Furthermore, the term "and/or" and "at least one of … …" encompasses any and all possible combinations of the listed items.

Before introducing exemplary embodiments of the present disclosure, several terms used herein will first be explained.

1. Streaming data

Streaming data refers to data that is continuously generated by multiple data sources, typically also sent simultaneously in the form of data records, of small size (on the order of several kilobytes). The streaming data has four characteristics: 1) The data arrives in real time; 2) The data arrival order is independent and is not controlled by an application system; 3) The data size is large and the maximum value cannot be predicted; 4) Once the data is processed, it cannot be re-fetched for processing, or it is expensive to re-fetch the data, unless deliberately preserved. For most scenarios where dynamic new data is continuously generated, it is advantageous to employ streaming data processing.

2、Kafka

Kafka is a distributed messaging system that is responsible for transferring data from one application to another application, and an application only needs to focus on the data, and does not need to focus on how the data is transferred between two or more applications. Distributed messaging is based on reliable message queues to asynchronously transfer messages between client applications and a messaging system. There are two main modes of messaging: point-to-point delivery mode, publish-subscribe mode. Most messaging systems use a publish-subscribe mode. The messaging mode of Kafka is a publish-subscribe mode.

3、Flink

The flank is a distributed data stream processing engine for stateful computation of unbounded and bounded data streams. The flank is designed to run in all common clustered environments, performing computations at memory speed and on any scale. The Flink combines low delay and high throughput, and is the first choice for enterprise deployment flow calculation.

In order to write mass data into the gallery, derivative may be performed using a streaming method. The derivative process of the database is often time-consuming, and the derivative process may be terminated due to abnormal conditions, so that a function capable of realizing breakpoint continuous transmission is urgently required.

Based on the above, the disclosure provides a database derivative method based on a distributed data stream processing engine, which realizes breakpoint continuous transmission in a data storage process by combining the distributed data stream processing engine and a message system, so that when tasks fail due to various reasons in the database derivative process, data is not required to be written from the beginning, only the last position of failure is required to be written continuously, similar to the situation that files fail due to network reasons when downloading files, the files are not required to be downloaded again, only the downloading is required to be continued, and time and calculation resources can be greatly saved. The distributed data stream processing engine is provided with an error retry mechanism, and by setting a check point, the continuous writing from the last failure position when the fault occurs can be realized, and the breakpoint continuous transmission is realized. In addition, by adopting the distributed data stream processing engine, the warehousing efficiency of the data writing database can be improved.

Exemplary embodiments of the present disclosure are described in detail below with reference to the attached drawings.

FIG. 1 is a flowchart illustrating a distributed data stream processing engine based database derivative method 100 in accordance with an exemplary embodiment. Referring to fig. 1, the method may include: step 101, reading a plurality of data to be imported into a database from a plurality of partitions of a message system; 102, respectively storing a plurality of data into a storage unit of a distributed data stream processing engine; step 103, importing the data in the storage unit into a database; step 104, triggering the distributed data stream processing engine to execute checkpointing operation according to a preset rule, wherein the checkpointing operation comprises: step 105, responding to the distributed data stream processing engine to trigger to perform checkpointing operation, acquiring the position parameters of the data currently read from each of the plurality of partitions so that the data currently read from each of the plurality of partitions can be used as check data; step 106, storing position parameters of a plurality of inspection data so that breakpoint continuous transmission of a derivative of the database can be realized based on the stored position parameters of the plurality of inspection data; step 107, labeling a barrier mark for each inspection data in the plurality of inspection data; and step 108, determining that the checkpointing operation is completed in response to the plurality of inspection data marked with the barrier mark being successfully read in. The method is used for realizing breakpoint continuous transmission in the data stream type warehouse entry process. Further saving waiting time of users, improving working efficiency of data warehouse entry and enabling data to be written in completely at one time.

The distributed data stream processing engine may be, but is not limited to, a flink engine. The storage unit can be a third party storage unit or a built-in storage unit of the distributed data stream processing engine. The messaging system may be, for example, a distributed messaging system kafka.

According to some embodiments, the storage unit of the distributed data stream processing engine may include a correspondence between a plurality of inspection data and the location parameter, and the method may further include: in response to detecting an interruption in the derivative process of the database, the importing of the data in the storage unit into the database is continued based on the location parameters of the plurality of inspection data. Therefore, when the database is interrupted in storage, the continuous transmission position of the data can be rapidly determined based on a plurality of set check data, and the continuous transmission of the breakpoint is realized, so that the time for recovering the interruption is saved.

FIG. 2 illustrates a flow chart of a method of triggering a distributed data stream processing engine to perform checkpointing operations according to preset rules, in accordance with an embodiment of the present disclosure. As shown in fig. 2, triggering the distributed data stream processing engine to perform checkpointing operations according to the preset rules at step 104 may include: triggering the distributed data stream processing engine to execute checkpointing operation according to a preset period. Therefore, by periodically triggering and executing checkpointing operation, breakpoint continuous transmission can be realized whenever a fault occurs, and the efficiency of data warehousing is further ensured.

According to some embodiments, triggering the distributed data stream processing engine to perform checkpointing operations at a preset period in step 104 may include: step 201, counting data read from the message system (corresponding to one message in a partition of the message system) and counting a duration of reading the data from the message system in each cycle; and step 202, triggering the distributed data stream processing engine to execute the checkpointing operation in response to the count of the data read from the message system reaching a preset number or the duration of the data read from the message system reaching a preset duration. Therefore, the method and the device can adapt to the application scene with larger data volume in database warehouse entry by setting the preset duration and the message number triggering at the same time, avoid that the data volume between two check points greatly influences the breakpoint continuous transmission efficiency, simultaneously ensure the instant response to faults, and avoid that the breakpoint continuous transmission cannot be realized due to the fact that the interval duration between the two check points is larger. In this case, it can be achieved that the distributed data stream processing engine is triggered to perform checkpointing operations with a non-fixed period.

According to some embodiments, the location parameter of each inspection data acquired in step 105 may include a first code associated with the partition corresponding to the inspection data and a second code associated with the read order of the inspection data. In this step, the position parameter of the check data records from which partition the check data comes, so that when the task is interrupted, the data transmission can be resumed according to the different partition.

According to some embodiments, the number of data read from each partition is accumulated for a plurality of data read from the partition, wherein for the partition, the accumulated number corresponding to the determined inspection data is taken as the second encoding of the inspection data. In the step, the reading offset of each partition is recorded so as to acquire the position of the data during continuous transmission and realize the continuous transmission of the breakpoint.

According to some embodiments, storing the location parameters of the plurality of inspection data may include: the position parameters of the plurality of inspection data stored before are updated to the position parameters of the plurality of inspection data determined currently. This step updates the checkpoint data in real time to ensure accuracy of the recovered data location. Optionally, the method may further include: the number of checkpointing operations performed is accumulated, so that it can be ensured that the storage states of all operator tasks in the checkpointing mechanism are consistent, i.e. the storage states of the same checkpoint are consistent.

According to some embodiments, the reading of data from a partition to which the data corresponds may be continued in response to the data read from the plurality of partitions of the messaging system being successfully read, thereby enabling the data to be secured to align to calculate the corresponding positional offset of the checkpoint, providing a one-time database derivative guarantee.

According to some embodiments, in response to a certain inspection data marked with a barrier mark being successfully read in, reading data from a partition to which the inspection data corresponds may continue. Illustratively, when the barrier-marked inspection data from one of the data sources (corresponding to one of the partitions of the message system) is successfully read, but the barrier-marked inspection data from the other data source is not successfully read, the inspection data from one of the data sources is cached first, waiting for the inspection data from the other data source. For a checkpoint, when inspection data from all data sources has not been cached, subsequent data processing may not be performed to ensure the ordering of data processing.

According to some embodiments, the distributed data stream processing engine may include a connector through which the distributed data stream processing engine reads a plurality of data to be imported into the database from a plurality of partitions of the message system, thereby enabling interfacing of the distributed data stream processing engine and the message system, and enabling data transfer. Illustratively, the distributed data stream processing engine may be a flink engine, the messaging system may be a kafka messaging system, and the connector may be a flink kafka consumer. Further, the location parameters of the currently read inspection data may be obtained from the flink kafka consumer in response to the flink engine being triggered to perform a checkpointing operation. The position parameter of the inspection data may include the kafka partition to which the inspection data corresponds and the position offset of the inspection data in the partition.

According to some embodiments, the distributed data stream processing engine may include a kafka consumption group corresponding to the plurality of partitions, the kafka consumption group including a plurality of kafka consumers, wherein reading data from the partitions of the messaging system is accomplished by each kafka consumer consuming data from a one-to-one correspondence of the partitions. One consumer corresponds to one partition, and sequential reading and writing are realized, so that compared with other modes such as random reading and writing, the sequential reading and writing have higher efficiency.

Alternatively, the kafka consumer corresponding to the data may continue to consume the data from the corresponding partition in response to the data read from the plurality of partitions of the messaging system being successfully read in. That is, when the data in the kafka consumer is successfully read, the kafka consumer continues to consume the next data to ensure that the data is aligned, facilitating the calculation of the positional offset at the time of the checkpoint.

Alternatively, the kafka consumer corresponding to certain inspection data marked with barrier marks may continue to consume data from the corresponding partition in response to the inspection data being successfully read in. That is, when the check data in a certain kafka consumer is successfully read, the kafka consumer corresponding to the check data continues to consume the next data, and the data transmission efficiency can be improved. Specifically, when the barrier-marked inspection data from one of the data sources (corresponding to one of the partitions) is successfully read in, but the barrier-marked inspection data from the other data source is not received, the inspection data from one of the data sources is cached first, waiting for the inspection data from the other data source. For a checkpoint, subsequent data processing may not be performed when the check data from all data sources has not yet been cached, to ensure the ordering of the data processing.

Fig. 3A to 3F are process diagrams illustrating checkpointing according to an exemplary embodiment.

In the example illustrated in fig. 3A, two partitions in the messaging system, each partition containing message "a", message "B", message "C", message "D", message "E". We set the offset of the two partitions to zero.

In the example illustrated in fig. 3B, the offset recorded in the consumer is 0. Message "A" from partition 0 is processed "in flight", with the offset of the first consumer becoming 1.

In the example illustrated in fig. 3C, message "a" arrives at the mapping task module of the distributed data stream processing engine, which is successfully read in. Both consumers read their next records (message "B" for partition 0 and message "a" for partition 1). The offsets in the two consumers are updated to 2 and 1, respectively. According to some embodiments, a master node server in a distributed data stream processing engine decides to trigger a checkpoint at the source (at the consumer), message "B" for partition 0 and message "a" for partition 1 being check data.

In the example illustrated in fig. 3D, the location parameters (2, 1) of the plurality of inspection data of the inspection point are stored in a master node server of the distributed data stream processing engine. The master node server issues a checkpoint barrier after messages "B" and "a" from partitions 0 and 1, respectively. The checkpoint barrier is used to align checkpoints of all operator tasks and to ensure consistency of the entire checkpoint. Message "a" arrives at the mapping task module, is successfully read in, and the consumer continues to read its next record (message "C").

In the example illustrated in FIG. 3E, the mapping task module successfully reads in the two barrier marked inspection data, and the consumer continues to consume more data from the two partitions of the message system.

In the example illustrated in fig. 3F, after the mapping task module successfully reads in the inspection data of the two marked barrier marks, the mapping task module communicates with the master node server to notify the master node server that checkpointing is complete. The master node server modifies the checkpoint completion times to 1.

Checkpointing is accomplished through steps fig. 3A-3F for breakpoint resume after failure recovery, and the distributed data stream processing engine can recover potential system failures independent of the positional offset of the messaging system.

Although the operations are depicted in the drawings in a particular order, this should not be understood as requiring that the operations be performed in the particular order shown or in sequential order, nor should it be understood that all illustrated operations must be performed in order to achieve desirable results.

Fig. 4 illustrates a scenario in which breakpoint resume is implemented at the time of failure recovery. Referring to fig. 4, the offsets of two consumers of the messaging system are 2 and 1, respectively, for subsequent passes, as this is the offset corresponding to the completed checkpoint. When the failure recovery restarts the database derivative, the consumer continues to consume data through the consumer according to the offset corresponding to the checkpoint stored in the primary node server.

FIG. 5 is a schematic block diagram illustrating a distributed data stream processing engine based database derivative device in accordance with an exemplary embodiment. As shown in fig. 5, the apparatus 500 may include: a reading unit 510 configured to read a plurality of data to be imported into the database from a plurality of partitions of the message system; a writing unit 520 configured to store a plurality of data into a plurality of storage units of the distributed data stream processing engine, respectively; an importing unit 530 configured to import data in the plurality of storage units into a database; a triggering unit 540 configured to trigger the distributed data stream processing engine to perform a checkpointing operation according to a preset rule, wherein the distributed data stream processing engine comprises a checkpointing unit 550 configured to perform the checkpointing operation, and the checkpointing unit comprises: an acquisition module 551 configured to acquire a location parameter of data currently read from each of the plurality of partitions in response to the distributed data stream processing engine being triggered to perform a checkpointing operation, so that the data currently read from each of the plurality of partitions can be used as check data; a storage module 552 configured to store location parameters of a plurality of inspection data, wherein a breakpoint resume of the database derivative can be implemented based on the stored location parameters of the plurality of inspection data; an annotating module 553 configured to annotate a barrier tag for each of a plurality of inspection data; and a determination module 554 configured for determining that the checkpointing operation is complete in response to a plurality of inspection data marked with barrier markers each being successfully read in.

The operation of the above units 510-550 and modules 551-554, respectively, of the distributed data stream processing engine based database derivative device 500 is similar to the operation of S101-108 described above and will not be repeated here.

Although specific functions are discussed above with reference to specific modules, it should be noted that the functions of the various modules discussed herein may be divided into multiple modules and/or at least some of the functions of the multiple modules may be combined into a single module. The particular module performing the actions discussed herein includes the particular module itself performing the actions, or alternatively the particular module invoking or otherwise accessing another component or module that performs the actions (or performs the actions in conjunction with the particular module). Thus, a particular module that performs an action may include that particular module itself that performs the action and/or another module that the particular module invokes or otherwise accesses that performs the action. For example, the read unit 510/write unit 520 described above may be combined into a single unit in some embodiments. For another example, the acquisition module 551 may include a storage module 552 in some embodiments.

It should also be appreciated that various techniques may be described herein in the general context of software hardware elements or program modules. The various modules described above with respect to fig. 5 may be implemented in hardware or in hardware in combination with software and/or firmware. For example, the modules may be implemented as computer program code/instructions configured to be executed in one or more processors and stored in a computer-readable storage medium. Alternatively, these modules may be implemented as hardware logic/circuitry. For example, in some embodiments, one or more of the read unit 510, the write unit 520, the import unit 530, the trigger unit 540, the checkpointing unit 550, the acquisition module 551, the storage module 552, the labeling module 553, and the determination module 554 may be implemented together in a System on Chip (SoC). The SoC may include an integrated circuit chip including one or more components of a processor (e.g., a central processing unit (Central Processing Unit, CPU), microcontroller, microprocessor, digital signal processor (Digital Signal Processor, DSP), etc.), memory, one or more communication interfaces, and/or other circuitry, and may optionally execute received program code and/or include embedded firmware to perform functions.

According to an aspect of the present disclosure, a computer device is provided that includes a memory, a processor, and a computer program stored on the memory. The processor is configured to execute a computer program to implement the steps of any of the method embodiments described above.

According to an aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of any of the method embodiments described above.

According to an aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of any of the method embodiments described above.

Illustrative examples of such computer devices, non-transitory computer readable storage media, and computer program products are described below in connection with fig. 6.

Fig. 6 illustrates an example configuration of a computer device 600 that may be used to implement the methods described herein.

The computer device 600 may be a variety of different types of devices, such as a server of a service provider, a device associated with a client (e.g., a client device), a system-on-chip, and/or any other suitable computer device or computing system. Examples of computer device 600 include, but are not limited to: a desktop, server, notebook, or netbook computer, a mobile device (e.g., tablet, cellular, or other wireless telephone (e.g., smart phone), notepad computer, mobile station), a wearable device (e.g., glasses, watch), an entertainment appliance (e.g., an entertainment appliance, a set-top box communicatively coupled to a display device, a gaming machine), a television or other display device, an automotive computer, and so forth. Thus, computer device 600 may range from full resource devices (e.g., personal computers, game consoles) that have significant memory and processor resources, to low-resource devices with limited memory and/or processing resources (e.g., traditional set-top boxes, hand-held game consoles).

Computer device 600 may include at least one processor 602, memory 604, communication interface(s) 606, display device 608, other input/output (I/O) devices 610, and one or more mass storage devices 612, capable of communicating with each other, such as via a system bus 614 or other suitable connection.

The processor 602 may be a single processing unit or multiple processing units, all of which may include a single or multiple computing units or multiple cores. The processor 602 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. The processor 602 may be configured to, among other capabilities, obtain and execute computer-readable instructions stored in the memory 604, mass storage device 612, or other computer-readable medium, such as program code for the operating system 616, program code for the application programs 618, program code for the other programs 620, and so forth.

Memory 604 and mass storage device 612 are examples of computer-readable storage media for storing instructions that are executed by processor 602 to implement the various functions as previously described. For example, memory 604 may generally include both volatile memory and nonvolatile memory (e.g., RAM, ROM, etc.). In addition, mass storage device 612 may generally include hard disk drives, solid state drives, removable media, including external and removable drives, memory cards, flash memory, floppy disks, optical disks (e.g., CD, DVD), storage arrays, network attached storage, storage area networks, and the like. Memory 604 and mass storage device 612 may both be referred to herein collectively as memory or a computer-readable storage medium, and may be non-transitory media capable of storing computer-readable, processor-executable program instructions as computer program code that may be executed by processor 602 as a particular machine configured to implement the operations and functions described in the examples herein.

A number of program modules may be stored on the mass storage device 612. These programs include an operating system 616, one or more application programs 618, other programs 620, and program data 622, and may be loaded into the memory 604 for execution. Examples of such application programs or program modules may include, for example, computer program logic (e.g., computer program code or instructions) for implementing the following components/functions: the read unit 510, the write unit 520, the import unit 530, the trigger unit 540, the checkpointing unit 550, the acquisition module 551, the storage module 552, the labeling module 553, and the determination module 554, the method 100 (including any suitable steps of the method 100), and/or further embodiments described herein.

Although illustrated in fig. 6 as being stored in memory 604 of computer device 600,

modules

616, 618, 620, and 622, or portions thereof, may be implemented using any form of computer-readable media accessible by computer device 600. As used herein, "computer-readable medium" includes at least two types of computer-readable media, namely computer storage media and communication media.

Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information for access by a computer device.

In contrast, communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism. Computer storage media as defined herein do not include communication media.

The computer device 600 may also include one or more communication interfaces 606 for exchanging data with other devices, such as via a network, direct connection, or the like, as previously discussed. Such communication interfaces may be one or more of the following: any type of network interface (e.g., a Network Interface Card (NIC)), a wired or wireless (such as IEEE 802.11 Wireless LAN (WLAN)) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, bluetooth, etc ^TM An interface, a Near Field Communication (NFC) interface, etc. Communication interface 606 may facilitate communication within a variety of network and protocol types, including wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the Internet, and so forth. Communication interface 606 may also provide for communication with external storage devices (not shown) such as in a storage array, network attached storage, storage area network, or the like.

In some examples, a display device 608, such as a monitor, may be included for displaying information and images to a user. Other I/O devices 610 may be devices that receive various inputs from a user and provide various outputs to the user, and may include touch input devices, gesture input devices, cameras, keyboards, remote controls, mice, printers, audio input/output devices, and so on.

While the disclosure has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative and schematic and not restrictive; the present disclosure is not limited to the disclosed embodiments. Variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed subject matter, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps than those listed and the word "a" or "an" does not exclude a plurality. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

1. A database derivative method based on a distributed data stream processing engine, comprising:

reading a plurality of data to be imported into a database from a plurality of partitions of a messaging system;

storing the plurality of data into storage units of the distributed data stream processing engine respectively;

importing the data in the storage unit into a database;

triggering the distributed data stream processing engine to execute checkpointing operations according to preset rules,

wherein the checkpointing operation comprises:

responsive to the distributed data stream processing engine being triggered to perform a checkpointing operation, obtaining location parameters of data currently read from each of the plurality of partitions to enable the data currently read from each of the plurality of partitions to be used as check data;

storing location parameters of a plurality of inspection data such that breakpoint continuous transmission of a database derivative can be achieved based on the stored location parameters of the plurality of inspection data;

marking each inspection data of the plurality of inspection data with a barrier mark; and

in response to a plurality of inspection data marked with barrier marks each being successfully read in, it is determined that the checkpointing operation is complete.

2. The method of claim 1, wherein triggering the distributed data stream processing engine to perform a checkpointing operation according to a preset rule comprises:

triggering the distributed data stream processing engine to execute checkpointing operation according to a preset period.

3. The method of claim 2, wherein triggering the distributed data stream processing engine to perform a checkpointing operation at a preset period comprises:

counting data read from the messaging system and timing a duration of reading data from the messaging system in each cycle;

and triggering the distributed data stream processing engine to execute checkpointing operation in response to the count of data read from the message system reaching a preset number or the duration of data read from the message system reaching a preset duration.

4. The method of claim 1, wherein the storage unit includes correspondence between the plurality of inspection data and location parameters,

the method further comprises the steps of:

in response to detecting an interruption in the derivative process of the database, the importing of the data in the storage unit into the database is continued based on the location parameters of the plurality of inspection data.

5. The method of claim 1, wherein the location parameter of each inspection data comprises a first code associated with a partition corresponding to the inspection data and a second code associated with a read order of the inspection data.

6. The method of claim 5, wherein, for a plurality of data read from each partition, the number of data read is accumulated,

wherein the accumulated number corresponding to the specified inspection data is used as the second code of the inspection data for the partition.

7. The method of claim 1, wherein storing the location parameters of the plurality of inspection data comprises:

the position parameters of the plurality of inspection data stored before are updated to the position parameters of the plurality of inspection data determined currently.

8. The method of claim 7, further comprising:

the number of checkpointing operations performed is accumulated.

9. The method of claim 1, wherein in response to data read from a plurality of partitions of the messaging system being successfully read in, continuing to read data from the partition to which the data corresponds.

10. The method of claim 1, wherein in response to a certain inspection data marked with a barrier mark being successfully read in, continuing to read data from the partition to which the inspection data corresponds.

11. The method of any of claims 1-10, wherein the distributed data stream processing engine comprises a connector through which the distributed data stream processing engine reads a plurality of data to be imported into a database from a plurality of partitions of a messaging system.

12. The method of claim 11, wherein the distributed data stream processing engine is a flink.

13. The method of any of claims 1-10, wherein the messaging system is a kafka messaging system.

14. The method of claim 13, wherein the distributed data stream processing engine comprises a kafka consumer group corresponding to the plurality of partitions, the kafka consumer group comprising a plurality of kafka consumers,

wherein reading data from a partition of the messaging system is accomplished by each kafka consumer consuming the data from a one-to-one correspondence of the partitions.

15. The method of claim 13, further comprising:

in response to data read from multiple partitions of the messaging system being successfully read in, a kafka consumer corresponding to the data continues to consume the data from the corresponding partition.

16. The method of claim 13, further comprising:

in response to a certain inspection data labeled with a barrier mark being successfully read in, the kafka consumer corresponding to the inspection data continues to consume data from the corresponding partition.

17. A database derivative device based on a distributed data stream processing engine, comprising:

a reading unit configured to read a plurality of data to be imported into the database from a plurality of partitions of the message system;

a writing unit configured to store the plurality of data into a plurality of storage units of the distributed data stream processing engine, respectively;

an importing unit configured to import data in the plurality of storage units into a database;

a triggering unit configured to trigger the distributed data stream processing engine to perform checkpointing operations according to preset rules,

wherein the distributed data stream processing engine comprises a checkpointing unit configured to perform checkpointing operations, and the checkpointing unit comprises:

an acquisition module configured to acquire a location parameter of data currently read from each of the plurality of partitions in response to the distributed data stream processing engine being triggered to perform a checkpointing operation to enable the data currently read from each of the plurality of partitions to be check data;

a storage module configured to store location parameters of the plurality of inspection data, wherein breakpoint continuous transmission of a database derivative can be implemented based on the stored location parameters of the plurality of inspection data;

an annotating module configured to annotate barrier indicia for each of the plurality of inspection data; and

and a determination module configured to determine that the checkpointing operation is completed in response to a plurality of inspection data marked with barrier marks being successfully read in.

18. A computer device, comprising:

a memory, a processor and a computer program stored on the memory,

wherein the processor is configured to execute the computer program to implement the steps of the method of any one of claims 1-16.

19. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the method of any of claims 1-16.