CN112905668A

CN112905668A - Database derivative method, apparatus, and medium based on distributed data stream processing engine

Info

Publication number: CN112905668A
Application number: CN202110254713.0A
Authority: CN
Inventors: 张灵星; 王海霖; 陈黄; 张国庆
Original assignee: Beijing Zhongjing Huizhong Technology Co ltd
Current assignee: Beijing Zhongjing Huizhong Technology Co ltd
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2021-06-04
Anticipated expiration: 2041-03-05
Also published as: CN112905668B

Abstract

A database derivative method, apparatus, and medium based on a distributed data stream processing engine. The method comprises the following steps: reading a plurality of data to be imported into a database from a plurality of partitions of a message system; respectively storing a plurality of data into storage units of a distributed data stream processing engine; importing the data in the storage unit into a database; triggering a distributed data stream processing engine to execute checkpoint operation according to a preset rule; wherein the checkpointing operation comprises: in response to the distributed data stream processing engine being triggered to perform a checkpointing operation, obtaining location parameters of data currently read from each of the plurality of partitions; storing location parameters of a plurality of inspection data; marking each of the plurality of inspection data with a barrier mark; and determining that the checkpointing operation is completed in response to all of the plurality of check data marked by the marked barrier being successfully read in.

Description

Database derivative method, apparatus, and medium based on distributed data stream processing engine

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a database derivative method, device, and medium based on a distributed data stream processing engine.

Background

The data processing mode is mainly divided into batch data processing and streaming data processing. Flink is a distributed data stream processing engine for performing stateful computations on unbounded and bounded data streams. Flink is designed to operate in all common clustered environments, performing computations at memory speed and any scale, and is capable of both low latency and high throughput.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.

Disclosure of Invention

When mass data is written into a database, time is consumed, a derivative is possibly terminated due to the occurrence of an abnormal condition, and the data needs to be rewritten from a head after the abnormal condition is relieved. It would be advantageous to provide a mechanism that alleviates, mitigates or even eliminates one or more of the above-mentioned problems.

According to an aspect of the present disclosure, there is provided a database derivative method based on a distributed data stream processing engine, including: reading a plurality of data to be imported into a database from a plurality of partitions of a message system; respectively storing a plurality of data into storage units of a distributed data stream processing engine; importing the data in the storage unit into a database; triggering the distributed data stream processing engine to execute the checkpointing operation according to a preset rule, wherein the checkpointing operation comprises the following steps: in response to the distributed data stream processing engine being triggered to perform a checkpointing operation, obtaining location parameters of data currently read from each of the plurality of partitions to enable the data currently read from each of the plurality of partitions to be used as inspection data; storing location parameters of the plurality of inspection data to enable a breakpoint resume of the database derivative based on the stored location parameters of the plurality of inspection data; marking each of the plurality of inspection data with a barrier mark; and determining that the checkpointing operation is completed in response to all of the plurality of check data marked by the marked barrier being successfully read in.

According to another aspect of the present disclosure, there is provided a database derivative apparatus based on a distributed data stream processing engine, including: a reading unit configured to read a plurality of data to be imported into the database from a plurality of partitions of the message system; a writing unit configured to store a plurality of data into a plurality of storage units of the distributed data stream processing engine, respectively; an importing unit configured to import data in the plurality of storage units into a database; a triggering unit configured to trigger the distributed data stream processing engine to perform a checkpointing operation according to a preset rule, wherein the distributed data stream processing engine includes a checkpointing unit configured to perform the checkpointing operation, and the checkpointing unit includes: an obtaining module configured to obtain a location parameter of data currently read from each of the plurality of partitions in response to the distributed data stream processing engine being triggered to perform a checkpointing operation to enable the data currently read from each of the plurality of partitions to be used as inspection data; a storage module configured to store location parameters of a plurality of inspection data, wherein a breakpoint resume of the database derivative can be implemented based on the stored location parameters of the plurality of inspection data; a tagging module configured to tag each of the plurality of inspection data with a barrier tag; and a determination module configured to determine that the checkpointing operation is complete in response to all of the plurality of check data marked by the marked barrier being successfully read in.

According to yet another aspect of the present disclosure, there is provided a computer apparatus including: a memory, a processor and a computer program stored on the memory, wherein the processor is configured to execute the computer program to implement the steps of the above method.

According to yet another aspect of the present disclosure, a non-transitory computer readable storage medium is provided, having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the above-described method.

According to yet another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program realizes the steps of the above-mentioned method when executed by a processor.

These and other aspects of the disclosure will be apparent from and elucidated with reference to the embodiments described hereinafter.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary embodiments. The illustrated embodiments are for purposes of example only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

Further details, features and advantages of the disclosure are disclosed in the following description of exemplary embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 illustrates a flow diagram of a database derivative method based on a distributed data stream processing engine in accordance with exemplary embodiments;

FIG. 2 illustrates a flowchart of a method for triggering a distributed data stream processing engine to perform a checkpointing operation according to preset rules, in accordance with an embodiment of the present disclosure;

3A-3F are process diagrams illustrating checkpointing according to exemplary embodiments;

FIG. 4 is a diagram illustrating a scenario when failover is performed in accordance with a checkpoint;

FIG. 5 is a schematic block diagram illustrating a distributed data stream processing engine based database derivative arrangement in accordance with an illustrative embodiment;

FIG. 6 is a block diagram illustrating an exemplary computer device that can be applied to the exemplary embodiments.

Detailed Description

In this disclosure, unless otherwise specified, the use of the terms "first," "second," etc. to describe various elements is not intended to define positional, chronological, or importance relationships of the elements, and such terms are used merely to separate one element from another element region knowledgegraph. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.

The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. As used herein, the term "plurality" means two or more, and the term "based on" should be interpreted as "based, at least in part, on". Further, the terms "and/or" and at least one of "… …" encompass any and all possible combinations of the listed items.

Before describing exemplary embodiments of the present disclosure, a number of terms used herein are first explained.

1. Streaming data

Streaming data refers to data that is continuously generated by a plurality of data sources, and is usually transmitted simultaneously in the form of data records, which are small in size (about several kilobytes). Streaming data has four characteristics: 1) data arrive in real time; 2) the data arrival sequence is independent and is not controlled by an application system; 3) the data scale is large and the maximum value cannot be predicted; 4) once the data is processed, it cannot be retrieved again for processing unless purposely saved, or it is expensive to retrieve the data again. For most scenarios where dynamic new data is continuously generated, it is advantageous to employ stream data processing.

2、Kafka

Kafka is a distributed messaging system that is responsible for passing data from one application to another, with applications only having to focus on the data and not on how the data is passed between two or more applications. Distributed messaging is based on reliable message queues, asynchronously delivering messages between client applications and a messaging system. There are two main modes of messaging: point-to-point delivery mode, publish-subscribe mode. Most messaging systems use a publish-subscribe model. The messaging mode of Kafka is the publish-subscribe mode.

3、Flink

Flink is a distributed data stream processing engine for performing stateful computations on unbounded and bounded data streams. Flink is designed to run in all common clustered environments, performing calculations at memory speed and any scale. Flink considers both low latency and high throughput and is the first choice when an enterprise deploys stream computing.

In order to write massive data into the gallery, the derivative can be performed by adopting a streaming method. The process of database derivative is time-consuming, and an abnormal condition may cause the termination of the derivative process, so that a function of implementing breakpoint transmission is urgently needed.

Based on this, the present disclosure provides a database derivative method based on a distributed data stream processing engine, which implements breakpoint continuous transmission in a data storage process by combining the distributed data stream processing engine and a message system, so that when a task fails due to various reasons in the database derivative process, data does not need to be written from the beginning, and only the data needs to be written continuously from the last failed position, which is similar to that when a file is downloaded, the file does not need to be downloaded again due to network reasons, and only the file needs to be downloaded continuously, thereby greatly saving time and computing resources. The distributed data stream processing engine is provided with an error retry mechanism, and can realize continuous writing from the last failed position when a fault occurs by setting a check point, so that continuous transmission at the break point is realized. In addition, the warehousing efficiency of data written into the database can be improved by adopting the distributed data stream processing engine.

Exemplary embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

FIG. 1 is a flowchart illustrating a distributed data stream processing engine based database derivative method 100 in accordance with an exemplary embodiment. Referring to fig. 1, the method may include: step 101, reading a plurality of data to be imported into a database from a plurality of partitions of a message system; 102, respectively storing a plurality of data into storage units of a distributed data stream processing engine; 103, importing the data in the storage unit into a database; 104, triggering the distributed data stream processing engine to execute the checkpointing operation according to a preset rule, wherein the checkpointing operation comprises the following steps: step 105, responding to the distributed data stream processing engine being triggered to execute a checkpointing operation, acquiring a location parameter of data currently read from each of the plurality of partitions, so that the data currently read from each of the plurality of partitions can be used as check data; 106, storing the position parameters of the plurality of inspection data so that breakpoint continuous transmission of the derivative of the database can be realized based on the stored position parameters of the plurality of inspection data; step 107, marking a barrier mark for each of the plurality of inspection data; and step 108, responding to the plurality of check data marked by the marked barrier marks to be read in successfully, and determining that the checkpointing operation is completed. The method is used for realizing breakpoint continuous transmission in the process of data streaming storage. The waiting time of the user is further saved, the working efficiency of data storage is improved, and the data is completely written in one time.

The distributed data stream processing engine may be, but is not limited to, a flink engine. The storage unit can be a third-party storage unit, and can also be a built-in storage unit of the distributed data stream processing engine. The message system may be, for example, the distributed message system kafka.

According to some embodiments, the storage unit of the distributed data stream processing engine may include a plurality of correspondence relationships between the inspection data and the location parameter, and the method may further include: and in response to detecting that the derivative process of the database is interrupted, continuously importing the data in the storage unit into the database based on the position parameters of the plurality of check data. Therefore, when the database is interrupted in storage, the position of data continuous transmission can be quickly determined based on the set plurality of check data, and the breakpoint continuous transmission is realized, so that the time for interruption recovery is saved.

FIG. 2 illustrates a flow diagram of a method for triggering a distributed data stream processing engine to perform a checkpointing operation according to preset rules, in accordance with an embodiment of the present disclosure. As shown in fig. 2, triggering 104 the distributed data stream processing engine to perform the checkpointing operation according to the preset rules may include: and triggering the distributed data stream processing engine to execute the checkpoint operation according to a preset period. Therefore, by periodically triggering and executing the checkpoint setting operation, breakpoint continuous transmission can be realized whenever a fault occurs, and the efficiency of data warehousing is further ensured.

According to some embodiments, the triggering, according to the preset period, the distributed data stream processing engine to perform the checkpointing operation in step 104 may include: step 201, in each period, counting data read from the message system (corresponding to one message in the partition of the message system), and timing the duration of reading data from the message system; and step 202, in response to the counting of the data read from the message system reaching a preset number or the duration of the data read from the message system reaching a preset duration, triggering the distributed data stream processing engine to execute a checkpointing operation. Therefore, the method can adapt to an application scene with large data volume put in a database by simultaneously setting the preset time and the message number for triggering, avoid the problem that the data volume between two check points greatly affects the breakpoint resume efficiency, simultaneously ensure the instant response to the fault, and avoid the problem that the breakpoint resume cannot be realized due to the fact that the time interval between the two check points is set to be long. In this case, it can be achieved that the distributed data stream processing engine is triggered to perform the checkpointing operation with a non-fixed periodicity.

According to some embodiments, the location parameter of each check data acquired in step 105 may include a first code associated with a partition to which the check data corresponds, and a second code associated with a reading order of the check data. In this step, the location parameter of the check data records from which partition the check data came, so that when the task was interrupted, the data transmission can be resumed for the different partitions.

According to some embodiments, the number of the read data is accumulated for a plurality of data read from each partition, wherein the accumulated number corresponding to the determined check data is used as the second code of the check data for the partition. In this step, the read offset of each partition is recorded so as to obtain the position of data continuous transmission and realize breakpoint continuous transmission.

According to some embodiments, the step 106 of storing location parameters of the plurality of inspection data may comprise: the position parameters of the plurality of inspection data stored before are updated to the position parameters of the plurality of inspection data determined at present. The step updates the data of the check point in real time so as to ensure the accuracy of the recovery data position. Optionally, the method may further include: the number of checkpointing operations performed is accumulated so that it can be ensured that the storage state of all operator tasks in the checkpointing mechanism is consistent, i.e. the storage state of the same checkpoint is consistent.

According to some embodiments, in response to data read from multiple partitions of a messaging system being successfully read in, data may continue to be read from the partition to which the data corresponds, thereby enabling data alignment to be ensured to calculate a positional offset corresponding to a checkpoint, providing a one-time database derivative guarantee.

According to some embodiments, in response to a certain check data marked by a barrier mark being successfully read in, reading data from the partition corresponding to the check data may be continued. Illustratively, when the check data marked with the barrier flag from one of the data sources (corresponding to one partition of the message system) is successfully read and the check data marked with the barrier flag from the other data source is not successfully read, the check data from the one data source is cached first and waits for the check data from the other data source. For a checkpoint, when the check data from all data sources are not cached yet, subsequent data processing may not be performed to ensure the ordering of data processing.

According to some embodiments, the distributed data stream processing engine may include a connector, and the distributed data stream processing engine reads a plurality of data to be imported into the database from a plurality of partitions of the message system through the connector, so as to interface the distributed data stream processing engine and the message system, and realize data transmission. Illustratively, the distributed data stream processing engine may be a flink engine, the message system may be a kafka message system, and the connector may be a flink kafka consumer. Further, location parameters for the currently read inspection data may be retrieved from the flink kafka consumer in response to the flink engine being triggered to perform a checkpointing operation. The location parameters of the inspection data may include the kafka partition to which the inspection data corresponds and the location offset of the inspection data in the partition.

According to some embodiments, the distributed data stream processing engine may include a kafka consumption group corresponding to the plurality of partitions, the kafka consumption group including a plurality of kafka consumers, wherein reading data from the partitions of the messaging system is accomplished by each kafka consumer consuming data from the one-to-one corresponding partition. One consumer corresponds to one partition, sequential reading and writing are achieved, and compared with other modes such as random reading and writing, sequential reading and writing have higher efficiency.

Alternatively, a kafka consumer corresponding to data may continue to consume data from a corresponding partition in response to the data being successfully read in from multiple partitions of the messaging system. That is, when the data in the kafka consumer is successfully read, then the kafka consumer continues to consume the next data to ensure data alignment, facilitating calculation of the positional offset at the checkpoint time.

Alternatively, in response to a successful read-in of some inspection data marked by a marked barrier, the kafka consumer corresponding to that inspection data continues to consume data from the corresponding partition. That is, when the inspection data in a certain kafka consumer is successfully read, the kafka consumer corresponding to the inspection data continues to consume the next data, and the data transmission efficiency can be improved. Specifically, when the check data marked with the barrier flag from one of the data sources (corresponding to one partition) is successfully read and the check data marked with the barrier flag from the other data source is not received, the check data from one of the data sources is cached first and waits for the check data from the other data source. For a checkpoint, when the check data from all data sources are not cached yet, the subsequent data processing may not be performed, so as to ensure the ordering of the data processing.

Fig. 3A to 3F are process diagrams illustrating checkpointing according to an exemplary embodiment.

In the example illustrated in FIG. 3A, two partitions in a message system each contain message "A", message "B", message "C", message "D", and message "E". We set the offset of the two partitions to zero.

In the example illustrated in FIG. 3B, the offset of the record in the consumer is 0. Message "A" from partition 0 is processed "in flight" and the offset for the first consumer becomes 1.

In the example illustrated in FIG. 3C, message "A" arrives at the map task module of the distributed data stream processing engine and is successfully read in. Both consumers read their next record (message "B" for partition 0 and message "A" for partition 1). The offsets in the two consumers are updated to 2 and 1, respectively. According to some embodiments, the master node server in the distributed data stream processing engine decides to trigger a checkpoint at the source (consumer), with message "B" for partition 0 and message "a" for partition 1 being the inspection data.

In the example illustrated in fig. 3D, the location parameters (2,1) of the plurality of inspection data of the checkpoint are stored in the master node server of the distributed data stream processing engine. The master node server issues checkpoint barriers after messages "B" and "a" from partitions 0 and 1, respectively. The checkpoint barrier is used to align checkpoints for all operator tasks and ensure consistency across checkpoints. Message "a" arrives at the map task module and is successfully read in, and the consumer continues to read its next record (message "C").

In the example illustrated in FIG. 3E, the map task module successfully reads in two of the check data marked by the marked barrier, and the consumer continues to consume more data from two partitions of the messaging system.

In the example illustrated in fig. 3F, after the mapping task module successfully reads in the two pieces of inspection data marked with barrier marks, the mapping task module communicates with the master node server to notify the master node server that the checkpoint setting is completed. The master node server modifies the number of checkpoint completions to 1.

Checkpointing is accomplished through the steps of fig. 3A-3F for resuming from a post-failure breakpoint, and the distributed data stream processing engine can recover from a potential system failure independent of the location offset of the message system.

Although the operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, nor that all illustrated operations be performed, to achieve desirable results.

Fig. 4 illustrates a scenario of implementing breakpoint resuming in fault recovery. Referring to fig. 4, the offsets of two consumers of the messaging system are 2 and 1, respectively, for continued transmission, as this is the offset for the completed checkpoint. When the database derivative is restarted by fault recovery, the consumer continues to consume data by the consumer according to the offset corresponding to the checkpoint stored in the primary node server.

FIG. 5 is a schematic block diagram illustrating a distributed data stream processing engine based database derivative arrangement in accordance with an illustrative embodiment. As shown in fig. 5, the apparatus 500 may include: a reading unit 510 configured to read a plurality of data to be imported into the database from a plurality of partitions of the message system; a writing unit 520 configured to store the plurality of data into a plurality of storage units of the distributed data stream processing engine, respectively; an importing unit 530 configured to import data in the plurality of storage units into the database; a triggering unit 540 configured to trigger the distributed data stream processing engine to perform a checkpointing operation according to a preset rule, wherein the distributed data stream processing engine comprises a checkpointing unit 550 configured to perform a checkpointing operation, and the checkpointing unit comprises: an obtaining module 551 configured to, in response to the distributed data stream processing engine being triggered to perform a checkpointing operation, obtain a location parameter of data currently read from each of the plurality of partitions to enable the data currently read from each of the plurality of partitions to be used as check data; a storage module 552 configured to store the location parameters of the plurality of inspection data, wherein a breakpoint resume of the database derivative can be implemented based on the stored location parameters of the plurality of inspection data; a tagging module 553 configured to tag each of the plurality of inspection data with a barrier flag; and a determination module 554 configured for determining completion of the checkpointing operation in response to each of the plurality of check data marked by the annotation barrier being successfully read in.

The operations of the units 510-550 and the modules 551-554 of the distributed data stream processing engine-based database derivative apparatus 500 are similar to the operations of the above-described S101-108, respectively, and are not described again.

Although specific functionality is discussed above with reference to particular modules, it should be noted that the functionality of the various modules discussed herein may be divided into multiple modules and/or at least some of the functionality of multiple modules may be combined into a single module. Performing an action by a particular module discussed herein includes the particular module itself performing the action, or alternatively the particular module invoking or otherwise accessing another component or module that performs the action (or performs the action in conjunction with the particular module). Thus, a particular module that performs an action can include the particular module that performs the action itself and/or another module that the particular module invokes or otherwise accesses that performs the action. For example, the read unit 510/write unit 520 described above may be combined into a single unit in some embodiments. As another example, the obtaining module 551 may include the storing module 552 in some embodiments.

It should also be appreciated that various techniques may be described herein in the general context of software, hardware elements, or program modules. The various modules described above with respect to fig. 5 may be implemented in hardware or in hardware in combination with software and/or firmware. For example, the modules may be implemented as computer program code/instructions configured to be executed in one or more processors and stored in a computer-readable storage medium. Alternatively, the modules may be implemented as hardware logic/circuitry. For example, in some embodiments, one or more of the read unit 510, the write unit 520, the import unit 530, the trigger unit 540, the checkpointing unit 550, the acquisition module 551, the storage module 552, the annotation module 553, and the determination module 554 may be implemented together in a System on Chip (SoC). The SoC may include an integrated circuit chip (which includes one or more components of a Processor (e.g., a Central Processing Unit (CPU), microcontroller, microprocessor, Digital Signal Processor (DSP), etc.), memory, one or more communication interfaces, and/or other circuitry), and may optionally execute received program code and/or include embedded firmware to perform functions.

According to an aspect of the disclosure, a computer device is provided that includes a memory, a processor, and a computer program stored on the memory. The processor is configured to execute the computer program to implement the steps of any of the method embodiments described above.

According to an aspect of the present disclosure, a non-transitory computer-readable storage medium is provided, having stored thereon a computer program which, when executed by a processor, implements the steps of any of the method embodiments described above.

According to an aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the steps of any of the method embodiments described above.

Illustrative examples of such computer devices, non-transitory computer-readable storage media, and computer program products are described below in connection with FIG. 6.

Fig. 6 illustrates an example configuration of a computer device 600 that may be used to implement the methods described herein.

The computer device 600 may be a variety of different types of devices, such as a server of a service provider, a device associated with a client (e.g., a client device), a system on a chip, and/or any other suitable computer device or computing system. Examples of computer device 600 include, but are not limited to: a desktop computer, a server computer, a notebook or netbook computer, a mobile device (e.g., a tablet, a cellular or other wireless telephone (e.g., a smartphone), a notepad computer, a mobile station), a wearable device (e.g., glasses, a watch), an entertainment device (e.g., an entertainment appliance, a set-top box communicatively coupled to a display device, a gaming console), a television or other display device, an automotive computer, and so forth. Thus, the computer device 600 may range from a full resource device with substantial memory and processor resources (e.g., personal computers, game consoles) to a low-resource device with limited memory and/or processing resources (e.g., traditional set-top boxes, hand-held game consoles).

The computer device 600 may include at least one processor 602, memory 604, communication interface(s) 606, display device 608, other input/output (I/O) devices 610, and one or more mass storage devices 612, capable of communicating with each other, such as through a system bus 614 or other suitable connection.

Processor 602 may be a single processing unit or multiple processing units, all of which may include single or multiple computing units or multiple cores. The processor 602 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitry, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor 602 can be configured to retrieve and execute computer readable instructions stored in the memory 604, mass storage device 612, or other computer readable medium, such as program code for an operating system 616, program code for an application program 618, program code for other programs 620, and so forth.

Memory 604 and mass storage device 612 are examples of computer readable storage media for storing instructions that are executed by processor 602 to implement the various functions described above. By way of example, memory 604 may generally include both volatile and nonvolatile memory (e.g., RAM, ROM, and the like). In addition, mass storage device 612 may generally include a hard disk drive, solid state drive, removable media, including external and removable drives, memory cards, flash memory, floppy disks, optical disks (e.g., CDs, DVDs), storage arrays, network attached storage, storage area networks, and the like. Memory 604 and mass storage device 612 may both be referred to herein collectively as memory or computer-readable storage media, and may be non-transitory media capable of storing computer-readable, processor-executable program instructions as computer program code that may be executed by processor 602 as a particular machine configured to implement the operations and functions described in the examples herein.

A number of program modules may be stored on the mass storage device 612. These programs include an operating system 616, one or more application programs 618, other programs 620, and program data 622, which can be loaded into memory 604 for execution. Examples of such applications or program modules may include, for instance, computer program logic (e.g., computer program code or instructions) for implementing the following components/functions: the read unit 510, the write unit 520, the import unit 530, the trigger unit 540, the checkpointing unit 550, the obtaining module 551, the storing module 552, the labeling module 553, and the determining module 554, the method 100 (including any suitable step of the method 100), and/or further embodiments described herein.

Although illustrated in fig. 6 as being stored in memory 604 of computer device 600,

modules

616, 618, 620, and 622, or portions thereof, may be implemented using any form of computer-readable media that is accessible by computer device 600. As used herein, "computer-readable media" includes at least two types of computer-readable media, namely computer storage media and communication media.

Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information for access by a computer device.

In contrast, communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism. Computer storage media, as defined herein, does not include communication media.

The computer device 600 may also include one or more communication interfaces 606 for exchanging data with other devices, such as over a network, direct connection, and the like, as previously discussed. Such communication interfaces may be one or more of the following: any type of network interface (e.g., a Network Interface Card (NIC)), wired or wireless (such as IEEE 802.11 Wireless LAN (WLAN)) wireless interface, worldwide interoperability for microwave Access (Wi-MAX) interface, Ethernet interface, Universal Serial Bus (USB) interface, cellular network interface, Bluetooth^TMAn interface, a Near Field Communication (NFC) interface, etc. The communication interface 606 may facilitate communication within a variety of networks and protocol types, including wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the internet, and so forth. The communication interface 606 may also provide for communication with external storage devices (not shown), such as in storage arrays, network attached storage, storage area networks, and so forth.

In some examples, a display device 608, such as a monitor, may be included for displaying information and images to a user. Other I/O devices 610 may be devices that receive various inputs from a user and provide various outputs to the user, and may include touch input devices, gesture input devices, cameras, keyboards, remote controls, mice, printers, audio input/output devices, and so forth.

While the disclosure has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative and exemplary and not restrictive; the present disclosure is not limited to the disclosed embodiments. Variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed subject matter, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps than those listed and the words "a" or "an" do not exclude a plurality. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

1. A database derivative method based on a distributed data stream processing engine, comprising:

reading a plurality of data to be imported into a database from a plurality of partitions of a message system;

storing the plurality of data into storage units of the distributed data stream processing engine respectively;

importing the data in the storage unit into a database;

triggering the distributed data stream processing engine to execute a checkpointing operation according to a preset rule,

wherein the checkpointing operation comprises:

in response to the distributed data stream processing engine being triggered to perform a checkpointing operation, obtaining location parameters of data currently read from each of a plurality of partitions to enable the data currently read from each of the plurality of partitions to be used as inspection data;

storing location parameters of a plurality of inspection data to enable breakpoint resumption of database derivatives based on the stored location parameters of the plurality of inspection data;

labeling each of the plurality of inspection data with a barrier label; and

in response to each of the plurality of check data marked by the marked barrier being successfully read in, it is determined that the checkpointing operation is complete.

2. The method of claim 1, wherein triggering the distributed data stream processing engine to perform a checkpointing operation according to preset rules comprises:

and triggering the distributed data stream processing engine to execute checkpoint operation according to a preset period.

3. The method of claim 2, wherein triggering the distributed data stream processing engine to perform checkpointing operations at a preset periodicity comprises:

counting data read from the messaging system and timing a duration of reading data from the messaging system in each cycle;

and triggering the distributed data stream processing engine to execute the checkpointing operation in response to the counting of the data read from the message system reaching a preset number or the duration of the data read from the message system reaching a preset duration.

4. The method according to claim 1, wherein the storage unit includes a correspondence between the plurality of inspection data and a position parameter,

the method further comprises the following steps:

and in response to detecting that the derivative process of the database is interrupted, continuously importing the data in the storage unit into the database based on the position parameters of the plurality of check data.

5. The method of claim 1, wherein the location parameter of each check data comprises a first code associated with a partition to which the check data corresponds and a second code associated with a reading order of the check data.

6. The method of claim 5, wherein, for a plurality of data read from each partition, accumulating the number of data read,

wherein the number of the partitions obtained by the accumulation corresponding to the specified check data is used as the second code of the check data.

7. The method of claim 1, wherein storing location parameters of a plurality of inspection data comprises:

the position parameters of the plurality of inspection data stored before are updated to the position parameters of the plurality of inspection data determined at present.

8. The method of claim 7, further comprising:

the number of checkpointing operations performed is accumulated.

9. The method of claim 1, wherein in response to data read from multiple partitions of the messaging system being successfully read in, continuing to read data from the partition to which the data corresponds.

10. The method of claim 1, wherein in response to a certain check data marked by a marked barrier being successfully read in, continuing to read data from the partition to which the check data corresponds.

11. The method of any of claims 1-10, wherein the distributed data stream processing engine comprises a connector through which the distributed data stream processing engine reads a plurality of data to be imported into the database from a plurality of partitions of the messaging system.

12. The method of claim 11, wherein the distributed data stream processing engine is a flink.

13. The method of any one of claims 1-10, wherein the messaging system is a kafka messaging system.

14. The method of claim 13, wherein the distributed data stream processing engine comprises a kafka consumption group corresponding to the plurality of partitions, the kafka consumption group comprising a plurality of kafka consumers,

wherein reading data from the partitions of the messaging system is accomplished by each kafka consumer consuming data from a one-to-one correspondence of the partitions.

15. The method of claim 13, further comprising:

in response to data read from multiple partitions of the messaging system being successfully read in, the kafka consumer corresponding to the data continues to consume the data from the corresponding partition.

16. The method of claim 13, further comprising:

in response to a certain check data marked by a marked barrier being successfully read in, the kafka consumer corresponding to the check data continues to consume data from the corresponding partition.

17. A database derivative apparatus based on a distributed data stream processing engine, comprising:

a reading unit configured to read a plurality of data to be imported into the database from a plurality of partitions of the message system;

a writing unit configured to store the plurality of data into a plurality of storage units of the distributed data stream processing engine, respectively;

an importing unit configured to import data in the plurality of storage units into a database;

a triggering unit configured to trigger the distributed data stream processing engine to perform a checkpointing operation according to a preset rule,

wherein the distributed data stream processing engine comprises a checkpointing unit configured to perform checkpointing operations, and the checkpointing unit comprising:

an obtaining module configured to obtain a location parameter of data currently read from each of a plurality of partitions in response to the distributed data stream processing engine being triggered to perform a checkpointing operation to enable the data currently read from each of the plurality of partitions to be used as inspection data;

a storage module configured to store location parameters of the plurality of inspection data, wherein a breakpoint resume of a database derivative can be achieved based on the stored location parameters of the plurality of inspection data;

a tagging module configured to tag each of the plurality of inspection data with a barrier tag; and

a determination module configured to determine that the checkpointing operation is complete in response to each of the plurality of check data marked by the marked barrier being successfully read in.

18. A computer device, comprising:

a memory, a processor, and a computer program stored on the memory,

wherein the processor is configured to execute the computer program to implement the steps of the method of any one of claims 1-16.

19. A non-transitory computer readable storage medium having a computer program stored thereon, wherein the computer program when executed by a processor implements the steps of the method of any of claims 1-16.

20. A computer program product comprising a computer program, wherein the computer program realizes the steps of the method of any one of claims 1-16 when executed by a processor.