CN114238516A

CN114238516A - Data synchronization method, system and computer readable medium

Info

Publication number: CN114238516A
Application number: CN202111571164.6A
Authority: CN
Inventors: 王仕凯; 陈诚; 戴橙
Original assignee: Zhejiang Taimei Medical Technology Co Ltd
Current assignee: Zhejiang Taimei Medical Technology Co Ltd
Priority date: 2021-12-21
Filing date: 2021-12-21
Publication date: 2022-03-25

Abstract

The invention provides a data synchronization method, a data synchronization system and a computer readable medium. The method comprises the following steps: in the full data synchronization stage, data are extracted from a first database through a first flow calculation program corresponding to a first calculation frame to form a first data storage message queue; in the incremental data synchronization stage after the full data synchronization stage is finished, determining a data starting point of incremental data synchronization through a first streaming type calculation program, and extracting data from the data starting point to form a second data storage message queue; extracting data from the first data storage message queue or the second data storage message queue by a second streaming computing program; performing data screening and format conversion operation on data extracted from the first data storage message queue or the second data storage message queue to form processed data; and storing the processed data to a second database. The incremental synchronization and the full synchronization are performed in the same program, so that the maintenance cost is reduced.

Description

Data synchronization method, system and computer readable medium

Technical Field

The present invention relates to the field of data processing, and in particular, to a data synchronization method, system and computer readable medium.

Background

When one data needs to store multiple copies, a consistency problem occurs, so synchronization is required. The synchronization is divided into two types: full and incremental synchronization. Full synchronization refers to the timed or periodic storage of all data into the target system. The incremental synchronization is to grab differential data at a certain moment or after a check point and synchronize the differential data to a target system on the basis of full synchronization. The moment or checkpoint at which the decision delta synchronization triggers the synchronization process is called the update point.

The currently common synchronization method is to synchronize through Spark in the incremental synchronization phase, and the incremental synchronization phase is to synchronize through canal reading Binlog log of MySQL database. The two are executed separately, after the full-scale synchronization is completed, the update point of the Binlog cannot be determined in the incremental synchronization stage, and then a separate canal client is needed to maintain the log, so that the maintenance cost is increased. At the same time, canal also needs to build a cluster, which increases the development cost. And canal also does not semantically guarantee that data can not be lost.

Therefore, a need exists for a data synchronization method, system, and computer readable medium with low maintenance and development costs.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a data synchronization method, a data synchronization system and a computer readable medium, and solve the problem that the data synchronization method is high in maintenance cost and development cost.

In order to solve the technical problem, the invention provides a data synchronization method. The method comprises the following steps: in the full data synchronization stage, data are extracted from a first database through a first flow calculation program corresponding to a first calculation frame to form a first data storage message queue; in the incremental data synchronization stage after the full data synchronization stage is finished, determining a data starting point of incremental data synchronization through a first flow calculation program corresponding to the first calculation frame, and extracting data from the data starting point to form a second data storage message queue; in the full data synchronization stage and the incremental data synchronization stage, extracting data from the first data storage message queue or the second data storage message queue through a second streaming calculation program corresponding to a first calculation frame; performing data screening and format conversion operations on the data extracted from the first data storage message queue or the second data storage message queue to form processed data; and storing the processed data to a second database.

In an embodiment of the invention, the method further comprises: monitoring the operation time of extracting data from the first database through a first flow calculation program corresponding to the first calculation frame to obtain an operation delay value; comparing the operation delay value with a set first threshold value to obtain a judgment result; and determining whether to request a new running resource for the data extraction operation based on the judgment result.

In an embodiment of the present invention, the monitoring the operation time of extracting data from the first database by the first streaming calculation program corresponding to the first calculation framework, and obtaining the operation delay value includes: acquiring the number of data corresponding to the extraction operation by taking a set first time interval as a unit; obtaining an average value of the extraction operation of each piece of data based on the number of the pieces of data and the first time interval; and taking the average value of the extraction operation of each piece of data as the operation delay value.

In one embodiment of the invention, the execution resources include processing resources and storage resources.

In an embodiment of the present invention, the first computing framework comprises a flink computing framework, the first streaming computing program comprises a flink-cdc streaming computing program, and the second streaming computing program comprises a flink streaming computing program.

In an embodiment of the present invention, the data extracted comprises Binlog data.

In one embodiment of the invention, the first and second data storage message queues comprise kafka message queues.

In an embodiment of the invention, the first database comprises a MySQL database and the second database comprises a KUDU database.

In order to solve the above technical problem, the present invention provides a data synchronization system, including: a memory for storing instructions executable by the processor; a processor for executing the instructions to implement the data synchronization method as described above.

To solve the above technical problem, the present invention provides a computer-readable medium storing computer program code, which when executed by a processor implements the data synchronization method as described above.

Compared with the prior art, the invention has the following advantages:

the data synchronization method of the invention carries out increment synchronization and full synchronization in the same program, has less dependent components, and reduces the maintenance cost and the development cost; the method can be seamlessly switched to the incremental data synchronization stage after the full data synchronization stage is finished, and an independent canal client is not needed to maintain the synchronization log, so that the maintenance cost is reduced; the invention judges whether to request new operation resources by integrating binlog delay indexes on the basis of the flink-cdc, thereby ensuring the real-time performance of data synchronization.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the principle of the application. In the drawings:

FIG. 1 is an exemplary flow diagram of a data synchronization method according to an embodiment of the invention;

FIG. 2 is an exemplary timing diagram of a data synchronization method according to an embodiment of the present invention;

FIG. 3 is a system block diagram of a data synchronization system in accordance with an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described herein, and thus the present invention is not limited to the specific embodiments disclosed below.

As used in this application and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.

The relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present application unless specifically stated otherwise. Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description. Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate. In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

It should be noted that the terms "first", "second", and the like are used to define the components, and are only used for convenience of distinguishing the corresponding components, and the terms have no special meanings unless otherwise stated, and therefore, the scope of protection of the present application is not to be construed as being limited. Further, although the terms used in the present application are selected from publicly known and used terms, some of the terms mentioned in the specification of the present application may be selected by the applicant at his or her discretion, the detailed meanings of which are described in relevant parts of the description herein. Further, it is required that the present application is understood not only by the actual terms used but also by the meaning of each term lying within.

Flow charts are used herein to illustrate operations performed by systems according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, various steps may be processed in reverse order or simultaneously. Meanwhile, other operations are added to or removed from these processes.

Fig. 1 is an exemplary flowchart of a data synchronization method according to an embodiment of the present invention. As shown in fig. 1, the data synchronization method 10 of the present embodiment includes the following steps:

step S11, in the full data synchronization stage, extracting data from the first database through a first flow calculation program corresponding to the first calculation frame to form a first data storage message queue;

step S12, in the incremental data synchronization stage after the full data synchronization stage is finished, determining a data starting point of the incremental data synchronization by a first flow calculation program corresponding to a first calculation frame, and extracting data from the data starting point to form a second data storage message queue;

step S13, in the full data synchronization stage and the incremental data synchronization stage, extracting data from the first data storage message queue or the second data storage message queue through a second streaming calculation program corresponding to the first calculation frame;

step S14, data screening and format conversion operation are carried out on the data extracted from the first data storage message queue or the second data storage message queue to form processed data;

and step S15, storing the processed data in a second database.

The above steps S11-S15 are explained in detail with reference to fig. 1-2.

In step S11, in the full data synchronization stage, data is extracted from the first database by the first streaming computing program corresponding to the first computing framework, so as to form a first data storage message queue. In some embodiments, the first computing framework comprises a flink computing framework and the first streaming computing program comprises a flink-cdc streaming computing program. flink is a new generation of distributed streaming data processing framework, and a unified processing engine of the flink can process both batch data (batch data) and streaming data (streaming data). The flink-cdc is a component of a flink. The cdc is an abbreviation of Change Data Capture, that is, an abbreviation of Change Data Capture, and can synchronize incremental Change records of a Source database (Source) to one or more Data target pools (Sink). In other words, the flink-cdc is a Source component that can read both the full amount of data and the delta change data directly from the database.

In some embodiments, the first database may be a MySQL database, or a PostgreSQL, Oracle, MongoDB, or the like database, which is not limited in this application. When the first database is a MySQL database, the extracted data may be Binlog data.

The method has the advantages that the whole data are extracted from the MySQL database through the flink-cdc, the historical whole data of the database are read firstly, and the reading is smoothly switched to the Binlog reading, so that the condition that one data is read at a time and one data is read at a time is guaranteed. Even if a fault occurs, the semantic processing of data through the exact Once can be guaranteed. And forming a first data storage message queue after the reading is finished.

In some embodiments, the first data storage message queue comprises a kafka message queue.

Fig. 2 is an exemplary timing diagram of a data synchronization method according to an embodiment of the present invention. As shown in FIG. 2, the components that run the data synchronization method include: MySQL database 21, flink-cdc streaming program 22, kafka cluster 23, flink streaming program 24, KUDU database 25. The flink-cdc streaming computation program 22 extracts the full amount of data from the MySQL database 21, determines which kafka topic the data is sent to by controlling the flow direction of each database data in the program, forms a first kafka message queue, and then inputs the first kafka message queue into the kafka cluster 23. The first kafka message queue is formed to represent the completion of the full-scale synchronization phase, and the flink-cdc streaming program 22 will seamlessly switch to the incremental data synchronization phase.

In step S12, in the incremental data synchronization stage after the full data synchronization stage is completed, a data start point of the incremental data synchronization is determined by a first streaming calculation program corresponding to the first calculation frame, and data is extracted from the data start point to form a second data storage message queue.

In some embodiments, the second data store message queue may be a kafka message queue, which is not limited by this application. Specifically, the flink-cdc firstly performs Snapshot block (Snapshot Chunk) division on a data table of full data through a main key, and then distributes the Snapshot Chunk to a plurality of sourcereaders, when each Snapshot Chunk is read, consistent reading under a lock-free condition is realized through an algorithm, when the sourcereaders are read, checkpoint supporting Chunk granularity is supported, and checkpoint is a data starting point of incremental data synchronization. And after all Snapshot Chunk reads are finished, issuing a Binlog Chunk to read the Binlog data of the incremental part from the data starting point checkpoint to form a second data storage message queue.

As shown in fig. 2, on the basis of the total amount of data extracted from the MySQL database 21, the flink-cdc streaming computation program 22 reads Binlog data of the incremental part from the data starting point checkpoint to form a second kafka message queue, and then inputs the second kafka message queue into the kafka cluster 23.

In step S13, in the full data synchronization stage and the incremental data synchronization stage, data is extracted from the first data storage message queue or the second data storage message queue by the second streaming computing program corresponding to the first computing framework.

In some embodiments, the second stream computation program comprises a flink stream computation program. flink provides a specialized Kafka connector to read or write data to Kafka topoic. Specifically, as shown in fig. 2, the flink streaming program 24 connects the Kafka cluster 23 through a Kafka connector. First, Kafka-related parameters, such as the topic (subject) name, the port number, and the like of the data source, are set in the flink streaming calculation program 24. After the setting is completed, the flink streaming computation program 24 reads the data with the topic name in the kafka message queue as the setting name, and generates the original target data stream.

In step S14, data screening and format conversion operations are performed on the data extracted from the first data storage message queue or the second data storage message queue to form processed data. Specifically, the original target data stream is subjected to data screening through a flink operator, and the data screening may be in a grouping (GROUP BY), a multi-table association (JOIN), and the like, which is not limited in the present application. And the Data Stream level API provided by the flink performs format conversion on the original target Data Stream after the Data screening is completed according to the requirements of the second database through the flink API, and generates a target Data Stream.

In step S15, the target data stream stores the processed data in the second database. In some embodiments, the second database may be a KUDU database, which is not limited in this application. As shown in fig. 2, the target data stream generated by the flink streaming program 24 is stored in the KUDU database 25. The target data streams stored in the KUDU database may be used as a data source for the data operations layer of the data warehouse in order to speed up the extraction of data.

In some embodiments, the data synchronization method further comprises: monitoring the operation time of extracting data from the first database through a first flow calculation program corresponding to the first calculation frame to obtain an operation delay value; comparing the operation delay value with a set first threshold value to obtain a judgment result; and determining whether to request a new running resource for extracting the data operation based on the judgment result.

For example, the scheme of the present invention may integrate the indicator of the Binlog delay on the basis of the flink-cdc, integrate the indicator of the Binlog delay into the metrics of the flink, and finally monitor the data delay time through the grafana component, so as to adjust the corresponding resource according to the indicator. Binlog's latency here refers to the time it takes for data to be generated in MySQL until the flink task pulls the data to program processing. As shown in FIG. 2, the operation time of the flink-cdc streaming program 22 extracting data from the MySQL database 21 is monitored by the monitoring component, grafana. The time when the flink-cdc streaming program 22 starts to extract data from the MySQL database 21 is the start time, and the time when the extracted data is stored in the kafka cluster 23 is the end time. And subtracting the starting time from the ending time to obtain a delay value, and comparing the operation delay value with a set first threshold value. If the delay value is larger than the first threshold value, the current running resource is in shortage and cannot meet the real-time performance of data synchronization, and therefore a new running resource is required to extract data. In some embodiments, the execution resources include processing resources and storage resources.

In some embodiments, monitoring an operation time of extracting data from the first database by the first streaming computing program corresponding to the first computing framework, and obtaining the operation delay value includes: acquiring the number of data corresponding to the extraction operation by taking a set first time interval as a unit; obtaining an average value of the extraction operation of each piece of data based on the number of the pieces of data and the first time interval; and taking the average value of the extraction operation of each piece of data as an operation delay value. Illustratively, as shown in fig. 2, a first time interval is set, the number of pieces of data extracted from the MySQL database 21 by the flink-cdc streaming computation program 22 in the first time interval is counted, the first time interval is divided by the total number of pieces of data to obtain an average value of the time of the extraction operation of each piece of data, and the average value is used as the delay value. And comparing the delay value with a first threshold value to judge whether to request a new operation resource.

In one embodiment, maintaining the synchronization log includes, for example, maintaining data start point information for incremental data synchronization.

The invention also includes a data synchronization system comprising a memory and a processor. Wherein the memory is to store instructions executable by the processor; the processor is configured to execute the instructions to implement the foregoing data synchronization method.

FIG. 3 is a system block diagram of a data synchronization system in accordance with an embodiment of the present invention. Referring to FIG. 3, the data synchronization system 300 may include an internal communication bus 301, a processor 302, a Read Only Memory (ROM)303, a Random Access Memory (RAM)304, and a communication port 305. When used on a personal computer, the operating device 300 may also include a hard disk 306. An internal communication bus 301 may enable data communication between the components of the operating device 300. Processor 302 may make the determination and issue a prompt. In some embodiments, processor 302 may be comprised of one or more processors. The communication port 305 can enable data communication between the operation device 300 and the outside. In some embodiments, the operator device 300 may send and receive information and data from a network through the communication port 305. The operating device 300 may also include various forms of program storage units and data storage units, such as a hard disk 306, Read Only Memory (ROM)303 and Random Access Memory (RAM)304, capable of storing various data files for computer processing and/or communication, as well as possible program instructions for execution by the processor 302. The processor executes these instructions to implement the main parts of the method. The results processed by the processor are communicated to the user device through the communication port and displayed on the user interface.

The data synchronization method described above can be implemented as a computer program, stored in the hard disk 306, and loaded into the processor 302 for execution, so as to implement the data synchronization of the present application.

The invention also comprises a computer-readable medium having stored thereon computer program code which, when executed by a processor, implements the data synchronization method as described above.

When the data synchronization method is implemented as a computer program, it may also be stored in a computer-readable storage medium as an article of manufacture. For example, computer-readable storage media can include but are not limited to magnetic storage devices (e.g., hard disk, floppy disk, magnetic strips), optical disks (e.g., Compact Disk (CD), Digital Versatile Disk (DVD)), smart cards, and flash memory devices (e.g., electrically Erasable Programmable Read Only Memory (EPROM), card, stick, key drive). In addition, various storage media described herein can represent one or more devices and/or other machine-readable media for storing information. The term "machine-readable medium" can include, without being limited to, wireless channels and various other media (and/or storage media) capable of storing, containing, and/or carrying code and/or instructions and/or data.

It should be understood that the above-described embodiments are illustrative only. The embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the processor may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), processors, controllers, micro-controllers, microprocessors, and/or other electronic units designed to perform the functions described herein, or a combination thereof.

Aspects of the present application may be embodied entirely in hardware, entirely in software (including firmware, resident software, micro-code, etc.) or in a combination of hardware and software. The above hardware or software may be referred to as "data block," module, "" engine, "" unit, "" component, "or" system. The processor may be one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), digital signal processing devices (DAPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, or a combination thereof. Furthermore, aspects of the present application may be represented as a computer product, including computer readable program code, embodied in one or more computer readable media. For example, computer-readable media may include, but are not limited to, magnetic storage devices (e.g., hard disk, floppy disk, magnetic strips … …), optical disks (e.g., Compact Disk (CD), Digital Versatile Disk (DVD) … …), smart cards, and flash memory devices (e.g., card, stick, key drive … …).

The computer readable medium may comprise a propagated data signal with the computer program code embodied therein, for example, on a baseband or as part of a carrier wave. The propagated signal may take any of a variety of forms, including electromagnetic, optical, and the like, or any suitable combination. The computer readable medium can be any computer readable medium that can communicate, propagate, or transport the program for use by or in connection with an instruction execution system, apparatus, or device. Program code on a computer readable medium may be propagated over any suitable medium, including radio, electrical cable, fiber optic cable, radio frequency signals, or the like, or any combination of the preceding.

Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing disclosure is by way of example only, and is not intended to limit the present application. Various modifications, improvements and adaptations to the present application may occur to those skilled in the art, although not explicitly described herein. Such modifications, improvements and adaptations are proposed in the present application and thus fall within the spirit and scope of the exemplary embodiments of the present application.

Also, this application uses specific language to describe embodiments of the application. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the present application is included in at least one embodiment of the present application. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the present application may be combined as appropriate.

Similarly, it should be noted that in the preceding description of embodiments of the application, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to require more features than are expressly recited in the claims. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.

Although the present application has been described with reference to the present specific embodiments, it will be recognized by those skilled in the art that the foregoing embodiments are merely illustrative of the present application and that various changes and substitutions of equivalents may be made without departing from the spirit of the application, and therefore, it is intended that all changes and modifications to the above-described embodiments that come within the spirit of the application fall within the scope of the claims of the application.

Claims

1. A method of data synchronization comprising the steps of:

in the full data synchronization stage, data are extracted from a first database through a first flow calculation program corresponding to a first calculation frame to form a first data storage message queue;

in the incremental data synchronization stage after the full data synchronization stage is finished, determining a data starting point of incremental data synchronization through a first flow calculation program corresponding to the first calculation frame, and extracting data from the data starting point to form a second data storage message queue;

in the full data synchronization stage and the incremental data synchronization stage, extracting data from the first data storage message queue or the second data storage message queue through a second streaming calculation program corresponding to a first calculation frame;

performing data screening and format conversion operations on the data extracted from the first data storage message queue or the second data storage message queue to form processed data;

and storing the processed data to a second database.

2. The data synchronization method of claim 1, further comprising:

monitoring the operation time of extracting data from the first database through a first flow calculation program corresponding to the first calculation frame to obtain an operation delay value;

comparing the operation delay value with a set first threshold value to obtain a judgment result;

and determining whether to request a new running resource for the data extraction operation based on the judgment result.

3. The data synchronization method of claim 2, wherein monitoring an operation time of extracting data from the first database by the first stream computing program corresponding to the first computing framework, and obtaining the operation delay value comprises:

acquiring the number of data corresponding to the extraction operation by taking a set first time interval as a unit;

obtaining an average value of the extraction operation of each piece of data based on the number of the pieces of data and the first time interval;

and taking the average value of the extraction operation of each piece of data as the operation delay value.

4. The data synchronization method of claim 2, wherein the operational resources comprise processing resources and storage resources.

5. The data synchronization method of claim 1, wherein the first computing framework comprises a flink computing framework, wherein the first streaming computing program comprises a flink-cdc streaming computing program, and wherein the second streaming computing program comprises a flink streaming computing program.

6. The data synchronization method of claim 1, wherein the extracted data comprises Binlog data.

7. The data synchronization method of claim 1, wherein the first and second data storage message queues comprise kafka message queues.

8. The data synchronization method of claim 1, wherein the first database comprises a MySQL database and the second database comprises a KUDU database.

9. A data synchronization system, comprising:

a memory for storing instructions executable by the processor; and

a processor for executing the instructions to implement the method of any one of claims 1-8.

10. A computer-readable medium having stored thereon computer program code which, when executed by a processor, implements the method of any of claims 1-8.