CN113032477B

CN113032477B - Long-distance data synchronization method and device based on GTID and computing equipment

Info

Publication number: CN113032477B
Application number: CN201911349579.1A
Authority: CN
Inventors: 刘阎; 赵伟峰; 王印森; 樊宇
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Online Services Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Online Services Co Ltd
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2023-07-21
Anticipated expiration: 2039-12-24
Also published as: CN113032477A

Abstract

The embodiment of the invention relates to the technical field of big data, and discloses a long-distance data synchronization method, a device and computing equipment based on a GTID, wherein the method comprises the following steps: subscribing data nodes which store metadata information to be checked from a current Zookeeper cluster of a first data center to trigger the check; respectively calculating checksums for the data of the first data center database and the second data center database according to the data blocks, and comparing and searching the data blocks with data differences; continuously adjusting the size of the data block with the data difference and repeatedly calculating a comparison checksum until difference data with inconsistent checksums are found; and further comparing the checksum with the difference data with inconsistent checksum to obtain an auditing result. Through the mode, the embodiment of the invention can effectively detect the difference of massive synchronous data, realize the efficient cross-center data difference comparison and greatly improve the data auditing efficiency.

Description

Long-distance data synchronization method and device based on GTID and computing equipment

Technical Field

The embodiment of the invention relates to the technical field of big data, in particular to a GTID-based long-distance data synchronization method, a GTID-based long-distance data synchronization device and a GTID-based computing device.

Background

In the existing evolution process of the internet architecture, the high availability guarantee of the application service is the core of the construction of the internet architecture. The disaster recovery and the double living in different places are important means for guaranteeing the high availability construction of the application service. However, the core problem to be solved by the off-site disaster recovery and off-site dual-activity architecture is how to guarantee the data consistency of the off-center application. Transcenter data synchronization is an effective solution for data consistency in multiple data centers, and can be divided into two ways: one is storage replication and the other is data replication. Storage replication is the process of synchronously or asynchronously replicating disks to different data centers through storage replication technology. The data replication is to realize data synchronization between data centers through database technology or third party software.

The problem mainly solved by the existing known open source database synchronization system and commercial cross-center data synchronization service is concentrated on a data synchronization scene of a cross-center unidirectional master-slave mode. For the application scene of the remote disaster recovery and the remote double-activity, as the data synchronization platform is the core of the whole remote disaster recovery and remote double-activity architecture, how to realize the high-availability construction of the data synchronization system is an important consideration for the deployment and application of the data synchronization platform in the production environment.

The existing open source software or system does not provide a high-reliability and expandable data synchronization platform architecture, lacks an effective monitoring management means, cannot meet the requirements of actual production environments, does not provide a high-efficiency and accurate data auditing scheme, timely discovers multi-center data synchronization difference values, and feeds back inconsistent multi-center business data.

Disclosure of Invention

In view of the above problems, embodiments of the present invention provide a long-distance data synchronization method, apparatus and computing device based on GTID, which overcome or at least partially solve the above problems.

According to an aspect of the embodiment of the present invention, there is provided a long-distance data synchronization method based on GTID, the method including: subscribing data nodes which store metadata information to be checked from a current Zookeeper cluster of a first data center to trigger the check; respectively calculating checksums for the data of the first data center database and the second data center database according to the data blocks, and comparing and searching the data blocks with data differences; continuously adjusting the size of the data block with the data difference and repeatedly calculating a comparison checksum until difference data with inconsistent checksums are found; and further comparing the checksum with the difference data with inconsistent checksum to obtain an auditing result.

In an optional manner, the calculating a checksum on the data of the first data center database and the second data center database according to the data blocks respectively, and comparing and searching the data blocks with data differences includes: performing CRC32 check on the data of the second data center database according to the data blocks, calculating a first checksum and taking out the boundaries of the data blocks; performing CRC32 check on the data of the first data center database according to the data block boundary, and calculating a second checksum; comparing the first checksum with the second checksum, and if the first checksum is consistent with the second checksum, indicating that the data of the data block is synchronous in the first data center database and the second data center database; and if the first checksum is inconsistent with the second checksum, indicating that the data block has data difference in the first data center database and the second data center database.

In an alternative manner, the continuously adjusting the size of the data block with the data difference and repeatedly calculating the comparison checksum until the difference data with inconsistent checksums is found includes: reducing the size of the data block; respectively calculating a checksum on the reduced data blocks, comparing the checksum with the data blocks, and searching the data blocks with data differences; and repeatedly reducing the size of the data block with the data difference and calculating and comparing the checksum until the data block with the data difference only comprises the difference data with inconsistent checksum.

In an optional manner, the comparing the checksum with the difference data inconsistent with the checksum further obtains an audit result, including: recalculating and comparing the first checksum and the second checksum after waiting for a preset time; if the difference data are consistent, the difference data are data synchronized in the first data center database and the second data center database; if not, comparing the currently calculated first checksum of the difference data with the previously calculated second checksum; if the difference data are consistent, the difference data are indicated to have synchronous delay in the first data center database and the second data center database; and if the difference data are inconsistent, continuously modifying the difference data in the first data center database and the second data center database.

In an alternative, the method further comprises: acquiring data needing to be analyzed currently according to BinLog locus information or a global transaction identifier GTID stored in the first data center; analyzing the data, and supplementing field types, field names, primary key information, loop marks and latest update time; transmitting the parsed data to the second data center; and updating and storing the current BinLog site information or the global transaction identifier GTID.

In an alternative, the method further comprises: receiving and storing the parsed data transmitted by the second data center; filtering the data from the first data center according to a loop mark in the data; for the same data, if the latest update time of the data of the first data center is greater than the latest update time of the data transmitted from the second data center, collision warning information is generated.

In an alternative, the method further comprises: responding to the request of the terminal user to configure the message queue theme; storing the parsed data into different message queues according to the message queue subject; and completing data subscription consumption in response to the data consumption conversion requests of different end users.

According to another aspect of the embodiment of the present invention, there is provided a long-distance data synchronization device based on GTID, the device including: the auditing triggering unit is used for subscribing the data node which stores the metadata information to be audited from the current Zookeeper cluster of the first data center to trigger auditing; the first searching unit is used for respectively calculating checksums according to data blocks for the data of the first data center database and the second data center database and comparing and searching the data blocks with data differences; the second searching unit is used for continuously adjusting the size of the data block with the data difference and repeatedly calculating the comparison checksum until the difference data with inconsistent checksums are searched; and the data checking unit is used for further comparing the checksum with the difference data with inconsistent checksum to obtain an auditing result.

According to another aspect of an embodiment of the present invention, there is provided a computing device including: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;

the memory is configured to store at least one executable instruction, where the executable instruction causes the processor to perform the steps of the GTID-based long-range data synchronization method.

According to yet another aspect of the embodiments of the present invention, there is provided a computer storage medium having stored therein at least one executable instruction for causing the processor to perform the steps of the GTID-based long-range data synchronization method described above.

According to the embodiment of the invention, the data nodes which are subscribed and stored with metadata information to be checked are used for triggering the check from the current Zookeeper cluster of the first data center; respectively calculating checksums for the data of the first data center database and the second data center database according to the data blocks, and comparing and searching the data blocks with data differences; continuously adjusting the size of the data block with the data difference and repeatedly calculating a comparison checksum until difference data with inconsistent checksums are found; and the difference data with inconsistent checksums are further compared with the checksums to obtain an auditing result, so that the difference of massive synchronous data can be effectively detected, the efficient cross-center data difference comparison is realized, and the data auditing efficiency is greatly improved.

The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and may be implemented according to the content of the specification, so that the technical means of the embodiments of the present invention can be more clearly understood, and the following specific embodiments of the present invention are given for clarity and understanding.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

fig. 1 shows a schematic diagram of a remote dual-activity architecture to which a GTID-based long-distance data synchronization method provided by an embodiment of the present invention is applied;

fig. 2 shows a flow diagram of a GTID-based long-distance data synchronization method according to an embodiment of the present invention;

fig. 3 shows a schematic diagram of collision detection of a long-distance data synchronization method based on GTID according to an embodiment of the present invention;

fig. 4 shows a schematic structural diagram of a GTID-based long-distance data synchronization device according to an embodiment of the present invention;

FIG. 5 illustrates a schematic diagram of a computing device provided by an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Fig. 1 shows a schematic diagram of a remote dual-activity architecture to which the GTID-based long-distance data synchronization method provided by the embodiment of the present invention is applied. As shown in fig. 1, a long-range data synchronization method based on a global transaction identifier (global transaction identifier, GTID) is used to achieve data synchronization of a first data center and a second data center. The configuration management platform, the data auditing node (drp-tracker) and the data synchronization node (drp-tube) are processes for configuration management, data auditing and data synchronization, respectively, which are disposed in the GTID-based long-distance data synchronization device 10.

In the embodiment of the invention, a Set of Zookeeper clusters are respectively deployed in a first data center and a second data center and are used for storing binary log (BinLog) site information of a database, a global transaction identifier Set (GTID Set), data synchronization node scheduling management information and data auditing node scheduling management information. The configuration management platform adopts a single-center cluster deployment architecture, such as the configuration management platform is deployed in a first data center in fig. 1, and a cross-center active-standby mode is adopted to realize high availability of the configuration management platform. The data auditing node (drp-cher) adopts a single-center deployment due to the fact that only data auditing operations are triggered. The data auditing node (drp-tracker) and the data synchronization node (drp-tube) adopt a mode of independent deployment, so that the problem that resources of a disk io, a memory and a cpu are contended due to the fact that data auditing and data analysis processes are deployed on the same physical host is avoided. In the embodiment of the present invention, the first data center and the second data center are both provided with the GTID-based long-distance data synchronization device 10 applying the GTID-based long-distance data synchronization method, the GTID-based long-distance data synchronization devices 10 of the first data center and the second data center need to be deployed with a data synchronization node (drp-tube), and the configuration management platform and the data auditing node (drp-tracker) need only be deployed in one of the GTID-based long-distance data synchronization devices 10 of the first data center and the second data center, as shown in fig. 1, and deployed in the first data center.

In the embodiment of the invention, the synchronous task is started or the auditing task is initiated through the configuration management platform. The data synchronization nodes of the first data center or the second data center analyze the data of the corresponding databases according to the BinLog site information or the GTID and transmit the analyzed data to the data synchronization nodes of the opposite ends, and the data synchronization nodes of the opposite ends store the received analyzed data into the corresponding databases to realize data synchronization. The application Message Queue (MQ) cluster realizes separation of the data analysis node and the data transmission node, and realizes the platform-level service capability of data one-time analysis and multi-terminal consumption. The data synchronization node writes the parsed data into the message queue MQ, from which different end users can obtain the parsed data. The data auditing node carries out data auditing on the data of the first data center and the second data center, detects the synchronous data difference, feeds back the difference data to the configuration management platform and sends out conflict alarm.

Fig. 2 shows a flow chart of a GTID-based long-distance data synchronization method according to an embodiment of the present invention. As shown in fig. 2, the long-distance data synchronization method based on GTID includes:

step S11: and subscribing the data nodes which store the metadata information to be checked from the current Zookeeper cluster of the first data center to trigger the check.

In step S11, in a first data center, writing metadata information to be audited into a Zookeeper cluster according to configuration information, and initiating an auditing task; subscribing the data nodes of the metadata information which is stored in the Zookeeper cluster and needs to be checked, and triggering the checking task to the database according to the metadata information.

Step S12: and respectively calculating checksums for the data of the first data center database and the second data center database according to the data blocks, and comparing and searching the data blocks with data differences.

In step S12, performing CRC32 check on the data of the second data center database according to the data blocks, calculating a first checksum and taking out the data block boundary; performing CRC32 check on the data of the first data center database according to the data block boundary, and calculating a second checksum; and comparing the first checksum with the second checksum, and searching the data blocks with data differences. And if the first checksum is consistent with the second checksum, the data of the data block is synchronized in the first data center database and the second data center database, the data of the data block is deleted from the difference data, and the auditing of the data block is finished. And if the first checksum is inconsistent with the second checksum, indicating that the data block has data difference in the first data center database and the second data center database. The data block boundary may be represented by BinLog site information of a start position and an end position of the data block or a global transaction identifier GTID.

Step S13: and continuously adjusting the size of the data block with the data difference, and repeatedly calculating the comparison checksum until the difference data with inconsistent checksums are found.

Specifically, the size of the data block with the data difference is reduced; respectively calculating a checksum on the reduced data blocks, comparing the checksum with the data blocks, and continuously searching the data blocks with data differences; and repeatedly reducing the size of the data block with the data difference, calculating and comparing the checksum until the data block with the data difference only comprises the difference data with inconsistent checksums, and searching the difference data. In the embodiment of the invention, the sizes of the data blocks with data differences are continuously reduced, CRC32 checksum calculation is repeatedly performed on the data blocks of the first data center and the second data center, and comparison is performed until one piece or a plurality of pieces of difference data are finally found.

Step S14: and further comparing the checksum with the difference data with inconsistent checksum to obtain an auditing result.

For any piece of difference data, after waiting for a preset time, recalculating and comparing the first checksum and the second checksum; if the difference data are consistent, the data of the difference data in the first data center database are synchronous with the data of the data in the second data center database, the difference data are deleted from the difference data, and the auditing of the data is finished; if not, comparing the first checksum calculated currently with the second checksum calculated last time; if the difference data are consistent, the difference data are indicated to have synchronous delay in the first data center database and the second data center database, the difference data are deleted from the difference data, and auditing of the difference data is finished; if the difference data are inconsistent, the difference data are continuously modified in the first data center database and the second data center database, and the result information of the difference data is fed back to the management node and a conflict alarm is sent out. The waiting preset time can be set by the end user according to needs or experience, and is generally set to be integral multiple of the synchronous delay time, preferably 2 times of the synchronous delay time. According to the embodiment of the invention, accurate and efficient data auditing capability is provided through a fixed-length sliding fragmented data auditing algorithm, a flexible configuration strategy is supported, massive synchronous data differences can be effectively detected, the data auditing efficiency is greatly improved compared with a conventional one-to-one comparison data auditing strategy, and efficient cross-center data difference comparison is realized.

In the embodiment of the invention, the synchronous analysis and transmission of the data are also carried out in the first data center and the second data center. Taking a first data center as an example, acquiring data needing to be analyzed currently according to BinLog locus information or a global transaction identifier stored in the first data center; analyzing the data, and supplementing field types, field names, primary key information, loop marks and latest update time; transmitting the parsed data to the second data center; and updating and storing the current BinLog site information or the global transaction identifier GTID. Specifically, the data parsing scheme is selected by configuration based on either a location BinLog Position Mode or a globally transaction unique GTID Mode. If a BinLog locus mode is selected (BinLog Position Mode), acquiring locus information stored after the last analysis from the Zookeeper cluster, and if the locus information is not configured, acquiring the current latest BinLog locus information of MySQL from the Zookeeper cluster; if the transaction identifier Mode is the global transaction identifier Mode (GTID Mode), a Set of the global transaction identifier after the last analysis is obtained from the Zookeeper cluster, and if the transaction identifier Mode is not configured, a GTID Set of the current database is obtained from the Zookeeper cluster. The database establishes connection, if the database is in a GTID Mode, a COM_BinLog_DUMP_GTID instruction is sent, and a GTID Set is sent to a server; and if the result is BinLog Mode, a COM_BinLog_DUMP instruction is sent, and BinLog site information is sent to the server. The receiving database (MySQL) identifies the BinLog data pushed by GTID Set based on the requested site information or global transaction. And carrying out protocol analysis on the received BinLog data by a binary log analyzer (BinLog Parse), and supplementing specific information such as field type, field name, primary key information, loop back mark, latest update time and the like. And carrying out data transmission on the Event transaction data, and synchronously transmitting the Event transaction data to a second data center to realize synchronization. The data transfer process is a blocking operation until the data transfer is successful. After successful data transmission, the current GTID Set or BinLog Position information is updated and stored in a Zookeeper cluster.

The embodiment of the invention applies the data analysis method compatible with the BinLog site information and the GTID global unique transaction identifier, realizes the two-way synchronization capability of the database for master-slave switching without perception, data synchronization without interruption and high-efficiency real-time cross-center mass data, and realizes the high-efficiency accurate BinLog data analysis capability.

In the embodiment of the invention, monitoring and alarming are also carried out in the process of carrying out data synchronous analysis and transmission on the first data center and the second data center, and comprise link running state monitoring, operation log recording, difference data recording, real-time conflict alarming and difference data alarming. Taking link running state monitoring as an example, specifically, acquiring running state data of a data synchronization process in a data synchronization analysis and transmission process; writing the throughput of the synchronous data, the synchronous progress, the synchronous delay time and the last synchronous time into a Database (DB); writing metadata information such as a BinLog locus token or a GTID Set, a synchronous link state and the like into a Zookeeper cluster; calling a monitoring alarm according to a page request initiated by a synchronous link selected by a user; initiating a call to the monitoring alarm service according to the selected synchronous link splicing request parameters; acquiring synchronous link running state data from a database, and acquiring packaged site information and link state according to synchronous link id; and returning a result of requesting synchronous running state.

In the embodiment of the invention, the loop control is also carried out on the analyzed data transmitted by the opposite-end data center. The database MySQL incorporates global transactions ID (Global Transaction ID) to enhance the database's primary-backup consistency, fault recovery, and fault tolerance capabilities, replacing the traditional way of locating replication locations by BinLog file offsets. In case of a Master-Slave switch by means of the GTID, the other Slave nodes (Slave) of the database MySQL can automatically find the correct replication location on the new Master node (Master). Therefore, the loop control of the data can be realized based on the basic principle of the GTID in the cross-center bidirectional synchronous scene.

According to the document definition, the GTID is composed of a Source ID (source_id) and a Transaction ID (transaction_id). Gtid=source_id: transaction_id, where source_id refers to the database MySQL instance that initiates the transaction, and the value is the universal unique identifier (server_uuid) of the loop control instance. Server_uuid is automatically generated by the database MySQL at the first start and persisted into the auto. Cnf file, and transaction_id is a self-increment count starting at 1 and represents the nth Transaction performed on this master. The database MySQL will guarantee a 1:1 mapping between transactions and GTIDs.

In the embodiment of the invention, the analyzed data transmitted by the second data center is received and stored; filtering the data from the first data center according to a loop mark in the data; the data synchronization bidirectional loop control strategy based on the GTID global transaction unique identification effectively realizes the bidirectional loop control of DDL sentences and DML sentences. Specifically, source_id of the global transaction analyzed to GTID is written into a Zookeeper cluster in the data analysis process, the source_id of the global transaction subscribed to the Zookeeper and homologous to the GTID is cached locally, the homologous source_id transaction is filtered in the data importing stage, and the BinLog data of different sources are imported into a database.

Compared with the existing loop control strategy and method based on the transaction identification, the data synchronization bidirectional loop control strategy based on the GTID global transaction unique identification effectively realizes the bidirectional loop control of Data Definition Language (DDL) statements and Data Manipulation Language (DML) statements, can effectively solve the problem of the bidirectional loop control of the DDL statements, and is more stable and efficient for the bidirectional loop control of the DML statements.

Aiming at the data conflict scene of the same data double writing of cross-center bidirectional data synchronization, the embodiment of the invention adopts a data conflict detection method based on a time update field, and the conflict is remedied after the conflict to ensure the final consistency of the data. Whether collision occurs is judged by whether the cmos_modification_time is changed or not, and based on the characteristic of the cmos_modification_time, namely the database MySQL automatically sets the field as the current timestamp when the data change occurs. For the same data, if the latest update time of the data of the first data center is greater than the latest update time of the data transmitted from the second data center, collision warning information is generated.

In the embodiment of the invention, as shown in fig. 3, fig. a is a schematic diagram of updating MySQL fields of a database; FIG. b is a diagram of a field of a data table updated by a first data center; figure c is a diagram of a field of a data table updated by a second data center. An automatic update field is added to all bidirectional synchronous data tables in the database MySQL, and an index is added to the field. For example, table t2 is defined as:

create table t2(id int primary key,value int)；

adding field statements:

ALTER TABLE t2 ADD`cmos_modify_time`datetime(3)

NOT NULL DEFAULT CURRENT_TIMESTAMP(3)ON UPDATE CURRENT_TIMESTAMP(3)；

creating an index:

create index idx_cmos_modify_time on t2(cmos_modify_time)；

this field is transparent to the application and the original application does not need any modification.

When the record in the table changes, referring to fig. a, last_modification_time is automatically set as the time of recording the change.

insert into t2(id,value)values(1,1000)；

The source end parses out the time field from BinLog and uses this field to construct the where condition. The process flow is illustrated with the first data center A and the second data center B updating one record '2018-03-28:14:22.520' at the same time. For example, referring to fig. b, the data table field is updated at center a:

update t2 set value＝10001where id＝1；

statement transmitted in the center direction a- > B:

update value＝10001,cmos_modify_time＝'2018-03-28 16:14:22.520'

where id＝1and cmos_modify_time<＝'2018-03-28 16:14:22.520'；

executing the statement at the target end, if the return value is 1, indicating that the statement is correct. If the return value is 0, the update time of the record of the B center is larger than that of the A center (2018-03-2816:14:22.520), and corresponding alarm information is generated.

Referring to figure c, the data table field is updated at the B center:

update t2 set value＝10002where id＝1；

the statement of B center- > A center direction transmission is:

update value＝10002,cmos_modify_time＝'2018-03-28 16:14:22.520'

where id＝1and cmos_modify_time<'2018-03-28 16:14:22.520'

executing the statement at the target end, if the return value is 1, indicating that the statement is correct. If the return value is 0, the update time of the record of the center B is equal to or greater than the update time of the center A (2018-03-2816:14:22.520), and corresponding alarm information is generated.

The data conflict detection based on the time update field can realize millisecond-level data conflict detection, so that the problem of data inconsistency caused by double writing of multi-center data is effectively solved, and the consistency of the two-way synchronous data of the cross-center data is ensured.

In order to meet the construction requirement of carrying out data storage and data conversion on different data terminals by the same data source in an actual service scene. The application Message Queue (MQ) cluster middleware realizes that the BinLog data analysis message release and the data message consumption subscription mode separate the original data analysis from the data consumption. And an external open data consumption service interface provides data consistency guarantee and realizes platform-level service of heterogeneous data conversion. In the embodiment of the invention, the message queue theme is configured in response to the request of the terminal user; storing the parsed data into different message queues according to the message queue subject; and completing data subscription consumption in response to the data consumption conversion requests of different end users.

Specifically, according to a message middleware Topic queue theme configured by a terminal user at a Web management platform according to a database schema level, metadata information is written into a Zookeeper cluster; writing the analyzed data into different message queues of the MQ cluster according to the configured Topic subject; responding to application of data consumption conversion service of different data terminal users on a configuration management platform, checking the application of the terminal users according to the operation authority of the users, and opening a security authentication Token (Token) and Topic theme information of an MQ queue for the terminal users after the checking is passed; different data terminal users are in butt joint with an open data consumption Service SDK (drp-Service) and configured with the Topic subject information of the secure authentication Token and the MQ queue to finish appointed data subscription consumption; and after the data terminal user submits a batch of data, writing metadata information of the subscribed subject into the Zookeeper cluster, wherein the metadata information comprises a consumed GTID Set or BinLog site information and the offset of a message queue.

According to the embodiment of the invention, the separation of data analysis and data consumption is realized by introducing the message middleware, so that the multiple times of consumption of BinLog data through one-time analysis are realized, and the problems of resource waste, low efficiency and the like caused by repeated analysis of BinLog data of a conventional data synchronization system are effectively solved. In addition, the platform-level service capability of heterogeneous data conversion is realized through the data open service interface which is externally adapted to various data terminals, the problem of repeated analysis of the database BinLog is effectively avoided, the data analysis efficiency is improved, and the platform-level service capability of heterogeneous data conversion is realized.

The GTID-based long-distance data synchronization method provided by the embodiment of the invention provides an omnibearing high-availability construction scheme, meets the high-availability and extensible architecture requirements of a platform, and has an actual production environment for large-scale application practice.

Fig. 4 shows a schematic structural diagram of a GTID-based long-distance data synchronization apparatus according to an embodiment of the present invention. As shown in fig. 4, the GTID-based long-distance data synchronization apparatus includes: the system comprises an audit triggering unit 401, a first searching unit 402, a second searching unit 403, a data checking unit 404, a synchronous analyzing unit 405, a loop control unit 406 and an analysis consumption separating unit 407. Wherein:

The auditing triggering unit 401 is configured to subscribe a data node storing metadata information to be audited from a Zookeeper cluster of a current first data center to trigger auditing; the first search unit 402 is configured to calculate checksums for the data of the first data center database and the second data center database according to data blocks, and compare and search the data blocks with data differences; the second searching unit 403 is configured to continuously adjust the size of the data block with the data difference and repeatedly calculate a comparison checksum until difference data with inconsistent checksums is found; the data checking unit 404 is configured to further compare the checksum with the difference data inconsistent with the checksum, and obtain an audit result.

In an alternative way, the first search unit 402 is configured to: performing CRC32 check on the data of the second data center database according to the data blocks, calculating a first checksum and taking out the boundaries of the data blocks; performing CRC32 check on the data of the first data center database according to the data block boundary, and calculating a second checksum; comparing the first checksum with the second checksum, and if the first checksum is consistent with the second checksum, indicating that the data of the data block is synchronous in the first data center database and the second data center database; and if the first checksum is inconsistent with the second checksum, indicating that the data block has data difference in the first data center database and the second data center database.

In an alternative way, the second search unit 403 is configured to: reducing the size of the data block; respectively calculating a checksum on the reduced data blocks, comparing the checksum with the data blocks, and continuously searching the data blocks with data differences; and repeatedly reducing the size of the data block with the data difference and calculating and comparing the checksum until the data block with the data difference only comprises the difference data with inconsistent checksum.

In an alternative way, the data verification unit 404 is configured to: recalculating and comparing the first checksum and the second checksum after waiting for a preset time; if the difference data are consistent, the difference data are data synchronized in the first data center database and the second data center database; if not, comparing the currently calculated first checksum of the difference data with the previously calculated second checksum; if the difference data are consistent, the difference data are indicated to have synchronous delay in the first data center database and the second data center database; and if the difference data are inconsistent, continuously modifying the difference data in the first data center database and the second data center database.

In an alternative way, the synchronization parsing unit 405 is configured to: acquiring data needing to be analyzed currently according to BinLog locus information or a global transaction identifier stored in the first data center; analyzing the data, and supplementing field types, field names, primary key information, loop marks and latest update time; transmitting the parsed data to the second data center; and updating and storing the current BinLog site information or the global transaction identifier GTID.

In an alternative way, the loop control unit 406 is configured to: receiving and storing the parsed data transmitted by the second data center; filtering the data from the first data center according to a loop mark in the data; the collision detection unit 408 is configured to: for the same data, if the latest update time of the data of the first data center is greater than the latest update time of the data transmitted from the second data center, collision warning information is generated.

In an alternative way, the parse-consumption separation unit 407 is configured to: responding to the request of the terminal user to configure the message queue theme; storing the parsed data into different message queues according to the message queue subject; and completing data subscription consumption in response to the data consumption conversion requests of different end users.

Embodiments of the present invention provide a non-volatile computer storage medium storing at least one executable instruction that may perform the GTID-based long-range data synchronization method in any of the above method embodiments.

The executable instructions may be particularly useful for causing a processor to:

subscribing data nodes which store metadata information to be checked from a current Zookeeper cluster of a first data center to trigger the check;

Respectively calculating checksums for the data of the first data center database and the second data center database according to the data blocks, and comparing and searching the data blocks with data differences;

continuously adjusting the size of the data block with the data difference and repeatedly calculating a comparison checksum until difference data with inconsistent checksums are found;

and further comparing the checksum with the difference data with inconsistent checksum to obtain an auditing result.

In one alternative, the executable instructions cause the processor to:

performing CRC32 check on the data of the second data center database according to the data blocks, calculating a first checksum and taking out the boundaries of the data blocks;

performing CRC32 check on the data of the first data center database according to the data block boundary, and calculating a second checksum;

comparing the first checksum with the second checksum, and if the first checksum is consistent with the second checksum, indicating that the data of the data block is synchronous in the first data center database and the second data center database; and if the first checksum is inconsistent with the second checksum, indicating that the data block has data difference in the first data center database and the second data center database.

In one alternative, the executable instructions cause the processor to:

reducing the size of the data block;

respectively calculating a checksum on the reduced data blocks, comparing the checksum with the data blocks, and continuously searching the data blocks with data differences;

and repeatedly reducing the size of the data block with the data difference and calculating and comparing the checksum until the data block with the data difference only comprises the difference data with inconsistent checksum.

In one alternative, the executable instructions cause the processor to:

recalculating and comparing the first checksum and the second checksum after waiting for a preset time; if the difference data are consistent, the difference data are data synchronized in the first data center database and the second data center database;

if not, comparing the currently calculated first checksum of the difference data with the previously calculated second checksum; if the difference data are consistent, the difference data are indicated to have synchronous delay in the first data center database and the second data center database;

and if the difference data are inconsistent, continuously modifying the difference data in the first data center database and the second data center database.

In one alternative, the executable instructions cause the processor to:

acquiring data needing to be analyzed currently according to BinLog locus information or a global transaction identifier stored in the first data center;

analyzing the data, and supplementing field types, field names, primary key information, loop marks and latest update time;

transmitting the parsed data to the second data center;

and updating and storing the current BinLog site information or the global transaction identifier GTID.

In one alternative, the executable instructions cause the processor to:

receiving and storing the parsed data transmitted by the second data center;

filtering the data from the first data center according to a loop mark in the data;

for the same data, if the latest update time of the data of the first data center is greater than the latest update time of the data transmitted from the second data center, collision warning information is generated.

In one alternative, the executable instructions cause the processor to:

Responding to the request of the terminal user to configure the message queue theme;

storing the parsed data into different message queues according to the message queue subject;

and completing data subscription consumption in response to the data consumption conversion requests of different end users.

An embodiment of the present invention provides a computer program product comprising a computer program stored on a computer storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the GTID-based long-range data synchronization method in any of the method embodiments described above.

In one alternative, the executable instructions cause the processor to:

reducing the size of the data block;

In one alternative, the executable instructions cause the processor to:

transmitting the parsed data to the second data center;

In one alternative, the executable instructions cause the processor to:

receiving and storing the parsed data transmitted by the second data center;

In one alternative, the executable instructions cause the processor to:

FIG. 5 illustrates a schematic diagram of a computing device according to an embodiment of the present invention, and the embodiment of the present invention is not limited to the specific implementation of the device.

As shown in fig. 5, the computing device may include: a processor 502, a communication interface (Communications Interface) 504, a memory 506, and a communication bus 508.

Wherein: processor 502, communication interface 504, and memory 506 communicate with each other via communication bus 508. A communication interface 504 for communicating with network elements of other devices, such as clients or other servers. Processor 502 is configured to execute program 510, and may specifically perform relevant steps in the above-described long-distance data synchronization method embodiment based on the GTID.

In particular, program 510 may include program code including computer-operating instructions.

The processor 502 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention. The device includes one or each processor, which may be the same type of processor, such as one or each CPU; but may also be different types of processors such as one or each CPU and one or each ASIC.

A memory 506 for storing a program 510. Memory 506 may comprise high-speed RAM memory or may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 510 may be specifically operable to cause the processor 502 to:

In an alternative, the program 510 causes the processor to:

reducing the size of the data block;

In an alternative, the program 510 causes the processor to:

transmitting the parsed data to the second data center;

In an alternative, the program 510 causes the processor to:

receiving and storing the parsed data transmitted by the second data center;

In an alternative, the program 510 causes the processor to:

The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the teachings herein. The required structure for a construction of such a system is apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It will be appreciated that the teachings of the present invention described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the present invention.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the above description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specifically stated.

Claims

1. A GTID-based long-range data synchronization method, the method comprising:

analyzing the data of the databases corresponding to the first data center or the second data center according to the BinLog locus information or the global transaction identifier GTID, and transmitting the data to the database of the opposite end so as to realize bidirectional data synchronization between the first data center and the second data center; the data synchronization process comprises the following steps: acquiring data to be analyzed currently according to BinLog site information or a global transaction identifier stored in the first data center;

transmitting the parsed data to the second data center to realize data synchronization between the first data center and the second data center;

updating and storing the current BinLog site information or the global transaction identifier GTID;

receiving and storing the parsed data transmitted by the second data center;

2. The method of claim 1, wherein the computing a checksum on the data of the first data center database and the second data center database, respectively, on a data block basis, and performing a comparison lookup for the data blocks having data differences, comprises:

3. The method of claim 1, wherein the continuously adjusting the size of the data block having the data difference and repeatedly calculating a comparison checksum until difference data having a checksum inconsistency is found, comprises:

Reducing the size of the data block;

4. The method of claim 2, wherein the further comparing the checksum against the difference data that is inconsistent with the checksum to obtain an audit result comprises:

5. The method according to claim 1, wherein the method further comprises:

6. The method according to claim 1, wherein the method further comprises:

7. A GTID-based long-range data synchronization apparatus, the apparatus comprising:

the auditing triggering unit is used for subscribing the data node which stores the metadata information to be audited from the current Zookeeper cluster of the first data center to trigger auditing;

the first searching unit is used for respectively calculating checksums according to data blocks for the data of the first data center database and the second data center database and comparing and searching the data blocks with data differences;

The second searching unit is used for continuously adjusting the size of the data block with the data difference and repeatedly calculating the comparison checksum until the difference data with inconsistent checksums are searched;

the data checking unit is used for further comparing the checksum with the difference data with inconsistent checksum to obtain an auditing result;

the synchronous analysis unit is used for analyzing the data of the databases corresponding to the first data center or the second data center respectively according to the BinLog locus information or the global transaction identifier GTID and transmitting the data to the database of the opposite end so as to realize bidirectional data synchronization between the first data center and the second data center; the data synchronization process comprises the following steps: acquiring data needing to be analyzed currently according to BinLog locus information or a global transaction identifier stored in the first data center; analyzing the data, and supplementing field types, field names, primary key information, loop marks and latest update time; transmitting the parsed data to the second data center; updating and storing the current BinLog site information or the global transaction identifier GTID;

the loop control unit is used for receiving and storing the analyzed data transmitted by the second data center; and filtering the data from the first data center according to the loop marks in the data.

8. A computing device, comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;

the memory is configured to store at least one executable instruction that causes the processor to perform the steps of the GTID-based long-range data synchronization method according to any one of claims 1-6.

9. A computer storage medium having stored therein at least one executable instruction for causing a processor to perform the steps of the GTID-based long range data synchronization method of any one of claims 1-6.