CN117370462A - Data synchronization method - Google Patents

Data synchronization method Download PDF

Info

Publication number
CN117370462A
CN117370462A CN202311386289.0A CN202311386289A CN117370462A CN 117370462 A CN117370462 A CN 117370462A CN 202311386289 A CN202311386289 A CN 202311386289A CN 117370462 A CN117370462 A CN 117370462A
Authority
CN
China
Prior art keywords
synchronized
transaction
operation record
synchronization
records
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311386289.0A
Other languages
Chinese (zh)
Inventor
郑奕彬
彭荣杰
王顺云
侯银花
王文虎
汤金林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN202311386289.0A priority Critical patent/CN117370462A/en
Publication of CN117370462A publication Critical patent/CN117370462A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/466Transaction processing

Abstract

The application provides a data synchronization method, which comprises the following steps: acquiring a plurality of transactions to be synchronized based on a database log; the database log records the adding, deleting and modifying operations aiming at the active database; any transaction to be synchronized contains at least one operation record for the source database table; for any operation record of any transaction to be synchronized, determining a synchronization identifier corresponding to the operation record according to the database name, the table name, the unique key and the numerical value of the unique key in the operation record; determining a synchronization strategy according to the log time sequence of a plurality of transactions to be synchronized and each operation record of the plurality of transactions to be synchronized, wherein the synchronization strategy is that the synchronization identifications of the operation records in each transaction to be synchronized are different, and each transaction to be synchronized is synchronously executed by different threads; the transaction to be synchronized with at least one same synchronization mark is executed by the same thread in series; and realizing data synchronization between the source database and the target storage unit according to the synchronization strategy. According to the scheme, real-time data synchronization can be achieved.

Description

Data synchronization method
Technical Field
The present disclosure relates to the field of computers, and in particular, to a data synchronization method.
Background
At present, the mainstream relational database in the industry is MySQL, and daily data storage and query service can be met through table design and index optimization. However MySQL is inefficient for complex queries and aggregate query scenarios. To solve this technical problem, it is a mainstream practice in the industry to synchronize data from MySQL to other data sources, such as elastic search, and complete the query task by querying the other data sources.
Synchronizing MySQL data sources to the elastic search typically employs a DataX component. However, the database component cannot achieve real-time performance, and for the database, each table does not have data change operation at any moment, frequent query of the database component may cause query idle running, and the query burden of the source database is increased; moreover, the dimension of data synchronization of the DataX component is according to the table level, and if there is a problem of hot spot table and hot spot record, the data synchronization of this table has a bottleneck.
In the prior art, the data synchronization can be performed by reading the content of the database log, namely the binlog log, and the binlog can actively sense the change of the source database without wasting excessive resources to perform invalid inquiry or polling. However, the binlog distribution is mainly performed according to the table dimension, especially when handling the situations of hot spot tables, hot spot records, large transactions, etc., the performance is poor, the real-time performance of data synchronization cannot be achieved, and the performance is even inferior to that of a DataX component.
How to synchronize data in real time is then to be solved.
Disclosure of Invention
The data synchronization method can solve the problem of data synchronization delay and realize real-time data synchronization.
In a first aspect, embodiments of the present application provide a data synchronization method, which may be performed by a data synchronization device, where the data synchronization device may be a terminal device or a module for a terminal device, or a server or a module for a server. The subject matter of the implementation of the method is not limited in this application. The method comprises the following steps: acquiring a plurality of transactions to be synchronized based on a database log; the database log records the adding, deleting and modifying operations aiming at the active database; any transaction to be synchronized contains at least one operation record for the source database table; for any operation record of any transaction to be synchronized, determining a synchronization identifier corresponding to the operation record according to the database name, the table name, the unique key and the numerical value of the unique key in the operation record; determining a synchronization strategy according to the log time sequence of a plurality of transactions to be synchronized and each operation record of the plurality of transactions to be synchronized, wherein the synchronization strategy is that the synchronization identifications of the operation records in each transaction to be synchronized are different, and each transaction to be synchronized is synchronously executed by different threads; the transaction to be synchronized with at least one same synchronization mark is executed by the same thread in series; and realizing data synchronization between the source database and the target storage unit according to the synchronization strategy.
According to the scheme, on one hand, a plurality of transactions to be synchronized are obtained based on the database log, so that the change of the source database can be actively perceived, and the invalid inquiry is performed without wasting excessive resources, so that the resources can be saved; on the other hand, the synchronization strategy is determined according to operation records, and the granularity of the operation records is smaller than that of the database table; in the prior art, the data synchronization is based on the table dimension, that is, the data synchronization of different database tables can be executed by adopting different threads simultaneously, but the data synchronization of the same database table can only be executed by adopting the same thread in series; especially when a hot spot table appears, namely a database table with more frequent updating appears, most of transactions to be synchronized wait for serial execution of the same thread, and other threads may be in a vacant state, so that the problems of data synchronization delay and resource waste are caused. Based on the synchronization strategy, the problem of data synchronization delay can be effectively solved, real-time data synchronization is realized, and the resource utilization rate is improved.
In a possible implementation method, according to the log time sequence of a plurality of transactions to be synchronized, determining the synchronization period and the belonging working thread group of each transaction to be synchronized in sequence, and distributing threads for the transactions to be synchronized according to the belonging working thread group; if the second transaction to be synchronized and the first transaction to be synchronized have at least one same synchronization identifier, the second transaction to be synchronized and the working thread group to which the first transaction to be synchronized belong are the same, and the synchronization period of the second transaction to be synchronized is located after the synchronization period of the first transaction to be synchronized; the log time of the first transaction to be synchronized is earlier than the log time of the second transaction to be synchronized; if the second transaction to be synchronized and any first transaction to be synchronized do not have the same synchronization identification, the working thread group to which the second transaction to be synchronized belongs is different from the working thread group to which any first transaction to be synchronized belongs.
According to the scheme, on one hand, according to the log time sequence of a plurality of transactions to be synchronized, the synchronization period and the belonging working thread group of each transaction to be synchronized are sequentially determined, so that the accuracy of data synchronization can be effectively determined, because each operation record of the database log has a strict time sequence, for example, the first operation record updates A to 1, the second operation record updates A to 2, and then the record A saved in the final source database should be 2; however, if the data synchronization is not performed according to the log time sequence, the second operation record is synchronized first, the a is updated to 2, the first operation record is synchronized again, and the a is updated to 1, then the content of the record a stored in the target storage unit is 1, and the problem of inconsistent data synchronization between the source database and the target storage unit occurs. On the other hand, when the second transaction to be synchronized and the first transaction to be synchronized have different synchronization identifications, the second transaction to be synchronized and the first transaction to be synchronized are executed in parallel, that is to say, the second transaction to be synchronized and the first transaction to be synchronized even belong to the same table dimension, but if the synchronization identifications are different, the second transaction to be synchronized and the first transaction to be synchronized can be executed in parallel, so that the problem of data synchronization delay can be effectively solved, real-time data synchronization is realized, and the resource utilization rate is improved.
In a possible implementation method, hash operation is performed on a database name, a table name, a primary key and numerical values of the primary key in an operation record to obtain a first hash value; performing hash operation on the database name, the table name, the unique index and the pre-operation numerical value of the unique index in the operation record to obtain a second hash value; the pre-operation value is a value corresponding to the unique index before the operation record is executed; performing hash operation on the database name, the table name, the unique index and the operated numerical value of the unique index in the operation record to obtain a third hash value; the value after operation is the value corresponding to the unique index after operation record is executed; and taking the first hash value, the second hash value and the third hash value as a plurality of synchronous identifiers corresponding to the operation record.
According to the scheme, the synchronous identification is determined according to the database name, the table name, the primary key, the numerical value of the primary key, the unique index, the pre-operation numerical value of the unique index, the unique index and the post-operation numerical value of the unique index, so that the association degree of different transactions to be synchronized can be effectively determined, different transactions to be synchronized without association relation are executed in different threads concurrently, the problem of data synchronization delay can be effectively solved, real-time data synchronization is realized, and the resource utilization rate is improved.
In one possible implementation, any of the worker thread groups includes at least one thread; distributing threads for the transaction to be synchronized according to the belonging working thread group, comprising: judging whether the transaction to be synchronized is a large transaction or not; the number of the operation records and/or the plurality of operation record types included in the large transaction is greater than a first threshold value; if the transaction to be synchronized is not a large transaction, a thread is allocated to the transaction to be synchronized from a working thread group to which the transaction to be synchronized belongs; if the transaction to be synchronized is a large transaction, determining a conflict set according to the identification of each operation record in the transaction to be synchronized; allocating different threads for different conflict sets from the working thread group; each operation record in the same conflict set has the same identification, and the identifications of the operation records are different among different conflict sets, wherein the identifications comprise one or more of a main key, an auto-increment main key, a joint main key or a line number.
According to the scheme, the operation records in the large transaction can be further divided into smaller groups according to the identifiers of the operation records, wherein the operation records in the same conflict set have the same identifier, the identifiers of the operation records in different conflict sets are different, the operation records in the same conflict set are executed in series in the same thread, the operation records in different conflict sets are executed in parallel in different threads, the problem of data synchronization delay can be effectively solved, real-time data synchronization is realized, and the resource utilization rate is improved.
In one possible implementation, the number of threads in any one work thread group may be dynamically adjusted.
According to the scheme, the number of threads can be dynamically adjusted according to the number of conflict sets, when the number of conflict sets is large, the number of threads is dynamically increased, the processing speed of the transaction to be synchronized is increased, after the synchronous transaction is finished, the number of threads which are increased again is idle, and the threads which are idle for a certain time can be recovered by the dynamic thread pool, so that the thread pool resources are saved.
In a possible implementation method, judging whether each conflict set meets an aggregation condition, and performing aggregation operation on operation records in the conflict set meeting the aggregation condition to obtain an aggregation record; and according to the synchronization strategy, carrying out data synchronization on the operation records and/or the aggregation records in each transaction to be synchronized, and realizing the data synchronization between the source database and the target storage unit.
According to the scheme, the operation records in the conflict set can be aggregated according to the aggregation conditions, and the number of the operation records is reduced, so that the number of the operation records processed in the thread is reduced, the data synchronization is rapidly completed, further, the problem of data synchronization delay can be effectively solved, and the real-time data synchronization is realized.
In one possible implementation, the size of the time window is determined; according to the size of the time window, performing data segmentation processing on at least one operation record in the conflict set, and determining aggregated data; wherein the aggregate data belongs to the same time window; judging whether each conflict set meets the aggregation condition or not, including: and judging whether each piece of aggregation data meets the aggregation condition.
In the above scheme, the number of operation records in the conflict set may be relatively large, and if the aggregation operation is performed according to the large conflict set, the time may be very long, resulting in inefficiency; therefore, according to the time window, the operation records in the conflict set are subjected to data segmentation processing, and the segmented data are subjected to aggregation operation, so that the efficiency of the aggregation operation can be improved, further, the problem of data synchronization delay can be effectively solved, and real-time data synchronization is realized.
In one possible implementation, it is determined that each conflict set includes an insert operation record and an update operation record, and the execution results of the insert operation record and the update operation record are equivalent to the execution results of the update operation record; performing aggregation operation on operation records in a conflict set meeting aggregation conditions, wherein the aggregation operation comprises the following steps: the insert operation record is deleted.
According to the scheme, the execution results of the inserting operation record and the updating operation record are equivalent to the execution results of the updating operation record, and the inserting operation record is deleted, and only the updating operation record is reserved, so that the number of operation records processed in a thread can be reduced, the data synchronization can be rapidly completed, further, the problem of data synchronization delay can be effectively solved, and the real-time data synchronization can be realized.
In one possible implementation method, it is determined that each conflict set includes an insert operation record and a delete operation record, and the execution results of the insert operation record and the delete operation record are equivalent to the execution results of the delete operation record; performing aggregation operation on operation records in a conflict set meeting aggregation conditions, wherein the aggregation operation comprises the following steps: the insert operation record is deleted.
According to the scheme, the execution results of the insert operation record and the delete operation record are equivalent to the execution results of the delete operation record, and the insert operation record is deleted, and only the delete operation record is reserved, so that the number of operation records processed in a thread can be reduced, the data synchronization can be rapidly completed, further, the problem of data synchronization delay can be effectively solved, and the real-time data synchronization can be realized.
In a possible implementation method, it is determined that each conflict set includes at least two update operation records, and primary keys corresponding to the at least two update operation records are the same; performing aggregation operation on operation records in a conflict set meeting aggregation conditions, wherein the aggregation operation comprises the following steps: merging at least two update operation records into a first update operation record; wherein the execution result of the first update operation record is equivalent to the execution result of at least two update operation records.
According to the scheme, the execution result of the first updating operation record is equivalent to the execution result of at least two updating operation records, and the at least two updating operation records are combined into the first updating operation record, so that the number of the updating operation records processed in the thread can be reduced, the data synchronization can be rapidly completed, further, the problem of data synchronization delay can be effectively solved, and the real-time data synchronization can be realized.
In a second aspect, an embodiment of the present application provides a data synchronization device, including: an acquisition unit, a determination unit and a synchronization unit. The acquisition unit is used for acquiring a plurality of to-be-synchronized transactions based on the database log; the database log records the adding, deleting and modifying operations aiming at the active database; any transaction to be synchronized contains at least one operation record for the source database table; the determining unit is used for determining a synchronous identifier corresponding to any operation record of any transaction to be synchronized according to the database name, the table name, the unique key and the numerical value of the unique key in the operation record; determining a synchronization strategy according to the log time sequence of a plurality of transactions to be synchronized and each operation record of the plurality of transactions to be synchronized, wherein the synchronization strategy is that the synchronization identifications of the operation records in each transaction to be synchronized are different, and each transaction to be synchronized is synchronously executed by different threads; the transaction to be synchronized with at least one same synchronization mark is executed by the same thread in series; and the synchronization unit is used for realizing data synchronization between the source database and the target storage unit according to the synchronization strategy.
In a possible implementation method, a determining unit is specifically configured to sequentially determine, according to a log time sequence of a plurality of transactions to be synchronized, a synchronization period and an affiliated working thread group of each transaction to be synchronized, and allocate threads for the transactions to be synchronized according to the affiliated working thread group; if the second transaction to be synchronized and the first transaction to be synchronized have at least one same synchronization identifier, the second transaction to be synchronized and the working thread group to which the first transaction to be synchronized belong are the same, and the synchronization period of the second transaction to be synchronized is located after the synchronization period of the first transaction to be synchronized; the log time of the first transaction to be synchronized is earlier than the log time of the second transaction to be synchronized; if the second transaction to be synchronized and any first transaction to be synchronized do not have the same synchronization identification, the working thread group to which the second transaction to be synchronized belongs is different from the working thread group to which any first transaction to be synchronized belongs.
In a possible implementation method, a determining unit is specifically configured to perform hash operation on a database name, a table name, a primary key and a numerical value of the primary key in an operation record to obtain a first hash value; performing hash operation on the database name, the table name, the unique index and the pre-operation numerical value of the unique index in the operation record to obtain a second hash value; the pre-operation value is a value corresponding to the unique index before the operation record is executed; performing hash operation on the database name, the table name, the unique index and the operated numerical value of the unique index in the operation record to obtain a third hash value; the value after operation is the value corresponding to the unique index after operation record is executed; and taking the first hash value, the second hash value and the third hash value as a plurality of synchronous identifiers corresponding to the operation record.
In one possible implementation, any of the worker thread groups includes at least one thread; the determining unit is also used for judging whether the transaction to be synchronized is a large transaction or not; the number of the operation records and/or the plurality of operation record types included in the large transaction is greater than a first threshold value; if the transaction to be synchronized is not a large transaction, a thread is allocated to the transaction to be synchronized from a working thread group to which the transaction to be synchronized belongs; if the transaction to be synchronized is a large transaction, determining a conflict set according to the identification of each operation record in the transaction to be synchronized; allocating different threads for different conflict sets from the working thread group; each operation record in the same conflict set has the same identification, and the identifications of the operation records are different among different conflict sets, wherein the identifications comprise one or more of a main key, an auto-increment main key, a joint main key or a line number.
In one possible implementation, the number of threads in any one work thread group may be dynamically adjusted.
In a possible implementation method, the device further includes an aggregation unit, configured to determine whether each conflict set meets an aggregation condition, and aggregate operation is performed on operation records in the conflict set that meets the aggregation condition to obtain an aggregated record; and the synchronization unit is used for carrying out data synchronization on the operation records and/or the aggregation records in each transaction to be synchronized according to the synchronization strategy so as to realize the data synchronization between the source database and the target storage unit.
In a possible implementation method, the determining unit is further configured to determine a size of the time window; according to the size of the time window, performing data segmentation processing on at least one operation record in the conflict set, and determining aggregated data; wherein the aggregate data belongs to the same time window; and the aggregation unit is used for judging whether each piece of aggregation data meets the aggregation condition.
In a possible implementation method, the aggregation unit is specifically configured to determine that each conflict set includes an insert operation record and an update operation record, and an execution result of the insert operation record and the update operation record is equivalent to an execution result of the update operation record; the insert operation record is deleted.
In a possible implementation method, the aggregation unit is specifically configured to determine that each conflict set includes an insert operation record and a delete operation record, and an execution result of the insert operation record and the delete operation record is equivalent to an execution result of the delete operation record; the insert operation record is deleted.
In a possible implementation method, the aggregation unit is specifically configured to determine that each conflict set includes at least two update operation records, and primary keys corresponding to the at least two update operation records are the same; merging at least two update operation records into a first update operation record; wherein the execution result of the first update operation record is equivalent to the execution result of at least two update operation records.
In a third aspect, embodiments of the present application further provide a computing device, including:
a memory for storing program instructions;
and a processor for calling program instructions stored in the memory and executing any method implementing the first aspect according to the obtained program instructions.
In a fourth aspect, embodiments of the present application further provide a computer-readable storage medium having stored therein computer-readable instructions which, when read and executed by a computer, implement any of the methods of the first aspect described above.
In a fifth aspect, embodiments of the present application provide a computer program product comprising a computer program executable by a computer device to cause the computer device to perform any of the methods of the first aspect described above when the program is run on the computer device.
Drawings
Fig. 1 is a schematic flow chart of a data synchronization method according to an embodiment of the present application;
FIG. 2 is a flow chart of a large transaction processing method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a standard data format according to an embodiment of the present application;
FIG. 4 is a schematic flow chart of a polymerization operation method according to an embodiment of the present application;
FIG. 5 is a schematic structural diagram of an inheritance relationship of an aggregator according to an embodiment of the present disclosure;
fig. 6 is a flow chart of a data synchronization method according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a data synchronization device according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a data synchronization device according to an embodiment of the present application.
Detailed Description
Fig. 1 is a schematic flow chart of a data synchronization method provided in an embodiment of the present application, where the method may be performed by a data synchronization device, and the data synchronization device may be a terminal device or a module for a terminal device, or a server or a module for a server. The subject matter of the implementation of the method is not limited in this application.
The method comprises the following steps:
step 101, acquiring a plurality of to-be-synchronized transactions based on a database log.
The database log records the adding, deleting and modifying operations aiming at the source database; any transaction to be synchronized contains at least one record of operations for the source database table.
In one possible implementation, the operations of adding, deleting, and updating the source database may generate operation records, and these operation records may be recorded in a file, which is called a database log. The type of the source database may be MySQL, SQL service, oracle, etc., which is not limited in this application.
In one possible implementation, a plurality of transactions to be synchronized are recorded in a database log, each transaction to be synchronized containing at least one operation record for a source database table. The operation record comprises an operation type, an operation object and an operation value, wherein the operation type comprises an addition operation, a deletion operation and an update operation. Illustratively, the operation record is "update t1seta=6write id=1", update represents an update operation, t1 represents a table name, id is a primary key, the meaning of the operation record is that the value of the a attribute with id equal to 1 in t1 is modified to 6, wherein the operation type is an update operation, the operation object is the a attribute with id equal to 1 in t1, and the operation value is 6. Of course, other contents, such as generation time, database name, table name, primary key information, etc., may also be included in the operation record, and the content of the operation record is not limited in this application.
Step 102, for any operation record of any transaction to be synchronized, determining the synchronization identifier corresponding to the operation record according to the database name, the table name, the unique key and the numerical value of the unique key in the operation record.
In one possible implementation, the unique key includes a primary key and a unique index, and may include other contents, which is not limited in this application.
In a possible implementation method, if the operation record only includes a primary key, performing hash operation on a database name, a table name, the primary key and numerical values of the primary key in the operation record to obtain a first hash value; and taking the first hash value as a synchronous identifier corresponding to the operation record.
In another possible implementation method, the operation record includes a primary key and a unique index, and hash operation is performed on the database name, the table name, the primary key and the numerical value of the primary key in the operation record to obtain a first hash value; performing hash operation on the database name, the table name, the unique index and the pre-operation numerical value of the unique index in the operation record to obtain a second hash value; the pre-operation value is a value corresponding to the unique index before the operation record is executed; performing hash operation on the database name, the table name, the unique index and the operated numerical value of the unique index in the operation record to obtain a third hash value; the value after operation is the value corresponding to the unique index after operation record is executed; and taking the first hash value, the second hash value and the third hash value as a plurality of synchronous identifiers corresponding to the operation record. According to the scheme, the synchronous identification is determined according to the database name, the table name, the primary key, the numerical value of the primary key, the unique index, the numerical value before operation of the unique index, the unique index and the numerical value after operation of the unique index, so that the association degree of different transactions to be synchronized can be effectively determined, different transactions to be synchronized without association relation are executed in different threads concurrently, the problem of data synchronization delay can be effectively solved, real-time data synchronization is realized, and the resource utilization rate is improved.
In another possible implementation method, the operation record only includes a unique index, and hash operation is performed on the database name, the table name, the unique index and the pre-operation numerical value of the unique index in the operation record to obtain a second hash value; the pre-operation value is a value corresponding to the unique index before the operation record is executed; performing hash operation on the database name, the table name, the unique index and the operated numerical value of the unique index in the operation record to obtain a third hash value; the value after operation is the value corresponding to the unique index after operation record is executed; and taking the second hash value and the third hash value as a plurality of synchronous identifiers corresponding to the operation record.
And step 103, determining a synchronization strategy according to the log time sequence of the plurality of to-be-synchronized transactions and each operation record of the plurality of to-be-synchronized transactions.
The synchronization strategy is that the synchronization identifiers of operation records in all the transactions to be synchronized are different, and all the transactions to be synchronized are synchronously executed by different threads; the transactions to be synchronized, having at least one identical synchronization identity, are executed serially by the same thread.
In a possible implementation method, according to the log time sequence of a plurality of transactions to be synchronized, determining the synchronization period and the belonging working thread group of each transaction to be synchronized in sequence, and distributing threads for the transactions to be synchronized according to the belonging working thread group; if the second transaction to be synchronized and the first transaction to be synchronized have at least one same synchronization identifier, the second transaction to be synchronized and the working thread group to which the first transaction to be synchronized belong are the same, and the synchronization period of the second transaction to be synchronized is located after the synchronization period of the first transaction to be synchronized; the log time of the first transaction to be synchronized is earlier than the log time of the second transaction to be synchronized; if the second transaction to be synchronized and any first transaction to be synchronized do not have the same synchronization identification, the working thread group to which the second transaction to be synchronized belongs is different from the working thread group to which any first transaction to be synchronized belongs. According to the scheme, on one hand, according to the log time sequence of a plurality of transactions to be synchronized, the synchronization period and the belonging working thread group of each transaction to be synchronized are sequentially determined, so that the accuracy of data synchronization can be effectively determined, because each operation record of the database log has a strict time sequence, for example, the first operation record updates A to 1, the second operation record updates A to 2, and then the record A saved in the final source database should be 2; however, if the data synchronization is not performed according to the log time sequence, the second operation record is synchronized first, the a is updated to 2, the first operation record is synchronized again, and the a is updated to 1, then the content of the record a stored in the target storage unit is 1, and the problem of inconsistent data synchronization between the source database and the target storage unit occurs. On the other hand, when the second transaction to be synchronized and the first transaction to be synchronized have different synchronization identifications, the second transaction to be synchronized and the first transaction to be synchronized are executed in parallel, that is to say, the second transaction to be synchronized and the first transaction to be synchronized even belong to the same table dimension, but if the synchronization identifications are different, the second transaction to be synchronized and the first transaction to be synchronized can be executed in parallel, so that the problem of data synchronization delay can be effectively solved, real-time data synchronization is realized, and the resource utilization rate is improved.
And 104, realizing data synchronization between the source database and the target storage unit according to a synchronization strategy.
In one possible implementation method, the target storage unit refers to a unit capable of storing a large amount of data, and the storage unit can represent an association relationship between data, for example, the target storage unit may be a database or other data sources, for example, an elastic search, and the type of the target storage unit is not limited in this application.
In one possible implementation method, the target storage unit reads the transaction to be synchronized in the working thread group, and performs related operations on the data of the target storage unit according to each operation record of the transaction to be synchronized so as to complete data synchronization.
According to the scheme, on one hand, a plurality of transactions to be synchronized are obtained based on the database log, so that the change of the source database can be actively perceived, and the invalid inquiry is performed without wasting excessive resources, so that the resources can be saved; on the other hand, the synchronization strategy is determined according to operation records, and the granularity of the operation records is smaller than that of the database table; in the prior art, the data synchronization is based on the table dimension, that is, the data synchronization of different database tables can be executed by adopting different threads simultaneously, but the data synchronization of the same database table can only be executed by adopting the same thread in series; especially when a hot spot table appears, namely a database table with more frequent updating appears, most of transactions to be synchronized wait for serial execution of the same thread, and other threads may be in a vacant state, so that the problems of data synchronization delay and resource waste are caused. Based on the synchronization strategy, the problem of data synchronization delay can be effectively solved, real-time data synchronization is realized, and the resource utilization rate is improved.
The following describes a specific implementation procedure of steps 101 to 104 with a specific example.
In one possible implementation, it is assumed that a database name d1 exists, below which is a table name t1, and the structures are shown below, and tables t2 and t3, which are similar in structure to table t 1. Where id is the primary key and a is the unique index.
In a possible implementation method, as shown below, there are 10 transactions to be processed, namely session1 to session10, respectively, and the transactions are abbreviated as TX, then the transactions to be processed 1 to 10 may be denoted as TX1 to TX10, and the transaction to be synchronized includes at least one operation record for the source database table, abbreviated as R, and may be denoted as R1 to R28.
In one possible implementation, the table updated very frequently is called a hot spot table, such as TX1 to TX10 described above, and many transactions to be processed are directed to the operation of table t1, so t1 is called a hot spot table.
In one possible implementation, a large transaction refers to a database containing multiple types of sql operations, or the sql operations involve multiple pieces of record data, or the sql operations involve lock wait, etc., resulting in a relatively long overall execution time of the transaction. In TX7, as in TX1 to TX10 described above, various types of database sql operations, such as update, insert and delete, are included, and the number of operation records in TX7 is greater than the first threshold, so TX7 is referred to as a large transaction.
In one possible implementation, the operation record updated frequently is called a hot spot record, for example, in TX1 to TX10, and there are multiple operations on the same operation record in TX7, for example, id=7, 30, 40, 50, that is, for example, id=7, 30, 40, 50 are all hot spot records.
In a possible implementation method, when the transaction to be processed is distributed according to the table dimension, the execution sequence of different transactions to be processed is shown in the following table 1, wherein, the worker_1, the worker_2, the worker_3 and the worker_4 are respectively different working thread groups, as can be seen from the table 1, because the table t1 is a hot spot table, the different transactions updating the table t1 are finally retired to be executed in the same worker in series, in addition, the table t1 and the table t3 are updated simultaneously by the transaction 9, namely the TX9, and the transaction 7, namely the TX7 and the transaction 8, namely the TX8 have a dependency relationship, so that the transaction 7 and the transaction 8 need to be executed after being executed, and during the period, other transactions cannot be executed, so that the multi-thread concurrent processing is degenerated to be a single thread; transaction 7 and transaction 8 are independent of each other and may be executed in parallel. Also, as can be seen from table 1, the worker_3 and worker_4 are always in an idle state due to the above-described problems.
TABLE 1
In a possible implementation method, taking the TX1 to TX10 as an example, the specific implementation process of step 102 and step 103 is as follows: for the operation records corresponding to TX1 and TX2, the unique index a is included in addition to the primary key, and it is assumed that in TX1, the initial value of a is 1, and in TX2, the initial value of a is 2. For TX1, performing hash operation on the database name, table name, primary key and numerical value of the primary key in the operation record to obtain a first hash value, namely, a key1=hash algorithm (db1+t1+ "id" +1); performing hash operation on the database name, the table name, the unique index and the pre-operation numerical value of the unique index in the operation record to obtain a second hash value, namely, a key2=hash algorithm (db1+t1+ "a" +1); performing hash operation on the database name, the table name, the unique index and the operated numerical value of the unique index in the operation record to obtain a third hash value, namely a key3=hash algorithm (db1+t1+ "a" +6); taking the first hash value key1, the second hash value key2 and the third hash value key3 as 3 synchronous identifications corresponding to the R1 operation record, wherein the 3 synchronous identifications can be recorded in a hash set A and are { key1, key2 and key3}; for TX2, the first hash value, i.e., ken4=hash algorithm (db1+t1+ "id" +2); a second hash value, the key5 hash algorithm (db1+t1+ "a" +2); a third hash value, key6=hash algorithm (db1+t1+ "a" +1); the first hash value key4, the second hash value key5 and the third hash value key6 are used as 3 synchronous identifiers corresponding to the R2 operation record, and can be recorded in a hash set B as { key4, key5, key6}.
In one possible implementation method, it is determined whether the hash sets corresponding to TX1 and TX2 respectively have the same synchronization identifier, and if so, TX1 and TX2 are allocated to the same working thread group. From the hash set a and the hash set B, it can be seen that key2 and key6 are the same, that is, that TX1 and TX2 have the same synchronization identifier, that is, both transactions to be processed update db1 library, t1 table, unique index name a, and record with field value 1, that is, that TX2 modification affects TX1, and TX1 and TX2 need to be allocated to the same working thread group for serial execution. Assuming that TX1 and TX2 are allocated to different worker thread groups for concurrent execution, TX2 may be preferentially executed, and since the operation record of TX2 describes that the value of the unique index value a is modified to 1, if TX1 is not executed, that is, the unique index value with id of 1 is not modified, the unique index value with id of 1 in TX1 is also 1, so that the problem of unique key collision occurs. TX1 and TX2 cannot be performed in parallel but only in series. This is also the case, where the present application uses "library name + table name + unique key number" as the basis for determining whether or not the transaction to be processed can be executed concurrently, where the unique key includes a primary key and a unique index. Because for the operation records with the main key and the unique index, if the main key is simply adopted as the basis for judging whether the transaction to be processed can be executed concurrently, the problem of unique key conflict can be generated.
In one possible implementation method, for the operation record corresponding to the transaction to be processed, in addition to calculating the synchronization identifier, a synchronization value is also calculated, where the synchronization value is used to represent the number of occurrences of the value of the unique key corresponding to the operation record. Illustratively, for TX1, key1 = hash algorithm (db1+t1+ "id" +1), value = 2; wherein value represents a synchronization value, and value=2 in key1 is because the primary key id value of the operation record before and after modification of the unique index a is unchanged and appears twice, so that value=1 is obtained by the key 2=hash algorithm (db1+t1+ "a" +1); key3 = hash algorithm (db1+t1+ "a" +6), value = 1.
In a possible implementation method, for TX3, determining a synchronization identifier corresponding to TX3 according to an operation record corresponding to TX3, where the operation record corresponding to TX3 only includes a primary key and does not include a unique index, so that hash operation is only performed on a database name, a table name, a primary key and a numerical value of the primary key in the operation record to obtain a first hash value, that is, a key7=hash algorithm (db1+t1+ "id" +3); may be recorded in the hash set C as key 7. It is determined whether the hash set C and the hash sets a and B have intersections, respectively, and as can be seen from the above examples, the hash set C and the hash sets a and B do not have intersections, so TX3 can be executed concurrently with TX1 and TX 2. For TX4 to TX10, respectively obtaining hash sets according to the method, respectively judging whether the hash sets corresponding to each transaction to be processed have intersections with the hash sets corresponding to the transactions to be processed with earlier log time according to the log time sequence, and distributing corresponding thread groups for each transaction to be processed. According to the synchronization strategy of the application, when the transaction distribution is performed on the transaction to be processed, the execution sequence of different transaction to be processed is shown in the following table 2, the sequence of writing the transaction to be processed 3 to the transaction to be processed 6 into the database log is the transaction to be processed 3> transaction to be processed 4> transaction to be processed 5> transaction to be processed 6 in sequence, and the operation of the transaction to be processed 3 to the transaction to be processed 6 into the table record does not involve the unique index a, so that only the primary key id needs to be focused. Since all pending transaction 3 through pending transaction 5 update table t1, but update different operation records (records R3, R4, R5), and transaction 6 updates table t2 (operation record R6), these four transactions can be executed in parallel according to the new transaction distribution algorithm. The order of writing the pending transactions 7 to 10 into the database log is that the pending transactions 7> pending transactions 8> pending transactions 9> pending transactions 10 in turn, and the operations of the pending transactions 7 to 10 on the table record do not involve a unique index a, so only the primary key id needs to be concerned. When distributing transactions in the table dimension, the pending transaction 9 needs to rely on the pending transaction 7 and the pending transaction 8, so that the pending transaction 7 and the pending transaction 8 can be executed in parallel, and the pending transaction 9 needs to wait for them to execute before executing the pending transaction 10. In this application, since the pending transaction 7, the pending transaction 8, and the pending transaction 9 do not update the same operation record of the same table, there is no conflict problem, different pending transactions may be executed concurrently, and the pending transaction 10 and the first three pending transactions also do not update the same operation record, so they may also be executed concurrently.
TABLE 2
In a possible implementation method, after the step 103, determining whether the transaction to be synchronized is a large transaction is further included; the number of the operation records and/or the plurality of operation record types included in the large transaction is greater than a first threshold value; if the transaction to be synchronized is not a large transaction, a thread is allocated to the transaction to be synchronized from a working thread group to which the transaction to be synchronized belongs; if the transaction to be synchronized is a large transaction, determining a conflict set according to the identification of each operation record in the transaction to be synchronized; allocating different threads for different conflict sets from the working thread group; each operation record in the same conflict set has the same identification, and the identifications of the operation records in different conflict sets are different, wherein the identifications comprise one or more of a main key, an auto-increment main key, a joint main key or a line number. Of course, it may also be determined whether the transaction to be synchronized is a large transaction after the step 101; after finishing the large transaction, determining a synchronous identifier corresponding to the operation record according to the database name, the table name, the unique key and the numerical value of the unique key in the operation record, and distributing the transaction to be processed according to a synchronous strategy; of course, the processing of large transactions may be performed independently of the steps 101 to 104. The present application is not limited in this regard. According to the scheme, the operation records in the large transaction can be further divided into smaller groups according to the identifiers of the operation records, wherein the operation records in the same conflict set have the same identifier, the identifiers of the operation records in different conflict sets are different, the operation records in the same conflict set are executed in series in the same thread, the operation records in different conflict sets are executed in parallel in different threads, the problem of data synchronization delay can be effectively solved, real-time data synchronization is realized, and the resource utilization rate is improved.
In a possible implementation, for a large transaction process, as shown in fig. 2, the method includes the following steps:
in step 201, it is determined whether the transaction to be synchronized is a large transaction.
In one possible implementation, the large transaction includes one or more of the following features: the number of operation records in the transaction to be processed is larger than a first threshold, and the first threshold is not limited in the application; the transaction to be processed comprises operation records of different operation types; the corresponding operation record in the pending transaction involves lock waiting.
Step 202, if the transaction to be synchronized is not a large transaction, a thread is allocated to the transaction to be synchronized from the working thread group to which the transaction to be synchronized belongs.
In a possible implementation method, if the transaction to be synchronized is not a large transaction, a thread is allocated to the transaction to be synchronized from a working thread group to which the transaction to be synchronized belongs, and the transaction to be synchronized is queued in the thread to be executed.
Step 203, if the transaction to be synchronized is a large transaction, determining a conflict set according to the identification of each operation record in the transaction to be synchronized; different threads are allocated for different conflict sets from the worker thread group.
The operation records in the same conflict set have the same identification, and the identifications of the operation records in different conflict sets are different, wherein the identifications comprise one or more of a main key, an auto-increment main key, a combined main key or a line number.
In one possible implementation method, different operation records in a large transaction are serially processed according to a certain time sequence, if no conflict exists between different operation records, the operation records can be executed in parallel, different operation records in the large transaction can be grouped again according to the identification of each operation record, the operation records with the same identification are grouped into the same conflict set to be executed in series, and the operation records with different identifications are grouped into different conflict sets to be executed in parallel.
In one possible implementation, the primary key is used as an identifier to determine the conflict set.
In one possible implementation, if the database table defines an auto-added primary key, then the auto-added primary key is used as an identifier to determine the conflict set.
In one possible implementation method, if the database table has a joint primary key, then relevant fields of the joint primary key are taken, then the relevant fields are ordered according to a word order ordering rule, and then field values corresponding to the ordered field names are spliced in order and then used as identifiers to determine a conflict set. Illustratively, there is one primary key index (name, age, depth) followed by several sets of values ("Zhang Sang", "20", 2), ("Liqu", "30", 1), ("Wang Wu", "18", 10). The ordering order of the field names of the combined main key can be known according to the dictionary ordering, and is as follows: the field values of the field values are spliced according to the rule to obtain '20_2_Zhang three', '30_1_Lifour', '18_10_Wang five', and the spliced field values are used as identifiers to determine a conflict set.
In one possible implementation, if the self-increasing primary key ID is not defined, and the joint primary key is not defined, the database will default to generate a line number, i.e., _rowid, field as the primary key, and may take_rowid as the identifier to determine the conflict set.
In one possible implementation, the number of threads in any one work thread group may be dynamically adjusted. The specific implementation process is as follows: after a conflict set is determined according to the identification of each operation record in the transaction to be synchronized, judging whether the number of the conflict set is larger than the number of the current idle threads, if the number of the conflict set is smaller than or equal to the number of the current idle threads, distributing the conflict set to the corresponding idle threads, and waiting to be executed; if the number of conflict sets is greater than the number of current idle threads, dynamically increasing the number of threads in the working thread group to a second threshold, which is not limited by the present application. According to the scheme, the number of threads can be dynamically adjusted according to the number of conflict sets, when the number of conflict sets is large, the number of threads is dynamically increased, the processing speed of a transaction to be synchronized is increased, after the synchronous transaction is processed, the number of threads which are increased newly is idle, and a dynamic thread pool can recover the threads which are idle for a certain time, so that the thread pool resources are saved.
In one possible implementation method, because the operation records in the same conflict set are also executed according to the sequence of time, the operation records in the same conflict set can be divided and calculated according to the dimension of the time window by utilizing the sliding time window of the stream calculation. For the operation records which are processed and completed in the time window, the operation records are sent to the next processing node for further processing, such as aggregation calculation or library falling. And the operation records in the same conflict set can be sent to the next processing node for the next processing without waiting for the completion of the processing of the operation records, so that the concurrency degree is further improved. For timeliness of data processing, a time window is set for each second, but the time window can be other values, and the application is not limited to this.
In one possible implementation method, if the transaction to be synchronized is a large transaction, data cleaning can be performed on the operation record corresponding to the large transaction, and the operation record is arranged into a standard data format. The standard format may be as shown in table 3 below, but may be other standard formats, which are not limited in this application. And data cleaning is carried out on the operation records corresponding to the large transaction, and the operation records are arranged into a standard data format, so that a conflict set can be rapidly and accurately determined.
TABLE 3 Table 3
Standard data field name Interpretation of field meanings
optType 1:insert;2:update;3:delete
database Database name
table Table name
primaryKey Main key (self-increment ID, joint main key, _rowid)
beforeModifyColumnMap Record key value pair before modification (including primary key etc.)
afterModifyColumnMap Recording key value pairs after modification (including primary key, etc.)
In a possible implementation, for the large transaction TX7, after data cleansing is performed according to the contents of table 3, as shown in fig. 3.
In a possible implementation method, for the large transaction TX7, according to the above steps 201 to 203, 4 conflict sets are determined, which are respectively a conflict set a, a conflict set B, a conflict set C and a conflict set D; the conflict set A includes operation records R7> R8> R9> R14, the conflict set B includes operation records R10> R11> R17> R18> R19, the conflict set C includes operation records R12> R13> R20> R21, and the conflict set D includes operation records R15> R16> R22> R23> R24. And dynamically adding 4 threads according to the number of the conflict set to process the corresponding operation records in the conflict set, wherein the 4 threads are respectively queue_1, queue_2, queue_3 and queue_4. When the service distribution is performed according to the conflict set, the execution sequence of the different conflict sets is shown in the following table 4.
TABLE 4 Table 4
In a possible implementation method, the record of the operation which is updated frequently is called a hot spot record, for example, assuming that 5000 movie tickets participate in a second killing event in a movie theater, each person purchases one movie ticket, after the event starts, as the user gushes into participating in the second killing, the movie ticket stock is updated frequently (the stock is subtracted by 1 after the second killing is successful), and then the record of the movie ticket stock is updated frequently, so that a hot spot record data problem is formed. In addition, when one transaction is executing, other transactions need to wait, and the lock waiting time accounts for a large part of the transaction execution time, which is why the time consumed by the whole hot spot record update is longer.
In one possible implementation method, in addition to the hotspot recording problem, the network round trip IO delay problem needs to be considered. Assuming that an insert operation takes 5ms to perform, the network round trip IO delay takes 2ms in total. When one strip performs an insert operation, the total time taken for 1000 insert operations to be performed is 5000ms, and the network round trip IO delay is 2000ms. However, if the network round-trip IO delay is reduced by combining sending insert operations under optimization, for example, 100 insert operations are sent each time, then 1000 insert operations need to be sent only 10 times, and considering that the total time spent on network round-trip IO delay is 5ms (this time spent mainly on three-way handshake and four-way handshake of TCP, the time spent on data packet transmission is relatively small unless the data packet transmission time spent is changed by magnitude) and the total network round-trip IO delay is 50ms, so that the efficiency is improved.
In a possible implementation method, after the conflict sets are determined, whether each conflict set meets an aggregation condition is further required to be judged, and operation records in the conflict sets meeting the aggregation condition are aggregated to obtain aggregated records. Of course, before determining the conflict set, it may also be determined whether the entire transaction to be processed meets the aggregation condition, and the operation records in the transaction to be processed that meets the aggregation condition are aggregated. The execution order of the present application is not limited. Of course, determining whether the pending transaction satisfies the aggregation condition may be performed alone, without being performed in combination with steps 101 to 104, and steps 201 to 203. According to the scheme, the operation records in the conflict set can be aggregated according to the aggregation conditions, and the number of the operation records is reduced, so that the number of the operation records processed in the thread is reduced, the data synchronization is rapidly completed, further, the problem of data synchronization delay can be effectively solved, and the real-time data synchronization is realized.
In a possible implementation method, after determining conflict sets, determining whether each conflict set meets an aggregation condition, and performing an aggregation operation on operation records in the conflict set meeting the aggregation condition, a specific flow is shown in fig. 4, and the process includes the following steps:
step 401, judging whether each conflict set meets an aggregation condition, and performing an aggregation operation on operation records in the conflict set meeting the aggregation condition to obtain an aggregation record.
In one possible implementation method, since the operation records contained in each conflict set may be relatively large, if all the operation records contained in the conflict set are subjected to the judgment of the aggregation condition, the judgment may be very time-consuming, resulting in low efficiency. Therefore, each conflict set can be subjected to data segmentation according to the time window, and the operation records in the same time window are subjected to aggregation operation, which specifically comprises the following steps: determining a size of a time window; according to the size of the time window, performing data segmentation processing on at least one operation record in the conflict set, and determining aggregated data; wherein the aggregate data belongs to the same time window; judging whether each piece of aggregation data meets aggregation conditions or not; and performing aggregation operation on the operation records in the aggregation data meeting the aggregation conditions to obtain the aggregation records. In this scheme, the number of operation records in the conflict set may be relatively large, and if the aggregation operation is performed according to the large conflict set, it may be very time-consuming, resulting in inefficiency; therefore, according to the time window, the operation records in the conflict set are subjected to data segmentation processing, and the segmented data are subjected to aggregation operation, so that the efficiency of the aggregation operation can be improved, further, the problem of data synchronization delay can be effectively solved, and real-time data synchronization is realized.
Step 402, according to the synchronization policy, data synchronization is performed on the operation record and/or the aggregation record in each transaction to be synchronized, so as to achieve data synchronization between the source database and the target storage unit.
In a possible implementation method, in step 401, determining whether each conflict set meets an aggregation condition includes: determining that each conflict set contains an insert operation record and an update operation record, and that the execution results of the insert operation record and the update operation record are equivalent to the execution results of the update operation record; performing aggregation operation on operation records in a conflict set meeting aggregation conditions, wherein the aggregation operation comprises the following steps: the insert operation record is deleted. In the conflict set a, the two operation records of R7 and R8 are a piece of data with the operation primary key id=7, and the insert operation is performed first, then the update operation is performed, the b field is set to 1 except for the primary key id during insert operation, and the update operation is performed to set the b field to 3. That is, after the insert operation is performed, the state of the data is { "id": "7", "b": "1" }, and after the udpad operation is performed, the state of the data is { "id": "7", "b": "3", it can be found that after the update operation is performed, the state of the data is identical to the R8 record, that is, the execution effect of "insert+update" = "update", so that the insert operation can be omitted, that is, the operation of R7 is omitted, and thus, the IO performance can be saved. According to the scheme, the execution results of the inserting operation record and the updating operation record are equivalent to the execution results of the updating operation record, and the inserting operation record is deleted, and only the updating operation record is reserved, so that the number of operation records processed in a thread can be reduced, the data synchronization can be rapidly completed, further, the problem of data synchronization delay can be effectively solved, and the real-time data synchronization can be realized.
In a possible implementation method, in step 401, determining whether each conflict set meets an aggregation condition includes: determining that each conflict set contains an insert operation record and a delete operation record, and that the execution results of the insert operation record and the delete operation record are equivalent to the execution results of the delete operation record; performing aggregation operation on operation records in a conflict set meeting aggregation conditions, wherein the aggregation operation comprises the following steps: the insert operation record is deleted. Illustratively, in the conflict set C, both operations R20 and R21 are a piece of data with the primary key id=40, and an insert operation is performed first, and then a delete operation is performed, where the insert operation sets the b field to 1 except for the primary key id, and the delete operation deletes the piece of record. That is, after the insert operation is performed, the state of the data is { "id": "40", "b": "1", and after the delete operation is performed, the state of the data is { } (null, record does not exist), so that it can be found that after the delete operation is performed, the state of the data is identical to the record R21, that is, the effect of "insert+delete" is = "delete", then the insert operation can be omitted, that is, the operation R20 can be omitted, and thus, the IO performance can be saved. According to the scheme, the execution results of the insert operation record and the delete operation record are equivalent to the execution results of the delete operation record, and the insert operation record is deleted, and only the delete operation record is reserved, so that the number of operation records processed in a thread can be reduced, the data synchronization can be rapidly completed, further, the problem of data synchronization delay can be effectively solved, and the real-time data synchronization can be realized.
In a possible implementation method, in step 401, determining whether each conflict set meets an aggregation condition includes: determining that each conflict set comprises at least two update operation records, wherein the primary keys corresponding to the at least two update operation records are the same; performing aggregation operation on operation records in a conflict set meeting aggregation conditions, wherein the aggregation operation comprises the following steps: merging at least two update operation records into a first update operation record; wherein the execution result of the first update operation record is equivalent to the execution result of at least two update operation records. According to the scheme, the execution result of the first updating operation record is equivalent to the execution result of at least two updating operation records, and the at least two updating operation records are combined into the first updating operation record, so that the number of the updating operation records processed in the thread can be reduced, the data synchronization can be rapidly completed, and further, the problem of data synchronization delay can be effectively solved, and the real-time data synchronization can be realized.
Illustratively, in the conflict set B, three operations of R17, R18, and R19 are all data of operation primary key id=30, and all update operations are performed, except that the three update operations update different fields, for example, R17 updates a to 2, R18 updates B to 3, and R19 updates c to 4. Three operations of R17, R18 and R19 are sequentially executed, and the data state of the record of the main key id=30 is { "id": "30", "a": "2", "b": "3", "c": "4" }, that is to say if { "id": "30", "a": "2", "b": "3", "c": "4" } such an update operation, the final effect is equivalent to R17+R18+R19, then we can directly modify the operation of R19, directly let it perform { "id": "30", "a": "2", "b": "3", "c": "4" }, two operations of R17 and R18 can be saved, and thus IO performance can be saved.
Illustratively, in the conflict set D, three operations of R22, R23 and R24 are all data of operation primary key id=50, and are all update operations, and the data state after modification of R22 is { "id": "50", "b": "1" }, R23 modified data state is { "id": "50", "b": "2" }, R24 modified data state is { "id": "50", "b": "3", it can be found that the data state after the sequential execution of R22, R23, R24 is identical to the data state after the direct execution of R24, and because these three operations all update the same field b, the effect of directly executing R24 is equivalent to r22+r23+r24. In this way, two operations of R22 and R23 can be saved, and thus IO performance can be saved.
In one possible implementation method, other polymerization conditions may be included in addition to the three polymerization conditions listed above, and the present application is not limited to the polymerization conditions.
In one possible implementation, after the aggregation operation is performed on the transaction TX7, the data operation in table 4 is optimized as shown in the following table 5, and it can be seen from table 5 that many operation records are saved, that is, many IO operations are saved.
TABLE 5
In one possible implementation method, in the above example, the operations of updating the same record are all adjacent, but there may be a discontinuous situation in the actual operation record, then the operation record in the same time window may not meet the aggregation condition, but the operation records in the multiple time windows may be combined together to meet the aggregation condition, if the total number of data in the time window meets a certain threshold, it is determined whether the aggregation condition is met, if the aggregation condition is met, the aggregation calculation is performed, and the calculation result is output and stored in a database. When the data does not meet the threshold value, the calculation of one time unit can be delayed, so that the data of two time windows can be combined to perform the aggregation calculation, and the aggregation calculation plays a better role.
In a possible implementation method, aggregation calculation is performed by defining different aggregators, wherein an inheritance relation diagram of the aggregators is shown in fig. 5, wherein AggInterface is an aggregator interface, and the properties of the aggregators, a data set, an aggregation calculation method and the like are included, different aggregators are included under AggInterface, and the different aggregators include an updatesetgrag and an updatemodifyinag under InsertUpdateAgg, insertDeleteAgg and AbstractUpdateAgg, abstractUpdateAgg respectively, and an insert update agg represents an execution effect of "insert+update" = "update"; inertDeleteAgg represents the execution effect of "insert+delete" = "delete"; the updatesettagg represents the execution effect of "update+update" = "update", and is used for merging different fields corresponding to the update operation together; the updatemodifyigg represents the execution effect of "update+update" = "update", and is used to merge the same fields corresponding to the update operation together.
In a possible implementation method, a specific flow of distributing the to-be-synchronized transactions to different threads is shown in fig. 6, where a portion a is an a-on-demand data window module, configured to execute steps 101 to 104, and distribute different to-be-synchronized transactions to corresponding working thread groups according to whether there is a dependency relationship between to-be-synchronized transactions and the number of to-be-synchronized transactions already distributed and processed or the busyness of the working thread groups. The part B is a B-sliding time window module which is used for solving the problem of large transactions, dividing the operation records into different conflict sets according to the identification of the operation records in the transactions to be synchronized, further dividing the operation records in the conflict sets according to time windows by the generation time of the operation records, and queuing and sliding and executing according to the time windows; and part C, namely a C-delay time window module, is used for solving the problem of hot spot recording, realizing multiple aggregation calculators by self-definition and reducing invalid IO requests by optimizing a combined memory instruction technology. The data magnitude may be different at different moments in different scenes, a calculation scheme is preferably aggregated according to the data magnitude for windows with dense data, and a calculation scheme is preferably calculated according to a delay time window for windows with sparse data.
Based on the same technical concept, fig. 7 exemplarily illustrates a data synchronization apparatus 700 provided in an embodiment of the present application. As shown in fig. 7, includes: an acquisition unit 701, a determination unit 702, and a synchronization unit 703. An acquiring unit 701, configured to acquire a plurality of transactions to be synchronized based on a database log; the database log records the adding, deleting and modifying operations aiming at the active database; any transaction to be synchronized contains at least one operation record for the source database table; a determining unit 702, configured to determine, for any operation record of any transaction to be synchronized, a synchronization identifier corresponding to the operation record according to a database name, a table name, a unique key, and a numerical value of the unique key in the operation record; determining a synchronization strategy according to the log time sequence of a plurality of transactions to be synchronized and each operation record of the plurality of transactions to be synchronized, wherein the synchronization strategy is that the synchronization identifications of the operation records in each transaction to be synchronized are different, and each transaction to be synchronized is synchronously executed by different threads; the transaction to be synchronized with at least one same synchronization mark is executed by the same thread in series; and the synchronization unit 703 is configured to implement data synchronization between the source database and the target storage unit according to a synchronization policy.
In a possible implementation method, the determining unit 702 is specifically configured to sequentially determine, according to a log time sequence of a plurality of transactions to be synchronized, a synchronization period and an affiliated working thread group of each transaction to be synchronized, and allocate threads for the transactions to be synchronized according to the affiliated working thread group; if the second transaction to be synchronized and the first transaction to be synchronized have at least one same synchronization identifier, the second transaction to be synchronized and the working thread group to which the first transaction to be synchronized belong are the same, and the synchronization period of the second transaction to be synchronized is located after the synchronization period of the first transaction to be synchronized; the log time of the first transaction to be synchronized is earlier than the log time of the second transaction to be synchronized; if the second transaction to be synchronized and any first transaction to be synchronized do not have the same synchronization identification, the working thread group to which the second transaction to be synchronized belongs is different from the working thread group to which any first transaction to be synchronized belongs.
In a possible implementation method, the determining unit 702 is specifically configured to perform a hash operation on a database name, a table name, a primary key, and a numerical value of the primary key in the operation record, to obtain a first hash value; performing hash operation on the database name, the table name, the unique index and the pre-operation numerical value of the unique index in the operation record to obtain a second hash value; the pre-operation value is a value corresponding to the unique index before the operation record is executed; performing hash operation on the database name, the table name, the unique index and the operated numerical value of the unique index in the operation record to obtain a third hash value; the value after operation is the value corresponding to the unique index after operation record is executed; and taking the first hash value, the second hash value and the third hash value as a plurality of synchronous identifiers corresponding to the operation record.
In one possible implementation, any of the worker thread groups includes at least one thread; a determining unit 702, configured to determine whether the transaction to be synchronized is a large transaction; the number of the operation records and/or the plurality of operation record types included in the large transaction is greater than a first threshold value; if the transaction to be synchronized is not a large transaction, a thread is allocated to the transaction to be synchronized from a working thread group to which the transaction to be synchronized belongs; if the transaction to be synchronized is a large transaction, determining a conflict set according to the identification of each operation record in the transaction to be synchronized; allocating different threads for different conflict sets from the working thread group; each operation record in the same conflict set has the same identification, and the identifications of the operation records are different among different conflict sets, wherein the identifications comprise one or more of a main key, an auto-increment main key, a joint main key or a line number.
In one possible implementation, the number of threads in any one work thread group may be dynamically adjusted.
In a possible implementation method, the device further includes an aggregation unit 704, where the aggregation unit 704 is configured to determine whether each conflict set meets an aggregation condition, and aggregate operation is performed on operation records in the conflict set that meets the aggregation condition to obtain an aggregated record; and the synchronization unit 703 is configured to perform data synchronization on the operation record and/or the aggregate record in each transaction to be synchronized according to the synchronization policy, so as to achieve data synchronization between the source database and the target storage unit.
In a possible implementation method, the determining unit 702 is further configured to determine a size of the time window; according to the size of the time window, performing data segmentation processing on at least one operation record in the conflict set, and determining aggregated data; wherein the aggregate data belongs to the same time window; and an aggregation unit 704, configured to determine whether each piece of aggregation data meets an aggregation condition.
In a possible implementation manner, the aggregation unit 704 is specifically configured to determine that each conflict set includes an insert operation record and an update operation record, and an execution result of the insert operation record and the update operation record is equivalent to an execution result of the update operation record; the insert operation record is deleted.
In a possible implementation manner, the aggregation unit 704 is specifically configured to determine that each conflict set includes an insert operation record and a delete operation record, and an execution result of the insert operation record and the delete operation record is equivalent to an execution result of the delete operation record; the insert operation record is deleted.
In a possible implementation method, the aggregation unit 704 is specifically configured to determine that each conflict set includes at least two update operation records, and primary keys corresponding to the at least two update operation records are the same; merging at least two update operation records into a first update operation record; wherein the execution result of the first update operation record is equivalent to the execution result of at least two update operation records.
Based on the same technical concept, the embodiments of the present application provide a data synchronization apparatus 800, where the data synchronization apparatus 800 may be, for example, a computing device. As shown in fig. 8, a data synchronization device 800 includes at least one processor 801 and a memory 802 connected to the at least one processor, in this embodiment of the present application, a specific connection medium between the processor 801 and the memory 802 is not limited, and in fig. 8, the processor 801 and the memory 802 are connected by a bus, for example. The buses may be divided into address buses, data buses, control buses, etc.
In the embodiment of the present application, the memory 802 stores instructions executable by the at least one processor 801, and the at least one processor 801 may perform one of the data synchronization methods described above by executing the instructions stored in the memory 802.
The processor 801 is a control center of the data synchronizer 800, and may connect various parts of a computer device using various interfaces and lines, and perform resource setting by executing or executing instructions stored in the memory 802 and calling data stored in the memory 802. Alternatively, the processor 801 may include one or more determination units 202, and the processor 801 may integrate an application processor that primarily processes operating systems, user interfaces, application programs, and the like, with a modem processor that primarily processes wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 801. In some embodiments, processor 801 and memory 802 may be implemented on the same chip, or they may be implemented separately on separate chips in some embodiments.
The processor 801 may be a general purpose processor such as a Central Processing Unit (CPU), digital signal processor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, and may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in the processor for execution.
Memory 802, as a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 802 may include at least one type of storage medium, which may include, for example, flash Memory, hard disk, multimedia card, card Memory, random access Memory (Random Access Memory, RAM), static random access Memory (Static Random Access Memory, SRAM), programmable Read-Only Memory (Programmable Read Only Memory, PROM), read-Only Memory (ROM), charged erasable programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory), magnetic Memory, magnetic disk, optical disk, and the like. Memory 802 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 802 in the embodiments of the present application may also be circuitry or any other device capable of implementing a memory function for storing program instructions and/or data.
The embodiment of the application also provides a computer readable storage medium, and the computer readable storage medium stores a computer executable program, and the computer executable program is used for making a computer execute a data synchronization method listed in any mode.
The embodiments of the present application provide a computer program product comprising a computer program executable by a computer device, which when run on the computer device, causes the computer device to perform a data synchronization method as set forth in any one of the above.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims (10)

1. A method of data synchronization, comprising:
acquiring a plurality of transactions to be synchronized based on a database log; the database log records the adding, deleting and modifying operations aiming at the source database; any transaction to be synchronized contains at least one operation record for the source database table;
for any operation record of any transaction to be synchronized, determining a synchronization identifier corresponding to the operation record according to the database name, the table name, the unique key and the numerical value of the unique key in the operation record;
determining a synchronization strategy according to the log time sequence of the plurality of transactions to be synchronized and each operation record of the plurality of transactions to be synchronized, wherein the synchronization strategy is that the synchronization identifications of the operation records in each transaction to be synchronized are different, and each transaction to be synchronized is synchronously executed by different threads; the transaction to be synchronized with at least one same synchronization mark is executed by the same thread in series;
and realizing data synchronization between the source database and the target storage unit according to the synchronization strategy.
2. The method of claim 1, wherein the determining a synchronization policy based on the log time order of the plurality of transactions to be synchronized and the respective operation records of the plurality of transactions to be synchronized comprises:
Sequentially determining a synchronization period and an affiliated working thread group of each transaction to be synchronized according to the log time sequence of the plurality of transactions to be synchronized, and distributing threads for the transactions to be synchronized according to the affiliated working thread group; wherein,
if the second transaction to be synchronized and the first transaction to be synchronized have at least one same synchronization identifier, the second transaction to be synchronized and the working thread group to which the first transaction to be synchronized belong are the same, and the synchronization period of the second transaction to be synchronized is located after the synchronization period of the first transaction to be synchronized; the log time of the first transaction to be synchronized is earlier than the log time of the second transaction to be synchronized;
if the second transaction to be synchronized and any first transaction to be synchronized do not have the same synchronization identification, the working thread group to which the second transaction to be synchronized belongs is different from the working thread group to which any first transaction to be synchronized belongs.
3. The method of claim 1, wherein the determining the synchronization identifier corresponding to the operation record according to the database name, the table name, the unique key, and the value of the unique key in the operation record comprises:
carrying out hash operation on the database name, the table name, the primary key and the numerical value of the primary key in the operation record to obtain a first hash value;
Performing hash operation on the database name, the table name, the unique index and the pre-operation numerical value of the unique index in the operation record to obtain a second hash value; the pre-operation numerical value is a numerical value corresponding to the unique index before the operation record is executed;
performing hash operation on the database name, the table name, the unique index and the operated numerical value of the unique index in the operation record to obtain a third hash value; the operated numerical value is a numerical value corresponding to the unique index after the operation record is executed;
and taking the first hash value, the second hash value and the third hash value as a plurality of synchronous identifiers corresponding to the operation record.
4. The method of claim 2, wherein any worker thread group includes at least one thread;
the allocating threads for the transaction to be synchronized according to the belonging working thread group comprises the following steps:
judging whether the transaction to be synchronized is a large transaction or not; the large transaction comprises a plurality of operation record types and/or the number of operation records is greater than a first threshold;
if the transaction to be synchronized is not a large transaction, a thread is allocated to the transaction to be synchronized from a working thread group to which the transaction to be synchronized belongs;
If the transaction to be synchronized is a large transaction, determining a conflict set according to the identification of each operation record in the transaction to be synchronized; allocating different threads for different conflict sets from the working thread group; each operation record in the same conflict set has the same identification, and the identifications of the operation records are different among different conflict sets, wherein the identifications comprise one or more of a main key, an auto-increment main key, a joint main key or a line number.
5. The method of claim 4, wherein said assigning different threads from said worker thread group for different conflict sets comprises:
the number of threads in any one worker thread group may be dynamically adjusted.
6. The method of claim 4, wherein said synchronizing data between the source database and the target storage unit according to the synchronization policy comprises:
judging whether each conflict set meets the aggregation conditions, and aggregating the operation records in the conflict sets meeting the aggregation conditions to obtain aggregated records;
and according to the synchronization strategy, carrying out data synchronization on the operation records and/or the aggregation records in each transaction to be synchronized, and realizing the data synchronization between the source database and the target storage unit.
7. The method of claim 6, wherein before determining whether each conflict set satisfies an aggregation condition, comprising:
determining a size of a time window;
according to the size of the time window, performing data segmentation processing on at least one operation record in the conflict set to determine aggregated data; wherein the aggregate data belongs to the same time window;
the judging whether each conflict set meets the aggregation condition comprises the following steps:
and judging whether each piece of aggregation data meets the aggregation condition.
8. The method of claim 6, wherein the determining whether each conflict set satisfies an aggregation condition comprises:
determining that each conflict set contains an insert operation record and an update operation record, and that the insert operation record and the update operation record have execution results equivalent to the update operation record;
the aggregating operation of the operation records in the conflict set meeting the aggregation condition comprises the following steps:
and deleting the insertion operation record.
9. The method of claim 6, wherein the determining whether each conflict set satisfies an aggregation condition comprises:
Determining that each conflict set contains an insert operation record and a delete operation record, and that the insert operation record and the delete operation record have execution results equivalent to the delete operation record;
the aggregating operation of the operation records in the conflict set meeting the aggregation condition comprises the following steps:
and deleting the insertion operation record.
10. The method of claim 6, wherein the determining whether each conflict set satisfies an aggregation condition comprises:
determining that each conflict set comprises at least two update operation records, wherein the primary keys corresponding to the at least two update operation records are the same;
the aggregating operation of the operation records in the conflict set meeting the aggregation condition comprises the following steps:
combining the at least two update operation records into a first update operation record; wherein the execution result of the first update operation record is equivalent to the execution result of the at least two update operation records.
CN202311386289.0A 2023-10-24 2023-10-24 Data synchronization method Pending CN117370462A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311386289.0A CN117370462A (en) 2023-10-24 2023-10-24 Data synchronization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311386289.0A CN117370462A (en) 2023-10-24 2023-10-24 Data synchronization method

Publications (1)

Publication Number Publication Date
CN117370462A true CN117370462A (en) 2024-01-09

Family

ID=89407391

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311386289.0A Pending CN117370462A (en) 2023-10-24 2023-10-24 Data synchronization method

Country Status (1)

Country Link
CN (1) CN117370462A (en)

Similar Documents

Publication Publication Date Title
CN111338766B (en) Transaction processing method and device, computer equipment and storage medium
CN110166282B (en) Resource allocation method, device, computer equipment and storage medium
CN111597015B (en) Transaction processing method and device, computer equipment and storage medium
CN111159252B (en) Transaction execution method and device, computer equipment and storage medium
CN106776967B (en) Method and device for storing massive small files in real time based on time sequence aggregation algorithm
US8381230B2 (en) Message passing with queues and channels
US10866970B1 (en) Range query capacity allocation
CN108959510B (en) Partition level connection method and device for distributed database
CN113419823B (en) Alliance chain system suitable for high concurrency transaction and design method thereof
CN109240607B (en) File reading method and device
CN110633378A (en) Graph database construction method supporting super-large scale relational network
CN111722918A (en) Service identification code generation method and device, storage medium and electronic equipment
CN112162846B (en) Transaction processing method, device and computer readable storage medium
CN112749198A (en) Multi-level data caching method and device based on version number
CN113687964A (en) Data processing method, data processing apparatus, electronic device, storage medium, and program product
CN108399175B (en) Data storage and query method and device
US8543722B2 (en) Message passing with queues and channels
CN114721594A (en) Distributed storage method, device, equipment and machine readable storage medium
CN117271531B (en) Data storage method, system, equipment and medium
US20130332465A1 (en) Database management device and database management method
CN114327642A (en) Data read-write control method and electronic equipment
CN109460406A (en) A kind of data processing method and device
CN117370462A (en) Data synchronization method
CN115114294A (en) Self-adaption method and device of database storage mode and computer equipment
CN115760405A (en) Transaction execution method, device, computer equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication