CN115145937A

CN115145937A - Data synchronization method and system based on middleware

Info

Publication number: CN115145937A
Application number: CN202210823871.8A
Authority: CN
Inventors: 沙正辉; 李华靖; 骆海东; 颜嘉梁
Original assignee: Hangzhou Jushuitan Network Technology Co ltd; Shanghai Juhuotong E Commerce Co ltd; Shanghai Jushuitan Network Technology Co ltd
Current assignee: Hangzhou Jushuitan Network Technology Co ltd; Shanghai Juhuotong E Commerce Co ltd; Shanghai Jushuitan Network Technology Co ltd
Priority date: 2022-07-13
Filing date: 2022-07-13
Publication date: 2022-10-04

Abstract

The invention relates to a data synchronization method and a system based on a middleware, which relate to the technical field of data processing, and the method comprises the following steps: acquiring data tasks to be synchronized on all task nodes, and determining a task instance corresponding to each data task to be synchronized; determining a synchronous task group; the synchronous task group is an array formed by a plurality of data tasks to be synchronized, which have the same corresponding task instances; for each synchronization task group, when the number of the data tasks to be synchronized in the synchronization task group is greater than a task number safety threshold, a task configuration updating instruction is sent out so as to change the working state of the data tasks to be synchronized in the corresponding task nodes; the working state comprises a synchronous state and a stop synchronization state. The invention carries out scheduling configuration on a large-scale synchronization process aiming at the million-level meter scale data, realizes real-time synchronization of the data and reduces the performance pressure of the database to the maximum extent.

Description

Data synchronization method and system based on middleware

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data synchronization method and system based on a middleware.

Background

Currently, enterprise Resource Planning (ERP) services of companies mainly use SQLSERVER as a database carrier, and have no library preparation and data capture (CDC) change mechanism, while data works of a support system of a data platform does not support second-level data synchronization. If batch collection using offline synchronization could potentially put performance and traffic pressure on the business pool; if online synchronization is used, multiple databases need to be prepared to achieve real-time synchronization of data, but significant costs are incurred, such as a large amount of expense for maintaining both the purchase case source and the backup repository.

Disclosure of Invention

The invention aims to provide a middleware-based data synchronization method and a middleware-based data synchronization system, which are used for carrying out scheduling configuration on a large-scale synchronization process aiming at million-level table scale data, realizing real-time data synchronization and reducing the performance pressure of a database to the maximum extent.

In order to achieve the purpose, the invention provides the following scheme:

the invention provides a data synchronization method based on middleware, which comprises the following steps:

acquiring data tasks to be synchronized on all task nodes, and determining a task instance corresponding to each data task to be synchronized;

determining a synchronous task group; the synchronous task group is an array formed by a plurality of data tasks to be synchronized, which have the same corresponding task instances;

for each synchronization task group, when the number of the data tasks to be synchronized in the synchronization task group is greater than a task number safety threshold, a task configuration updating instruction is sent out so as to change the working state of the data tasks to be synchronized in the corresponding task nodes; the working state comprises a synchronous state and a stop synchronization state.

Optionally, the acquiring the to-be-synchronized data tasks on all task nodes specifically includes:

acquiring preliminary trigger tasks on all task contacts;

judging whether the preliminary trigger task is in a task abnormal state or not aiming at each preliminary trigger task;

if the preliminary trigger task is in a task abnormal state, sending a task abnormal notification;

and if the preliminary trigger task is not in an abnormal state, determining the preliminary trigger task as a data synchronization task to be processed.

Optionally, the middleware-based data synchronization method further includes:

counting the working states of the data tasks to be synchronized in all the task nodes to obtain a task scheduling table;

and updating the task scheduling table according to the task configuration updating instruction.

Optionally, the middleware-based data synchronization method further includes:

acquiring local synchronous data; the local synchronous data is data obtained after the data task to be synchronized is executed;

judging whether the local synchronization data is consistent with a task instance corresponding to the data task to be synchronized or not, and obtaining a first result;

if the first result shows that the data is correct, outputting a synchronous error-free notice;

and if the first result shows no, returning to the step of determining the synchronous task group.

Optionally, the task configuration update instruction is determined according to the timestamp field type supported by the SQLSERVER of the same data source.

The invention also provides a data synchronization system based on the middleware, which comprises:

the synchronous task determining module is used for acquiring data tasks to be synchronized on all task nodes and determining a task instance corresponding to each data task to be synchronized;

the grouping module is used for determining a synchronous task group; the synchronous task group is an array formed by a plurality of data tasks to be synchronized, which have the same corresponding task instances;

the synchronous task updating module is used for sending a task configuration updating instruction to change the working state of the data tasks to be synchronized in the corresponding task nodes when the number of the data tasks to be synchronized in the synchronous task groups is greater than a task number safety threshold value aiming at each synchronous task group; the working state comprises a synchronous state and a stop synchronization state.

Optionally, in terms of acquiring data tasks to be synchronized on all task nodes, the synchronization task determining module specifically includes:

the preliminary task determining submodule is used for acquiring preliminary trigger tasks on all task contacts;

the abnormality checking submodule is used for judging whether the preliminary trigger task is in a task abnormal state or not aiming at each preliminary trigger task;

the exception notification submodule is used for sending out task exception notification when the preliminary trigger task is in a task exception state;

and the task preparation submodule is used for determining the preliminary trigger task as a to-be-processed data synchronization task when the preliminary trigger task is not in an abnormal state.

Optionally, the middleware-based data synchronization system further includes:

the scheduling and counting module is used for counting the working states of the data tasks to be synchronized in all the task nodes to obtain a task scheduling table;

and the scheduling updating module is used for updating the task scheduling table according to the task configuration updating instruction.

Optionally, the middleware-based data synchronization system further includes:

the local synchronous data acquisition module is used for acquiring local synchronous data; the local synchronous data is data obtained after the data task to be synchronized is executed;

the data before and after synchronization judging module is used for judging whether the local synchronization data is consistent with the task instance corresponding to the data task to be synchronized or not and obtaining a first result;

an end data module for outputting a synchronization error-free notification when the first result indicates yes;

and a returning module, configured to return to the step of determining the synchronization task group when the first result indicates no.

Optionally, the task configuration update instruction is determined according to the timestamp field type supported by the same data source SQLSERVER.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides a data synchronization method and a data synchronization system based on middleware, which aim at million-level table scale data, check data tasks to be synchronized on all task nodes, determine the number of the data tasks to be synchronized with the same corresponding task instances, and send a task configuration updating instruction to reconfigure the corresponding data tasks to be synchronized when the number of the corresponding data tasks to be synchronized exceeds a task number safety threshold value, so that the data tasks to be synchronized on some task nodes are continuously performed, and the data tasks to be synchronized on some task nodes are stopped, thereby realizing the scheduling configuration of large-scale data synchronization, providing a stable and timely data source for a data platform on the premise of minimizing the influence on the performance of a service library, and realizing the real-time synchronization of large-scale data.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of a middleware-based data synchronization method according to the present invention;

FIG. 2 is a schematic diagram of a middleware-based data synchronization system according to the present invention;

FIG. 3 is a swim lane diagram of a data synchronization middleware according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present invention will be described in further detail with reference to the accompanying drawings and detailed description, in order to make the objects, features and advantages thereof more comprehensible.

CDC represents Change Data Capture, which uses an SQL SERVER agent to record insertion, update, and deletion activities applied to a table. The Dataworks represents a data factory, is a big data engine based on MaxCommute/Hologres/EMR/CDP and the like in Aliskiu, and provides a uniform whole-link big data development and treatment platform for solutions of data warehouse/data lake/lake storehouse integration and the like. The DataX is an offline data synchronization tool of an Ariiyun open source and supports data synchronization work of any data source type. The ODS represents an Operational Data layer/tile layer (Operational Data Store), which is the most primitive source of the bins.

Example one

As shown in fig. 1, the present embodiment provides a data synchronization method based on middleware, including:

step 100, acquiring data tasks to be synchronized on all task nodes, and determining a task instance corresponding to each data task to be synchronized.

The acquiring of the to-be-synchronized data tasks on all task nodes specifically includes:

1) And acquiring the preliminary trigger tasks on all task contacts. The preliminary trigger task may be a complete data synchronization task, or may be a newly added synchronization task corresponding to a synchronized data task. For the latter, the newly added synchronization task is the changed data in the data task to be synchronized, and the unchanged data in the data task to be synchronized does not belong to the category of the newly added synchronization task and does not need to be synchronized repeatedly.

2) And judging whether the preliminary trigger task is in a task abnormal state or not aiming at each preliminary trigger task.

3) And if the preliminary trigger task is in a task abnormal state, sending a task abnormal notification.

4) And if the preliminary trigger task is not in an abnormal state, determining the preliminary trigger task as a to-be-processed data synchronization task.

Step 200, determining a synchronous task group; the synchronous task group is an array formed by a plurality of data tasks to be synchronized, wherein the corresponding task instances are the same.

Step 300, for each synchronization task group, when the number of the data tasks to be synchronized in the synchronization task group is greater than a task number safety threshold, sending a task configuration update instruction to change the working state of the data tasks to be synchronized in the corresponding task node; the working state comprises a synchronous state and a stop synchronization state. The task configuration updating instruction is determined according to the type of a Timestamp field supported by the SQLSERVER of the same data source, and it is required to know that a Timestamp TS (Timestamp) is a mechanism commonly used by the SQLSERVER database for adding version stamps to table rows, the TS is updated (only the TS is increased but not reduced) when the record is updated (including insertion and deletion) every time, the database is shared in a hierarchical manner, and the overall increment is unique. If the task number safety threshold is not set, the data volume pointing to the same synchronous task is too large and cannot be coordinated, and the overall flow control difficulty is increased. After the task number safety threshold value is set, the number of tasks exceeding the task number safety threshold value can be immediately stopped through monitoring, so that the flow can be controlled.

Further, the middleware-based data synchronization method further includes:

step 400, acquiring local synchronous data; the local synchronization data is data obtained after the data task to be synchronized is executed.

Step 500, judging whether the local synchronization data is consistent with the task instance corresponding to the data task to be synchronized, and obtaining a first result.

And step 600, if the first result shows yes, outputting a synchronization error-free notification.

Step 700, if the first result indicates no, returning to the step of determining the synchronous task group.

Preferably, the middleware-based data synchronization method further includes:

1) And counting the working states of the data tasks to be synchronized in all the task nodes to obtain a task scheduling table.

2) And updating the task scheduling table according to the task configuration updating instruction.

In a specific practical application, in order to synchronize the mega meter scale data to the database in real time, the SQLSERVER real-time synchronization alternative of the same data source is compared first, as shown in table 1.

Table 1 alternative comparison table

Secondly, a real-time synchronization strategy is determined according to the timestamp field types supported by the SQLSERVER of the same data source, and different schemes exist for the real-time synchronization strategy, so that different effects are achieved, as shown in Table 2.

TABLE 2 comparison of effects achieved after execution of different schemes based on real-time synchronization strategy

Specifically, there are three schemes implemented based on the real-time synchronization policy:

1) Self-increment ID: the database supports incremental IDs, and every new addition/insertion of a record, a globally unique < self-increment ID > is assigned. Such as: if the current ID of a certain file is ID1, and the ID of the file is changed into ID2 when the file is detected again, the data of the file is changed, and data synchronization updating is needed.

2) And (3) modifying time: the database table records a system time stamp which can be automatically updated during each update, whether file data needs to be synchronously updated is determined by judging whether the time stamp is changed, data corresponding to the current maximum time stamp is generally selected as updating reference data, but the number of records cannot be controlled in an updating method for modifying time increment.

3) Updating the number of the strips: for a file or a data table, 1 is added to the number of changes for each added/inserted record. Such as: if the current change frequency of a certain file is 10, the file content with the change frequency of 10 is synchronized, and if the change frequency is 11 or 12 in the next detection, the file content is changed, and the changed part in the file content between the change frequency of 10 and the change frequency of 11 or 12 needs to be updated synchronously.

Then, an architecture design is performed based on the data synchronization method of the embodiment, specifically, a distributed master-slave mode is adopted to perform synchronization task management on data, and the example is shown in table 3.

Table 3 comparison table of distributed master-slave mode corresponding results

Finally, the data synchronization method based on the middleware provided by the embodiment can perform real-time data synchronization in a task synchronous issuing state and a task asynchronous issuing state in the execution process. Specifically, if the task synchronization issuing state is reached, the data synchronization method specifically includes:

the method comprises the following steps that a worker issues data synchronization tasks at the front end, task nodes randomly select data tasks to be synchronized, the data tasks to be synchronized selected from task nodes and belonging to the same task instance are checked, and whether the number of the data tasks is larger than a task number safety threshold value or not is judged; if not, executing the synchronous tasks in batch. When the task node executes the task, the state of the executed task is required to be checked; if the selected task is in the task abnormal state, a task abnormal notification is sent out to remind workers of paying attention; and if the selected task is in a normal working state, carrying out next data synchronization, and displaying the synchronized data result at the front end.

Compared with the synchronous task issuing state, the data synchronization method is different in that when the synchronous task issuing state is adopted, the data synchronization method comprises the following steps: when the task state is not greater than the task safety threshold, a return result can be directly output, so that the task node can continue to perform other data synchronization tasks while performing task state inspection.

The data synchronization method of the embodiment further includes query and synchronization of task synchronization state, which is also the case of synchronous query and asynchronous query. Specifically, under the synchronous query condition, the worker sends a task state query instruction to the task node at the front end, and then displays the task state returned by the task node in the front end. Under the condition of asynchronous query, a worker sends a task state query instruction to a task node at the front end, on one hand, a result is returned to provide possibility for next query, on the other hand, the task node executes a timing task therein, and the task is scheduled through a data center DP. It should be understood that under the condition of asynchronous query, the task is executed at intervals, and only when external trigger is received, the corresponding node action is called, otherwise, the node does not have action.

In the aspect of monitoring of the task nodes, the data center station collects state data and task data of a plurality of task nodes, and after a worker sends a node data query request through the front end, the data center station sends the state data and the task data of the corresponding task nodes to the front end and displays the state data and the task data.

Example two

As shown in fig. 2, the present embodiment provides a middleware-based data synchronization system, including:

the synchronization task determining module 101 is configured to acquire data tasks to be synchronized on all task nodes, and determine a task instance corresponding to each data task to be synchronized.

In terms of acquiring data tasks to be synchronized on all task nodes, the synchronization task determining module 101 specifically includes:

and the preliminary task determining submodule is used for acquiring preliminary trigger tasks on all task contacts.

And the abnormity checking submodule is used for judging whether the preliminary trigger task is in a task abnormal state or not aiming at each preliminary trigger task.

And the exception notification sub-module is used for sending out task exception notification when the preliminary trigger task is in a task exception state.

And the task preparation submodule is used for determining the preliminary trigger task as a data synchronization task to be processed when the preliminary trigger task is not in an abnormal state.

A grouping module 201 for determining a synchronization task group; the synchronous task group is an array formed by a plurality of data tasks to be synchronized, wherein the corresponding task instances are the same.

A synchronization task updating module 301, configured to, for each synchronization task group, send a task configuration updating instruction when the number of the data tasks to be synchronized in the synchronization task group is greater than a task number safety threshold, so as to change a working state of the data tasks to be synchronized in a corresponding task node; the working state comprises a synchronous state and a stop synchronization state. Specifically, the task configuration updating instruction is determined according to the type of the timestamp field supported by the SQLSERVER of the same data source.

The data synchronization system based on the middleware further comprises a scheduling statistic module, a scheduling updating module, a local synchronous data acquisition module, a data before and after synchronization judgment module, an end data module and a return module.

The scheduling and counting module is used for counting the working states of the data tasks to be synchronized in all the task nodes to obtain a task scheduling table; and the scheduling updating module is used for updating the task scheduling table according to the task configuration updating instruction.

The local synchronous data acquisition module is used for acquiring local synchronous data; the local synchronous data is data obtained after the data task to be synchronized is executed; the data before and after synchronization judging module is used for judging whether the local synchronization data is consistent with the task instance corresponding to the data task to be synchronized and obtaining a first result; the end data module is used for outputting a synchronous error-free notice when the first result shows yes; and the returning module is used for returning to the step of determining the synchronous task group when the first result shows no.

In a practical application, as shown in fig. 3, the data synchronization system based on the middleware in this embodiment may be configured as a data synchronization middleware, which specifically includes five parts: the system comprises a scheduling module, a deployment node module, a management back-end module, a page module and a data warehouse platform.

The scheduling module is used for scheduling a start script which is responsible for regularly triggering the deployment node.

The deployment node module is used for checking and starting an extraction program after the Agent node script is started.

The management back-end module is used for distributing the extracted tasks to active agent ends and instantly pulling task conditions to adjust a task distribution plan in real time.

The page module is used for displaying, configuring and adjusting.

And the warehouse counting platform is used for counting the original data acquired by the ODS layer, synchronizing the original data to the rear end of acquisition processing, and triggering a compensation task according to the situation.

In the working process of the data synchronization middleware, the scheduling module outputs task trigger information, the deployment node module performs state check on corresponding tasks according to the task trigger information, the state check result comprises normal and abnormal results, and the results are required to be reported to the management back-end module and then uploaded to the page module for display. The method comprises the steps that a worker deploys and adjusts each synchronous task through a page module, task deployment information is sent to a management back-end module, the management back-end module conducts task scheduling according to the task deployment information and sends the task scheduling to a deployment node module, and meanwhile execution records of the task deployment information are generated, so that synchronization steps can be inquired in the later period, task execution in the management back-end module can also conduct task state synchronization on the task execution in the deployment node module, and meanwhile, the result of the state synchronization is displayed on the page module.

And finally, the task execution management sends the data related to the synchronous task obtained by statistics to the quality management of the page module, samples the data after the data synchronization is completed, then audits and checks the quality (whether the data synchronization is complete or not and whether the data synchronization is correct or not), and triggers the compensation task in the warehouse counting platform when the audit result is that the data synchronization is unqualified.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The principle and the embodiment of the present invention are explained by applying specific examples, and the above description of the embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A middleware-based data synchronization method, comprising:

2. The middleware-based data synchronization method according to claim 1, wherein the acquiring data tasks to be synchronized on all task nodes specifically comprises:

acquiring preliminary trigger tasks on all task contacts;

3. The middleware-based data synchronization method of claim 1 further comprising:

4. The middleware based data synchronization method as recited in claim 1, further comprising:

judging whether the local synchronization data are consistent with the task instance corresponding to the data task to be synchronized or not, and obtaining a first result;

5. The middleware-based data synchronization method of claim 1 wherein said task configuration update instruction is determined according to timestamp field type supported by SQLSERVER of the same data source.

6. A middleware-based data synchronization system, comprising:

7. The middleware-based data synchronization system of claim 6, wherein in terms of acquiring the data tasks to be synchronized on all task nodes, the synchronization task determination module specifically comprises:

the abnormal notification sub-module is used for sending out a task abnormal notification when the preliminary trigger task is in a task abnormal state;

8. The middleware-based data synchronization system of claim 6 further comprising:

9. The middleware-based data synchronization system of claim 6 further comprising:

the data before and after synchronization judging module is used for judging whether the local synchronization data are consistent with the task examples corresponding to the data tasks to be synchronized or not and obtaining a first result;

10. The middleware-based data synchronization system of claim 6 wherein said task configuration update instructions are determined from timestamp field types supported by the same data source SQLSERVER.