CN118410106B

CN118410106B - Cross-data source real-time synchronization method based on time line mapping

Info

Publication number: CN118410106B
Application number: CN202410850487.6A
Authority: CN
Inventors: 温斌; 楼哲
Original assignee: Tianjin Nankai University General Data Technologies Co ltd
Current assignee: Tianjin Nankai University General Data Technologies Co ltd
Priority date: 2024-06-28
Filing date: 2024-06-28
Publication date: 2024-09-20
Anticipated expiration: 2044-06-28
Also published as: CN118410106A

Abstract

The invention provides a real-time synchronization method of data source crossing based on time line mapping, which is used for identifying and accessing different data sources needing synchronization and configuring corresponding access rights for each type of data source; analyzing the data type of each type of data source and marking key structure information in the data source; creating a unified global time line, monitoring all connected data sources to capture and record data change operations, distributing time stamps for the data change operations of different data sources, and mapping the time stamps into the global time line; scheduling data transmission according to the mapping result of the time stamp, and synchronously updating the data to the corresponding target data source; data collisions are detected during the synchronization update and resolved using a predetermined policy. The invention has the beneficial effects that: the method solves the problems of time sequence and consistency of data synchronization in a multi-data source environment, improves the accuracy and efficiency of data processing, and remarkably improves the adaptability and flexibility of the synchronization technology.

Description

Cross-data source real-time synchronization method based on time line mapping

Technical Field

The invention belongs to the field of data management, and particularly relates to a cross-data source real-time synchronization method based on time line mapping.

Background

With the rapid development of information technology, various types of databases and data storage systems are widely used in a plurality of industries, and massive data resources are generated. Different business scenarios and data processing requirements result in data being stored in different database systems, such as SQL databases, noSQL databases, etc. A concomitant problem is that real-time synchronization and consistency maintenance of data between different data sources becomes particularly important and challenging.

At present, the data synchronization technology mainly depends on batch processing or periodic updating, which may not meet the requirements in certain application scenarios with high real-time requirements, for example, in financial transactions or online services, data delay or inconsistency may cause user experience degradation or business decision errors, so how to efficiently and accurately realize real-time data synchronization across data sources, and ensure the integrity and consistency of data becomes a technical problem to be solved urgently.

Existing data synchronization solutions often require complex configurations and high maintenance costs, which are not a small challenge for resource-limited organizations, while data security and privacy protection are key factors that must be considered in the data synchronization process, especially in scenarios involving sensitive information processing.

Disclosure of Invention

In view of the foregoing, the present invention aims to provide a real-time synchronization method across data sources based on timeline mapping, so as to solve at least one of the above-mentioned technical problems.

In order to achieve the above purpose, the technical scheme of the invention is realized as follows:

a real-time synchronization method of cross data sources based on time line mapping comprises the following steps:

identifying and accessing different data sources to be synchronized, and configuring corresponding access rights for each type of data source;

Analyzing the data type of each type of data source and marking key structure information in the data source;

Creating a unified global time line, monitoring all connected data sources to capture and record data change operations, distributing time stamps for the data change operations of different data sources, and mapping the time stamps into the global time line;

Scheduling data transmission according to the mapping result of the time stamp, and synchronously updating the data to the corresponding target data source;

data collisions are detected during the synchronization update and resolved using a predetermined policy.

Further, the process of identifying and accessing different data sources includes:

Automatically scanning the network environment to identify available data sources, classifying and recording the identified data sources, executing connection test on each identified data source, and verifying connection parameters of the data sources;

And carrying out a secure connection test through an encryption protection protocol, and recording the connection state and error information of each data source.

Further, corresponding access rights are configured for each type of data source according to the security level and the access requirement of the data source, all the data sources are confirmed to be ready and accessible after the data source is accessed, and an access log and a data source configuration document are generated.

Further, the process of analyzing the data type of each type of data source and marking the key structure information in the data source includes:

acquiring metadata in a data source, and obtaining the dependency between the data through foreign key constraint in the metadata and reference of the relation between the integrity analysis tables;

and carrying out rule restriction on the data format and the data range according to the service logic, and carrying out regular check and inquiry to monitor whether the data meets the rule restriction.

Further, the process of assigning time stamps to data change operations of different data sources and mapping to a global timeline includes:

Distributing corresponding time stamps for the data change operation of each data source, converting the time stamps generated by different data sources into a uniform format and a uniform time zone, and calibrating the time stamps by using a network time protocol;

and mapping the data change time of each data source into a global time line according to the unified and calibrated time stamps, and counting all the time stamps by using a time tree structure.

Further, the process of monitoring all connected data sources for capture and recording of data change operations includes:

Deploying a monitor on each data source, and capturing specific information of a data change operation by using a database trigger; the specific information comprises a change type, a data snapshot and a time stamp;

The process of capturing data change operation is monitored in real time, when data capture abnormality or error occurs, corresponding error logs are recorded, error information is classified, and different asynchronous processing is automatically executed according to different error types;

If the asynchronous processing which is automatically executed cannot solve the problem, reporting error information to carry out manual repair.

Further, the process of scheduling data transmission according to the mapping result of the time stamp includes:

giving different priority to different data changing operations according to actual service demands, and obtaining the synchronous frequency of the different data changing operations according to the mapping result of the time stamp;

And sequencing all the data change operations according to the priority and the synchronous frequency, and sequentially and synchronously updating according to the sequencing result.

Further, the process of detecting the data collision in the process of synchronous update includes:

monitoring the synchronous updating process in real time, and detecting potential conflicts through the time stamp, wherein the types of the conflicts comprise updating conflicts and deleting conflicts;

The updating conflict is different data sources and modifies the same data at the same time, and the deleting conflict is different data sources and carries out different operations on one data at the same time.

Further, the process of resolving the conflict using the predetermined policy includes:

In the process of recording data changing operation, different data sources record corresponding serial numbers of the data changing operation, if the current data changing operation is the first time, the serial numbers are set to be 1, and then the serial numbers are added with 1 when the data is changed each time;

when update conflict occurs, the latest data change operation is selected according to the sequence number and the time stamp of the data change operation, and when deletion conflict occurs, merging modification is executed;

and merging or revising the conflict data, and keeping change histories of all relevant data.

Compared with the prior art, the cross-data source real-time synchronization method based on the time line mapping has the following beneficial effects:

by introducing a real-time data capturing and timestamp mapping mechanism, the problems of time sequence and consistency of data synchronization in a multi-data source environment are effectively solved, and the accuracy and efficiency of data processing are improved;

The synchronization method based on the time line mapping provides a brand new solution strategy for real-time data synchronization among heterogeneous data sources, and the adaptability and the flexibility of the synchronization technology are remarkably improved;

The invention is suitable for data-intensive business environments, such as financial services and cloud services, can support real-time decision making and improve business continuity, and has high market application value and wide practical prospect.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention. In the drawings:

FIG. 1 is a flow chart of a real-time synchronization method across data sources based on timeline mapping according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of time stamp configuration and use according to an embodiment of the invention.

Detailed Description

It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.

The invention will be described in detail below with reference to the drawings in connection with embodiments.

step one, identifying and accessing different data sources to be synchronized, and configuring corresponding access rights for each type of data source;

Step three, a unified global time line is created, all connected data sources are monitored to capture and record data change operations, time stamps are distributed to the data change operations of different data sources, and the data change operations are mapped into the global time line;

Step four, scheduling data transmission according to the mapping result of the time stamp, and synchronously updating the data to the corresponding target data source;

And fifthly, detecting data conflict in the synchronous updating process, and resolving the conflict by using a preset strategy.

The specific implementation process of the first step comprises the following steps:

Automatically scanning the network environment, identifying available data sources including databases and API endpoints, identifying various types of data sources using predefined configuration templates, such as MySQL, mongoDB, REST, API, and the like, classifying and recording the identified data sources, and preparing for further in-depth analysis; the predefined configuration templates specifically include: a set of preset parameters and rules for automatically identifying and accessing different types of data sources; these templates typically contain basic configuration information and connection parameters required for each data source type, such as: database type (e.g., mySQL, mongoDB), connection parameters (e.g., IP address, port number, user name, password), security settings (e.g., SSL/TLS configuration), data structure information (e.g., table, view, mode of storage procedure), access rights (read-write rights settings), predefined configuration templates can simplify configuration procedure, standardize data source management, improve access efficiency, enhance security, facilitate troubleshooting and optimization;

data source connection test and verification: performing a connection test on each identified data source, verifying connection parameters such as IP address, port number, user name and password;

Performing a secure connection test by using SSL/TLS encryption, ensuring the security of data transmission, recording the connection state and any error information of each data source so as to perform fault investigation and optimization, and analyzing the data structure of each data source, including a data table, a view, a storage process and the like;

Key data structure information such as key type, index, data type and association relation is extracted. And storing the analysis result in a central management system for the establishment and optimization of a synchronous strategy, configuring data access rights according to the security level and access requirements of a data source, and configuring corresponding read-write rights for system operators and synchronous tasks to ensure that all data operations conform to the data security policies of enterprises and regulations.

After the access of all the data sources is finished, final confirmation is carried out, all the data sources are ready and accessible, an access log and a data source configuration document are generated, the data source detailed information, the connection parameters and the authority setting are included, the information is used as basic data for system maintenance and audit, the step ensures that the system can effectively identify and access various data sources, a solid foundation is laid for real-time synchronization of the data, and necessary information and tools are provided for subsequent synchronous operation.

The specific implementation process of the second step comprises the following steps:

The method comprises the steps of connecting to each data source, performing deep scanning on the structure of the data source (such as using basic SQL query or Python-based script and SQL query tool) to check all tables and metadata thereof, wherein the basic information comprises column names, data types, key constraints (such as primary keys and foreign keys) of the tables, which columns are indexed, types of indexes, views and the like;

for example, for an SQL database, SELECT information_schema.tabs is performed to obtain a list of all tables;

For NoSQL databases, such as mongo db, db.getcollelectioninfo () is used to obtain the structure information of the collection;

Determining data dependence and association between different tables by analyzing foreign key constraints and reference integrity, manually writing data consistency rules according to business logic and data relationships, such as using regular expressions and custom verification scripts to force data format and range restrictions, for example using regular expressions +\d {4} - \d {2} $ for date fields to ensure that the date format is YYYY-MM-DD;

constraints are placed on the data length or range of values, for example, the limit social security number field must be a 9 digit number, etc. The extracted data model information is stored in a central database, and the information is used for reference and analysis in subsequent steps, particularly in the establishment and optimization of a data synchronization strategy, check inquiry is periodically operated to monitor data consistency, such as checking whether the data exceeds a preset range.

Through this step, the system will fully understand and record the structure and behavior characteristics of all data sources, which is critical to achieving effective data synchronization, and accurate analysis of the data model can guide formulation of synchronization strategies, so as to ensure accuracy and high efficiency of data synchronization.

Wherein the data dependency and association between the different tables functions as:

In the data synchronization process, knowing the dependency relationship between tables can ensure that the integrity and consistency of the database are not destroyed when the data is synchronized, and the optimal sequence of the synchronized data can be determined by analyzing the dependency relationship between tables;

For example, if table a relies on table B (foreign keys in table a reference the primary keys in table B), then the data of table B must be synchronized first and then the data of table a in order to ensure the integrity of the foreign key references. In synchronizing a table containing foreign key constraints, it must be ensured that the associated tables are already synchronized to ensure data integrity.

Understanding the association relationships between tables helps to detect and resolve conflicts that may occur during the data synchronization process, for example, if the data of the association tables are updated differently in two different data sources, by knowing these association relationships, conflicts can be detected more accurately and appropriate policies can be adopted for resolution.

During real-time synchronization, a change in data in one table may trigger an update of data in the associated table, e.g., when a change in data in the master table occurs, the associated slave table data may also need to be updated accordingly. Knowing the dependencies between tables helps to handle these changes correctly during the synchronization process.

The specific implementation process of the third step comprises the following steps:

Distributing time stamps to the data change of each data source, and mapping the time stamps on a global time line to ensure the time consistency and the sequence accuracy of the data synchronization;

An accurate time stamp is automatically assigned to each data source's data change event (e.g., create, update, delete operation). Ensuring that the generation mode of the time stamp can reflect the actual occurrence time of the data change, and considering the time difference of the servers, correcting the time stamp to maintain consistency may be required;

Timestamp normalization: the time stamps generated by different data sources are converted into a uniform time format and time zone, so that the data sources can be aligned correctly on a global time line, and the time of a data source server is calibrated by using a standard time synchronization protocol (such as NTP) to reduce time errors;

Global timeline construction: the data change events of each data source are mapped onto a global timeline according to the time stamps. These timestamps are organized using a data structure such as a timeline or a time tree to facilitate querying and access;

time conflict detection and processing: detecting time stamp conflicts on a global time line, namely updating the same data record by different data sources at the same or close time points, and processing the conflicts according to preset conflict resolution strategies (such as latest write priority, merging change and the like);

data change record: the time stamp and the corresponding data change are recorded in the persistent storage, support is provided for data recovery and history tracking, the integrity and the safety of recording operation are ensured, and the information is protected by using proper encryption and access control measures.

By this step, the consistency in time and the accuracy of the synchronization operation between different data sources can be ensured, which is crucial for achieving efficient and reliable data synchronization, and the timestamp mapping not only provides an explicit time reference, but also helps to solve the problem of time conflict possibly occurring in the data synchronization process.

Monitoring all connected data sources in real time, capturing and recording any data change event and time stamp thereof, and ensuring that the synchronous system can respond to real-time update in the data sources;

establishing a data monitoring mechanism: a monitor (Listener) is arranged on each data source and is used for monitoring data change events such as new addition, modification and deletion operations, and the monitor captures data change in real time through a trigger or an API monitoring function of the database;

Timestamp record: for each captured data change event, automatically generating a time stamp through a database system, ensuring the accuracy of the time stamp, and reflecting the specific time when the data change occurs;

and (3) data change detail record: recording the detailed information of each captured data change event, including change types, data snapshots before and after the change, related time stamps and the like, storing the information so as to facilitate the subsequent data synchronization and analysis, storing the change records by using a cache or a temporary database, and processing the change records when the data is scheduled synchronously;

exception and error handling: in a real-time data capturing system, due to various reasons such as network problems, data format errors, access right problems and the like, data capturing anomalies or errors can be frequently encountered, and if effective processing is not performed, the problems can cause data loss or synchronization errors, so that the reliability of the whole system and the accuracy of the data are affected;

In the process of data capturing, all operations are monitored in real time, once an abnormality or error is found, an alarm is immediately triggered, all captured abnormalities and errors are recorded in a system log in detail, including error types, occurrence time, influence range, possible reasons and the like, the captured abnormalities are classified, serious errors and general warnings are distinguished, so that different processing measures are adopted, and the system can automatically retry for failed operations caused by temporary problems (such as network delay, busy data sources and the like);

Setting the retry times and retry time intervals, for example, retrying for the first time after 30 seconds after the initial failure, if the first time is failed, retrying for the second time after 1 minute, isolating error data, preventing the errors from affecting the processing of other data, setting a data recovery scheme for serious faults, such as using a transaction log to recover the data to a state before the occurrence of the error, and ensuring the data integrity. After a complex exception or multiple retry failures, the exception report is provided to the technical team for human intervention and resolution.

Capturing and processing high frequency data changes in real time may have an impact on system performance. High performance data processing frameworks are introduced, such as using message queues (e.g., kafka) to asynchronously process data change events, distribute processing pressure and increase the scalability and response speed of the system.

Through the step, the system can be ensured to accurately capture the change of each data source in real time, and necessary real-time data support is provided for data synchronization. This not only improves the accuracy of data synchronization, but also ensures high availability and reliability of the system.

The specific implementation process of the fourth step comprises the following steps:

According to the timestamp mapping and the data capturing results, the synchronization between the data is scheduled and executed to ensure that the data in all the data sources maintain consistency and up-to-date state, and the data which need to be synchronized, the frequency of synchronization and the priority setting of data synchronization are evaluated, wherein main considerations include: determining which data is critical to business operations, for example, in financial services, real-time synchronization of transaction data may be more critical than other types of data, change frequency is particularly the frequency of data update, and frequently changed data may need higher synchronization frequency to ensure real-time and accuracy of data;

According to the importance and change frequency of the data, all the tasks to be synchronized are ordered, the data with the greatest influence on the service is processed preferentially, the synchronous tasks are managed by using technical means such as priority queues, the synchronous tasks with high priority (such as key service data) can be executed preferentially, the priority of the synchronous plans and tasks is adjusted dynamically according to the actual running condition and the change of service demands, for example, the synchronous priority of commodity data and inventory data of an e-commerce platform can be improved during sales promotion, so that the quick change of market demands can be responded.

And the high-efficiency data transmission protocol (such as HTTP, FTP or special data transmission protocol) is used for data transmission, and the transmitted data is encrypted, so that the safety of the data in the transmission process is ensured. During the data synchronization process, the data conflict is processed by using a conflict resolution strategy defined previously, such as using a time stamp to determine whether the data is old or new, and decide which version to keep;

For some key data, a manual auditing mode is adopted to solve conflict, so that the accuracy and consistency of the data are ensured, after synchronization is completed, data integrity and consistency check is carried out, whether the data are correctly synchronized is verified, an automation script is used for sampling check on the synchronized data, no data loss or error is ensured, and detailed log records are carried out on all synchronous operations, including synchronization time, synchronous data quantity, synchronization result and any abnormal occurrence.

In a multi-data source environment, data synchronization may involve a large number of data transfers, and data formats and structures may not be consistent, data format conversion and mapping are adopted to ensure that data between different data sources can be correctly matched and converted, before the multi-data sources are synchronized, a unified data model needs to be defined first, the model should cover common fields in all the data sources, and solve differences in field names, data types and the like, a centralized metadata repository is created and maintained for storing data model information of each data source, including field names, data types, data lengths and the like, a unified data definition standard is formulated to ensure that mapping relationships between different sources can be corresponding and converted in terms of data formats and structures, and for complex data conversion requirements, SQL scripts or logic for processing data conversion by using specific script languages (such as Python) may be required.

Through the step, the data can be ensured to be accurately synchronized among different data sources, and the real-time updating and consistency of the data are maintained, so that the real-time decision and operation of enterprises are supported, the process not only enhances the efficiency of data synchronization, but also improves the data management capability of the whole system.

The fifth specific implementation process comprises the following steps:

In the process of synchronizing multiple data sources, conflicts caused by data updating inconsistency need to be effectively solved, so that the final consistency of the data among all sources is ensured, and the conflicts in the process of synchronizing the data are monitored and detected;

Conflicts typically occur when multiple data sources attempt to update the same data record, each time a data update is accompanied by a time stamp, when two or more data sources attempt to update the same record, the system compares the respective time stamps, and if the time stamps indicate that the updates are nearly simultaneous, the system recognizes the situation as a potential conflict;

tracking the change history of each data record by maintaining a version number for each data record, the version number increasing each time the data is updated, if the system detects that updates submitted by two data sources have the same starting version number but different final version numbers, which indicates that there is an update conflict between the two data sources, the system can detect the conflict by directly comparing the data contents;

For example, if updates to the same record provided by two data sources have different values on certain fields, the system treats this as a conflict. In some cases, business rules may also be used to determine conflicts, for example, if an update of one data source violates a data integrity rule maintained by another data source (e.g., foreign key constraints), which may be considered a conflict, using a timestamp and a data version number to determine the date's freshness, facilitating subsequent processing.

Identifying the type of conflict, such as update conflict, deletion conflict and the like, wherein the update conflict refers to that two data sources modify the same data item at the same time; a deletion conflict refers to one data source being updated and another data source deleting the same data item;

Predefined conflict resolution policy application: according to predefined policies to resolve conflicts, common policies include "last write-first" (latest changes override old changes) and "merge changes" (changes in conjunction with two data sources), for complex conflicts, custom resolution policies may need to be made based on business rules, such as when financial data or critical business data are involved;

Wherein, the merging modification specifically comprises: the system detects that different data sources perform deleting operation on the same record, and confirms the existence of conflict by comparing the time stamps of the deleting operation. And selecting the latest deleting operation according to the time stamp, and retaining the deleting result. For selecting the data source which is not reserved, the deleted record is restored, and the record is ensured to be consistent with the latest deleting operation. And updating operation logs in all relevant data sources, and recording conflict resolution conditions. A data consistency check may also be performed to ensure that all data sources remain consistent after conflict resolution.

Data merge and revision: after the solution strategy is determined, the conflict data are combined or revised to ensure the consistency and the integrity of the data, and the change history records of all relevant data are reserved so as to facilitate future audit or further analysis;

Automation and manual intervention: the setup system automatically handles most common conflicts, which are resolved by manual intervention for complex or high risk conflicts.

In the environment of high data volume and complex data relationship, the detection and resolution of the conflict may become very complex, and efficient algorithms are developed to optimize the conflict detection and processing process, such as utilizing parallel processing and advanced data structures to accelerate the recognition and resolution of the conflict, and meanwhile, a powerful log and monitoring system is established to ensure the traceability and transparency of all operations, through this step, the system can ensure the accuracy and consistency of the data synchronization in the environment of multiple data sources, effectively resolve various data conflicts which may occur in the synchronization process, support the continuity of service and the reliability of data, and this flow not only improves the efficiency of data management, but also enhances the stability and usability of the whole system.

Those of ordinary skill in the art will appreciate that the elements and method steps of each example described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the elements and steps of each example have been described generally in terms of functionality in the foregoing description to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed methods and systems may be implemented in other ways. For example, the above-described division of units is merely a logical function division, and there may be another division manner when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted or not performed. The units may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present application.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention, and are intended to be included within the scope of the appended claims and description.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. A real-time synchronization method of cross data sources based on time line mapping is characterized by comprising the following steps:

Detecting data conflict in the process of synchronous updating, and resolving the conflict by using a preset strategy;

Wherein the process of assigning time stamps to data change operations of different data sources and mapping to a global timeline includes:

Mapping the data change time of each data source into a global time line according to the unified and calibrated time stamps, and counting all the time stamps by using a time tree structure;

wherein the process of monitoring all connected data sources for capture and recording of data change operations comprises:

if the asynchronous processing automatically executed cannot solve the problem, reporting error information to perform manual repair;

wherein, the process of scheduling data transmission according to the mapping result of the time stamp comprises:

2. The timeline mapping-based real-time synchronization method of cross data sources according to claim 1, wherein the process of identifying and accessing different data sources comprises:

3. The timeline mapping-based cross-data source real-time synchronization method of claim 1, wherein:

And configuring corresponding access rights for each type of data source according to the security level and the access requirement of the data source, confirming that all the data sources are ready and accessible after the data source is accessed, and generating an access log and a data source configuration document.

4. The method for real-time synchronization across data sources based on timeline mapping according to claim 1, wherein the process of analyzing the data type of each type of data source and marking the key structure information in the data source comprises:

5. The method for real-time synchronization across data sources based on timeline mapping according to claim 1, wherein the process of detecting data collision during synchronization update comprises:

6. The method of real-time synchronization across data sources based on timeline mapping according to claim 5, wherein the process of resolving conflicts using predefined policies comprises: