WO2023185309A1 - 数据同步方法和系统、计算机可读存储介质 - Google Patents

数据同步方法和系统、计算机可读存储介质 Download PDF

Info

Publication number
WO2023185309A1
WO2023185309A1 PCT/CN2023/077058 CN2023077058W WO2023185309A1 WO 2023185309 A1 WO2023185309 A1 WO 2023185309A1 CN 2023077058 W CN2023077058 W CN 2023077058W WO 2023185309 A1 WO2023185309 A1 WO 2023185309A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
target
fingerprint
source data
database
Prior art date
Application number
PCT/CN2023/077058
Other languages
English (en)
French (fr)
Inventor
佟立伟
Original Assignee
京东方科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司 filed Critical 京东方科技集团股份有限公司
Publication of WO2023185309A1 publication Critical patent/WO2023185309A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/275Synchronous replication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database

Definitions

  • the present disclosure relates to the field of data processing technology, and in particular, to a data synchronization method and system, and a computer-readable storage medium.
  • heterogeneous data source offline synchronization tools are dedicated to achieving stable and efficient data synchronization functions between various heterogeneous data sources including relational databases (MySQL, Oracle, etc.), HDFS, Hive, ODPS, HBase, FTP, etc.
  • relational databases MySQL, Oracle, etc.
  • HDFS high-speed data source
  • Hive high-speed data source
  • ODPS High-power source
  • HBase High-Open Source
  • FTP FTP
  • the heterogeneous data source offline synchronization tool itself is built using the Framework+plugin architecture. It abstracts data source reading and writing into Reader/Writer plug-ins and incorporates them into the entire synchronization framework.
  • the data to be synchronized comes from different data sources, and the data from different data sources may use different types of primary keys (such as string-type uuid, numeric primary keys that are auto-incremented by the database, or primary keys of custom rules).
  • primary keys such as string-type uuid, numeric primary keys that are auto-incremented by the database, or primary keys of custom rules.
  • some source data read by the Reader plug-in do not have a primary key.
  • a relational database needs to be used to generate an auto-incrementing primary key. This will cause repeated data insertion during incremental data synchronization.
  • the Reader plug-in and Writer plug-in may be on different nodes, and existing heterogeneous data source offline synchronization tools do not have a safe processing mechanism, it is impossible to detect data that has been tampered with during the data synchronization transmission process.
  • the present disclosure provides a data synchronization method and system, and a computer-readable storage medium to solve the deficiencies of related technologies.
  • a data synchronization method is provided, applied to a data synchronization system, including:
  • generating fingerprint data of the initial source data includes:
  • the initial source data is input into the fingerprint generation model as input data of the fingerprint generation model to obtain the fingerprint data.
  • the preset fingerprint generation model is implemented using at least one of a message digest algorithm, a secure hash algorithm, a message authentication code algorithm, and a key-based message authentication code algorithm.
  • synchronizing the target source data to the target database includes:
  • the target source data is sent to the target database.
  • the method also includes:
  • the target database obtains the primary key in the target source data and matches the primary key of the stored data
  • the target source data is inserted into the target database.
  • synchronizing the target source data to the target database includes:
  • the target source data and the second fingerprint data are updated to the target database.
  • the method also includes:
  • Fingerprint data is generated according to the column data of the column combination, and the fingerprint data is synchronously stored in the target database as a secondary key of the target source data.
  • a data synchronization system including a source database, a target database and a data synchronization device;
  • the data synchronization device is used to obtain initial source data from the source database and generate fingerprint data of the initial source data to obtain target source data containing the fingerprint data; the fingerprint data serves as the primary key of the initial source data. ; And synchronize the target source data to the target database, so that the target database stores the target source data after the primary key verification passes.
  • the data synchronization device includes a Framework module, a fingerprint data generation module, a reading plug-in and a writing plug-in; the Framework module is connected to the reading plug-in and the writing plug-in respectively,
  • the reading plug-in is used to read the initial source data to be synchronized from the source database
  • the fingerprint data generation module is used to generate fingerprint data of the initial source data and use the fingerprint data as the primary key of the initial source data;
  • the Framework module is used to forward the initial source data and the primary key as target source data to the writing plug-in;
  • the writing plug-in is used to write the target source data to the target database.
  • the fingerprint data generation module is integrated into the reading plug-in and/or Framework module.
  • the fingerprint data generation module is integrated into the writing plug-in and is used to generate verification fingerprint data based on the initial source data in the target source data; the writing plug-in is also used to compare the target
  • the primary key in the source data is the same as the verification fingerprint data, and when the primary key is the same as the verification fingerprint data, the target source data is sent to the target database.
  • the write plug-in in the data synchronization device is also used to: obtain the first fingerprint data of the data column except the newly added data column and the second fingerprint data of the data column including the newly added data column; match all the data columns. the first fingerprint data and the primary key of the target database; when the primary key of the target source data exists in the target database, update the target source data and the second fingerprint data to the target database.
  • the data synchronization device is also used to obtain historical task information; count column combinations in the target source data used in the historical task information; generate fingerprint data according to the column data of the column combination, and combine the The fingerprint data is synchronously stored in the target database as a secondary key of the target source data.
  • a non-transitory computer-readable storage medium which can implement the above method when an executable computer program in the storage medium is executed by a processor.
  • the initial source data can be obtained from the source database; then, the fingerprint data of the initial source data is generated to obtain the target source data containing the fingerprint data; the fingerprint data is as The primary key of the initial source data; then, synchronize the target source data to the target database, so that the target database stores the target source data after the primary key verification passes.
  • the fingerprint data is generated to obtain the target source data containing the fingerprint data
  • the fingerprint data is as The primary key of the initial source data
  • synchronize the target source data to the target database so that the target database stores the target source data after the primary key verification passes.
  • FIG. 1 is a block diagram of a data synchronization system according to an exemplary embodiment.
  • FIG. 2 is a block diagram of another data synchronization system according to an exemplary embodiment.
  • Figure 3 is a flow chart of a data synchronization method according to an exemplary embodiment.
  • Figure 4 is a schematic diagram of an application scenario of a data synchronization system according to an exemplary embodiment.
  • Figures 5 to 8 are schematic diagrams showing the effects of a configuration task according to an exemplary embodiment.
  • Figure 9 is a flow chart of data synchronization according to an exemplary embodiment.
  • heterogeneous data source offline synchronization tools are dedicated to achieving stable and efficient data synchronization functions between various heterogeneous data sources including relational databases (MySQL, Oracle, etc.), HDFS, Hive, ODPS, HBase, FTP, etc.
  • relational databases MySQL, Oracle, etc.
  • HDFS high-speed data source
  • Hive high-speed data source
  • ODPS High-power source
  • HBase High-Open Source
  • FTP FTP
  • the heterogeneous data source offline synchronization tool itself is built using the Framework+plugin architecture. It abstracts data source reading and writing into Reader/Writer plug-ins and incorporates them into the entire synchronization framework.
  • the data to be synchronized comes from different data sources, and the data from different data sources may use different types of primary keys (such as string-type uuid, numeric primary keys that are auto-incremented by the database, or primary keys of custom rules).
  • primary keys such as string-type uuid, numeric primary keys that are auto-incremented by the database, or primary keys of custom rules.
  • some source data read by the Reader plug-in do not have a primary key.
  • a relational database needs to be used to generate an auto-incrementing primary key. This will cause repeated data insertion during incremental data synchronization.
  • the Reader plug-in and Writer plug-in may be on different nodes, and existing heterogeneous data source offline synchronization tools do not have a safe processing mechanism, it is impossible to detect data that has been tampered with during the data synchronization transmission process.
  • FIG. 1 is a block diagram of a data synchronization system according to an exemplary embodiment.
  • the data synchronization system includes a source database, a target database and a data synchronization device. Among them, the data synchronization device is connected to the source database and the target database respectively.
  • the data synchronization device is used to obtain the initial source data from the source database and generate the fingerprint data of the initial source data to obtain the target source data containing the fingerprint data; the above fingerprint data is used as the initial The primary key of the source data; and synchronizing the target source data to the target database so that the target database stores the target source data after the primary key verification passes.
  • fingerprint data with unique characteristics for the initial source data
  • the source data can be one source database or multiple source databases, that is, the number of source databases can be set according to specific scenarios and is not limited here.
  • the target source data can also be synchronized to different target databases. Therefore, the number of target databases can be one or more, and can be set according to specific scenarios, and is not limited here.
  • this disclosure directly uses a source database and a target database as examples to describe the solutions of each embodiment.
  • FIG. 2 is a block diagram of another data synchronization system according to an exemplary embodiment.
  • the data synchronization device in the data synchronization system may include a Framework module, a fingerprint data generation module, a reading plug-in and a writing plug-in.
  • the Framework module is connected to the read plug-in and the write plug-in respectively.
  • the reading plug-in is used to read the initial source data to be synchronized from the source database;
  • the fingerprint data generation module is used to generate the fingerprint data of the initial source data and use the fingerprint data as the primary key of the initial source data;
  • the Framework module is used to convert the initial source data into The source data and primary key are forwarded to the writing plug-in as the target source data;
  • the writing plug-in is used to write the target source data to the target database.
  • the data synchronization device can achieve the effect of synchronizing data in the source database to the target database without duplication.
  • the fingerprint data can be a message digest.
  • the fingerprint data generation module can include a preset fingerprint generation model.
  • the fingerprint generation model can include but is not limited to a message digest algorithm (Message Digest, MD) and a secure hash algorithm. (Secure Hash Algorithm, SHA) and Message Authentication Code algorithm (Message Authentication Code, MAD), technicians can choose the appropriate algorithm according to the specific scenario. If fingerprint data can be generated, the corresponding solution falls within the scope of protection of this disclosure.
  • the fingerprint generation model is a Hash-based Message Authentication Code (HMAC) algorithm.
  • HMAC Hash-based Message Authentication Code
  • the HMAC algorithm takes a message M and a key K as input and generates a fixed-length message digest as output.
  • the message M is the initial source data
  • the fixed-length message digest is the fingerprint data.
  • the fingerprint data is a message digest that can prevent tampering during data transmission, and can be compared with the primary key in the target database as a unique identifier to avoid repeated synchronization of data.
  • using fingerprint data to generate fingerprint data from the initial source data can unify the primary key types of existing data, avoid repeated insertion of data, and improve synchronization efficiency.
  • This embodiment only describes the function of the data synchronization device to transmit target source data.
  • it serves as a data transmission channel for both the reading plug-in and the writing plug-in, and can also handle buffering, data flow control, concurrent processing, and data processing.
  • the corresponding function can be selected according to the specific scenario. If the target source data can be transmitted normally, the corresponding solution falls within the protection scope of the present disclosure.
  • the fingerprint data generation module is located between the reading plug-in and the Framework module. This is because the fingerprint data generation module can be integrated into the reading plug-in and can also be integrated into the Framework Within the module, that is, the fingerprint data generation module is integrated into the reading plug-in and/or Framework module, and can be selected according to specific scenarios, and the corresponding solution falls within the protection scope of the present disclosure.
  • the fingerprint data generation module can be integrated into the reading plug-in.
  • Plug-ins for different data sources need to follow the framework's conventions for plug-ins, so that each plug-in completes the common steps of data operations, that is, the segmentation of concurrent tasks and the reading and sending of data.
  • the reading plug-in performs the task After startup, you can call the startRead method of the reading plug-in.
  • the startRead method uses the recordSender interface of the framework as a parameter.
  • This startRead method connects to the data source according to the task configuration and reads the data to be synchronized, and then uniformly encapsulates the data to be synchronized into the framework standard Record the object Record object and send the record object to the Framework for processing.
  • the fingerprint data generation module can generate fingerprint data for the data to be synchronized.
  • the generated fingerprint data is encapsulated in the Record object as the primary key column, that is, the reading plug-in can directly obtain it.
  • Target source data That is to say, after reading the initial source data, the reading plug-in can generate the fingerprint data of the initial source data. At this time, the reading plug-in can directly obtain the target source data, and upload the target source data to the Framework module.
  • the Framework module does not need to generate fingerprint data, which can reduce the amount of data processed by the Framework module.
  • the reading plug-in generates fingerprint data
  • the fingerprint data can be verified in the Framework module or writing plug-in to determine whether the initial source data has been tampered with, which will help improve the security of the data synchronization process.
  • the reading plug-in can call the startRead method of the reading plug-in after the task is started.
  • the startRead method uses the recordSender interface of the framework as a parameter. This method connects to the data source and reads the data according to the configuration, and then encapsulates the data as The unified record object of the framework is the Record object.
  • the fingerprint data generation module can be used as the recordSender interface method. This interface method is called during the processing of the Record object by the Framework module to generate a unified fingerprint for the transmitted data. At this time, the fingerprint data generation module can generate fingerprint data of the initial source data transmitted by different reading plug-ins.
  • the fingerprint data generation module only needs to be deployed in the Framework module, and does not need to be generated in each reading plug-in, which can reduce the workload of the reading plug-in. Moreover, generating fingerprint data of the initial source data in the Framework module can reduce the possibility of data transmission being tampered with in the subsequent process, which is beneficial to improving the security of the data synchronization process.
  • the startWrite method of the write plug-in is called.
  • the startWrite method uses the recordReceive interface of the framework as a parameter. This method receives the record object Record object from the Framework framework, connects to the data source according to the task configuration, and writes Target source data.
  • the fingerprint data generation module can obtain the initial source data in the target source data, such as column data in the Record object, and generate verification fingerprint data. It is understandable that the verification fingerprint data written into the plug-in is generated in the same way as the fingerprint data read from the plug-in.
  • the writing plug-in can compare the primary key column and the verification fingerprint data in the Record object, and when the primary key column and the verification fingerprint data are the same, determine to send the target source data to the target database; between the primary key column and the verification fingerprint data If they are different, it is determined that the target source data has been tampered with, and there is no need to synchronize the target source data.
  • this embodiment can perform security verification on the target source data by integrating the fingerprint data generation module in the writing plug-in, thereby improving the security of the data synchronization process.
  • the fingerprint data generation module when integrating the fingerprint data generation module in the reading plug-in, can be integrated in the Framework module and/or writing plug-in, and the latter The integrated fingerprint data generation module performs security verification on the target source data.
  • the fingerprint data generation module is integrated in the reading plug-in and/or Framework module, and the fingerprint data generation module is integrated in the writing plug-in, and the fingerprint data generation module in the writing plug-in performs security verification on the target source data.
  • the fingerprint data generation module when deployed in the reading plug-in, it can give developers greater flexibility, such as specifying which data is used as input for generating fingerprints, that is, only using specified data columns to generate primary keys, and only using specified data columns.
  • the writing plug-in also needs to be consistent with the fingerprint generation implementation of the reading plug-in.
  • developers cannot decide the data range for generating fingerprints.
  • Fingerprint generation becomes a fixed and unified process, and all read plug-ins use unified rules to generate fingerprints. That is to say, the fingerprint data generated by the fingerprint data generation module can be used as the primary key or as the verification fingerprint data, and the deployment location can be selected according to specific scenarios. The corresponding solution falls within the protection scope of the present disclosure.
  • the initial source data includes four columns of data: name, height, weight and age.
  • the fingerprint data generation module can generate fingerprint data and write it into the target source data or perform initial source data Data validation.
  • the fingerprint data generation module can first generate fingerprint data for the first four columns of data, and the fingerprint data is used as the first fingerprint data; then, the fingerprint data generation module can generate fingerprint data from the original source data and the new source data, that is, then Fingerprint data is generated for the first five columns of data, and the fingerprint data is used as the second fingerprint data. Then, the writing plug-in can use the first fingerprint data to perform checksum matching during the writing process.
  • the newly added data column is inserted into the row corresponding to the first fingerprint data, and at the same time, the second fingerprint data is inserted into the row corresponding to the first fingerprint data.
  • the fingerprint data is inserted into the corresponding row; or, the data corresponding to the first fingerprint data in the target database can be directly replaced with the target source data.
  • the data synchronization system before writing to the target database, can perform the following preprocessing on the target source data, that is: the data synchronization system can obtain historical task information and count the columns in the target source data used by the historical task information. combination. Then, the data synchronization system can group the data columns of the target source data based on the above column combinations. For example, the four columns of name, height, weight and age can be divided into ⁇ name, height ⁇ , ⁇ name, weight ⁇ , ⁇ name, age ⁇ , ⁇ name, age, height ⁇ , ⁇ name, age, weight ⁇ , ⁇ name, age, height, weight ⁇ and other column combinations. Afterwards, the data synchronization system can generate different fingerprint data based on each column combination.
  • fingerprint data will be synchronized and stored in the target database as secondary keys of the target source data, thus facilitating users to use different data columns when creating new tasks and reporting to the target database. Requirements for inserting data into different data column combinations in the database.
  • one preprocessing in this scenario can provide fingerprint data for multiple subsequent data synchronizations, reducing the data processing volume of subsequent synchronization processes and helping to improve data synchronization efficiency.
  • this scenario can be applied to the de-redundancy, simplification and security processing of data in large-scale data. It can provide data support for users to create tasks and reduce the management difficulty of the data synchronization system.
  • the target database can read the primary key from the target database and compare it with its stored primary key.
  • the primary key in the target source data already exists in the target database, it means that the target source data has been stored in the target database.
  • the target database can update or replace the target source data.
  • the target number The target source data can be inserted into the database. In this way, this embodiment can achieve the effect of synchronizing the initial source data in the source database to the target database.
  • the initial source data can be obtained from the source database; then, the fingerprint data of the initial source data is generated to obtain the target source data containing the fingerprint data; the fingerprint data serves as the initial source The primary key of the data; then, synchronize the target source data to the target database, so that the target database stores the target source data after the primary key verification passes.
  • the problem of repeated synchronization during the data synchronization process can be solved, and the accuracy and efficiency of data synchronization can be improved.
  • FIG. 3 is a flow chart of a data synchronization method according to an exemplary embodiment.
  • a data synchronization method includes steps 31 to 33.
  • step 31 initial source data is obtained from the source database.
  • the data synchronization device can obtain configuration information, and determine the source database, the source data to be synchronized in the source database, and the target database based on the above configuration information. Then, the read plug-in in the data synchronization device can obtain the initial source data from the source database.
  • step 32 fingerprint data of the initial source data is generated to obtain target source data containing the fingerprint data; the fingerprint data serves as the primary key of the initial source data.
  • the fingerprint data generation module in the data synchronization device can generate fingerprint data of the initial source data.
  • the data synchronization device can call a preset fingerprint generation model, and then input the initial source data into the fingerprint generation model as input data of the fingerprint generation model to obtain fingerprint data.
  • the data synchronization device can use the initial source data and the fingerprint data as the target source data, that is, obtain the target source data including the fingerprint data.
  • the above fingerprint data is used as the primary key of the initial source data. This primary key can be used for subsequent security verification and primary key comparison in the storage process.
  • the fingerprint data generation module in the data synchronization device can be integrated into the reading plug-in and the Framework module to generate fingerprint data used as the primary key, and can also be integrated into the Framework module and the writing plug-in to generate verification for verification.
  • Fingerprint data please refer to the data synchronization system solution for specific content, and will not be described in detail here.
  • step 33 the target source data is synchronized to the target database, so that the target database stores the target source data after the primary key verification passes.
  • a fingerprint data generation module can be deployed in the data synchronization device.
  • the data synchronization device can Generate verification fingerprint data based on the initial source data in the target source data. Then, when the primary key in the target source data is the same as the verification fingerprint data, the data synchronization device can send the target source data to the target database.
  • the target database can obtain the primary key in the target source data and match the primary key of the stored data.
  • the target database can update the target source data to the target database; when the primary key of the target source data does not exist in the target database, the target database can insert the target source data into the target database.
  • update and insert operations it can be ensured that the data synchronized to the target database is not repeated.
  • synchronizing the target source data to the target database includes:
  • the target source data and the second fingerprint data are updated to the target database.
  • the method further includes:
  • Fingerprint data is generated according to the column data of the column combination, and the fingerprint data is synchronously stored in the target database as a secondary key of the target source data.
  • the data synchronization system includes: WEB container, GateWay gateway, Service container, Exexutor container, external data source and execution engine. in,
  • WEB container includes data source management module, task configuration module, job management module, and system management module.
  • the data source management module is used to configure and manage the connection information of data sources, such as IP address, port, cluster configuration parameters, authentication method and other information. It manages all data sources created by users and provides common search, edit and delete methods. , you can perform connectivity testing and external permission settings on the data source.
  • the task configuration module is used to manage user-configured tasks. Users can combine existing data sources to create exchange tasks, and the created tasks will be mounted to the corresponding Under the project, you can configure scheduled execution tasks and historical data reruns.
  • the job management module is used to list all execution jobs under user-related tasks, including job calling time, completion time, execution parameters, execution nodes, and completion status. You can click the detailed log to view specific execution details.
  • the system management module is used to manage users and permissions of the system and manage service execution nodes.
  • the WEB container can provide a visual UI interface for users to interact with, making it convenient for users to configure and manage various aspects of the data synchronization process.
  • the user can interact with the data source management module, causing the data source management module to call the SERVICE service API interface to add, delete, modify or query the source data source and target data source.
  • These data sources can be understood as candidate source data sources.
  • the target data source that is, the data source of the read plug-in and the write target of the write plug-in.
  • Users can interact with the task configuration module, so that the task configuration module can specify the source data source of the data to be synchronized and the target data source of the data to be written, and configure the field correspondence between the data to be synchronized.
  • the interface is saved to the database for subsequent use by reading and writing plug-ins.
  • Users can interact with the job management module, which can add, delete, modify, query and execute configured tasks. For example, clicking to execute a task will call the SERVICE service API interface, and the SERVICE service will read the task information. After obtaining the task-related configuration, the SERVICE service will call the Exexutor service to actually execute the task. The Exexutor service starts the reading plug-in and writing plug-in based on the task information to complete the synchronization of task data.
  • Users can interact with the system management module, which can manage users, permissions, etc.
  • GateWay gateway communicates with Service container and Exexutor container, including user unified authentication component, user single sign-in component, container permission configuration component, authentication credential generator, credential refresh entrance, and service routing module.
  • the GateWay gateway is used for authentication and routing between various service containers, authentication of external requests and interaction with external authentication servers, and maintains the routing link of the entire platform.
  • GateWay gateway uniformly receives requests from WEB containers and completes login verification and distribution of requests.
  • SERVICE container includes data permission management module, execution node monitoring and management module, load balancer, job queue scheduler, RPC calling module, task configuration module, data source management module, and task job information status management module.
  • the SERVICE service provided by the SERVICE container can manage multiple groups of API interfaces related to user usage. It is the entrance for users to create configuration data sources, configure and perform data exchange tasks and other business operations.
  • the SERVICE container receives requests from the front-end WEB container and processes the requests. These processes include addition, deletion, modification, and query operations on the database and scheduling of tasks by calling the Exexutor container service.
  • Exexutor container includes task execution management container, task executor, resource allocation module, task job sub-process, job daemon thread, job log, job information callback interface, job runtime information, etc.
  • the Exexutor capacity The Executor service provided by the server is a container that actually performs data exchange tasks. It is used to connect to the execution engine. It also maintains multiple listening threads for task jobs, monitors job logs, whether they time out, and job resource allocation.
  • the Exexutor container receives requests from the WEB container and processes the requests, including querying the task job status, querying the task job log, etc.
  • the Exexutor container also maintains a heartbeat connection with the SERVICE container.
  • the SERVICE service calls the Exexutor container service by passing the task information.
  • the Exexutor container service starts the data synchronization execution engine (i.e. Framework module) according to the task configuration information, and the execution engines are started separately.
  • the read and write plug-ins run data synchronization jobs.
  • the execution engine includes a Hadoop/Hive environment and a heterogeneous data source offline synchronization tool (ie, the data synchronization device in the above embodiment).
  • the heterogeneous data source offline synchronization tool includes data source reading and writing plug-ins (i.e., the read plug-in and the write plug-in in the above embodiment), data transmission channels (i.e., Framework modules), checkpoint modules, data post-processors, and channels. Variable speed strategy group, etc.
  • the execution engine is used to read the initial source data to be synchronized from an external data source and generate fingerprint data to obtain the target source data, and then writes it to the external data source, that is, the target database through the writing plug-in.
  • the read plug-in and the write plug-in obtain the source database and the target database and the corresponding relationship between the source data and the target data based on the task configuration information for data synchronization.
  • This configuration information is configured by the user through the UI interface provided by the WEB container and calls the SERVICE container service. Saved in the database; when starting a task, the SERVICE container service reads the task configuration information and passes it to the Exexutor container service. The Exexutor container service finally passes the configuration information to the execution engine for the read-write plug-in to synchronize data based on the configuration information.
  • the user can open the interactive interface and enter the data management center in the interactive interface.
  • the data management center includes a data management module.
  • the user can trigger the data management module and configure the source data source (i.e., source database) and target data source (i.e., target database). For example, the user clicks to create a new data source. After the user fills in the data source information and clicks save, a new data is generated.
  • Source request this request calls the API interface in the Service service to store the data source information in the database.
  • the interface returns the response information of the successful addition of the data source, and displays an information prompt of successful saving on the UI interface.
  • the Service service calls the API interface of the executor service to pass the task information to the Executor service for execution.
  • the Executor service starts the heterogeneous data source offline synchronization tool (i.e., the data synchronization device in Figure 4) to execute the data synchronization process, see Figure 9, including:
  • the engine creates a task container JobContainer and starts it through the task container startup method JobContainer.start().
  • the task container startup method JobContainer.start() sequentially executes the task preprocessing method preHandler(), task initialization method init(), read-write plug-in ready method prepare(), task splitting method split(), and task scheduling method schedule() , task post-processing method post(), task post-processing method postHandle() and other methods.
  • the task initialization method init() initializes the reading plug-in reader and writing plug-in writer according to the task configuration information.
  • the task ready method prepare() implements the class loading of the read-write plug-in by calling the read-write plug-in ready method prepare().
  • the task splitting method split() adjusts the number of channels through the channel number adjustment method adjustChannelNumber(), and simultaneously performs the most granular splitting of the reading plug-in reader and the writing plug-in writer.
  • the counting of channels is mainly based on bytes and records (i.e. to be synchronized source data or initial source data). If the user does not set the number of channels, the channel size needs to be calculated in the first step of the task splitting method split().
  • the read-write subtask configuration merging method mergeReaderAndWriterTaskConfigs() inside the task splitting method split() is responsible for merging the relationship between the reading plug-in reader, writing plug-in writer, and transmitter transformer to generate the configuration of the sub-task task.
  • the task scheduling method schedule() splits and distributes the generated subtask task configuration according to the task splitting method split() to generate a subtask group taskGroup object. According to the number of subtask tasks and the subtask tasks supported by a single subtask group taskGroup Configure the number, and divide the two to get the number of sub-task groups taskGroup.
  • the task scheduling method schdule() is executed internally through the task scheduling method schedule() of the scheduler AbstractScheduler, and continues to execute the start all sub-task group method startAllTaskGroup() to create all sub-task group containers TaskGroupContainer related sub-task tasks, sub-tasks
  • the group container runner TaskGroupContainerRunner is responsible for running the sub-task group container TaskGroupContainer to execute the assigned sub-task.
  • the sub-task group container execution service taskGroupContainerExecutorService starts a fixed thread pool to execute the sub-task group container runner TaskGroupContainerRunner object.
  • the execution method run() of TaskGroupContainerRunner calls the sub-task group container startup method taskGroupContainer.start() method. For each A channel channel creates a subtask executor TaskExecutor, and starts the task through taskExecutor.doStart().
  • the recordSender data fingerprint interface method uses the HMAC algorithm to generate the message digest (i.e. fingerprint data) of the record, thereby generating fingerprint data for the source data to be synchronized and obtaining the target source data.
  • the writing plug-in synchronizes the target source data to the target database, and can perform security verification on the above target source data before synchronization.
  • the present disclosure adds an interface method for generating data fingerprints to the heterogeneous data source offline synchronization tool.
  • the interface method for generating data fingerprints is called to provide the data.
  • the record generates a summary string (message summary or fingerprint data in the above embodiment). This summary string is used as the identification and verification value of the data record, so that the data synchronized to the target database has a unified data identification, thereby avoiding different
  • the data source data identification ID is not uniform; moreover, it achieves the effect of preventing data tampering and avoids the problem of repeated data insertion.
  • a computer-readable storage medium is also provided, for example, including an executable computer program
  • a computer program is stored in a memory, and the above executable computer program can be executed by a processor to implement the method of the embodiment shown in FIG. 3 .
  • the readable storage medium can be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioethics (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开是关于一种数据同步方法和系统、计算机可读存储介质。该方法包括:从源数据库获取初始源数据;生成所述初始源数据的指纹数据得到包含所述指纹数据的目标源数据;所述指纹数据作为所述初始源数据的主键;将所述目标源数据同步到目标数据库,以使所述目标数据库在所述主键校验通过后存储所述目标源数据。本实施例中通过为初始源数据生成具有唯一性特征的指纹数据可以解决数据同步过程中重复同步的问题,提升数据同步的准确度和同步效率。

Description

数据同步方法和系统、计算机可读存储介质 技术领域
本公开涉及数据处理技术领域,尤其涉及一种数据同步方法和系统、计算机可读存储介质。
背景技术
现有异构数据源离线同步工具致力于实现包括关系型数据库(MySQL、Oracle等)、HDFS、Hive、ODPS、HBase、FTP等各种异构数据源之间稳定高效的数据同步功能。异构数据源离线同步工具本身作为离线数据同步框架,采用Framework+plugin架构构建,将数据源读取和写入抽象成为Reader/Writer插件,纳入到整个同步框架中。
考虑到待同步的数据来自于不同的数据源,而不同数据源的数据可能采用不同类型的主键(如字符串类型的uuid,数据库自增的数字型主键或者自定义规则的主键)。实际应用中,部分Reader插件读取的源数据没有主键,此时需要利用关系数据库生成自增主键,此时在增量数据同步时会引起重复插入数据。另外,由于Reader插件和Writer插件可能处于不同节点,加之现有异构数据源离线同步工具没有安全处理机制,导致无法检测出数据同步传输过程被篡改的数据。
发明内容
本公开提供一种数据同步方法和系统、计算机可读存储介质,以解决相关技术的不足。
根据本公开实施例的第一方面,提供一种数据同步方法,应用于数据同步系统,包括:
从源数据库获取初始源数据;
生成所述初始源数据的指纹数据得到包含所述指纹数据的目标源数据;所述指纹数据作为所述初始源数据的主键;
将所述目标源数据同步到目标数据库,以使所述目标数据库在所述主键校验通过后存储所述目标源数据。
可选地,生成所述初始源数据的指纹数据,包括:
调用预设的指纹生成模型;
将所述初始源数据作为所述指纹生成模型的输入数据输入到所述指纹生成模型,得到所述指纹数据。
可选地,所述预设的指纹生成模型采用消息摘要算法、安全散列算法、消息认证码算法和基于密钥的消息认证码算法中的至少一种实现。
可选地,将所述目标源数据同步到目标数据库,包括:
基于所述目标源数据中的初始源数据生成校验指纹数据;
当所述目标源数据中的主键与所述校验指纹数据相同时,将所述目标源数据发送到目标数据库。
可选地,所述方法还包括:
所述目标数据库获取所述目标源数据中的主键并匹配已存储数据的主键;
当所述目标数据库中存在所述目标源数据的主键时,将所述目标源数据更新到所述目标数据库;
当所述目标数据库中不存在所述目标源数据的主键时,将所述目标源数据插入到所述目标数据库。
可选地,将所述目标源数据同步到目标数据库,包括:
获取除新增数据列之外的数据列的第一指纹数据和包含新增数据列的数据列的第二指纹数据;
匹配所述第一指纹数据和所述目标数据库的主键;
当所述目标数据库中存在所述目标源数据的主键时,将所述目标源数据和所述第二指纹数据更新到所述目标数据库。
可选地,所述方法还包括:
获取历史任务信息;
统计所述历史任务信息中所使用的目标源数据中的列组合;
根据所述列组合的列数据生成指纹数据,将所述指纹数据作为所述目标源数据的副键同步存储到所述目标数据库中。
根据本公开实施例的第二方面,提供一种数据同步系统,包括源数据库、目标数据库和数据同步装置;
所述数据同步装置用于从所述源数据库获取初始源数据并生成所述初始源数据的指纹数据,得到包含所述指纹数据的目标源数据;所述指纹数据作为所述初始源数据的主键;以及将所述目标源数据同步到目标数据库,以使所述目标数据库在所述主键校验通过后存储所述目标源数据。
可选地,所述数据同步装置包括Framework模块、指纹数据生成模块、读取插件和写入插件;所述Framework模块分别与所述读取插件和所述写入插件连接,
所述读取插件,用于从源数据库读取待同步的初始源数据;
所述指纹数据生成模块,用于生成所述初始源数据的指纹数据并将所述指纹数据作为所述初始源数据的主键;
所述Framework模块,用于将所述初始源数据和所述主键作为目标源数据转发给所述写入插件;
所述写入插件,用于将所述目标源数据写入到目标数据库。
可选地,所述指纹数据生成模块集成到所述读取插件和/或Framework模块。
可选地,所述指纹数据生成模块集成到所述写入插件内,用于基于所述目标源数据中的初始源数据生成校验指纹数据;所述写入插件还用于对比所述目标源数据中的主键与所述校验指纹数据,并在所述主键与所述校验指纹数据相同时,将所述目标源数据发送到目标数据库。
可选地,所述数据同步装置中写入插件还用于:获取除新增数据列之外的数据列的第一指纹数据和包含新增数据列的数据列的第二指纹数据;匹配所述第一指纹数据和所述目标数据库的主键;当所述目标数据库中存在所述目标源数据的主键时,将所述目标源数据和所述第二指纹数据更新到所述目标数据库。
可选地,所述数据同步装置还用于获取历史任务信息;统计所述历史任务信息中所使用的目标源数据中的列组合;根据所述列组合的列数据生成指纹数据,将所述指纹数据作为所述目标源数据的副键同步存储到所述目标数据库中。
根据本公开实施例的第三方面,提供一种非暂态计算机可读存储介质,当所述存储介质中的可执行的计算机程序由处理器执行时,能够实现如上述的方法。
本公开的实施例提供的技术方案可以包括以下有益效果:
由上述实施例可知,本公开实施例提供的方案中可以从源数据库获取初始源数据;然后,生成所述初始源数据的指纹数据得到包含所述指纹数据的目标源数据;所述指纹数据作为所述初始源数据的主键;之后,将所述目标源数据同步到目标数据库,以使所述目标数据库在所述主键校验通过后存储所述目标源数据。这样,本实施例中通过为初始源数据生成具有唯一性特征的指纹数据可以解决数据同步过程中重复同步的问题,提升数据同步的准确度和同步效率。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。
图1是根据一示例性实施例示出的一种数据同步系统的框图。
图2是根据一示例性实施例示出的另一种数据同步系统的框图。
图3是根据一示例性实施例示出的一种数据同步方法的流程图。
图4是根据一示例性实施例示出的一种数据同步系统的应用场景示意图。
图5~图8是根据一示例性实施例示出的一种配置任务的效果示意图。
图9是根据一示例性实施例示出的一种数据同步的流程图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性所描述的实施例并不代表与本公开相一致的所有实施例。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置例子。需要说明的是,在不冲突的情况下,下述的实施例及实施方式中的特征可以相互组合。
现有异构数据源离线同步工具,致力于实现包括关系型数据库(MySQL、Oracle等)、HDFS、Hive、ODPS、HBase、FTP等各种异构数据源之间稳定高效的数据同步功能。 异构数据源离线同步工具本身作为离线数据同步框架,采用Framework+plugin架构构建,将数据源读取和写入抽象成为Reader/Writer插件,纳入到整个同步框架中。
考虑到待同步的数据来自于不同的数据源,而不同数据源的数据可能采用不同类型的主键(如字符串类型的uuid,数据库自增的数字型主键或者自定义规则的主键)。实际应用中,部分Reader插件读取的源数据没有主键,此时需要利用关系数据库生成自增主键,此时在增量数据同步时会引起重复插入数据。另外,由于Reader插件和Writer插件可能处于不同节点,加之现有异构数据源离线同步工具没有安全处理机制,导致无法检测出数据同步传输过程被篡改的数据。
为解决上述技术问题,本公开实施例提供了一种数据同步方法和系统、计算机可读存储介质。图1是根据一示例性实施例示出的一种数据同步系统的框图。参见图1,该数据同步系统包括源数据库、目标数据库和数据同步装置。其中,数据同步装置分别与源数据库和目标数据库连接,该数据同步装置用于从源数据库获取初始源数据并生成初始源数据的指纹数据,得到包含指纹数据的目标源数据;上述指纹数据作为初始源数据的主键;以及将目标源数据同步到目标数据库,以使目标数据库在主键校验通过后存储目标源数据。本实施例中通过为初始源数据生成具有唯一性特征的指纹数据可以解决数据同步过程中重复同步的问题,提升数据同步的准确度和同步效率。
需要说明的是,考虑到源数据的来源可以是一个源数据库,还可以是多个源数据库,即源数据库的数量可以根据具体场景进行设置,在此不作限定。同理,目标源数据也可以同步到不同的目标数据库,因此,目标数据库的数量可以是一个或者是多个,可以根据具体场景进行设置,在此不作限定。为方便描述,本公开直接以一个源数据库和一个目标数据库为例描述各实施例的方案。
本实施例中,上述数据同步装置可以采用异构数据源离线同步工具实现,即上述数据同步装置可以采用Framework+plugin架构构建。图2是根据一示例性实施例示出的另一种数据同步系统的框图。参见图2,该数据同步系统中数据同步装置可以包括Framework模块、指纹数据生成模块、读取插件和写入插件。Framework模块分别与读取插件和写入插件连接。读取插件,用于从源数据库读取待同步的初始源数据;指纹数据生成模块,用于生成初始源数据的指纹数据并将指纹数据作为初始源数据的主键;Framework模块,用于将初始源数据和主键作为目标源数据转发给写入插件;写入插件,用于将目标源数据写入到目标数据库。这样,数据同步装置可以达到将源数据库中的数据无重复地同步到目标数据库的效果。
本实施例中,指纹数据可以为消息摘要,此时指纹数据生成模块可以包括预设的指纹生成模型,该指纹生成模型可以包括但不限于消息摘要算法(Message Digest,MD)、安全散列算法(Secure Hash Algorithm,SHA)和消息认证码算法(Message Authentication Code,MAD),技术人员可以根据具体场景选择合适的算法,在能够生成指纹数据的情况下,相应方案落入本公开的保护范围。
在一示例中,该指纹生成模型为基于密钥的消息认证码算法(Hash-based Message Authentication Code,HMAC)。HMAC算法以一个消息M和一个密钥K作为输入,生成一个定长的消息摘要作为输出。其中消息M即是初始源数据,定长的消息摘要即是指纹数据。这样,指纹数据是一种消息摘要可以防止数据传输中被篡改,并且可以作为唯一标识与目标数据库中的主键进行对比,避免重复同步数据。另外,本示例中利用指纹数据对初始源数据生成指纹数据可以对现有数据的主键类型进行统一,可以避免数据的重复插入以及提升同步效率。
本实施例中仅描述了数据同步装置传输目标源数据的功能,实际应用中,其作为读取插件和写入插件两者的数据传输通道,还可以处理缓冲、数据流控制、并发处理、数据转换等功能,可以根据具体场景选择相应的功能,在能够正常传输目标源数据的情况下,相应方案落入本公开的保护范围。
需要说明的是,图2所示的数据同步系统中,指纹数据生成模块位于读取插件和Framework模块之间,这是因为指纹数据生成模块可以集成到读取插件之内,还可以集成到Framework模块之内,即指纹数据生成模块集成到所述读取插件和/或Framework模块,可以根据具体场景进行选择,相应方案落入本公开的保护范围。
在一实施例中,考虑到读取插件通常设置到不同的源数据库中,指纹数据生成模块可以集成到读取插件中。针对不同数据源的插件都需要遵循框架对插件的约定,使各个插件完成数据操作的通用步骤,即对并发的任务的切分和数据的读取和发送,本实施例中读取插件在任务启动后可以调用读取插件的startRead方法,startRead方法使用框架的recordSender接口作为参数,此startRead方法根据任务配置连接数据源并读取待同步的数据,然后将待同步的数据统一封装为框架标准的记录对象Record对象,并将记录对象发送给Framework框架处理。在读取插件将读取的数据封装为Record对象过程中,指纹数据生成模块可以对待同步的数据生成指纹数据,生成的指纹数据作为主键列封装在Record对象之中,即读取插件可以直接获得目标源数据。也就是说,读取插件在读取到初始源数据后即可生成该初始源数据的指纹数据,此时读取插件可以直接获得目标 源数据,并且将目标源数据上传到Framework模块。这样,Framework模块无需生成指纹数据,可以减少Framework模块处理的数据量。并且,在读取插件生成指纹数据,可以在Framework模块或者写入插件内验证指纹数据,从而确定初始源数据是否被篡改,有利于提升数据同步过程的安全性。
在另一实施例中,读取插件在任务启动后可以调用读取插件的startRead方法,startRead方法使用框架的recordSender接口作为参数,此方法根据配置连接数据源并读取数据,然后将数据封装为框架统一的记录对象Record对象。指纹数据生成模块可以作为recordSender接口方法,在Framework模块对Record对象处理过程中调用此接口方法,对传输的数据生成统一的指纹。此时指纹数据生成模块可以生成不同读取插件传输的初始源数据的指纹数据。这样,指纹数据生成模块仅需要在Framework模块中部署即可,无需在各个读取插件中生成,可以减化读取插件的工作量。并且,在Framework模块内生成初始源数据的指纹数据可以减少后续过程数据传输被篡改的可能性,有利于提升数据同步过程的安全性。
在一实施例中,写入插件任务启动后调用写入插件的startWrite方法,startWrite方法使用框架的recordReceive接口作为参数,此方法从Framework框架接收记录对象Record对象,根据任务配置连接数据源并写入目标源数据,此时指纹数据生成模块可以获取目标源数据中的初始源数据,例如Record对象中的列数据,生成校验指纹数据。可理解的是,写入插件中校验指纹数据与读取插件中的指纹数据的生成方式是相同的。然后,写入插件可以对Record对象中的主键列和校验指纹数据进行对比,并在主键列与校验指纹数据相同时,确定将目标源数据发送到目标数据库;在主键与校验指纹数据不相同时,确定该目标源数据被篡改,无需同步该目标源数据。这样,本实施例通过在写入插件内集成指纹数据生成模块可以对目标源数据进行安全校验,达到提升数据同步过程的安全性。
需要说明的是,结合上述指纹数据生成模块的功能和部署位置,在读取插件内集成指纹数据生成模块时,可以在Framework模块和/或写入插件内集成指纹数据生成模块,由后者的集成指纹数据生成模块对目标源数据进行安全校验。或者,读取插件和/或Framework模块内集成指纹数据生成模块,并且写入插件内集成指纹数据生成模块,由写入插件内的指纹数据生成模块对目标源数据进行安全校验。当指纹数据生成模块部署在读取插件内时,可以赋予开发者较大地灵活性,例如指定哪些数据作为生成指纹的输入,也就是说只用指定的数据列生成主键,并且只对指定数据列进行校验。这样对应的 写入插件也需要和读取插件的指纹生成实现保持一致。当指纹数据生成模块部署在Framework模块内时,开发者不能决定生成指纹的数据范围,指纹生成成为固定统一的处理,所有读插件都使用统一规则生成指纹。也就是说,指纹数据生成模块生成的指纹数据可以作为主键,也可以作为校验指纹数据,可以根据具体场景选择部署的位置,相应方案落入本公开的保护范围。
在一场景下,初始源数据包括姓名、身高、体重和年龄共四列数据,在读取过程中和校验过程中,指纹数据生成模块可以生成指纹数据并写入目标源数据或者进行初始源数据校验。在另一场景下,如某一业务需要新增一列“性别”数据。为了匹配重合度,指纹数据生成模块可以先对前四列数据生成指纹数据,该指纹数据作为第一指纹数据;然后,指纹数据生成模块可以原先源数据和新增源数据生成指纹数据,即再对前五列数据生成指纹数据,该指纹数据作为第二指纹数据。然后,写入插件在写入过程中可以利用第一指纹数据进行校验和匹配,在目标数据库中存在同一指纹数据时将新增加数据列插入到第一指纹数据对应的行内,同时将第二指纹数据插入到对应的行内;或者,直接利用目标源数据替换目标数据库中第一指纹数据对应的数据即可。
在另一场景下,在写入目标数据库之前,数据同步系统可以对目标源数据作如下预处理,即:数据同步系统可以获取历史任务信息,并统计历史任务信息所使用的目标源数据中列组合。然后,数据同步系统可以基于上述列组合对目标源数据的数据列分组,例如姓名、身高、体重和年龄这四列可以划分为{姓名、身高}、{姓名、体重}、{姓名、年龄}、{姓名、年龄、身高}、{姓名、年龄、体重}、{姓名、年龄、身高、体重}等列组合。之后,数据同步系统可以基于各个列组合生成不同的指纹数据,这些指纹数据会作为目标源数据的副键同步存储到目标数据库中,从而方便用户创建新任务时使用不同数据列的需求以及向目标数据库中不同数据列组合插入数据的需求。这样,本场景中通过一次预处理可以为后续多次的数据同步提供指纹数据,降低后续同步过程的数据处理量,有利于提高数据同步效率。另外,本场景中可以应用于大规模数据中数据的去冗余、精简化和安全性处理,为用户创建任务可以提供数据支撑,降低数据同步系统的管理难度。
本实施例中,写入插件将目标源数据写入到目标数据库后,目标数据库可以从目标数据库内读取主键,并与其已存储的主键进行对比。当目标源数据中的主键已存在于目标数据库之内时,说明目标源数据已经存储到目标数据库中,此时目标数据库可以更新或者替换目标源数据。当目标源数据中的主键未存在于目标数据库之内时,此时目标数 据库可以插入该目标源数据。这样,本实施例可以达到将源数据库中的初始源数据同步到目标数据库中的效果。
至此,本公开实施例提供的方案中可以从源数据库获取初始源数据;然后,生成所述初始源数据的指纹数据得到包含所述指纹数据的目标源数据;所述指纹数据作为所述初始源数据的主键;之后,将所述目标源数据同步到目标数据库,以使所述目标数据库在所述主键校验通过后存储所述目标源数据。这样,本实施例中通过为初始源数据生成具有唯一性特征的指纹数据可以解决数据同步过程中重复同步的问题,提升数据同步的准确度和同步效率。
结合图1和图2所示的数据同步系统,本公开实施例还提供了一种数据同步方法,图3是根据一示例性实施例示出的一种数据同步方法的流程图。参见图3,一种数据同步方法,包括步骤31~步骤33。
在步骤31中,从源数据库获取初始源数据。
本实施例中,数据同步装置可以获取配置信息,根据上述配置信息确定出源数据库和源数据库中的待同步的源数据,以及目标数据库。然后,数据同步装置中读取插件可以从源数据库中获取初始源数据。
在步骤32中,生成所述初始源数据的指纹数据得到包含所述指纹数据的目标源数据;所述指纹数据作为所述初始源数据的主键。
本实施例中,数据同步装置中指纹数据生成模块可以生成初始源数据的指纹数据。例如,数据同步装置可以调用预设的指纹生成模型,然后将初始源数据作为指纹生成模型的输入数据输入到指纹生成模型,得到指纹数据。之后,数据同步装置可以将初始源数据和指纹数据作为目标源数据,即得到包含指纹数据的目标源数据。本示例中,上述指纹数据作为初始源数据的主键使用,该主键可以用于后续的安全校验以及存储过程中的主键对比。
可理解的是,数据同步装置中指纹数据生成模块可以集成到读取插件和Framework模块中生成作为主键使用的指纹数据,也可以集成到Framework模块和写入插件中生成用于校验的校验指纹数据,具体内容可以参见数据同步系统的方案,在此不再赘述。
在步骤33中,将所述目标源数据同步到目标数据库,以使所述目标数据库在所述主键校验通过后存储所述目标源数据。
本实施例中,数据同步装置中可以部署指纹数据生成模块,此时数据同步装置可以 基于目标源数据中的初始源数据生成校验指纹数据。然后,当目标源数据中的主键与校验指纹数据相同时,数据同步装置可以将目标源数据发送到目标数据库。
本实施例中,目标数据库可以获取目标源数据中的主键并匹配已存储数据的主键。当目标数据库中存在目标源数据的主键时,目标数据库可以将目标源数据更新到该目标数据库;当目标数据库中不存在目标源数据的主键时,目标数据库可以将目标源数据插入到目标数据库。本实施例中通过更新和插入操作,可以保证同步到目标数据库中的数据不重复。
本实施例中,将所述目标源数据同步到目标数据库,包括:
获取除新增数据列之外的数据列的第一指纹数据和包含新增数据列的数据列的第二指纹数据;
匹配所述第一指纹数据和所述目标数据库的主键;
当所述目标数据库中存在所述目标源数据的主键时,将所述目标源数据和所述第二指纹数据更新到所述目标数据库。
本实施例中,所述方法还包括:
获取历史任务信息;
统计所述历史任务信息中所使用的目标源数据中的列组合;
根据所述列组合的列数据生成指纹数据,将所述指纹数据作为所述目标源数据的副键同步存储到所述目标数据库中。
需要说明的是,本实施例中示出的方法与系统实施例的内容相匹配,可以参考上述系统实施例的内容,在此不再赘述。
下面结合一实施例来描述数据同步系统的工作原理,参见图4,该数据同步系统包括:WEB容器、GateWay网关、Service容器、Exexutor容器、外部数据源和执行引擎。其中,
WEB容器,包括数据源管理模块、任务配置模块、作业管理模块、系统管理模块。其中,数据源管理模块,用于配置和管理数据源的连接信息,例如IP地址、端口、集群配置参数和认证方式等信息,管理用户创建的所有数据源,提供常用的搜索、编辑和删除方法,可以对数据源进行连接性测试和对外权限设置。任务配置模块,用于管理用户配置的任务。用户可以组合现有的数据源创建交换任务,创建的任务将挂载到对应的 项目下,可以配置定时执行任务以及历史数据重跑。作业管理模块,用于列举用户相关任务下所有的执行作业,包括作业调用时间、完成时间、执行参数、执行节点以及完成状态,可以点击详细日志查看具体的执行细节。系统管理模块,用于管理系统的用户和权限,管理服务执行节点。
WEB容器可以提供与用户进行交互的可视化UI界面,方便用户对数据同步过程中的各个环节进行配置和管理。例如,用户可以与数据源管理模块交互,使数据源管理模块调用SERVICE服务API接口对源数据源和目标数据源进行增加、删除、修改或者查询,这些数据源可理解的成候选的源数据源和目标数据源,即读取插件的数据来源和写入插件的写入目标。用户可以与任务配置模块交互,使任务配置模块可以指定待同步数据的源数据源和待写入数据的目标数据源并配置待同步数据间的字段对应关系,这些任务配置信息可以通过SERVICE服务API接口保存到数据库中供后续读取插件和写入插件使用。用户可以与作业管理模块进行交互,该作业管理模块可以对配置好的任务进行增加、删除、修改、查询和执行等操作,例如点击执行任务,会调用SERVICE服务API接口,SERVICE服务读取任务信息获取任务相关配置,SERVICE服务会调用Exexutor服务实际执行任务,Exexutor服务根据任务信息启动读取插件和写入插件完成任务数据的同步。用户可以与系统管理模块进行交互,该系统管理模块可以对用户、权限等进行管理。
GateWay网关,与Service容器和Exexutor容器通信,包括用户统一认证组件、用户单点登陆组件、容器权限配置组件、鉴权凭证生成器、凭证刷新入口、服务路由模块。该GateWay网关用于各个服务容器之间的鉴权及路由,外部请求的鉴权和与外部鉴权服务器的交互,维护着整个平台的路由链路。GateWay网关作为SERVICE服务和Exexutor服务的唯一入口,统一接收WEB容器的请求,完成对请求的登录验证和分发。
SERVICE容器包括数据权限管理模组、执行节点监控管理模块、负载均衡器、作业队列调度器、RPC调用模组、任务配置模块、数据源管理模块、任务作业信息状态管理模块。该SERVICE容器提供的SERVICE服务可以管理着多组和用户使用相关的API接口,是用户创建配置数据源,配置执行数据交换任务等业务操作的入口。SERVICE容器从前端WEB容器接收请求,并对请求进行处理,这些处理包括对数据库的增删改查操作以及通过调用Exexutor容器服务完成任务的调度。
Exexutor容器包括任务执行管理容器、任务执行器、资源分配模块、任务作业子进程、作业守护线程、作业日志、作业信息回调接口、作业运行时信息等。该Exexutor容 器提供的Executor服务,是实际执行数据交换任务的容器,用于对接执行引擎,同时还维护任务作业的多种监听线程,监听作业日志、是否超时和作业资源分配。Exexutor容器从WEB容器接收请求并对请求进行处理,包括对任务作业状态的查询、对任务作业日志的查询等。Exexutor容器还与SERVICE容器保持心跳连接,当WEB容器启动任务时,SERVICE服务通过传递任务信息调用Exexutor容器服务,Exexutor容器服务根据任务配置信息启动数据同步执行引擎(即Framework模块),执行引擎分别启动读取和写入插件运行数据同步作业。
执行引擎包括Hadoop/Hive环境和异构数据源离线同步工具(即上述实施例中的数据同步装置)。其中异构数据源离线同步工具包括数据源读写插件(即上述实施例中的读取插件和写入插件)、数据传输通道(即Framework模块)以及检查点模块、数据后置处理器、通道变速策略组等。该执行引擎用于从外部数据源读取待同步的初始源数据并生成指纹数据得到目标源数据,然后通过写入插件再写入到外部数据源即目标数据库。读取插件和写入插件根据任务配置信息获取源数据库和目标数据库以及源数据和目标数据的对应关系进行数据同步,这些配置信息是通过用户由WEB容器提供的UI界面进行配置并调用SERVICE容器服务保存在数据库里;当启动任务时,SERVICE容器服务读取任务配置信息并传递给Exexutor容器服务,Exexutor容器服务最终将配置信息传递给执行引擎以供读写插件依据配置信息同步数据。
图4所示的数据同步系数的工作原理为:
(1)在使用数据同步系统时,用户可以打开交互界面,并进入交互界面内的数据管理中心,该数据管理中心包括数据管理模块。用户可以触发该数据管理模块,配置来源数据源(即源数据库)和目标数据源(即目标数据库),例如用户点击新建数据源,在用户填写完数据源信息点击保存后,生成一个新增数据源的请求,该请求调用Service服务中的API接口将数据源信息存储至数据库中,接口返回数据源新增成功的响应信息,呈现在UI界面上即时保存成功的信息提示。
(2)触发任务配置模块,进入数据集成管理任务配置页面。新建数据同步任务,填写任务名称,任务所属项目等基本信息,点击下一步配置源数据源。源数据源选择任意一种数据源,例如rest数据源,配置rest的API接口地址和请求方式及请求参数,并测试数据源连接的可用性,效果如图5所示。
(3)点击下一步,配置目标数据源,数据源选择postgresql数据源,效果如图6所示。
(4)点击下一步,配置源字段和目标字段的对应关系,将源数据源中的字段和目标数据源的字段和类型一一对应,即指定源数据中的某一列对应目标数据源中的哪一列,字段对应关系只能一对一不能一对多或多对一,且不允许有对应不上的字段列,效果如图7所示。
(5)点击下一步,配置数据同步速度并点击保存;调用service服务的API接口将配置好的数据同步任务信息保存至数据库,效果如图8所示。
(6)触发作业管理模块,进入任务管理界面。此时,可以找到刚配置完成的任务,点击执行,此时可以调用service服务的API接口执行任务。
(7)Service服务调用executor服务的API接口将任务信息传给Executor服务执行。
(8)Executor服务启动异构数据源离线同步工具(即图4中数据同步装置)执行数据同步流程,参见图9,包括:
(a)解析配置文件,包括任务配置文件job.json、引擎配置文件core.json和读写插件配置文件plugin.json。
(b)将解析出的配置信息合并成任务所需要的任务配置信息。
(c)根据任务配置信息通过Engine.start()引擎启动方法启动引擎。
(d)引擎创建任务容器JobContainer,并通过任务容器启动方法JobContainer.start()启动。该任务容器启动方法JobContainer.start()依次执行任务预处理方法preHandler()、任务初始化方法init()、读写插件就绪方法prepare()、任务切分方法split()、任务调度方法schedule()、任务后置方法post()、任务后置处理方法postHandle()等方法。
(e)任务初始化方法init()根据任务配置信息来初始化读取插件reader和写入插件writer。
(f)任务就绪方法prepare()通过调用读写插件就绪方法prepare()实现读写插件的类加载。
(g)任务切分方法split()通过通道数量调整方法adjustChannelNumber()调整通道(channel)个数,同时执行读取插件reader和写入插件writer最细粒度的切分。
(h)通道channel的计数主要是根据字节(byte)和记录(record)(即待同步 的源数据或者初始源数据)的限速来实现的。如果用户没有设置通道个数,在任务切分方法split()第一步需要计算通道channel的大小。
(i)任务切分方法split()内部的读写子任务配置合并方法mergeReaderAndWriterTaskConfigs()负责合并读插件reader、写插件writer、以及传输器transformer三者关系,生成子任务task的配置。
(j)任务调度方法schedule()根据任务切分方法split()拆分生成的子任务task配置分配生成子任务组taskGroup对象,根据子任务task的数量和单个子任务组taskGroup支持的子任务task数量进行配置,两者相除就可以得出子任务组taskGroup的数量。
(k)任务调度方法schdule()内部通过调度器AbstractScheduler的任务调度方法schedule()执行,继续执行启动全部子任务组方法startAllTaskGroup()创建所有的子任务组容器TaskGroupContainer相关的子任务task,子任务组容器运行器TaskGroupContainerRunner负责运行子任务组容器TaskGroupContainer执行分配的子任务task。
(l)子任务组容器执行服务taskGroupContainerExecutorService启动固定的线程池用以执行子任务组容器运行器TaskGroupContainerRunner对象,TaskGroupContainerRunner的执行方法run()调用子任务组容器启动方法taskGroupContainer.start()方法,针对每个通道channel创建一个子任务执行器TaskExecutor,通过taskExecutor.doStart()启动任务。
(m)recordSender数据指纹接口方法采用HMAC算法生成记录(record)的消息摘要(即指纹数据),从而为待同步的源数据生成指纹数据,得到目标源数据。
(n)写入插件将目标源数据同步到目标数据库中,在同步之前可以对上述目标源数据进行安全校验。
这样,本公开通过为异构数据源离线同步工具增加一个生成数据指纹的接口方法,在读取插件读取数据记录并处理完成发给写入插件之前,调用生成数据指纹的接口方法,为数据记录生成摘要字符串即(上述实施例中的消息摘要或者指纹数据),此摘要字符串做为数据记录的标识和校验值,使同步至目标库的数据具有统一的数据标识,从而规避不同数据源数据标识id不统一的问题;并且,实现了数据防篡改的效果和避免数据重复插入的问题。
在示例性实施例中,还提供了一种计算机可读存储介质,例如包括可执行的计 算机程序的存储器,上述可执行的计算机程序可由处理器执行,以实现如图3所示实施例的方法。其中,可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
本领域技术人员在考虑说明书及实践这里公开的公开后,将容易想到本公开的其它实施方案。本公开旨在涵盖任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由下面的权利要求指出。
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。

Claims (14)

  1. 一种数据同步方法,其特征在于,应用于数据同步系统,包括:
    从源数据库获取初始源数据;
    生成所述初始源数据的指纹数据得到包含所述指纹数据的目标源数据;所述指纹数据作为所述初始源数据的主键;
    将所述目标源数据同步到目标数据库,以使所述目标数据库在所述主键校验通过后存储所述目标源数据。
  2. 根据权利要求1所述的方法,其特征在于,生成所述初始源数据的指纹数据,包括:
    调用预设的指纹生成模型;
    将所述初始源数据作为所述指纹生成模型的输入数据输入到所述指纹生成模型,得到所述指纹数据。
  3. 根据权利要求2所述的方法,其特征在于,所述预设的指纹生成模型采用消息摘要算法、安全散列算法、消息认证码算法和基于密钥的消息认证码算法中的至少一种实现。
  4. 根据权利要求1所述的方法,其特征在于,将所述目标源数据同步到目标数据库,包括:
    基于所述目标源数据中的初始源数据生成校验指纹数据;
    当所述目标源数据中的主键与所述校验指纹数据相同时,将所述目标源数据发送到目标数据库。
  5. 根据权利要求4所述的方法,其特征在于,所述方法还包括:
    所述目标数据库获取所述目标源数据中的主键并匹配已存储数据的主键;
    当所述目标数据库中存在所述目标源数据的主键时,将所述目标源数据更新到所述目标数据库;
    当所述目标数据库中不存在所述目标源数据的主键时,将所述目标源数据插入到所述目标数据库。
  6. 根据权利要求1所述的方法,其特征在于,将所述目标源数据同步到目标数据库,包括:
    获取除新增数据列之外的数据列的第一指纹数据和包含新增数据列的数据列的第二指纹数据;
    匹配所述第一指纹数据和所述目标数据库的主键;
    当所述目标数据库中存在所述目标源数据的主键时,将所述目标源数据和所述第二指纹数据更新到所述目标数据库。
  7. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    获取历史任务信息;
    统计所述历史任务信息中所使用的目标源数据中的列组合;
    根据所述列组合的列数据生成指纹数据,将所述指纹数据作为所述目标源数据的副键同步存储到所述目标数据库中。
  8. 一种数据同步系统,其特征在于,包括源数据库、目标数据库和数据同步装置;
    所述数据同步装置用于从所述源数据库获取初始源数据并生成所述初始源数据的指纹数据,得到包含所述指纹数据的目标源数据;所述指纹数据作为所述初始源数据的主键;以及将所述目标源数据同步到目标数据库,以使所述目标数据库在所述主键校验通过后存储所述目标源数据。
  9. 根据权利要求8所述的系统,其特征在于,所述数据同步装置包括Framework模块、指纹数据生成模块、读取插件和写入插件;所述Framework模块分别与所述读取插件和所述写入插件连接,
    所述读取插件,用于从源数据库读取待同步的初始源数据;
    所述指纹数据生成模块,用于生成所述初始源数据的指纹数据并将所述指纹数据作为所述初始源数据的主键;
    所述Framework模块,用于将所述初始源数据和所述主键作为目标源数据转发给所述写入插件;
    所述写入插件,用于将所述目标源数据写入到目标数据库。
  10. 根据权利要求9所述的系统,其特征在于,所述指纹数据生成模块集成到所述读取插件和/或Framework模块。
  11. 根据权利要求9所述的系统,其特征在于,所述指纹数据生成模块集成到所述写入插件内,用于基于所述目标源数据中的初始源数据生成校验指纹数据;所述写入插件还用于对比所述目标源数据中的主键与所述校验指纹数据,并在所述主键与所述校验指纹数据相同时,将所述目标源数据发送到目标数据库。
  12. 根据权利要求8所述的系统,其特征在于,所述数据同步装置中写入插件还用于:获取除新增数据列之外的数据列的第一指纹数据和包含新增数据列的数据列的第二指纹数据;匹配所述第一指纹数据和所述目标数据库的主键;当所述目标数据库中存在所述目标源数据的主键时,将所述目标源数据和所述第二指纹数据更新到所述目标数据 库。
  13. 根据权利要求8所述的系统,其特征在于,所述数据同步装置还用于获取历史任务信息;统计所述历史任务信息中所使用的目标源数据中的列组合;根据所述列组合的列数据生成指纹数据,将所述指纹数据作为所述目标源数据的副键同步存储到所述目标数据库中。
  14. 一种非暂态计算机可读存储介质,其特征在于,当所述存储介质中的可执行的计算机程序由处理器执行时,能够实现如权利要求1~7任一项所述的方法。
PCT/CN2023/077058 2022-03-28 2023-02-20 数据同步方法和系统、计算机可读存储介质 WO2023185309A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210316593.7 2022-03-28
CN202210316593.7A CN114722118A (zh) 2022-03-28 2022-03-28 数据同步方法和系统、计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2023185309A1 true WO2023185309A1 (zh) 2023-10-05

Family

ID=82239903

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/077058 WO2023185309A1 (zh) 2022-03-28 2023-02-20 数据同步方法和系统、计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN114722118A (zh)
WO (1) WO2023185309A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114722118A (zh) * 2022-03-28 2022-07-08 京东方科技集团股份有限公司 数据同步方法和系统、计算机可读存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107172112A (zh) * 2016-03-07 2017-09-15 阿里巴巴集团控股有限公司 一种计算机文件传输方法及装置
CN108804697A (zh) * 2018-06-15 2018-11-13 中国平安人寿保险股份有限公司 基于Spark的数据同步方法、装置、计算机设备和存储介质
CN110781243A (zh) * 2019-11-06 2020-02-11 杭州安恒信息技术股份有限公司 关系型数据库双主数据增量同步方法和系统
US10977275B1 (en) * 2018-12-21 2021-04-13 Village Practice. Management Company, Llc System and method for synchronizing distributed databases
CN114722118A (zh) * 2022-03-28 2022-07-08 京东方科技集团股份有限公司 数据同步方法和系统、计算机可读存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107172112A (zh) * 2016-03-07 2017-09-15 阿里巴巴集团控股有限公司 一种计算机文件传输方法及装置
CN108804697A (zh) * 2018-06-15 2018-11-13 中国平安人寿保险股份有限公司 基于Spark的数据同步方法、装置、计算机设备和存储介质
US10977275B1 (en) * 2018-12-21 2021-04-13 Village Practice. Management Company, Llc System and method for synchronizing distributed databases
CN110781243A (zh) * 2019-11-06 2020-02-11 杭州安恒信息技术股份有限公司 关系型数据库双主数据增量同步方法和系统
CN114722118A (zh) * 2022-03-28 2022-07-08 京东方科技集团股份有限公司 数据同步方法和系统、计算机可读存储介质

Also Published As

Publication number Publication date
CN114722118A (zh) 2022-07-08

Similar Documents

Publication Publication Date Title
US20200177373A1 (en) System and method for storing contract data structures on permissioned distributed ledgers
US20190361848A1 (en) Methods and systems for a database
WO2021151316A1 (zh) 数据查询方法、装置、电子设备及存储介质
CN111338766A (zh) 事务处理方法、装置、计算机设备及存储介质
CN111414381B (zh) 数据处理方法、装置、电子设备及存储介质
US20170206208A1 (en) System and method for merging a mainframe data file to a database table for use by a mainframe rehosting platform
KR20220044603A (ko) 블록체인 데이터베이스 관리 시스템
CN111930489B (zh) 一种任务调度方法、装置、设备及存储介质
WO2023185309A1 (zh) 数据同步方法和系统、计算机可读存储介质
CN113407600B (zh) 一种动态实时同步多源大表数据的增强实时计算方法
US11907262B2 (en) System and method for data pruning via dynamic partition management
WO2022061878A1 (en) Blockchain transaction processing systems and methods
Erraissi et al. Meta-modeling of Zookeeper and MapReduce processing
CN113901078A (zh) 业务订单关联查询方法、装置、设备及存储介质
Chung et al. Performance tuning and scaling enterprise blockchain applications
JP7221799B2 (ja) 情報処理システム、及び情報処理システムの制御方法
Rovere et al. A centralized support infrastructure (CSI) to manage CPS digital twin, towards the synchronization between CPS deployed on the shopfloor and their digital representation
US7752225B2 (en) Replication and mapping mechanism for recreating memory durations
CN112559525B (zh) 数据检查系统、方法、装置和服务器
Zarei et al. Past, present and future of Hadoop: A survey
US11567957B2 (en) Incremental addition of data to partitions in database tables
CN112395308A (zh) 一种基于hdfs数据库的数据查询方法
CN115080663A (zh) 一种分布式数据库同步方法及系统及装置及介质
CN113486019A (zh) 自动触发对远程多数据库数据实时批量同步方法和装置
Subbiah et al. Job starvation avoidance with alleviation of data skewness in Big Data infrastructure

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 18559010

Country of ref document: US

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23777688

Country of ref document: EP

Kind code of ref document: A1