WO2023101610A2 - Device and method for synchronizing data between data sources - Google Patents

Device and method for synchronizing data between data sources Download PDF

Info

Publication number
WO2023101610A2
WO2023101610A2 PCT/SG2022/050875 SG2022050875W WO2023101610A2 WO 2023101610 A2 WO2023101610 A2 WO 2023101610A2 SG 2022050875 W SG2022050875 W SG 2022050875W WO 2023101610 A2 WO2023101610 A2 WO 2023101610A2
Authority
WO
WIPO (PCT)
Prior art keywords
data source
data
configuration
configuration file
synchronization task
Prior art date
Application number
PCT/SG2022/050875
Other languages
French (fr)
Other versions
WO2023101610A3 (en
Inventor
Liufeng WANG
Libin ZHOU
Wei Hu
Lei Feng
Renyuan SUN
Original Assignee
Shopee IP Singapore Private Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shopee IP Singapore Private Limited filed Critical Shopee IP Singapore Private Limited
Publication of WO2023101610A2 publication Critical patent/WO2023101610A2/en
Publication of WO2023101610A3 publication Critical patent/WO2023101610A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Definitions

  • Various aspects of this disclosure relate to devices and methods for synchronizing data between data sources.
  • Various embodiments concern a method for synchronizing data between data sources, comprising presenting a user with a graphical user interface to input values for configuration parameters for a synchronization task between a first data source and a second data source, determining values of the configuration parameters at least partially based on user input to the graphical user interface, generating a synchronization task configuration file by including each determined configuration parameter value into a section of the configuration file assigned to the configuration parameter according to a pre-defined configuration file structure and starting a synchronization task in accordance with the generated configuration file.
  • starting the synchronization task in accordance with the generated configuration file comprises supplying the generated configuration file to a data synchronization framework.
  • the synchronization task comprises a transmission of data from the first data source to the second data source.
  • the values of the configuration parameters include the information that the first data source is an origin data source of the transmission and that the second data source is a destination data source of the transmission.
  • the values of the configuration parameters include information about formatting functions to perform on the data for transmitting data from the first data source to the second data source.
  • the values of the configuration parameters include information about the first data source and the second data source.
  • the information about the first data source includes an identification of the first data source and the information about the second data source includes an identification of the second data source.
  • the synchronization task configuration file comprises a section for parameters related to the first data source, a section for parameters related to the second data source and a general section and generating the synchronization task configuration file comprises classifying the determined configuration parameter values according to whether they relate to the first data source, relate to the second data source or are general configuration parameters and putting each configuration parameter value in the section of the synchronization task configuration file to which it relates.
  • determining values of the configuration parameters comprises presenting possible values of the configuration parameters via the graphical user interface for selection and confirmation.
  • determining values of the configuration parameters comprises receiving an original synchronization task configuration file and parsing the original synchronization task configuration file to determine possible values of the configuration parameters.
  • generating the synchronization task configuration file includes keeping configuration parameter values of the original configuration file which have been confirmed by user input to the graphical user interface and updating configuration parameter values which have been changed by user input to the graphical user interface.
  • the pre-defined configuration file structure is a FlinkX configuration file structure.
  • the method comprises performing the synchronization task using FlinkX.
  • the pre-defined configuration file structure is a JSON format.
  • a data synchronization device comprising a communication interface, a memory interface and a processing unit configured to perform the method for synchronizing data between data sources as described above.
  • a computer program element comprising program instructions, which, when executed by one or more processors, cause the one or more processors to perform the method for synchronizing data between data sources described above.
  • a computer-readable medium comprising program instructions, which, when executed by one or more processors, cause the one or more processors to perform the method for synchronizing data between data sources described above.
  • FIG. 1 shows a data distribution arrangement
  • FIG. 2 shows excerpts of an exemplary configuration file.
  • FIG. 3 shows a flow diagram for the generation of a configuration file for a synchronization task.
  • FIG. 4 illustrates a configuration system
  • FIG. 5 shows a graphical user interface presenting a drop-down menu to select an origin data source.
  • FIG. 6 shows a graphical user interface presenting a drop-down menu to select a destination data source.
  • FIG. 7 shows a graphical user interface for inputting base information.
  • FIG. 8 shows a graphical user interface for inputting resource information.
  • FIG. 9 shows a graphical user interface presenting a drop-down menu to select parameters to configure a data formatting function.
  • FIG. 10 shows a graphical user interface presenting a window to input a JSON configuration file.
  • FIG. 11 shows a flow diagram illustrating a method for synchronizing data between data sources.
  • FIG. 12 shows a data synchronization device according to an embodiment.
  • Embodiments described in the context of one of the devices or methods are analogously valid for the other devices or methods. Similarly, embodiments described in the context of a device are analogously valid for a method, and vice-versa.
  • the articles “a”, “an” and “the” as used with regard to a feature or element include a reference to one or more of the features or elements.
  • the term “and/or” includes any and all combinations of one or more of the associated listed items.
  • FIG. 1 shows a data distribution arrangement 100.
  • the data distribution arrangement 100 comprises multiple data sources 101, 102, 103 forming a distributed data storage system.
  • Clients 104 may retrieve data from the data sources 101, 102, 103.
  • the data sources 101, 102, 103 and the clients 104 are connected by a communication network 109 (e.g. the Internet) which may be implemented by multiple subnetworks.
  • a communication network 109 e.g. the Internet
  • the data sources 101, 102, 103 are heterogeneous.
  • a data source 101, 102 may operate according to Hdfs, Hbase, SQLServer, MySQL, Oracle, Kafka, etc.
  • a client 104 may (depending on its type) contact a suitable data source 101, 102, 103 to retrieve data. To ensure that a client 104 may retrieve data from any one of the data sources 101, 102, 103, the data sources 101, 102, 103 are synchronized.
  • FlinkX is a distributed offline data synchronization framework based on Flink that enables efficient data migration between multiple heterogeneous data sources.
  • FlinkX provides a simplification and encapsulation of the configuration of synchronization (e.g. streaming) jobs, and can be applied in the field of big data to synchronize various heterogeneous data sources.
  • FlinkX when a data transmission 105 from a first data source 101 to a second data source 102 is performed for synchronization, the first data source 101 acts as a “reader” and is handled by a corresponding reader-plug-in (e.g. a Kafka reader) and the second data source 102 acts as a “writer” and is handled by a corresponding writer-plug-in (e.g. a MySQL writer).
  • a reader-plug-in e.g. a Kafka reader
  • writer-plug-in e.g. a MySQL writer
  • FlinkX different data sources 101, 102 are abstracted into different reader plugins and different data destinations are abstracted into different writer plug-ins.
  • the FlinkX framework can support data synchronization for any data source type.
  • each new set of data sources can be connected to the existing data source with the newly added data source.
  • a data synchronisation task, i.e. transmission 105, between two data sources 101, 102 is denoted as a job.
  • a job is defined by a configuration file which is, in FlinkX, a JOSN file with a certain pre-defined structure. So, a synchronization engine 106 controls the data transmission 105 in accordance with a configuration file 107.
  • the synchronization engine 106 may run on a separate computer or on a computer which implements (possibly together with one or more other computers, e.g. in a cloud implementation) one of the data sources 101, 102.
  • the configuration file has the following structure:
  • Table 1 top-level configuration file structure
  • the data synchronization task contains one job element, and this element contains two parts of setting and content.
  • Setting is for example used to configure the speed limit and error control.
  • Setting may also be used to configure managing of dirty data, logging and restoring.
  • the content part is used to configure certain task information, including the origin data source 101 (reader plug-in information) and destination data source 102 (writer plug-in information).
  • FIG. 2 shows excerpts of an exemplary configuration file 200 as an example for a configuration file 107 in JOSN for illustration (written in columns from left to right, omissions are indicated
  • the first data source is a Kafka data source and the second data source (writer) is a MySQL data source.
  • this for example includes core configuration parameters such as
  • these core configuration parameters are mixed at various positions into the configuration file 107, e.g. partially in context of a data source (i.e. in the content part) and partially in the setting part.
  • an approach is provided which allows management of the core parameters for synchronization tasks which in the end allows reliable synchronization between data sources 101, 102, 103 and thus maintenance of a distributed data storage system.
  • Performing a synchronization task e.g. a transmission from the first data source 101 to the second data source 102, may in particular include the determination of core parameters related to the second data source 102.
  • these core parameters need to be extracted and managed. According to various embodiments, corresponding functions for this are provided.
  • FIG. 3 shows a flow diagram 300 for the generation of a configuration file for a synchronization task.
  • a synchronization event is triggered.
  • the synchronization engine 106 is configured to regularly synchronize data between the data sources 101, 102 and in the present examples determines that data needs to be transferred from the first data source 101 to the second data source 102.
  • the synchronization event may include the creation of an original (base) configuration file (e.g. a template, possibly taken from a previous synchronization task).
  • the synchronization engine 106 determines whether it should perform the synchronization task based on FlinkX. If that is not the case (e.g. because FlinkX is not configured to be used for the present synchronization task), the synchronization engine 106 skips the processing described in the following (and e.g. continues in a conventional manner using another synchronization framework).
  • the synchronization engine 106 starts using a (synchronization) configuration system 108 for gathering configuration information in 303.
  • FIG. 4 illustrates a configuration system 400.
  • the configuration system 400 includes a data source information retrieval module 401. It reads information about the read data source (i.e. the data source acting as reader) and information about the write data source (i.e. the data source acting as writer) to ensure the uniqueness of the read data source and write data source.
  • the data source information retrieval module 401 may for example retrieve this information from an original (e.g. base) configuration file that was created (and triggers the job e.g. in 301) or from another file indicating the data sources 101, 102 which should be synchronized, e.g. provided or input by a user.
  • the configuration system 400 further includes a unified management module 402 which gets the information about the data sources 101, 102 and in 304 confirms the information about the data sources (e.g. upon confirmation input from a user). These may be seen as the main system parameters.
  • the configuration system 400 further comprises a core parameter module 403 which, in 305, configures and confirms core parameters. These may for example include parameters controlling functions required for transmitting data from one data source to another such as reformatting (e.g. table splitting).
  • one or more additional modules 404 of the configuration system 400 may set the values of additional fields of the configuration file and confirm configuration values (e.g. upon confirmation input from a user).
  • the configuration system 400 If, in 306, the configuration file is to be created (e.g. when a user inputs a corresponding instruction, e.g. by pressing a corresponding button), the configuration system 400 generates the configuration file 107 according to the set parameters and fields and stores it in 307.
  • the synchronization engine 106 then runs the synchronization task according to the configuration file 107 using FlinkX.
  • the setting of configuration parameters may include querying a user for corresponding information.
  • This may include presenting a graphical user interface to the user for inputting the information.
  • the configuration system 108 may communicate with one of the clients 104 to cause the client 104 to display a corresponding graphical user interface on a screen.
  • the user may for example be presented with a drop down menu to select the origin data source 101 and similarly with a drop down menu to select the destination data source 102.
  • the user may further be presented with a graphical user interface to enter base information (such as channel), resource info (such as information about the origin data source such as a Kafka group ID), parameters to configure certain functions controlling how data is synchronized (e.g. formatted)
  • the user may also be presented with the opportunity (e.g. a button) to input a JSON configuration file from which the configuration system parses the various configuration parameters.
  • the opportunity e.g. a button
  • FIG. 5 shows a graphical user interface 500 presenting a drop-down menu to select the origin data source.
  • FIG. 6 shows a graphical user interface 600 presenting a drop-down menu to select the destination data source.
  • FIG. 7 shows a graphical user interface 700 for inputting base information.
  • FIG. 8 shows a graphical user interface 800 for inputting resource information.
  • FIG. 9 shows a graphical user interface 900 presenting a drop-down menu to select parameters to configure a data formatting function.
  • FIG. 10 shows a graphical user interface 1000 presenting a window to input a JSON configuration file.
  • the configuration system 400 provides a centralized management of data synchronization and controllable configuration of core parameters of synchronization jobs. This may include the management according to the various data source schema, such as different data formats etc. used by the different data sources, i.e. setting configuration parameters to correctly synchronize data between data sources of different types (e.g. Kafka and MySQE etc.).
  • FIG. 11 shows a flow diagram 1100 illustrating a method for synchronizing data between data sources.
  • a user is presented with a graphical user interface to input values for configuration parameters for a synchronization task between a first data source and a second data source.
  • values of the configuration parameters are determined at least partially based on user input to the graphical user interface.
  • a synchronization task configuration file is generated by including each determined configuration parameter value into a section of the configuration file assigned to the configuration parameter according to a pre-defined configuration file structure.
  • a synchronization task is started in accordance with the generated configuration file (e.g. a synchronization framework is instructed or executed to perform a synchronization task in accordance with the generated configuration file, according to the synchronization framework, control messages may then be sent to the data sources).
  • a synchronization framework is instructed or executed to perform a synchronization task in accordance with the generated configuration file, according to the synchronization framework, control messages may then be sent to the data sources).
  • a configuration method (e.g. performed by a configuration tool running on a configuration or synchronization system or device) which allows managing configuration parameters of a synchronization task between two data sources (e.g. a data base synchronization job between two data bases).
  • FIG. 11 for example allows addressing issues of configuration confusion and complexity in synchronization of data based on the configuration file of FlinkX.
  • the method of FIG. 11 is for example carried out by a server computer as illustrated in FIG. 12.
  • FIG. 12 shows a data synchronization device 1200 according to an embodiment.
  • the data synchronization device 1200 is implemented by one or more computers and includes a communication interface 1201 (e.g. configured to transmit data for displaying the graphical user interface on a display, receiving user input, sending control data to the data sources for starting the synchronization task etc.)
  • a communication interface 1201 e.g. configured to transmit data for displaying the graphical user interface on a display, receiving user input, sending control data to the data sources for starting the synchronization task etc.
  • the data synchronization device 1200 further includes a processing unit 1202 and a memory 1203.
  • the memory 1203 may be used by the processing unit 1202 to store, for example, values of the configuration parameters and the generated synchronization task configuration file.
  • the data synchronization device is configured to perform the method of FIG. 11.
  • a "circuit” may be understood as any kind of a logic implementing entity, which may be hardware, software, firmware, or any combination thereof.
  • a "circuit” may be a hard-wired logic circuit or a programmable logic circuit such as a programmable processor, e.g. a microprocessor.
  • a "circuit” may also be software being implemented or executed by a processor, e.g. any kind of computer program, e.g. a computer program using a virtual machine code. Any other kind of implementation of the respective functions which are described herein may also be understood as a "circuit" in accordance with an alternative embodiment.

Abstract

Various embodiments concern a method for synchronizing data between data sources, comprising presenting a user with a graphical user interface to input values for configuration parameters for a synchronization task between a first data source and a second data source, determining values of the configuration parameters at least partially based on user input to the graphical user interface, generating a synchronization task configuration file by including each determined configuration parameter value into a section of the configuration file assigned to the configuration parameter according to a pre-defined configuration file structure and starting a synchronization task in accordance with the generated configuration file.

Description

DEVICE AND METHOD FOR SYNCHRONIZING DATA BETWEEN DATA SOURCES
TECHNICAL FIELD
[0001] Various aspects of this disclosure relate to devices and methods for synchronizing data between data sources.
BACKGROUND
[0002] In a distributed storage system where there are multiple data sources storing the same data, data needs to be synchronized between the data sources. However, for example to support different clients, data sources may operate according to different data storage schemas. This makes the synchronization a complex task. While efficient synchronization frameworks like FlinkX exist, the configuration of theses frameworks is, in accordance with the complexity of the synchronization, complex and therefore requires a lot of effort, in particular to avoid errors. [0003] Accordingly, efficient approaches for synchronizing data between data sources are desirable.
SUMMARY
[0004] Various embodiments concern a method for synchronizing data between data sources, comprising presenting a user with a graphical user interface to input values for configuration parameters for a synchronization task between a first data source and a second data source, determining values of the configuration parameters at least partially based on user input to the graphical user interface, generating a synchronization task configuration file by including each determined configuration parameter value into a section of the configuration file assigned to the configuration parameter according to a pre-defined configuration file structure and starting a synchronization task in accordance with the generated configuration file.
[0005] According to one embodiment, starting the synchronization task in accordance with the generated configuration file comprises supplying the generated configuration file to a data synchronization framework. [0006] According to one embodiment, the synchronization task comprises a transmission of data from the first data source to the second data source.
[0007] According to one embodiment, the values of the configuration parameters include the information that the first data source is an origin data source of the transmission and that the second data source is a destination data source of the transmission.
[0008] According to one embodiment, the values of the configuration parameters include information about formatting functions to perform on the data for transmitting data from the first data source to the second data source.
[0009] According to one embodiment, the values of the configuration parameters include information about the first data source and the second data source.
[0010] According to one embodiment, the information about the first data source includes an identification of the first data source and the information about the second data source includes an identification of the second data source.
[0011 ] According to one embodiment, the synchronization task configuration file comprises a section for parameters related to the first data source, a section for parameters related to the second data source and a general section and generating the synchronization task configuration file comprises classifying the determined configuration parameter values according to whether they relate to the first data source, relate to the second data source or are general configuration parameters and putting each configuration parameter value in the section of the synchronization task configuration file to which it relates.
[0012] According to one embodiment, determining values of the configuration parameters comprises presenting possible values of the configuration parameters via the graphical user interface for selection and confirmation.
[0013] According to one embodiment, determining values of the configuration parameters comprises receiving an original synchronization task configuration file and parsing the original synchronization task configuration file to determine possible values of the configuration parameters.
[0014] According to one embodiment, generating the synchronization task configuration file includes keeping configuration parameter values of the original configuration file which have been confirmed by user input to the graphical user interface and updating configuration parameter values which have been changed by user input to the graphical user interface. [0015] According to one embodiment, the pre-defined configuration file structure is a FlinkX configuration file structure.
[0016] According to one embodiment, the method comprises performing the synchronization task using FlinkX.
[0017] According to one embodiment, the pre-defined configuration file structure is a JSON format.
[0018] According to one embodiment, a data synchronization device comprising a communication interface, a memory interface and a processing unit configured to perform the method for synchronizing data between data sources as described above.
[0019] According to one embodiment, a computer program element is provided comprising program instructions, which, when executed by one or more processors, cause the one or more processors to perform the method for synchronizing data between data sources described above.
[0020] According to one embodiment, a computer-readable medium is provided comprising program instructions, which, when executed by one or more processors, cause the one or more processors to perform the method for synchronizing data between data sources described above.
[0021] It should be noted that embodiments described in context of the method for synchronizing data between data sources are analogously valid for the data synchronization device and vice versa.
BRIEF DESCRIPTION OF THE DRAWINGS
[0022] The invention will be better understood with reference to the detailed description when considered in conjunction with the non-limiting examples and the accompanying drawings, in which:
- FIG. 1 shows a data distribution arrangement.
- FIG. 2 shows excerpts of an exemplary configuration file.
- FIG. 3 shows a flow diagram for the generation of a configuration file for a synchronization task.
- FIG. 4 illustrates a configuration system.
- FIG. 5 shows a graphical user interface presenting a drop-down menu to select an origin data source. - FIG. 6 shows a graphical user interface presenting a drop-down menu to select a destination data source.
- FIG. 7 shows a graphical user interface for inputting base information.
- FIG. 8 shows a graphical user interface for inputting resource information.
- FIG. 9 shows a graphical user interface presenting a drop-down menu to select parameters to configure a data formatting function.
- FIG. 10 shows a graphical user interface presenting a window to input a JSON configuration file.
- FIG. 11 shows a flow diagram illustrating a method for synchronizing data between data sources.
- FIG. 12 shows a data synchronization device according to an embodiment.
DETAILED DESCRIPTION
[0023] The following detailed description refers to the accompanying drawings that show, by way of illustration, specific details and embodiments in which the disclosure may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the disclosure. Other embodiments may be utilized and structural, and logical changes may be made without departing from the scope of the disclosure. The various embodiments are not necessarily mutually exclusive, as some embodiments can be combined with one or more other embodiments to form new embodiments.
[0024] Embodiments described in the context of one of the devices or methods are analogously valid for the other devices or methods. Similarly, embodiments described in the context of a device are analogously valid for a method, and vice-versa.
[0025] Features that are described in the context of an embodiment may correspondingly be applicable to the same or similar features in the other embodiments. Features that are described in the context of an embodiment may correspondingly be applicable to the other embodiments, even if not explicitly described in these other embodiments. Furthermore, additions and/or combinations and/or alternatives as described for a feature in the context of an embodiment may correspondingly be applicable to the same or similar feature in the other embodiments.
[0026] In the context of various embodiments, the articles “a”, “an” and “the” as used with regard to a feature or element include a reference to one or more of the features or elements. [0027] As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.
[0028] In the following, embodiments will be described in detail.
[0029] FIG. 1 shows a data distribution arrangement 100.
[0030] The data distribution arrangement 100 comprises multiple data sources 101, 102, 103 forming a distributed data storage system. Clients 104 may retrieve data from the data sources 101, 102, 103. The data sources 101, 102, 103 and the clients 104 are connected by a communication network 109 (e.g. the Internet) which may be implemented by multiple subnetworks.
[0031] In the present example, it is assumed that the data sources 101, 102, 103 are heterogeneous. For example a data source 101, 102 may operate according to Hdfs, Hbase, SQLServer, MySQL, Oracle, Kafka, etc. A client 104 may (depending on its type) contact a suitable data source 101, 102, 103 to retrieve data. To ensure that a client 104 may retrieve data from any one of the data sources 101, 102, 103, the data sources 101, 102, 103 are synchronized.
[0032] According to various embodiments, data between two data sources 101, 102 is synchronized using FlinkX. FlinkX is a distributed offline data synchronization framework based on Flink that enables efficient data migration between multiple heterogeneous data sources. FlinkX provides a simplification and encapsulation of the configuration of synchronization (e.g. streaming) jobs, and can be applied in the field of big data to synchronize various heterogeneous data sources.
[0033] In FlinkX, when a data transmission 105 from a first data source 101 to a second data source 102 is performed for synchronization, the first data source 101 acts as a “reader” and is handled by a corresponding reader-plug-in (e.g. a Kafka reader) and the second data source 102 acts as a “writer” and is handled by a corresponding writer-plug-in (e.g. a MySQL writer).
[0034] In FlinkX, different data sources 101, 102 are abstracted into different reader plugins and different data destinations are abstracted into different writer plug-ins. Theoretically, the FlinkX framework can support data synchronization for any data source type. As a set of ecosystems, each new set of data sources can be connected to the existing data source with the newly added data source. [0035] A data synchronisation task, i.e. transmission 105, between two data sources 101, 102 is denoted as a job.
[0036] A job is defined by a configuration file which is, in FlinkX, a JOSN file with a certain pre-defined structure. So, a synchronization engine 106 controls the data transmission 105 in accordance with a configuration file 107. The synchronization engine 106 may run on a separate computer or on a computer which implements (possibly together with one or more other computers, e.g. in a cloud implementation) one of the data sources 101, 102.
[0037] The configuration file has the following structure:
Figure imgf000008_0001
Table 1: top-level configuration file structure
[0038] The data synchronization task contains one job element, and this element contains two parts of setting and content.
[0039] Setting is for example used to configure the speed limit and error control.
Figure imgf000008_0002
Table 2: example of speed information element
Figure imgf000008_0003
Table 3: example of error limit information element
[0040] Setting may also be used to configure managing of dirty data, logging and restoring. [0041] The content part is used to configure certain task information, including the origin data source 101 (reader plug-in information) and destination data source 102 (writer plug-in information).
Figure imgf000009_0001
Table 4: example of content part structure of configuration file
[0042] FIG. 2 shows excerpts of an exemplary configuration file 200 as an example for a configuration file 107 in JOSN for illustration (written in columns from left to right, omissions are indicated
[0043] In the example of FIG. 2, the first data source (reader) is a Kafka data source and the second data source (writer) is a MySQL data source.
[0044] In the configuration file 107, all parameter that the synchronization engine 106 needs to carry out the synchronization task 105 are defined.
[0045] In the exemplary configuration file 200, this for example includes core configuration parameters such as
[0046] "channel": 32
[0047] "splitTableRange": 6
[0048] "splitTablelndex": 8
[0049] "splitDB Index": 4
[0050] etc.
[0051] As can be seen in the example of FIG. 2, these core configuration parameters are mixed at various positions into the configuration file 107, e.g. partially in context of a data source (i.e. in the content part) and partially in the setting part.
[0052] Thus, the configuration of a job is inconvenient and difficult to handle and the maintenance effort for a job is high.
[0053] Therefore, according to various embodiments, an approach is provided which allows management of the core parameters for synchronization tasks which in the end allows reliable synchronization between data sources 101, 102, 103 and thus maintenance of a distributed data storage system. Performing a synchronization task, e.g. a transmission from the first data source 101 to the second data source 102, may in particular include the determination of core parameters related to the second data source 102. Thus, these core parameters need to be extracted and managed. According to various embodiments, corresponding functions for this are provided.
[0054] These functions allow, e.g. when using FlinkX for a synchronization job, to simplify the synchronization (e.g. possibly large-scale streaming) job configuration, in particular extract streaming job core parameters (e.g. according to a data source schema), controllable management of core parameters and thus in the end increase reliability of synchronization and reduce workload for the operation.
[0055] FIG. 3 shows a flow diagram 300 for the generation of a configuration file for a synchronization task.
[0056] In 301, a synchronization event is triggered. For example, the synchronization engine 106 is configured to regularly synchronize data between the data sources 101, 102 and in the present examples determines that data needs to be transferred from the first data source 101 to the second data source 102. The synchronization event may include the creation of an original (base) configuration file (e.g. a template, possibly taken from a previous synchronization task).
[0057] In 302, the synchronization engine 106 determines whether it should perform the synchronization task based on FlinkX. If that is not the case (e.g. because FlinkX is not configured to be used for the present synchronization task), the synchronization engine 106 skips the processing described in the following (and e.g. continues in a conventional manner using another synchronization framework).
[0058] If the synchronization task is to be carried out using FlinkX, the synchronization engine 106 starts using a (synchronization) configuration system 108 for gathering configuration information in 303.
[0059] FIG. 4 illustrates a configuration system 400.
[0060] The configuration system 400 includes a data source information retrieval module 401. It reads information about the read data source (i.e. the data source acting as reader) and information about the write data source (i.e. the data source acting as writer) to ensure the uniqueness of the read data source and write data source. [0061] The data source information retrieval module 401 may for example retrieve this information from an original (e.g. base) configuration file that was created (and triggers the job e.g. in 301) or from another file indicating the data sources 101, 102 which should be synchronized, e.g. provided or input by a user.
[0062] The configuration system 400 further includes a unified management module 402 which gets the information about the data sources 101, 102 and in 304 confirms the information about the data sources (e.g. upon confirmation input from a user). These may be seen as the main system parameters.
[0063] The configuration system 400 further comprises a core parameter module 403 which, in 305, configures and confirms core parameters. These may for example include parameters controlling functions required for transmitting data from one data source to another such as reformatting (e.g. table splitting).
[0064] Further, in 305, one or more additional modules 404 of the configuration system 400 may set the values of additional fields of the configuration file and confirm configuration values (e.g. upon confirmation input from a user).
[0065] If, in 306, the configuration file is to be created (e.g. when a user inputs a corresponding instruction, e.g. by pressing a corresponding button), the configuration system 400 generates the configuration file 107 according to the set parameters and fields and stores it in 307.
[0066] The synchronization engine 106 then runs the synchronization task according to the configuration file 107 using FlinkX.
[0067] The setting of configuration parameters may include querying a user for corresponding information.
[0068] This may include presenting a graphical user interface to the user for inputting the information. For example, the configuration system 108 may communicate with one of the clients 104 to cause the client 104 to display a corresponding graphical user interface on a screen.
[0069] The user may for example be presented with a drop down menu to select the origin data source 101 and similarly with a drop down menu to select the destination data source 102. [0070] The user may further be presented with a graphical user interface to enter base information (such as channel), resource info (such as information about the origin data source such as a Kafka group ID), parameters to configure certain functions controlling how data is synchronized (e.g. formatted)
[0071] The user may also be presented with the opportunity (e.g. a button) to input a JSON configuration file from which the configuration system parses the various configuration parameters.
[0072] Examples for corresponding GUIs are shown in FIG. 5 to FIG. 10.
[0073] FIG. 5 shows a graphical user interface 500 presenting a drop-down menu to select the origin data source.
[0074] FIG. 6 shows a graphical user interface 600 presenting a drop-down menu to select the destination data source.
[0075] FIG. 7 shows a graphical user interface 700 for inputting base information.
[0076] FIG. 8 shows a graphical user interface 800 for inputting resource information.
[0077] FIG. 9 shows a graphical user interface 900 presenting a drop-down menu to select parameters to configure a data formatting function.
[0078] FIG. 10 shows a graphical user interface 1000 presenting a window to input a JSON configuration file.
[0079] According to various embodiments, the configuration system 400 provides a centralized management of data synchronization and controllable configuration of core parameters of synchronization jobs. This may include the management according to the various data source schema, such as different data formats etc. used by the different data sources, i.e. setting configuration parameters to correctly synchronize data between data sources of different types (e.g. Kafka and MySQE etc.).
[0080] In summary, according to various embodiments, a method is provided as illustrated in FIG. 11.
[0081] FIG. 11 shows a flow diagram 1100 illustrating a method for synchronizing data between data sources.
[0082] In 1101, a user is presented with a graphical user interface to input values for configuration parameters for a synchronization task between a first data source and a second data source.
[0083] In 1102, values of the configuration parameters are determined at least partially based on user input to the graphical user interface. [0084] In 1103, a synchronization task configuration file is generated by including each determined configuration parameter value into a section of the configuration file assigned to the configuration parameter according to a pre-defined configuration file structure.
[0085] In 1104, a synchronization task is started in accordance with the generated configuration file (e.g. a synchronization framework is instructed or executed to perform a synchronization task in accordance with the generated configuration file, according to the synchronization framework, control messages may then be sent to the data sources).
[0086] According to various embodiments, in other words, a configuration method (e.g. performed by a configuration tool running on a configuration or synchronization system or device) is provided which allows managing configuration parameters of a synchronization task between two data sources (e.g. a data base synchronization job between two data bases).
[0087] The approach of FIG. 11 for example allows addressing issues of configuration confusion and complexity in synchronization of data based on the configuration file of FlinkX. [0088] The method of FIG. 11 is for example carried out by a server computer as illustrated in FIG. 12.
[0089] FIG. 12 shows a data synchronization device 1200 according to an embodiment.
[0090] The data synchronization device 1200 is implemented by one or more computers and includes a communication interface 1201 (e.g. configured to transmit data for displaying the graphical user interface on a display, receiving user input, sending control data to the data sources for starting the synchronization task etc.)
[0091] The data synchronization device 1200 further includes a processing unit 1202 and a memory 1203. The memory 1203 may be used by the processing unit 1202 to store, for example, values of the configuration parameters and the generated synchronization task configuration file. The data synchronization device is configured to perform the method of FIG. 11.
[0092] The methods described herein may be performed and the various processing or computation units and the devices and computing entities described herein may be implemented by one or more circuits. In an embodiment, a "circuit" may be understood as any kind of a logic implementing entity, which may be hardware, software, firmware, or any combination thereof. Thus, in an embodiment, a "circuit" may be a hard-wired logic circuit or a programmable logic circuit such as a programmable processor, e.g. a microprocessor. A "circuit" may also be software being implemented or executed by a processor, e.g. any kind of computer program, e.g. a computer program using a virtual machine code. Any other kind of implementation of the respective functions which are described herein may also be understood as a "circuit" in accordance with an alternative embodiment.
[0093] While the disclosure has been particularly shown and described with reference to specific embodiments, it should be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. The scope of the invention is thus indicated by the appended claims and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced.

Claims

CLAIMS A method for synchronizing data between data sources, comprising: presenting a user with a graphical user interface to input values for configuration parameters for a synchronization task between a first data source and a second data source; determining values of the configuration parameters at least partially based on user input to the graphical user interface; generating a synchronization task configuration file by including each determined configuration parameter value into a section of the configuration file assigned to the configuration parameter according to a pre-defined configuration file structure; and starting a synchronization task in accordance with the generated configuration file, wherein the synchronization task configuration file comprises a section for parameters related to the first data source, a section for parameters related to the second data source and a general section and generating the synchronization task configuration file comprises classifying the determined configuration parameter values according to whether they relate to the first data source, relate to the second data source or are general configuration parameters and putting each configuration parameter value in the section of the synchronization task configuration file to which it relates. The method of claim 1, wherein starting the synchronization task in accordance with the generated configuration file comprises supplying the generated configuration file to a data synchronization framework. The method of claim 1 or 2, wherein the synchronization task comprises a transmission of data from the first data source to the second data source. The method of claim 3, wherein the values of the configuration parameters include the information that the first data source is an origin data source of the transmission and that the second data source is a destination data source of the transmission. The method of any one of claims 1 to 4, wherein the values of the configuration parameters include information about formatting functions to perform on the data for transmitting data from the first data source to the second data source. The method of any one of claims 1 to 5, wherein the values of the configuration parameters include information about the first data source and the second data source. The method of claim 6, wherein the information about the first data source includes an identification of the first data source and wherein the information about the second data source includes an identification of the second data source. The method of any one of claims 1 to 7, wherein determining values of the configuration parameters comprises presenting possible values of the configuration parameters via the graphical user interface for selection and confirmation. The method of one of claim 8, wherein determining values of the configuration parameters comprises receiving an original synchronization task configuration file and parsing the original synchronization task configuration file to determine possible values of the configuration parameters. The method of claim 9, wherein generating the synchronization task configuration file includes keeping configuration parameter values of the original configuration file which have been confirmed by user input to the graphical user interface and updating configuration parameter values which have been changed by user input to the graphical user interface. The method of any one of claims 1 to 10, wherein the pre-defined configuration file structure is a FlinkX configuration file structure. The method of any one of claims 1 to 11, comprising performing the synchronization task using FlinkX. 15 The method of any one of claims 1 to 12, wherein the pre-defined configuration file structure is a JSON format. A data synchronization device comprising a communication interface, a memory interface and a processing unit configured to perform the method of any one of claims 1 to 13. A computer program element comprising program instructions, which, when executed by one or more processors, cause the one or more processors to perform the method of any one of claims 1 to 13. A computer-readable medium comprising program instructions, which, when executed by one or more processors, cause the one or more processors to perform the method of any one of claims 1 to 13.
PCT/SG2022/050875 2021-12-01 2022-12-01 Device and method for synchronizing data between data sources WO2023101610A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
SG10202113349Y 2021-12-01
SG10202113349Y 2021-12-01

Publications (2)

Publication Number Publication Date
WO2023101610A2 true WO2023101610A2 (en) 2023-06-08
WO2023101610A3 WO2023101610A3 (en) 2023-08-10

Family

ID=86613201

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2022/050875 WO2023101610A2 (en) 2021-12-01 2022-12-01 Device and method for synchronizing data between data sources

Country Status (1)

Country Link
WO (1) WO2023101610A2 (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016033056A1 (en) * 2014-08-26 2016-03-03 Ctera Networks, Ltd. A method and computing device for allowing synchronized access to cloud
CN107766132B (en) * 2017-06-25 2019-03-15 平安科技(深圳)有限公司 Multi-task scheduling method, application server and computer readable storage medium
CN110175159B (en) * 2019-05-29 2020-07-31 京东数字科技控股有限公司 Data synchronization method and system for object storage cluster
CN113704355A (en) * 2021-08-31 2021-11-26 深信服科技股份有限公司 Data synchronization method, electronic device and storage medium

Also Published As

Publication number Publication date
WO2023101610A3 (en) 2023-08-10

Similar Documents

Publication Publication Date Title
US11277320B2 (en) Automatic provisioning of monitoring for containerized microservices
US11394767B2 (en) Central repository of configuration files and two-way replication of search node configuration files
EP3404542A1 (en) Data pipeline architecture for analytics processing stack
US10984013B1 (en) Tokenized event collector
CN105204812B (en) A kind of multiple solutions display & control system and its integrated approach based on plug-in unit
US11829381B2 (en) Data source metric visualizations
JP6506686B2 (en) Technique for automatically managing file descriptors
CN111966633B (en) Method, device, electronic equipment and medium for inquiring child node under directory
US20100082701A1 (en) System and Method for Using a Configuration Management Database
EP3172682B1 (en) Distributing and processing streams over one or more networks for on-the-fly schema evolution
US11699268B1 (en) Techniques for placement of extended reality objects relative to physical objects in an extended reality environment
US10140302B2 (en) Autonomic generation of document structure in a content management system
US11676345B1 (en) Automated adaptive workflows in an extended reality environment
CN113760987A (en) Data processing method and data processing platform
WO2023101610A2 (en) Device and method for synchronizing data between data sources
US11222072B1 (en) Graph database management system and method for a distributed computing environment
Xiaohua et al. Design and implementation of OpenDayLight manager application
CN110471968A (en) Dissemination method, device, equipment and the storage medium of ETL task
US20180246747A1 (en) Cloning a hypervisor
US11276236B1 (en) Techniques for placement of extended reality objects relative to other extended reality objects in an extended reality environment
US11036456B1 (en) Control of a display device included in a display grid
CN112579165A (en) Batch operation execution method and device, readable medium and electronic equipment
CN114461490B (en) Fortune dimension aggregation system
US11354012B1 (en) Automated placement and time selection for dashboard panels in an extended reality environment
US11644940B1 (en) Data visualization in an extended reality environment