CN114036238A

CN114036238A - Data synchronization method, device, equipment and storage medium

Info

Publication number: CN114036238A
Application number: CN202111399158.7A
Authority: CN
Inventors: 史鹏飞; 熊承鹏; 林攀学; 胡建宇; 陈飞
Original assignee: Sangfor Technologies Co Ltd
Current assignee: Sangfor Technologies Co Ltd
Priority date: 2021-11-23
Filing date: 2021-11-23
Publication date: 2022-02-11

Abstract

The invention belongs to the field of computers, and discloses a data synchronization method, a data synchronization device, data synchronization equipment and a data synchronization storage medium. The method comprises the steps of obtaining data source information of a data source to be synchronized; determining the type of the data source according to the data source information; when the data source type is a distributed message data source, acquiring configuration information corresponding to the data source to be synchronized; and pulling the data to be synchronized from the data source to be synchronized, and writing the data to be synchronized into the target database according to the configuration information. When the data source type is the distributed message data source, the data to be synchronized is pulled from the data source to be synchronized, and the data to be synchronized is written into the target database according to the configuration information. Compared with the existing mode that the distributed message data source can only support streaming synchronization, namely, processing is performed immediately after receiving one piece of data, the mode of the invention can ensure that the distributed message data source supports batch offline synchronization.

Description

Data synchronization method, device, equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data synchronization method, apparatus, device, and storage medium.

Background

In network management systems, communication systems, electronic commerce, banking systems, a large amount of system data is continuously generated. If the system fails, this data is lost, causing serious consequences. For this reason, a timely backup of the data is required. Meanwhile, the data may be subject to a large number of queries at any time, and in order to speed up the query progress, the data needs to be stored in databases at a plurality of different places. In the existing data synchronization, when distributed message data sources such as kafka and the like are used as synchronous data sources, only stream synchronization is supported, and batch offline synchronization is not supported.

The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.

Disclosure of Invention

The invention mainly aims to provide a data synchronization method, a data synchronization device, data synchronization equipment and a data synchronization storage medium, and aims to solve the technical problem that batch offline synchronization is not supported when a data source for data synchronization in the prior art is a distributed message data source.

In order to achieve the above object, the present invention provides a data synchronization method, comprising the steps of:

acquiring data source information of a data source to be synchronized;

determining the type of a data source according to the data source information;

when the data source type is a distributed message data source, acquiring configuration information corresponding to the data source to be synchronized;

and pulling the data to be synchronized from the data source to be synchronized, and writing the data to be synchronized into a target database according to the configuration information.

Optionally, the pulling the data to be synchronized from the data source to be synchronized includes:

and pulling the data to be synchronized in batches from the data source to be synchronized.

Optionally, the step of pulling the data to be synchronized in batches from the data source to be synchronized includes:

acquiring quantity information of data to be synchronized according to the configuration information;

and determining the pulling times according to the quantity information, and pulling the data to be synchronized in batches from the data source to be synchronized according to the pulling times.

Optionally, the step of determining the number of pulling times according to the quantity information, and pulling the data to be synchronized in batches from the data source to be synchronized according to the number of pulling times includes:

determining the pulling times according to the quantity information and preset single batch pulling number information;

and pulling the data to be synchronized in batches from the data source to be synchronized according to the preset single-batch pulling number information and the pulling times.

Optionally, after the step of pulling the data to be synchronized in batches from the data source to be synchronized according to the preset single-batch pulling number information and the pulling times, the method includes:

when the data pulling corresponding to the pulling times is finished, performing data pulling on the data source to be synchronized again;

and when the data is not pulled within the preset time length, judging that the data to be synchronized is pulled completely.

Optionally, when the data is not pulled within the preset time period, after the step of determining that the pulling of the data to be synchronized is completed, the method includes:

acquiring a target database and a field mapping rule in the configuration information;

and writing the data to be synchronized into the target database according to the field mapping rule.

Optionally, when the data is not pulled within the preset time period, after the step of determining that the pulling of the data to be synchronized is completed, the method further includes:

and sending a data synchronization termination instruction to a preset Flink process so that the preset Flink process stops running after receiving the data synchronization termination instruction.

Optionally, after the step of obtaining the data source information of the data source to be synchronized, the method further includes:

acquiring first synchronous data corresponding to a synchronous data task according to the data source information;

and uploading the first synchronous data to a target partition corresponding to the synchronous data task according to a preset data writing mode.

Optionally, before the step of uploading the first synchronization data to the target partition corresponding to the synchronization data task according to a preset data writing mode, the method further includes:

acquiring a data purpose corresponding to the synchronous data task;

when the data is targeted to a data warehouse, acquiring preconfigured data synchronization partition information;

and creating partitions in the data warehouse according to the data synchronization partition information to obtain target partitions.

Optionally, the step of creating a partition in the data warehouse according to the data synchronization partition information and obtaining a target partition includes:

generating a partition field array according to the data synchronization partition information;

and creating partitions in the data warehouse according to the partition field arrays to obtain target partitions.

Optionally, the step of generating a partition field array according to the data synchronization partition information includes:

acquiring the partition type in the data synchronization partition information;

judging whether the partition type is a data source field partition or a time field partition;

and when the partition type is a time field partition, generating a partition field array according to the data synchronization partition information.

Optionally, after the step of determining whether the partition type is a data source field partition or a time field partition, the method further includes:

when the partition type is a data source field partition, acquiring a partition name, a binding field name and a binding field subscript according to the data synchronization partition information;

and generating a partition field array according to the partition name, the binding field name and the binding field subscript.

Optionally, after the step of determining the type of the data source according to the data source information, the method further includes:

when the type of the data source is a relational data source, acquiring a preset storage database corresponding to the relational data source;

acquiring a synchronous mapping relation table corresponding to the relational data source and the preset storage database;

and storing the data to be synchronized in the relational data source into the preset storage database according to the synchronous mapping relation table.

Optionally, the step of storing the data to be synchronized in the relational data source into the preset storage database according to the synchronization mapping relationship table includes:

acquiring data to be synchronized in the relational data source according to a preset receiving rule;

performing type conversion on the data to be synchronized according to the synchronous mapping relation table to obtain converted synchronous data;

and storing the converted synchronous data into the preset storage database.

when the data source type is a time sequence type data source, acquiring data synchronization configuration information;

according to the data synchronization configuration information, slicing data to be synchronized in the data synchronization task to obtain a slicing result;

and uploading the data to be synchronized to a distributed file system according to the fragmentation result.

Optionally, the step of fragmenting the data to be synchronized in the data synchronization task according to the data synchronization configuration information to obtain a fragmentation result includes:

determining whether the data synchronization task is incremental synchronization according to the data synchronization configuration information;

if not, determining the data to be synchronized according to the data synchronization configuration information;

and fragmenting the data to be synchronized according to the data volume of the data to be synchronized and a preset fragmentation rule to obtain a fragmentation result.

Optionally, after the step of determining whether the data synchronization task is incremental synchronization according to the data synchronization configuration information, the method further includes:

when the data synchronization task is incremental synchronization, determining the initial position of the data synchronization task according to data synchronization configuration information;

and determining data to be synchronized according to the data synchronization configuration information and the initial position, and executing the step of fragmenting the data to be synchronized according to the data volume of the data to be synchronized and a preset fragmentation rule to obtain a fragmentation result.

In addition, to achieve the above object, the present invention also provides a data synchronization apparatus, including:

the acquisition module is used for acquiring data source information of a data source to be synchronized;

the determining module is used for determining the type of the data source according to the data source information;

the configuration information determining module is used for acquiring the configuration information corresponding to the data source to be synchronized when the data source type is a distributed message data source;

and the synchronization module is used for pulling the data to be synchronized from the data source to be synchronized and writing the data to be synchronized into a target database according to the configuration information.

In addition, to achieve the above object, the present invention further provides a data synchronization apparatus, including: a memory, a processor and a data synchronization program stored on the memory and executable on the processor, the data synchronization program configured to implement the steps of the data synchronization method as described above.

Furthermore, to achieve the above object, the present invention further provides a storage medium having a data synchronization program stored thereon, which when executed by a processor implements the steps of the data synchronization method as described above.

The method comprises the steps of acquiring data source information of a data source to be synchronized; determining the type of the data source according to the data source information; when the data source type is a distributed message data source, acquiring configuration information corresponding to the data source to be synchronized; and pulling the data to be synchronized from the data source to be synchronized, and writing the data to be synchronized into the target database according to the configuration information. When the data source type is the distributed message data source, the data to be synchronized is pulled from the data source to be synchronized, and the data to be synchronized is written into the target database according to the configuration information. Compared with the existing distributed message data source which can only support streaming synchronization, namely, a mode of processing each time one data is received, the mode of the invention can support batch offline synchronization.

Further, the method acquires data source information of a first synchronous data source, and acquires first synchronous data corresponding to the synchronous data task according to the data source information; acquiring a data purpose corresponding to the synchronous data task; when the data is targeted to a data warehouse, acquiring preconfigured data synchronization partition information; generating a partition field array according to the data synchronization partition information; creating partitions in the data warehouse according to the partition field arrays to obtain target partitions; and uploading the first synchronous data to a target partition corresponding to the synchronous data task according to a preset data writing mode. According to the method and the device, the first synchronous data corresponding to the synchronous data task and the pre-configured data synchronous partition information are obtained, the partition field array is generated according to the data synchronous partition information, the partition is further created in the data warehouse, the target partition is obtained, the first synchronous data are uploaded to the target partition corresponding to the synchronous data task according to the preset data writing mode, and the problems that the data warehouse does not support multi-partition data synchronization and the field of the partition cannot be specified in a customized mode are solved.

Further, when the type of the data source is a relational data source, a preset storage database corresponding to the relational data source is obtained; acquiring a synchronous mapping relation table corresponding to a relational data source and a preset storage database; and storing the data to be synchronized in the relational data source into a preset storage database according to the synchronous mapping relation table. The method comprises the steps of obtaining a synchronous mapping relation table corresponding to a relational data source and a preset storage database, and carrying out type conversion on data to be synchronized in the relational data source according to the synchronous mapping relation table when the data are synchronized to obtain converted synchronous data; and storing the converted synchronous data into a preset storage database. The method solves the technical problem that data synchronization cannot be performed due to the fact that data types of the relational databases are not supported possibly when the relational databases are synchronized to the databases or data warehouses such as Hive and ES in the prior art. And all data types of all databases can be successfully synchronized.

Further, the invention determines the data source type according to the data source information, and when the data source type is a time sequence type data source, the data synchronization configuration information is obtained; according to the data synchronization configuration information, slicing the data to be synchronized in the data synchronization task to obtain a slicing result; and uploading the data to be synchronized to a distributed file system according to the fragmentation result. According to the method and the device, the data to be synchronized is fragmented to obtain the fragmentation result, and the data to be synchronized is uploaded to the distributed file system according to the fragmentation result to complete the data synchronization, so that the technical problem that the data synchronization cannot be stably realized due to excessive data to be synchronized in the time sequence database in the prior art is solved. The data synchronization of the whole quantity or increment of the time sequence database can be still stable when the data tasks are too many each time through a dynamic slicing mode.

Drawings

Fig. 1 is a schematic structural diagram of a data synchronization device of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a first embodiment of a data synchronization method according to the present invention;

FIG. 3 is a flowchart illustrating a data synchronization method according to a second embodiment of the present invention;

FIG. 4 is a flowchart illustrating a data synchronization method according to a third embodiment of the present invention;

FIG. 5 is a flowchart illustrating a fourth embodiment of a data synchronization method according to the present invention;

fig. 6 is a block diagram of a first embodiment of the data synchronization apparatus according to the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a data synchronization device in a hardware operating environment according to an embodiment of the present invention.

As shown in fig. 1, the data synchronization apparatus may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (WI-FI) interface). The Memory 1005 may be a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the configuration shown in fig. 1 does not constitute a limitation of the data synchronization apparatus, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a storage medium, may include therein an operating system, a network communication module, a user interface module, and a data synchronization program.

In the data synchronization apparatus shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 of the data synchronization apparatus of the present invention may be provided in the data synchronization apparatus, and the data synchronization apparatus calls the data synchronization program stored in the memory 1005 through the processor 1001 and executes the data synchronization method provided by the embodiment of the present invention.

Based on the data synchronization device, an embodiment of the present invention provides a data synchronization method, and referring to fig. 2, fig. 2 is a schematic flow diagram of a first embodiment of the data synchronization method according to the present invention.

In this embodiment, the data synchronization method includes the following steps:

step S10: and acquiring data source information of a data source to be synchronized.

It should be noted that the execution subject of the embodiment may be a computing service device with data processing, network communication and program running functions, such as a mobile phone, a tablet computer, a personal computer, etc., or an electronic device or a data synchronization device capable of implementing the same or similar functions. The present embodiment and the following embodiments will be described below by taking a data synchronization device as an example.

It should be noted that the data source to be synchronized may be a database or a data warehouse that provides data to be synchronized in the data synchronization process, for example, a distributed message database such as kafka, a time sequence database such as infiuxdb, TDEngine, OpenTSDB, and a data warehouse such as hive or ES. In a specific implementation, if data in the database a needs to be synchronized into the database B, the database a is a data source to be synchronized, and the database B may be referred to as a data destination or a target database. The data source information may include type information of a data source to be synchronized, a data amount of data to be synchronized, and the like.

Step S20: and determining the type of the data source according to the data source information.

It should be noted that the data source type may be a kind of the data source to be synchronized, for example, when the data source is kafka, the corresponding data source type is a distributed message data source; when the data source is InfluxDB, TDEngine or OpenTSDB, the corresponding data source type is a time sequence database and the like. It should be emphasized that the present embodiment does not specifically limit the manner in which the data source types are divided and determined.

Step S30: and when the data source type is a distributed message data source, acquiring configuration information corresponding to the data source to be synchronized.

It should be noted that the configuration information may be parameter information set before data synchronization, for example: the data source to be synchronized, the data destination, the start time of data synchronization, the start position of data synchronization, the size of the data volume to be synchronized, the synchronization mode (incremental synchronization or full synchronization), and the like.

In a specific implementation, when detecting that the data source type is a distributed message data source, the data synchronization device acquires parameter information, that is, the configuration information, set before data synchronization.

Step S40: and pulling the data to be synchronized from the data source to be synchronized, and writing the data to be synchronized into a target database according to the configuration information.

It should be noted that the data to be synchronized may be data that needs to be synchronized. The target database may be a database to which the data to be synchronized is to be written. For example, if the data B in the database a needs to be written into the database C, the database a is a data source database, the data B is data to be synchronized, and the database C is a target database.

Further, the pulling the data to be synchronized from the data source to be synchronized includes: and pulling the data to be synchronized in batches from the data source to be synchronized.

It should be noted that, the pulling of the data to be synchronized in batches may be performed by pulling the data to be synchronized from the data source to be synchronized multiple times according to a certain data size or number.

In a specific implementation, the data synchronization device may pull the data to be synchronized from the data source to be synchronized by obtaining a preset number of single pulls, and pull the data to be synchronized for multiple times until the data to be synchronized is pulled. For example, there are 990 pieces of data to be synchronized, the preset number of single pulling is 100 pieces, that is, 100 pieces of data are pulled each time, the data synchronization device pulls the data from the data source to be synchronized in a manner of pulling 100 pieces of data each time, 100 pieces of data are pulled for the first time, 100 pieces of data are pulled for the second time, and so on until 90 pieces of data are pulled for the tenth time, the data pulling is completed, and then the pulled 990 pieces of data are written into the target database.

Further, in order to avoid the data to be synchronized being too much, resulting in a data pull failure or data missing during the pull, the step S40 includes:

step S401: acquiring quantity information of data to be synchronized according to the configuration information;

step S402: determining the number of pulling times according to the quantity information, and pulling the data to be synchronized in batches from the data source to be synchronized according to the number of pulling times;

step S403: and writing the data to be synchronized into a target database according to the configuration information.

It should be noted that the quantity information may be information such as the number of pieces of data to be synchronized or the size of the data amount. The number of pulls may be the number of pulls of the data to be synchronized from the data source.

Further, for accurately pulling the data to be synchronized, the step S402 may include:

step S4021: determining the pulling times according to the quantity information and preset single batch pulling number information;

step S4022: and pulling the data to be synchronized in batches from the data source to be synchronized according to the preset single-batch pulling number information and the pulling times.

The preset single batch number information may be the number of pieces of data pulled each time, which is preset by the user, and for example, the number of pieces of data pulled each time may be set to 100 pieces or 200 pieces. The determining of the number of pulling according to the number information and the preset single-batch pulling number information may be that the number of pulling is obtained by dividing the number of data pieces of the data to be synchronized by the preset single-batch pulling number. When data is pulled, the data can be pulled according to a preset pulled data size (size), for example, the data size (size) pulled in a single batch can be preset to be 1000B, 1000KB or 1000MB, and the data size (size) pulled in each pulling process is 1000B, 1000KB or 1000 MB. At this time, the number of pulling times may be obtained by dividing the data size of the data to be synchronized by the preset single batch of data pulling amount.

Further, in order to accurately monitor whether data synchronization is completed, after the step S4022, the method further includes the steps of:

step S40221: when the data pulling corresponding to the pulling times is finished, performing data pulling on the data source to be synchronized again;

step S40222: and when the data is not pulled within the preset time length, judging that the data to be synchronized is pulled completely.

It should be noted that the preset time length may be a self-defined limited time length in the data pulling process, and in each data pulling process, if the pulling time exceeds the time length, it is determined that the data pulling is overtime, and the data pulling is ended. The data pulling execution completion corresponding to the pulling times may be that the actual pulling times are equal to the previously calculated pulling times. And when the data pulling execution corresponding to the pulling times is finished, performing data pulling on the data source to be synchronized again, and judging that the data pulling is finished when the data pulling is overtime and the data is not pulled.

In a specific implementation, for example, the number of data is 1000, the number of data pieces pulled each time preset by the user is 100, the calculation is performed, the number of pulling times is 10, and after 10 actual pulling times, it is determined that the data pulling corresponding to the number of pulling times (10) is completed, and data pulling is performed again (which is equivalent to 11 total pulling times). If the preset time length is 20 seconds, when the 11 th data pulling time exceeds the preset time length by 20 seconds and the data is not pulled (namely, when the data is not pulled for the 11 th time), it is determined that the data to be synchronized is pulled completely.

Further, after the step S40222, in order to enable normal access to the data after data synchronization, the method further includes the steps of: acquiring a target database and a field mapping rule in the configuration information; and writing the data to be synchronized into the target database according to the field mapping rule.

It should be noted that the field mapping rule may be a type of data stored when the same data is stored in different databases. For example, the database b does not support the storage of floating point type data, when data synchronization is performed, the data in the data source a to be synchronized has floating point type data, and at this time, if direct synchronization is performed, an error occurs when the data to be synchronized is written into the database b.

Therefore, in a specific implementation, the data synchronization device may determine a data type of the data to be synchronized when the data to be synchronized is written into the target database according to the field mapping rule, and then write the data to be synchronized into the target database according to the data type. For example, if the storage type of the data to be synchronized C (floating point data) in the data source to be synchronized a corresponding to the integer data in the database B is found according to the field mapping rule, the data to be synchronized C is converted into the integer data and then written into the database B when the data is written, so as to complete data synchronization.

Further, in order to timely end the process of the system and release the system resources, after the step S40222, the method further includes the steps of: and sending a data synchronization termination instruction to a preset Flink process so that the preset Flink process stops running after receiving the data synchronization termination instruction.

It should be noted that the preset Flink process may be a Flink process for implementing data synchronization in this embodiment. The data synchronization termination instruction may be a command to terminate the preset Flink process.

It should be understood that Flink executes any stream data program in a data parallel and pipelined manner, and Flink's pipelined runtime system can execute batch and stream processing programs. The data synchronization in this embodiment can be completed through the Flink process, and in the Flink stream processing, the native data pull program does not make a judgment on the end of the data, which may result in that the Flink process may not be ended after the data pull is completed, and thus the offline synchronization operator may not be ended. Therefore, when the data synchronization is completed, a data synchronization termination instruction needs to be actively sent to a preset Flink process, so that the preset Flink process stops running after receiving the data synchronization termination instruction. The offline synchronization operator is used for reading a task on the Hadoop resource manager, namely Yarn, and the data synchronization task can be considered to be completely executed when the task on the Yarn is in a completion state. When the state on the offline operator query Yarn is terminated, the data synchronization is displayed on the page to be completed.

The embodiment obtains the data source information of the data source to be synchronized; determining the type of a data source according to the data source information; when the data source type is a distributed message data source, acquiring configuration information corresponding to the data source to be synchronized; acquiring quantity information of data to be synchronized according to the configuration information; determining the pulling times according to the quantity information and preset single batch pulling number information; and pulling the data to be synchronized in batches from the data source to be synchronized according to the preset single-batch pulling number information and the pulling times. And writing the data to be synchronized into a target database according to the configuration information. In this embodiment, when the data source type is a distributed message data source, configuration information corresponding to a data source to be synchronized is acquired; acquiring quantity information of data to be synchronized according to the configuration information; determining the pulling times according to the quantity information and the preset single-batch pulling number information; and pulling the data to be synchronized in batches from the data source to be synchronized according to the preset single-batch pulling number information and the pulling times. And writing the data to be synchronized into a target database according to the configuration information. Compared with the mode that the existing distributed message data source can only support stream synchronization, namely, the processing is immediately carried out when one data is received, the mode of the invention pulls the data in a batch pulling mode by judging the type of the data source when the type of the data source is the distributed message data source, and writes the data to be synchronized into the target database according to the configuration information. Such that the distributed message data sources support batch offline synchronization.

Referring to fig. 3, fig. 3 is a flowchart illustrating a data synchronization method according to a second embodiment of the present invention.

Based on the first embodiment described above, in the present embodiment, after the step S10, the method includes:

step S50: and acquiring first synchronous data corresponding to the synchronous data task according to the data source information.

It should be noted that the synchronous data task may be a data synchronization job to be executed, and may include a data source to be synchronized, a data destination, first synchronization data, parameter information during synchronization, and the like. The first synchronization data may be data stored in a data source that needs to be synchronized to a data destination.

Step S60: and uploading the first synchronous data to a target partition corresponding to the synchronous data task according to a preset data writing mode.

It should be noted that the target partition may be an area storing the first synchronization data in the data destination corresponding to the synchronization data task. The data destination may be a data warehouse or destination database where the first synchronization data is stored, such as a data warehouse tool hive or the like. The preset data writing mode may be a data synchronization mode set before data synchronization, for example, the preset data writing mode may be a mode of slicing first synchronization data, pulling data from a data source in batches, and then storing the data into a data destination, or may be a mode of uploading a file and data to synchronize data of a data destination corresponding to a synchronization data task, for example, writing data into a Hive table by uploading a file into an HDFS (Hadoop distributed file system), and storing the Hive data in the HDFS.

Further, in order to support synchronization of multiple partitions, so as to facilitate classification of data and improve reading and writing speed of data, the step S60 includes:

step S601: acquiring a data purpose corresponding to the synchronous data task;

step S602: when the data is targeted to a data warehouse, acquiring preconfigured data synchronization partition information;

step S603: and creating partitions in the data warehouse according to the data synchronization partition information to obtain target partitions.

It should be noted that the data destination may be a data warehouse or a database stored after the first synchronization data in the synchronization data task is synchronized. The data warehouse may be a warehouse capable of storing data, such as a Hadoop data warehouse tool hive or a Lucene-based search server ES, and the like. The data synchronization partition information may be a partition type set by a user, parameter information in a partition, and the like, and the parameter information may be information such as a partition name, a partition field, a time formatting standard, and a subscript of the partition field. The target partition may be a partition created in a data warehouse based on the data synchronization partition information for storing the first synchronization data in the data synchronization task.

Further, in order to create the partition more quickly and accurately, the step S603 may include:

step S6031: generating a partition field array according to the data synchronization partition information;

step S6032: and creating partitions in the data warehouse according to the partition field arrays to obtain target partitions.

It should be noted that the partition field array may be an array formed by extracting parameter information in the data synchronization partition information, that is, a partition field array is formed. For example, the data synchronization partition information includes a name 1, a current system time as a partition field, and a time format of year-month-day; the name 2 is a partition field according to the field in the data source, the fields have A and B, and the corresponding subscripts are 3 and 4. The partition field array generated according to the data synchronization partition information may be [ name 1, system time, year-month-day ], [ name 2, a, 3, B, 4 ]. The form of the generated partition field array may also be other forms, and the embodiment is not limited herein. When the data warehouse is hive, the partition is created in the data warehouse according to the partition field array, and obtaining the target partition may be to create a partition in a hive table according to the partition field array, and obtain the target partition.

Further, in order to improve the reading and writing speed of the data, the step S6031 may include:

step S60311: acquiring the partition type in the data synchronization partition information;

step S60312: judging whether the partition type is a data source field partition or a time field partition;

step S60313: and when the partition type is a time field partition, generating a partition field array according to the data synchronization partition information.

It should be noted that the data source field partition may be a partition according to a data field in a data source, the time field partition may be a partition according to time, and the time field partition includes a system time field partition and a data source time field partition. The system time field partitioning may be partitioning according to the time of the system. The data source time field partition may be based on the time of data storage in the data source or the time carried by the data itself. When the partition type is a time field partition, the generating a partition field array according to the data synchronization partition information may be: when the partition type is a time field partition, acquiring a partition name and a time formatting standard according to the data synchronization partition information; and generating a partition field array according to the partition name, the time formatting standard and the partition type.

It should be understood that, when the type of partition is time-dependent, the time data in the first synchronization data may have different forms, for example, some represent time by time stamp and some represent time by year, month and day, and therefore, when the partition is time-dependent, the representations of the relevant time are required to be uniformly formatted into the same representation form, so that the user is required to set a time formatting standard so that the time representations of the synchronized data are consistent in time.

Further, in order to distinguish the system time field partition from the data source time field partition, after the step S60313, the method further includes the steps of: when the time field partition is the system time field partition, generating a partition field array according to the partition name, the time formatting standard and the system time field; and when the time field partition is the data source time field partition, generating a partition field array according to the partition name, the time formatting standard and the data source time field.

Further, in order to quickly locate the partition field in the data source field partition, after step S60312, the method further includes the steps of: when the partition type is a data source field partition, acquiring a partition name, a binding field name and a binding field subscript according to the data synchronization partition information; and generating a partition field array according to the partition name, the binding field name and the binding field subscript.

It should be noted that the binding field name may be a field name according to which partitioning is performed, and the binding field subscript may be a subscript corresponding to the binding field. For example, in a data table, the first column of data, a field being an ID, the subscript of the field being 0, the second column being a name field, the subscript being 1, and so on.

The embodiment acquires data source information of a first synchronous data source, and acquires first synchronous data corresponding to a synchronous data task according to the data source information; acquiring a data purpose corresponding to the synchronous data task; when the data is targeted to a data warehouse, acquiring preconfigured data synchronization partition information; generating a partition field array according to the data synchronization partition information; creating partitions in a data warehouse according to the partition field arrays to obtain target partitions; and uploading the first synchronous data to a target partition corresponding to the synchronous data task according to a preset data writing mode. According to the data synchronization method and the data synchronization system, the first synchronization data corresponding to the synchronization data task and the data synchronization partition information configured in advance are obtained, the partition field array is generated according to the data synchronization partition information, the partition is further created in the data warehouse, the target partition is obtained, the first synchronization data are uploaded to the target partition corresponding to the synchronization data task according to the preset data writing mode, and the problems that the data warehouse does not support multi-partition data synchronization and the fields of the partitions cannot be specified in a customized mode are solved.

Referring to fig. 4, fig. 4 is a flowchart illustrating a data synchronization method according to a third embodiment of the present invention.

Based on the foregoing embodiments, in this embodiment, after the step S20, the method further includes:

step S210: and when the type of the data source is a relational data source, acquiring a preset storage database corresponding to the relational data source.

It should be noted that the relational data source may be a relational database as a data source for providing data to be synchronized when data is synchronized. The preset storage database may be a data destination in data synchronization, that is, a database or a data warehouse that needs to acquire data to be synchronized from a data source.

It should be understood that a relational database refers to a database that uses a relational model to organize data, and stores data in rows and columns, and a series of rows and columns of the relational database are called tables, and a set of tables constitutes the database.

Step S220: and acquiring a synchronous mapping relation table corresponding to the relational data source and the preset storage database.

It should be noted that the synchronization mapping relationship table may be a field type when the data to be synchronized is stored in the relational data source and the preset storage database respectively when the data synchronization is performed between the relational data source and the preset storage database.

It should be understood that, during data synchronization, some data in the data source database may not be directly synchronized to the destination database, because a field storage rule in the destination database may not support storage of some data fields, for example, the destination database F database does not support storage of floating point type data, while during data synchronization, if the data in the data source E to be synchronized has floating point type data, an error may occur when the data to be synchronized is written into the F database, at this time, the floating point type data in E needs to be converted into a data type supported by the F database for storage, and at this time, a synchronization mapping relationship table corresponding to the relationship type data source and the preset storage database needs to be used. The following table can be referred to, and table 1 is a table of correspondence between the Mysql database and the field types of hives; table 2 is a table of correspondence between field types of the Oracle database and hive, and it can be known from the following table that, when the source database is Mysql database and the data destination is hive, if the type of the data to be synchronized is TINYINT, the data type of TINYINT needs to be converted into TINYINT, smallnt, INT, or BIGINT when stored in hive.

TABLE 1-Mysql database and hive field type corresponding relation table

TABLE 2 field type corresponding relationship Table of Oracle database and hive

Step S230: and storing the data to be synchronized in the relational data source into the preset storage database according to the synchronous mapping relation table.

It should be understood that, storing the data to be synchronized in the relational data source into the preset storage database according to the synchronization mapping relationship table may be to perform type conversion on the data in the relational data source according to a correspondence between a data type in the relational data source in the synchronization mapping relationship table and a data type in the preset storage database, and then store the data in the preset storage database.

Further, in order to avoid that some data types cannot be read in the data synchronization, thereby causing a data synchronization failure, the step S230 may include:

and storing the converted synchronous data into the preset storage database.

It should be noted that the preset receiving rule may be a reading method for reading data in the data source, because some types cannot be read by the conventional reading method and can only be read by a specific method, when a data type that cannot be read by the conventional reading method is encountered, the data type needs to be read by the specific method.

In a specific implementation, for example, in the synchronization mapping relationship table in which the data source database is a PostgreSql database, when the data type in the data source database is smallnt or INT2, the data type in the corresponding preset storage database is smallnt, INT, or BIGINT. At this time, data to be synchronized in the data source database, the type of which is smallnt or INT2, needs to be converted into smallnt, INT, or BIGINT, the converted data is obtained, and the converted data is stored in the preset storage database.

In this embodiment, when the type of the data source is a relational data source, a preset storage database corresponding to the relational data source is obtained; acquiring a synchronous mapping relation table corresponding to the relational data source and the preset storage database; and storing the data to be synchronized in the relational data source into the preset storage database according to the synchronous mapping relation table. In the embodiment, by acquiring a synchronization mapping relation table corresponding to a relational data source and a preset storage database, when data is synchronized, type conversion is performed on data to be synchronized in the relational data source according to the synchronization mapping relation table, so as to obtain converted synchronous data; and storing the converted synchronous data into the preset storage database. The method solves the technical problem that data synchronization cannot be performed due to the fact that data types of the relational databases are not supported possibly when the relational databases are synchronized to the databases or data warehouses such as Hive and ES in the prior art. And all data types of all databases can be successfully synchronized.

Referring to fig. 5, fig. 5 is a flowchart illustrating a data synchronization method according to a fourth embodiment of the present invention.

step S240: and when the data source type is a time sequence type data source, acquiring data synchronization configuration information.

It should be noted that the time-sequence data source may be a time-sequence database as a data source for providing data to be synchronized when data is synchronized. The data synchronization configuration information may be information of a data purpose in data synchronization, data to be synchronized, incremental synchronization or full synchronization, an initial position of incremental synchronization, and the like, and the data purpose may be a database or a data warehouse which needs to acquire the data to be synchronized from a data source.

It should be understood that the time series database is referred to collectively as a time series database. The time series database is mainly used for processing data with time tags (which are changed in time sequence, i.e., time-sequenced), and the data with time tags is also called time series data.

Step S250: and fragmenting the data to be synchronized in the data synchronization task according to the data synchronization configuration information to obtain a fragmentation result.

It should be noted that the slicing of the data to be synchronized in the data synchronization task according to the data synchronization configuration information may be to slice the data to be synchronized according to a slicing parameter in the data synchronization configuration information, where the slicing parameter may be a data amount or a data number in each slicing when the slicing is performed. When data synchronization is performed, self-adaptive fragmentation may be performed dynamically according to the data amount of the data to be synchronized, for example, the larger the data amount is, the larger the data amount of each fragment is.

In a specific implementation, for example, there are 900 pieces of data to be synchronized, the slicing of the data to be synchronized may be to divide the 900 pieces of data to be synchronized into 9 parts of data of 100 pieces each, when the data is synchronized, the sliced data are sequentially read, and after the sliced data are read, the read position is recorded, so that when the next sliced data is read for data synchronization, the reading is started from the position where the last reading is ended, and the reading is synchronized. If 5000 pieces of data to be synchronized exist, slicing the data to be synchronized may be to divide the 5000 pieces of data to be synchronized into 5 pieces of data of 1000 pieces each, that is, to realize dynamic self-adaptive slicing.

Step S260: and uploading the data to be synchronized to a distributed file system according to the fragmentation result.

It should be noted that the distributed file system may be a Hadoop distributed file system HDFS or a system or a database supporting data or file storage. The uploading of the data to be synchronized to the distributed file system according to the fragmentation result may be reading the data to be synchronized from the time-sequence data source according to the fragmentation result, and uploading the read data to be synchronized to a target database in a data synchronization task or a corresponding distributed file system in a data warehouse.

Further, in order to save the time and reduce the workload of data synchronization, the step S250 includes:

It should be understood that incremental synchronization and full-scale synchronization are two ways of database synchronization. Full synchronization is to synchronize all data at once, and incremental synchronization is to synchronize only different parts of two databases. In order to save the time of data synchronization and reduce the workload of data synchronization, when data synchronization is performed, whether a data synchronization task is incremental synchronization is determined, and if the data synchronization task is incremental synchronization, a synchronous starting position set by a user in data synchronization configuration information is obtained, so that data synchronization is started according to the starting position set by the user. If the data synchronization is incremental synchronization, but the user does not set the initial position of the data synchronization, it needs to determine whether the data synchronization is the first synchronization, and when the data synchronization is the first synchronization, the data in the data source is not stored in the destination database or the data warehouse, so the data synchronization is also full synchronization. The data is synchronized in a full-scale synchronization manner. If the data synchronization is not the first time synchronization, and the target database or the data warehouse stores part of data in the data source, the data in the target database or the data warehouse needs to be compared with the data in the data source, so as to synchronize different parts of the two databases. The time and the workload of data synchronization are saved. After the data to be synchronized is read, the data can be acquired through a Flink interface and the like, and after the data is analyzed and packaged, the data is carried by a Flink native object array row, so that the data synchronization is completed.

In specific implementation, the data synchronization device determines whether the incremental synchronization is performed at this time according to the data synchronization task, acquires an initial position of the data synchronization set by a user when the incremental synchronization is performed at this time, determines whether the data synchronization is performed at the first time if the initial position of the data synchronization is not set by the user, and synchronizes the data in a full-scale synchronization manner when the data synchronization is performed at the first time. When the first synchronization is not performed, the data information in the data source database and the data information in the target database are acquired, and the data in the data source database are synchronized to the target database according to the data information, namely, only different parts of the two databases are synchronized.

The embodiment determines a data source type according to the data source information, and acquires data synchronization configuration information when the data source type is a time sequence type data source; according to the data synchronization configuration information, slicing data to be synchronized in the data synchronization task to obtain a slicing result; and uploading the data to be synchronized to a distributed file system according to the fragmentation result. In the embodiment, the data to be synchronized is fragmented to obtain the fragmentation result, and the data to be synchronized is uploaded to the distributed file system according to the fragmentation result to complete the data synchronization, so that the technical problem that the data synchronization cannot be stably realized due to excessive data to be synchronized in the time sequence database in the prior art is solved. The data synchronization of the whole quantity or increment of the time sequence database can be still stable when the data tasks are too many each time through a dynamic slicing mode.

Referring to fig. 6, fig. 6 is a block diagram of a first embodiment of the data synchronization apparatus according to the present invention.

As shown in fig. 6, the data synchronization apparatus according to the embodiment of the present invention includes:

an obtaining module 10, configured to obtain data source information of a data source to be synchronized;

a determining module 20, configured to determine a data source type according to the data source information;

a configuration information determining module 30, configured to obtain configuration information corresponding to the data source to be synchronized when the data source type is a distributed message data source;

and the synchronization module 40 is configured to pull the data to be synchronized from the data source to be synchronized, and write the data to be synchronized into a target database according to the configuration information.

The embodiment acquires data source information of a data source to be synchronized; determining the type of the data source according to the data source information; when the data source type is a distributed message data source, acquiring configuration information corresponding to the data source to be synchronized; and pulling the data to be synchronized in batches from the data source to be synchronized, and writing the data to be synchronized into the target database according to the configuration information. According to the invention, when the data source type is the distributed message data source, the data to be synchronized is pulled from the data source to be synchronized in batches, and the data to be synchronized is written into the target database according to the configuration information. Compared with the existing distributed message data source which can only support streaming synchronization, namely, a mode of processing each time one data is received, the mode of the invention can support batch offline synchronization.

It should be noted that the above-described work flows are only exemplary, and do not limit the scope of the present invention, and in practical applications, a person skilled in the art may select some or all of them to achieve the purpose of the solution of the embodiment according to actual needs, and the present invention is not limited herein.

In addition, the technical details that are not described in detail in this embodiment may refer to the parameter operation method provided in any embodiment of the present invention, and are not described herein again.

Based on the first embodiment of the data synchronization apparatus of the present invention, a second embodiment of the data synchronization apparatus of the present invention is provided.

In this embodiment, the synchronization module 40 is further configured to obtain quantity information of data to be synchronized according to the configuration information; determining the number of pulling times according to the quantity information, and pulling the data to be synchronized in batches from the data source to be synchronized according to the number of pulling times; and writing the data to be synchronized into a target database according to the configuration information.

Further, the synchronization module 40 is further configured to determine the number of pulling times according to the quantity information and preset single batch number of pulling pieces information; and pulling the data to be synchronized in batches from the data source to be synchronized according to the preset single-batch pulling number information and the pulling times.

Further, the synchronization module 40 is further configured to perform data pulling on the data source to be synchronized again when the data pulling corresponding to the pulling times is completed; and when the data is not pulled within the preset time length, judging that the data to be synchronized is pulled completely.

Further, the synchronization module 40 is further configured to obtain a target database and a field mapping rule in the configuration information; and writing the data to be synchronized into the target database according to the field mapping rule.

Further, the synchronization module 40 is further configured to send a data synchronization termination instruction to a preset Flink process, so that the preset Flink process stops running after receiving the data synchronization termination instruction.

Further, the obtaining module 10 is further configured to obtain first synchronization data corresponding to a synchronization data task according to the data source information; and uploading the first synchronous data to a target partition corresponding to the synchronous data task according to a preset data writing mode.

Further, the obtaining module 10 is further configured to obtain a data purpose corresponding to the synchronous data task; when the data is targeted to a data warehouse, acquiring preconfigured data synchronization partition information; and creating partitions in the data warehouse according to the data synchronization partition information to obtain target partitions.

Further, the obtaining module 10 is further configured to generate a partition field array according to the data synchronization partition information; and creating partitions in the data warehouse according to the partition field arrays to obtain target partitions.

Further, the obtaining module 10 is further configured to obtain a partition type in the data synchronization partition information; judging whether the partition type is a data source field partition or a time field partition; and when the partition type is a time field partition, generating a partition field array according to the data synchronization partition information.

Further, the obtaining module 10 is further configured to, when the partition type is a data source field partition, obtain a partition name, a binding field name, and a binding field subscript according to the data synchronization partition information; and generating a partition field array according to the partition name, the binding field name and the binding field subscript.

Further, the obtaining module 10 is further configured to obtain a preset storage database corresponding to the relational data source when the data source type is the relational data source; acquiring a synchronous mapping relation table corresponding to the relational data source and the preset storage database; and storing the data to be synchronized in the relational data source into the preset storage database according to the synchronous mapping relation table.

Further, the obtaining module 10 is further configured to obtain data to be synchronized in the relational data source according to a preset receiving rule; performing type conversion on the data to be synchronized according to the synchronous mapping relation table to obtain converted synchronous data; and storing the converted synchronous data into the preset storage database.

Further, the obtaining module 10 is further configured to obtain data synchronization configuration information when the data source type is a time-series data source; according to the data synchronization configuration information, slicing data to be synchronized in the data synchronization task to obtain a slicing result; and uploading the data to be synchronized to a distributed file system according to the fragmentation result.

Further, the obtaining module 10 is further configured to determine whether the data synchronization task is incremental synchronization according to the data synchronization configuration information; if not, determining the data to be synchronized according to the data synchronization configuration information; and fragmenting the data to be synchronized according to the data volume of the data to be synchronized and a preset fragmentation rule to obtain a fragmentation result.

Further, the obtaining module 10 is further configured to determine, when the data synchronization task is incremental synchronization, an initial position of the data synchronization task according to data synchronization configuration information; and determining data to be synchronized according to the data synchronization configuration information and the initial position, and executing the step of fragmenting the data to be synchronized according to the data volume of the data to be synchronized and a preset fragmentation rule to obtain a fragmentation result.

Other embodiments or specific implementation manners of the data synchronization apparatus of the present invention may refer to the above method embodiments, and are not described herein again.

Furthermore, an embodiment of the present invention further provides a storage medium, where the storage medium stores a data synchronization program, and the data synchronization program, when executed by a processor, implements the steps of the data synchronization method described above.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., a rom/ram, a magnetic disk, an optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A data synchronization method, characterized in that the data synchronization method comprises the steps of:

acquiring data source information of a data source to be synchronized;

determining the type of a data source according to the data source information;

2. The data synchronization method of claim 1, wherein the pulling the data to be synchronized from the data source to be synchronized comprises:

3. The data synchronization method of claim 2, wherein the pulling the data to be synchronized in batches from the data source to be synchronized comprises:

4. The data synchronization method according to claim 3, wherein the step of determining the number of times of pulling according to the quantity information and pulling the data to be synchronized in batches from the data source to be synchronized according to the number of times of pulling comprises:

5. The data synchronization method according to claim 3, wherein after the step of pulling the data to be synchronized in batches from the data source to be synchronized according to the preset single-batch pulling number information and the pulling times, the method comprises:

6. The data synchronization method according to claim 5, wherein the step of determining that the data to be synchronized is pulled to be completed when the data is not pulled within the preset time period comprises:

7. The data synchronization method according to claim 5, wherein after determining that the data to be synchronized is pulled and completed when the data is not pulled within a preset time period, the method further comprises:

8. The data synchronization method of claim 1, wherein the step of obtaining the data source information of the data source to be synchronized is followed by further comprising:

9. The data synchronization method according to claim 8, wherein before the step of uploading the first synchronization data to the target partition corresponding to the synchronization data task according to a preset data writing manner, the method further comprises:

acquiring a data purpose corresponding to the synchronous data task;

10. The data synchronization method of claim 9, wherein the step of creating a partition in a data warehouse based on the data synchronization partition information to obtain a target partition comprises:

11. The data synchronization method of claim 10, wherein the step of generating a partition field array according to the data synchronization partition information comprises:

acquiring the partition type in the data synchronization partition information;

12. The data synchronization method of claim 10, wherein the step of determining whether the partition type is a data source field partition or a time field partition further comprises:

13. The data synchronization method of claim 1, wherein the step of determining the data source type based on the data source information is followed by further comprising:

14. The data synchronization method according to claim 13, wherein the step of storing the data to be synchronized in the relational data source into the preset storage database according to the synchronization mapping relationship table comprises:

and storing the converted synchronous data into the preset storage database.

15. The data synchronization method of claim 1, wherein the step of determining the data source type based on the data source information is followed by further comprising:

16. The data synchronization method according to claim 15, wherein the step of fragmenting the data to be synchronized in the data synchronization task according to the data synchronization configuration information to obtain a fragmentation result comprises:

17. The data synchronization method of claim 16, wherein after the step of determining whether the data synchronization task is incremental synchronization according to the data synchronization configuration information, further comprising:

18. A data synchronization apparatus, characterized in that the data synchronization apparatus comprises:

19. A data synchronization apparatus, characterized in that the apparatus comprises: memory, a processor and a data synchronization program stored on the memory and executable on the processor, the data synchronization program being configured to implement the steps of the data synchronization method of any one of claims 1 to 17.

20. A storage medium having stored thereon a data synchronization program which, when executed by a processor, implements the steps of the data synchronization method of any one of claims 1 to 17.