CN116595101B

CN116595101B - Data synchronization method, device, equipment and computer readable storage medium

Info

Publication number: CN116595101B
Application number: CN202310835733.6A
Authority: CN
Inventors: 唐立志; 罗英群; 肖飞秋
Original assignee: Shenzhen Skyworth Smart Technology Co ltd
Current assignee: Shenzhen Skyworth Smart Technology Co ltd
Priority date: 2023-07-10
Filing date: 2023-07-10
Publication date: 2023-10-24
Anticipated expiration: 2043-07-10
Also published as: CN116595101A

Abstract

The application discloses a data synchronization method, a device, equipment and a computer readable storage medium, wherein the data synchronization method is applied to a data synchronization frame and comprises the following steps: reading at least one data record from an original data table, and determining an original data field included in the data record; determining an original field type of the original data field, converting the original field type into a target field type of a preset format, and updating the data record with the target field type to obtain a target data record, wherein the preset format comprises an Avro format of a data serialization system; and writing the target data record into a preset destination table so as to synchronize the data record in the original data table into the preset destination table. The application realizes the data synchronization between other data sources and Hudi based on the DataX.

Description

Data synchronization method, device, equipment and computer readable storage medium

Technical Field

The present application relates to the field of data identification and transmission technologies, and in particular, to a data synchronization method, apparatus, device, and computer readable storage medium.

Background

The data synchronization framework DataX is a tool for exchanging data between heterogeneous databases or file systems at high speed, and can realize data synchronization between arbitrary data systems.

In order to solve the problem of data synchronization between heterogeneous data sources, the DataX changes a complex mesh data synchronization link into a star data synchronization link, and the DataX is used as an intermediate transmission carrier to be responsible for connecting the data synchronization between various data sources. DataX provides an efficient data synchronization function between various heterogeneous data sources, but current DataX does not support Hudi of a data lake and cannot realize data synchronization between other data sources and Hudi.

Disclosure of Invention

The application mainly aims to provide a data synchronization method, a device, equipment and a computer readable storage medium, and aims to solve the technical problem that data synchronization between other data sources and Hudi cannot be realized by DataX.

To achieve the above object, the present application provides a data synchronization method applied to a data synchronization frame, the data synchronization method comprising the steps of:

reading at least one data record from an original data table, and determining an original data field included in the data record;

Determining an original field type of the original data field, converting the original field type into a target field type of a preset format, and updating the data record with the target field type to obtain a target data record, wherein the preset format comprises an Avro format of a data serialization system;

and writing the target data record into a preset destination table so as to synchronize the data record in the original data table into the preset destination table.

Optionally, the step of updating the data record with the target field type to obtain a target data record includes:

acquiring an original field value corresponding to the original data field, and updating a target field value of the original field value in a preset format;

taking the target field type and the target field value as target data fields, and updating the original data fields of the data records into the target data fields;

and taking the updated data record as a target data record.

Optionally, the original data field includes an incremental synchronization field;

the step of reading at least one data record from the original data table comprises:

acquiring history synchronization information of an original data table, wherein the history synchronization information comprises a history increment synchronization field initial value, a history increment synchronization field end value and a history synchronization data number;

Determining a field value of an increment synchronization field of each data record in the original data table, taking a range between an initial value of the history increment synchronization field and an end value of the history increment synchronization field as a history field value range, and determining a first target field value matched with the history field value range in each field value;

determining a first data record corresponding to the first target field value, determining the total number of data records included in the first data record, and taking the total number of data records as the actual number of data records;

and if the number of the historical synchronous data is consistent with the number of the actual data, determining the current synchronous data record of the original data table, and reading at least one data record from the original data table from the current synchronous data record.

Optionally, the step of determining the current synchronization data record of the original data table includes:

determining a next increment synchronization field value corresponding to the history increment synchronization field end value;

taking the next increment synchronization field value as a current increment synchronization field initial value, and determining a second target synchronization field value matched with the current increment synchronization field initial value in the field values;

And determining that the second target synchronous field value corresponds to a second data record, and taking the second data record as the current synchronous data record of the original data table.

Optionally, after the step of converting the original field type to the destination format data type, the method includes:

if the number of the historical synchronous data strips is inconsistent with the number of the actual data strips, determining a third target synchronous field value matched with the initial value of the historical increment synchronous field in each field value;

and determining that the third target synchronous field value corresponds to a third data record, and reading at least one data record from the original data table from the third data record.

Optionally, after the step of writing the target data record into a preset destination table, the method further includes:

acquiring data synchronization information of the data synchronization framework, wherein the data synchronization information comprises data reading delay, data writing delay, data reading speed and data writing speed; the data reading delay refers to a reading time length corresponding to the unit data quantity, and the data writing delay refers to a writing time length corresponding to the unit data quantity;

determining comprehensive reading efficiency and comprehensive writing efficiency based on the data synchronization information;

And determining the ratio of the data reading efficiency to the data writing efficiency, and if the ratio is smaller than a preset threshold value, adjusting the reading speed of reading the data record from the original data table or adjusting the writing speed of writing the target data record into a preset destination table.

Optionally, the step of determining the integrated reading efficiency and the integrated writing efficiency based on the data synchronization information includes:

determining the read data amount read from the original data table within a preset time duration, and dividing the read data amount by the preset time duration to obtain a data reading rate;

determining the writing data quantity written into the preset destination table in a preset time duration, and dividing the writing data quantity by the preset time duration to obtain a data writing rate;

dividing the read data quantity by the data reading rate to obtain current reading efficiency;

dividing the writing data quantity by the data writing rate to obtain the current writing efficiency;

multiplying the data reading rate by the current reading efficiency and dividing the data reading rate by the data reading delay to obtain the comprehensive reading efficiency;

multiplying the data writing rate and the current writing efficiency, and dividing the data writing rate and the current writing efficiency by the data writing delay to obtain the comprehensive writing efficiency.

Optionally, the step of adjusting a reading speed of reading data records from the original data table includes:

substituting the data reading rate, the data writing rate, the comprehensive reading efficiency and the comprehensive writing efficiency into the following formula I to obtain an initial maximum rate limit;

the first formula is:

D ₁ = min(S _reading *J _Reading ，S _Writing *J _Writing )；

Wherein D is ₁ Representing the initial maximum rate limit, S _Reading Representing the data read rate, S _Writing Representing the data writing rate, J _Reading Representing the integrated reading efficiency, J _Writing Representing the integrated write efficiency;

substituting the initial maximum rate limit and the preset maximum rate limit into the following formula II to obtain an adjustment step length;

the formula II is as follows:wherein M represents the adjustment step, +.>Representing said preset maximum rate limit,/i>Representing the initial maximum rate limit;

adding the preset maximum rate limit and the adjustment step length to obtain a current maximum rate limit;

the read speed is adjusted based on the current maximum rate limit.

In addition, to achieve the above object, the present application also provides a data synchronization device including a data synchronization frame, the data synchronization device further including:

The reading module is used for reading at least one data record from the original data table and determining an original data field included in the data record;

the conversion module is used for determining an original field type of the original data field, converting the original field type into a target field type of a preset format, and updating the data record by the target field type to obtain a target data record, wherein the preset format comprises an Avro format of a data serialization system;

and the writing module is used for writing the target data record into a preset destination table so as to synchronize the data record in the original data table into the preset destination table.

In addition, to achieve the above object, the present application also provides a data synchronization apparatus, including: the system comprises a memory, a processor and a data synchronization program stored in the memory and capable of running on the processor, wherein the data synchronization program realizes the steps of the data synchronization method when being executed by the processor.

In addition, in order to achieve the above object, the present application also provides a computer-readable storage medium, on which a data synchronization program is stored, which when executed by a processor, implements the steps of the data synchronization method as described above.

According to the method, at least one data record is read from an original data table based on a DataX frame, the original field type of the data record is converted into a target field type corresponding to a preset format, the preset format comprises an Avro format, the data record is updated to obtain the target data record, the target data record is synchronized to the preset target table, namely, the field type of the data record can be converted into an Avro format supported by a data lake Hudi, the phenomenon that data synchronization cannot be carried out between other data sources and Hudi due to the fact that Hudi is not supported by the DataX frame in the prior art is avoided, and the method can enable the target data record with the Avro format to be written into the Hudi table by converting the field type of the data record into the Avro format, so that data synchronization between other data sources and Hudi is achieved by the DataX frame.

Drawings

The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

FIG. 1 is a schematic diagram of a terminal/device structure of a hardware operating environment according to an embodiment of the present application;

FIG. 2 is a flowchart of a data synchronization method according to a first embodiment of the present application;

FIG. 3 is a schematic diagram of the whole data synchronization process in the data synchronization method of the present application;

FIG. 4 is a schematic diagram of a task flow for data synchronization configuration in the data synchronization method of the present application;

FIG. 5 is a schematic diagram of a task flow for generating data synchronization in the data synchronization method of the present application;

FIG. 6 is a schematic flow chart of generating an operation script file in the data synchronization method of the present application;

FIG. 7 is a schematic diagram of a flow chart of a task for running data synchronization in the data synchronization method of the present application;

FIG. 8 is a schematic diagram of a target data record encapsulation flow in the data synchronization method of the present application;

fig. 9 is a schematic diagram of a device module of the data synchronization device of the present application.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

Referring to fig. 1, fig. 1 is a schematic diagram of a data synchronization device structure of a hardware running environment according to an embodiment of the present application.

As shown in fig. 1, the data synchronization device may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) Memory or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.

It will be appreciated by those skilled in the art that the structure shown in fig. 1 does not constitute a limitation of the data synchronization device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

As shown in fig. 1, an operating device, a data storage module, a network communication module, a user interface module, and a data synchronization program may be included in a memory 1005 as one type of computer-readable storage medium.

In the data synchronization device shown in fig. 1, the network interface 1004 is mainly used for data communication with other devices; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the data synchronization device of the present application may be provided in the data synchronization device, and the data synchronization device calls the data synchronization program stored in the memory 1005 through the processor 1001 and executes the data synchronization method provided by the embodiment of the present application.

Referring to fig. 2, the present application provides a data synchronization method, which is applied to a data synchronization framework in a first embodiment of the data synchronization method, the data synchronization method comprising the steps of:

step S10, at least one data record is read from an original data table, and original data fields included in the data record are determined;

DataX is an open-source and widely used Data synchronization framework for domestic enterprises, and can provide MySQL (MyStructured Query Language, structured Data query language), oracle, sqlServer, postgre, HDFS (Hadoop Distributed File System, distributed file system), hive, clickHouse (Click Stream, data WareHouse, column Database management system), HBase (Hadoop Database), starRocks, kudu, and other Data synchronization functions with high efficiency among various heterogeneous Data sources, but the open-source version of DataX does not support Hudi.

Hudi provides a variety of offline data-in-lake schemes, mainly three schemes, spark Datasource Writer (Spark data source Writer), deltaStreamer, and Flink SQL Writer. The first scheme is that a data source API (DataSource Application Programming Interface ) is provided by the Hudi-Spark, any data frame can be written (or can be read) into the Hudi data set, information such as a field name, a field type, a recordKey, a precombination field and the like of the Hudi table needs to be configured, and the Spark needs to be installed and deployed to execute a job depending on the Spark. The second scheme is essentially similar to the first scheme, and needs to configure information such as Hudi table field names, field types, recordKey, precombination fields and the like, source data tables and target data tables, and finally needs to be submitted to Spark clusters for execution. The third scheme is that Flink SQL connector (Flink SQL connector) is provided by Hudi-Flink, data can be written (or read) into Hudi data set, information such as Hudi table field name, field type, recordKey, pre-merging field and the like needs to be configured, and Flink needs to be installed and deployed depending on Flink to execute the job. Taken together, these three schemes require more server resources and the task configuration is complex.

Based on the above phenomena, in this embodiment, secondary development is performed on the source code of DataX, the Hudi write plug-in is extended, the original Data source information, the original Data table information and the target Hudi table storage path are configured, metadata information such as table names, fields, primary keys and the like of the original Data table are obtained, the metadata information is automatically mapped into the format of Avro format supported by Hudi according to the field metadata information, the recordKey and the precombination field required by the Hudi table can be automatically configured according to the primary key metadata, a job. Json file which can be executed by DataX is generated, and the Data is synchronized to Hudi through DataX framework execution, for example, referring to fig. 3, a Data synchronization task is configured based on a Data synchronization framework (such as a Data framework), and the Data synchronization task is operated after the Data synchronization task is generated, so as to realize Data synchronization.

In this embodiment, the data synchronization request may be triggered when the original data source data is updated, or the data synchronization request may be triggered by the user by inputting a data synchronization request instruction, or the data synchronization request may be triggered based on a preset rule, for example, when the data synchronization request instruction is input by the user at intervals of a preset time (for one week, one month, etc.), an original data table to be subjected to data synchronization may be input synchronously, and storage information such as a storage path where a destination table to which data in the original data table is synchronized is located, a table name of the destination table, etc. may be pre-input, so that the destination table may be searched.

Further, after receiving the data synchronization request, referring to fig. 4, the configuration of the data synchronization task may be performed, first, an original data source is configured, where the original data source is a data source to be subjected to data synchronization, information such as a storage path, a reading manner, etc. of the original data source may be configured so that data may be obtained from the original data source, then, according to an actual situation, an original data table to be synchronized is selected from the original data source, information such as a storage area where the original data table is located, a reading method, etc. may be configured so that a data record may be read from the original data table, and then, one data field is selected from original data fields included in the original data table and configured as an incremental synchronization field, such as a date field, a sequence number field, a scoring field, etc. to record a data synchronization process through the incremental synchronization field.

Further, information such as a table name of the destination table, a storage path of the destination table, and the like may be input, and destination table related information may be further configured to synchronize data to the destination table. It should be noted that, if the user does not input the pre-merge field, one field may be selected from the original data fields included in the original data table to configure the pre-merge field, and the pre-merge field and the increment synchronization field may be the same field or two different fields, and may be configured by itself according to the actual situation, where the pre-merge field may be used to determine that one data record in the original data table is inserted into the destination table, update the destination table, or have no action when the destination table is inserted and updated. For example, if the date field is selected as the pre-merging field, the primary key of the original data table is the name field, the primary key of a certain data record R is three, the date field value is 10 days of month a, at this time, if there is a data record R1 with three primary keys and 20 days of month a in the destination table, no writing or inserting operation is performed, and the next data record is written or read, if there is a data record R2 with three primary keys and 5 days of month a in the destination table, the date field of R2 in the destination table is updated to 10 days of month a, and if there is no primary key in the destination table to three data records, R is inserted into the destination table.

At least one data record can be read from the original data table, and it should be noted that a preset number of data records, such as 100 data records, 1000 data records, etc., can be read at one time.

Step S20, determining an original field type of the original data field, converting the original field type into a target field type of a preset format, and updating the data record with the target field type to obtain a target data record, wherein the preset format comprises an Avro format of a data serialization system;

all the original data fields included in the original data table can be directly read, and the original field type of each original data field, such as an integer int type, a character char type, a long length type, a time type, etc., is sequentially determined, and in this embodiment, the acquisition of the data field and the determination of the field type are mature technologies, and are not described in detail. Further, the original field type is converted into the target field type with a preset format, where the preset formats set by the table types of different destination tables may be different, that is, the preset format may be any data format supported by the destination table, for example, when the destination table is a Postgre table, the preset format may be a json format, when the destination table is a Hive table, the preset format may be a sequence file format, the table type of the destination table is a data lake table (such as Hudi), and the preset format may be an Avro format.

It will be appreciated that, for each original data field in the original data table, the field type of the original data field is sequentially converted to a target field type in a preset format, for example, the Avro format is taken as an example, several different original field types are converted to target field types, the logic type of the original field type of decimal data type of decimal mapped to the target field type of the Avro format [ decimal ] and the data type of FIXED length byte array of fixed_ed are mapped to the target field type of the Avro [ decimal ] and the data type of 32 bits of signed integer [ decimal ] are mapped to the target field type of Avro [ time-milli ] and the data type of INT [ time-milli ] and the original field type of date type of data type of time mapped to the target field type of time-milli [ time-milli ] and the data type of time-milli [ time-milli ] and the data type of time-milli-zero ] of 32 bits of signed integer [ time-milli ] are mapped to the target field type of time-milli. The target field types corresponding to the different original field types can be preset, so that each original field type can be correspondingly converted into the target field type with the preset format, the original field type of the original data field included in the data record in the original data table is updated into the target field type, and it can be understood that after the field type of the original data field is updated, the original field value corresponding to the original data field is correspondingly updated, so that the updated target data record is obtained.

Further, referring to fig. 5, after the data synchronization task is configured, a data synchronization task is then generated, optionally, taking the data synchronization frame DataX as an example, firstly, according to a configured original data source and an original data table, a field name, a field data type (i.e., an original field type) and primary key information of the original data table are obtained, then, according to the configured original data source, a read plug-in template of the DataX is obtained, a read plug-in template of the DataX corresponding to different original data sources can be preset, and a configuration of the read plug-in is assembled, if the original data source is MySQL, the MySQL read plug-in template can be obtained, if the original data source is Hive, the Hive read plug-in template can be obtained, if the original data source is HBase, the HBase read plug-in template can be obtained, a DataX write plug-in can be configured, and a different target table can also correspond to different write plug-in elements like the configuration of the read plug-in, for example, if the target table can be hivenwrite plug-in, the table can be Orle, and the target table can be written as Ordite table can be written as a corresponding table, and at least the Orditi table can be written as a required to be written, and parameters can be generated as Orditi table: generating a Hudi table name according to an input table name, generating an HDFS path of a Hudi table data storage according to an input table storage path, generating a Hudi table field according to a field name of an original data table (the field name of the Hudi table field can be consistent with the field name of the original data table, the field type is converted into a target field type of an Avro format), configuring a Hudi table precombination field according to the input precombination field or selecting one original data field, configuring a Hudi table record key according to main key information of the original data table (the main key field of the original data table is configured as a record key of the Hudi table), further generating a Hudi PadaddClassName (Hudi payload type name) as a write carrier of a target data record, and generating a job. Json file for running a DataX increment acquisition and breakpoint resume task after the configuration of a write plug-in and a read plug-in are completed.

Step S30, writing the target data record into a preset destination table so as to synchronize the data record in the original data table into the preset destination table.

And writing the target data record with the preset format into a preset destination table, wherein the preset destination table can be obtained from the input storage path. If the preset destination table cannot be obtained in the storage path, an empty table is created, a write plug-in the data synchronization framework initializes the empty table, such as configuration table field, table name, table record key, table storage path and other information, the empty table is used as the destination table after initialization, and the target data record is written into the destination table. It should be noted that the writing mode of the target data record may be insert writing or batch writing, and generally, if the preset target table is empty and data is not stored, batch writing is performed, and if data is stored in the preset data table, insert writing is performed.

Further, the target data records may be written into a preset buffer area, and after the number of the data records stored in the buffer area is greater than a preset threshold value, the target data records are written into a preset target table in batches. Wherein different write plug-ins can set different preset thresholds.

Referring to fig. 7, taking a preset destination table as an example of a Hudi table, a data synchronization task operation process of writing the target data record into the preset destination table is described. First, a Hudi write plug-in (Hudi write plug-in) is obtained, if the preset destination table is not a Hudi table, processing logic of other write plug-ins is used, and processing logic of different write plug-ins is preconfigured in the DataX, which is not described in detail herein. The method comprises the steps of obtaining configuration information of a Hudi table, at least comprising information such as a table name, a table storage path, a table field and the like, checking whether the table name, the table storage path and the table field are empty, initializing the Hudi table, checking whether data exists in the Hudi table or not if the table is empty, checking whether data exists in the Hudi table or not if the table does not exist in the Hudi table, optimizing a writing mode to be a batch_insert (batch insertion), obtaining a pre ql (pre-execution structured query statement) of the Hudi table if the table does not exist in the Hudi table, mapping each column (namely, each original data field) to the Avro data type according to field information of the configured original data table, creating an Avro mode containing all cases, setting HoddieWriteCong (Hoddie write configuration), circularly traversing a received data Record, obtaining a data Record (recording) of DataX (namely, setting the recording is a recording in a general key field of the Record (Record) and storing the recording is an increment value of the Record (Record) after the recording is set up in a general key of the recording (Record) and the general data (Record) and the value of the general data (recording) is stored in a key field of the general Record (Record) and the general Record (Record) is updated according to the default value.

In this embodiment, at least one data record is read from an original data table based on a DataX framework, an original field type of the data record is converted into a target field type corresponding to a target format, the target format includes an Avro format, so as to obtain a target data record, the target data record is synchronized to a preset target table, that is, the field type of the data record can be converted into an Avro format supported by a data lake Hudi, and the phenomenon that in the prior art, hudi is not supported by the DataX framework, so that data synchronization cannot be performed between other data sources and Hudi is avoided.

Further, based on the first embodiment of the present application, a second embodiment of the data synchronization method of the present application is provided, in this embodiment, step S20 of the foregoing embodiment, determining an original field type of the original data field, converting the original field type into a target field type in a preset format, and updating the data record with the target field type to obtain refinement of the target data record, including:

Step a, acquiring an original field value corresponding to the original data field, and updating a target field value of the original field value in a preset format;

in this embodiment, the data synchronization framework is preconfigured with a data lake write plug-in, preferably, the data synchronization framework is a DataX framework, and the data lake write plug-in at least includes a Hudi write plug-in. It can be understood that the table types of the preset destination tables are different, the write-ins correspondingly selected can also be different, and the write-ins corresponding to the different table types can be preconfigured in the DataX framework.

And if the table type of the preset target table is a data lake table, running the data lake write plug-in, and further, if the table type is a Hudi data lake table, running the Hudi write plug-in, wherein the Hudi table supports an Avro format, and the original field type can be converted into a target field type corresponding to the Avro format, wherein Avro is a data serialization format and supports multiple data types. The following are some of the data types supported by Avro: NULL type NULL, BOOLEAN type boost, 32 bit signed integer type INT, 64 bit signed integer type LONG, single precision floating point type flow, DOUBLE precision floating point type DOUBLE, byte ARRAY type byte, STRING type STRING, enumeration type ENUM, FIXED length byte ARRAY type fix, ARRAY type ARRAY, key pair collection type MAP, one of a plurality of possible data types single, complex data type RECORD, consisting of a plurality of fields with names and types.

It can be appreciated that the original field type may be converted to the target field type supported by the Avro format based on a preset mapping relationship, such as the original field type being a target field type (time-millis) of which the logical type is a time-millisecond type time-millis and the data type is INT in Avro, the original field type being a target field type (time-millis) of which the logical type is a time stamp type time-millis and the data type is a 64-bit signed integer LONG in Avro.

It will be appreciated that, after the conversion of the original field types, the original field values corresponding to the original data fields are correspondingly converted into target field values with a preset format, so that the field types of each original data field of each data record are converted into target field types in the form of a support format of a preset destination table, and the corresponding original field values are also converted into target field values with a preset format.

Step b, taking the target field type and the target field value as target data fields, and updating the original data fields of the data records into the target data fields;

And c, taking the updated data record as a target data record.

Referring to fig. 8, the encapsulation process of the target data Record is illustrated by using the Hudi table and the Avro format as an example, firstly, the Avro mode is acquired, the GenericRecord object is created, each field in the Avro mode is iterated, each original data field is iterated to convert the target data field, a corresponding original field value is acquired from the Record object of DataX, the DataX column value is converted into the Avro format to obtain the target field value, the target field value is set in the GenericRecord object, and the converted GenericRecord object is returned to obtain the target data Record.

In this example, the original field type of the original data field is converted into the target field type of the preset format, and the original field value is correspondingly updated, so that the original field value is also converted into the target field value of the preset format, and the target data record is obtained.

In one embodiment, the original data field includes an incremental synchronization field, and the step of reading at least one data record from the original data table includes:

step e, acquiring history synchronization information of an original data table, wherein the history synchronization information comprises a history increment synchronization field initial value, a history increment synchronization field end value and a history synchronization data number;

after the data synchronization is finished each time, the synchronization information of this time can be recorded, wherein the synchronization information comprises an initial value of an increment synchronization field, an end value of the increment synchronization field and a synchronization data quantity (namely the number of synchronization data), the initial value of the increment synchronization field is a field value of an increment synchronization field corresponding to a first data record of this time to start synchronization, and the end value of the increment synchronization field is a field value of an increment synchronization field corresponding to a last data record of this time to end synchronization. When data synchronization is started, history synchronization information can be obtained, and preferably, the synchronization information of the last data synchronization task closest to the time of the current data synchronization task is obtained as history synchronization information.

F, determining a field value of an increment synchronization field of each data record in the original data table, taking a range between an initial value of the history increment synchronization field and an end value of the history increment synchronization field as a history field value range, and determining a first target field value matched with the history field value range in each field value;

Step g, determining a first data record corresponding to the first target field value, determining the total number of data records included in the first data record, and taking the total number of data records as the actual number of data records;

and h, if the number of the historical synchronous data is consistent with the number of the actual data, determining the current synchronous data record of the original data table, and reading at least one data record from the original data table from the current synchronous data record.

Based on the obtained history synchronization information, judging whether the last data synchronization task is successful, optionally, determining the field value of the increment synchronization field of each data record in the original data table, determining whether each field value is in the history field value range, if a certain field value is in the history field value range, marking the field value as a first target field value, determining the number of data records of the first data record corresponding to the first target field value, obtaining the actual number of data records, judging whether the number of the history synchronization data records is consistent with the actual number of data records, if so, indicating that the last data synchronization task is successful, and if not, starting the data synchronization of this time from the next data record after the last data synchronization is ended, otherwise, failing the last data synchronization task, and for the data record failed in the last synchronization, carrying out the data synchronization again this time.

In one embodiment, the step of determining the current synchronization data record of the original data table comprises:

step i, determining a next increment synchronization field value corresponding to the history increment synchronization field end value;

it will be understood that if the increment synchronization fields are arranged in ascending order in the original data table, the precision of the end value of the history increment synchronization field is increased by 1 to obtain the next increment synchronization field value, where the precision is increased by 1 means that 1 is added to the data type corresponding to the increment synchronization field, if the increment synchronization field is a date field, 1 is added to the date field, and similarly, if the increment synchronization field is arranged in descending order in the original data table, the precision of the end value of the history increment synchronization field is decreased by 1 to obtain the next increment synchronization field value.

J, taking the next increment synchronization field value as a current increment synchronization field initial value, and determining a second target synchronization field value matched with the current increment synchronization field initial value in the field values;

and step k, determining that the second target synchronization field value corresponds to a second data record, and taking the second data record as the current synchronization data record of the original data table.

In this embodiment, based on the number of history synchronization data pieces and the number of actual data pieces, it is determined whether the last data synchronization task is successful, if so, the next increment synchronization field value corresponding to the end value of the history increment synchronization field is determined, and the second data record corresponding to the next target synchronization field value is used as the current synchronization data record of the original data table to start data synchronization, thereby avoiding repeated writing of the data record.

In an embodiment, after the step of converting the original field type to the destination format data type, the method includes:

step l, if the number of the historical synchronous data is inconsistent with the number of the actual data, determining a third target synchronous field value matched with the initial value of the historical increment synchronous field in the field values;

and m, determining that the third target synchronization field value corresponds to a third data record, and reading at least one data record from the original data table from the third data record.

In this example, if the number of the history synchronization data is inconsistent with the number of the actual data, it is indicated that the previous data synchronization task fails, and data synchronization is performed from the third data record corresponding to the initial value of the history increment synchronization field, so that accurate synchronization of the data records in the original data table to the preset destination table is ensured.

In addition, to assist understanding of the flow of the data synchronization method in the present embodiment, the following description is given by way of example.

Referring to fig. 6, according to the configuration of the original data source, the original data table and the increment synchronization field (i.e. increment field), the synchronization information of the last synchronization task is obtained, if the synchronization information is not obtained, which indicates that the data has not been synchronized yet, the initial value of the present synchronization is set as the minimum value of the increment field data type, as the true initial value of the present synchronization, if the synchronization information is obtained, the last synchronization initial value, the end value and the synchronization data amount can be obtained, according to the last synchronization initial value and the end value, the data amount of the original table is obtained, whether the data amount is consistent is compared, if the data amount is consistent, which indicates that the last synchronization is complete, the last data is successfully synchronized into the destination table, the end value precision +1 (plus 1) of the last synchronization is used as the initial value of the present synchronization, if the last synchronization is inconsistent, which indicates that the abnormality exists, the data presql within the last synchronization initial value and the end value range is deleted, the initial value of the last synchronization is pre-executed sql is converted into the presql of the hudiwrite plug-in by the preset dialect, the initial value of the last synchronization is used as the initial value of the present synchronization, and the previous synchronization is converted into the corresponding increment plug-in by the preset dialect converter, and the reader is read.

In an embodiment, after the step of writing the target data record into the preset destination table, the method further includes:

step n, acquiring data synchronization information of the data synchronization frame, wherein the data synchronization information comprises data reading delay, data writing delay, data reading speed and data writing speed; the data reading delay refers to a reading time length corresponding to the unit data quantity, and the data writing delay refers to a writing time length corresponding to the unit data quantity;

since there may be data synchronization between different terminal devices between DataX and Hudi, the rate of data synchronization depends on the software and hardware resources of the two terminal devices. If the synchronization rate exceeds the software and hardware resources of the terminal equipment, the data synchronization interruption or data loss and other conditions are easy to cause. In order to ensure the stability of data synchronization while ensuring the maximum synchronization rate, it is necessary to comprehensively consider software and hardware resources corresponding to two terminal devices, so as to further allocate reasonable data synchronization rate.

Based on the above phenomenon, a speed controller parameter is added to the configuration file of the DataX, so as to specify the strategy and parameters of the data synchronization rate control. At the runtime of the DataX, a speedController object is created based on the speedController parameters for monitoring and adjusting the data synchronization rate. The SpeedController object may have different implementations to control the rate according to different policies.

First, data synchronization information of a data synchronization frame may be acquired, and an initial rate limit and an initial data synchronization rate may also be acquired, the initial rate limit including a maximum rate limit and a minimum rate limit. Before adjusting the data synchronization rate, data synchronization between the original data table and the preset destination table is performed at the initial data synchronization rate. In the data synchronization process, the data end where the original data table is located and the destination end where the preset destination table is located are subjected to only synchronous rate control, so that the situations of data synchronization interruption, data loss and the like are prevented.

It should be noted that, the data reading speed represents the data amount that can be read by the data end per second, the number of bytes or data packets that can be read from the data end to the data record can be recorded in a certain time window, then the data amount that can be read per second is obtained by dividing the length of the time window, and the data reading speed is obtained.

Step o, determining comprehensive reading efficiency and comprehensive writing efficiency based on the data synchronization information;

specifically, step o specifically includes the steps of:

step o1, determining the read data amount read from the original data table in a preset time duration, and dividing the read data amount by the preset time duration to obtain a data reading rate;

step o2, determining the writing data amount written into the preset destination table in the preset time duration, and dividing the writing data amount by the preset time duration to obtain a data writing rate;

step o3, dividing the read data quantity by the data reading rate to obtain the current reading efficiency;

step o4, dividing the written data volume by the data writing rate to obtain the current writing efficiency;

step o5, multiplying the data reading rate by the current reading efficiency and dividing the data reading rate by the data reading delay to obtain the comprehensive reading efficiency;

and step o6, multiplying the data writing rate and the current writing efficiency, and dividing the data writing rate and the current writing efficiency by the data writing delay to obtain the comprehensive writing efficiency.

In this embodiment, the integrated reading efficiency is calculated through a plurality of data dimensions of the data reading rate, the current reading efficiency and the data reading delay, so as to perform load assessment on the data terminal device, and the integrated writing efficiency is calculated through a plurality of data dimensions of the data writing rate, the current writing efficiency and the data writing delay, so as to perform load assessment on the destination terminal device. The calculation process is based on multi-dimensional data, and the data reading speed, the current reading efficiency and the data reading delay can accurately calculate the load of equipment, so that the calculation process has higher calculation precision.

And step p, determining the ratio of the comprehensive reading efficiency to the comprehensive writing efficiency, and if the ratio is smaller than a preset threshold value, adjusting the reading speed of reading the data record from the original data table or adjusting the writing speed of writing the target data record into a preset target table.

And respectively determining the comprehensive reading efficiency of the data end and the comprehensive writing efficiency of the destination end based on the data synchronization information, and calculating the ratio between the comprehensive reading efficiency and the comprehensive writing efficiency (namely dividing the two). If the ratio is greater than a threshold (e.g., greater than 0.8), then the current data synchronization rate is indicated to be appropriate and no adjustment is required. If the ratio is smaller than the threshold (e.g., smaller than 0.8), it indicates that overload operation occurs at one of the data end and the destination end, so that the current data synchronization rate needs to be adjusted. In addition, the read buffer size or the write buffer size can be adjusted, and the number of concurrent read threads or the number of concurrent write threads can be adjusted.

In this embodiment, the read speed and the write speed of the data are correspondingly adjusted by the comprehensive read efficiency and the comprehensive write efficiency, so that the stability of data synchronization is ensured.

In one embodiment, the step of determining the integrated read efficiency and the integrated write efficiency based on the data synchronization information includes:

Step p1, substituting the data reading rate, the data writing rate, the comprehensive reading efficiency and the comprehensive writing efficiency into the following formula I to obtain an initial maximum rate limit;

the first formula is:

D ₁ = min(S _reading *J _Reading ，S _Writing *J _Writing )；

where min represents a minimum value, and this rate limit may represent a maximum rate that may be achieved without affecting the data processing performance of the data end and the destination end.

Step p2, substituting the initial maximum rate limit and the preset maximum rate limit into the following formula II to obtain an adjustment step length;

the formula II is as follows:wherein M represents the adjustment step, +.>Representation houseSaid preset maximum rate limit,/->Representing the initial maximum rate limit;

an adjustment step is calculated based on the initial maximum rate limit and the preset maximum rate limit, i.e. adjustment step = (initial maximum rate limit-preset maximum rate limit)/preset value. Wherein the preset value may be 5, 10, 15, etc., the adjustment step size represents the amount of increase or decrease each time the rate limiting is adjusted.

Step p3, adding the preset maximum rate limit and the adjustment step length to obtain a current maximum rate limit;

and step p4, adjusting the reading speed based on the current maximum speed limit.

In this embodiment, an initial maximum rate limit is calculated according to a data reading rate, a data writing rate, a comprehensive reading efficiency, and a comprehensive writing efficiency, and then an adjustment step is obtained according to the initial maximum rate limit and a preset maximum rate, so as to obtain a current maximum rate limit, and then a reading speed is adjusted according to the current maximum rate limit. The process calculates the adjustment step length according to the plurality of data dimensions, and obtains the current maximum rate limit based on the adjustment step length, so that the synchronous rate between the data end and the destination end can be more balanced, and the situations of data synchronization interruption, data loss and the like are prevented.

For example, assuming that the obtained data reading speed and data writing speed of the data end and the destination end are 15MB/s and 12MB/s, the data reading efficiency and data writing efficiency are 60% and 80% respectively, the data reading delay and data writing delay are 0.1s and 0.2s, and the current reading efficiency and the current writing efficiency are 90% and 95% respectively, the comprehensive reading efficiency and the comprehensive writing efficiency are 54 and 380 respectively. Assuming that the preset threshold is 1, calculating to obtain that the ratio of the comprehensive reading efficiency to the comprehensive writing efficiency is smaller than the preset threshold 1, and the comprehensive writing efficiency is overlarge, which indicates that the destination end has performance bottleneck and needs to reduce the maximum rate limit. The initial maximum rate limit is min (15×54, 12×380) =810 KB/s. Assuming that the preset maximum rate limit is 8MB/s, the adjustment step size is (0.81-8)/10 = -0.72MB/s. The current maximum rate limit is 8-0.72=7.28 MB/s.

In addition, referring to fig. 9, the present application also provides a data synchronization device, where the data synchronization device includes a data synchronization frame, and the data synchronization device further includes:

the reading module A10 is used for reading at least one data record from the original data table and determining the original data field included in the data record;

the conversion module a20 is configured to determine an original field type of the original data field, convert the original field type into a target field type of a preset format, and update the data record with the target field type to obtain a target data record, where the preset format includes an Avro format of a data serialization system;

and the writing module A20 is used for writing the target data record into a preset destination table so as to synchronize the data record in the original data table into the preset destination table.

In addition, the embodiment of the application also provides a data synchronization device, which comprises a memory, a processor and a data synchronization program stored in the memory and executable on the processor, wherein the data synchronization program realizes the steps of the data synchronization method when being executed by the processor.

The specific implementation manner of the data synchronization device of the present application is substantially the same as that of each embodiment of the data synchronization method described above, and will not be repeated here.

The specific implementation manner of the computer readable storage medium of the present application is basically the same as the above embodiments of the data synchronization method, and will not be repeated here.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a computer readable storage medium (such as ROM/RAM, magnetic disk, optical disk) as described above, including several instructions for causing a terminal device (which may be a mobile phone, a computer, a cloud server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.

The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the application, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. A data synchronization method, wherein the data synchronization method is applied to a data synchronization framework, the data synchronization method comprising the steps of:

writing the target data record into a preset destination table so as to synchronize the data record in the original data table into the preset destination table;

after the step of writing the target data record into a preset destination table, the method further comprises the following steps:

multiplying the data reading rate by the current reading efficiency and dividing the data reading rate by the data reading delay to obtain comprehensive reading efficiency;

multiplying the data writing rate by the current writing efficiency, and dividing the data writing rate by the data writing delay to obtain comprehensive writing efficiency;

Determining the ratio of the comprehensive reading efficiency to the comprehensive writing efficiency, and if the ratio is smaller than a preset threshold value, adjusting the reading speed of reading the data record from the original data table or adjusting the writing speed of writing the target data record into a preset destination table;

wherein the step of adjusting the reading speed of reading the data record from the original data table comprises:

the first formula is:

D ₁ = min(S _reading *J _Reading ，S _Writing *J _Writing )；

The read speed is adjusted based on the current maximum rate limit.

2. The data synchronization method of claim 1, wherein the step of updating the data record with the target field type results in a target data record, comprising:

and taking the updated data record as a target data record.

3. The data synchronization method of claim 1, wherein the original data field comprises an incremental synchronization field;

4. The data synchronization method of claim 3, wherein the step of determining a current synchronization data record of the original data table comprises:

5. The data synchronization method of claim 3, wherein after the step of converting the original field type to a destination format data type, comprising:

6. A data synchronization device, the data synchronization device comprising: memory, a processor and a data synchronization program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the data synchronization method according to any one of claims 1 to 5.

7. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a data synchronization program, which when executed by a processor, implements the steps of the data synchronization method according to any of claims 1 to 5.