CN114490554A

CN114490554A - Data synchronization method and device, electronic equipment and storage medium

Info

Publication number: CN114490554A
Application number: CN202210134131.3A
Authority: CN
Inventors: 林丹; 沈贇; 刘雪晶; 阳兵
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2022-02-14
Filing date: 2022-02-14
Publication date: 2022-05-13

Abstract

The invention discloses a data synchronization method and a device thereof, electronic equipment and a storage medium, and relates to the technical field of big data, wherein the synchronization method comprises the following steps: collecting an audit log of a preset distributed cluster system, extracting a target table and a storage data partition operated by a data processing statement with data change, comparing file updating time of each data file under the target table and the storage data partition, and writing data under the storage data partition with updated data into a target database under the condition that the file updating time triggers a data synchronization instruction. The invention solves the technical problem that the data synchronization can not be automatically triggered when the data change in the related technology.

Description

Data synchronization method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of big data, in particular to a data synchronization method and device, electronic equipment and a storage medium.

Background

With the explosive development of big data technology, data scale and data form show explosive and diversified development, and most enterprises begin to construct big data system architecture based on open source Hadoop system. In the related technology, data are uniformly stored in a Hadoop file system, and batch processing can be carried out by multiple computing engines such as Hive and Spark. However, in the online access scene, Hive timeliness is low, and most enterprises adopt to synchronize data into a relational database, so that when data changes on Hadoop, data change situations need to be found quickly and data synchronization needs to be triggered automatically, so that timeliness of online access scene data effectiveness is improved.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a data synchronization method and device, electronic equipment and a storage medium, which are used for at least solving the technical problem that data synchronization cannot be automatically triggered when data change occurs in the related technology.

According to an aspect of an embodiment of the present invention, there is provided a data synchronization method, including: collecting an audit log of a preset distributed cluster system, wherein each data processing statement in the audit log corresponds to a statement type, and the statement type comprises: data change and no data change occur; extracting a target table operated by the data processing statement with data change and a storage data partition; comparing the file update time of each data file under the target table and the storage data partition; and under the condition that the file updating time triggers a data synchronization instruction, writing the data in the storage data partition with the updated data into the target database.

Optionally, after collecting an audit log of a preset distributed cluster system, the method further includes: dividing the audit log according to a preset format to obtain a statement identifier of each data processing statement; acquiring execution logs of all data processing statements in a preset data warehouse based on the statement identification; intercepting the execution content of each data processing statement in the execution log, wherein the execution content at least comprises: log completion time; and writing the data processing statements into a preset sorting queue in sequence according to the sequence of the log completion time, wherein the preset sorting queue is used for sequentially analyzing the statement type of each data processing statement.

Optionally, after extracting the target table and the storage data partition operated by the data processing statement in which the data change occurs, the method further includes: inquiring a data modification list to be processed based on the table identifier of the target table and the area identifier of the storage data partition; and extracting the sentence execution ending time of the data processing sentence with the modification identifier in the data modification list.

Optionally, the step of comparing the file update time of each data file in the target table and the storage data partition includes: acquiring file synchronization time of each data file under the target table and the storage data partition; under the condition that the statement execution ending time is less than or equal to the file synchronization time, confirming that the data processing statement with the modification mark is synchronized and completed; and setting the statement execution ending time as the file updating time under the condition that the statement execution ending time is greater than the file synchronization time.

Optionally, after comparing the file update time of each data file under the target table and the storage data partition, the method further includes: comparing the file updating time with the maximum deadline in the data synchronization task log table to obtain a deadline comparison result; and when the file updating time is different from the maximum deadline in the data synchronization task log table, determining that the file updating time triggers a data synchronization instruction.

Optionally, in a case that the file update time triggers a data synchronization instruction, the step of writing data in the storage data partition in which the update data exists into the target database includes: under the condition that the file updating time triggers a data synchronization instruction, reading a source cluster, a source table user, a target table and a target table user in the data synchronization instruction; accessing a storage data partition with updated data under a target table of the source cluster by adopting a first user identifier of the source table user in the source table to read partition data; clearing the partition field under the target table to be the data of the storage data partition with the updated data; and writing the partition data into a target database.

Optionally, after writing the partition data into the target database, the method further includes: acquiring the end time of writing the partition data into the target database to obtain the synchronous end time; and updating the maximum deadline in the data synchronization task log table as the synchronization end time.

According to another aspect of the embodiments of the present invention, there is also provided a data synchronization apparatus, including: the system comprises an acquisition unit and a processing unit, wherein the acquisition unit is used for acquiring an audit log of a preset distributed cluster system, each data processing statement in the audit log corresponds to a statement type, and the statement types comprise: data change and no data change occur; the extraction unit is used for extracting a target table operated by the data processing statement with data change and a storage data partition; the comparison unit is used for comparing the file updating time of each data file under the target table and the storage data partition; and the writing unit is used for writing the data in the storage data partition with the updated data into the target database under the condition that the file updating time triggers the data synchronization instruction.

Optionally, the synchronization apparatus further includes: the first segmentation module is used for segmenting an audit log of a preset distributed cluster system according to a preset format after the audit log is collected, so as to obtain a statement identifier of each data processing statement; the first acquisition module is used for acquiring execution logs of all data processing statements in a preset data warehouse based on the statement identifications; a first intercepting module, configured to intercept execution content of each data processing statement in the execution log, where the execution content at least includes: log completion time; and the first writing module is used for sequentially writing the data processing statements into a preset sorting queue according to the sequence of the log completion time, wherein the preset sorting queue is used for sequentially analyzing the statement type of each data processing statement.

Optionally, the synchronization apparatus further includes: the first query module is used for querying a data modification list to be processed based on a table identifier of the target table and an area identifier of the storage data partition after extracting the target table and the storage data partition operated by the data processing statement with data change; and the first extraction module is used for extracting the statement execution ending time of the data processing statement with the modification identifier in the data modification list.

Optionally, the comparing unit includes: the first acquisition module is used for acquiring the file synchronization time of each data file under the target table and the storage data partition; a first determining module, configured to determine that the data processing statement with the modification identifier is synchronized and completed when the statement execution end time is less than or equal to the file synchronization time; and the first setting module is used for setting the sentence execution ending time as the file updating time under the condition that the sentence execution ending time is greater than the file synchronization time.

Optionally, the synchronization apparatus further includes: the first comparison module is used for comparing the file updating time with the maximum deadline in the data synchronization task log table after comparing the file updating time of each data file in the target table and the stored data partition, so as to obtain a deadline comparison result; and the first triggering module is used for confirming that the file updating time triggers the data synchronization instruction when the file updating time is different from the maximum deadline in the data synchronization task log table.

Optionally, the writing unit includes: the first reading module is used for reading a source cluster, a source table user, a target table and a target table user in a data synchronization instruction under the condition that the file updating time triggers the data synchronization instruction; the first access module is used for accessing a storage data partition with update data under a target table of the source cluster by adopting a first user identifier of the source table user so as to read partition data; the first clearing module is used for clearing the data of the storage data partition with the updated data in the partition field under the target table; and the second writing module is used for writing the partition data into a target database.

Optionally, the synchronization apparatus further includes: the second acquisition module is used for acquiring the end time of the partition data written into the target database after the partition data is written into the target database, so as to obtain the synchronous end time; and the first updating module is used for updating the maximum deadline in the data synchronization task log table to be the synchronization end time.

According to another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, and when the computer program runs, the apparatus where the computer-readable storage medium is located is controlled to execute the above data synchronization method.

According to another aspect of embodiments of the present invention, there is also provided an electronic device, including one or more processors and a memory for storing one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the data synchronization method described above.

In the method, an audit log of a preset distributed cluster system is collected, a target table and a storage data partition operated by a data processing statement with data change are extracted, the file update time of each data file under the target table and the storage data partition is compared, and under the condition that the file update time triggers a data synchronization instruction, data under the storage data partition with the update data are written into a target database. In the method and the device, the target table with changed data and the stored data partition can be extracted, and the file updating time of each data file under the target table and the stored data partition is compared, so that the data under the stored data partition with updated data can be written into the target database in time, the timeliness of triggering data synchronization is improved, the process is full-automatic, and the technical problem that the data synchronization cannot be automatically triggered when the data are changed in the related technology is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow diagram of an alternative method of data synchronization according to an embodiment of the present invention;

FIG. 2 is a flow diagram of an alternative parsing of an SQL statement according to an embodiment of the invention;

FIG. 3 is a flow diagram of an alternative synchronized task scheduling according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an alternative data synchronization system for monitoring data change triggering based on Hive execution SQL according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an alternative data synchronization apparatus according to an embodiment of the present invention;

fig. 6 is a block diagram of a hardware structure of an electronic device (or mobile device) for a data synchronization method according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

To facilitate understanding of the invention by those skilled in the art, some terms or nouns referred to in the embodiments of the invention are explained below:

hadoop: the software framework can perform distributed processing on a large amount of data, and can perform data processing in a reliable, efficient and scalable manner.

Hive: the system is a data warehouse tool based on Hadoop, is used for data extraction, transformation and loading, and is a mechanism capable of storing, inquiring and analyzing large-scale data stored in Hadoop.

Spark: is a fast, general-purpose computing engine designed specifically for large-scale data processing.

It should be noted that the data synchronization method and the data synchronization device in the present disclosure may be used in the field of big data technology under the condition of data synchronization, and may also be used in any field except the field of big data technology under the condition of data synchronization, and the application fields of the data synchronization method and the data synchronization device in the present disclosure are not limited.

It should be noted that, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for presentation, analyzed data, etc.) referred to in the present disclosure are information and data authorized by the user or sufficiently authorized by each party.

The embodiments of the invention described below can be applied to various systems/applications/devices for data synchronization. The method comprises the steps of collecting an audit log executed by big data SQL, analyzing SQL sentences in the audit log, extracting a target table and a storage data partition which are likely to have data changes, comparing the file change time of the target table and the storage data partition with the change condition of the latest synchronized data, triggering data synchronization, reading the data in the change partition and writing the data in a target database (for example, a relational database).

The present invention will be described in detail with reference to examples.

Example one

In accordance with an embodiment of the present invention, there is provided a data synchronization method embodiment, it should be noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

Fig. 1 is a flow chart of an alternative data synchronization method according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:

step S101, collecting an audit log of a preset distributed cluster system, wherein each data processing statement in the audit log corresponds to a statement type, and the statement type comprises: data changes occur and no data changes occur.

Step S102, extracting a target table operated by a data processing statement with data change and a storage data partition.

Step S103, the file updating time of each data file under the target table and the stored data partition is compared.

And step S104, writing the data in the storage data partition with the updated data into the target database under the condition that the file updating time triggers the data synchronization instruction.

Through the steps, the audit log of the preset distributed cluster system can be collected, the target table and the storage data partition operated by the data processing statement with data change are extracted, the file updating time of each data file under the target table and the storage data partition is compared, and the data under the storage data partition with the updated data is written into the target database under the condition that the data synchronization instruction is triggered by the file updating time. In the embodiment of the invention, the target table with changed data and the stored data partition can be extracted, and the file updating time of each data file under the target table and the stored data partition can be compared, so that the data under the stored data partition with updated data can be written into the target database in time, the timeliness of triggering data synchronization is improved, the process is full-automatic, and the technical problem that the data synchronization cannot be automatically triggered when the data is changed in the related technology is solved.

The following will explain the embodiments of the present invention in detail with reference to the above steps.

Step S101, collecting an audit log of a preset distributed cluster system, wherein each data processing statement in the audit log corresponds to a statement type, and the statement types comprise: data changes occur and no data changes occur.

In the embodiment of the present invention, an audit log of a preset distributed cluster system (for example, a Hadoop cluster system) may be collected, and data may be obtained line by line, where the audit log has data processing statements, each data processing statement corresponds to a statement type, and the statement types include: data changes occur and no data changes occur.

Optionally, after acquiring the audit log of the preset distributed cluster system, the method further includes: dividing the audit log according to a preset format to obtain a statement identifier of each data processing statement; acquiring execution logs of all data processing sentences in a preset data warehouse based on the sentence marks; intercepting the execution content of each data processing statement in the execution log, wherein the execution content at least comprises the following steps: log completion time; and sequentially writing the data processing statements into a preset sorting queue according to the sequence of the log completion time, wherein the preset sorting queue is used for sequentially analyzing the statement type of each data processing statement.

In the embodiment of the present invention, after acquiring an audit log of a preset distributed cluster system, the audit log may be segmented according to a preset format (which may be set according to an actual situation) to obtain a statement identifier of each data processing statement, and based on the statement identifier, an execution log (e.g., a Hive log) of all data processing statements (e.g., SQL statements) in a preset data warehouse (e.g., a Hive data warehouse) is acquired, and then, an execution content of each data processing statement in the execution log is intercepted (the execution content at least includes a log completion time), and then, the data processing statements are sequentially written into a preset ordering queue (e.g., a topic queue of Kafka, that is, a Kafka queue) according to an order of the log completion time, where the preset ordering queue is used to sequentially analyze a statement type of each data processing statement.

In the embodiment of the present invention, after writing the data processing statement into the preset sorting queue, the SQL statement (i.e. the data processing statement) may be obtained from the preset sorting queue, the SQL statement is screened, the execution state of the SQL statement is determined, and the statement that is successfully executed is screened, and then the screened SQL statement is subjected to type analysis, and the SQL statement is divided into 2 types according to the correlation with the data change: (1) the statement is irrelevant to data change, namely, the SQL is executed without causing table data change, such as: setting a queue, setting SQL sentences such as use database, empowerment, query sentences, single table building and the like; (2) statements related to data changes, i.e. SQL executes that may cause changes to table data, such as: insert, delete, create select, etc. SQL statements. In this embodiment, the processing is mainly performed for the statements related to the data change.

In the embodiment of the present invention, after classifying the data processing statements, it is possible to further analyze the SQL statements related to the data change without processing the statements that are unrelated to the data change, and extract the table name and the data partition of the operated target table (i.e. extract the target table and the stored data partition operated by the data processing statement that has data change) according to the syntax structure, for example, for the SQL statements: insert int 1 partition (pt _ dt ═ 2021-01-01') select xxx from t2, then extract the target table name t1, store the data partition 2021-01-01.

Optionally, after extracting the target table and the storage data partition operated by the data processing statement with the data change, the method further includes: inquiring a data modification list to be processed based on the table identifier of the target table and the area identifier of the data storage partition; and extracting the sentence execution ending time of the data processing sentence with the modification identifier in the data modification list.

In the embodiment of the present invention, based on the table identifier of the target table and the area identifier of the stored data partition, a data modification list to be processed (i.e., a data synchronization task log table) is queried, and the statement execution end time of the data processing statement in which the modification identifier exists in the data modification list is extracted, for example, a check is performed from the table name t1 to the "data synchronization task log table", if the t1 table does not relate to the data synchronization task, no processing is required, and if a related task (i.e., a data processing statement in which the modification identifier exists) is found, the execution end time of the table name, the data partition, and the SQL statement is recorded in the "data modification record list table".

In this embodiment, fig. 2 is a flowchart of an optional parsing SQL statement according to an embodiment of the present invention, and as shown in fig. 2, the method includes the following steps: reading SQL sentences of a Kafka queue, screening the SQL sentences, judging the execution state, skipping and not processing the sentences which are not successfully executed, analyzing the types of the sentences which are successfully executed, and dividing the SQL sentences into 2 types according to the correlation with data change: (1) the statement is irrelevant to data change, namely the statement which does not cause table data change after SQL is executed, such as: setting a queue, setting a use library database, empowerment, query statement and single table-building SQL statement; (2) statements related to data changes, i.e. SQL executes and may cause changes to table data, such as: insert, delete, create select SQL statements.

The statements which are irrelevant to the data change are directly skipped without processing, the SQL statements which are relevant to the data change are further analyzed, and the table name and the data partition of the operated target table are extracted according to the syntax structure (namely, the table name and the partition are extracted). For example, the SQL statement: insert int t1 partition (pt _ dt ═ 2021-01-01') select xxx from t2, then the table name t1 is extracted, and the data partition 2021-01-01 is extracted.

And (3) checking the data synchronization task log table according to the table name t1 to judge whether a data synchronization task exists, if the data synchronization task does not exist in the table t1, skipping and not processing the data synchronization task, and if the data synchronization task does exist, recording the execution end time of the table name, the data partition and the SQL statement into a data modification record list table.

Optionally, the step of comparing the file update time of each data file in the target table and the storage data partition includes: acquiring file synchronization time of each data file under a target table and a storage data partition; under the condition that the execution ending time of the statement is less than or equal to the file synchronization time, confirming that the data processing statement with the modification mark is synchronized and completed; and in the case that the statement execution end time is greater than the file synchronization time, setting the statement execution end time as a file update time.

In the embodiment of the present invention, the last synchronized data file time (i.e., the file synchronization time of each data file under the acquisition target table and the storage data partition) of the table and the partition (e.g., the t1 table pt1 partition) is acquired from the "data synchronization task log table", when the statement execution end time is less than or equal to the file synchronization time, it is determined that the data processing statement in which the modification flag exists is synchronized (i.e., the corresponding record of the "data modification record list table" is set as complete), and when the statement execution end time is greater than the file synchronization time, the statement execution end time is set as the file update time.

Optionally, after comparing the file update time of each data file in the target table and the storage data partition, the method further includes: comparing the file updating time with the maximum deadline in the data synchronization task log table to obtain a deadline comparison result; and when the file updating time is different from the maximum deadline in the data synchronization task log table, determining that the file updating time triggers a data synchronization instruction.

In the embodiment of the invention, the file updating time is compared with the maximum deadline in the data synchronization task log table to obtain a deadline comparison result, when the file updating time is equal to the maximum deadline, the data is updated to be the latest data, and the repeated synchronization is not needed, so that the corresponding record list of the data modification record list table is set to be finished; and when the file updating time is different from the maximum deadline in the data synchronization task log table, determining that the file updating time triggers a data synchronization instruction, namely that the data is updated and needs to be synchronized, and sending the data synchronization instruction.

In this embodiment, fig. 3 is a flowchart of an alternative synchronous task scheduling method according to an embodiment of the present invention, as shown in fig. 3, including the following steps: taking the pending data modification records of the same table and partition (for example, t1 table pt1 partition) from the data modification record list table, merging the records, taking the maximum SQL statement execution end time (denoted as lastsqlexecuttime), then, obtaining the last synchronized data file time (denoted as lasttreplitadatatime) of the table and partition (for example, t1 table pt1 partition) from the data synchronization task log table, and then, comparing the lastsqxecuttime with the lasttreplitadatatime:

(1) when lastsqlexecuttime is less than or equal to lasttreplitadatatime, setting a plurality of records in the data modification record list table to be completed;

(2) when lastsqlexecuttime is greater than lasttreplitadatatime, the latest update time of the data file under the partition of cluster reading table t1 (denoted as lastFileDataTime) is accessed, and lastFileDataTime and lasttreplitadatatime are compared: when lastFileDataTime is lasttrepliadatatime, it is described that the data has been updated to the latest data, and no repeated synchronization is required, so that a plurality of records in the "data modification record list table" are set to be completed; when lastFileDataTime < > lasttrepliadatatime (where "< >" indicates that they are different), the table name data is updated and needs to be synchronized, and therefore, a data synchronization request is transmitted.

Optionally, in a case that the file update time triggers a data synchronization instruction, writing data in the storage data partition where the update data exists into the target database, where the step includes: under the condition that a data synchronization instruction is triggered at the file updating time, reading a source cluster, a source table user, a target table and a target table user in the data synchronization instruction; accessing a storage data partition with updated data under a target table of a source cluster by adopting a first user identifier of a source table user in the source table to read partition data; clearing the data of the storage data partition with the updated data in the partition field under the target table; and writing the partition data into the target database.

In the embodiment of the present invention, in a case that a data synchronization instruction is triggered at a file update time, a source cluster, a source table user, a target cluster, a target table, and a target table user in the data synchronization instruction are read, a storage data partition in which update data exists under a target table of the source cluster is accessed with a first user identifier of the source table user in the source table to read partition data (for example, pt1 partition in t1 table of the source cluster is accessed with the source table user to read data), then, data of the storage data partition in which update data exists is cleared from a partition field under the target table, and the partition data is written into a target database (for example, data of pt1 and data of the target cluster is cleared from the partition field under the target table and written into the read data of the source cluster).

In this embodiment, the data flushing operation and the write operation may be completed within one transaction to ensure data consistency.

Optionally, after writing the partition data into the target database, the method further includes: acquiring the end time of writing the partition data into the target database to obtain the synchronous end time; and updating the maximum deadline in the data synchronization task log table to be the synchronization end time.

In the embodiment of the present invention, after data synchronization is completed, the end time of writing partition data into the target database is obtained, and then the latest time of the data file synchronized on the cluster is updated to the lasttreplitadatatime field in the "data synchronization task log table" (i.e., the maximum deadline in the data synchronization task log table is updated to the synchronization end time).

In the embodiment of the invention, by acquiring the audit log executed by big data SQL, analyzing SQL sentences in the audit log, extracting a target table and a storage data partition which are likely to have data change, comparing the file change time of the target table and the storage data partition with the change condition of the latest synchronized data, triggering data synchronization, reading the data in the change partition and writing the data in a preset database, the following beneficial effects can be achieved:

(1) the data change condition is analyzed based on the execution SQL in the big data Hadoop audit log, and the flow of initiating the SQL request is decoupled, so that no additional operation is performed on the big data batch processing application.

(2) The monitoring mode of automatically and quickly triggering data synchronization in the big data is provided, the change of the data can be found in real time after the execution of the SQL is finished, and the timeliness is high.

(3) In the data synchronization scheduling process, repeated synchronization of data in a short time can be reduced by comparing multiple times and executing SQL operation combination.

Example two

Fig. 4 is a schematic diagram of an optional data synchronization system for monitoring data change triggering based on Hive execution SQL, as shown in fig. 4, including: an audit log collection module, a modification table analysis module, a synchronous task scheduling module, a data synchronization module, Hive and a relational database,

the audit log collection module can be used for collecting execution logs of all SQL statements in Hive, intercepting the SQL statements in the logs and providing the intercepted SQL statements to the modification table analysis module.

The modified table analysis module can be used for analyzing the received SQL, screening out statements related to table data change, then performing SQL analysis on the SQL statements, and extracting a table name and a partition of a target.

The synchronous task scheduling module can be used for checking the table with data change, confirming whether data synchronization is needed or not, and sending the data synchronization to the data synchronization module if the data synchronization is needed.

The data synchronization module can be used for reading data under the changed partition from Hive and then writing the data into the relational database.

In the embodiment of the invention, by using the Hive-based SQL execution monitoring data change triggering data synchronization system, the audit log executed by big data SQL can be collected, SQL sentences in the audit log are analyzed, a target table and a storage data partition which are possibly subjected to data change are extracted, the file change time of the target table and the storage data partition and the change condition of the latest synchronized data are compared, data synchronization is triggered, and data in the change partition are read and written into a preset database.

EXAMPLE III

The data synchronization apparatus provided in this embodiment includes a plurality of implementation units, and each implementation unit corresponds to each implementation step in the first embodiment.

Fig. 5 is a schematic diagram of an alternative data synchronization apparatus according to an embodiment of the present invention, and as shown in fig. 5, the synchronization apparatus may include: an acquisition unit 50, an extraction unit 51, a comparison unit 52, a writing unit 53, wherein,

the collecting unit 50 is configured to collect an audit log of a preset distributed cluster system, where each data processing statement in the audit log corresponds to a statement type, and the statement types include: data change and no data change occur;

an extracting unit 51 for extracting a target table operated by a data processing statement in which a data change occurs and a storage data partition;

a comparison unit 52 for comparing the file update times of the target table and the respective data files under the stored data partition;

and a writing unit 53, configured to write data in the storage data partition in which the update data exists into the target database, in a case where the file update time triggers the data synchronization instruction.

The synchronization device can acquire the audit log of the preset distributed cluster system through the acquisition unit 50, extract the target table and the stored data partition operated by the data processing statement with data change through the extraction unit 51, compare the file update time of each data file under the target table and the stored data partition through the comparison unit 52, and write the data under the stored data partition with the updated data into the target database through the write-in unit 53 under the condition that the file update time triggers the data synchronization instruction. In the embodiment of the invention, the target table with changed data and the stored data partition can be extracted, and the file updating time of each data file under the target table and the stored data partition can be compared, so that the data under the stored data partition with updated data can be written into the target database in time, the timeliness of triggering data synchronization is improved, the process is full-automatic, and the technical problem that the data synchronization cannot be automatically triggered when the data is changed in the related technology is solved.

Optionally, the synchronization apparatus further includes: the first segmentation module is used for segmenting the audit log according to a preset format after the audit log of the preset distributed cluster system is collected, so as to obtain statement identification of each data processing statement; the first acquisition module is used for acquiring execution logs of all data processing sentences in a preset data warehouse based on the sentence marks; a first intercepting module, configured to intercept execution content of each data processing statement in the execution log, where the execution content at least includes: log completion time; the first writing module is used for sequentially writing the data processing statements into a preset sorting queue according to the sequence of the log completion time, wherein the preset sorting queue is used for sequentially analyzing the statement type of each data processing statement.

Optionally, the synchronization apparatus further includes: the first query module is used for querying a data modification list to be processed based on a table identifier of a target table and an area identifier of a storage data partition after the target table and the storage data partition operated by a data processing statement with data change are extracted; and the first extraction module is used for extracting the sentence execution ending time of the data processing sentence with the modification identifier in the data modification list.

Optionally, the comparing unit includes: the first acquisition module is used for acquiring the file synchronization time of each data file under the target table and the storage data partition; the first determining module is used for determining that the data processing statement with the modification identifier is synchronized and completed under the condition that the statement execution ending time is less than or equal to the file synchronization time; and the first setting module is used for setting the sentence execution ending time as the file updating time under the condition that the sentence execution ending time is greater than the file synchronization time.

Optionally, the synchronization apparatus further includes: the first comparison module is used for comparing the file updating time with the maximum deadline in the data synchronization task log table after comparing the file updating time of each data file in the target table and the stored data partition to obtain a deadline comparison result; and the first triggering module is used for confirming that the file updating time triggers the data synchronization instruction when the file updating time is different from the maximum deadline in the data synchronization task log table.

Optionally, the writing unit includes: the first reading module is used for reading a source cluster, a source table user, a target table and a target table user in a data synchronization instruction under the condition that the data synchronization instruction is triggered at the file updating time; the first access module is used for accessing a storage data partition with updated data under a target table of a source cluster by adopting a first user identifier of a source table user in the source table so as to read partition data; the first clearing module is used for clearing the data of the storage data partition with the updated data in the partition field under the target table; and the second writing module is used for writing the partition data into the target database.

The above-mentioned synchronization device may further include a processor and a memory, and the above-mentioned acquisition unit 50, the extraction unit 51, the comparison unit 52, the writing unit 53, and the like are all stored in the memory as program units, and the processor executes the above-mentioned program units stored in the memory to implement corresponding functions.

The processor comprises a kernel, and the kernel calls a corresponding program unit from the memory. The kernel can be set to be one or more, and the kernel parameters are adjusted to write the data in the storage data partition with the updated data into the target database under the condition that the data synchronization instruction is triggered at the file updating time.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.

The present application also provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device: collecting an audit log of a preset distributed cluster system, extracting a target table and a storage data partition operated by a data processing statement with data change, comparing file updating time of each data file under the target table and the storage data partition, and writing data under the storage data partition with updated data into a target database under the condition that the file updating time triggers a data synchronization instruction.

According to another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, which includes a stored computer program, wherein when the computer program runs, the apparatus on which the computer-readable storage medium is located is controlled to execute the above-mentioned data synchronization method.

According to another aspect of embodiments of the present invention, there is also provided an electronic device, including one or more processors and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the data synchronization method described above.

Fig. 6 is a block diagram of a hardware structure of an electronic device (or mobile device) for a data synchronization method according to an embodiment of the present invention. As shown in fig. 6, the electronic device may include one or more (shown as 102a, 102b, … …, 102 n) processors 102 (the processors 102 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), and memory 104 for storing data. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a keyboard, a power supply, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 6 is only an illustration and is not intended to limit the structure of the electronic device. For example, the electronic device may also include more or fewer components than shown in FIG. 6, or have a different configuration than shown in FIG. 6.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method of data synchronization, comprising:

collecting an audit log of a preset distributed cluster system, wherein each data processing statement in the audit log corresponds to a statement type, and the statement type comprises: data change occurs and data change does not occur;

extracting a target table operated by the data processing statement with data change and a storage data partition;

comparing the file update time of each data file under the target table and the storage data partition;

and under the condition that the file updating time triggers a data synchronization instruction, writing the data in the storage data partition with the updated data into the target database.

2. The method of claim 1, after collecting the audit log of the preset distributed cluster system, further comprising:

dividing the audit log according to a preset format to obtain a statement identifier of each data processing statement;

acquiring execution logs of all data processing statements in a preset data warehouse based on the statement identification;

intercepting the execution content of each data processing statement in the execution log, wherein the execution content at least comprises: log completion time;

and writing the data processing statements into a preset sorting queue in sequence according to the sequence of the log completion time, wherein the preset sorting queue is used for sequentially analyzing the statement type of each data processing statement.

3. The method according to claim 1, further comprising, after extracting the target table and the storage data partition operated by the data processing statement in which the data change occurs:

inquiring a data modification list to be processed based on the table identifier of the target table and the area identifier of the storage data partition;

and extracting the sentence execution ending time of the data processing sentence with the modification identifier in the data modification list.

4. The method of claim 3, wherein the step of comparing the target table with the file update times of the respective data files under the stored data partition comprises:

acquiring file synchronization time of each data file in the target table and the storage data partition;

under the condition that the statement execution ending time is less than or equal to the file synchronization time, confirming that the data processing statement with the modification mark is synchronized and completed;

and setting the statement execution ending time as the file updating time under the condition that the statement execution ending time is greater than the file synchronization time.

5. The method of claim 4, further comprising, after comparing the file update times of the target table and the respective data files under the stored data partition:

comparing the file updating time with the maximum deadline in the data synchronization task log table to obtain a deadline comparison result;

and when the file updating time is different from the maximum deadline in the data synchronization task log table, determining that the file updating time triggers a data synchronization instruction.

6. The method according to claim 1, wherein in the case that the file update time triggers a data synchronization instruction, the step of writing data under the storage data partition where the update data exists into the target database comprises:

under the condition that the file updating time triggers a data synchronization instruction, reading a source cluster, a source table user, a target table and a target table user in the data synchronization instruction;

accessing a storage data partition with updated data under a target table of the source cluster by adopting a first user identifier of the source table user in the source table to read partition data;

clearing the partition field under the target table to be the data of the storage data partition with the updated data;

and writing the partition data into a target database.

7. The method of claim 6, after writing the partition data to the target database, further comprising:

acquiring the end time of writing the partition data into the target database to obtain the synchronous end time;

and updating the maximum deadline in the data synchronization task log table as the synchronization end time.

8. A data synchronization apparatus, comprising:

the system comprises an acquisition unit and a processing unit, wherein the acquisition unit is used for acquiring an audit log of a preset distributed cluster system, each data processing statement in the audit log corresponds to a statement type, and the statement types comprise: data change and no data change occur;

the extraction unit is used for extracting a target table operated by the data processing statement with data change and a storage data partition;

the comparison unit is used for comparing the file updating time of each data file under the target table and the storage data partition;

and the writing unit is used for writing the data in the storage data partition with the updated data into the target database under the condition that the file updating time triggers the data synchronization instruction.

9. A computer-readable storage medium, comprising a stored computer program, wherein the computer program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform the data synchronization method of any one of claims 1 to 7.

10. An electronic device comprising one or more processors and memory storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the data synchronization method of any of claims 1-7.