CN114490554A - Data synchronization method and device, electronic equipment and storage medium - Google Patents

Data synchronization method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114490554A
CN114490554A CN202210134131.3A CN202210134131A CN114490554A CN 114490554 A CN114490554 A CN 114490554A CN 202210134131 A CN202210134131 A CN 202210134131A CN 114490554 A CN114490554 A CN 114490554A
Authority
CN
China
Prior art keywords
data
partition
statement
file
synchronization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210134131.3A
Other languages
Chinese (zh)
Inventor
林丹
沈贇
刘雪晶
阳兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202210134131.3A priority Critical patent/CN114490554A/en
Publication of CN114490554A publication Critical patent/CN114490554A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/178Techniques for file synchronisation in file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data synchronization method and a device thereof, electronic equipment and a storage medium, and relates to the technical field of big data, wherein the synchronization method comprises the following steps: collecting an audit log of a preset distributed cluster system, extracting a target table and a storage data partition operated by a data processing statement with data change, comparing file updating time of each data file under the target table and the storage data partition, and writing data under the storage data partition with updated data into a target database under the condition that the file updating time triggers a data synchronization instruction. The invention solves the technical problem that the data synchronization can not be automatically triggered when the data change in the related technology.

Description

Data synchronization method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of big data, in particular to a data synchronization method and device, electronic equipment and a storage medium.
Background
With the explosive development of big data technology, data scale and data form show explosive and diversified development, and most enterprises begin to construct big data system architecture based on open source Hadoop system. In the related technology, data are uniformly stored in a Hadoop file system, and batch processing can be carried out by multiple computing engines such as Hive and Spark. However, in the online access scene, Hive timeliness is low, and most enterprises adopt to synchronize data into a relational database, so that when data changes on Hadoop, data change situations need to be found quickly and data synchronization needs to be triggered automatically, so that timeliness of online access scene data effectiveness is improved.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the invention provides a data synchronization method and device, electronic equipment and a storage medium, which are used for at least solving the technical problem that data synchronization cannot be automatically triggered when data change occurs in the related technology.
According to an aspect of an embodiment of the present invention, there is provided a data synchronization method, including: collecting an audit log of a preset distributed cluster system, wherein each data processing statement in the audit log corresponds to a statement type, and the statement type comprises: data change and no data change occur; extracting a target table operated by the data processing statement with data change and a storage data partition; comparing the file update time of each data file under the target table and the storage data partition; and under the condition that the file updating time triggers a data synchronization instruction, writing the data in the storage data partition with the updated data into the target database.
Optionally, after collecting an audit log of a preset distributed cluster system, the method further includes: dividing the audit log according to a preset format to obtain a statement identifier of each data processing statement; acquiring execution logs of all data processing statements in a preset data warehouse based on the statement identification; intercepting the execution content of each data processing statement in the execution log, wherein the execution content at least comprises: log completion time; and writing the data processing statements into a preset sorting queue in sequence according to the sequence of the log completion time, wherein the preset sorting queue is used for sequentially analyzing the statement type of each data processing statement.
Optionally, after extracting the target table and the storage data partition operated by the data processing statement in which the data change occurs, the method further includes: inquiring a data modification list to be processed based on the table identifier of the target table and the area identifier of the storage data partition; and extracting the sentence execution ending time of the data processing sentence with the modification identifier in the data modification list.
Optionally, the step of comparing the file update time of each data file in the target table and the storage data partition includes: acquiring file synchronization time of each data file under the target table and the storage data partition; under the condition that the statement execution ending time is less than or equal to the file synchronization time, confirming that the data processing statement with the modification mark is synchronized and completed; and setting the statement execution ending time as the file updating time under the condition that the statement execution ending time is greater than the file synchronization time.
Optionally, after comparing the file update time of each data file under the target table and the storage data partition, the method further includes: comparing the file updating time with the maximum deadline in the data synchronization task log table to obtain a deadline comparison result; and when the file updating time is different from the maximum deadline in the data synchronization task log table, determining that the file updating time triggers a data synchronization instruction.
Optionally, in a case that the file update time triggers a data synchronization instruction, the step of writing data in the storage data partition in which the update data exists into the target database includes: under the condition that the file updating time triggers a data synchronization instruction, reading a source cluster, a source table user, a target table and a target table user in the data synchronization instruction; accessing a storage data partition with updated data under a target table of the source cluster by adopting a first user identifier of the source table user in the source table to read partition data; clearing the partition field under the target table to be the data of the storage data partition with the updated data; and writing the partition data into a target database.
Optionally, after writing the partition data into the target database, the method further includes: acquiring the end time of writing the partition data into the target database to obtain the synchronous end time; and updating the maximum deadline in the data synchronization task log table as the synchronization end time.
According to another aspect of the embodiments of the present invention, there is also provided a data synchronization apparatus, including: the system comprises an acquisition unit and a processing unit, wherein the acquisition unit is used for acquiring an audit log of a preset distributed cluster system, each data processing statement in the audit log corresponds to a statement type, and the statement types comprise: data change and no data change occur; the extraction unit is used for extracting a target table operated by the data processing statement with data change and a storage data partition; the comparison unit is used for comparing the file updating time of each data file under the target table and the storage data partition; and the writing unit is used for writing the data in the storage data partition with the updated data into the target database under the condition that the file updating time triggers the data synchronization instruction.
Optionally, the synchronization apparatus further includes: the first segmentation module is used for segmenting an audit log of a preset distributed cluster system according to a preset format after the audit log is collected, so as to obtain a statement identifier of each data processing statement; the first acquisition module is used for acquiring execution logs of all data processing statements in a preset data warehouse based on the statement identifications; a first intercepting module, configured to intercept execution content of each data processing statement in the execution log, where the execution content at least includes: log completion time; and the first writing module is used for sequentially writing the data processing statements into a preset sorting queue according to the sequence of the log completion time, wherein the preset sorting queue is used for sequentially analyzing the statement type of each data processing statement.
Optionally, the synchronization apparatus further includes: the first query module is used for querying a data modification list to be processed based on a table identifier of the target table and an area identifier of the storage data partition after extracting the target table and the storage data partition operated by the data processing statement with data change; and the first extraction module is used for extracting the statement execution ending time of the data processing statement with the modification identifier in the data modification list.
Optionally, the comparing unit includes: the first acquisition module is used for acquiring the file synchronization time of each data file under the target table and the storage data partition; a first determining module, configured to determine that the data processing statement with the modification identifier is synchronized and completed when the statement execution end time is less than or equal to the file synchronization time; and the first setting module is used for setting the sentence execution ending time as the file updating time under the condition that the sentence execution ending time is greater than the file synchronization time.
Optionally, the synchronization apparatus further includes: the first comparison module is used for comparing the file updating time with the maximum deadline in the data synchronization task log table after comparing the file updating time of each data file in the target table and the stored data partition, so as to obtain a deadline comparison result; and the first triggering module is used for confirming that the file updating time triggers the data synchronization instruction when the file updating time is different from the maximum deadline in the data synchronization task log table.
Optionally, the writing unit includes: the first reading module is used for reading a source cluster, a source table user, a target table and a target table user in a data synchronization instruction under the condition that the file updating time triggers the data synchronization instruction; the first access module is used for accessing a storage data partition with update data under a target table of the source cluster by adopting a first user identifier of the source table user so as to read partition data; the first clearing module is used for clearing the data of the storage data partition with the updated data in the partition field under the target table; and the second writing module is used for writing the partition data into a target database.
Optionally, the synchronization apparatus further includes: the second acquisition module is used for acquiring the end time of the partition data written into the target database after the partition data is written into the target database, so as to obtain the synchronous end time; and the first updating module is used for updating the maximum deadline in the data synchronization task log table to be the synchronization end time.
According to another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, and when the computer program runs, the apparatus where the computer-readable storage medium is located is controlled to execute the above data synchronization method.
According to another aspect of embodiments of the present invention, there is also provided an electronic device, including one or more processors and a memory for storing one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the data synchronization method described above.
In the method, an audit log of a preset distributed cluster system is collected, a target table and a storage data partition operated by a data processing statement with data change are extracted, the file update time of each data file under the target table and the storage data partition is compared, and under the condition that the file update time triggers a data synchronization instruction, data under the storage data partition with the update data are written into a target database. In the method and the device, the target table with changed data and the stored data partition can be extracted, and the file updating time of each data file under the target table and the stored data partition is compared, so that the data under the stored data partition with updated data can be written into the target database in time, the timeliness of triggering data synchronization is improved, the process is full-automatic, and the technical problem that the data synchronization cannot be automatically triggered when the data are changed in the related technology is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a flow diagram of an alternative method of data synchronization according to an embodiment of the present invention;
FIG. 2 is a flow diagram of an alternative parsing of an SQL statement according to an embodiment of the invention;
FIG. 3 is a flow diagram of an alternative synchronized task scheduling according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an alternative data synchronization system for monitoring data change triggering based on Hive execution SQL according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an alternative data synchronization apparatus according to an embodiment of the present invention;
fig. 6 is a block diagram of a hardware structure of an electronic device (or mobile device) for a data synchronization method according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
To facilitate understanding of the invention by those skilled in the art, some terms or nouns referred to in the embodiments of the invention are explained below:
hadoop: the software framework can perform distributed processing on a large amount of data, and can perform data processing in a reliable, efficient and scalable manner.
Hive: the system is a data warehouse tool based on Hadoop, is used for data extraction, transformation and loading, and is a mechanism capable of storing, inquiring and analyzing large-scale data stored in Hadoop.
Spark: is a fast, general-purpose computing engine designed specifically for large-scale data processing.
It should be noted that the data synchronization method and the data synchronization device in the present disclosure may be used in the field of big data technology under the condition of data synchronization, and may also be used in any field except the field of big data technology under the condition of data synchronization, and the application fields of the data synchronization method and the data synchronization device in the present disclosure are not limited.
It should be noted that, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for presentation, analyzed data, etc.) referred to in the present disclosure are information and data authorized by the user or sufficiently authorized by each party.
The embodiments of the invention described below can be applied to various systems/applications/devices for data synchronization. The method comprises the steps of collecting an audit log executed by big data SQL, analyzing SQL sentences in the audit log, extracting a target table and a storage data partition which are likely to have data changes, comparing the file change time of the target table and the storage data partition with the change condition of the latest synchronized data, triggering data synchronization, reading the data in the change partition and writing the data in a target database (for example, a relational database).
The present invention will be described in detail with reference to examples.
Example one
In accordance with an embodiment of the present invention, there is provided a data synchronization method embodiment, it should be noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
Fig. 1 is a flow chart of an alternative data synchronization method according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:
step S101, collecting an audit log of a preset distributed cluster system, wherein each data processing statement in the audit log corresponds to a statement type, and the statement type comprises: data changes occur and no data changes occur.
Step S102, extracting a target table operated by a data processing statement with data change and a storage data partition.
Step S103, the file updating time of each data file under the target table and the stored data partition is compared.
And step S104, writing the data in the storage data partition with the updated data into the target database under the condition that the file updating time triggers the data synchronization instruction.
Through the steps, the audit log of the preset distributed cluster system can be collected, the target table and the storage data partition operated by the data processing statement with data change are extracted, the file updating time of each data file under the target table and the storage data partition is compared, and the data under the storage data partition with the updated data is written into the target database under the condition that the data synchronization instruction is triggered by the file updating time. In the embodiment of the invention, the target table with changed data and the stored data partition can be extracted, and the file updating time of each data file under the target table and the stored data partition can be compared, so that the data under the stored data partition with updated data can be written into the target database in time, the timeliness of triggering data synchronization is improved, the process is full-automatic, and the technical problem that the data synchronization cannot be automatically triggered when the data is changed in the related technology is solved.
The following will explain the embodiments of the present invention in detail with reference to the above steps.
Step S101, collecting an audit log of a preset distributed cluster system, wherein each data processing statement in the audit log corresponds to a statement type, and the statement types comprise: data changes occur and no data changes occur.
In the embodiment of the present invention, an audit log of a preset distributed cluster system (for example, a Hadoop cluster system) may be collected, and data may be obtained line by line, where the audit log has data processing statements, each data processing statement corresponds to a statement type, and the statement types include: data changes occur and no data changes occur.
Optionally, after acquiring the audit log of the preset distributed cluster system, the method further includes: dividing the audit log according to a preset format to obtain a statement identifier of each data processing statement; acquiring execution logs of all data processing sentences in a preset data warehouse based on the sentence marks; intercepting the execution content of each data processing statement in the execution log, wherein the execution content at least comprises the following steps: log completion time; and sequentially writing the data processing statements into a preset sorting queue according to the sequence of the log completion time, wherein the preset sorting queue is used for sequentially analyzing the statement type of each data processing statement.
In the embodiment of the present invention, after acquiring an audit log of a preset distributed cluster system, the audit log may be segmented according to a preset format (which may be set according to an actual situation) to obtain a statement identifier of each data processing statement, and based on the statement identifier, an execution log (e.g., a Hive log) of all data processing statements (e.g., SQL statements) in a preset data warehouse (e.g., a Hive data warehouse) is acquired, and then, an execution content of each data processing statement in the execution log is intercepted (the execution content at least includes a log completion time), and then, the data processing statements are sequentially written into a preset ordering queue (e.g., a topic queue of Kafka, that is, a Kafka queue) according to an order of the log completion time, where the preset ordering queue is used to sequentially analyze a statement type of each data processing statement.
In the embodiment of the present invention, after writing the data processing statement into the preset sorting queue, the SQL statement (i.e. the data processing statement) may be obtained from the preset sorting queue, the SQL statement is screened, the execution state of the SQL statement is determined, and the statement that is successfully executed is screened, and then the screened SQL statement is subjected to type analysis, and the SQL statement is divided into 2 types according to the correlation with the data change: (1) the statement is irrelevant to data change, namely, the SQL is executed without causing table data change, such as: setting a queue, setting SQL sentences such as use database, empowerment, query sentences, single table building and the like; (2) statements related to data changes, i.e. SQL executes that may cause changes to table data, such as: insert, delete, create select, etc. SQL statements. In this embodiment, the processing is mainly performed for the statements related to the data change.
Step S102, extracting a target table operated by a data processing statement with data change and a storage data partition.
In the embodiment of the present invention, after classifying the data processing statements, it is possible to further analyze the SQL statements related to the data change without processing the statements that are unrelated to the data change, and extract the table name and the data partition of the operated target table (i.e. extract the target table and the stored data partition operated by the data processing statement that has data change) according to the syntax structure, for example, for the SQL statements: insert int 1 partition (pt _ dt ═ 2021-01-01') select xxx from t2, then extract the target table name t1, store the data partition 2021-01-01.
Optionally, after extracting the target table and the storage data partition operated by the data processing statement with the data change, the method further includes: inquiring a data modification list to be processed based on the table identifier of the target table and the area identifier of the data storage partition; and extracting the sentence execution ending time of the data processing sentence with the modification identifier in the data modification list.
In the embodiment of the present invention, based on the table identifier of the target table and the area identifier of the stored data partition, a data modification list to be processed (i.e., a data synchronization task log table) is queried, and the statement execution end time of the data processing statement in which the modification identifier exists in the data modification list is extracted, for example, a check is performed from the table name t1 to the "data synchronization task log table", if the t1 table does not relate to the data synchronization task, no processing is required, and if a related task (i.e., a data processing statement in which the modification identifier exists) is found, the execution end time of the table name, the data partition, and the SQL statement is recorded in the "data modification record list table".
In this embodiment, fig. 2 is a flowchart of an optional parsing SQL statement according to an embodiment of the present invention, and as shown in fig. 2, the method includes the following steps: reading SQL sentences of a Kafka queue, screening the SQL sentences, judging the execution state, skipping and not processing the sentences which are not successfully executed, analyzing the types of the sentences which are successfully executed, and dividing the SQL sentences into 2 types according to the correlation with data change: (1) the statement is irrelevant to data change, namely the statement which does not cause table data change after SQL is executed, such as: setting a queue, setting a use library database, empowerment, query statement and single table-building SQL statement; (2) statements related to data changes, i.e. SQL executes and may cause changes to table data, such as: insert, delete, create select SQL statements.
The statements which are irrelevant to the data change are directly skipped without processing, the SQL statements which are relevant to the data change are further analyzed, and the table name and the data partition of the operated target table are extracted according to the syntax structure (namely, the table name and the partition are extracted). For example, the SQL statement: insert int t1 partition (pt _ dt ═ 2021-01-01') select xxx from t2, then the table name t1 is extracted, and the data partition 2021-01-01 is extracted.
And (3) checking the data synchronization task log table according to the table name t1 to judge whether a data synchronization task exists, if the data synchronization task does not exist in the table t1, skipping and not processing the data synchronization task, and if the data synchronization task does exist, recording the execution end time of the table name, the data partition and the SQL statement into a data modification record list table.
Step S103, the file updating time of each data file under the target table and the stored data partition is compared.
Optionally, the step of comparing the file update time of each data file in the target table and the storage data partition includes: acquiring file synchronization time of each data file under a target table and a storage data partition; under the condition that the execution ending time of the statement is less than or equal to the file synchronization time, confirming that the data processing statement with the modification mark is synchronized and completed; and in the case that the statement execution end time is greater than the file synchronization time, setting the statement execution end time as a file update time.
In the embodiment of the present invention, the last synchronized data file time (i.e., the file synchronization time of each data file under the acquisition target table and the storage data partition) of the table and the partition (e.g., the t1 table pt1 partition) is acquired from the "data synchronization task log table", when the statement execution end time is less than or equal to the file synchronization time, it is determined that the data processing statement in which the modification flag exists is synchronized (i.e., the corresponding record of the "data modification record list table" is set as complete), and when the statement execution end time is greater than the file synchronization time, the statement execution end time is set as the file update time.
Optionally, after comparing the file update time of each data file in the target table and the storage data partition, the method further includes: comparing the file updating time with the maximum deadline in the data synchronization task log table to obtain a deadline comparison result; and when the file updating time is different from the maximum deadline in the data synchronization task log table, determining that the file updating time triggers a data synchronization instruction.
In the embodiment of the invention, the file updating time is compared with the maximum deadline in the data synchronization task log table to obtain a deadline comparison result, when the file updating time is equal to the maximum deadline, the data is updated to be the latest data, and the repeated synchronization is not needed, so that the corresponding record list of the data modification record list table is set to be finished; and when the file updating time is different from the maximum deadline in the data synchronization task log table, determining that the file updating time triggers a data synchronization instruction, namely that the data is updated and needs to be synchronized, and sending the data synchronization instruction.
In this embodiment, fig. 3 is a flowchart of an alternative synchronous task scheduling method according to an embodiment of the present invention, as shown in fig. 3, including the following steps: taking the pending data modification records of the same table and partition (for example, t1 table pt1 partition) from the data modification record list table, merging the records, taking the maximum SQL statement execution end time (denoted as lastsqlexecuttime), then, obtaining the last synchronized data file time (denoted as lasttreplitadatatime) of the table and partition (for example, t1 table pt1 partition) from the data synchronization task log table, and then, comparing the lastsqxecuttime with the lasttreplitadatatime:
(1) when lastsqlexecuttime is less than or equal to lasttreplitadatatime, setting a plurality of records in the data modification record list table to be completed;
(2) when lastsqlexecuttime is greater than lasttreplitadatatime, the latest update time of the data file under the partition of cluster reading table t1 (denoted as lastFileDataTime) is accessed, and lastFileDataTime and lasttreplitadatatime are compared: when lastFileDataTime is lasttrepliadatatime, it is described that the data has been updated to the latest data, and no repeated synchronization is required, so that a plurality of records in the "data modification record list table" are set to be completed; when lastFileDataTime < > lasttrepliadatatime (where "< >" indicates that they are different), the table name data is updated and needs to be synchronized, and therefore, a data synchronization request is transmitted.
And step S104, writing the data in the storage data partition with the updated data into the target database under the condition that the file updating time triggers the data synchronization instruction.
Optionally, in a case that the file update time triggers a data synchronization instruction, writing data in the storage data partition where the update data exists into the target database, where the step includes: under the condition that a data synchronization instruction is triggered at the file updating time, reading a source cluster, a source table user, a target table and a target table user in the data synchronization instruction; accessing a storage data partition with updated data under a target table of a source cluster by adopting a first user identifier of a source table user in the source table to read partition data; clearing the data of the storage data partition with the updated data in the partition field under the target table; and writing the partition data into the target database.
In the embodiment of the present invention, in a case that a data synchronization instruction is triggered at a file update time, a source cluster, a source table user, a target cluster, a target table, and a target table user in the data synchronization instruction are read, a storage data partition in which update data exists under a target table of the source cluster is accessed with a first user identifier of the source table user in the source table to read partition data (for example, pt1 partition in t1 table of the source cluster is accessed with the source table user to read data), then, data of the storage data partition in which update data exists is cleared from a partition field under the target table, and the partition data is written into a target database (for example, data of pt1 and data of the target cluster is cleared from the partition field under the target table and written into the read data of the source cluster).
In this embodiment, the data flushing operation and the write operation may be completed within one transaction to ensure data consistency.
Optionally, after writing the partition data into the target database, the method further includes: acquiring the end time of writing the partition data into the target database to obtain the synchronous end time; and updating the maximum deadline in the data synchronization task log table to be the synchronization end time.
In the embodiment of the present invention, after data synchronization is completed, the end time of writing partition data into the target database is obtained, and then the latest time of the data file synchronized on the cluster is updated to the lasttreplitadatatime field in the "data synchronization task log table" (i.e., the maximum deadline in the data synchronization task log table is updated to the synchronization end time).
In the embodiment of the invention, by acquiring the audit log executed by big data SQL, analyzing SQL sentences in the audit log, extracting a target table and a storage data partition which are likely to have data change, comparing the file change time of the target table and the storage data partition with the change condition of the latest synchronized data, triggering data synchronization, reading the data in the change partition and writing the data in a preset database, the following beneficial effects can be achieved:
(1) the data change condition is analyzed based on the execution SQL in the big data Hadoop audit log, and the flow of initiating the SQL request is decoupled, so that no additional operation is performed on the big data batch processing application.
(2) The monitoring mode of automatically and quickly triggering data synchronization in the big data is provided, the change of the data can be found in real time after the execution of the SQL is finished, and the timeliness is high.
(3) In the data synchronization scheduling process, repeated synchronization of data in a short time can be reduced by comparing multiple times and executing SQL operation combination.
Example two
Fig. 4 is a schematic diagram of an optional data synchronization system for monitoring data change triggering based on Hive execution SQL, as shown in fig. 4, including: an audit log collection module, a modification table analysis module, a synchronous task scheduling module, a data synchronization module, Hive and a relational database,
the audit log collection module can be used for collecting execution logs of all SQL statements in Hive, intercepting the SQL statements in the logs and providing the intercepted SQL statements to the modification table analysis module.
The modified table analysis module can be used for analyzing the received SQL, screening out statements related to table data change, then performing SQL analysis on the SQL statements, and extracting a table name and a partition of a target.
The synchronous task scheduling module can be used for checking the table with data change, confirming whether data synchronization is needed or not, and sending the data synchronization to the data synchronization module if the data synchronization is needed.
The data synchronization module can be used for reading data under the changed partition from Hive and then writing the data into the relational database.
In the embodiment of the invention, by using the Hive-based SQL execution monitoring data change triggering data synchronization system, the audit log executed by big data SQL can be collected, SQL sentences in the audit log are analyzed, a target table and a storage data partition which are possibly subjected to data change are extracted, the file change time of the target table and the storage data partition and the change condition of the latest synchronized data are compared, data synchronization is triggered, and data in the change partition are read and written into a preset database.
EXAMPLE III
The data synchronization apparatus provided in this embodiment includes a plurality of implementation units, and each implementation unit corresponds to each implementation step in the first embodiment.
Fig. 5 is a schematic diagram of an alternative data synchronization apparatus according to an embodiment of the present invention, and as shown in fig. 5, the synchronization apparatus may include: an acquisition unit 50, an extraction unit 51, a comparison unit 52, a writing unit 53, wherein,
the collecting unit 50 is configured to collect an audit log of a preset distributed cluster system, where each data processing statement in the audit log corresponds to a statement type, and the statement types include: data change and no data change occur;
an extracting unit 51 for extracting a target table operated by a data processing statement in which a data change occurs and a storage data partition;
a comparison unit 52 for comparing the file update times of the target table and the respective data files under the stored data partition;
and a writing unit 53, configured to write data in the storage data partition in which the update data exists into the target database, in a case where the file update time triggers the data synchronization instruction.
The synchronization device can acquire the audit log of the preset distributed cluster system through the acquisition unit 50, extract the target table and the stored data partition operated by the data processing statement with data change through the extraction unit 51, compare the file update time of each data file under the target table and the stored data partition through the comparison unit 52, and write the data under the stored data partition with the updated data into the target database through the write-in unit 53 under the condition that the file update time triggers the data synchronization instruction. In the embodiment of the invention, the target table with changed data and the stored data partition can be extracted, and the file updating time of each data file under the target table and the stored data partition can be compared, so that the data under the stored data partition with updated data can be written into the target database in time, the timeliness of triggering data synchronization is improved, the process is full-automatic, and the technical problem that the data synchronization cannot be automatically triggered when the data is changed in the related technology is solved.
Optionally, the synchronization apparatus further includes: the first segmentation module is used for segmenting the audit log according to a preset format after the audit log of the preset distributed cluster system is collected, so as to obtain statement identification of each data processing statement; the first acquisition module is used for acquiring execution logs of all data processing sentences in a preset data warehouse based on the sentence marks; a first intercepting module, configured to intercept execution content of each data processing statement in the execution log, where the execution content at least includes: log completion time; the first writing module is used for sequentially writing the data processing statements into a preset sorting queue according to the sequence of the log completion time, wherein the preset sorting queue is used for sequentially analyzing the statement type of each data processing statement.
Optionally, the synchronization apparatus further includes: the first query module is used for querying a data modification list to be processed based on a table identifier of a target table and an area identifier of a storage data partition after the target table and the storage data partition operated by a data processing statement with data change are extracted; and the first extraction module is used for extracting the sentence execution ending time of the data processing sentence with the modification identifier in the data modification list.
Optionally, the comparing unit includes: the first acquisition module is used for acquiring the file synchronization time of each data file under the target table and the storage data partition; the first determining module is used for determining that the data processing statement with the modification identifier is synchronized and completed under the condition that the statement execution ending time is less than or equal to the file synchronization time; and the first setting module is used for setting the sentence execution ending time as the file updating time under the condition that the sentence execution ending time is greater than the file synchronization time.
Optionally, the synchronization apparatus further includes: the first comparison module is used for comparing the file updating time with the maximum deadline in the data synchronization task log table after comparing the file updating time of each data file in the target table and the stored data partition to obtain a deadline comparison result; and the first triggering module is used for confirming that the file updating time triggers the data synchronization instruction when the file updating time is different from the maximum deadline in the data synchronization task log table.
Optionally, the writing unit includes: the first reading module is used for reading a source cluster, a source table user, a target table and a target table user in a data synchronization instruction under the condition that the data synchronization instruction is triggered at the file updating time; the first access module is used for accessing a storage data partition with updated data under a target table of a source cluster by adopting a first user identifier of a source table user in the source table so as to read partition data; the first clearing module is used for clearing the data of the storage data partition with the updated data in the partition field under the target table; and the second writing module is used for writing the partition data into the target database.
Optionally, the synchronization apparatus further includes: the second acquisition module is used for acquiring the end time of the partition data written into the target database after the partition data is written into the target database, so as to obtain the synchronous end time; and the first updating module is used for updating the maximum deadline in the data synchronization task log table to be the synchronization end time.
The above-mentioned synchronization device may further include a processor and a memory, and the above-mentioned acquisition unit 50, the extraction unit 51, the comparison unit 52, the writing unit 53, and the like are all stored in the memory as program units, and the processor executes the above-mentioned program units stored in the memory to implement corresponding functions.
The processor comprises a kernel, and the kernel calls a corresponding program unit from the memory. The kernel can be set to be one or more, and the kernel parameters are adjusted to write the data in the storage data partition with the updated data into the target database under the condition that the data synchronization instruction is triggered at the file updating time.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
The present application also provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device: collecting an audit log of a preset distributed cluster system, extracting a target table and a storage data partition operated by a data processing statement with data change, comparing file updating time of each data file under the target table and the storage data partition, and writing data under the storage data partition with updated data into a target database under the condition that the file updating time triggers a data synchronization instruction.
According to another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, which includes a stored computer program, wherein when the computer program runs, the apparatus on which the computer-readable storage medium is located is controlled to execute the above-mentioned data synchronization method.
According to another aspect of embodiments of the present invention, there is also provided an electronic device, including one or more processors and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the data synchronization method described above.
Fig. 6 is a block diagram of a hardware structure of an electronic device (or mobile device) for a data synchronization method according to an embodiment of the present invention. As shown in fig. 6, the electronic device may include one or more (shown as 102a, 102b, … …, 102 n) processors 102 (the processors 102 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), and memory 104 for storing data. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a keyboard, a power supply, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 6 is only an illustration and is not intended to limit the structure of the electronic device. For example, the electronic device may also include more or fewer components than shown in FIG. 6, or have a different configuration than shown in FIG. 6.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A method of data synchronization, comprising:
collecting an audit log of a preset distributed cluster system, wherein each data processing statement in the audit log corresponds to a statement type, and the statement type comprises: data change occurs and data change does not occur;
extracting a target table operated by the data processing statement with data change and a storage data partition;
comparing the file update time of each data file under the target table and the storage data partition;
and under the condition that the file updating time triggers a data synchronization instruction, writing the data in the storage data partition with the updated data into the target database.
2. The method of claim 1, after collecting the audit log of the preset distributed cluster system, further comprising:
dividing the audit log according to a preset format to obtain a statement identifier of each data processing statement;
acquiring execution logs of all data processing statements in a preset data warehouse based on the statement identification;
intercepting the execution content of each data processing statement in the execution log, wherein the execution content at least comprises: log completion time;
and writing the data processing statements into a preset sorting queue in sequence according to the sequence of the log completion time, wherein the preset sorting queue is used for sequentially analyzing the statement type of each data processing statement.
3. The method according to claim 1, further comprising, after extracting the target table and the storage data partition operated by the data processing statement in which the data change occurs:
inquiring a data modification list to be processed based on the table identifier of the target table and the area identifier of the storage data partition;
and extracting the sentence execution ending time of the data processing sentence with the modification identifier in the data modification list.
4. The method of claim 3, wherein the step of comparing the target table with the file update times of the respective data files under the stored data partition comprises:
acquiring file synchronization time of each data file in the target table and the storage data partition;
under the condition that the statement execution ending time is less than or equal to the file synchronization time, confirming that the data processing statement with the modification mark is synchronized and completed;
and setting the statement execution ending time as the file updating time under the condition that the statement execution ending time is greater than the file synchronization time.
5. The method of claim 4, further comprising, after comparing the file update times of the target table and the respective data files under the stored data partition:
comparing the file updating time with the maximum deadline in the data synchronization task log table to obtain a deadline comparison result;
and when the file updating time is different from the maximum deadline in the data synchronization task log table, determining that the file updating time triggers a data synchronization instruction.
6. The method according to claim 1, wherein in the case that the file update time triggers a data synchronization instruction, the step of writing data under the storage data partition where the update data exists into the target database comprises:
under the condition that the file updating time triggers a data synchronization instruction, reading a source cluster, a source table user, a target table and a target table user in the data synchronization instruction;
accessing a storage data partition with updated data under a target table of the source cluster by adopting a first user identifier of the source table user in the source table to read partition data;
clearing the partition field under the target table to be the data of the storage data partition with the updated data;
and writing the partition data into a target database.
7. The method of claim 6, after writing the partition data to the target database, further comprising:
acquiring the end time of writing the partition data into the target database to obtain the synchronous end time;
and updating the maximum deadline in the data synchronization task log table as the synchronization end time.
8. A data synchronization apparatus, comprising:
the system comprises an acquisition unit and a processing unit, wherein the acquisition unit is used for acquiring an audit log of a preset distributed cluster system, each data processing statement in the audit log corresponds to a statement type, and the statement types comprise: data change and no data change occur;
the extraction unit is used for extracting a target table operated by the data processing statement with data change and a storage data partition;
the comparison unit is used for comparing the file updating time of each data file under the target table and the storage data partition;
and the writing unit is used for writing the data in the storage data partition with the updated data into the target database under the condition that the file updating time triggers the data synchronization instruction.
9. A computer-readable storage medium, comprising a stored computer program, wherein the computer program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform the data synchronization method of any one of claims 1 to 7.
10. An electronic device comprising one or more processors and memory storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the data synchronization method of any of claims 1-7.
CN202210134131.3A 2022-02-14 2022-02-14 Data synchronization method and device, electronic equipment and storage medium Pending CN114490554A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210134131.3A CN114490554A (en) 2022-02-14 2022-02-14 Data synchronization method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210134131.3A CN114490554A (en) 2022-02-14 2022-02-14 Data synchronization method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114490554A true CN114490554A (en) 2022-05-13

Family

ID=81480844

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210134131.3A Pending CN114490554A (en) 2022-02-14 2022-02-14 Data synchronization method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114490554A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114936212A (en) * 2022-07-26 2022-08-23 北京安华金和科技有限公司 Audit data synchronous processing method and device
CN116991331A (en) * 2023-09-25 2023-11-03 苏州元脑智能科技有限公司 Log file storage method and device, storage medium and electronic device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114936212A (en) * 2022-07-26 2022-08-23 北京安华金和科技有限公司 Audit data synchronous processing method and device
CN114936212B (en) * 2022-07-26 2022-09-23 北京安华金和科技有限公司 Audit data synchronous processing method and device
CN116991331A (en) * 2023-09-25 2023-11-03 苏州元脑智能科技有限公司 Log file storage method and device, storage medium and electronic device
CN116991331B (en) * 2023-09-25 2024-01-26 苏州元脑智能科技有限公司 Log file storage method and device, storage medium and electronic device

Similar Documents

Publication Publication Date Title
CN109656934B (en) Source Oracle database DDL synchronization method and device based on log analysis
CN109460349B (en) Test case generation method and device based on log
CN107220142B (en) Method and device for executing data recovery operation
CN110569214B (en) Index construction method and device for log file and electronic equipment
US8234248B2 (en) Tracking changes to a business object
CN109634970B (en) Table data synchronization method, apparatus, storage medium and device
CN114490554A (en) Data synchronization method and device, electronic equipment and storage medium
CN111259004B (en) Method for indexing data in storage engine and related device
CN112434043B (en) Data synchronization method, device, electronic equipment and medium
CN113282555A (en) Data processing method, device, equipment and storage medium
CN111159497A (en) Regular expression generation method and regular expression-based data extraction method
CN105095436A (en) Automatic modeling method for data of data sources
CN114116762A (en) Offline data fuzzy search method, device, equipment and medium
CN110188106B (en) Data management method and device
CN110309206B (en) Order information acquisition method and system
CN116932649A (en) Database synchronization method, database synchronization device, and readable storage medium
CN114936269A (en) Document searching platform, searching method, device, electronic equipment and storage medium
CN113868283A (en) Data testing method, device, equipment and computer storage medium
CN116628042A (en) Data processing method, device, equipment and medium
CN113778996A (en) Large data stream data processing method and device, electronic equipment and storage medium
CN114090673A (en) Data processing method, equipment and storage medium for multiple data sources
CN109840213B (en) Test data creating method, device, terminal and storage medium for GUI test
CN112612866A (en) Knowledge base text synchronization method and device, electronic equipment and storage medium
CN113553320B (en) Data quality monitoring method and device
CN111651531A (en) Data import method, device, equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination