CN115827777A

CN115827777A - Self-adaptive synchronization and difference identification method, device and equipment for multiple data sources

Info

Publication number: CN115827777A
Application number: CN202211458209.3A
Authority: CN
Inventors: 吴林峰; 谢家宝; 熊施; 严光兵; 顾振华; 文艺
Original assignee: Peoples Insurance Company of China
Current assignee: Peoples Insurance Company of China
Priority date: 2022-11-21
Filing date: 2022-11-21
Publication date: 2023-03-21

Abstract

The invention discloses a method, a device and equipment for self-adaptive synchronization and difference identification of multiple data sources, wherein the method comprises the following steps: acquiring a logic log from a source end database server, analyzing the logic log into a json message, and sending the json message to a message queue; monitoring a message queue, pulling a json message from the message queue and persisting the json message to the bottom layer of a destination database; acquiring a base table in the bottom layer of a target end database, and configuring the base table through a defined configuration file; and reading the configuration file by setting a timing task, and performing consistency comparison and difference identification on the base table according to conditions in the configuration file to obtain a comparison result and an identification result.

Description

Self-adaptive synchronization and difference identification method, device and equipment for multiple data sources

Technical Field

The invention relates to the field of databases, in particular to a method, a device and equipment for self-adaptive synchronization and difference identification of multiple data sources.

Background

In a big data environment, the system synchronizes data from multiple related systems in daily increments, involving multiple data sources, a top thousand tables. The traditional database synchronization method generally adopts timing synchronization, i.e. a fixed time point or a fixed time interval is selected for synchronization.

The traditional database synchronization mode generally has the following defects:

1. the conventional scheme cannot adapt to the structural change of the source end. When the structure of the source end changes, the destination end also needs to be synchronously changed, and the data self-adaption synchronization effect is poor.

2. The synchronization has time delay, the calculated amount is large, the resource expense is large, and a bottleneck point appears.

3. The method cannot adapt to scenes with multiple data sources and multiple data domains. The system has over dozens of data sources to be docked, the total number of data tables can reach thousands, and if the traditional scheme is used, a large amount of manpower is wasted;

4. the speed is relatively slow. The traditional scheme has high efficiency in processing hundreds of thousands of data, but the system can synchronize and compare hundreds of millions of data each day, and the comparison efficiency is poor under the condition of the data quantity.

Disclosure of Invention

The invention provides a method, a device and equipment for self-adaptive synchronization and difference identification of multiple data sources, which ensure the reliability and the synchronization efficiency of data synchronization.

A self-adaptive synchronization and difference identification method for multiple data sources comprises the following steps:

acquiring a logic log from a source end database server, analyzing the logic log into a json message, and sending the json message to a message queue;

monitoring the message queue, pulling a json message from the message queue and persisting the json message to the bottom layer of a destination database;

acquiring a base table in the bottom layer of the target end database, and configuring the base table through a defined configuration file;

and reading the configuration file by setting a timing task, and performing consistency comparison and difference identification on the base table according to conditions in the configuration file to obtain a comparison result and an identification result.

In an embodiment of the present invention, the acquiring a logical log from a source database server, parsing the logical log into a json packet, and sending the json packet to a message queue specifically includes: deploying a logic log analysis tool on a database server of a source end; reading the logic log according to the current latest log number or the appointed log number by a log analysis tool; cutting each logic log into smaller log blocks for analysis; analyzing the log block into a JavaScript object numbered notation json message, wherein the json message comprises a database, a data table and data specific information; and sending the json message to a message queue.

In an embodiment of the present invention, the monitoring the message queue, pulling a json packet from the message queue, and persisting the json packet to a bottom layer of a destination database includes: starting a monitor to monitor a specified message queue, and pulling data from the message queue according to a preset fixed frequency; filtering the pulled json message according to preset database and data table information, and reserving the needed json message; mapping the needed json message to a java object, and carrying out similar combination processing on the java object; and establishing a destination database connection, storing the merged json message into a bottom layer of the destination database, and releasing the connection.

In an embodiment of the present invention, the performing consistency comparison on the library table according to the condition in the configuration file to obtain a comparison result specifically includes: reading the base tables and the query conditions in the configuration file, counting the data volume and the summary value of each base table, and recording the data volume and the summary value; comparing the data volume and the summary value in the base tables of the source end and the destination end to obtain a statistical result; determining the consistency of the statistical results, and recording base tables with inconsistent statistical results for identifying difference data; and if the statistical results of all the base tables are consistent, ending the current flow.

In an embodiment of the present invention, the performing difference identification on the library table according to the condition in the configuration file to obtain an identification result specifically includes: according to the consistency of the statistical results, the data detail of the base tables with the differences is exported to the same base table to obtain the base tables with detail differences; and identifying and processing detail differences under large data quantity on the base table with the detail differences through spark structured query language spark SQL to find inconsistent detail data.

In an embodiment of the present invention, the performing consistency comparison and difference identification on the library table according to the condition in the configuration file to obtain a comparison result and an identification result specifically includes: sequentially executing the statistical scripts of each base table at the source end and the destination end respectively according to the timing task, and counting the total amount and the total sum of each base table; respectively writing the total amount and the total amount counted at the source end and the destination end into corresponding statistical files, and judging whether the statistical files of the source end and the destination end are consistent; if the inconsistency is determined, exporting the main keys and the sum fields of the base tables with the total amount inconsistent with the total sum to a detail difference base table according to the statistical file; importing the detail difference table into a destination end database; and accessing the target-end database through spark SQL, counting and identifying data with differences in the detail difference database table, and sending the data to related personnel for processing.

In an embodiment of the present invention, after persisting the json packet to a bottom layer of a destination-side database, the method further includes: and structuring the json message through a relational database, mapping the data in the json message into a table structure, and operating the data at the bottom layer of the target end database by using SQL.

An apparatus for adaptive synchronization and disparity recognition for multiple data sources, comprising:

the synchronization module is used for acquiring a logic log from a source end database server, analyzing the logic log into a json message and sending the json message to a message queue; monitoring the message queue, pulling a json message from the message queue and persisting the json message to the bottom layer of a destination database;

the configuration module is used for acquiring a base table in the bottom layer of the destination database and configuring the base table through a defined configuration file;

and the consistency checking module is used for reading the configuration file by setting a timing task, and performing consistency comparison and difference identification on the base table according to conditions in the configuration file to obtain a comparison result and an identification result.

An adaptive synchronization and discrepancy identification device for multiple data sources, comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled with the at least one processor via a bus; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to perform:

A non-volatile storage medium storing computer-executable instructions for execution by a processor to perform the steps of:

The invention provides a method, a device and equipment for self-adaptive synchronization and difference identification of multiple data sources, which at least have the following beneficial effects: according to the scheme provided by the invention, the automation and self-adaptation of the process are realized, the reliability is ensured, the efficiency is greatly improved, and a large amount of manpower and time are saved under the scene that a plurality of systems, thousands of meters are involved and the data volume is hundreds of millions. Monitoring a message queue through a service system, pulling a json message for consumption, and persisting the json message to the bottom layer of a database, so that the data format is unified, and the log synchronization efficiency is improved; by calculating the results of the data tables and screening the data tables with different results for difference inspection, the speed of difference inspection is increased.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a schematic diagram illustrating steps of an adaptive synchronization and difference identification method for multiple data sources according to an embodiment of the present invention;

fig. 2 is an execution flow chart of an application scenario according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an apparatus for adaptive synchronization and disparity recognition for multiple data sources according to an embodiment of the present invention;

fig. 4 is a schematic diagram of an adaptive synchronization and difference recognition apparatus for multiple data sources according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail and fully with reference to the following embodiments. It is to be understood that the disclosed embodiments are merely exemplary of the invention, and are not intended to be exhaustive or exhaustive. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It is to be understood that the described embodiments of the present invention may be combined with other embodiments, both explicitly and implicitly, by one of ordinary skill in the art, without conflict. Unless defined otherwise, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this invention belongs. The use of the terms "a" and "an" and "the" and similar referents in the context of describing the invention (including a reference to the context of the specification and claims) are not to be construed as limiting the scope of the invention in any way, and may be construed in any way. The present invention relates to the terms "comprises," "comprising," "includes," "including," "has," "having" and any variations thereof, which are intended to cover non-exclusive inclusions; the terms "first," "second," "third," and the like in reference to the present invention are used merely to distinguish between similar objects and not necessarily to represent a particular ordering for the objects.

Under the current big data environment, the business system of each unit needs to synchronize data from a plurality of related source end systems according to day increment, and a plurality of data sources and a plurality of thousand tables are involved. The table structure of the source system is changed frequently, the business system needs to shield the difference of the table structure to store the data, and check whether the data synchronized every day is complete. In order to meet the requirements, the invention provides a verification scheme which can verify whether the data synchronized every day is complete. The verification scheme is substantially as follows: 1. after the data are synchronized, the tool logs in the service system and the database server of the related system respectively. 2. And respectively calculating the data conditions of corresponding base tables in the business system and the related system according to the configured statistical indexes. 3. The second step is repeated until all the library tables have completed the statistics. 4. And according to the statistical result, aiming at the base table with the difference, further acquiring the detail data by using a related tool, comparing the list, identifying the specific field difference, and finally obtaining a data comparison conclusion and difference detail list. 5. And subsequently, the contrast difference is handed over to development and operation and maintenance personnel to troubleshoot and process the problems. The following specifically describes the present invention.

Fig. 1 is a schematic step diagram of an adaptive synchronization and difference identification method for multiple data sources according to an embodiment of the present invention, which may include the following steps:

s110: and acquiring a logic log from a source end database server, analyzing the logic log into a json message, and sending the json message to a message queue.

In an embodiment of the present invention, acquiring a logical log from a source database server, parsing the logical log into a json packet, and sending the json packet to a message queue, specifically includes: deploying a logic log analysis tool on a database server at a source end; reading a logic log according to the current latest log number or the appointed log number by a log analysis tool; cutting each logic log into smaller log blocks for analysis; analyzing the log block into a JavaScript object notation json message, wherein the json message comprises a database, a data table and data specific information; and sending the json message to a message queue.

Specifically, a synchronization module is used to synchronize data of a source end and a destination end, and the synchronization module includes two main functions: and production and consumption, namely analyzing, producing, storing and warehousing the json message.

Production: and deploying a logic log analysis tool on a database server at the source end, analyzing the logic log into a json message with higher readability, and sending the json message to a message queue.

The log analysis tool reads the logic log according to the current latest log number or the appointed log number, and cuts each log into smaller log blocks for analysis; the log is quickly analyzed into a json format message, the message comprises a database, a data table, specific data information and the like, and the format is as follows:

and sending the analyzed json message to a message queue to wait for system consumption.

S120: and monitoring the message queue, pulling the json message from the message queue and persisting the json message to the bottom layer of a destination database.

In an embodiment of the present invention, monitoring a message queue, pulling a json packet from the message queue, and persisting the json packet to a bottom layer of a destination database specifically includes: starting a monitor to monitor a specified message queue, and pulling data from the message queue according to a preset fixed frequency; filtering the pulled json message according to preset database and data table information, and reserving the needed json message; mapping the needed json message to a java object, and carrying out similar combination processing on the java object; and establishing a destination database connection, storing the merged json message into a bottom layer of the destination database, and releasing the connection.

In an embodiment of the present invention, after persisting the json packet to the bottom layer of the target-end database, the json packet is structured by using the relational database, data in the json packet is mapped to a table structure, and SQL is used to operate on data in the bottom layer of the target-end database.

Consumption: the system monitors the message queue, draws the json message for consumption, and persists the json message to the bottom layer of the database. The method comprises the following specific steps:

(1) Monitoring: and starting a listener, designating the message queue for listening, and pulling data into the message queue at a fixed frequency.

(2) And (3) filtering: and filtering the pulled json message according to information such as a database, a data table and the like in the json message, and only retaining required data.

(3) Mapping: and mapping the json message to a correct java object.

(4) And (4) classification: and merging the similar operations and uniformly processing.

(5) Warehousing: and establishing a destination database connection, storing the json message into a bottom layer of the destination database, and releasing the connection.

(6) The following components are used: when using data, there are two ways:

(1) with unstructured json data, single table operations are very convenient, but support for multi-table associative operations is insufficient.

(2) And structuring json data by using a relational database MySQL or PostgreSQL, mapping the data into a table structure, and operating bottom data by using SQL.

During the synchronization process, data may be lost due to some abnormal reasons, and it is necessary to check the comparison and determine the data difference result.

S130: and acquiring a base table in the bottom layer of the target end database, and configuring the base table through a defined configuration file.

Specifically, a shell script is compiled by a worker, and a base table to be subjected to consistency check is configured, so that the functions of data volume statistics, comparison, derivation, difference identification and the like are realized.

S140: and reading the configuration file by setting a timing task, and performing consistency comparison and difference identification on the base table according to conditions in the configuration file to obtain a comparison result and an identification result.

In an embodiment of the present invention, performing consistency comparison on a library table according to a condition in a configuration file to obtain a comparison result, specifically including: reading base tables and query conditions in the configuration file, counting the data volume and summary value of each base table, and recording the data volume and the summary value; comparing the data volume and the summary value in the base tables of the source end and the destination end to obtain a statistical result; determining the consistency of statistical results, and recording base tables with inconsistent statistical results for identifying difference data; and if the statistical results of all the base tables are consistent, ending the current flow.

In an embodiment of the present invention, performing difference identification on a base table according to a condition in a configuration file to obtain an identification result, specifically including: according to the consistency of the statistical results, the data detail of the base tables with the differences is exported to the same base table to obtain the base tables with detail differences; and identifying and processing detail differences under large data quantity on the base table with the detail differences through spark structured query language spark SQL to find inconsistent detail data.

Specifically, after the shell script is defined, a timed task crontab is configured, the shell script is executed at a timed time, and the following functions are completed by executing the shell script.

Data volume statistics: reading the base table and the query condition in the configuration file, counting the data volume and the summary value of each table and recording the data volume and the summary value.

And (3) comparison: comparing the statistical results of the source end system and the target end system, and recording the base table with inconsistent statistical results for the next data retrieval; if the statistical results of all tables are consistent, the process ends.

And (3) derivation: and according to the statistical result obtained by comparison, the data detail of the differences is exported to the same library table.

And (3) difference identification: and (5) performing identification processing on detail differences under large data quantity by using spark SQL, finding out inconsistent detail data, and inserting inconsistent marks into the database.

In an embodiment of the present invention, consistency comparison and difference identification are performed on a library table according to conditions in a configuration file to obtain a comparison result and an identification result, which specifically include: sequentially executing the statistical script of each base table at the source end and the destination end respectively according to the timing task, and counting the total amount and the total amount of each base table; respectively writing the total amount and the total amount counted at the source end and the destination end into corresponding statistical files, and judging whether the statistical files of the source end and the destination end are consistent; if the inconsistency is determined, exporting the main keys and the amount fields of the base tables with the total amount inconsistent with the total amount to the detail difference base table according to the statistical file; importing the detail difference table into a destination end database; and accessing a target end database through spark SQL, counting and identifying data with differences in the detailed difference database table, and sending the data to related personnel for processing.

Specifically, as shown in fig. 2, an execution flow chart in a certain application scenario of the present invention is that a consistency check script is configured by a worker, a timed task crontab is configured for the consistency check script in a service platform (a destination end system) and an external system (a source end system), the consistency check script is executed by the crontab at a timed time, a statistical script of each Zhang Kubiao is sequentially executed, a total amount and a total amount are counted, a statistical result is written into a statistical file, whether a statistical result is consistent or not is manually checked, if the statistical result is inconsistent, difference data in a base table is identified, a main key and an amount field which are inconsistent are derived according to the statistical result, the main key and the amount field of the base table are imported into a destination end database, and data with differences are counted by using spark sql, and problems are solved by development and operation and maintenance personnel.

Based on the same inventive concept, the foregoing method for adaptive synchronization and difference identification of multiple data sources provided in the embodiment of the present invention further provides a corresponding apparatus for adaptive synchronization and difference identification of multiple data sources, as shown in fig. 3.

The synchronization module 301 is configured to acquire a logical log from a source-end database server, analyze the logical log into a json packet, and send the json packet to a message queue; monitoring a message queue, pulling a json message from the message queue and persisting the json message to the bottom layer of a destination database; a configuration module 302, configured to obtain a base table in a bottom layer of a destination database, and configure the base table through a defined configuration file; and the consistency checking module 303 is configured to read the configuration file by setting the timing task, and perform consistency comparison and difference identification on the base table according to conditions in the configuration file to obtain a comparison result and an identification result.

The embodiment of the invention also provides corresponding self-adaptive synchronization and difference identification equipment for multiple data sources, which is shown in figure 4.

The embodiment provides an adaptive synchronization and difference identification device for multiple data sources, which comprises:

at least one processor 401; and a memory 402 communicatively coupled to the at least one processor 401 via a bus 403; wherein the memory 402 stores instructions executable by the at least one processor, the instructions being executable by the at least one processor 401 to enable the at least one processor 401 to perform:

acquiring a logic log from a source end database server, analyzing the logic log into a json message, and sending the json message to a message queue; monitoring a message queue, pulling a json message from the message queue and persisting the json message to the bottom layer of a database at a target end; acquiring a base table in the bottom layer of a target end database, and configuring the base table through a defined configuration file; and reading the configuration file by setting a timing task, and performing consistency comparison and difference identification on the base table according to conditions in the configuration file to obtain a comparison result and an identification result.

Based on the same idea, some embodiments of the present invention also provide media corresponding to the above method.

Some embodiments of the invention provide a storage medium storing computer-executable instructions for execution by a processor to perform the steps of:

acquiring a logic log from a source end database server, analyzing the logic log into a json message, and sending the json message to a message queue; monitoring a message queue, pulling a json message from the message queue and persisting the json message to the bottom layer of a destination database; acquiring a base table in the bottom layer of a target end database, and configuring the base table through a defined configuration file; and reading the configuration file by setting a timing task, and performing consistency comparison and difference identification on the base table according to conditions in the configuration file to obtain a comparison result and an identification result.

The embodiments of the present invention are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the device and media embodiments, the description is relatively simple as it is substantially similar to the method embodiments, and reference may be made to some descriptions of the method embodiments for relevant points.

The device and the medium provided by the embodiment of the invention correspond to the method one by one, so the device and the medium also have the beneficial technical effects similar to the corresponding method.

It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process method article or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process method article or method. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of additional like elements in the process method commodity or method comprising the element.

The above are merely examples of the present invention, and are not intended to limit the present invention. Although the invention has been described in detail with respect to the general description and the specific embodiments thereof, it will be apparent to those skilled in the art that modifications and improvements can be made based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. An adaptive synchronization and difference identification method for multiple data sources, comprising:

2. The method of claim 1, wherein the obtaining a logical log from a source-side database server, parsing the logical log into json packets, and sending the json packets to a message queue, specifically comprises:

deploying a logic log analysis tool on a database server at a source end;

reading the logic log according to the current latest log number or the appointed log number by a log analysis tool;

cutting each logic log into smaller log blocks for analysis;

analyzing the log block into a JavaScript object numbered musical notation json message, wherein the json message comprises a database, a data table and data specific information;

and sending the json message to a message queue.

3. The method according to claim 1, wherein the monitoring the message queue, pulling a json packet from the message queue, and persisting the json packet to a bottom layer of a destination database specifically comprises:

starting a monitor to monitor a specified message queue, and pulling data from the message queue according to a preset fixed frequency;

filtering the pulled json message according to preset database and data table information, and reserving the needed json message;

mapping the needed json message to a java object, and merging the same type in the java object;

and establishing a target end database connection, storing the merged json message into a bottom layer of the target end database, and then releasing the connection.

4. The method according to claim 1, wherein the performing consistency comparison on the library table according to the condition in the configuration file to obtain a comparison result specifically includes:

reading the base tables and the query conditions in the configuration file, counting the data volume and the summary value of each base table, and recording the data volume and the summary value;

comparing the data volume and the summary value in the base tables of the source end and the destination end to obtain a statistical result;

determining the consistency of the statistical results, and recording base tables with inconsistent statistical results for identifying difference data;

and if the statistical results of all the base tables are consistent, ending the current flow.

5. The method according to claim 4, wherein the performing difference recognition on the library table according to the condition in the configuration file to obtain a recognition result specifically comprises:

exporting data details of the base tables with the differences to the same base table according to the consistency of the statistical results to obtain the base tables with the detail differences;

and identifying the detail difference under large data quantity by using a spark structured query language spark SQL to the base table with the detail difference, and finding out inconsistent detail data.

6. The method according to claim 1, wherein the performing consistency comparison and difference identification on the library table according to the condition in the configuration file to obtain a comparison result and an identification result specifically comprises:

sequentially executing the statistical script of each base table at the source end and the destination end respectively according to the timing task, and counting the total amount and the total amount of each base table;

respectively writing the total amount and the total amount counted at the source end and the destination end into corresponding statistical files, and judging whether the statistical files of the source end and the destination end are consistent;

if the inconsistency is determined, exporting the main keys and the sum fields of the base tables with the total amount inconsistent with the total sum to a detail difference base table according to the statistical file;

importing the detail difference table into a destination end database;

and accessing the target end database through spark SQL, counting and identifying data with differences in the detail difference database table, and sending the data to related personnel for processing.

7. The method of claim 1, wherein after persisting the json packet to a destination database underlay, the method further comprises:

and structuring the json message through a relational database, mapping the data in the json message into a table structure, and operating the data at the bottom layer of the target end database by using SQL.

8. An apparatus for adaptive synchronization and disparity recognition for multiple data sources, comprising:

9. An adaptive synchronization and discrepancy identification device for multiple data sources, comprising:

at least one processor; and (c) a second step of,

monitoring the message queue, pulling a json message from the message queue and persisting the json message to a bottom layer of a target-end database;

10. A non-transitory storage medium storing computer-executable instructions, the computer-executable instructions being executable by a processor to perform the steps of: