CN115080666A - Data synchronization method, system, electronic device and storage medium - Google Patents

Data synchronization method, system, electronic device and storage medium Download PDF

Info

Publication number
CN115080666A
CN115080666A CN202210791558.0A CN202210791558A CN115080666A CN 115080666 A CN115080666 A CN 115080666A CN 202210791558 A CN202210791558 A CN 202210791558A CN 115080666 A CN115080666 A CN 115080666A
Authority
CN
China
Prior art keywords
data
preset
synchronization
target
storage medium
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210791558.0A
Other languages
Chinese (zh)
Inventor
许斌
马涛
金丽丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ctrip Travel Information Service Shanghai Co Ltd
Original Assignee
Ctrip Travel Information Service Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ctrip Travel Information Service Shanghai Co Ltd filed Critical Ctrip Travel Information Service Shanghai Co Ltd
Priority to CN202210791558.0A priority Critical patent/CN115080666A/en
Publication of CN115080666A publication Critical patent/CN115080666A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/548Queue

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data synchronization method, a system, electronic equipment and a storage medium, wherein the data synchronization method comprises the following steps: constructing data configuration information corresponding to a preset storage medium; matching to obtain target data from a preset data source based on the data configuration information; and synchronizing the target data to a plurality of target clusters in the preset storage medium according to a preset data synchronization rule. According to the invention, a set of simple and easy-to-use data synchronization framework is formed through abstract full synchronization, full comparison and incremental synchronization, an expansion interface is reserved, a use scene can be expanded through configuration, rapid iteration of actual service functions can be completed with less development investment, and the development cost is effectively reduced.

Description

Data synchronization method, system, electronic device and storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data synchronization method, system, electronic device, and storage medium.
Background
ETL (a data transmission technology) represents a process of Extracting (transforming), transmitting (transforming), and Loading (Loading) data from a data source end to a target end, and dataX is a widely used offline data synchronization ETL tool, but in some service scenarios, such as real-time data processing, it is not desirable. In the online travel platform, the business travel account data plays a role in the whole reservation process, the core application program interfaces have higher query rate per second, and great pressure (especially reading pressure) is exerted on the database. Therefore, for improving the response speed, reducing the database pressure, improving the data accuracy and improving the service quality, a cache is inevitably required to be introduced; with the increase of the amount of data related to an account, the design rule of a data structure is disordered, and in addition, a special query scenario, a relational database cannot meet the query requirement, so that a distributed search and data analysis engine, namely an elastic search, is put into practice in a business travel system and is widely applied to the functions of searching, recommending and the like of each business scenario.
Data synchronization is always a troublesome part, and is discussed repeatedly on various communication platforms, taking search as an example, a service scene which is often provided is often started, and the current requirements can not be met after a period of service development. In practice, the situation is more complicated, where the index field may vary from a few to tens or even tens of index fields, and the code, database performance, and business logic need to be considered. Assembling data is a very tedious work, and in any case, business logic is processed, so that the workload of developers is greatly increased, and the completion time of business functions is long.
Disclosure of Invention
The invention aims to overcome the defects that in data synchronization in the prior art, development tasks are complex, repeated, time-consuming, low in efficiency and incapable of flexibly adapting to various scenes, and provides a data synchronization method, a data synchronization system, electronic equipment and a storage medium.
The invention solves the technical problems through the following technical scheme:
as a first aspect of the present invention, there is provided a data synchronization method including:
constructing data configuration information corresponding to a preset storage medium;
matching to obtain target data from a preset data source based on the data configuration information;
and synchronizing the target data to a plurality of target clusters in the preset storage medium according to a preset data synchronization rule.
Preferably, different preset storage media correspond to different data configuration information.
Preferably, the step of constructing the data configuration information corresponding to the preset storage medium includes:
judging whether the data configuration information meets a preset configuration condition, if so, executing the step of obtaining target data from a preset data source based on the data configuration information; otherwise, generating prompt information of abnormal configuration.
Preferably, the predetermined storage medium includes: a distributed search and data analysis engine, a remote dictionary;
and/or the presence of a gas in the gas,
the preset data source comprises: a data warehouse tool and/or a relational database.
Preferably, when the preset data synchronization rule includes a DB (database) synchronization manner corresponding to the relational database, the DB synchronization manner is to divide the target data based on a dynamically adjusted data division time boundary for synchronous transmission;
or the like, or, alternatively,
when the preset data synchronization rule comprises a Hive (data warehouse tool) synchronization mode of the data warehouse tool, the Hive synchronization mode is to divide the target data based on a multithreading technology to synchronously transmit the target data.
Preferably, when the DB synchronization mode is adopted, the step of obtaining target data from a preset data source based on the data configuration information includes:
acquiring a time field of the relational database, wherein the time field comprises a start boundary time and an end boundary time, and the start boundary time and the end boundary time divide the relational database into different data blocks;
comparing the data volume of the time field with a preset threshold, and if the data volume of the time field is lower than the preset threshold, querying a data block corresponding to the time field to obtain the target data;
if the time field is higher than or equal to the preset threshold, reducing the ending boundary time of the time field to the midpoint of the starting boundary time and the ending boundary time, re-acquiring the data volume of the time field until the data volume of the time field is lower than the preset threshold, and updating the time field.
Preferably, when the Hive synchronization manner is adopted, after the step of synchronizing the target data to a plurality of target clusters in the preset storage medium according to a preset data synchronization rule, the method further includes:
judging whether the target data is the same as the set data in the storage medium or not by adopting a preset data comparison rule;
and if not, correcting the data until the target data is the same as the set data.
Preferably, the preset data comparison rule includes: a forward alignment rule or a reverse alignment rule;
the forward comparison rule is used for representing the comparison of the data in the preset data source with the data on the target cluster;
the reverse alignment rule is used for representing whether the target data exists in the alignment on the target cluster.
Preferably, the data synchronization method further includes:
monitoring incremental log data of the data source, wherein the incremental log data comprise incremental change data;
and synchronizing the incremental change data to the target cluster.
As a second aspect of the present invention, there is provided a data synchronization system comprising:
the configuration information construction module is used for constructing data configuration information corresponding to a preset storage medium;
the target data matching module is used for matching target data from a preset data source based on the data configuration information;
and the data synchronization module is used for synchronizing the target data to a plurality of target clusters in the preset storage medium according to a preset data synchronization rule.
As a third part of the present invention, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the data synchronization method described above when executing the computer program.
As a fourth aspect of the present invention, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the data synchronization method described above.
The positive progress effects of the invention are as follows:
according to the data synchronization method, the data synchronization system, the electronic equipment and the storage medium, a set of simple and easy-to-use data synchronization framework is formed through abstract full synchronization, full comparison and incremental synchronization, an expansion interface is reserved, a use scene can be expanded through configuration, rapid iteration of actual business functions can be completed with less development investment, and development cost is effectively reduced.
Drawings
Fig. 1 is a first flowchart of a data synchronization method according to embodiment 1 of the present invention;
FIG. 2 is a second flowchart of the data synchronization method according to embodiment 1 of the present invention;
fig. 3 is a third flowchart of the data synchronization method according to embodiment 1 of the present invention;
fig. 4 is a first framework structure diagram of the data synchronization method according to embodiment 1 of the present invention;
fig. 5 is a second frame structure diagram of the data synchronization method according to embodiment 1 of the present invention;
fig. 6 is a third flowchart of the data synchronization method according to embodiment 1 of the present invention;
fig. 7 is a third frame structure diagram of the data synchronization method according to embodiment 1 of the present invention;
FIG. 8 is a block diagram of a data synchronization system according to embodiment 2 of the present invention;
fig. 9 is a schematic structural diagram of an electronic device according to embodiment 3 of the present invention.
Detailed Description
The invention is further illustrated by the following examples, which are not intended to limit the scope of the invention.
Example 1
As shown in fig. 1, the data synchronization method of the present embodiment includes the following steps:
step 101, constructing data configuration information corresponding to preset storage media, wherein different preset storage media correspond to different data configuration information, and the preset storage media comprise: distributed search and data analysis engines, remote dictionaries, and the like.
In one embodiment, the data configuration information of the distributed search and data analysis engine includes source nodes, target nodes and configuration nodes, the source nodes determine the target field data of the library table, the target nodes determine how the data is organized, and the configuration nodes determine the underlying configuration of the index. The source node includes: data source references, tables, fields, etc.; the target node includes: information such as distributed search and data analysis engine fields, database field mapping, index field types, word segmentation and the like; the configuration node includes: the number of the fragments, the number of the copies and the refreshing interval information are generally obtained by estimating the number of the fragments and the number of the copies through configuration without manual setting.
In another embodiment, the remote dictionary data configuration information is different from the distributed search and data analysis engine, including: the system comprises target library nodes, table nodes and keyword nodes, wherein the target library nodes comprise appointed database references; the table nodes contain information such as data tables and the like, determine field data of the used target table and how to organize the data, and determine a cache name space to which the data is synchronized; the key word node contains the relevant table information of the appointed current table, and whether the relevant table data needs to be organized is determined.
Before constructing data configuration information of a distributed search and data analysis engine and a remote dictionary, as shown in fig. 2 and 3, validity verification needs to be performed, and the method specifically includes the following steps:
step 1011, determining whether the data configuration information meets the preset configuration condition. And checking the correctness of the field information, so that the corresponding index can be conveniently and subsequently created for synchronous use.
If the data configuration information does not meet the preset configuration condition, generating prompt information of abnormal configuration to carry out reconfiguration until the actual scene requirement is met;
if the data configuration information satisfies the preset configuration condition, the following step 102 is executed.
102, matching target data from a preset data source based on data configuration information, wherein the preset data source comprises: data warehouse tools, relational databases, and the like.
And 103, synchronizing the target data to a plurality of target clusters in a preset storage medium according to a preset data synchronization rule.
Specifically, as shown in fig. 4, both data synchronization ES (a storage medium) and Redis (a storage medium) need to solve the configuration problem, specifically, which data is used to organize data, and the synchronization path of the data, if hard coding is adopted, the corresponding logic function needs to be developed each time a change of the requirement is met, some functions may be similar, in which case, a lot of time is spent and the labor is repeated, so a configuration center (Qconfig) is provided to solve the problem well.
In an implementable scheme, the data configuration specifically includes:
(1) and (3) configuring a data source, particularly using the traditional JDBC operation (Java database connection).
(2) The ES configuration mainly comprises: a source node (an ES node), comprising: data source references, tables, fields, etc.; a target node (an ES node), comprising: ES field, db field mapping (a field type mapping relation), index field type, word segmentation, and other information; a config node (an ES node), comprising: the number of shards (an index logical slice), the number of duplicates (an index logical slice copy), and refresh interval information (refresh interval information). The Source node determines the field data using the base tables, the target node determines how to organize the data, the config node determines the basic configuration of the index, the number of the shards and the replicas is generally estimated by the configuration, manual setting is not needed, and the configuration determines the index to which the data is synchronized. With the configuration of the data source, the configuration of the ES can be directly used by referring to the data source.
(3) Redis configuration, which is based on replacing corresponding DB (database) query, is different from ES configuration, and comprises a dbName node (database name) after simplified configuration and specifies database reference; table nodes (table nodes), data table information, and the like; a relationKeys node (key node) specifies the association table information of the current table. the table node determines which field data of that table is used, how the data is organized and what cache namespace the data is synchronized to, and the relationKeys node determines whether there is associated table data to organize.
And when the preset data synchronization rule comprises a DB synchronization mode corresponding to the relational database, the DB synchronization mode is to divide the target data based on a dynamic adjustment data division time boundary so as to realize synchronous transmission.
And when the preset data synchronization rule comprises a Hive synchronization mode of the data warehouse tool, the Hive synchronization mode is to divide the target data based on a multithreading technology to transmit synchronously.
And 104, monitoring incremental log data of a data source, wherein the incremental log data comprise incremental change data, and synchronizing the incremental change data to the target cluster.
After the data is fully synchronized, the data needs to be supplemented by incremental data in real time, in one possible implementation scheme, an Otter (a database synchronization system) is used for monitoring data change and sending the data after the change to mq (message queue), if an abnormality occurs in the synchronization process, a retry mechanism of the message queue is used for retry, and if the retry frequency exceeds a threshold value, the failed data is subjected to database dropping for subsequent compensation operation.
Specifically, in one possible embodiment, the data synchronization method of MySQL (a relational database) includes: otter retrieves incremental log data based on the Canal (binlog incremental subscribe & consume component). The interactive protocol of the Canal simulation MySQL slave (slave database) is disguised as MySQL slave, a dump protocol is sent to a MySQL master (master database), a binary log of the MySQL master is received and analyzed, and the binary log is pushed to an internal message middleware qmq (a distributed message queue) through an Otter, so that incremental change data can be synchronized in real time.
In one possible embodiment, data synchronization for SqlServer (a relational database) synchronizes data by transactionally sending incremental log data to custom message queues.
Specifically, as shown in fig. 2, in one possible embodiment, when the DB synchronization mode is adopted, step 102 specifically includes:
step 1021, obtaining a time field of the relational database, wherein the time field comprises a start boundary time and an end boundary time, and the start boundary time and the end boundary time divide the relational database into different data blocks.
The database is divided into different data blocks by the time field, when the query is carried out, all data can be completely queried, the query speed can be guaranteed even if the number of the time field divisions is small, but if the data needs to be manually cleaned in a large batch, a large amount of data can be concentrated into a certain time period, so that the data in some time fields are very dense, the data in some time fields are very sparse, and therefore the fixed time field needs to be changed into an elastic time boundary, the method specifically comprises the following steps:
step 1022, comparing the data amount of the time field with a preset threshold;
if the time field is lower than the preset threshold, inquiring a data block corresponding to the time field to obtain the target data;
if the time field is higher than or equal to the preset threshold, reducing the ending boundary time of the time field to the midpoint of the starting boundary time and the ending boundary time, re-acquiring the data volume in a new time period until the data volume of the time field is lower than the preset threshold, and updating the time field.
Therefore, a larger time field is given first, counting operation and preset threshold comparison are carried out, if the larger time field is higher than or equal to the preset threshold, the data volume in the time field is proved to be overlarge, and processing is not facilitated, so that the ending boundary time of the time field is reduced to the midpoint of the starting boundary time and the ending boundary time for circular comparison until the data volume of the time field is lower than the preset threshold, and the influence of query operation on the database can be reduced to the minimum as possible by dynamically changing the time field boundary to carry out segmentation processing on the data.
In one possible embodiment, in addition to the DB synchronization method, there is a Hive synchronization method, where the query operation speed of Hive itself is fast, but the queried data needs to be combined, which results in that the time consumption of a single query is on the order of seconds or more, and therefore, to increase the query speed, data segments for thread processing need to be preset, and the query processing needs to be performed in multiple threads.
In particular, data in the storage medium may be synchronized from a relational database (e.g., MySQL, SqlServer) or a data warehouse tool (Hive).
In one practical solution, synchronization is most direct from the database, the data is most accurate, but when the data size is large or the primary key of the table is not shaped, when all data is queried in the traditional paging, the query of the full data is difficult to be completed because of the usage specification limit of some connection timeout of the database, and becomes more difficult if the speed is also required.
If different query modes are selected according to different types of main keys of the table, the query logic is complicated, and the operation is not uniform enough, so that a uniform processing mode is needed: the specified table must have a creation time field and an update time field, and the update time field creates an index by default, so that data can be divided into data blocks by using a fixed time field, and when in query, all data can be completely queried, and if the number of the time field divisions is small, the query speed can also be ensured.
When a large amount of data is artificially cleaned, a large amount of data is concentrated into a certain time field, so that the data in some time fields are very dense, and the data in some time fields are very sparse, therefore, an elastic time boundary is set, a larger time field is given, then count operation (counting operation) is carried out, and comparison with a preset threshold value shows that the data in the time field can be normally inquired when the count operation is lower than the threshold value; when the data amount of the time field is larger than or equal to the preset threshold, the data amount of the time field is too large and is not easy to process, so that the time field is reduced, the ending boundary time is reduced to the midpoint of the starting boundary time and the ending boundary time, the count operation and the operation of comparing with the preset threshold are circularly carried out until the count number is smaller than the preset threshold, and therefore the influence of the query operation on the database can be reduced to the minimum.
In an implementable scheme, data can be synchronized from a database and Hive, query operation of Hive itself is fast, but queried data needs to be combined, so that time is consumed on the second level or above, therefore, to improve query speed, multithreading needs to be performed after presetting, and particularly, ods layer (data interface layer) data needs to be resynchronized to a Hive table with a preset sid integer field (a field in a data table), and data is divided through the sid integer field, so that a part of query speed can be improved, the utilization rate of local storage of Hive can be reduced, and instability of a system caused by Hive is avoided.
When the Hive synchronization mode is adopted, there are some problems in the initial synchronization of data, and it cannot be completely guaranteed that the synchronized data is correct or complete, so a bottom-pocketing mechanism needs to be set to guarantee that when the data is wrong, automatic detection is performed and the data is corrected, and therefore, as shown in fig. 3, after step 103, the method further includes:
10301, judging whether the target data is the same as the set data in the storage medium by adopting a preset data comparison rule;
and if not, correcting the data until the target data is the same as the set data.
Generally, the preset data comparison rule includes: and the forward comparison rule is used for representing the comparison of the data in the preset data source and the data on the target cluster, and the comparison is carried out after synchronization, so that the accuracy of the data is ensured.
Specifically, the forward comparison rule starts from a data source, queries data of a database, compares whether the data is the same as or less than the data of the database, and thus creates different tasks for a table of synchronized data, and the processing logic is basically the same as that of using full DB data synchronization, which is not described herein again.
And the reverse comparison rule is used for representing whether the target data exist in the comparison on the target cluster or not, and inquiring whether the target data exist in the distributed search and data analysis engine and the remote dictionary or not from the synchronized data.
In an implementation scheme, the reverse comparison rule of the ES is full query, and a "shallow" paging method of the from-size (a data query method) and a scroll (a data query method) method cannot be used, because the from-size may cause performance and efficiency problems in the case of deep paging, and a scroll _ id (data generated through data query) generated by the scroll not only occupies a large amount of resources (especially a request for sorting), but also queries a historical snapshot, and the change of data is not reflected on the snapshot, so that a search _ after (a data query method) method is used for querying.
In an implementable scheme, the reverse comparison rule of Redis is also a full query, and a scan mode (a data query mode) is adopted for query.
As shown in the framework of fig. 5, the full-quantity comparison is automatically started after the full-quantity synchronization is completed, and when the data to be compared is about to catch up with the current time, the update time of the comparison data and the current time are less than the configuration threshold, the incremental synchronization is automatically opened, so that the full-flow automatic synchronization, the verification and the correction can be completed only by configuring the relevant tasks.
The following describes the implementation principle of the data synchronization method of the present embodiment with reference to the following example:
taking the example of synchronizing data from MySQL to ES, as shown in fig. 5, the specific steps are as follows:
s1, ES configuration parsing mapping, as shown in fig. 6:
and S11, creating an ES configuration analysis class for analyzing, checking, estimating the number of the shards, constructing configuration mapping information and the like.
S2, full-synchronization preparation processing.
And S21, creating a full-synchronization preprocessing class of the distributed search and data analysis engine, and processing basic information of full synchronization.
And S22, creating a full synchronous trigger class of the distributed search and data analysis engine, and processing the compression, refreshing time and comparison switch information of the index.
S23, creating a distributed search and data analysis engine configuration mapping class for obtaining effective mapping configuration information.
And S3, carrying out full quantity comparison preparation treatment.
And S31, creating a full-quantity comparison preprocessing class of the distributed search and data analysis engine, and processing basic information of the full-quantity comparison.
And S32, creating a full-quantity comparison switch class of the distributed search and data analysis engine for starting incremental comparison.
S4, data synchronization, alignment and synchronization, as shown in fig. 7:
and S41, creating a data synchronization class, converting and assembling the prepared data into data required by a successful distributed search and data analysis engine, and completing functions of batch data initial synchronization, batch data comparison, single data synchronization and the like.
And S5, incremental data consumption processing.
S51, creating otter queue consumption classes, adding consumption entries by adding consumption groups, and being capable of being isolated from other consumers without mutual influence.
And S52, creating a data filtering class for judging whether the data in the current otter queue can be processed.
In the embodiment, a set of simple and easy-to-use data synchronization framework is formed by abstract full synchronization, full comparison and incremental synchronization, an expansion interface is reserved, a use scene can be expanded by configuration, rapid iteration of actual business functions can be completed with less development investment, and development cost is effectively reduced.
Example 2
As shown in fig. 8, the data synchronization system of the present embodiment specifically includes:
a configuration information constructing module 201, configured to construct data configuration information corresponding to a preset storage medium, where different preset storage media correspond to different data configuration information, where the preset storage medium includes: distributed search and data analysis engines, remote dictionaries, and the like.
In one embodiment, the data configuration information of the distributed search and data analysis engine includes source nodes, target nodes and configuration nodes, the source nodes determine the target field data of the library table, the target nodes determine how the data is organized, and the configuration nodes determine the underlying configuration of the index. The source node includes: data source references, tables, fields, etc.; the target node includes: information such as distributed search and data analysis engine fields, database field mapping, index field types, word segmentation and the like; the configuration node includes: the number of the fragments, the number of the copies and the refreshing interval information are generally obtained by estimating the number of the fragments and the number of the copies through configuration without manual setting.
In another embodiment, the remote dictionary data configuration information is different from the distributed search and data analysis engine, including: the system comprises target library nodes, table nodes and keyword nodes, wherein the target library nodes comprise appointed database references; the table nodes contain information such as data tables and the like, determine field data of the used target table and how to organize the data, and determine a cache name space to which the data is synchronized; the key word node contains the relevant table information of the appointed current table, and whether the relevant table data needs to be organized is determined.
Before constructing the data configuration information of the distributed search and data analysis engine and the remote dictionary, validity verification needs to be performed, and the configuration information construction module 201 further includes:
the determining unit 2011 is configured to determine whether the data configuration information meets a preset configuration condition. And checking the correctness of the field information, so that the corresponding index can be conveniently and subsequently created for synchronous use.
If the data configuration information does not meet the preset configuration condition, generating prompt information of configuration abnormity;
if the data configuration information meets the preset configuration condition, the target data matching module 202 is called.
A target data matching module 202, configured to match, based on data configuration information, target data from a preset data source, where the preset data source includes: data warehouse tools, relational databases, and the like.
And the data synchronization module 203 is configured to synchronize the target data to a plurality of target clusters in the preset storage medium according to a preset data synchronization rule.
And when the preset data synchronization rule comprises a DB synchronization mode corresponding to the relational database, the DB synchronization mode is to divide the target data based on the dynamic adjustment data division time boundary so as to realize synchronous transmission.
And when the preset data synchronization rule comprises a Hive synchronization mode of the data warehouse tool, the Hive synchronization mode is to divide the target data based on a multithreading technology to transmit synchronously.
And the incremental monitoring module 204 is configured to monitor incremental log data of a data source, where the incremental log data includes incremental change data, and synchronize the incremental change data to the target cluster.
After the data is fully synchronized, the data needs to be supplemented by incremental data in real time, in one possible implementation scheme, an Otter is used for monitoring data change and sending the data after the change to the mq, if an abnormality occurs in the synchronization process, a retry mechanism of a message queue is used for retry, and if the retry frequency exceeds a threshold value, the failed data is subjected to database dropping for subsequent compensation operation.
Specifically, in one possible embodiment, the data synchronization method of MySQL includes: otter obtains incremental log data based on Canal.
In one possible embodiment, data synchronization of the SqlServer performs data synchronization by sending incremental log data to a custom message queue in a transactional manner.
Specifically, in one possible embodiment, when the DB synchronization approach is adopted, the target data matching module 202 includes:
an obtaining time field unit 2021, configured to obtain a time field of the relational database, where the time field includes a start boundary time and an end boundary time, and the start boundary time and the end boundary time divide the relational database into different data blocks.
The database is divided into different data blocks by the time field, when the database is queried, all data can be completely queried, even if the number of the time field division is small, the query speed can be ensured, but if the data needs to be manually cleaned in a large batch, a large amount of data can be concentrated into a certain time period, so that the data in some time fields are very dense, the data in some time fields are very sparse, and therefore the fixed time field needs to be changed into an elastic time boundary, specifically, the method comprises the following steps:
a comparing unit 2022, configured to compare the data amount of the time field with a preset threshold.
If the time field is lower than the preset threshold, inquiring a data block corresponding to the time field to obtain the target data;
if the time field is higher than or equal to the preset threshold, reducing the ending boundary time of the time field to the midpoint of the starting boundary time and the ending boundary time, re-acquiring the data volume in the new time field until the data volume in the time field is lower than the preset threshold, and updating the time field.
Therefore, a larger time field is given first, counting operation and preset threshold value comparison are carried out, if the larger time field is higher than or equal to the preset threshold value, the data volume in the time field is proved to be overlarge, and processing is not facilitated, so that the ending boundary time of the time field is reduced to the middle point of the starting boundary time and the ending boundary time for circular comparison until the time field is lower than the preset threshold value, and therefore, the influence of query operation on the database can be reduced to the minimum as possible by dynamically changing the time field boundary to carry out segmentation processing on the data.
In an implementation, besides the DB synchronization method, there is also a Hive synchronization method, where the query operation speed of Hive itself is fast, but the combination of queried data is required, which results in the time consumption of a single query being on the order of seconds or more, so to increase the query speed, it is necessary to set the data segments for thread processing in advance, and perform query processing in multiple threads.
When the Hive synchronization mode is adopted, initial synchronization of data has some problems more or less, and it cannot be completely guaranteed that the synchronized data is correct or complete, so a bottom-pocketing mechanism needs to be set to guarantee that automatic detection and data correction are performed when data errors occur, and therefore, the data synchronization system further comprises:
a detecting module 20301, configured to determine whether the target data is the same as the set data in the storage medium according to a preset data comparison rule;
and if not, correcting the data until the target data is the same as the set data.
Generally, the preset data comparison rule includes: and the forward comparison rule is used for representing the comparison of the data in the preset data source and the data on the target cluster, and the comparison is carried out after synchronization, so that the accuracy of the data is ensured.
And the reverse comparison rule is used for representing whether the target data exist in the comparison on the target cluster or not, and inquiring whether the target data exist in the distributed search and data analysis engine and the remote dictionary or not from the synchronized data.
In an implementation scheme, the reverse comparison rule of the distributed search and data analysis engine is a full query, and a "shallow" paging mode of from-size and a scroll mode cannot be used, because from-size can cause performance and efficiency problems in the case of deep paging, and scroll _ id generated by scroll not only occupies a large amount of resources (especially, ordered requests), but also queries a generated historical snapshot, and changes of data are not reflected on the snapshot, so that a search _ after mode is used for querying.
In an implementable scheme, the reverse comparison rule of the remote dictionary is also a full-scale query, and the query is carried out in a scan mode.
And when the data to be compared is about to catch up with the current time, the updating time of the comparison data and the current time are smaller than a configuration threshold value, and the incremental synchronization is automatically opened, so that the functions of automatic synchronization, verification and correction of the whole process can be completed only by configuring related tasks.
It should be noted that the working principle of the data synchronization system in this embodiment is the same as that of the data synchronization method in embodiment 1, and therefore, the description thereof is omitted here.
Example 3
Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, where the electronic device includes a memory, a processor, and a computer program stored in the memory and running on the processor, and the processor executes the computer program to implement the data synchronization method. The electronic device 30 shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.
The electronic device 30 may be embodied in the form of a general purpose computing device, which may be, for example, a server device. The components of the electronic device 30 may include, but are not limited to: the at least one processor 31, the at least one memory 32, and a bus 33 connecting the various system components (including the memory 32 and the processor 31).
The bus 33 includes a data bus, an address bus, and a control bus.
The memory 32 may include volatile memory, such as Random Access Memory (RAM)321 and/or cache memory 322, and may further include Read Only Memory (ROM) 323.
Memory 32 may also include a program/utility 325 having a set (at least one) of program modules 324, such program modules 324 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
The processor 31 executes various functional applications and data processing, such as the above-described data synchronization method, by executing the computer program stored in the memory 32.
The electronic device 30 may also communicate with one or more external devices 34 (e.g., a keyboard, a pointing device, etc.). Such communication may be through input/output (I/O) interfaces 35. Also, model-generating device 30 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via network adapter 36. As shown, the network adapter 26 communicates with the other modules of the model-generating device 30 over a bus 33. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the model-generating device 30, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, and data backup storage systems, to name a few.
It should be noted that although in the above detailed description several units/modules or sub-units/modules of the electronic device are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.
Example 4
The present embodiment provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the data synchronization method as in the above embodiments.
More specific examples, among others, that the readable storage medium may employ may include, but are not limited to: a portable disk, a hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.
In a possible implementation form, the present invention can also be implemented in the form of a program product, which includes program code for causing a terminal device to execute a method for implementing data synchronization in the above-described embodiments when the program product runs on the terminal device.
Where program code for carrying out the invention is written in any combination of one or more programming languages, the program code may be executed entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on a remote device or entirely on the remote device.
While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and that the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and these changes and modifications are within the scope of the invention.

Claims (12)

1. A data synchronization method, characterized in that the data synchronization method comprises:
constructing data configuration information corresponding to a preset storage medium;
matching to obtain target data from a preset data source based on the data configuration information;
and synchronizing the target data to a plurality of target clusters in the preset storage medium according to a preset data synchronization rule.
2. The data synchronization method of claim 1, wherein different preset storage media correspond to different data configuration information.
3. The data synchronization method according to claim 1 or 2, wherein the step of constructing the data configuration information corresponding to the preset storage medium is followed by:
judging whether the data configuration information meets a preset configuration condition, if so, executing the step of obtaining target data from a preset data source based on the data configuration information; otherwise, generating prompt information of configuration abnormity.
4. The data synchronization method of claim 1, wherein the preset storage medium comprises: a distributed search and data analysis engine, a remote dictionary;
and/or the presence of a gas in the gas,
the preset data source comprises: a data warehouse tool and/or a relational database.
5. The data synchronization method according to claim 4, wherein when the preset data synchronization rule includes a DB synchronization pattern corresponding to the relational database, the DB synchronization pattern is a DB synchronization pattern for dividing the target data based on a dynamically adjusted data division time boundary for synchronous transmission;
or the like, or, alternatively,
and when the preset data synchronization rule comprises a Hive synchronization mode of the data warehouse tool, the Hive synchronization mode is to divide the target data based on a multithreading technology to transmit synchronously.
6. The data synchronization method of claim 5, wherein the step of obtaining the target data from a preset data source based on the data configuration information in the DB synchronization mode comprises:
acquiring a time field of the relational database, wherein the time field comprises a starting boundary time and an ending boundary time, and the starting boundary time and the ending boundary time divide the relational database into different data blocks;
comparing the data volume of the time field with a preset threshold, and if the data volume of the time field is lower than the preset threshold, querying a data block corresponding to the time field to obtain the target data;
if the time field is higher than or equal to the preset threshold, reducing the ending boundary time of the time field to the midpoint of the starting boundary time and the ending boundary time, re-acquiring the data volume of the time field until the data volume of the time field is lower than the preset threshold, and updating the time field.
7. The data synchronization method of claim 5, wherein when the Hive synchronization mode is adopted, the step of synchronizing the target data to a plurality of target clusters in the preset storage medium according to a preset data synchronization rule further comprises:
judging whether the target data is the same as the set data in the storage medium or not by adopting a preset data comparison rule;
and if not, correcting the data until the target data is the same as the set data.
8. The data synchronization method of claim 7, wherein the preset data comparison rule comprises: a forward alignment rule or a reverse alignment rule;
the forward comparison rule is used for representing the comparison of the data in the preset data source with the data on the target cluster;
the reverse alignment rule is used for representing whether the target data exist in the alignment on the target cluster.
9. The data synchronization method of claim 1 or 2, wherein the data synchronization method further comprises:
monitoring incremental log data of the data source, wherein the incremental log data comprise incremental change data;
and synchronizing the incremental change data to the target cluster.
10. A data synchronization system, characterized in that the data synchronization system comprises:
the configuration information construction module is used for constructing data configuration information corresponding to a preset storage medium;
the target data matching module is used for matching target data from a preset data source based on the data configuration information;
and the data synchronization module is used for synchronizing the target data to a plurality of target clusters in the preset storage medium according to a preset data synchronization rule.
11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the data synchronization method of any one of claims 1 to 9 when executing the computer program.
12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the data synchronization method of any one of claims 1 to 9.
CN202210791558.0A 2022-07-05 2022-07-05 Data synchronization method, system, electronic device and storage medium Pending CN115080666A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210791558.0A CN115080666A (en) 2022-07-05 2022-07-05 Data synchronization method, system, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210791558.0A CN115080666A (en) 2022-07-05 2022-07-05 Data synchronization method, system, electronic device and storage medium

Publications (1)

Publication Number Publication Date
CN115080666A true CN115080666A (en) 2022-09-20

Family

ID=83258093

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210791558.0A Pending CN115080666A (en) 2022-07-05 2022-07-05 Data synchronization method, system, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN115080666A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115392802A (en) * 2022-10-28 2022-11-25 江苏智云天工科技有限公司 Method, system, medium, and apparatus for detecting defects of industrial products
CN115952178A (en) * 2022-12-01 2023-04-11 北京华宇九品科技有限公司 Multilevel associated data heterogeneous data synchronization method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115392802A (en) * 2022-10-28 2022-11-25 江苏智云天工科技有限公司 Method, system, medium, and apparatus for detecting defects of industrial products
CN115952178A (en) * 2022-12-01 2023-04-11 北京华宇九品科技有限公司 Multilevel associated data heterogeneous data synchronization method

Similar Documents

Publication Publication Date Title
US10691722B2 (en) Consistent query execution for big data analytics in a hybrid database
WO2020224374A1 (en) Data replication method and apparatus, and computer device and storage medium
US9589041B2 (en) Client and server integration for replicating data
US9892153B2 (en) Detecting lost writes
US10191932B2 (en) Dependency-aware transaction batching for data replication
US8949222B2 (en) Changing the compression level of query plans
CN115080666A (en) Data synchronization method, system, electronic device and storage medium
US20210382877A1 (en) Data read method and apparatus, computer device, and storage medium
US20040148317A1 (en) System and method for efficient multi-master replication
CN111506556A (en) Multi-source heterogeneous structured data synchronization method
US12093241B2 (en) Method for replaying log on data node, data node, and system
EP2380090B1 (en) Data integrity in a database environment through background synchronization
CN111651519B (en) Data synchronization method, data synchronization device, electronic equipment and storage medium
CN110737720A (en) DB2 database data synchronization method, device and system
WO2018196729A1 (en) Query processing method, data source registration method and query engine
CN114661816B (en) Data synchronization method and device, electronic equipment and storage medium
CN112131214A (en) Method, system, equipment and storage medium for data writing and data query
CN112163038A (en) Cross-cluster data synchronization method, device, equipment and storage medium
US9047354B2 (en) Statement categorization and normalization
CN108090056B (en) Data query method, device and system
CN116842244A (en) Search engine data synchronization method, system, device and storage medium
CN116340114A (en) Stream processing log alarming method
CN116186082A (en) Data summarizing method based on distribution, first server and electronic equipment
CN114003580A (en) Database construction method and device applied to distributed scheduling system
CN111813779A (en) Data query method, system, device and medium based on data interface configuration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination