CN113010609B

CN113010609B - Differentiated synchronization method and system applied to store operation

Info

Publication number: CN113010609B
Application number: CN202110377970.3A
Authority: CN
Inventors: 吴昭松; 潘威; 王忆新; 王元盛; 王新; 彭肖溶; 朱婵娟
Original assignee: Shanghai Heading Information Engineering Co ltd
Current assignee: Shanghai Heading Information Engineering Co ltd
Priority date: 2020-12-23
Filing date: 2021-04-08
Publication date: 2023-05-16
Anticipated expiration: 2041-04-08
Also published as: CN113010609A

Abstract

The invention relates to the technical field of data synchronization, and provides a differential synchronization method and a system applied to store operation, wherein the method comprises the following steps: the data extraction end registers to the central dispatching service; the central dispatching service generates task information and sends the task information to the data extraction end; after receiving the task information, the data extraction end executes the task according to the task type and the extraction mode, encapsulates the extracted data into a data packet and sends the data packet to the data processing end; and the data processing end processes the data packet, synchronizes the data in the data packet into the target database when the task type is data synchronization, pulls the synchronized data in the target database when the task type is verification, compares the synchronized data with the data in the data packet, and verifies the accuracy of the synchronized data. Aiming at forms with different forms in a source database, the forms are synchronized in a matched mode, so that the integrity, accuracy and high efficiency of each synchronization can be ensured.

Description

Differentiated synchronization method and system applied to store operation

Technical Field

The invention relates to the technical field of data synchronization, in particular to the technical field of differentiated synchronization of store operation. In particular to a differentiated synchronization method and a system applied to store operation.

Background

In store management systems, particularly chain stores, it is common to include a plurality of different data sources. For example, different stores may have their own individual store systems corresponding to different source databases. For another example, sales systems, supplier systems, etc. may be involved in store operations and may correspond to different source databases.

And when the data in each store, each sales or supplier system is finally subjected to statistical processing, the data in each database is required to be synchronously summarized and stored in the same target database system, so that the subsequent data is conveniently summarized and counted.

When extracting a source database and synchronizing the source database to a target database, the integrity, accuracy and high efficiency of data are important targets. However, because forms of the forms in the source database are different, if all the forms are synchronized by adopting a single same synchronization mode, a plurality of problems are brought, and the integrity, accuracy and high efficiency of each synchronization cannot be ensured.

Disclosure of Invention

Aiming at the problems, the invention aims to provide a differential synchronization method and a differential synchronization system applied to store operation, which are used for synchronizing forms in different forms in a source database in a matched mode, so that the integrity, accuracy and high efficiency of each synchronization can be ensured.

The above object of the present invention is achieved by the following technical solutions:

a differentiated synchronization method applied to store operations, comprising the steps of:

s1: a data extraction end for extracting data in a source database is established, and when the data extraction work in the source database is started, the data extraction end registers with a central dispatching service, wherein the central dispatching service is used for dispatching the data extraction of a plurality of groups of source databases;

s2: when the central scheduling service receives registration information of the data extraction end, generating task information comprising a task type and an extraction mode, and sending the task information to the data extraction end, and starting a data processing end for executing a data processing task after the data extraction end extracts data in the source database, wherein the task type comprises data synchronization and verification, and the extraction mode comprises a full table synchronization mode, a single table increment synchronization mode and an adaptive extraction mode selected according to different table forms and including the table increment synchronization mode;

s3: after receiving the task information, the data extraction end executes a task according to the task type and the extraction mode, encapsulates the extracted data into a data packet and sends the data packet to the data processing end, wherein when executing the task according to the task type and the extraction mode, optimal synchronous task parameters are calculated by adopting a decision tree according to the performance of a synchronous task host, and the utilization rate of system resources is improved and the time consumption of the whole task is reduced by improving concurrency and batch data size and optimizing task queue allocation on the premise of not influencing the operation of the host and the operation of other applications, and the method specifically comprises the following steps:

Establishing a decision tree for calculating optimal synchronous task parameters, collecting parameters including CPU core number, CPU utilization rate, I O, network, memory, table field number and table field size of a host machine and data storage ETL when a task is started, and inputting the collected parameters into the decision tree;

the decision tree outputs task configuration, and a data storage ETL task is started;

calculating and outputting task configuration including optimal concurrency number, batch data size and task queue allocation of the batch of tasks according to the use condition of the current system and historical task logs, collecting information including resource occupation and time consumption of task execution, and storing the information into a task log library to provide basis for next decision;

s4: the data processing end processes the data packet after analyzing the data packet; and when the task type is data synchronization, synchronizing the data in the data packet into a target database, and when the task type is verification, pulling the synchronized data in the target database, and comparing the synchronized data with the data in the data packet to verify the accuracy of the synchronized data in the target database.

Further, according to the form difference in the source database, selecting different extraction modes to perform data extraction and synchronization, specifically:

The full-table synchronization is suitable for tables with small synchronous data quantity and tables with large data quantity and low synchronous frequency, and is used for carrying out full-table synchronization on all data in the tables;

the single-table increment synchronization is suitable for synchronizing a single table with large data volume, and increment synchronization is carried out on newly added or updated data in the table;

the slave table increment synchronization is suitable for synchronizing the slave table associated with the master table, and the increment synchronization is carried out on the data associated with the slave table according to the newly added or updated data in the master table.

Further, in step S3, a decision tree for calculating the optimal synchronization task parameters is established, specifically:

the method comprises the steps of establishing a configuration decision tree by adopting a C4.5 algorithm, taking the task log library as a training set, calculating the information gain rate of each attribute when each task executes log, and selecting the attribute with the highest information gain rate for division, wherein the method specifically comprises the following steps:

setting all attribute sets in a task log library as D, wherein K types of task configuration exists;

calculating information entropy of the whole data:

wherein C is _k Representing a kth class of task configuration;

calculating the information entropy of each attribute A:

wherein D is divided into n different classes by attribute A, D _i An ith set divided by the attribute A;

calculating the information gain of each attribute A:

Gain(D，A)＝H(D)-H(D|A)

calculating an information gain rate:

for the information gain rate of each attribute A, the feature with the highest information gain rate is found and used as the node of the division decision tree.

Further, when the task type is verification, a verification scheme including immediate quick verification, daily verification, weekly verification and dynamic verification specifically includes:

the instant quick verification is carried out on the table with large data quantity and multiple data fields in time;

the daily check is carried out on the data with time increment, and each time the data with change in one day is checked;

the data in time increment is verified according to the weekly verification, and each time the data which changes in one week is verified;

and the dynamic verification is carried out, the data to be verified is split into a plurality of data segments according to fixed time intervals, and the data segments are respectively verified.

Further, the instant quick verification specifically includes:

the dimension of the wide table is reduced by a PCA algorithm, 5% -20% of main component data is extracted, the included information can reach more than 95% of the original data, then MD5 values are calculated on the main component data, and the main component data are synchronized to a target library together with the original data;

And comparing the MD5 values of the main component data calculated by the target library according to the same logic, and if the MD5 values are consistent, passing the instant quick verification.

Further, the dimension of the wide table is reduced by a PCA algorithm, the variance after the maximized data projection is calculated, an optimal data matrix is obtained, and then the data projection is carried out to reduce the dimension of the matrix, and the method specifically comprises the following steps:

the extracted original data are formed into a 2-dimensional matrix X of n rows and m columns according to the columns;

initializing a matrix X by 0 mean value, and applying feature scaling to-0.5;

set an orthogonal base u _j Data point x _i The projection distance on the substrate is

The variance J of the projection of all data onto the substrate _j The method comprises the following steps:

where m is the number of samples, initializing the matrix X by 0 mean, i.e., X _center =0, then:

so that:

calculating covariance matrix

SVD decomposition is carried out on the covariance matrix to obtain a characteristic value and a corresponding characteristic vector;

substituting the covariance matrix into a formula (3), and obtaining an extremum according to a Lagrangian operator to obtain:

construction function:

solving for

Obtaining:

when u is _j 、λ _j Respectively are provided withWhen the characteristic vector and the characteristic value of the covariance matrix S are adopted, J _j With extremum, substituting the above structure into formula (4) to obtain:

sorting the characteristic values from large to small, and taking characteristic vectors corresponding to the first k characteristic values to obtain a new k-dimensional coordinate system P; for any orthogonal basis meeting the condition, the variance value of the corresponding data after projection is the eigenvector of the S matrix, so that:

Wherein lambda is ordered from large to small

The feature vector corresponding to the first k maximum feature values in the feature vectors with the orthogonal basis of S is projected;

according to the relation conclusion of the eigenvector and the SVD, substituting the eigenvector into the matrix S to obtain a new orthogonal basis meeting the maximum data distance after projection:

P＝{u ₁ ,u ₂ ,…,u _k }

mapping the matrix into a new coordinate system, and reducing the dimension of the matrix with n rows and m columns into a matrix with k rows and m columns;

further, when the task type is verification, the following verification method is further included:

checking the record number, and comparing the record number synchronized with the full table or the increment;

checking the total value, and comparing the total value of fields including the amount and the quantity for synchronizing the whole table or the increment;

and checking the check code, and comparing MD5 values recorded by the tables for synchronizing the whole tables or the increment.

Further, in step S4, the method further includes: when the task type is verification, and when the synchronous data in the target database is verified to be inaccurate, the data in the target database are re-synchronized, specifically:

and the central scheduling service generates a corresponding sql statement according to the extraction mode, sends the sql statement to the data extraction end, and the data extraction end executes the sql statement to carry out data synchronization again.

Further, in the task information, further includes: task state;

the task state marks the progress and the completion state of the task for synchronizing or checking the data;

and when the task state is failure, the central scheduling service reinitiates the task to perform data synchronization.

Further, the data extraction end communicates with the central dispatching service and the data processing end through a message application service.

A system for performing the above-described differentiated synchronization method applied to store operations, comprising:

the system comprises a data extraction end establishing module, a central scheduling service and a data extraction end processing module, wherein the data extraction end establishing module is used for establishing a data extraction end for extracting data in a source database, and when the data extraction work in the source database is started, the data extraction end registers with the central scheduling service, wherein the central scheduling service is used for scheduling the data extraction of a plurality of groups of source databases;

the task information generation module is used for generating task information comprising task types and extraction modes after the central scheduling service receives registration information of the data extraction end, sending the task information to the data extraction end, and starting a data processing end for executing a data processing task after the data extraction end extracts data in the source database, wherein the task types comprise data synchronization and verification, and the extraction modes comprise full-table synchronization, single-table increment synchronization and adaptive extraction modes selected from different table forms and including table increment synchronization;

The data extraction module is used for providing the data extraction end with the task information, executing the task according to the task type and the extraction mode, packaging the extracted data into a data packet and sending the data packet to the data processing end, wherein when executing the task according to the task type and the extraction mode, the optimal synchronous task parameters are calculated by adopting a decision tree according to the performance of a synchronous task host machine, and the utilization rate of system resources is improved and the time consumption of the whole task is reduced by improving the concurrency and the batch data size and optimizing the task queue allocation on the premise of not influencing the operation of the host machine and the operation of other applications;

the data processing module is used for providing the data processing end with the data packet for processing after analyzing the data packet; and when the task type is data synchronization, synchronizing the data in the data packet into a target database, and when the task type is verification, pulling the synchronized data in the target database, and comparing the synchronized data with the data in the data packet to verify the accuracy of the synchronized data in the target database.

An electronic device comprising a processor and a memory, wherein at least one instruction, at least one program, code set, or instruction set is stored in the memory, and wherein the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the method described above.

A computer readable storage medium storing computer code which, when executed, performs a method as described above.

Compared with the prior art, the invention has at least one of the following beneficial effects:

(1) By providing a differentiated synchronization method applied to store operations, the method comprises the following steps:

s1: a data extraction end for extracting data in a source database is established, and when the data extraction work in the source database is started, the data extraction end registers with a central dispatching service, wherein the central dispatching service is used for dispatching the data extraction of a plurality of groups of source databases; s2: when the central scheduling service receives registration information of the data extraction end, generating task information comprising a task type and an extraction mode, and sending the task information to the data extraction end, and starting a data processing end for executing a data processing task after the data extraction end extracts data in the source database, wherein the task type comprises data synchronization and verification, and the extraction mode comprises a full table synchronization mode, a single table increment synchronization mode and an adaptive extraction mode selected according to different table forms and including the table increment synchronization mode;

S3: after receiving the task information, the data extraction end executes tasks according to the task types and the extraction modes, packages the extracted data into data packets and sends the data packets to the data processing end, wherein when the tasks are executed according to the task types and the extraction modes, optimal synchronous task parameters are calculated by adopting decision trees according to the performance of a synchronous task host machine, and the utilization rate of system resources is improved and the time consumption of the whole task is reduced by improving concurrency and batch data size and optimizing task queue allocation on the premise that the operation of the host machine and the operation of other applications are not influenced; s4: the data processing end processes the data packet after analyzing the data packet; and when the task type is data synchronization, synchronizing the data in the data packet into a target database, and when the task type is verification, pulling the synchronized data in the target database, and comparing the synchronized data with the data in the data packet to verify the accuracy of the synchronized data in the target database. According to the technical scheme, different extraction modes are selected for synchronization according to different form forms, so that each synchronization can be complete, accurate and efficient.

(2) And checking the data with different updating frequencies by selecting different checking schemes. The efficiency of the verification work is ensured, and the accuracy of the synchronized data is ensured.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention.

FIG. 1 is an overall flow chart of a differentiated synchronization method for store operation according to the present invention;

FIG. 2 is a flow chart of the present invention for calculating optimal sync task parameters through a decision tree;

FIG. 3 is a schematic diagram of a time-consuming record of the same batch of tasks performed on the same host machine according to the present invention;

FIG. 4 is a diagram of a table structure with a relatively large data size and a large number of data fields according to the present invention;

FIG. 5 is a schematic diagram of the instant quick verification of the present invention;

FIG. 6 is a schematic diagram of data added by PCA dimension reduction calculation in a first embodiment of the present invention;

FIG. 7 is a diagram showing the result of PCA dimension reduction calculation in the first embodiment of the present invention;

FIG. 8 is a time-consuming comparison of the full field participation verification and the principal component verification of the present invention;

FIG. 9 is a block diagram of a differentiated synchronization method applied to store operations in accordance with the present invention;

fig. 10 is an overall structure diagram of a differentiated synchronization system applied to store operation according to the present invention.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

The invention adopts a differential synchronization mode to improve the synchronization efficiency, and is divided from the following two aspects:

1. the "source-saving" selects a synchronization mode from the following three modes according to the table data update frequency from the viewpoint of the synchronization task configuration, reduces the amount of data to be synchronized as much as possible, and improves the synchronization efficiency.

1. The full-table synchronization is suitable for a large table with lower synchronization frequency or a small table with higher relative synchronization frequency;

2. single table increment synchronization, which is suitable for synchronizing the data which has changed recently;

3. a slave table increment mode adapted to have the slave table implement the increment mode using the last updated time field of the master table;

2. and (3) opening the flow, calculating optimal synchronous task parameters by adopting a decision tree according to the performance of the synchronous task host machine, and improving the utilization rate of system resources and reducing the time consumption of the whole task by improving concurrency, improving the size of batch data and optimizing task queue allocation on the premise of not influencing the operation of the host machine and the operation of other applications.

The following is described by way of specific examples:

first embodiment

As shown in fig. 1, the present embodiment provides a differentiated synchronization method applied to store operations, including the following steps:

S1: a data extraction end for extracting data in a source database is established, and when the data extraction work in the source database is started, the data extraction end registers with a central dispatching service, wherein the central dispatching service is used for dispatching a plurality of groups of data extraction of the source database.

Specifically, before extracting the data in each source database, a data extraction end corresponding to the source database needs to be established. And the data extraction work can be started only after registration is performed in the central dispatching service.

The registration information of the data extraction end in the central dispatching service comprises an ip of the data extraction end, a name of a source database, a state and a used data pipeline name (topic).

The central scheduling service can simultaneously perform data synchronization service for a plurality of groups of source databases. Before the data extraction end is registered in the central scheduling service, the central scheduling service does not know the existence of the data extraction end, and the corresponding data processing end is not started. The purpose of registration is to let the central dispatch service know that there is an extraction end to start working, and that a data processing end needs to be started to process data according to registration information. This facilitates more flexible scheduling.

S2: and after the central scheduling service receives the registration information of the data extraction end, generating task information comprising a task type and an extraction mode, and sending the task information to the data extraction end, and starting a data processing end for executing a data processing task after the data extraction end extracts the data in the source database, wherein the task type comprises data synchronization and verification, and the extraction mode comprises full table synchronization, single table increment synchronization and an adaptive extraction mode selected according to different table forms and including table increment synchronization.

Specifically, after the central scheduling service receives the registration information sent by the data extraction end, a task information is generated and sent to the data extraction end, the data extraction end can perform data extraction only after receiving the task information, otherwise, the extracted data is not processed by the data processing end later.

The task information comprises a task type and an extraction mode. The task type determines whether the data extracted at the time is used for data synchronization work or verification work, and for different task types, different processing is performed on the data in a subsequent data processing end. The extraction mode determines which extraction mode is used for extracting data according to the form, and updates the data in the target data in the same manner as the extraction mode in the subsequent data synchronization.

Specifically, the technical way of extraction and synchronization is as follows: the jdbc is used for connecting a source database, different extraction modes are selected according to different scenes to generate corresponding sql, data is obtained from a form of the source database, after the target database receives the data, the target database generates the put-in sql in a fundamentally different synchronous mode corresponding to the extraction modes, and the put-in sql is executed in the target database.

S3: after receiving the task information, the data extraction end executes the task according to the task type and the extraction mode, encapsulates the extracted data into a data packet, sends the data packet to the data processing end, calculates optimal synchronous task parameters according to the performance of a synchronous task host according to the task type and the extraction mode, and improves the utilization rate of system resources and reduces the time consumption of the whole task by improving concurrency and batch data size and optimizing task queue allocation on the premise of not affecting the operation of the host and the operation of other applications.

Specifically, after the data extraction end receives the task information, the data needs to be packaged to form a data packet, so that the data is convenient to transmit.

Examples of data packets are as follows:

Splicing the field names and the field types of the extracted data in the form of key value pairs to generate field objects of the data; and splicing the field object with the type and the table name of the table to generate the data packet. The splicing mode is not limited, and the splicing mode can be conveniently transmitted. The splicing mode is exemplified as follows: table type-! @ is-! List name-! @ is-! The serialized field and the key value pair of the field type.

Further, as shown in fig. 2, when executing the task according to the task type and the extraction mode, according to the performance of the synchronous task host, a decision tree is adopted to calculate the optimal synchronous task parameter, and on the premise of not affecting the operation of the host and the operation of other applications, the utilization rate of system resources is improved by improving concurrency, improving the batch data size and optimizing task queue allocation, and the time consumption of the whole task is reduced, specifically:

according to the use condition of the current system and the historical task log, calculating and outputting task configuration of the batch of tasks including optimal concurrence number, batch data size and task queue allocation, collecting information such as resource occupation and time consumption of task execution, and storing the information in a task log library, and providing basis for next decision.

Wherein, for the collection of parameters, the following manner can be adopted:

for system information, it is generally available via Linux system commands, for example:

obtaining the CPU core number: cat/proc/cpu info|grep "processor" |wc-l

Obtaining CPU utilization rate and IO condition: vmstat

Acquiring the use condition of a memory: free (free)

The table-related information is obtained by database query.

Further, a decision tree for calculating the optimal synchronous task parameters is established, specifically:

the configuration decision tree is built using the C4.5 algorithm (C4.5 algorithm is an algorithm developed by Ross Quinlan for generating a decision tree, which is an extension of the ID3 algorithm developed before Ross Quinlan, the decision tree generated by C4.5 algorithm can be used for classification purposes, and thus the algorithm can also be used for statistical classification), the task log library is used as a training set, the C4.5 algorithm is characterized by calculating information gain rates as classification criteria, in the system, the information gain rates of each attribute in each task execution log are calculated, and then the attribute with the highest information gain rate is selected for classification, and the method specifically includes the following steps:

calculating information entropy of the whole data:

wherein C is _k Representing a kth class of task configuration;

calculating the information entropy of each attribute A:

calculating the information gain of each attribute A:

Gain(D，)＝H(D)-H(D|A)

calculating an information gain rate:

As shown in fig. 3, an example of one specific application is provided. FIG. 3 is a time-consuming record of the execution of the same batch of tasks on the same host machine on which other applications are running simultaneously in order to simulate a real environment. The following diagram illustrates the time-consuming changes of each execution of the task, and it can be seen that the output task configuration is relatively conservative in order to avoid competing with other applications for resources at the beginning, resulting in longer execution times. And the subsequent attempts increase the concurrency number, increase the number of a batch of data, adjust the task queue and other optimization strategies, gradually shorten the time consumption of the ETL task and finally reach a relatively stable state.

Specifically, after receiving the data packet, the data processing end needs to parse the data packet and restore the data packet into a data original structure. After analysis, judging the task type of the current task, and executing data synchronization or verification operation.

Typically, the validation task is performed after the data is synchronized, verifying whether the synchronized data is correct, and when the data is incorrect, the central scheduling service reschedules the synchronization task.

For the data synchronization task, corresponding sql statements are generated and executed in the target database, and data are updated into the target database. And for the verification task, corresponding query statistics sql are generated according to a specified verification mode, the query statistics sql are respectively executed in a source database and a target database, the execution structure is compared, different verification modes and verification time are flexibly selected according to scene requirements and server conditions, and the influence on the performance of the server is small.

(1) The full-table synchronization is suitable for synchronizing tables (small tables) with small data volume and tables (large tables) with large data volume and low synchronization frequency, and is used for performing full-table synchronization on all data in the tables;

(2) The single-table increment synchronization is suitable for synchronizing a single table with large data volume, and increment synchronization is carried out on newly added or updated data in the table;

(3) The slave table increment synchronization is suitable for synchronizing the slave table associated with the master table, and the increment synchronization is carried out on the data associated with the slave table according to the newly added or updated data in the master table.

Further, when the task type is verification, verification schemes including instant quick verification, daily verification, week verification and dynamic verification are selected according to the data size, and the updated frequency is verified by selecting a proper verification scheme. The method comprises the following steps:

(1) The instant quick check, as shown in fig. 4, performs a timely check on a table with large data quantity and multiple data fields, and if all the fields are incorporated into the check calculation according to the conventional method, a great deal of time and computer resources are consumed.

(2) The daily check is carried out on the data with time increment, and each time the data with change in one day is checked;

(3) The data in time increment is verified according to the weekly verification, and each time the data which changes in one week is verified;

(4) And the dynamic verification is carried out, the data to be verified is split into a plurality of data segments according to fixed time intervals, and the data segments are respectively verified.

As shown in fig. 5, the instant quick verification specifically includes:

the dimension of the wide table is reduced by a PCA algorithm, 5% -20% of main component data is extracted, the included information can reach more than 95% of the original data, then MD5 values are calculated on the main component data, and the main component data are synchronized to a target library together with the original data; and comparing the MD5 values of the main component data calculated by the target library according to the same logic, and if the MD5 values are consistent, passing the instant quick verification.

Specifically, the source library end transmits the extracted data in a JDBC connection mode, and the verifier processes the batch of data to generate corresponding verification codes, and after the data is acquired by the target end, the verification is performed again, and the verification codes generated twice are compared, so that whether the batch of data is accurately transmitted can be judged, and then follow-up remedial measures are selected for remediation, so that the consistency of the data is ensured. The algorithm of the verifier is mainly realized by a PCA dimension reduction algorithm, and aims to express data with more characteristics by using data with fewer characteristics, namely data compression, and after the main characteristics of the data are obtained, MD5 processing is performed to generate a verification code so as to improve the processing speed. The implementation thought of the algorithm is mainly to calculate the variance after maximum data projection, obtain the optimal data matrix coordinates, and then perform data projection to reduce the matrix dimension. The implementation principle is as follows:

initializing a matrix X by 0 mean value, and applying feature scaling to-0.5;

so that:

calculating covariance matrix

construction function:

solving for

Obtaining:

when u is _j 、λ _j J when the eigenvectors and eigenvalues of the covariance matrix S are respectively _j With extremum, substituting the above structure into formula (4) to obtain:

wherein lambda is ordered from large to small

P＝{u ₁ ,u ₂ ,…,u _k }

for example, PCA dimension reduction calculation is performed using a 91 field table as an example to add raw data (see FIG. 6) and calculate the result (see FIG. 7).

And (3) performance verification: FIG. 8 is a comparison of the time consumption of the same batch of tables in synchronization using two instant check modes. When the number of the table fields is small, the two modes are basically consistent in time consumption, and as the number of the table fields is increased, the efficiency advantage of checking only by using the principal components is highlighted, and compared with the traditional mode, the efficiency is obviously improved.

(1) Checking the record number, and comparing the record number synchronized with the full table or the increment;

(2) Checking the total value, and comparing the total value of fields including the amount and the quantity for synchronizing the whole table or the increment;

(3) And checking the check code, and comparing MD5 values recorded by the tables for synchronizing the whole tables or the increment.

Further, in step S4, the method further includes: when the task type is verification, and when the synchronous data in the target database is verified to be inaccurate, the data in the target database are re-synchronized, specifically: and the central scheduling service generates a corresponding sql statement according to the extraction mode, sends the sql statement to the data extraction end, and the data extraction end executes the sql statement to carry out data synchronization again.

Specifically, an adaptive synchronization mode is formed according to the differentiated synchronization mode and characteristics, and each synchronization mode has a data check and repair function; therefore, the efficiency and accuracy of data synchronization are improved to a great extent, and a good foundation is formed for synchronizing a large amount of data every day and night for a big data system.

If the difference is found, the central dispatching server generates a specified sql according to the verification mode, sends the specified sql to the extraction end, and the extraction end executes the sql to resynchronize the data.

The re-extraction process generally has two modes:

(1) And (3) full-table verification, namely re-extracting full-table data.

(2) And (3) incremental verification, namely analyzing the time period with the difference, and resynchronizing the data in the time period.

Further, in the task information, further includes: task state;

Second embodiment

The steps of this embodiment are basically the same as those of the first embodiment, except that the data extraction end communicates with the central scheduling service and the data processing end through a message application service.

As shown in fig. 9, a specific implementation manner of the differentiated synchronization method for communication by using the message application service is specifically:

the network environment adopts hundred megaethernet, and the device comprises a source database 1, a data extraction end 2, a message application service 3 (a first service message pipeline 31 and a second data message pipeline 32), a central dispatching service 4 (with a database 41) and a data processing end 5.

The operation platform adopts linux, wherein the message application service 3, the central dispatching service 4 and the data processing end 5 are required to be in the same network segment, so that the message application service 3 is provided with two hundred megacards, one of the cards is used for monitoring data packets of an external network, and the other card is connected with the central dispatching service 4 and the data processing end 5 of an internal network environment, thereby ensuring orderly transmission of data in the internal and external network environment.

The data extraction end 2 starts to establish network connection with the message application service 3, then sends a registration message to the first service message pipeline 31, the central dispatching service 4 receives the registration message from the first service message pipeline 31, after analysis, establishes a data table of the data extraction end 2 in the database 41 application, and simultaneously sends task information to the first service message pipeline 31, the data extraction end 2 receives the task message, analyzes the task, performs tasks such as extraction/verification, and the like. The data processing end 5 continuously monitors the second data message pipeline 32, and performs a data processing task after receiving the data packet.

Third embodiment

As shown in fig. 10, the present embodiment provides a system for performing the differentiated synchronization method applied to store operations as in the first embodiment, including:

the data extraction end establishing module 1 is used for establishing a data extraction end for extracting data in a source database, and when the data extraction work in the source database is started, the data extraction end registers to a central dispatching service, wherein the central dispatching service is used for dispatching the data extraction of a plurality of groups of source databases;

the task information generating module 2 is configured to generate task information including a task type and an extraction mode after the central scheduling service receives registration information of the data extraction end, send the task information to the data extraction end, and start a data processing end for executing a data processing task after the data extraction end extracts data in the source database, where the task type includes data synchronization and verification, and the extraction mode includes full table synchronization, single table increment synchronization, and an adapted extraction mode selected from different table forms including table increment synchronization;

the data extraction module 3 is configured to provide the data extraction end with performing a task according to the task type and the extraction mode after receiving the task information, and package the extracted data into a data packet and send the data packet to the data processing end;

The data processing module 4 is used for providing the data processing end with the data packet for processing after analyzing the data packet; and when the task type is data synchronization, synchronizing the data in the data packet into a target database, and when the task type is verification, pulling the synchronized data in the target database, and comparing the synchronized data with the data in the data packet to verify the accuracy of the synchronized data in the target database.

A computer device comprising a memory and one or more processors, the memory having stored therein computer code that, when executed by the one or more processors, causes the one or more processors to perform the method of any of the first embodiments.

A computer readable storage medium storing computer code which, when executed, performs a method as described above. Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the present invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

It should be noted that the above embodiments can be freely combined as needed. The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

The software program of the present invention may be executed by a processor to perform the steps or functions described above. Likewise, the software programs of the present invention (including associated data structures) may be stored on a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. In addition, some of the steps or functions of the present invention may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various functions or steps. The methods disclosed in the embodiments shown in the embodiments of the present specification may be applied to a processor or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Net work Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of this specification may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present specification may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

Embodiments also provide a computer readable storage medium storing one or more programs that, when executed by an electronic system comprising a plurality of application programs, cause the electronic system to perform the method of embodiment one. And will not be described in detail herein.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-RO M), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (tr ansitory media), such as modulated data signals and carrier waves.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices. Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves. It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

Furthermore, portions of the present invention may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present invention by way of operation of the computer. Program instructions for invoking the inventive methods may be stored in fixed or removable recording media and/or transmitted via a data stream in a broadcast or other signal bearing medium and/or stored within a working memory of a computer device operating according to the program instructions. An embodiment according to the invention comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to operate a method and/or a solution according to the embodiments of the invention as described above.

Claims

1. A differentiated synchronization method applied to store operations, comprising the steps of:

Establishing a decision tree for calculating optimal synchronous task parameters, collecting parameters including CPU core number, CPU utilization rate, IO, network, memory, table field number and table field size of a host machine and data storage ETL when a task is started, and inputting the collected parameters into the decision tree;

2. The differentiated synchronization method applied to store operations according to claim 1, further comprising: according to the different forms of the forms in the source database, different extraction modes are selected to carry out data extraction and synchronization, specifically:

3. The differentiated synchronization method applied to store operations according to claim 1, wherein in step S3, a decision tree for calculating optimal synchronization task parameters is established, specifically:

calculating information entropy of the whole data:

wherein C is _k Representing a kth class of task configuration;

calculating the information entropy of each attribute A:

calculating the information gain of each attribute A:

Gain(D，A)＝H(D)-H(D|A)

calculating an information gain rate:

4. The differentiated synchronization method applied to store operations according to claim 1, wherein when the task type is verification, verification schemes including immediate quick verification, daily verification, weekly verification, dynamic verification are specifically:

5. The differentiated synchronization method applied to store operations according to claim 4, wherein the instant quick verification is specifically:

The dimension of the wide table is reduced by a PCA algorithm, 5% -20% of main component data are extracted, the contained information can reach more than 95% of the original data, then MD5 values are calculated on the main component data, and the main component data are synchronized to a target library together with the original data;

6. The differentiated synchronization method applied to store operations according to claim 5, wherein the dimension of the wide table is reduced by a PCA algorithm, the variance after the maximized data projection is calculated, the optimal data matrix is obtained, and then the data projection is performed to reduce the dimension of the matrix, specifically comprising the following steps:

initializing a matrix X by 0 mean value, and applying feature scaling to-0.5;

so that:

calculating covariance matrix

construction function:

solving for

Obtaining:

wherein lambda is ordered from large to small

P＝{u ₁ ,u ₂ ,…,u _k }

7. the differentiated synchronization method applied to store operations according to claim 1, wherein when the task type is verification, further comprising the following verification means:

8. The differentiated synchronization method applied to store operations according to claim 1, further comprising, in step S4: when the task type is verification, and when the synchronous data in the target database is verified to be inaccurate, the data in the target database are re-synchronized, specifically:

9. A system for performing the differentiated synchronization method applied to store operations of any one of claims 1-8, comprising:

10. A computer readable storage medium storing computer code which, when executed, performs the method of any one of claims 1 to 8.