CN112069160B

CN112069160B - CAP-based data cleaning synchronization method

Info

Publication number: CN112069160B
Application number: CN202010896649.1A
Authority: CN
Inventors: 桑礼荣; 李峰; 李长江; 查敏
Original assignee: Zhejiang Huarui Information Co ltd
Current assignee: Zhejiang Huarui Information Co ltd
Priority date: 2020-08-31
Filing date: 2020-08-31
Publication date: 2023-06-27
Anticipated expiration: 2040-08-31
Also published as: CN112069160A

Abstract

The invention relates to a data cleaning and synchronizing method, in particular to a CAP-based data cleaning and synchronizing method, which belongs to the technology in the field of big data. The method comprises the following steps: pre-analysis and processing, data publishing, data subscribing, local cleaning and storage and business relation flow. The CAP-based data cleaning and synchronizing method further ensures the consistency and efficiency of data synchronization.

Description

CAP-based data cleaning synchronization method

Technical Field

The invention relates to a data cleaning and synchronizing method, in particular to a CAP-based data cleaning and synchronizing method, which belongs to the technology in the field of big data.

Background

With the development of services, a single-node system cannot meet the service requirements, a distributed system is used, and the distributed system has the following basic characteristics that a group of independent computers in one distributed system are presented to a user as a unified whole, so that the distributed system is like a system. The system has various general physical and logical resources, can dynamically allocate tasks, and the scattered physical and logical resources realize information exchange through a computer network.

A distributed system is a loosely coupled system of multiple processors interconnected by communication lines. The remaining processors and corresponding resources are remote from the point of view of a processor in the system, and only their own resources are local. To date, no unified insight has been formed into the definition of distributed systems. It is generally recognized that a distributed system should have the following four features:

(1) Distribution: a distributed system is composed of multiple computers that are geographically dispersed, and may be spread across a unit, a city, a country, or even worldwide. The functions of the whole system are realized in a scattered way on each node, so that the distributed system has the distribution of data processing.

(2) Autonomy: each node in the distributed system comprises a processor and a memory, and each node has an independent data processing function. Generally, the two systems are equal in position, have no primary and secondary parts, can work autonomously, can transmit information by using a shared communication line, and coordinate task processing.

(3) Parallelism: a large task may be divided into several sub-tasks, each executing on a different host.

(4) Global. A single, global process communication mechanism must exist in a distributed system so that any one process can communicate with other processes and does not distinguish between local and remote communications. At the same time, there should be a global protection mechanism. There is a unified set of system calls on all machines in the system that must accommodate the distributed environment. The same kernel is run on all CPUs, making coordination easier. Distributed system advantages:

(1) Resource sharing: several different nodes are interconnected with each other through a communication network, and a user on one node can use resources on other nodes, such as a distributed system to allow equipment sharing, so that a plurality of users share expensive external equipment, such as a color printer; allowing data sharing to enable a plurality of users to access a shared database; remote files may be shared, remote-specific hardware devices used, and other operations performed.

(2) The calculation speed is quickened: if a particular computing task can be divided into several sub-tasks that run in parallel, the sub-tasks can be distributed to different nodes so that they run on the nodes at the same time, thereby increasing the computing speed. In addition, the distributed system has a calculation migration function, if the load on a certain node is too heavy, some jobs can be moved to other nodes for execution, so that the load of the node is relieved. This job migration is known as load balancing.

(3) The reliability is high: the distributed system has high reliability. If one node fails, the rest of the nodes can continue to operate, and the whole system cannot be totally crashed due to the failure of one or a few nodes. Therefore, the distributed system has good fault tolerance.

The system must be able to detect the failure of a node and take appropriate measures to recover from the failure. After the system determines the node where the fault is located, it is no longer used to provide service until it resumes normal operation. If the function of the failed node can be performed by other nodes, the system must ensure the correct implementation of the function transfer. When a failed node is restored or repaired, the system must smoothly integrate it into the system.

(4) The communication is convenient and quick: the nodes in the distributed system are interconnected together by a communication network. The communication network is composed of communication lines, modems, communication processors and the like, and users of different nodes can conveniently exchange information. At the bottom, the systems communicate by way of messaging.

The distributed system realizes the remote communication between nodes, and provides great convenience for information exchange between people. Users in different areas can jointly complete a project, and remotely log in to a counterpart system to run programs by transmitting project files, such as sending e-mails and the like, and coordinate the work of each other.

A system operating in a network environment has multiple nodes that themselves may become unstable for a variety of reasons. There is a very important concept CAP principle. This principle directs the design process of most distributed systems, and the CAP principle is roughly illustrative of three characteristics that must exist in distributed systems: consistency, partition tolerance, and availability, and all three of these characteristics may not be met at the same time in the design of a distributed system.

In the process of constructing a distributed system of an SOA or a micro-service system, events are usually needed to integrate all services, in the process, the final consistency of data cannot be guaranteed by simply using a message queue, a scheme of a local message table integrated with a current database is adopted by CAP to solve the possible abnormality in each link of the inter-calling of the distributed system, and the event message cannot be lost under any condition.

The distributed system has the problems of data transmission consistency, partition tolerance, usability and the like which are related to the need of making different-place storage. The system designs a Cap data synchronous cleaning tool, and a program is based on a C# library of NET Standard, which is a solution for processing distributed transactions, and also has the function of EventBus, and has the characteristics of light weight, easy use, high performance and the like. The situation that data synchronization is lost or data is inconsistent due to single subsystem faults in the synchronization process is avoided, the synchronization software can be distributed and deployed, the service state is managed and monitored through service discovery Consul, and therefore the data synchronization working efficiency is improved to the greatest extent.

While a distributed system has many advantages, it has its own drawbacks, mainly insufficient available software, relatively few system software, programming languages, applications and development tools. In addition, there are problems of saturation of the communication network or loss of information and network security, and convenient data sharing means that confidential data is easily stolen. While distributed systems present these potential problems, their advantages are far greater than shortcomings and these shortcomings are being overcome. Thus, distributed systems remain a direction of research, development and application.

In a distributed system, N working nodes are linked together through a network to work cooperatively.

First you cannot deposit complete data X on only one node because once this node is out of operation for various reasons, data X must not meet system availability once it cannot be accessed and once this node can no longer be restored, data X is always lost. Therefore, at least a plurality of copies of the data X should be stored on different nodes, and the more the copy quantity is stored, the more the security of the data X can be ensured, and the data X can be accessed even if a plurality of nodes are not available at the same time. This is a partitioning requirement, and according to common experience, the number of copies of data X should be at least 3.

How does in this case update these copies when data X changes? The most ideal effect is that when a client requests to send out an update request of the data X, the latest state of the data X can be obtained by accessing the data X from any node, which is the consistency requirement. Of course this optimal effect is too theoretical.

Distributed systems operating on a network are affected by many external factors:

how does a copy node not be connected as found in the synchronization process?

Or how do one copy node have data deadlock at the same time?

If the theoretical consistency requirement is met, all clients needing to read/write the data X can only wait until all copies of the data X are synchronized, and then respond, but obviously, the requirement is not met from the point of availability.

Disclosure of Invention

The invention mainly solves the defects existing in the prior art, solves the data synchronization in a distributed system, balances the relationship between CAP, realizes the efficient synchronization and cleaning of data, supports the conventional structured data and document formats such as text pictures, and the like, and introduces a data signal exchange synchronization method based on CAP data by adopting a mode of publishing and subscribing to introduce a message queue in the synchronization process.

The technical problems of the invention are mainly solved by the following technical proposal:

the CAP-based data cleaning synchronization method comprises the following steps:

(one), pre-analysis and treatment:

data synchronization in a distributed system balances the relationship between CAPs, realizes high-efficiency synchronization and cleaning of data, supports conventional structured data, text pictures and other document formats in a data format, and introduces a message queue in the synchronization process to exchange data signals in a publish-subscribe mode;

the data release process adopts the temporary buffering of release data into a temporary table of a production system, and the data synchronization system sends the data to the queue database through the temporary table of the production system, so that the situations of release failure and the like caused by network or queue database faults in the data synchronization process are prevented, and only the data of activities and changes are synchronized in the data release process. The duplicate nodes can set different queues through a message switch provided by queue data, each queue is used as a new starting signal source of a duplicate node, a response mode of RabbitMQ is adopted in the process, after data is received, the data to be processed is preferentially stored in a temporary table of the duplicate nodes, the signal exchange of the message queue is completed after the received signal is given out by storage, the data in the temporary data of the duplicate nodes are immediately cleaned, the data storage is completed after the cleaning, meanwhile, the synchronous state of the data in the temporary table is informed through the generated data fingerprint to a synchronous state signal of the temporary table, and in order to ensure the high efficiency of the data synchronization, the system adopts the automatic cleaning of the data of the temporary table for 24 hours successfully and 30 days abnormally;

the data of the cleaning synchronization failure is immediately retried for 3 times, the self-defined retry strategy defaults for 10 times, the self-defined retry strategy fails again, and the synchronization system triggers the system early warning and informs related personnel to repair;

(II), issuing data:

when the service data is stored, the data is simultaneously released to the exchanger of the message queue by utilizing the message pipeline intermediate key, a third party application system can customize a plurality of message queues according to service requirements, and exchange is used for automatically distributing the data into the plurality of queues;

if the conditions of message pipelines, network abnormality and the like occur, the data to be sent are stored locally alternatively, and the abnormal recovery of the message intermediate library is waited for, so that the automatic updating synchronous operation is realized;

(III), subscription data:

subscribing data by multiple nodes, and consuming the data by the same queue data by adopting an advanced principle; after acquiring subscription data, carrying out local storage preferentially, generating unique identification fingerprints, and carrying out subsequent cleaning and verification;

(IV), local cleaning and storing:

the data subscription terminal adopts a synchronous plug-in, automatically realizes data subscription according to authorized configuration such as message queue names, the subscribed data stores automatic samples firstly, then related data cleaning is carried out according to service requirements, in the data processing process, a synchronous intermediate key automatically carries out abnormal interception, and an abnormal retry function is realized according to abnormal information;

(V), business relation flow:

1) And (3) data release:

according to the injection reflection interface of the system, further realizing method interception, extracting method parameters and corresponding texts, and preparing data before data distribution;

2) And (3) data transmission:

before transmission, network judgment and message switch state detection are carried out, all the records are transmitted under abnormal conditions, and each record generates a unique fingerprint identifier;

3) And (3) data exchange:

the use of plug-in package RabbitMQ simplifies the process of generating the exchanger and the message queue, and in the whole use process, the message exchanger can be completed only by finishing relevant service logic, and the direct relation of the message queue is built;

4) Subscription distributed deployment:

the used service discovery Consul completes the deployment of the system deployment relationship, automatically performs system health detection at regular time, and automatically distributes subscription tasks according to the health state of the system and the corresponding data subscription process;

5) Data cleaning:

after the data subscription is acquired, performing data fingerprint verification according to a new service scene, taking out the service state of the data, performing data processing, automatically performing exception interception and hierarchical reminding processing on service codes, and setting an exception strategy of a message, wherein the two aspects of current strategy definition, fault transfer and fault timing retry are used for processing;

6) And (3) fault retry:

when the fault is found out to be abnormal, immediately performing retry strategies, carrying out default retry for 3 times, carrying out task degradation treatment after 3 times of failures, wherein the strategy of degradation retry is defined by a user, and retry design ensures that the sporadic abnormality of the task is ensured, and the problem of system resource loss caused by high-frequency retry due to temporary faults of a network or a copy node is also ensured; if the execution task exceeds the defined rule, triggering an abnormal reminding function by the task, and switching from abnormal system level to manual processing, so that the execution integrity of the task is ensured;

7) And (3) cleaning at regular intervals:

setting expiration time of fingerprint information in the temporary library, and automatically deleting successful signal data exceeding 24 hours and failed signal data exceeding 30 days in the temporary table at regular time every day; therefore, the excessive congestion of the information quantity in the cache is avoided, the problem of automatic data restoration and synchronization during fault processing is reasonably guaranteed, and the speed of searching in the cache is further improved.

CAP theory is very well known in the big data field, and the current popular big data technology is generally regarded as a theoretical basis, and is a basic stone of NoSQL databases, so that many architects can regard it as a criterion for designing a distributed system.

The close data synchronization is currently: the data synchronization is carried out in a timing polling mode, the timing polling has larger resource consumption on the production system under the condition of large data volume of the service system, and meanwhile, the program design has high coupling property, is not easy to expand and manage, and is easy to generate certain influence on the production system. In addition, part of databases have the function of subscribing and publishing data, but cannot do data heterogeneous and cleaning work, the subscribed and published data are limited by database manufacturers, and cannot be better compatible with data cleaning and heterogeneous equivalent synchronous work among different databases.

In the data release process, release data are temporarily buffered into a temporary table of a production system, the data synchronization system sends the data to the queue database through the temporary table of the production system, the temporary table improves the release efficiency of the data, and meanwhile, the situations of release failure loss record and the like caused by network or queue database faults in the data synchronization process are prevented, and only the data which are active and changed are synchronized in the data release process.

The duplicate node can set different queue rules through a message switch provided by queue data, each message queue is used as an initial new signal source, a response mode of RabbitMQ is adopted in the process, after data is received, data to be processed is preferentially stored in a temporary table of the duplicate node, after data information is stored, a consumption action of receiving signal to finish signal exchange of the message queue is given out, then the duplicate node immediately cleans, converts and stores the data according to the data information state in the temporary data, and meanwhile, the synchronous state of the data in the temporary table is updated through the generated unique fingerprint of the data. And the high efficiency and consistency of data synchronization are ensured.

And (3) effect analysis:

the present invention mainly solves the following common problems.

And (3) a step of: the temporary library of the production system effectively solves the problem that the production system is not required to wait when meeting the theoretical consistency requirement, and achieves the effect of data release of transactions.

And II: the data cleaning tool acquires the data in the temporary library of the production system, firstly stores the data in a message queue, and stores the release data into the queue through exchange of RabbitMQ. The association between the production system and the replica node is decoupled.

Thirdly,: cluster deployment, which is used for relieving system pressure and sporadic faults of single nodes.

Thirdly,: the subscription service immediately performs data cleaning work through the temporary table after acquiring the data, records the data firstly, cleans and stores the data later, and ensures the high efficiency and consistency of the data signals and the conversion process.

Fourth, the method comprises the following steps: and (3) carrying out abnormal early warning, maintaining a synchronous state through synchronous task fingerprints, and timely notifying manual participation of early warning of an automatic triggering system after a plurality of failures to carry out related works such as equipment diagnosis and repair.

Fifth step: the subscription service temporary library is locally cleaned according to the data signal state of the temporary library, so that the problems of data isomerism, long synchronization time consumption caused by complicated cleaning rules and synchronization failure caused by generated sporadic network faults are solved, and meanwhile, the cleaning and the synchronization are ensured to be carried out in the same transaction, thereby further ensuring the consistency and the efficiency of data synchronization.

Therefore, the CAP-based data cleaning and synchronizing method further ensures the consistency and efficiency of data synchronization.

Drawings

Fig. 1 is a schematic flow chart of the present invention.

Detailed Description

The technical scheme of the invention is further specifically described below by means of examples and with reference to the accompanying drawings.

Example 1: as shown in the figure, the CAP-based data cleaning synchronization method comprises the following steps: (one), pre-analysis and treatment:

(II), issuing data:

(III), subscription data:

(IV), local cleaning and storing:

(V), business relation flow:

1) And (3) data release:

2) And (3) data transmission:

3) And (3) data exchange:

4) Subscription distributed deployment:

5) Data cleaning:

6) And (3) fault retry:

7) And (3) cleaning at regular intervals:

Claims

1. The CAP-based data cleaning and synchronizing method is characterized by comprising the following steps:

(one), pre-analysis and treatment:

data synchronization in a distributed system balances the relationship between CAPs, realizes high-efficiency synchronization and cleaning of data, supports the conventional structured data and text picture document format in a data format, and introduces a message queue in the synchronization process to adopt a mode of publishing and subscribing to exchange data signals;

the data release process adopts: the data release is temporarily buffered in a temporary table of the production system, the data synchronization system sends the data to the queue database through the temporary table of the production system, the release failure condition caused by the network or the failure of the queue database in the data synchronization process is prevented, and only the activity and the changed data are synchronized in the data release process; the duplicate nodes can set different queues through a message switch provided by queue data, each queue is used as a duplicate node and is also an initial new signal source, a response mode of RabbitMQ is adopted in the process, after data is received, the data to be processed is preferentially stored in a temporary table of the duplicate nodes, the signal exchange of the message queue is completed by the received signal after the storage is completed, the data in temporary data of the duplicate nodes are immediately cleaned, the data storage is completed after the cleaning is completed, meanwhile, the synchronous state of the data in the temporary table is informed through the generated data fingerprint to a synchronous state signal of the temporary table, and in order to ensure the high efficiency of the data synchronization, the system adopts successful 24 hours for the data of the temporary table and automatically cleans the temporary table for 30 days abnormally;

(II), issuing data:

when the service data is stored, the message pipeline middleware is utilized to issue the data to the exchanger of the message queue, a third party application system can customize a plurality of message queues according to service requirements, and Exchanges are used for automatically distributing the data to the plurality of queues;

if a message pipeline or a network abnormality occurs, locally storing data to be transmitted, waiting for message queue and network abnormality recovery, and realizing automatic updating synchronous operation;

(III), subscription data:

(IV), local cleaning and storing:

the data subscription terminal adopts a synchronous plug-in, the data subscription is automatically realized according to the authorized configuration of the message queue name, the subscribed data firstly stores the specimen automatically, then carries out related data cleaning according to the service requirement, and in the data processing process, the synchronous middleware automatically carries out abnormal interception and realizes the abnormal retry function according to the abnormal information;

(V), business relation flow:

1) And (3) data release:

intercepting according to an injection reflection interface implementation method of the system, extracting method parameters and corresponding texts, and preparing before data distribution;

2) And (3) data transmission:

3) And (3) data exchange:

4) Subscription distributed deployment:

the used service discovery Consul completes the deployment of the system deployment relationship, automatically performs system health detection at regular time, and automatically distributes subscription tasks according to the health state of the system and corresponding nodes in the data subscription process;

5) Data cleaning:

after the data subscription is acquired, performing data fingerprint verification according to a new service scene, taking out the service state of the data, performing data processing, automatically performing exception interception and hierarchical reminding processing on service codes, setting an exception strategy of a message, and performing processing on two aspects of fault transfer and fault timing retry;

6) And (3) fault retry:

7) And (3) cleaning at regular intervals:

and setting expiration time of fingerprint information in the temporary library, and automatically deleting successful signal data exceeding 24 hours and failed signal data exceeding 30 days in the temporary table at regular time every day.