CN112328697A - Data synchronization method based on big data - Google Patents

Data synchronization method based on big data Download PDF

Info

Publication number
CN112328697A
CN112328697A CN202011300354.XA CN202011300354A CN112328697A CN 112328697 A CN112328697 A CN 112328697A CN 202011300354 A CN202011300354 A CN 202011300354A CN 112328697 A CN112328697 A CN 112328697A
Authority
CN
China
Prior art keywords
data
virtual
source end
synchronization
heterogeneous
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202011300354.XA
Other languages
Chinese (zh)
Inventor
樊馨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202011300354.XA priority Critical patent/CN112328697A/en
Publication of CN112328697A publication Critical patent/CN112328697A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0631Configuration or reconfiguration of storage systems by allocating resources to storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0647Migration mechanisms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0662Virtualisation aspects
    • G06F3/0665Virtualisation aspects at area level, e.g. provisioning of virtual or logical volumes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1095Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes

Abstract

The application discloses a data synchronization method based on big data, which comprises the following steps: the method comprises the steps that a cloud server obtains database configuration information of a source end and a destination end, wherein the source end and the destination end both comprise a plurality of instances; if the data type is isomorphic data, the cloud server completes the synchronization operation from the source end instance to the target end instance through an SQL synchronization command based on Hash mapping; if the data type is heterogeneous data, the cloud server virtualizes the heterogeneous data through a virtualization strategy to generate virtual isomorphic data of different instances of the source end, different virtual tables are established in the different instances of the source end, and the virtual isomorphic data in the different instances of the source end are correspondingly added into the different virtual tables; based on SQL synchronous command, synchronizing the virtual tables of different instances of the source end to the corresponding different instances of the destination end according to the incidence relation of Hash mapping, so that the destination end extracts the virtual isomorphic data in the virtual tables and inversely converts the virtual isomorphic data into heterogeneous data in time sharing.

Description

Data synchronization method based on big data
Technical Field
The application relates to the technical field of data processing, in particular to a data synchronization method based on big data.
Background
The big data is a data set with large scale which greatly exceeds the capability range of the traditional database software tools in the aspects of acquisition, storage, management and analysis, and has the four characteristics of large data scale, rapid data circulation, various data types and low value density.
For PB-level data, data synchronization is a difficult point, in the prior art, data synchronization is completed by scheduling tasks regularly through an SQL interface, data at a source end is sequentially distributed to a destination end node through a MASTER node MASTER through the SQL interface, the requirement on data of the MASTER node is high, performance bottleneck is easily caused, and only timing scheduling can be performed on isomorphic data, and the problem of fast synchronization of heterogeneous data cannot be solved.
Disclosure of Invention
The embodiment of the application provides a data synchronization method based on big data, which is used for solving the problems that performance bottlenecks are easily caused by data synchronization and heterogeneous data cannot be quickly synchronized in the prior art.
The embodiment of the invention provides a data synchronization method based on big data, which comprises the following steps:
the method comprises the steps that a cloud server obtains database configuration information of a source end and a target end, wherein the source end and the target end both comprise a plurality of instances;
the cloud server judges the data types in all the instances of the source end and the destination end:
if the data type is isomorphic data, the cloud server establishes hash mapping of a plurality of instances in the source end and a plurality of instances in the destination end, and completes synchronization operation from the source end instance to the destination end instance through an SQL synchronization command based on the hash mapping, wherein the source end instance and the destination end instance are associated according to the hash mapping relation, and the SQL synchronization command is used for directly performing point-to-point transmission on data in the source end instance and data in the destination end instance;
if the data type is heterogeneous data, the cloud server virtualizes the heterogeneous data through a virtualization strategy to generate virtual homogeneous data of different instances of the source end, establishes different virtual tables in different instances of the source end, and correspondingly adds the virtual homogeneous data in the different instances of the source end into the different virtual tables; the cloud server establishes Hash mapping between a plurality of instances in the source end and a plurality of instances in the destination end; and the cloud server synchronizes the virtual tables of different instances of the source end to different corresponding instances of the destination end according to the incidence relation of the Hash mapping based on the SQL synchronous command, so that the destination end extracts the virtual isomorphic data in the virtual tables and reversely converts the virtual isomorphic data into heterogeneous data in a time-sharing manner.
Optionally, the virtualizing, by the cloud server, the heterogeneous data through a virtualization policy to generate virtual homogeneous data of different instances of the source end includes:
the cloud server acquires the data type of the heterogeneous data;
setting different identification bits based on different data types of the heterogeneous data, wherein the identification bits correspond to the data types of the heterogeneous data one to one;
constructing virtual isomorphic data with a uniform format, packaging the isomerous data serving as payload data of the virtual isomerous data into the virtual isomerous data, and writing the identification bits serving as headers into a message of the virtual isomerous data, wherein the virtual isomerous data comprises the headers, the payload data and end marks;
if the size of the heterogeneous data is larger than that of the virtual homogeneous data, splitting the heterogeneous data to split the heterogeneous data into a plurality of sub heterogeneous data, taking each sub heterogeneous data as payload data of each virtual homogeneous data, and writing the payload data into the virtual homogeneous data, wherein the data volume of the sub heterogeneous data is smaller than or equal to that of the virtual homogeneous data, and the sub heterogeneous data has a split identifier.
Optionally, the virtualizing, by the cloud server, the heterogeneous data through a virtualization policy to generate virtual homogeneous data of different instances of the source end includes:
constructing a plurality of virtual volumes with different sizes through a Storage Virtualization Manager (SVM), wherein the virtual volumes with different sizes are isomorphic data;
and sequentially packaging the heterogeneous data into the virtual volume, and establishing an index pointer, wherein the index pointer is used for associating the virtual volume with the heterogeneous data, and the virtual homogeneous data is a set of the virtual volume and the index pointer.
Optionally, the time-sharing converting the virtual homogeneous data into heterogeneous data by the destination includes:
the target terminal monitors the resource occupancy rates of different instances of the target terminal, when the resource occupancy rates are lower than a preset threshold value, the virtual isomorphic data are reversely packaged in sequence, the isomerous data in the virtual isomerous data are extracted and verified,
or the like, or, alternatively,
if the verification fails, directionally acquiring the original heterogeneous data in the instance corresponding to the source end through the index pointer;
when the resource occupancy is higher than the preset threshold, suspending the inverse transformation operation.
Optionally, the completing, by an SQL synchronization command, an operation of synchronizing from the source end instance to the destination end instance includes:
generating snapshots of different instances of the source terminal, and setting the snapshots of the different instances as first static snapshots by using a lock function;
at a time T0, migrating the first static snapshot to the corresponding destination instance according to the incidence relation of the hash mapping through an SQL synchronization command, and setting a time at which the migration of the first static snapshot is completed as a time T1;
at a time T1, judging whether data of different instances of the source end are updated from a time T0 to a time T1, if the data of different instances of the source end are updated, using a lock function to take the updated data of the time T0 to the time T1 as a second static snapshot, migrating the second static data to corresponding different instances of the target end according to the incidence relation of the hash mapping, and setting a time when the second static snapshot is migrated to be a time T2;
at a time T2, judging whether the data of the different instances of the source end are updated at the time T1 to T2, if the data of the different instances of the source end are updated, using a lock function to take the updated data at the time T1 to T2 as a third static snapshot, and migrating the third static data to the corresponding different instances of the target end according to the incidence relation of the hash mapping; and if the data of different instances of the source end are not updated, the migration operation is not carried out.
Optionally, if the synchronization type includes stock synchronization and full synchronization, the operation of performing synchronization from the source end instance to the destination end instance is completed through an SQL synchronization command, including:
when the synchronization type is stock synchronization, the cloud server compares the difference between the source end instance and the destination end instance, extracts the difference data of the source end and the destination end, and completes the synchronization operation from the source end instance to the destination end instance through an SQL synchronization command;
and when the synchronization type is full synchronization, the cloud server creates a blank table in different instances of the destination end, and inserts the data of the source end instance into the blank table of the destination end instance through an SQL synchronization command.
According to the data synchronization method based on the big data, the source end instance and the target end instance are directly matched in a Hash mapping mode, data synchronization is carried out in a point-to-point P2P mode, distribution is not needed through a main node, synchronization performance can be greatly improved, meanwhile, a mode of virtualizing the heterogeneous data into the homogeneous data is adopted for the heterogeneous data, the heterogeneous data is synchronized first and then inversely converted into the heterogeneous data, communication transmission cost among data synchronization is saved, and data synchronization efficiency is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the description of the embodiments will be briefly introduced below.
FIG. 1 is a schematic flow diagram illustrating big data based data synchronization in one embodiment;
FIG. 2 is a diagram of source and destination data synchronization, in one embodiment;
FIG. 3 is a diagram illustrating time-sharing synchronization during data synchronization according to an embodiment.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
FIG. 1 is a flow diagram of a big data based data synchronization method in one embodiment. The method in the embodiment comprises the following steps:
s101, a cloud server acquires database configuration information of a source end and a destination end, wherein the source end and the destination end both comprise a plurality of instances;
the cloud servers are basic elements of the cloud computing network, and a plurality of cloud servers form a cloud service cluster to provide services such as data storage, data processing and data transmission.
The source end is a main storage node and is responsible for communicating with the edge node or the user terminal and acquiring, updating and storing data. The destination node is a node corresponding to the source node, and is usually used as a backup node or a secondary storage node and is responsible for backing up and migrating source-end data. The source end and the destination end can be cloud server architectures, and meanwhile, the source end and the destination end both comprise a plurality of instances. Example in the embodiment of the present invention, the storage node may be a storage node, and may also be a virtualized storage space. The source end instance and the target end instance can establish a matching relationship, and after the matching relationship is established, the plurality of instances of the source end can be directionally matched with the plurality of instances of the target end, so that the problem of data synchronization only by a main node of the source end is not needed, and the problem of transmission bottleneck is solved.
As shown in fig. 2, in the embodiment of the present invention, the source end includes a MASTER node MASTER and multiple instances, the destination end also includes a MASTER node and multiple instances, and once a hash mapping relationship (as a dotted line in the figure) is established between the source end and the destination end, data point-to-point transmission (synchronization) is performed according to the mapping relationship in the data synchronization process.
The database configuration information of the source end and the destination end comprises a database ID and an IP address of the source end, a database ID and an IP address of the destination end, a port number of an interface and the like. After the configuration information of the database is obtained, the data synchronization can be carried out on the source end and the destination end according to the configuration information.
S102, the cloud server judges the data types in all the instances of the source end and the destination end:
the data types are divided into two types, one is isomorphic data, and the other is heterogeneous data. Wherein, the isomorphic data indicates that the data belongs to the same type, such as a plurality of pictures with the format of JPEG, and each picture is isomorphic data; heterogeneous data is in contrast to homogeneous data, i.e., different kinds of data, such as pictures of JPEG and video data of MPEG, are heterogeneous data.
Heterogeneous data is embodied at five levels:
1. heterogeneity of computer architectures; physical storage of data originates from computers of different architectures, such as: mainframe, minicomputer, workstation, PC, or embedded systems.
2. Heterogeneous operating systems; the storage of data originates from different operating systems, such as: unix, Windows, Linux, OS/400, etc.
3. The data formats are heterogeneous; the storage management mechanism of data is different, and can be a relational database system, such as: oracle, SQL Server, DB2, etc., and may also be two-dimensional data of file lines, such as: txt, CSV, XLS, etc.
4. Data storage locations are heterogeneous; data is stored in distributed physical locations, which is much more common in large organizations, such as: sales data are stored in local sales systems of a plurality of branches such as beijing, shanghai, japan, korea, and the like, respectively.
5. The logical models of the data stores are heterogeneous; the data are respectively stored and maintained in different business logics, so that the data with the same meaning have different expressions; such as: and the code of departments is inconsistent in the independent sales system and the independent purchasing system.
Often, heterogeneous data is not heterogeneous at one level, but rather exists heterogeneous at multiple levels.
S103, 103a, if the data type is isomorphic data, the cloud server establishes hash mapping between a plurality of instances in the source end and a plurality of instances in the destination end, and completes synchronization operation from the source end instance to the destination end instance through an SQL synchronization command based on the hash mapping, wherein the source end instance and the destination end instance are associated according to the hash mapping relationship, and the SQL synchronization command is used for directly performing point-to-point transmission on data in the source end instance and data in the destination end instance;
HashMap, also known as HashMap or HashMap. Is a collection for storing key-value pairs (key-values), each called Entry, in an array, called HashMap. And enabling the HashMap object to add the key-value to the HashMap through the put () function, and acquiring the value corresponding to the key through the get function. In the embodiment of the invention, the source terminal instances are grouped through the hashmap and are respectively mapped to the destination terminal instances. The mapping may be many-to-many or cross.
In the embodiment of the invention, a source end or a destination end distributes a plurality of instances, the plurality of instances are interconnected through a network, the source end performs Hash distribution according to the total number of different instances, and distributes the corresponding instances to the plurality of instances of the destination end. The cloud server obtains the IDs, the IP addresses and the port numbers of the source end and the destination end and writes the IDs, the IP addresses and the port numbers into the key-value pair dictionary. Specifically, in the embodiment of the present invention, the source end uses the ID, the IP address, and the port number of the source end as one element through a hash algorithm, and maps the remaining ID, the IP address, and the port number of the source end into multiple instances of the destination end through MOD redundancy, that is, an array is obtained in which IDs of a group of different instances of the source end are respectively mapped to IDs of different instances of the destination end, and in the data synchronization process, data is transmitted to a specified instance according to corresponding hash mapping between the instances. Before the synchronization process, the cloud server also needs to store the row names and the length of the tables in a dictionary according to the key values, record the distribution keys and the distribution key sequence of the tables, and allow the synchronization operation if the verification is successful.
The Hash algorithm is a method of converting an input of an arbitrary length into an output of a fixed length by a Hash algorithm, and the output is a Hash value. This transformation is a kind of compression mapping, i.e. the space of hash values is much smaller than the space of inputs, different inputs may hash to the same output.
The SQL command statement is essentially data updating operation on different example databases, including operations of adding, deleting, changing, reading and the like. The SQL command can be converted into an operation instruction at a system level, is executed as a main () function in each target end instance, and directly carries out parallel operation on databases of different instances, so that the system synchronization performance is greatly improved.
S103b, if the data type is heterogeneous data, virtualizing the heterogeneous data by the cloud server through a virtualization strategy to generate virtual homogeneous data of different instances of the source end, establishing different virtual tables in the different instances of the source end, and correspondingly adding the virtual homogeneous data in the different instances of the source end into the different virtual tables; the cloud server establishes Hash mapping between a plurality of instances in the source end and a plurality of instances in the destination end; and the cloud server synchronizes the virtual tables of different instances of the source end to different corresponding instances of the destination end according to the incidence relation of the Hash mapping based on the SQL synchronous command, so that the destination end extracts the virtual isomorphic data in the virtual tables and reversely converts the virtual isomorphic data into heterogeneous data in a time-sharing manner.
For heterogeneous data, due to the diversity of data types, the data synchronization after hash mapping cannot be directly performed according to the processing mode of the homogeneous data, but the heterogeneous data needs to be subjected to "homogeneous" processing. Therefore, in the embodiment of the invention, the virtualization is applied to the special field of heterogeneous data by taking advantage of the virtualization concept of the virtual machine. For example, for different data types, three types of files, i.e., JPEG, MPEG, and TXT, are respectively picture format data, video format data, and text format data, and the data sizes, data formats, and data presentation forms of the three types of files are different, and it is difficult to transmit the three types of data in parallel through a unified interface.
The virtual table is an abstract concept of a logical level representation database or a physical table (table) in the database, belongs to a specific virtual database, has related characteristics and has schema and constraint; realizing one-to-one correspondence with a concrete database physical table through a pr mode and a vr mode; the virtual tables satisfy the relational operators, and new virtual tables, i.e., virtual views, can be formed through operations such as connection, union, and the like, thus forming access and mapping relationships between the virtual tables having a hierarchical structure, and the virtual tables, like the physical tables, can provide operations such as creating, deleting, modifying, and reading data, writing data, changing data, and deleting data. Since the virtual isomorphic data itself is not real and effective heterogeneous data, a virtual table is required as a transmission carrier, the virtual isomorphic data is written into the virtual table, and the virtual table is synchronized to the destination entity.
In summary, in the embodiments of the present invention, the heterogeneous data is virtualized into the homogeneous data, and then the data synchronization manner of the homogeneous data is utilized to implement parallel transmission (synchronization) of different instances.
In the embodiment of the present invention, a heterogeneous data virtualization method specifically includes:
s21, the cloud server acquires the data type of the heterogeneous data;
s22, setting different identification bits based on different data types of the heterogeneous data, wherein the identification bits correspond to the data types of the heterogeneous data one to one;
for example, the identification bits may be 01, 02, and 03, representing data in JPEG, MPEG, and TXT formats, respectively.
S23, constructing virtual isomorphic data in a uniform format, packaging the isomerous data serving as payload data of the virtual isomerous data into the virtual isomerous data, and writing the identification bits serving as headers into a message of the virtual isomerous data, wherein the virtual isomerous data comprises the headers, the payload data and end marks;
in the embodiment of the present invention, an isomorphic data with a uniform format may be defined, where the isomorphic data format includes a header HEAD, a PAYLOAD, and a trailer END, where the size of the PAYLOAD data is fixed and the specific size is customizable, and the header and the trailer are used to determine the ID and the sequence of the data, respectively. For heterogeneous data, the heterogeneous data may be encapsulated into the homogeneous data, and an identification bit may be added to the header portion to indicate the kind of the homogeneous data that is virtualized out.
And S24, if the size of the heterogeneous data is larger than that of the virtual homogeneous data, splitting the heterogeneous data to split the heterogeneous data into a plurality of sub heterogeneous data, taking each sub heterogeneous data as payload data of each virtual homogeneous data, and writing the payload data into the virtual homogeneous data, wherein the data volume of the sub heterogeneous data is smaller than or equal to that of the virtual homogeneous data, and the sub heterogeneous data has splitting identification.
If the size of the heterogeneous data is smaller than or equal to the size of the virtual homogeneous data, the heterogeneous data can be completely encapsulated into the payload of the virtual homogeneous data, and the insufficient bits are filled with the end identifier "00". For example, the heterogeneous data is "1111" and the payload data bits of the virtual homogeneous data are 6 bits, so when the heterogeneous data is encapsulated into the virtual homogeneous data, the payload data bits of the new virtual homogeneous data are "111100".
If the size of the heterogeneous data is larger than that of the virtual homogeneous data, the single virtual homogeneous data is not encapsulated with the heterogeneous data, and therefore the heterogeneous data needs to be split, for example, the heterogeneous data can be split equally into two sub heterogeneous data a and B, or the split can be performed according to the size of the payload data of the virtual homogeneous data as the split standard, for example, the heterogeneous data is 32KB and the virtual homogeneous data is 18KB, the heterogeneous data can be split into two sub heterogeneous data of 16KB, or one sub heterogeneous data is 18KB, one is 14KB, the 14KB is less than 18KB, and therefore the remaining 4KB is filled with the end character "00". And the split sub heterogeneous data has split marks for recombination or splicing at a destination end according to the split marks.
In the embodiment of the present invention, another heterogeneous data virtualization method is as follows:
s31, constructing a plurality of virtual volumes with different sizes through a Storage Virtualization Manager (SVM), wherein the virtual volumes with different sizes are isomorphic data;
and S32, sequentially packaging the heterogeneous data into the virtual volume, and establishing an index pointer, wherein the index pointer is used for associating the virtual volume with the heterogeneous data, and the virtual homogeneous data is a set of the virtual volume and the index pointer.
SVM is a virtual storage product developed by StoreAge corporation. The SVM is a SAN application device that provides virtual volume management in heterogeneous environments. The SVM can realize the sharing of the storage capacity and the storage performance in the whole space. The volume management software of the SVM is responsible for establishing one or more virtual volumes with different sizes in the storage pool as required, and distributing the virtual volumes to one or more cloud servers (source ends or destination ends) managed by the SVM according to certain access authorization. The cloud server does not know nor needs to know where in which storage pool the data is physically stored when accessing the data in the virtual volume assigned to it. The unique out-of-band virtualization design of the SVM enables the SVM not to have any performance loss and can provide high expansibility and high availability for the SAN.
In summary, in S103b, the destination inverse-transforms the virtual isomorphic data into the heterogeneous data in a time-sharing manner, which may specifically be:
the destination terminal monitors the resource occupancy rates of different instances of the destination terminal, when the resource occupancy rates are lower than a preset threshold value, the virtual isomorphic data are sequentially reversely packaged (namely, the message and the tail in the known virtual isomorphic data are deleted, the net load data in the known virtual isomorphic data are extracted), the heterogeneous data in the known virtual isomorphic data are extracted and verified,
or the like, or, alternatively,
if the verification fails (for example, the verification fails due to incorrect data reception in transmission, incorrect table storage sequence and the like), directionally acquiring original heterogeneous data in the example corresponding to the source end through the index pointer;
and when the resource occupancy rate is higher than a preset threshold value, suspending the inverse transformation operation.
In the traditional data synchronization process, a lock function needs to be added to lock data to be synchronized, that is, the data is not allowed to be added, deleted or modified in the synchronization process until the synchronization is finished, and the lock is deleted and released. This operation avoids data discrepancies resulting from reading and writing the table at the same time. However, the data update cannot be imported during the synchronization period, and the data update response is not timely. Therefore, in the embodiment of the invention, the data synchronization can be performed in a static snapshot manner without affecting the updating of the data in the synchronization process.
The definition of snap (snapshot) by snia (storage Networking Industry association) is: with respect to a fully available copy of a given data set, the copy includes an image of the corresponding data at some point in time (the point in time at which the copy begins). The snapshot may be a copy of the data it represents or may be a replica of the data.
The snapshot is actually a reference mark or pointer pointing to data stored in the storage device, which is the condition of the data at a certain moment, and the core of the working principle is to establish a pointer list indicating the address of the read data, provide an image of the instant data, and copy the data when the data is changed. Snapshots are roughly classified into 2 types, one is called copy-on-write (copy-on-write) snapshot and is also commonly called pointer-type snapshot, VSS belongs to this, and the other is called split-mirror snapshot and is often called mirror-type snapshot. The pointer type snapshot occupies small space and has small influence on the system performance, but if the original data disk is damaged without backup, the data cannot be recovered; the mirror image type snapshot is actually a full mirror image of the data at that time, and will cause a certain load on the system performance, but even if the original data is damaged, the system will not have much influence, but will occupy the space with the same capacity. The embodiment of the invention mainly adopts the second mirror image type snapshot.
Therefore, in S103b, the operation of synchronizing from the source end instance to the destination end instance is completed through the SQL synchronization command, which may specifically be:
s41, generating snapshots of different instances at a source end, and setting the snapshots of the different instances as a first static snapshot by using a lock function lock () (the embodiment of the invention can also be expressed by static snapshots 1,2 and 3); so-called static snapshots, i.e. snapshots that are locked, do not allow for operations to add, delete or modify the snapshot during synchronization.
S42, at the time of T0, migrating the first static snapshot to a corresponding destination example according to the incidence relation of Hash mapping through an SQL synchronous command, and setting the time of the completion of migration of the first static snapshot as the time of T1;
s43, at the time of T1, judging whether the data of different instances of the source end are updated from the time of T0 to the time of T1, if the data of different instances of the source end are updated, using the updated data from the time of T0 to the time of T1 as a second static snapshot by using a lock function, migrating the second static data to different instances of the corresponding target end according to the incidence relation of Hash mapping, and setting the migration finish time of the second static snapshot as the time of T2;
s44, at the time of T2, judging whether the data of different instances of the source end are updated at the time of T1-T2, if the data of different instances of the source end are updated, using the lock function to take the updated data at the time of T1-T2 as a third static snapshot, and migrating the third static data to different corresponding instances of the target end according to the incidence relation of Hash mapping; and if the data of different instances of the source end are not updated, the migration operation is not carried out.
And S45, setting the time when the third static snapshot is migrated to be T3, repeating the steps of S43 or S44 if data are updated at the time from T2 to T3, and not performing the migration operation if the data are not updated.
As shown in fig. 3, the migration amount is large at time T0, the migration at time T1 is completed, and if there is data update at times T0 to T1, the updated data needs to be migrated again, and this is repeated until there is no data update in the current time period.
Optionally, the synchronization types include stock synchronization and full synchronization, the stock synchronization indicates that the target has synchronized a certain part of data of the source, and stock (margin) data is not synchronized, and the full synchronization indicates that the target is empty or empty, and all source data needs to be synchronized. In S103b, the operation of performing synchronization from the source end instance to the destination end instance is completed through an SQL synchronization command, which includes:
when the synchronization type is stock synchronization, the cloud server compares the difference between the source end instance and the target end instance, extracts the difference data of the source end and the target end, and completes the synchronization operation from the source end instance to the target end instance through an SQL synchronization command; for example, the difference data may be captured from the LOG of LOGs.
And when the synchronization type is full synchronization, the cloud server creates a blank table in different instances of the destination end, and inserts the data of the source end instance into the blank table of the destination end instance through an SQL synchronization command.
According to the data synchronization method based on the big data, the source end instance and the target end instance are directly matched in a Hash mapping mode, data synchronization is carried out in a point-to-point P2P mode, distribution is not needed through a main node, synchronization performance can be greatly improved, meanwhile, a mode of virtualizing the heterogeneous data into the homogeneous data is adopted for the heterogeneous data, the heterogeneous data is synchronized first and then inversely converted into the heterogeneous data, communication transmission cost among data synchronization is saved, and data synchronization efficiency is improved.
The embodiment of the invention also provides a data synchronization system based on big data, which comprises: a processor and a memory for storing a computer program capable of running on the processor; when the processor is used for running the computer program, the data synchronization method based on big data in the above embodiments is executed.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present application, and these modifications or substitutions should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (6)

1. A data synchronization method based on big data is characterized by comprising the following steps:
the method comprises the steps that a cloud server obtains database configuration information of a source end and a target end, wherein the source end and the target end both comprise a plurality of instances;
the cloud server judges the data types in all the instances of the source end and the destination end:
if the data type is isomorphic data, the cloud server establishes hash mapping of a plurality of instances in the source end and a plurality of instances in the destination end, and completes synchronization operation from the source end instance to the destination end instance through an SQL synchronization command based on the hash mapping, wherein the source end instance and the destination end instance are associated according to the hash mapping relation, and the SQL synchronization command is used for directly performing point-to-point transmission on data in the source end instance and data in the destination end instance;
if the data type is heterogeneous data, the cloud server virtualizes the heterogeneous data through a virtualization strategy to generate virtual homogeneous data of different instances of the source end, establishes different virtual tables in different instances of the source end, and correspondingly adds the virtual homogeneous data in the different instances of the source end into the different virtual tables; the cloud server establishes Hash mapping between a plurality of instances in the source end and a plurality of instances in the destination end; and the cloud server synchronizes the virtual tables of different instances of the source end to different corresponding instances of the destination end according to the incidence relation of the Hash mapping based on the SQL synchronous command, so that the destination end extracts the virtual isomorphic data in the virtual tables and reversely converts the virtual isomorphic data into heterogeneous data in a time-sharing manner.
2. The method of claim 1, wherein the cloud server virtualizes the heterogeneous data through a virtualization policy to generate virtual homogeneous data of different instances of the source, including:
the cloud server acquires the data type of the heterogeneous data;
setting different identification bits based on different data types of the heterogeneous data, wherein the identification bits correspond to the data types of the heterogeneous data one to one;
constructing virtual isomorphic data with a uniform format, packaging the isomerous data serving as payload data of the virtual isomerous data into the virtual isomerous data, and writing the identification bits serving as headers into a message of the virtual isomerous data, wherein the virtual isomerous data comprises the headers, the payload data and end marks;
if the size of the heterogeneous data is larger than that of the virtual homogeneous data, splitting the heterogeneous data to split the heterogeneous data into a plurality of sub heterogeneous data, taking each sub heterogeneous data as payload data of each virtual homogeneous data, and writing the payload data into the virtual homogeneous data, wherein the data volume of the sub heterogeneous data is smaller than or equal to that of the virtual homogeneous data, and the sub heterogeneous data has a split identifier.
3. The method of claim 1, wherein the cloud server virtualizes the heterogeneous data through a virtualization policy to generate virtual homogeneous data of different instances of the source, including:
constructing a plurality of virtual volumes with different sizes through a Storage Virtualization Manager (SVM), wherein the virtual volumes with different sizes are isomorphic data;
and sequentially packaging the heterogeneous data into the virtual volume, and establishing an index pointer, wherein the index pointer is used for associating the virtual volume with the heterogeneous data, and the virtual homogeneous data is a set of the virtual volume and the index pointer.
4. The method according to claim 2 or 3, wherein the destination inverse-transforms the virtual homogeneous data into heterogeneous data in a time-sharing manner, including:
the target terminal monitors the resource occupancy rates of different instances of the target terminal, when the resource occupancy rates are lower than a preset threshold value, the virtual isomorphic data are reversely packaged in sequence, the isomerous data in the virtual isomerous data are extracted and verified,
or the like, or, alternatively,
if the verification fails, directionally acquiring the original heterogeneous data in the instance corresponding to the source end through the index pointer;
when the resource occupancy is higher than the preset threshold, suspending the inverse transformation operation.
5. The method of claim 1, wherein the performing the synchronization from the source end instance to the destination end instance via SQL synchronization commands comprises:
generating snapshots of different instances of the source terminal, and setting the snapshots of the different instances as first static snapshots by using a lock function;
at a time T0, migrating the first static snapshot to the corresponding destination instance according to the incidence relation of the hash mapping through an SQL synchronization command, and setting a time at which the migration of the first static snapshot is completed as a time T1;
at a time T1, judging whether data of different instances of the source end are updated from a time T0 to a time T1, if the data of different instances of the source end are updated, using a lock function to take the updated data of the time T0 to the time T1 as a second static snapshot, migrating the second static data to corresponding different instances of the target end according to the incidence relation of the hash mapping, and setting a time when the second static snapshot is migrated to be a time T2;
at a time T2, judging whether the data of the different instances of the source end are updated at the time T1 to T2, if the data of the different instances of the source end are updated, using a lock function to take the updated data at the time T1 to T2 as a third static snapshot, and migrating the third static data to the corresponding different instances of the target end according to the incidence relation of the hash mapping; and if the data of different instances of the source end are not updated, the migration operation is not carried out.
6. The method of claim 1, wherein the synchronization types include stock synchronization and full synchronization, and the performing the synchronization from the source end instance to the destination end instance via the SQL synchronization command comprises:
when the synchronization type is stock synchronization, the cloud server compares the difference between the source end instance and the destination end instance, extracts the difference data of the source end and the destination end, and completes the synchronization operation from the source end instance to the destination end instance through an SQL synchronization command;
and when the synchronization type is full synchronization, the cloud server creates a blank table in different instances of the destination end, and inserts the data of the source end instance into the blank table of the destination end instance through an SQL synchronization command.
CN202011300354.XA 2020-11-18 2020-11-18 Data synchronization method based on big data Withdrawn CN112328697A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011300354.XA CN112328697A (en) 2020-11-18 2020-11-18 Data synchronization method based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011300354.XA CN112328697A (en) 2020-11-18 2020-11-18 Data synchronization method based on big data

Publications (1)

Publication Number Publication Date
CN112328697A true CN112328697A (en) 2021-02-05

Family

ID=74321538

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011300354.XA Withdrawn CN112328697A (en) 2020-11-18 2020-11-18 Data synchronization method based on big data

Country Status (1)

Country Link
CN (1) CN112328697A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113722396A (en) * 2021-08-25 2021-11-30 武汉达梦数据库股份有限公司 Method and equipment for switching main service and standby service of data synchronization receiving end
CN114338708A (en) * 2021-12-07 2022-04-12 南京拓蝶软件科技有限公司 Data synchronization method based on cloud service

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113722396A (en) * 2021-08-25 2021-11-30 武汉达梦数据库股份有限公司 Method and equipment for switching main service and standby service of data synchronization receiving end
CN113722396B (en) * 2021-08-25 2023-12-22 武汉达梦数据库股份有限公司 Method and equipment for switching main and standby services of data synchronous receiving end
CN114338708A (en) * 2021-12-07 2022-04-12 南京拓蝶软件科技有限公司 Data synchronization method based on cloud service

Similar Documents

Publication Publication Date Title
US11153380B2 (en) Continuous backup of data in a distributed data store
US11797510B2 (en) Key-value store and file system integration
US10664492B2 (en) Replication of data objects from a source server to a target server
US11960464B2 (en) Customer-related partitioning of journal-based storage systems
US10346434B1 (en) Partitioned data materialization in journal-based storage systems
US9558194B1 (en) Scalable object store
US10725691B1 (en) Dynamic recycling algorithm to handle overlapping writes during synchronous replication of application workloads with large number of files
US11321291B2 (en) Persistent version control for data transfer between heterogeneous data stores
CN102708165B (en) Document handling method in distributed file system and device
CN101673289B (en) Method and device for constructing distributed file storage framework
US11080253B1 (en) Dynamic splitting of contentious index data pages
WO2011120791A1 (en) Transmission of map-reduce data based on a storage network or a storage network file system
CN106570113B (en) Mass vector slice data cloud storage method and system
CN110347651A (en) Method of data synchronization, device, equipment and storage medium based on cloud storage
US10031682B1 (en) Methods for improved data store migrations and devices thereof
US10909143B1 (en) Shared pages for database copies
CN112328697A (en) Data synchronization method based on big data
US20210165768A1 (en) Replication Barriers for Dependent Data Transfers between Data Stores
US20210165767A1 (en) Barriers for Dependent Operations among Sharded Data Stores
US10558373B1 (en) Scalable index store
US10235407B1 (en) Distributed storage system journal forking
CN113806300A (en) Data storage method, system, device, equipment and storage medium
US10387384B1 (en) Method and system for semantic metadata compression in a two-tier storage system using copy-on-write
CN109597903A (en) Image file processing apparatus and method, document storage system and storage medium
CN111435286A (en) Data storage method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20210205