CN108134775B - Data processing method and equipment - Google Patents

Data processing method and equipment Download PDF

Info

Publication number
CN108134775B
CN108134775B CN201711167866.1A CN201711167866A CN108134775B CN 108134775 B CN108134775 B CN 108134775B CN 201711167866 A CN201711167866 A CN 201711167866A CN 108134775 B CN108134775 B CN 108134775B
Authority
CN
China
Prior art keywords
data block
data
fingerprint
similar
transmitted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711167866.1A
Other languages
Chinese (zh)
Other versions
CN108134775A (en
Inventor
冷继南
关坤
李定
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201711167866.1A priority Critical patent/CN108134775B/en
Publication of CN108134775A publication Critical patent/CN108134775A/en
Application granted granted Critical
Publication of CN108134775B publication Critical patent/CN108134775B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/04Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
    • H04L63/0428Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/565Conversion or adaptation of application format or content

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data processing method and data processing equipment, relates to the technical field of computers, and is beneficial to saving bandwidth resources. The method comprises the following steps: the method comprises the steps that similar fingerprints of data to be transmitted are calculated by first equipment, wherein the similar fingerprints of the data to be transmitted comprise similar fingerprints of a first data block; the method comprises the steps that the first equipment sends similar fingerprints of data to be transmitted to the second equipment, wherein the similar fingerprints of the data to be transmitted are used for searching whether reference data blocks similar to the data to be transmitted are stored in the second equipment or not; the first device receives the fingerprint of the reference data block sent by the second device; the fingerprint of the reference data block comprises a fingerprint of the first reference data block; the similar fingerprint of the first reference data block is the same as the similar fingerprint of the first data block; the first device finds a first reference data block in the first device according to the fingerprint of the first reference data block; the first device sends data to the second device based on the fingerprint of the reference data block; the data comprises difference data between the first reference data block and the first data block.

Description

Data processing method and equipment
Technical Field
The present application relates to the field of computer technologies, and in particular, to a data processing method and device.
Background
With the continuous advancement of the cloud computing industry, cloud infrastructures of mainstream manufacturers, such as a cloud computing center, a cloud disaster recovery center, an edge cloud, and the like, are widely deployed. These infrastructures form a complex Wide Area Network (WAN) network topology, and require a significant amount of WAN bandwidth resources to be expended in data transfer between each other. Under limited bandwidth conditions, WAN acceleration techniques are typically used to save bandwidth.
The WAN acceleration technology is to install a dedicated device, namely, WAN Accelerator (ACC), at both ends of a WAN link. The WAN accelerator reduces the data amount transmitted on the WAN link by caching part or all of transmitted data and using a repeated data deleting technology, thereby saving bandwidth resources and shortening the total time of information transmission, namely realizing information acceleration. Specifically, the method comprises the following steps: if the WAN accelerator installed on the transmitting end device side determines that the data transmitted from the transmitting end device to the receiving end device is already stored in the WAN accelerator installed on the receiving end device side, the WAN accelerator installed on the transmitting end device side may not transmit the data to the receiving end device, thereby saving bandwidth resources.
However, in actual implementation, the situation that data sent by the sending end device to the receiving end device is completely the same as data cached by the WAN accelerator installed on the receiving end device side is generally not much, and therefore, the WAN acceleration technology has a limited effect of saving bandwidth resources.
Disclosure of Invention
In order to achieve the above object, the present application provides a data processing method and device, which are helpful for saving bandwidth resources.
In a first aspect, the present application provides a data processing method, which may include: the method comprises the steps that a first device calculates similar fingerprints of data to be transmitted; the similar fingerprint of the data to be transmitted comprises a similar fingerprint of a first data block, and the first data block is one data block in the data to be transmitted; the method comprises the steps that the first equipment sends similar fingerprints of data to be transmitted to the second equipment, wherein the similar fingerprints of the data to be transmitted are used for searching whether reference data blocks similar to the data to be transmitted are stored in the second equipment or not; the first device receives the fingerprint of the reference data block sent by the second device; wherein the fingerprint of the reference data block comprises a fingerprint of the first reference data block; the similar fingerprint of the first reference data block is the same as the similar fingerprint of the first data block; the first device finds a first reference data block in the first device according to the fingerprint of the first reference data block; the first device sends data to the second device based on the fingerprint of the reference data block; wherein the data comprises difference data between the first reference data block and the first data block. In the technical scheme, the first device performs information interaction with the second device, and sends difference data between the first data block and the first reference data block to the second device when determining that the second device stores the first reference data block similar to the first data block. Therefore, compared with the transmission of the first data block, the bandwidth resource can be saved, so that the information transmission time is reduced, namely, the information transmission speed is accelerated.
The fingerprint of the data block refers to identification information for marking the data block, which is obtained based on all characteristic information of the data block. The fingerprints of different data blocks are different. The similar fingerprint of the data block refers to identification information for marking the data block, which is obtained based on the specific characteristic information of the data block. Similar fingerprints for different data chunks may or may not be the same. The similar fingerprint of the data to be transmitted may include a similar fingerprint of each data block in the data to be transmitted. The fingerprint of the reference data chunk may comprise a fingerprint of the reference data chunk that is similar to each data chunk of the data to be transmitted. The reference data block similar to the data to be transmitted is specifically a reference data block similar to a data block in the data to be transmitted.
In one possible design, the first device calculating similar fingerprints of data to be transmitted may include: the first device performs hash operation on the data block of the data to be transmitted by using a locality sensitive hash algorithm to obtain a similar fingerprint of the data block. The locality sensitive hashing algorithm may be, for example, but not limited to, minhash, simhash, and the like.
In one possible design, the hashing, by the first device, a data block of data to be transmitted by using a locality sensitive hashing algorithm to obtain a similar fingerprint of the data block may include: the method comprises the steps that first equipment divides data to be transmitted to obtain data blocks; for each data block, the first device performs the following operations: extracting at least one subdata block in the data block; performing hash operation on at least one subdata block by using m hash algorithms to obtain m hash sequences; performing hash operation on the at least one sub data block by using 1 hash algorithm to obtain 1 hash sequence; m is an integer of 2 or more; combining the maximum value in each hash sequence in the m hash sequences, and taking the hash sequence obtained after combination as the similar fingerprint of the data block; or combining the minimum values in each hash sequence in the m hash sequences, and taking the hash sequence obtained after combination as the similar fingerprint of the data block. This possible design provides a concrete way of implementing minhash, in which the process of obtaining m hash sequences can be executed in parallel, which can shorten the time consumed for computing similar fingerprints.
In one possible design, the method may further include: the first device performs differential compression on the first reference data block and the first data block using a differential compression algorithm. Thus, bandwidth resources can be further saved, and the information transmission time is reduced, namely, the information transmission speed is accelerated.
In one possible design, the similar fingerprint of the data to be transmitted further includes a similar fingerprint of a second data block, and the second data block is another data block in the data to be transmitted; the fingerprint of the reference data block does not contain the fingerprint of the second reference data block; the similar fingerprint of the second reference data block is the same as the similar fingerprint of the second data block; the data also contains a second block of data.
In one possible design, the first device includes a first level cache and a second level cache, the first level cache being a non-persistent medium and the second level cache being a persistent medium, the first level cache for caching some or all of the data blocks stored in the second level cache and fingerprints and similar fingerprints for some or all of the data blocks; the method may further comprise: the first device searches the fingerprint of the first reference data block in the first-level cache; and if the fingerprint of the first reference data block cannot be searched in the first-level cache, searching the fingerprint of the first reference data block in the second-level cache. In this way, since the probability of the data block in the first-level cache being hit is high, the first reference data block can be usually found in the first-level cache, and therefore, the information search efficiency can be improved.
In one possible design, the second-level cache includes one or more containers, each container being a set of at least two data chunks and a fingerprint and similar fingerprint of each of the at least two data chunks, there being a correlation between contents of the at least two data chunks in each container; the method may further comprise: and if the first equipment finds one data block in the second-level cache, caching the container where the data block is located into the first-level cache. In this way, since the probability that the data block in the first-level cache is hit is high, the data block can be usually found in the first-level cache, and therefore, the information search efficiency can be improved.
In a second aspect, the present application provides a data processing method, which may include: the method comprises the steps that similar fingerprints of data to be transmitted sent by first equipment are received by second equipment, wherein the similar fingerprints of the data to be transmitted comprise similar fingerprints of a first data block, and the first data block is one data block in the data to be transmitted; the second equipment finds out a reference data block which is similar to the data to be transmitted and stored in the second equipment according to the similar fingerprint of the data to be transmitted; wherein the reference data block comprises a first reference data block, and the similar fingerprint of the first reference data block is the same as the similar fingerprint of the first data block; the second device sends the fingerprint of the reference data block to the first device; wherein the fingerprint of the reference data block comprises a fingerprint of the first reference data block; the fingerprint of the reference data block is used for the first device to send data to the second device, wherein the data comprises difference data between the first reference data block and the first data block; and the second equipment receives the data sent by the first equipment. In the technical scheme, the second device performs information interaction with the first device, and sends the fingerprint of the reference data block to the first device when determining that the reference data block similar to the first data block is stored in the second device, so that the first device sends the difference data between the first data block and the reference data block to the second device according to the fingerprint of the reference data block. Therefore, compared with the transmission of the first data block, the bandwidth resource can be saved, so that the information transmission time is reduced, namely, the information transmission speed is accelerated.
In one possible design, the method may further include: the method comprises the steps that the second equipment receives fingerprints of data to be transmitted, which are sent by the first equipment, wherein the fingerprints of the data to be transmitted comprise fingerprints of a first data block; and when the second equipment finds that the first data block is not stored in the second equipment according to the fingerprint of the first data block, the second equipment finds whether the first reference data block is stored in the second equipment or not according to the similar fingerprint of the first data block. If the second device stores the first data block, the first device may not send the first data block to the second device, and therefore, compared with a technical scheme of transmitting difference data between the first data block and the second reference data block, the technical scheme can further save bandwidth resources.
In one possible design, the similar fingerprint of the data to be transmitted includes a similar fingerprint of a second data block, the second data block being another data block of the data to be transmitted; the fingerprint of the reference data block does not contain the fingerprint of the second reference data block; the similar fingerprint of the second reference data block is the same as the similar fingerprint of the second data block; the data also contains a second data block.
In one possible design, the second device includes a first level cache and a second level cache, the first level cache being a non-persistent medium and the second level cache being a persistent medium, the first level cache for caching some or all of the data blocks stored in the second level cache and fingerprints and similar fingerprints of some or all of the data blocks; the method may further comprise: the second device looks up the fingerprint (or similar fingerprint) of the first data block in the first-level cache; if the fingerprint (or similar fingerprint) of the first data block cannot be found in the first-level cache, the fingerprint (or similar fingerprint) of the first data block is found in the second-level cache. In this way, since the probability that the data block in the first-level cache is hit is high, that is, the first data block (or the first reference data block) can be usually found in the first-level cache, the information search efficiency can be improved.
In one possible design, the second-level cache includes one or more containers, each container being a set of at least two data chunks and a fingerprint and similar fingerprint of each of the at least two data chunks, there being a correlation between contents of the at least two data chunks in each container; the method may further comprise: and if the first equipment finds one data block in the second-level cache, caching the container where the data block is located into the first-level cache. The data block may be any one of the data blocks to be transmitted, or a reference data block similar to any one of the data blocks to be transmitted. In this way, since the probability that the data block in the first-level cache is hit is high, the data block can be usually found in the first-level cache, and therefore, the information search efficiency can be improved.
In a third aspect, the present application provides a data processing apparatus for performing any one of the methods provided in the first aspect above. The data processing device may specifically be the first device described above.
In a possible design, the data processing apparatus may be divided into functional modules according to the method provided in the first aspect, for example, the functional modules may be divided corresponding to the functions, or two or more functions may be integrated into one processing module.
In another possible design, the apparatus may include: a memory for storing a computer program which, when executed by the processor, causes any of the methods provided in the first aspect to be performed, and a processor.
In a fourth aspect, the present application provides a data processing apparatus for performing any of the methods provided by the second aspect above. The data processing device may specifically be the second device described above.
In a possible design, the data processing apparatus may be divided into functional modules according to the method provided in the second aspect, for example, the functional modules may be divided corresponding to the functions, or two or more functions may be integrated into one processing module.
In another possible design, the apparatus may include: a memory for storing a computer program which, when executed by the processor, causes any of the methods provided by the second aspect to be performed, and a processor.
The embodiment of the application further provides a processing apparatus, configured to implement the functions of the first device or the second device, where the processing apparatus includes a processor and an interface; the processing device may be a chip, and the processor may be implemented by hardware or software, and when implemented by hardware, the processor may be a logic circuit, an integrated circuit, or the like; when implemented in software, the processor may be a general-purpose processor implemented by reading software code stored in a memory, which may be integrated with the processor, located external to the processor, or stand-alone.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when run on a computer, causes the computer to perform any of the possible methods of the first to second aspects described above.
The present application also provides a computer program product which, when run on a computer, causes any of the methods provided in the first aspect to the second aspect to be performed.
It is understood that any data processing device or computer storage medium or computer program product provided above is used for executing the corresponding method provided above, and therefore, the beneficial effects achieved by the data processing device or computer storage medium can refer to the beneficial effects in the corresponding method, and are not described herein again.
Drawings
Fig. 1 is a schematic structural diagram of a system to which the data processing method according to the embodiment of the present application is applied;
fig. 2 is an interaction diagram of a data processing method according to an embodiment of the present application;
fig. 3 is a schematic diagram of a size relationship between data to be transmitted, a unit length, and a data block according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a process for computing similar fingerprints of data blocks according to an embodiment of the present application;
fig. 5 is a schematic diagram of a first transmission list provided in an embodiment of the present application;
FIG. 6 is a flowchart of determining a category of a data block according to an embodiment of the present disclosure;
fig. 7 is a schematic diagram of a second transmission list provided in the embodiment of the present application;
fig. 8 is a schematic diagram of a third transmission list provided in the embodiment of the present application;
fig. 9 is a schematic structural diagram of another system to which the data processing method according to the embodiment of the present application is applied;
fig. 10 is a schematic diagram illustrating a process of updating information stored in a first-level cache according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;
fig. 12 is a schematic structural diagram of another data processing apparatus according to an embodiment of the present application.
Detailed Description
A data block refers to a collection of a portion of data to be transmitted. The size of the different data blocks may be the same or different.
The fingerprint of a data block refers to identification information for marking the data block, which is obtained based on all characteristic information of the data block. The fingerprints of different data blocks are different.
The similar fingerprint of the data block refers to identification information for marking the data block, which is obtained based on the specific characteristic information of the data block. For example, a certain data block is a character string "78905", and if the specific feature information is the feature information of the character at the 2 nd position, the similar fingerprint of the data block is the feature information of "8"; if the specific feature information is the feature information of the character at the 5 th position, the similar fingerprint of the data block is the feature information of "5". Similar fingerprints for different data chunks may or may not be the same. For example, two data blocks are character strings "78905" and "12345", respectively, and if the specific feature information is the feature information of the character at the 2 nd position, the similar fingerprints of the two data blocks are different, and are the feature information of "8" and the feature information of "2", respectively; if the specific feature information is the feature information of the character at the 5 th position, the similar fingerprints of the two data blocks are the same and are both the feature information of "5". Here, the character feature information may be the character itself, or information on the character calculated according to a specific algorithm.
A reference data block that is similar to a data block refers to a data block that has the same similar fingerprint as the data block. For example, if two data blocks are the character strings "78905" and "12345" respectively, and the specific feature information is the feature information of the character at the 5 th position, the similar fingerprints of the two data blocks are the same and are both the feature information of "5", in this case, "78905" may be used as the reference data block of "12345", and "12345" may also be used as the reference data block of "78905".
A container refers to a collection of information. A container may include a plurality of data chunks, a fingerprint and a similar fingerprint for each data chunk of the plurality of data chunks. Each container has a container identifier for marking the container. The accelerator may uniformly schedule the information in one container, for example, write all the information in one container into a cache, and the like.
The term "and/or" in this application is used only to describe the association relationship of the associated objects, and means that there may be three relationships between the associated objects, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The symbol "/" indicates a relationship in which the associated object is an or, for example, a/B indicates a or B. The terms "first," "second," and the like are used for distinguishing between different objects and not for describing a particular order of the objects. "plurality" means two or more.
Fig. 1 is a schematic structural diagram of a system to which the data processing method provided in the embodiment of the present application is applied. The system shown in fig. 1 comprises: a sending end device 1, a sending end accelerator 2, a receiving end accelerator 3 and a receiving end device 4. The transmitting end device 1 transmits information to the receiving end device 4 via the transmitting end accelerator 2 and the receiving end accelerator 3. The transmitting terminal accelerator 2 is installed on the side of the transmitting terminal device 1, the receiving terminal accelerator 3 is installed on the side of the receiving terminal device 4, and the transmitting terminal accelerator 2 and the receiving terminal accelerator 3 communicate through WAN. The sending end device 1 and the receiving end device 3 may be data centers, such as a cloud computing center, a cloud disaster recovery center, an edge cloud, and the like. Both the transmitting side accelerator 2 and the receiving side accelerator 3 may be referred to as WAN accelerators. The data processing method provided by the application can be applied to scenes such as data backup and data recovery.
It can be understood that a sending end device in a certain data transmission process may be used as a receiving end device in another data transmission process; accordingly, the transmitting-end accelerator installed on the transmitting-end device side is used as the receiving-end accelerator in the other data transmission process. Similarly, the receiving end device in a certain data transmission process may be used as the sending end device in another data transmission process; accordingly, the receiving-side accelerator installed on the receiving-side device side is used as the sending-side accelerator in the other data transmission process.
It is understood that the sending-end accelerator 2 may be a first device described in this application, and the receiving-end accelerator 3 may be a second device described in this application; the first device described in this application may also refer to a sending end device 1, and the second device refers to a receiving end device 4; in another implementation, the first devices described in this application may also refer to the sending-end device 1 and the sending-end accelerator 2, and the second devices refer to the receiving-end accelerator 3 and the receiving-end device 4.
Fig. 2 is an interaction diagram of a data processing method according to an embodiment of the present application. The method shown in fig. 2 may be applied to the system architecture shown in fig. 1. The method shown in fig. 2 includes the following steps S101 to S110:
s101: and the sending end equipment sends the data to be transmitted to the sending end accelerator.
In the field of cloud computing, a large amount of data is generally transmitted from a sending end device to a receiving end device through a sending end accelerator and a receiving end accelerator periodically or aperiodically. The size of the data to be transmitted (i.e., the data to be transmitted) may be the same or different. For example, the size of data that needs to be transmitted at a time is 10 GB.
S102: after receiving data to be transmitted sent by sending end equipment, a sending end accelerator divides the data to be transmitted into a plurality of unit lengths, and then performs variable-length block division on the data of each unit length to obtain a plurality of data blocks.
Since the size of the data to be transmitted at a time may be the same or different. For ease of management, the concept of "unit length" is introduced, the size of which may be, for example, but not limited to, 4 MB. Generally, a sending-end accelerator and a receiving-end accelerator perform unified processing on data of one unit length (for example, unified calculation of similar fingerprints of fingerprints, unified transmission, and the like).
The variable-length block division is a block division algorithm for carrying out data block according to data content. The variable length partitions may be implemented, for example, but not limited to, using sliding window techniques and Rabin fingerprinting techniques. The size of any two data blocks obtained by the variable-length blocking technology can be equal or unequal. The number of data blocks obtained by performing variable-length blocking on data with different unit lengths may be the same or different. For example, assuming that the unit length is 4MB in size and the average size of the data blocks obtained by the variable-length blocking technique is about 8KB, some unit length data may be divided into 511 data blocks by variable-length blocking, some unit length data may be divided into 512 data blocks by variable-length blocking, some unit length data may be divided into 514 data blocks by variable-length blocking, and the like. Because the position of the data block of the variable-length block depends on the data content, the data block with the same content as that before the data shift can be cut after the data shift, thereby being beneficial to finishing the subsequent deduplication service, namely the service that the first type of data block described in the following is not transmitted any more.
Fig. 3 is a schematic diagram of a size relationship between data to be transmitted, a unit length, and a data block according to an embodiment of the present application. In fig. 3, the data to be transmitted is 10GB, the unit length is 4MB, and the average value of the data block sizes is about 8K. In this example, after receiving 10GB of data to be transmitted sent by the sending end device, the sending end accelerator may first segment the 10GB of data to be transmitted according to the arrangement order of the 10GB of data to be transmitted, with every 4MB as a granularity, to obtain 2560 unit lengths (marked as unit lengths 1 to 2560 in fig. 3); then, 4MB of data to be transmitted in each unit length is divided into 512 data blocks (marked as data blocks 1-512 in fig. 3).
The blocking algorithm may not be limited to the variable-length blocking algorithm, and may also be a fixed-length blocking algorithm, for example. The size of each data block obtained by carrying out fixed-length blocking on the data of each unit length is the same.
S103: for any unit length, the sending-end accelerator calculates the fingerprint and similar fingerprint of each data block in the unit length.
The sending accelerator may compute the fingerprint of the data block through a hash algorithm. The hash algorithm may be, for example, but is not limited to, any of the following: secure hash algorithm (SHA 1), message digest algorithm 5 (MD 5), modulo algorithm, cut-out partial byte algorithm, etc.
The sending-end accelerator may calculate a similar fingerprint of the data block through a Locality Sensitive Hashing (LSH) algorithm. The locality sensitive hashing algorithm is a method for improving the similar query efficiency by designing a hashing function meeting special properties, namely locality sensitivity. Similar fingerprints of the same data block obtained by using different locality sensitive hashing algorithms can be the same or different. The locality sensitive hashing algorithm may be, for example, but not limited to, any of the following: minhash, simhash, etc.
Optionally, in order to shorten the time consumed for calculating the similar fingerprints, in some embodiments of the present application, a method for calculating the similar fingerprints of the data block is provided, which may specifically include the following steps 1) to 4):
1) and for any data block, segmenting the data block into a plurality of sub data blocks.
The algorithm used when the data block is divided into a plurality of sub-data blocks may be a fixed-length block partitioning algorithm or a variable-length block partitioning algorithm. For the variable length blocking algorithm, the size of each sub data block may be, for example, but not limited to, 8-16 bytes (Byte). The sizes of the different sub data blocks may be the same or different.
2) And extracting n target sub data blocks in the plurality of sub data blocks. Wherein n is an integer of 1 or more. Each target sub-data block may be considered as a specific piece of characteristic information of the data block.
For example, assuming that an 8KB data block is divided into 1000 sub-data blocks in step 1), the target sub-data block is the 4 k-th sub-data block, and k is an integer greater than or equal to 1, the extracted target sub-data block may be: the target sub data block 4, 8, 12, 16, 20 … … 1000.
The steps 1) to 2) are a specific implementation manner for extracting the specific feature information from the data block. The present application is not limited thereto.
3) And performing hash operation on each target subdata block in the data block by using m different hash algorithms to obtain m groups of hash sequences. Wherein m is an integer greater than or equal to 2.
And performing hash operation on each target sub data block to obtain a hash value. And carrying out hash operation on the n target sub data blocks by utilizing a hash algorithm to obtain n hash values, wherein the n hash values form a hash sequence. Therefore, m different hash algorithms are used to perform hash operation on each target sub-data block in the data block, so as to obtain m hash sequences, where each hash sequence includes n hash values.
4) And acquiring the maximum value in each hash sequence, and taking the sequence obtained by combining the m maximum values as the similar fingerprint of the data block. Or, acquiring the minimum value in each hash sequence, and taking the sequence obtained by combining the m minimum values as the similar fingerprint of the data block. The m values (including the maximum value or the minimum value) are combined, that is, the m values are sorted according to the arrangement sequence of m hash algorithms to obtain a sequence. However, once the permutation order of the m hash algorithms is determined, when calculating the similar data blocks of each data block, the m values are merged by using the determined permutation order.
Fig. 4 is a process diagram of this alternative implementation. In fig. 4, m different hash algorithms are specifically hash algorithms 1, 2, and 3, and hash operations are performed on the plurality of target sub data blocks by using the hash algorithms 1, 2, and 3 to obtain hash sequences 1, 2, and 3, respectively. In this optional implementation, the process of obtaining m hash sequences by the accelerator at the sending end may be executed in parallel, so that the time consumed for computing similar fingerprints of the data blocks may be shortened.
It should be noted that, because the sending-end accelerator performs unified processing on the data in one unit length, in actual implementation, the sending-end accelerator does not need to perform S104 after finishing calculating the fingerprints and similar fingerprints of all data blocks of all to-be-transmitted data to be transmitted this time, but may continue to perform S104 for the unit length after finishing calculating the fingerprints and similar fingerprints of each data block in one unit length.
It will be appreciated that the fingerprint of a data block to be transmitted may include: a fingerprint of each data block in the data to be transmitted. Similar fingerprints of data chunks to be transmitted may include: a similar fingerprint for each data chunk in the data to be transmitted.
S104: and the accelerator at the transmitting end transmits the fingerprints and the similar fingerprints of each data block in the unit length to the accelerator at the receiving end.
Illustratively, the accelerator at the sending end constructs a first transmission list according to the fingerprints and similar fingerprints of each data block in the unit length; the first transmission list is then sent to the receive side accelerator. The first transmission list is used to indicate the fingerprint and similar fingerprints for each data chunk in the unit length. In this embodiment, the example that the sending-end accelerator sends the fingerprint of one unit length and the similar fingerprint to the receiving-end accelerator in the form of a list is described, but the present application is not limited thereto.
Fig. 5 is a schematic diagram of a first transmission list provided in an embodiment of the present application. The first transmission list may include: header information, and fingerprint information of each data block arranged in a certain sequence (hereinafter referred to as a first sequence) in the unit length. The header information is used to record overview information of the contents transmitted in the first transmission list, and may include, for example and without limitation: the number of data blocks in the unit length, and the start position of the fingerprint information of the first data block. The starting position of the fingerprint information of the first data block may be: the first bit occupied by the fingerprint information representing the first data block is the information of the second bit in the first transmission list, and the like. The first sequence may be, for example, but not limited to, a sequence made up of respective data blocks obtained when variable-length blocking is performed in S101. For example, if the unit length is sequentially divided into data block 1, data block 2, and data block 3 … … when performing variable-length partitioning, the fingerprint information of each data block in the first transmission list may be sequentially: fingerprint information for data chunk 1, fingerprint information for data chunk 2, fingerprint information for data chunk 3 … … fingerprint information for a data chunk includes a fingerprint for the data chunk and a similar fingerprint for the data chunk. Fig. 5 illustrates an example in which one unit length includes data block 1 to data block 512. The fingerprint information of each data chunk in the first transmission list is not limited to the example shown in fig. 5. For example, the fingerprint information of each data block in the first transmission list may also be, in order: a fingerprint of data block 1, a fingerprint of data block 2, a fingerprint of data block 3, a fingerprint of … … data block 512, a similar fingerprint of data block 1, a similar fingerprint of data block 2, a similar fingerprint of data block 3 … … a similar fingerprint of data block 512.
S105: after the receiving end accelerator receives the fingerprints and the similar fingerprints of the data blocks in the unit length sent by the sending end accelerator, the category of each data block in the unit length is determined.
The accelerator at the receiving end may sequentially determine the category of each data block according to the first sequence, or may simultaneously determine the categories of at least two data blocks, which is not limited in the present application.
The categories of the data blocks include: the first type data blocks, the second type data blocks and the third type data blocks. If a data block is a first type data block, it indicates that the accelerator at the receiving end has stored the data block. If a data block is a second type data block, it indicates that the receiving accelerator does not store the data block, but stores a similar reference data block as the data block. If a data block is a third type data block, it indicates that neither the data block nor a reference data block similar to the data block is stored in the accelerator at the receiving end.
For example, for any data block, as shown in FIG. 6, the category of the data block can be determined through the following steps T1-T6:
t1: the accelerator at the receiving end obtains the fingerprint and similar fingerprints of the data block.
T2: the accelerator at the receiving end judges whether the fingerprint of the data block can be found locally.
If yes, the receiving accelerator stores the data block, and T3 is executed.
If not, the receiving-end accelerator does not store the data block, and then T4 is executed.
It should be noted that, in general, if a data block is stored in the accelerator at the receiving end, the fingerprint of the data block is stored at the same time. Therefore, whether the fingerprint of the data block is locally stored or not can be judged.
T3: the accelerator at the receiving end determines that the data block is a first type data block.
After execution of T3, it ends.
T4: the accelerator at the receiving end judges whether the similar fingerprint of the data block can be found locally.
If yes, it indicates that the accelerator at the receiving end has stored a reference data block similar to the data block, then T5 is executed.
If not, indicating that the accelerator at the receiving end does not store the reference data block similar to the data block, T6 is executed.
It should be noted that, in general, if a data block is stored in the accelerator at the receiving end, a similar fingerprint of the data block is stored at the same time. Therefore, it is determined whether the similar fingerprint of the data block is locally stored, that is, whether the reference data block similar to the data block is locally stored.
T5: the receiving end accelerator determines that the data block is a second type data block.
After execution of T5, it ends.
It should be noted that, since similar fingerprints of different data blocks may be the same, a plurality of data blocks with the same similar fingerprint may be cached in the receiving-end accelerator, and based on this, in T5, the receiving-end accelerator may use one of the data blocks as a reference data block. Or, when the receiving-end accelerator caches the data blocks, only one of the data blocks may be cached for a plurality of data blocks with the same similar fingerprint, so that the situation that a plurality of data blocks with the same similar fingerprint are cached in the receiving-end accelerator can be avoided.
It should be noted that, after determining that a certain data block is a second type data block, the accelerator at the receiving end may also obtain a fingerprint of a reference data block similar to the certain data block, so as to prepare for performing S106.
T6: the receiving end accelerator determines that the data block is a third type data block.
After execution of T6, it ends.
S106: and the receiving-end accelerator feeds back the category of each data block in the unit length and the fingerprint of the reference data block similar to each second-type data block in the unit length to the transmitting-end accelerator.
For example, the receiving-end accelerator constructs a second transmission list according to the category of each data block in the unit length, and sends the second transmission list to the sending-end accelerator. Wherein the second transmission list is used for indicating the category of each data block in the unit length and the fingerprint of each reference data block.
Fig. 7 is a schematic diagram of a second transmission list provided in the embodiment of the present application. The second transmission list may include: header information, and a class identification of each data block in the first sequence, and a fingerprint of a reference data block similar to each second class data block in the first sequence. Wherein the header information is used to record overview information of the contents transmitted in the second transmission list, for example, but not limited to, may include: the total length of the category identifier of each data block in the unit length, the number and the starting position of the fingerprint of the reference data block of the second type data block, and the like, wherein the starting position of the fingerprint of the reference data block of the second type data block may be: the first bit occupied by the fingerprint representing the first second type data block is the information of the second few bits in the second transmission list. Fig. 7 illustrates an example in which 100 second-type data blocks are included in the unit length, and reference data blocks similar to the 100 second-type data blocks are labeled as reference data block 1 to reference data block 100. In addition, the category identification of the first type data block may be a binary number "11", the category identification of the second type data block may be a binary number "10", and the category identification of the third type data block may be a binary number "00", although the application is not limited thereto.
S107: after receiving the type of each data block in the unit length fed back by the receiving end accelerator and the fingerprints of the reference data blocks similar to the second type data blocks in the unit length, the transmitting end accelerator transmits data to the receiving end accelerator according to the following strategies 1-3:
strategy 1: for the first type of data block, no data is transmitted.
Strategy 2: for the second type data block, the sending terminal accelerator judges whether the reference data block similar to the second type data block can be found locally. And if the reference data block similar to the second type data block can be found locally, performing difference compression on the second type data block and the reference data block similar to the second type data block, and sending data obtained after difference compression to a receiving end accelerator. And if the reference data block similar to the second type data block cannot be found locally, compressing the second type data block, and sending the compressed information to the accelerator at the receiving end. Differential compression, which can be understood as: the difference data between the second type data block and the reference data block similar to the second type data block is calculated, and then the difference data is compressed.
Strategy 3: and for the third type data block, the accelerator at the sending end compresses the third type data block and sends the information obtained after compression.
For the above strategy 1, since the first type data block is cached in the receiving end accelerator, the sending end accelerator may not send the data block to the receiving end accelerator.
For the above policy 2, since the reference data block similar to the second type data block is cached in the receiving end accelerator, the sending end accelerator only sends the difference data between the second type data block and the reference data block to the receiving end accelerator, so that the receiving end accelerator can recover the second type data block according to the difference data and the reference data block. If the reference data block is not stored in the sending-end accelerator, the step of calculating the difference data cannot be executed, in this case, the sending-end accelerator needs to send the second-type data block to the receiving-end accelerator.
For the above strategy 3, since the receiving-end accelerator does not store the third type data block, nor stores the reference data block similar to the third type data block, the sending-end accelerator needs to send the third data block to the receiving-end accelerator.
In addition, the step of performing differential compression or compression in the above policy 2 and policy 3 by the accelerator at the transmitting end can further save bandwidth resources, thereby reducing information transmission time, i.e. accelerating information transmission rate. The present application does not limit the algorithm used when performing differential compression and compression, and the differential compression algorithm may be, for example, but is not limited to, any of the following: x-delta, LZ-delta, and the like. The compression algorithm may be, for example and without limitation, any of the following: gzip, LZ4, bzip, 7zip, and the like.
For example, in executing S107, after the sending-end accelerator receives the class of each data block in the unit length and the reference data block similar to each second-class data block in the unit length, which are fed back by the receiving-end accelerator, the sending-end accelerator may construct a third transmission list according to the above-mentioned policy, and send the third transmission list to the receiving-end accelerator. Wherein the third transmission list is used to indicate each differentially compressed information (hereinafter referred to as differentially compressed data) and each compressed information (hereinafter referred to as compressed data).
Fig. 8 is a schematic diagram of a third transmission list provided in the embodiment of the present application. The third transmission list may include: header information, a class identification for each block of data in the first sequence, each difference compressed data, and each compressed data. Wherein the header information is used to record overview information of the contents transmitted in the third transmission list, such as but not limited to including: the total length of the class identifier of each data block in the unit length, the number and starting position of the differential compressed data, the number and starting position of the compressed data, and the like. The start position of the differential compressed data (or compressed data) may be information indicating that the first bit occupied by the first differential compressed data (or compressed data) is the second bit in the third transmission list. Fig. 8 illustrates an example in which the number of second-type data blocks for which reference data blocks can be found in the accelerator on the transmitting side is 90, and difference compressed data 1 to difference compressed data 90 are obtained after difference compression. In addition, the class identifier of the first type of data block may be a binary number "11", the class identifier of the second type of data block for which the reference data block can be found in the transmitting-end accelerator may be a binary number "10", the class identifier of the second type of data block for which the reference data block cannot be found in the transmitting-end accelerator may be a binary number "01", and the class identifier of the third type of data block may be "00". It is to be understood that, here, the same binary "10" label is used for the second type data block that can find the reference data block in the sending-end accelerator and the second type data block determined by the receiving-end accelerator in S105, and the labels of the two may be different in practical implementation.
S108: after the receiving end accelerator receives the information, differential decompression and/or decompression are executed according to the following strategies 4-5:
strategy 4: for the difference compressed data, the fingerprint of the second type data block corresponding to the difference compressed data is determined, for example, the receiving end accelerator may determine a difference compressed data according to the header information of the third transmission list, and determine the fingerprint of the second type data block corresponding to the difference compressed data. Then, a reference data block similar to the second type data block is obtained, and differential decompression is performed on the reference data block and the differential compressed data to obtain the second type data block.
Strategy 5: and for the compressed data, decompressing the compressed data to obtain a data block. It will be appreciated that based on policy 2 and policy 3 above, the data block may be a second type data block or a third type data block.
S109: the receiving end accelerator assembles each data block obtained in S108 and each first type data block in the unit length to recover the data in the unit length. It can be understood that the receiving end accelerator can restore the data to be transmitted after assembling the data of a plurality of unit lengths.
S110: and the receiving terminal accelerator sends the data to be transmitted to the receiving terminal equipment.
In the data processing method provided by the embodiment of the application, a sending-end accelerator performs information interaction with a receiving-end accelerator to determine that a reference data block similar to one or some data blocks in data to be transmitted is stored in the receiving-end accelerator, and then sends difference data between the data block and the reference data block to the receiving-end accelerator. Therefore, compared with the transmission of the whole data block, the method can save bandwidth resources, thereby reducing the information transmission time, namely accelerating the information transmission speed.
Referring to fig. 1, the transmit-side accelerator 2 may include: an accelerator agent (agent)21, a first level cache 22, a second level cache 23, and an interface 24; the receiving-end accelerator 3 may include: accelerator agent 31, first level cache 32, second level cache 33, and interface 34, as shown in fig. 9. The connection relationship between the devices can be referred to fig. 9.
Wherein, for any accelerator, the accelerator agent contained in the accelerator agent is the control center of the accelerator. For example, with reference to fig. 2, the splitting step in S102, the variable length splitting step, the calculating step in S103, and the like, which are executed by the transmitting-end accelerator, may be specifically executed by the accelerator agent 21 in the transmitting accelerator. The determination step in S105, the difference compression, the compression step, and the like in S108, which are performed by the sink accelerator, may be specifically performed by the accelerator agent 31 in the sink accelerator.
For any accelerator, the interface is an interface to communicate with the WAN. The interface may be based on a proxy (proxy) protocol and may therefore also be referred to as a proxy interface. For example, one accelerator sends information to another accelerator (e.g., S104, S106, S107, etc. described above), which may specifically be: an accelerator agent of one accelerator sends information to another accelerator via an interface of the accelerator. The accelerator receives information sent by another accelerator, and specifically, the accelerator agent of one accelerator receives information sent by another accelerator via an interface of the accelerator.
For either accelerator, the first level cache is a non-persistent medium, such as a cache memory (cache). The second level cache is a persistent medium, such as a disk. The first-level cache is used for caching part or all of the data blocks stored in the second-level cache, and fingerprints and similar fingerprints of each data block in the part or all of the data blocks. In an alternative implementation, for ease of management, the first level cache may include a data cache, a fingerprint cache, and a similar fingerprint cache, as shown in fig. 9. The data cache is used for caching a part of or all data blocks stored in the second-level cache, the fingerprint cache is used for caching fingerprints of each data block in the part of or all data blocks, and the similar fingerprint cache is used for caching similar fingerprints of each data block in the part of or all data blocks.
For any accelerator, typically, the capacity of the second level cache is greater than the capacity of the first level cache. For example, the capacity of the second level cache is 20GB, and the capacity of the first level cache is 30 KB. The first-level cache is arranged in one accelerator, so that the information searching efficiency can be improved, and the cache performance is improved. The second-level cache is arranged in one accelerator, so that the cache capacity can be increased, the hit rate of information search is improved, bandwidth resources are saved, and the information transmission time is shortened.
In some embodiments of the present application, when an accelerator (which may be a sending-end accelerator or a receiving-end accelerator) finds whether a data block is stored in the accelerator (e.g., the above-mentioned T2, or S107), first, a fingerprint of the data block is found in a first-level cache of the accelerator; if the fingerprint of the data block can not be searched in the first-level cache, the data block is searched in the second-level cache of the accelerator. The data block may be any one of the data blocks to be transmitted, or a reference data block similar to any one of the second-type data blocks in the data to be transmitted. In this way, since the probability that the data block in the first-level cache is hit is high, the data block can be usually found in the first-level cache, and therefore, the information search efficiency can be improved.
In some embodiments of the present application, when the receiving-end accelerator searches whether a reference data block similar to a certain data block is stored in the receiving-end accelerator according to a similar fingerprint of the certain data block, it may first search whether the similar fingerprint is stored in the first-level cache; if the data block is found, the data block is judged to be the second type data block. If the similar fingerprint is not found in the first level cache, one of the following two implementation manners may be performed, for example and without limitation:
one implementation may be: if the similar fingerprint is not found in the first-level cache, continuing to find whether the similar fingerprint is stored in the second-level cache, and when the similar fingerprint is found, judging that the data block is a second-class data block; when the data block is not searched, the information indicating that the reference data block similar to the data block is not stored in the receiving-end accelerator is fed back to the transmitting-end accelerator. In this way, since the probability that the data block in the first-level cache is hit is high, the reference data block can be usually found in the first-level cache, and therefore, the information search efficiency can be improved.
Another implementation may be: if the similar fingerprint is not found in the first-level cache, directly feeding back information indicating that the reference data block similar to the data block is not stored in the receiving-end accelerator to the sending-end accelerator. It should be noted that, on one hand, when there are many similar fingerprints stored in the second-level cache, the time consumed by the process of searching for one similar fingerprint in the second-level cache by the accelerator at the receiving end is usually long; on the other hand, from the analysis hereinbefore, it can be considered that: if the similar fingerprint cannot be found in the first-level cache, the probability of finding the similar fingerprint in the second-level cache is smaller. Therefore, in this case, the receiving-end accelerator may directly feed back information indicating that no reference data block similar to the data block is stored in the receiving-end accelerator to the sending-end accelerator, thereby saving the time consumed for searching for similar fingerprints.
The information stored in the first level cache and the second level cache can be updated. For the first-level cache, the cached data blocks may be the data blocks that are accessed recently, and/or the data blocks before and after the data blocks that are accessed recently. For the second level cache, the cached data block may be a data block with an access number greater than or equal to a threshold value and/or a data block that has been accessed recently. Generally, for any level of cache, when a data block is cached in the cache, the fingerprint and similar fingerprint of the data block are cached therewith; when a data block is deleted from the cache, the fingerprint of the data block and the similar fingerprint are deleted.
For example, after an accelerator agent of a sending-end accelerator receives data to be transmitted sent by sending-end equipment, and calculates and obtains a fingerprint and a similar fingerprint of each data block in data of a unit length, each data block divided by the unit length and the fingerprint and the similar fingerprint of each data block are cached in a first-level cache and a second-level cache. If the free space of any level of cache (such as a first level cache or a second level cache) is not enough to cache each data block divided by the unit length and the fingerprint and the similar fingerprint of each data block, deleting the data block stored and/or accessed earliest in the first level cache, and the fingerprint and the similar fingerprint of the data block. If the free space of the level cache is enough to cache each data block divided by the unit length and the fingerprint and the similar fingerprint of each data block, each data block divided by the unit length and the fingerprint and the similar fingerprint of each data block are directly added in the level cache.
As another example, in some embodiments of the present application, for any accelerator, one or more containers are included in the second level cache, each container including at least two data chunks and a fingerprint and similar fingerprint for each of the at least two data chunks. If the accelerator finds a data block in the second-level cache, the accelerator caches the container where the data block is located in the first-level cache. The data block may be any one of the data blocks to be transmitted, or a reference data block similar to any one of the second-type data blocks in the data to be transmitted. As shown in fig. 10, assuming that the second-level cache includes a plurality of containers (containers 1 to 3 are shown in fig. 10), if a data block (for example, the above-mentioned T2, or S107) is found in the container 1 in the second-level cache in a certain process, the container 1 is cached in the first-level cache. Fig. 10 is drawn based on fig. 9, and the first-level cache and the second-level cache in fig. 10 may belong to the sending-end accelerator or may belong to the receiving-end accelerator. It should be noted that, because there is a correlation between the contents of consecutive data blocks, it may be considered that one data block is hit, and then the probability that several data blocks before and after the data block will be hit is high. Therefore, the subsequent information hit rate can be improved.
The scheme provided by the embodiment of the application is mainly introduced from the perspective of a method. To implement the above functions, it includes hardware structures and/or software modules for performing the respective functions. Those of skill in the art would readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiment of the present application, according to the above method example, functional modules of a data processing device (which may be the sending-end accelerator or the receiving-end accelerator described above) may be divided, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, in the embodiment of the present application, the division of the module is schematic, and is only one logic function division, and there may be another division manner in actual implementation.
As shown in fig. 11, a data processing apparatus 11 according to an embodiment of the present application is provided. The data processing device 11 may be the sending-end accelerator described above, and may also refer to a sending-end device or a sending-end device and a sending-end accelerator. The data processing apparatus 11 shown in fig. 11 may include: a calculation unit 1101, a sending unit 1102, a receiving unit 1103, and a look-up unit 1104. The computing unit 1101 is configured to compute a similar fingerprint of data to be transmitted; the similar fingerprint of the data to be transmitted contains a similar fingerprint of a first data block, and the first data block is one data block in the data to be transmitted. The sending unit 1102 is configured to send the similar fingerprint of the data to be transmitted to the second device, where the similar fingerprint of the data to be transmitted is used to find whether a reference data block similar to the data to be transmitted is stored in the second device. A receiving unit 1103, configured to receive a fingerprint of the reference data block sent by the second device; wherein the fingerprint of the reference data block comprises a fingerprint of the first reference data block; the similar fingerprint of the first reference data block is the same as the similar fingerprint of the first data block. A finding unit 1104, configured to find the first reference data block in the data processing apparatus 11 according to the fingerprint of the first reference data block. The sending unit 1102 is further configured to send data to the second device based on the fingerprint of the reference data block; wherein the data comprises difference data between the first reference data block and the first data block. For example, referring to fig. 2, the data processing device 11 may specifically be a sending-end accelerator, and the second device may specifically be a receiving-end accelerator. The first data blocks may be the second type data blocks described above. The calculation unit 1101 may be configured to perform the step of calculating similar fingerprints in S103. The transmitting unit 1102 may be configured to perform the step of transmitting the similar fingerprint in S104, and the step of transmitting the data in S107. The receiving unit 1103 may be configured to perform the step of receiving the fingerprint of the reference data block in S106.
In one possible design, the computing unit 1101 may be specifically configured to: and carrying out Hash operation on the data blocks of the data to be transmitted by utilizing a locality sensitive Hash algorithm to obtain similar fingerprints of the data blocks.
In one possible design, the computing unit 1101 may be specifically configured to: and segmenting the data to be transmitted to obtain data blocks. For each data block, the calculation unit 1101 performs the following operations: extracting at least one subdata block in the data block; performing hash operation on at least one subdata block by using m hash algorithms to obtain m hash sequences; performing hash operation on at least one subdata block by using 1 hash algorithm to obtain 1 hash sequence; m is an integer of 2 or more; combining the maximum value in each hash sequence in the m hash sequences, and taking the hash sequence obtained after combination as a similar fingerprint of the data block; or combining the minimum values in each hash sequence in the m hash sequences, and taking the hash sequence obtained after combination as the similar fingerprint of the data block. For example, the computing unit 1101 may be specifically configured to execute each step in the process shown in fig. 4.
In one possible design, the data processing device 11 may further include: a difference compression unit 1105, configured to perform difference compression on the first reference data block and the first data block by using a difference compression algorithm. For example, the differential compression unit 1105 may be specifically configured to perform the step of differential compression in S107.
In one possible design, the similar fingerprint of the data to be transmitted further includes a similar fingerprint of a second data block, and the second data block is another data block in the data to be transmitted; the fingerprint of the reference data block does not contain the fingerprint of the second reference data block; the similar fingerprint of the second reference data block is the same as the similar fingerprint of the second data block; the data also contains a second data block. The second data block may be a third type data block as described above.
In one possible design, the data processing device 11 further includes a first level cache and a second level cache, the first level cache being a non-persistent medium and the second level cache being a persistent medium, the first level cache being configured to cache some or all of the data blocks stored in the second level cache, and fingerprints and similar fingerprints of some or all of the data blocks. In this case, the search unit 1104 may specifically be configured to: searching a fingerprint of a first reference data block in a first-level cache; and if the fingerprint of the first reference data block cannot be searched in the first-level cache, searching the fingerprint of the first reference data block in the second-level cache.
In one possible design, the second level cache includes one or more containers, each container being a set of at least two data chunks and a fingerprint and similar fingerprint of each of the at least two data chunks, there being a correlation between contents of the at least two data chunks in each container. In this case, the lookup unit 1104 may be further configured to: and if one data block is found in the second-level cache, caching the container where the data block is located into the first-level cache. For example, the lookup unit 1104 may be used to perform the steps in the scenario shown in fig. 10.
In a possible design, the sending unit 1102 and the receiving unit 1103 may specifically correspond to the interface 24 in fig. 9. Some or all of the calculation unit 1101, the search unit 1104, and the difference compression unit 1105 may correspond to the accelerator agent 21 in fig. 9.
As shown in fig. 12, a data processing apparatus 12 according to an embodiment of the present application is provided. The data processing device 12 may be the receiving-end accelerator described above, or may be a receiving-end device or a receiving-end accelerator and a receiving-end device. The data processing device 12 shown in fig. 12 may include: a receiving unit 1201, a finding unit 1202, and a sending unit 1203. The receiving unit 1201 is configured to receive a similar fingerprint of data to be transmitted, where the similar fingerprint of the data to be transmitted includes a similar fingerprint of a first data block, and the first data block is one data block of the data to be transmitted. A searching unit 1202, configured to search, according to the similar fingerprint of the data to be transmitted, a reference data block that is similar to the data to be transmitted and is stored in the data processing device 12; wherein the reference data block comprises a first reference data block, and the similar fingerprint of the first reference data block is the same as the similar fingerprint of the first data block. A sending unit 1203, configured to send a fingerprint of the reference data block to the first device; wherein the fingerprint of the reference data block comprises a fingerprint of the first reference data block; the fingerprint of the reference data block is used for the first device to send data to the data processing device 12, which data contains difference data between the first reference data block and the first data block. The receiving unit 1201 is further configured to receive the data sent by the first device. For example, referring to fig. 2, the data processing device 12 may be specifically a receiving-end accelerator, and the first device may be specifically a transmitting-end accelerator. The first data blocks may be the second type data blocks described above. The receiving unit 1201 may specifically be configured to perform the step of receiving similar fingerprints in S104. The sending unit 1203 may specifically be configured to execute the step of sending the fingerprint of the reference data block in S106. The first data blocks may be the second type data blocks described above.
In one possible design, the receiving unit 1201 may be further configured to receive a fingerprint of data to be transmitted, where the fingerprint of the data to be transmitted includes a fingerprint of the first data block, where the fingerprint of the data to be transmitted is sent by the first device. In this case, the searching unit 1202 may be further configured to, when it is found that the first data block is not stored in the data processing device 12 according to the fingerprint of the first data block, search whether the first reference data block is stored in the data processing device 12 according to the similar fingerprint of the first data block. In connection with fig. 6, the receiving unit 1201 may be configured to perform T2, T4, and the like.
In one possible design, the similar fingerprint of the data to be transmitted includes a similar fingerprint of a second data block, the second data block being another data block of the data to be transmitted; the fingerprint of the reference data block does not contain the fingerprint of the second reference data block; the similar fingerprint of the second reference data block is the same as the similar fingerprint of the second data block; the data also contains a second data block. The second data block may be a third type data block as described above.
In a possible design, the receiving unit 1201 and the sending unit 1203 may specifically correspond to the interface 34 in fig. 9. The lookup unit 1202 may correspond to the accelerator agent 31 in fig. 9.
Since the data processing device provided in the embodiment of the present application may be configured to execute the data processing method, reference may be made to the method embodiment for obtaining technical effects, and details of the embodiment of the present application are not repeated herein.
The steps of a method or algorithm described in connection with the disclosure herein may be embodied in hardware or in software instructions executed by a processing module. The software instructions may be comprised of corresponding software modules that may be stored in Random Access Memory (RAM), flash memory, Read Only Memory (ROM), Erasable Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), registers, a hard disk, a removable disk, a compact disc read only memory (CD-ROM), or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC.
Those skilled in the art will recognize that in one or more of the examples described above, the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
The above embodiments are further described in detail for the purpose, technical solutions and advantages of the present application, and it should be understood that the above embodiments are only specific embodiments of the present application, and are not intended to limit the scope of the present application.

Claims (18)

1. A method of data processing, the method comprising:
the first device calculates similar fingerprints of data to be transmitted, comprising: the first equipment divides the data to be transmitted to obtain data blocks; for each data block, the first device performs the following operations: extracting at least one subdata block in the data block; performing hash operation on the at least one sub data block by using m hash algorithms to obtain m hash sequences; performing hash operation on the at least one sub data block by using 1 hash algorithm to obtain 1 hash sequence; m is an integer of 2 or more; merging the maximum value in each hash sequence in the m hash sequences, and taking the hash sequence obtained after merging as the similar fingerprint of the data block; or combining the minimum values in each hash sequence of the m hash sequences, and taking the hash sequence obtained after combination as the similar fingerprint of the data block; the similar fingerprint of the data to be transmitted comprises a similar fingerprint of a first data block, and the first data block is one data block in the data to be transmitted; the similar fingerprint of the first data block is identification information which is obtained based on the specific characteristic information of the first data block and is used for marking the first data block;
the first equipment sends the similar fingerprint of the data to be transmitted to second equipment, wherein the similar fingerprint of the data to be transmitted is used for searching whether a reference data block similar to the data to be transmitted is stored in the second equipment or not;
the first device receives the fingerprint of the reference data block sent by the second device; wherein the fingerprint of the reference data block comprises a fingerprint of a first reference data block; the similar fingerprint of the first reference data block is the same as the similar fingerprint of the first data block;
the first device finds the first reference data block in the first device according to the fingerprint of the first reference data block;
the first device sending data to the second device based on the fingerprint of the reference data block; wherein the data comprises difference data between the first reference data block and the first data block.
2. The method of claim 1, further comprising:
the first device performs differential compression on the first reference data block and the first data block using a differential compression algorithm.
3. The method of claim 1, wherein the similar fingerprints of the data to be transmitted further include similar fingerprints of a second data chunk, the second data chunk being another data chunk of the data to be transmitted; the fingerprint of the reference data block does not include a fingerprint of a second reference data block; the similar fingerprint of the second reference data block is the same as the similar fingerprint of the second data block; the data also includes the second data block.
4. The method of claim 1, wherein the first device comprises a first level cache and a second level cache, wherein the first level cache is a non-persistent medium, wherein the second level cache is a persistent medium, wherein the first level cache is configured to cache a portion or all of the data chunks stored in the second level cache, and fingerprints and similar fingerprints of the portion or all of the data chunks; the method further comprises the following steps:
the first device searches the fingerprint of the first reference data block in the first-level cache; and if the fingerprint of the first reference data block cannot be found in the first-level cache, finding the fingerprint of the first reference data block in the second-level cache.
5. The method of claim 4, wherein the second level cache comprises one or more containers, each container being a set of at least two data chunks and a fingerprint and similar fingerprints for each of the at least two data chunks, there being a correlation between contents of at least two data chunks in each container; the method further comprises the following steps:
and if one data block is found in the second-level cache, the first device caches the container in which the data block is located in the first-level cache.
6. A method of data processing, the method comprising:
the method comprises the steps that similar fingerprints of data to be transmitted sent by first equipment are received by second equipment, wherein the similar fingerprints of the data to be transmitted comprise similar fingerprints of a first data block, and the first data block is one data block in the data to be transmitted; the similar fingerprint of the first data block is identification information which is obtained based on the specific characteristic information of the first data block and is used for marking the first data block; similar fingerprints of the data to be transmitted are calculated by the first device, and include: the method comprises the steps that first equipment divides data to be transmitted to obtain data blocks; for each data block, the first device performs the following operations: extracting at least one subdata block in the data block; performing hash operation on the at least one sub data block by using m hash algorithms to obtain m hash sequences; performing hash operation on the at least one sub data block by using 1 hash algorithm to obtain 1 hash sequence; m is an integer of 2 or more; merging the maximum value in each hash sequence in the m hash sequences, and taking the hash sequence obtained after merging as the similar fingerprint of the data block; or combining the minimum values in each hash sequence of the m hash sequences, and taking the hash sequence obtained after combination as the similar fingerprint of the data block;
the second equipment finds out a reference data block similar to the data to be transmitted and stored in the second equipment according to the similar fingerprint of the data to be transmitted; wherein the reference data block comprises a first reference data block whose similar fingerprint is the same as the similar fingerprint of the first data block;
the second device sending a fingerprint of a reference data block to the first device; wherein the fingerprint of the reference data block comprises a fingerprint of the first reference data block; the fingerprint of the reference data block is used for the first device to send data to the second device, the data including difference data between the first reference data block and the first data block;
and the second equipment receives the data sent by the first equipment.
7. The method of claim 6, further comprising:
the second device receives the fingerprint of the data to be transmitted, which is sent by the first device, wherein the fingerprint of the data to be transmitted comprises the fingerprint of the first data block;
and when the second device finds that the first data block is not stored in the second device according to the fingerprint of the first data block, the second device finds whether the first reference data block is stored in the second device according to the similar fingerprint of the first data block.
8. The method according to claim 6 or 7, characterized in that the similar fingerprint of the data to be transmitted comprises a similar fingerprint of a second data block, the second data block being another data block of the data to be transmitted; the fingerprint of the reference data block does not include a fingerprint of a second reference data block; the similar fingerprint of the second reference data block is the same as the similar fingerprint of the second data block; the data also includes the second data block.
9. A data processing apparatus, characterized in that the apparatus comprises:
a calculation unit for calculating similar fingerprints of data to be transmitted, comprising: segmenting the data to be transmitted to obtain data blocks; for each data block, the calculation unit performs the following operations: extracting at least one subdata block in the data block; performing hash operation on the at least one sub data block by using m hash algorithms to obtain m hash sequences; performing hash operation on the at least one sub data block by using 1 hash algorithm to obtain 1 hash sequence; m is an integer of 2 or more; merging the maximum value in each hash sequence in the m hash sequences, and taking the hash sequence obtained after merging as the similar fingerprint of the data block; or combining the minimum values in each hash sequence of the m hash sequences, and taking the hash sequence obtained after combination as the similar fingerprint of the data block; the similar fingerprint of the data to be transmitted comprises a similar fingerprint of a first data block, and the first data block is one data block in the data to be transmitted; the similar fingerprint of the first data block is identification information which is obtained based on the specific characteristic information of the first data block and is used for marking the first data block;
a sending unit, configured to send a similar fingerprint of the data to be transmitted to a second device, where the similar fingerprint of the data to be transmitted is used to find whether a reference data block similar to the data to be transmitted is stored in the second device;
a receiving unit, configured to receive a fingerprint of a reference data block sent by the second device; wherein the fingerprint of the reference data block comprises a fingerprint of a first reference data block; the similar fingerprint of the first reference data block is the same as the similar fingerprint of the first data block;
a searching unit, configured to search the first reference data block in the device according to the fingerprint of the first reference data block;
the sending unit is further configured to send data to the second device based on the fingerprint of the reference data block; wherein the data comprises difference data between the first reference data block and the first data block.
10. The apparatus of claim 9, further comprising:
a difference compression unit for performing difference compression on the first reference data block and the first data block by using a difference compression algorithm.
11. The apparatus of claim 9, wherein the similar fingerprints of the data to be transmitted further include similar fingerprints of a second data chunk, the second data chunk being another data chunk of the data to be transmitted; the fingerprint of the reference data block does not include a fingerprint of a second reference data block; the similar fingerprint of the second reference data block is the same as the similar fingerprint of the second data block; the data also includes the second data block.
12. The apparatus of claim 9, further comprising a first level cache and a second level cache, the first level cache being a non-persistent medium and the second level cache being a persistent medium, the first level cache for caching some or all of the data chunks stored in the second level cache and fingerprints and similar fingerprints for the some or all of the data chunks;
the search unit is specifically configured to: searching the fingerprint of the first reference data block in the first-level cache; and if the fingerprint of the first reference data block cannot be found in the first-level cache, finding the fingerprint of the first reference data block in the second-level cache.
13. The apparatus of claim 12, wherein the second level cache comprises one or more containers, each container being a set of at least two data chunks and a fingerprint and similar fingerprints for each of the at least two data chunks, there being a correlation between contents of at least two data chunks in each container;
the lookup unit is further configured to: and if one data block is found in the second-level cache, caching the container in which the data block is located into the first-level cache.
14. A data processing apparatus, characterized in that the apparatus comprises:
the device comprises a receiving unit and a processing unit, wherein the receiving unit is used for receiving similar fingerprints of data to be transmitted, which are sent by first equipment, wherein the similar fingerprints of the data to be transmitted comprise similar fingerprints of a first data block, and the first data block is one data block in the data to be transmitted; the similar fingerprint of the first data block is identification information which is obtained based on the specific characteristic information of the first data block and is used for marking the first data block; similar fingerprints of the data to be transmitted are calculated by the first device, and include: the method comprises the steps that first equipment divides data to be transmitted to obtain data blocks; for each data block, the first device performs the following operations: extracting at least one subdata block in the data block; performing hash operation on the at least one sub data block by using m hash algorithms to obtain m hash sequences; performing hash operation on the at least one sub data block by using 1 hash algorithm to obtain 1 hash sequence; m is an integer of 2 or more; merging the maximum value in each hash sequence in the m hash sequences, and taking the hash sequence obtained after merging as the similar fingerprint of the data block; or combining the minimum values in each hash sequence of the m hash sequences, and taking the hash sequence obtained after combination as the similar fingerprint of the data block;
the searching unit is used for searching a reference data block which is stored in the equipment and is similar to the data to be transmitted according to the similar fingerprint of the data to be transmitted; wherein the reference data block comprises a first reference data block whose similar fingerprint is the same as the similar fingerprint of the first data block;
a sending unit, configured to send a fingerprint of a reference data block to the first device; wherein the fingerprint of the reference data block comprises a fingerprint of the first reference data block; the fingerprint of the reference data block is used for the first device to send data to the device, the data containing difference data between the first reference data block and the first data block;
the receiving unit is further configured to receive the data sent by the first device.
15. The apparatus of claim 14,
the receiving unit is further configured to receive a fingerprint of the data to be transmitted, where the fingerprint of the data to be transmitted includes a fingerprint of the first data block, and the fingerprint of the data to be transmitted is sent by the first device;
the searching unit is further configured to, when the first data block is found not to be stored in the device according to the fingerprint of the first data block, find whether the first reference data block is stored in the device according to the similar fingerprint of the first data block.
16. The apparatus according to claim 14 or 15, wherein the similar fingerprint of the data to be transmitted comprises a similar fingerprint of a second data block, the second data block being another data block of the data to be transmitted; the fingerprint of the reference data block does not include a fingerprint of a second reference data block; the similar fingerprint of the second reference data block is the same as the similar fingerprint of the second data block; the data also includes the second data block.
17. A data processing apparatus, characterized by comprising: memory and a processor, wherein the memory is for storing a computer program that, when executed by the processor, causes the method of any of claims 1 to 8 to be performed.
18. A computer-readable storage medium, on which a computer program is stored, which, when run on a computer, causes the method according to any one of claims 1 to 8 to be performed.
CN201711167866.1A 2017-11-21 2017-11-21 Data processing method and equipment Active CN108134775B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711167866.1A CN108134775B (en) 2017-11-21 2017-11-21 Data processing method and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711167866.1A CN108134775B (en) 2017-11-21 2017-11-21 Data processing method and equipment

Publications (2)

Publication Number Publication Date
CN108134775A CN108134775A (en) 2018-06-08
CN108134775B true CN108134775B (en) 2020-10-09

Family

ID=62388793

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711167866.1A Active CN108134775B (en) 2017-11-21 2017-11-21 Data processing method and equipment

Country Status (1)

Country Link
CN (1) CN108134775B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109309670B (en) * 2018-09-07 2021-02-12 深圳市网心科技有限公司 Data stream decoding method and system, electronic device and computer readable storage medium
CN111064471B (en) * 2018-10-16 2023-04-11 阿里巴巴集团控股有限公司 Data processing method and device and electronic equipment
CN109710502B (en) * 2018-12-19 2022-06-14 苏州科达科技股份有限公司 Log transmission method, device and storage medium
WO2021012162A1 (en) * 2019-07-22 2021-01-28 华为技术有限公司 Method and apparatus for data compression in storage system, device, and readable storage medium
CN112416694A (en) * 2019-08-20 2021-02-26 中国电信股份有限公司 Information processing method, system, client and computer readable storage medium
CN112988041A (en) * 2019-12-18 2021-06-18 华为技术有限公司 Data storage method in storage system and related equipment
CN113868013A (en) * 2020-06-30 2021-12-31 华为技术有限公司 Data transmission method, system, device, equipment and medium
CN114662160B (en) * 2022-05-25 2022-08-23 成都易我科技开发有限责任公司 Digital summarization method, system and digital summarization method in network transmission

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101833486A (en) * 2010-04-07 2010-09-15 山东高效能服务器和存储研究院 Method for designing remote backup and recovery system
CN102185889A (en) * 2011-03-28 2011-09-14 北京邮电大学 Data deduplication method based on internet small computer system interface (iSCSI)
CN102495894A (en) * 2011-12-12 2012-06-13 成都市华为赛门铁克科技有限公司 Method, device and system for searching repeated data
CN103020174A (en) * 2012-11-28 2013-04-03 华为技术有限公司 Similarity analysis method, device and system
CN103858125A (en) * 2013-12-17 2014-06-11 华为技术有限公司 Repeating data processing methods, devices, storage controller and storage node

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101833486A (en) * 2010-04-07 2010-09-15 山东高效能服务器和存储研究院 Method for designing remote backup and recovery system
CN102185889A (en) * 2011-03-28 2011-09-14 北京邮电大学 Data deduplication method based on internet small computer system interface (iSCSI)
CN102495894A (en) * 2011-12-12 2012-06-13 成都市华为赛门铁克科技有限公司 Method, device and system for searching repeated data
CN103020174A (en) * 2012-11-28 2013-04-03 华为技术有限公司 Similarity analysis method, device and system
CN103858125A (en) * 2013-12-17 2014-06-11 华为技术有限公司 Repeating data processing methods, devices, storage controller and storage node

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于重复数据删除技术的数据容灾系统的研究;廖海生;《中国优秀硕士学位论文全文数据库 信息科技辑》;20111215;正文第11页第17行至第13页第21行和第25页第1行至第51页最后一段 *

Also Published As

Publication number Publication date
CN108134775A (en) 2018-06-08

Similar Documents

Publication Publication Date Title
CN108134775B (en) Data processing method and equipment
US8954392B2 (en) Efficient de-duping using deep packet inspection
CN102684827B (en) Data processing method and data processing equipment
WO2017020576A1 (en) Method and apparatus for file compaction in key-value storage system
US20220382458A1 (en) System and method for data compaction and security using multiple encoding algorithms
US11609882B2 (en) System and method for random-access manipulation of compacted data files
CN106980680B (en) Data storage method and storage device
CN106990914B (en) Data deleting method and device
US9935889B2 (en) Communication apparatus and method
US8868584B2 (en) Compression pattern matching
CN111198855B (en) Log data processing method and device
CN115174561B (en) File segmented transmission method and system
CN116010362A (en) File storage and file reading method, device and system
CN114466004A (en) File transmission method, system, electronic equipment and storage medium
CN113900990A (en) File fragment storage method, device, equipment and storage medium
JP7024578B2 (en) Communication device, communication control method, and communication control program
US20170048303A1 (en) On the fly statistical delta differencing engine
CN106998361B (en) Data transmission method and system
CN113726832B (en) Data storage method, device, system and equipment of distributed storage system
US11853262B2 (en) System and method for computer data type identification
CN112765421B (en) Data retrieval method and device and terminal equipment
CN116346757A (en) Message transmission method, device, equipment and storage medium
US20230315288A1 (en) System and method for data compaction and security using multiple encoding algorithms with pre-coding and complexity estimation
CN117255990A (en) Data management method in data storage system, data index module and data storage system
CN113515491A (en) Cloud storage file level duplication removing method based on double-layer Bloom filter

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant