WO2017096532A1 - 一种数据保存方法和装置 - Google Patents

一种数据保存方法和装置 Download PDF

Info

Publication number
WO2017096532A1
WO2017096532A1 PCT/CN2015/096696 CN2015096696W WO2017096532A1 WO 2017096532 A1 WO2017096532 A1 WO 2017096532A1 CN 2015096696 W CN2015096696 W CN 2015096696W WO 2017096532 A1 WO2017096532 A1 WO 2017096532A1
Authority
WO
WIPO (PCT)
Prior art keywords
sub
block
data
saved
blocks
Prior art date
Application number
PCT/CN2015/096696
Other languages
English (en)
French (fr)
Inventor
李夫路
张程伟
徐培
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201580056658.7A priority Critical patent/CN107046812B/zh
Priority to EP15910007.2A priority patent/EP3376393B1/en
Priority to PCT/CN2015/096696 priority patent/WO2017096532A1/zh
Publication of WO2017096532A1 publication Critical patent/WO2017096532A1/zh
Priority to US16/002,585 priority patent/US20180285014A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device

Definitions

  • the embodiments of the present invention relate to the field of computers, and in particular, to a data storage method and apparatus.
  • deduplication technology In order to alleviate the space growth problem of storage systems, reduce data footprint and reduce costs, and maximize the use of existing resources, deduplication technology has become a hot research topic.
  • deduplication technology (hereinafter referred to as "deduplication”) can be used to optimize the utilization of storage space to eliminate the same files or data blocks distributed in the storage system.
  • the use of deduplication technology can reduce the amount of data transmitted in the network, thereby reducing energy consumption and network costs, and saving a lot of network bandwidth for data replication.
  • deduplication module in this system is responsible for the comparative analysis of data content to find redundant data for recurring data units. It is only necessary to record the location of the same data that can be stored for reference, and finally only the non-repeating data is stored in the storage medium.
  • the deduplication algorithm with better compression effect has a relatively large amount of computation, and the compression algorithm with a small computational amount has a relatively small compression effect.
  • an embodiment of the present invention provides a data saving method and apparatus. To achieve deduplication and compression of the data blocks to be stored, thereby reducing the storage space required for storing the data blocks to be stored.
  • the application provides a data saving method, including:
  • the data block to be saved is divided into N pieces of sub-data blocks to be saved, and the N pieces of sub-data to be saved correspond to N position identifiers, and each sub-block to be saved corresponds to a position identifier, where N is greater than 1.
  • a positive integer; one of the at least two comparison data blocks that have been saved is selected as a reference data block, the reference data block includes N reference sub-blocks, and the N reference sub-blocks correspond to N position identifiers, and each reference sub-data The block corresponds to a location identifier; comparing the to-be-saved sub-block corresponding to the i-th location identifier with the reference sub-block of the corresponding i-th location identifier, and determining the first sub-block from the N to-be-saved sub-blocks, wherein i is a positive integer increasing from 1 to N, the first sub-block is a sub-block to be saved that is different from the compared reference sub-block; the representative sub-block of the first sub-block is selected;
  • the data of the data block is XORed with the data representing the sub-block; the result of the XOR operation is compressed using run-length encoding, and the compression result and the position information of the first sub-block are saved
  • the size of the to-be-stored data block and the reference data block are the same, the size of the to-be-saved sub-block and the reference sub-block are the same, and the order of the sub-block in the data block can be used as the location identifier. For example, in the order of position, the position of the i-th sub-data block to be saved is identified as "i". Comparing the to-be-saved sub-block and the reference sub-block corresponding to the same location identifier, that is, comparing the to-be-saved sub-block and the reference sub-block in the same location.
  • the method further includes: comparing, by using a comparison of the to-be-saved sub-block and the reference sub-block identified by the same location, from the N to-be-preserved sub-blocks The second sub-block is determined, wherein the second sub-block is the same sub-block to be saved as the compared reference sub-block, and the second sub-block is subjected to a de-duplication operation.
  • the data block to be saved is divided into sub-data blocks of smaller granularity, and the sub-data blocks of the same position of the saved data block and the reference data block are respectively compared, so as to be smaller.
  • the calculation amount performs the deduplication operation on the same sub-block, and performs an exclusive-OR operation on the different sub-blocks and the reference sub-block, and performs run-length encoding compression on the XOR result, thereby greatly reducing the storage required for recording the data block to be saved. space.
  • the reference to the saved i data block corresponding to the i th position and the reference corresponding to the i th position identifier Comparing the sub-blocks includes: comparing the fingerprints of the sub-blocks to be saved corresponding to the i-th position with the fingerprints of the reference sub-blocks corresponding to the i-th position identifier, and if the fingerprints are the same, the two are the same, if the fingerprints are different , which means that the two are different.
  • selecting one of the saved at least two comparison data blocks as the reference data block includes: A comparison data block having at most the same sub-data block as the data block to be saved is selected from the at least two comparison data blocks as a reference data block. Further, a comparison data block having the same number of sub-blocks as the data block to be saved and having the least clustering of different sub-blocks may be selected as a reference data block, wherein one cluster of different sub-blocks refers to a position continuous. A set of different sub-blocks, and sub-blocks of adjacent positions of the set are the same sub-block.
  • the similarity between each comparison data block and the data block to be saved can be determined, and the comparison data block with the same sub-data block as the data block to be saved is the data block to be saved.
  • the most similar data block is selected by selecting the clustered reference data block with the smallest clustering of the different sub-blocks as the reference data block in the comparison data block with the highest similarity to the data block to be saved.
  • the compression ratio of the deduplication operation is further increased.
  • the selecting the representative sub-block of the first sub-block includes: selecting and each first The sub-block corresponds to the reference sub-block identified by the same location as the representative sub-block of each of the first sub-blocks.
  • a reference sub-correlation corresponding to each of the first sub-blocks is selected.
  • the data block can cause a large number of consecutive "0"s in the XOR result, thereby greatly reducing the compression ratio of the run-length coding.
  • the selecting the representative sub-block of the first sub-block includes: the first sub-block And performing exclusive-OR operation with at least two saved sub-blocks respectively, and selecting a saved sub-block with the smallest run-length compression ratio of the XOR result of the XOR data according to the run-length compression ratio of the result of the XOR operation As a representative sub-block.
  • the saved sub-blocks that minimize the run-length coding compression rate of the first sub-block and its XOR operation result are selected from the saved sub-blocks as the representative sub-block, which can be reduced to a large extent. Save the storage space required for the data block to be saved.
  • the selecting the representative sub-block of the first sub-block includes: the first sub-block Performing a two-two exclusive OR operation, according to the run-length coding compression ratio of the result of the exclusive OR operation, the first sub-block with the smallest run-length compression ratio of the XOR result of the other first sub-blocks is selected as the representative sub-block.
  • the storage to be saved can be reduced without being attached to the saved sub-block.
  • the method before the data block to be saved is divided into N to be saved sub-blocks of the same size, The method includes: extracting samples from the data to be saved, determining the storage compression ratio, the saving speed, and the reading speed of the sample according to different sub-block sizes to be saved, satisfying the constraint according to the saving speed and the reading speed, and saving the compression ratio.
  • the minimum sub-data size determines the size of N, wherein the constraint condition is: the saving speed is greater than or equal to a preset first threshold, and the reading speed is greater than or equal to a preset second threshold.
  • the storage space required for storing the data block can be minimized under the premise of satisfying the storage processing speed and the reading processing speed.
  • the method further includes: saving the fingerprint of the first sub-block. Therefore, when the first sub-block is subsequently used as a reference sub-block of another data block, the fingerprint is not repeatedly calculated.
  • the present application provides a computer readable medium, comprising computer executed instructions, when the processor of the computer executes the computer to execute an instruction, the computer executes the first aspect or any of the possible implementations of the first aspect Methods.
  • the present application provides a computing device, including: a processor, a memory, a bus, and a communication interface; the memory is configured to store an execution instruction, the processor is connected to the memory through the bus, when the computing device is running The processor executes the execution instructions stored by the memory to cause the computing device to perform the method of any of the first aspect or the first aspect.
  • the application provides a data storage device, including:
  • a dividing unit configured to divide the to-be-saved data block into N to-be-saved sub-blocks of the same size, and the N to-be-preserved sub-blocks correspond to N position identifiers, and each sub-block to be saved corresponds to one position An identifier, wherein N is a positive integer greater than 1;
  • a selecting unit configured to select one of the at least two saved data blocks as a reference data block, the reference data block includes N reference sub-blocks, N reference sub- The data block corresponds to N position identifiers, and each reference sub-block corresponds to a position identifier;
  • the comparing unit is configured to compare the to-be-saved sub-block corresponding to the i-th position identifier with the reference sub-block corresponding to the i-th position identifier, Determining a first sub-block from the N to-be-preserved sub-blocks, where i is a positive integer increasing from 1 to N,
  • the size of the to-be-stored data block and the reference data block are the same, the size of the to-be-saved sub-block and the reference sub-block are the same, and the order of the sub-block in the data block can be used as the location identifier. For example, in the order of position, the position of the i-th sub-data block to be saved is identified as "i". Comparing the to-be-saved sub-block and the reference sub-block corresponding to the same location identifier, that is, comparing the to-be-saved sub-block and the reference sub-block in the same location.
  • the device further includes a deduplication unit, where the comparison unit is further configured to determine a second sub data block from the N to be saved sub data blocks, where the second The sub-data block is the same sub-data block to be saved as the compared reference sub-data block; the de-duplication unit is used to perform the de-duplication operation on the second sub-data block.
  • the data block to be saved is divided into sub-data blocks of smaller granularity, and the sub-data blocks of the same position of the saved data block and the reference data block are respectively compared, so as to be smaller.
  • the calculation amount performs the deduplication operation on the same sub-block, and performs an exclusive-OR operation on the different sub-blocks and the reference sub-block, and performs run-length encoding compression on the XOR result, thereby greatly reducing the storage required for recording the data block to be saved. space.
  • the comparing unit is specifically configured to: use the fingerprint of the to-be-saved sub-block to be identified corresponding to the i-th location
  • the fingerprints of the reference sub-blocks corresponding to the i-th position identifier are compared. If the fingerprints are the same, the two are the same. If the fingerprints are different, the two are different.
  • the selecting unit is configured to select one of the saved at least two comparison data blocks as the reference data block, and the selecting unit is configured to select, from the at least two comparison data blocks, the data block to be saved.
  • the comparison data block of at most the same sub-block is used as the reference data block.
  • the similarity between each comparison data block and the data block to be saved can be determined, and the comparison data block with the same sub-data block as the data block to be saved is the data block to be saved.
  • the most similar data block is selected by selecting the clustered reference data block with the smallest clustering of the different sub-blocks as the reference data block in the comparison data block with the highest similarity to the data block to be saved.
  • the compression ratio of the deduplication operation is further increased.
  • the selecting unit is configured to select a representative sub-data block of the first sub-data block, including: selecting a unit And a reference sub-block for selecting the same location identifier corresponding to each of the first sub-blocks as a representative sub-block of each of the first sub-blocks.
  • the reselecting unit may be configured to select a comparison data block having the same number of sub-blocks as the data block to be saved and having the least clustering of different sub-blocks as a reference data block, wherein one cluster of different sub-blocks is Refers to a set of different sub-blocks of consecutive positions, and sub-blocks of adjacent positions of the set are the same sub-block.
  • a reference sub-correlation corresponding to each of the first sub-blocks is selected.
  • the data block can cause a large number of consecutive "0"s in the XOR result, thereby greatly reducing the compression ratio of the run-length coding.
  • the selecting unit is configured to select a representative sub-data block of the first sub-data block, including: selecting a unit And performing an exclusive OR operation on the first sub data block and the at least two saved sub data blocks respectively by the XOR unit, and selecting an XOR result from the first sub data block according to a run length compression ratio of a result of the XOR operation
  • the saved sub-block with the smallest run-length compression ratio is represented as a sub-block.
  • the saved sub-blocks that minimize the run-length coding compression rate of the first sub-block and its XOR operation result are selected from the saved sub-blocks as the representative sub-block, which can be reduced to a large extent. Save the storage space required for the data block to be saved.
  • the selecting unit is configured to select a representative sub-data block of the first sub-data block, including: selecting a unit For performing a two-to-two-OR operation on the first sub-block by the XOR unit, according to the run-length coding compression rate of the result of the XOR operation, selecting the run-length coding compression rate with the XOR result of the other first sub-blocks is the smallest The first sub-block is used as a representative sub-block.
  • the storage to be saved can be reduced without being attached to the saved sub-block.
  • the selecting unit is further configured to: extract a sample from the data to be saved, according to different to-be-preserved sub-
  • the data block size determines the save compression rate, the save speed and the read speed of the sample respectively, and the constraint condition is satisfied according to the save speed and the read speed, and the size of the sub-data with the smallest compression ratio is determined, wherein the constraint condition is:
  • the saving speed is greater than or equal to a preset first threshold
  • the reading speed is greater than or equal to a preset second threshold.
  • the storage space required for storing the data block can be minimized under the premise of satisfying the storage processing speed and the reading processing speed.
  • the comparing unit is further configured to save the fingerprint of the first sub data block. Therefore, when the first sub-block is subsequently used as a reference sub-block of another data block, the fingerprint is not repeatedly calculated.
  • the data block to be saved is divided into sub-data blocks of smaller granularity, and the sub-data of the same position of the data block and the reference data block are to be saved.
  • the blocks are compared separately, and the same sub-blocks are deduplicated with a small amount of calculation, and the different sub-blocks are XORed with the reference sub-blocks, and the XOR result is run-length encoded and compressed, which greatly reduces the recording.
  • the storage space required for the data block to be saved is required for the data block to be saved.
  • FIG. 1 is a block diagram of an exemplary networked environment of a data storage system
  • FIG. 2 is a schematic structural diagram of a hardware of a computing device according to an embodiment of the invention.
  • FIG. 3 is an exemplary flowchart of a data saving method according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a data block segmentation method according to an embodiment of the invention.
  • FIG. 5 is a schematic diagram of comparison of comparison data blocks according to an embodiment of the invention.
  • FIG. 6 is a schematic diagram showing the logical structure of a data storage device according to an embodiment of the invention.
  • the data storage information used by the client is stored on the storage device of the database server, and the data stored on the storage device of the database server may be deduplicated and compressed, or de-duplicated and compressed, but heavy Deletion and compression can greatly reduce the storage space required for the data to be stored on the storage device on the database server, saving storage resources.
  • 1 shows an exemplary networked environment block diagram of a data storage system 100 that includes a client 102 and a database server 104, and further includes a storage device 108. It should be understood that the above naming is merely for convenience of description and should not be construed as limiting the invention.
  • the storage device of the database server 104 stores data for use by the client 102 for providing services for the client 102 applications, which may be storage, query, update, transaction management, indexing, caching, query optimization, security. And multi-user access control.
  • the database server 104 can be composed of one or more computers and database management system software.
  • Both database server 104 and client 102 are configured with a communication interface for communication between the two. Alternatively, database server 104 and client 102 communicate over network 106.
  • the network 106 may be the Internet, an intranet, a local area network (LAN), a wireless local area network (WLANs), a storage area network (SANs), or the like, or a combination of the above.
  • LAN local area network
  • WLANs wireless local area network
  • SANs storage area network
  • Storage device 108 may be coupled to database server 104 and/or client 102 via a communication interface, or may be coupled to database server 104 and/or client 102 via network 106, both database server 104 and client 102, or both Any one of them can access the storage device 108.
  • the storage device 108 can serve as a storage device of the database server 104, and store the data deleted and compressed by the database server 104.
  • FIG. 1 is merely exemplary participants of the data storage system 100 and their interrelationships. Therefore, the depicted system 100 is greatly simplified, and the embodiments of the present invention are merely described in general terms, and the implementation thereof is not limited in any way.
  • the client 102 and the database server 104 in FIG. 1 may be of any architecture, which is not limited by the embodiment of the present invention.
  • the client 102 and/or database server 104 shown in FIG. 1 can be implemented by the computing device 200 shown in FIG. 2.
  • computing device 200 includes a processor 202, a memory unit 204, an input/output interface 206, a communication interface 208, a bus 210, and a storage device 212.
  • the processor 202, the memory unit 204, the input/output interface 206, the communication interface 208, and the storage device 212 implement communication connections with each other through the bus 210.
  • the processor 202 is a control center of the computing device 200 for executing related programs to implement the technical solutions provided by the embodiments of the present invention.
  • the processor 202 includes one or more central processing units (CPUs), such as the central processing unit 1 and the central processing unit 2 shown in FIG.
  • the computing device 200 can also include multiple processors 202, each of which can be a single core processor (including one CPU) or a multi-core processor (including multiple CPUs).
  • a component for performing a specific function for example, the processor 202 or the memory unit 204, may be implemented by configuring a general-purpose component to perform a corresponding function, or may be specifically performed by a specific one.
  • the processor 202 can be a general-purpose central processing unit, a microprocessor, an application specific integrated circuit (ASIC), or one or more integrated circuits for executing related programs to implement the technology provided by the present application. Program.
  • Processor 202 can be coupled to one or more storage schemes via bus 210.
  • the storage scheme can include a memory unit 204 and a storage device 212.
  • the storage device 212 can be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM).
  • Memory unit 204 can be a random access memory.
  • the memory unit 204 can be integrated with or integrated with the processor 202, or it can be one or more memory units independent of the processor 202.
  • Program code for execution by the processor 202 or a CPU internal to the processor 202 may be stored in the storage device 212 or the memory unit 204.
  • program code eg, an operating system, an application, a deduplication compression module, or a communication module, etc.
  • stored internal to storage device 212 is copied to memory unit 204 for execution by processor 202.
  • Storage device 212 may include high speed random access memory (RAM), and may also include non-volatile memory, such as one or more magnetic disk memories, flash memory, or other non-volatile memory.
  • the storage device may further include a remote memory separate from the one or more processors 202, such as a network disk accessed through a communication interface 208 with a communication network, which may be the Internet, an intranet , local area networks (LANs), wide area networks (WLANs), storage area networks (SANs), etc., or a combination of the above.
  • the storage device 212 can also be used to store the database server 104 to deduplicate the compressed data.
  • Operating systems include controls and management of general system tasks (such as memory management, storage device control, power management, etc.) And various software components and/or drivers that facilitate communication between various hardware and software components.
  • the input/output interface 206 is for receiving input data and information, and outputting data such as operation results.
  • Communication interface 208 enables communication between computing device 200 and other devices or communication networks using transceivers such as, but not limited to, transceivers.
  • Bus 210 may include a path for communicating information between various components of computing device 200, such as processor 202, memory unit 204, input/output interface 206, communication interface 208, and storage device 212.
  • the bus 210 can use a wired connection or a wireless communication mode, which is not limited in this application.
  • computing device 200 shown in FIG. 2 only shows the processor 202, the memory unit 204, the input/output interface 206, the communication interface 208, the bus 210, and the storage device 212, in a specific implementation process, the field Those skilled in the art will appreciate that computing device 200 also includes other devices necessary to achieve proper operation.
  • the computing device 200 can be a general purpose computer or a special purpose computing device including, but not limited to, a portable computer, a personal desktop computer, a web server, a tablet computer, a mobile phone, a personal digital assistant (PDA), or the like, or both. Or a plurality of combined devices, the present application does not limit the specific implementation form of the computing device 200.
  • computing device 200 of FIG. 2 is merely an example of one computing device 200, which may include more or fewer components than those shown in FIG. 2, or have different component configurations.
  • computing device 200 may also include hardware devices that implement other additional functions, depending on the particular needs.
  • computing device 200 may also only include the components necessary to implement embodiments of the present invention, and does not necessarily include all of the devices shown in FIG.
  • the various components shown in Figure 2 can be implemented in hardware, software, or a combination of hardware and software.
  • FIG. 2 and the above description are applicable to various computing devices provided by the embodiments of the present invention, and are applicable to performing various data saving methods provided by the embodiments of the present invention.
  • the memory unit 204 of the computing device 200 includes a deduplication compression module, and the processor 202 executes the deduplication compression module program code to implement deduplication of data to be stored in the storage device 212 of the database server 104. Compression operation.
  • the database server 104 stores the data to be stored based on the data block when performing database storage, and there is a large similarity between different data blocks, and the relative position of the same data in different data blocks is in the data block.
  • the similar data deduplication operation when the similar data deduplication operation is performed, the data with the same relative position in the reference data block and the reference data block can be compared, thereby avoiding the cross comparison between the data with different relative positions. The complexity of the comparison is greatly reduced.
  • the deduplication compression module may be comprised of one or more operational instructions to cause the computing device to perform one or more method steps in accordance with the above description. The specific method steps are described in detail in the following sections of this application.
  • FIG. 3 is an exemplary flowchart of a data saving method 300 according to an embodiment of the present invention. As shown in FIG. 3, the method 300 includes:
  • the N sub-data blocks to be saved correspond to N location identifiers, and each sub-block to be saved corresponds to a location identifier, where N is a positive integer greater than 1.
  • the data block to be saved is divided into N sub-data blocks to be saved, and the position identifier of the sub-data block to be saved may be represented by the position order of the sub-data block to be saved in the data block to be saved.
  • the size of the data block to be saved may be the basic storage unit size in the database storage, and the basic storage unit database in the database is further refined and divided into smaller sub-blocks.
  • the basic storage unit data block is 8K bytes, if 256 bytes is the best
  • the size of the sub-block is saved, and the data block to be saved is divided into 32 sub-blocks to be saved having a size of 256 bytes.
  • the 8K-byte data block to be saved is sliced into smaller sub-blocks to be saved, and the size of each sub-block to be saved is 256 bytes, so that the basic unit for similar deduplication comparison becomes 256 bytes.
  • S304 Select one of the at least two saved data blocks that have been saved as the reference data block.
  • the reference data block includes N reference sub-blocks, where the N reference sub-blocks correspond to N location identifiers, and each reference sub-block corresponds to a location identifier.
  • the division manner of the reference data block is the same as the division manner of the data block to be saved, and the location identifier is represented in the same manner as the data block to be saved.
  • S306 Compare the to-be-saved sub-block and the reference sub-block corresponding to the same location identifier, and determine a first sub-block, where the first sub-block is a sub-block to be saved that is different from the compared reference sub-block. .
  • the to-be-saved sub-data block corresponding to the i-th position identifier is compared with the reference sub-data block corresponding to the i-th position identifier, and the first sub-data block is determined from the N to-be-saved sub-data blocks, where i is a slave 1 is a positive integer incremented to N.
  • the number of the first sub-blocks may be one or a positive integer greater than one.
  • S308 Select a representative sub-block of the first sub-block.
  • S310 XOR the first sub-block with the representative sub-block of the first sub-block.
  • step S312 compress the result of the exclusive OR operation of step S310 using run length encoding, and save the compression result and the position information of the first sub data block.
  • step S306 when comparing the to-be-saved sub-data block and the reference sub-data block corresponding to the same location identifier, in order to avoid comparison by byte, the fingerprint of the saved sub-data block and the reference sub-block may be compared, if the fingerprints are the same, Indicates that the two sub-blocks are the same. If the fingerprints are different, it indicates that the two sub-blocks are different.
  • the fingerprint is the credential of the sub-block identity.
  • the method for calculating the fingerprint of the sub-block is various. For example, the hash fingerprint calculation of the sub-block can be performed by calculating the SHA1 or MD5 hash value, etc., which is not limited in the embodiment of the present invention.
  • the to-be-saved data block 402 is divided into N to the N Nth-sized sub-data blocks to be saved according to the data sequence.
  • the location identifier of the sub-block to be saved may be represented by the order of the sub-block to be saved. For example, i indicates the location identifier of the i-th sub-block to be saved.
  • the reference block 404 is also divided. N to the Nth N data blocks of the same size, compare the fingerprints of the sub data blocks on the same location identifier of the data block 402 to be saved and the reference data block 404, and compare the corresponding location identifiers according to the position order. The fingerprint of the sub-block and the reference sub-block is saved.
  • the fingerprints are the same, it indicates that the sub-blocks are the same. If the fingerprints are different, the sub-blocks are different.
  • the fingerprint of the sub-block to be saved corresponding to the i-th position identifier is compared with the fingerprint of the reference sub-block corresponding to the i-th position identifier. If the fingerprints are the same, the two are the same. If the fingerprints are different, the two are different. Where i is incremented from 1 to N.
  • the sub-data block to be saved identified by the i-th position may be subjected to a de-duplication operation, that is, the i-th sub-data block to be saved is not saved, and when the data is read, the i-th of the reference data block 404 is used.
  • the reference sub-blocks are used to restore the i-th to-be-saved sub-block of the data block 402 to be saved; if the fingerprints are different, the data block 402 to be saved is different from the i-th sub-block of the reference data block 404, and the i-th is saved.
  • the location information of the sub-block to be saved is stored, and the operation of the subsequent steps S308-S312 is performed on the i-th data block of the save data block 402, and the information after the operation is retained. Where i is incremented from 1 to N.
  • the data block to be saved and the reference data block may be compared by using a “dichotomy” to find the sub-data of the same location identifier of the data block to be saved and the reference data block.
  • Block fingerprints with different location information may be used.
  • the specific implementation process is as follows: the data block to be saved and the reference data block are respectively divided into two “left data blocks” and “right data blocks” of the same size, and the fingerprints of the “left data block” of the data block to be saved are calculated and “ Fingerprint of the right data block; comparing the "left data block” fingerprint of the data block to be saved with the reference data block, and comparing the fingerprint of the "right data block” of the data block to be saved and the reference data block (if the reference data block is saved)
  • the fingerprints of the "left data block” and the "right data block” can directly use the saved fingerprint value; if the fingerprints of the "left data block” and the "right data block” of the reference data block are not saved, the reference data block is calculated.
  • the “left (right) data block” performs “dichotomy” segmentation, and compares the fingerprints of the segmented data blocks respectively, and so on, until the size of the segmented data block is equal to the size of the sub-block, thereby finding out to be saved.
  • the data block has different position information from the sub-block fingerprints of all the same positions of the reference data block, and records different position information of the fingerprint.
  • the dichotomy method can reduce the calculation amount of the location information of the sub-block fingerprints of the same position of the data block to be saved and the reference data block, and increase the deduplication compression. speed.
  • the manner of saving the location information of the first sub-block is various, and can be implemented by recording the location information of the first sub-block, and the sub-block to be saved on the location information that is not recorded is the second sub-data.
  • the block can also be implemented by recording the location information of the second sub-block, and the sub-block to be saved on the unrecorded location information is the first sub-block.
  • the manner of saving the location information of the first sub-block is not limited.
  • the location information of the first sub-block may be represented by the order of the first sub-block in the data block to be saved.
  • i may be used to represent the i-th to be saved.
  • the position information of the data block in the data block to be saved where i is a positive integer greater than 0 and less than or equal to N.
  • the position sequence of the first sub-block in the data block to be saved may be recorded, and the sub-block to be saved in the position order without recording is the second sub-block. It can also be realized by recording the position order of the second sub-blocks, and the sub-blocks to be saved in the position order without recording are the first sub-blocks.
  • the location information of the first sub-block may also be recorded by recording the first address and the tail address of the first sub-block, or recording the first address and the address length of the first sub-block. If the location information of the first sub-block is continuous, only the first address and the tail address of the consecutive locations, or the first address and the address length of the consecutive locations are recorded to record the continuous location information, thereby saving the storage required for recording the continuous location information. space. It should be understood that the first address of the first sub-block of data herein may be an offset from the first address of the data block to be saved.
  • the data blocks stored in the database server 104 have a large degree of similarity, and the relative positions of the same portions of different data blocks within the data block are generally the same. According to the characteristics that the same parts of different data blocks are generally the same, the data block to be saved is divided into smaller-sized sub-blocks to be saved, and the sub-blocks on the same position identifier of the data block to be saved and the reference data block are compared.
  • the fingerprint avoids the cross-comparison of the sub-blocks on the different location identifiers, and can find the same part of the data block to be saved and the reference data block with a small amount of calculation, and perform the de-duplication operation on the same part, ie
  • the sub-blocks of the same position as the reference data block and the same fingerprint are not saved, and when the data is restored, the same sub-block can perform the de-duplicated sub-block according to the sub-block of the corresponding position of the reference block. Recovery, thereby greatly reducing the storage unit required to record the data block to be saved.
  • the fingerprint of the reference sub-block of the reference data block may be calculated when the reference data block is saved, and stored in the database; or the reference sub-block identified by the same reference position of the reference data block and the data block to be saved may be compared and to be saved.
  • the fingerprint calculation is performed on the sub-blocks, which is not limited in the embodiment of the present invention.
  • selecting one of the saved at least two comparison data blocks as the reference data block comprises: comparing the data block to be saved with the at least two comparison data blocks, and comparing according to the method in step 306, from the As the reference data block, a comparison data block having the most identical sub-blocks of the data block to be saved is selected among the two comparison data blocks. Further, a comparison data block having the same number of sub-blocks as the data block to be saved and having the least clustering of different sub-blocks may be selected as a reference data block, wherein one cluster of different sub-blocks refers to a position continuous. A set of different sub-blocks, and sub-blocks of adjacent positions of the set are the same sub-block.
  • the similarity between each comparison data block and the data block to be saved can be determined, and the comparison data block with the same sub-data block as the data block to be saved is the data block to be saved.
  • the most similar data block is selected by selecting the clustered reference data block with the smallest clustering of the different sub-blocks as the reference data block in the comparison data block with the highest similarity to the data block to be saved.
  • the compression ratio of the deduplication operation is further increased.
  • the data block 402 to be saved is compared with three comparison data blocks of the comparison data block 502, the comparison data block 504, and the comparison data block 506, and the data block 402 to be saved and each comparison data block are compared respectively. If the fingerprint of the sub-block in the same position and the fingerprint are the same, it means that it is the same sub-block. If the fingerprint is different, it indicates that it is a different sub-block. As shown in FIG. 5, the comparison data block 502 has three different sub-blocks, which are the second sub-block, the N-2th sub-block, and the N-1-th sub-block, respectively, compared with the to-be-stored data block 402.
  • the comparison data block 504 has two different sub-blocks, which are the second sub-block and the N-2-th sub-block, respectively, compared to the to-be-stored data block 402.
  • the comparison data block 506 is compared with the to-be-stored data block 402. There are 2 different sub-blocks, which are the 2nd sub-block and the 3rd sub-block. By comparison, it is found that the comparison data block 504 and the comparison data block 506 have the same number of sub-data blocks as the data block 402 to be saved, and are all N-2.
  • the clustering of different sub-blocks of the comparison data block 506 is 1, that is, the set of the second sub-block and the third sub-block; the cluster of the different sub-blocks of the comparison block 504 is 2, that is, the second Sub-data blocks and N-2th data blocks. Therefore, the clustered data block 506 with the smallest clustering of different sub-blocks is selected as the reference data block.
  • Position information of 2 sub-blocks and 3rd sub-blocks For example, the initial position and the end position of the cluster of the second sub-block and the third sub-block may be recorded, or the initial position of the cluster and the data length of the cluster may be recorded, which is not limited in the embodiment of the present invention. .
  • selecting the representative sub-block of the first sub-block includes: selecting a reference sub-block corresponding to the same position identifier of each of the first sub-blocks as a representative of each of the first sub-blocks Sub data block.
  • the reference data block of the data block 402 to be saved is the comparison data block 506.
  • the comparison data block 506 has two different sub-blocks, which are the second sub-block and the third sub-block, respectively, compared to the data block 402 to be saved. Then, the second sub-block of the comparison data block 506 is selected as the representative data block of the second to-be-saved sub-block of the data block 402 to be saved, and the third sub-block of the comparison data block 506 is selected as the data block 402 to be saved.
  • the third data block of the third data block to be saved is XORed with the second sub data block of the data block 402 to be saved, and the third data block of the data block 402 is to be saved.
  • the sub-blocks are XORed with the third sub-block of the comparison block 506. Because the second and third two sub-blocks to be saved are consecutively located, the XOR result can be uniformly run-length encoded, and the location information is uniformly recorded.
  • selecting a representative sub-block of the first sub-block includes: performing an exclusive-OR operation on the first sub-block and the at least two saved sub-blocks respectively, and performing run-length coding according to a result of the XOR operation
  • the compression ratio is selected as the representative sub-block of the saved sub-block with the smallest run-length compression ratio of the XOR result of the first sub-block.
  • selecting a representative sub-block of the first sub-block includes: performing a two-to-two-OR operation on the first sub-block, selecting a run-length compression ratio according to a result of the XOR operation, selecting and other
  • the first sub-block of the run-length coding compression rate of the XOR result of the first sub-block is the representative sub-block.
  • the representative sub-data block is saved.
  • the M first data blocks of the data block to be saved there are M first data blocks of the data block to be saved, and for convenience of description, The order of the positions in the saved data block is numbered. From the 1st to the Mth, the M first sub-blocks are respectively XORed. And calculating a run-length coding compression ratio of the XOR result of each of the first sub-blocks and the other M-1 first sub-blocks, that is, counting the first first sub-block and the second to the Mth first sub-block.
  • the compression caused by the run-length encoding can be performed by statistically consecutive characters in the XOR result.
  • the size of the space is used to determine the compression ratio of the run-length encoding of the XOR result of each of the first sub-blocks and the other first sub-blocks. The larger the compression space brought by the run-length encoding, the larger the compression ratio of the run-length encoding.
  • the first sub-block is XORed with the representative sub-block of the first sub-block, wherein the XOR operation is performed on the binary data bit of the sub-block, that is, the sub-data to be saved
  • the block performs an XOR operation with the reference sub-block. If the bit values at the same position are the same, the XOR result is 0. If the bit values at the same position are different, the XOR result is 1.
  • the XOR result is a binary string consisting of "0" and/or "1"
  • the XOR result is compressed by using run length encoding, saving the storage space required for recording information.
  • the method 300 also includes saving a fingerprint of the first sub-block of data. Therefore, when the first sub-block is subsequently used as a reference sub-block of another data block, the fingerprint is not repeatedly calculated.
  • the XOR result can be restored according to the saved run length coding result; according to the restored XOR result and the saved representative sub-block, different sub-blocks can be restored, and the XOR result is The position of "1" indicates that the value of the first sub-block at the position is opposite to the value representing the sub-block, and the result of the XOR result is "0", indicating that the first sub-block at the position The value is the same as the value representing the sub-block; according to the restored first sub-block, the saved position information of the first sub-block and the saved reference block, the original block to be saved can be restored.
  • the method 300 further includes: extracting samples from the data to be saved, and determining, according to different sub-block sizes to be saved, the storage compression ratio of the sample, a saving speed and a reading speed, satisfying a constraint according to the saving speed and the reading speed, and determining a size of the sub-data size that has the smallest compression ratio, wherein the constraint condition is: the saving speed is greater than or equal to a preset first threshold, the read speed being greater than or equal to a preset second threshold.
  • the size of the appropriate sub-block and the number N of sub-blocks may be determined according to the sample data, and the size of the block to be saved is K, and the sub-block to be saved is saved.
  • the size of M, then the value of M must meet the following constraints:
  • K/M> 2, that is, the number of sub-blocks to be saved should be greater than or equal to 2;
  • KmodM 0, that is, the result of modulo K to M is 0, thereby ensuring that the sizes of the sub-blocks to be saved are the same.
  • the save compression ratio is a ratio of the size of the data after the deduplication and compression to the size of the data block to be saved, and the data after the deduplication is data required for storing the information of the data block to be saved, including the run length coding result, the record reference data block, and Reference data required for the sub-block information, and data for recording the position information of the first sub-block.
  • the method 300 can be performed by the client 102, or by the database server 104, or part of the steps performed by the client 102, and the partial steps are performed by the database server 104, for example, the client can divide the data block to be saved into N.
  • the sub-blocks to be saved are counted separately Calculating the fingerprint of each sub-block to be saved, and then sending the fingerprints of the N sub-blocks to be saved to the server 104, the server 104 performing step S304, selecting a reference data block for the data block to be saved, and identifying by the same location
  • the first sub-block is determined, and the indication information of the first sub-block is sent to the client 102, and the client 102 sends the first sub-block to the database.
  • the server 104 performs subsequent operations by the database server.
  • the specific implementation body of the method 300 is not limited in the embodiment of the present invention.
  • the data block to be saved is divided into sub-data blocks of smaller granularity, and the same position of the data block and the reference data block are to be saved.
  • the sub-data blocks are respectively compared, and the same sub-data block is subjected to the de-duplication operation with a smaller calculation amount, and the different sub-data blocks are XORed with the reference sub-blocks, and the exclusive-OR result is subjected to run-length coding compression, which greatly reduces The storage space required to record the data block to be saved.
  • FIG. 6 is a schematic diagram showing the logical structure of a data saving apparatus 600 according to an embodiment of the present invention.
  • the apparatus 600 includes a dividing unit 602, a selecting unit 604, a comparing unit 606, a calculating unit 608, and a compressing unit 610, where
  • the dividing unit 602 is configured to divide the to-be-saved data block into N to-be-saved sub-blocks of the same size, the N to-be-preserved sub-blocks correspond to N location identifiers, and each to-be-saved sub-block corresponds to a location identifier Where N is a positive integer greater than one.
  • the selecting unit 604 is configured to select one of the saved at least two comparison data blocks as a reference data block, where the reference data block includes N reference sub-blocks, where the N reference sub-blocks correspond to the N Location identifier, each reference sub-block corresponds to a location identifier.
  • the comparing unit 606 is configured to compare the to-be-saved sub-data block corresponding to the i-th position identifier with the reference sub-data block corresponding to the i-th position identifier, and determine the first sub-data block from the N to-be-saved sub-data blocks, Where i is a positive integer from 1 to N, and the first sub-block is a sub-block to be saved that is different from the compared reference sub-block.
  • the selecting unit 604 is further configured to select a representative sub-data block of the first sub-data block.
  • the calculating unit 608 is configured to perform an exclusive OR operation on the data of the first sub-block and the data of the representative sub-block.
  • the compressing unit 610 is configured to compress the result of the XOR operation using run length encoding, and save the compression result and the location information of the first sub data block.
  • the apparatus 600 further includes a deduplication unit, where the comparison unit 606 is further configured to determine a second sub data block from the N to be saved sub data blocks, where the second sub data block is The compared reference sub-blocks are identical to the sub-blocks to be saved; the de-duplication unit is configured to perform a de-duplication operation on the second sub-blocks.
  • the data block to be saved is divided into sub-data blocks of smaller granularity, and the sub-data blocks of the same position of the saved data block and the reference data block are respectively compared, so as to be smaller.
  • the calculation amount performs the deduplication operation on the same sub-block, and performs an exclusive-OR operation on the different sub-blocks and the reference sub-block, and performs run-length encoding compression on the XOR result, thereby greatly reducing the storage required for recording the data block to be saved. space.
  • the comparison unit 606 is specifically configured to use the fingerprint of the sub-data block to be saved corresponding to the i-th location and the corresponding i-th The fingerprints of the reference sub-blocks of the location identifier are compared. If the fingerprints are the same, the two are the same. If the fingerprints are different, the two are different. By comparing the fingerprints of the sub-blocks, the bitwise comparison when comparing the sub-blocks is avoided, thereby reducing the complexity of the comparison operation.
  • the selecting unit 604 is configured to select one of the saved at least two comparison data blocks as the reference data block, where the selecting unit 604 is configured to select and select from the at least two comparison data blocks.
  • the data block to be saved has a comparison data block of at most the same sub data block as the reference data block.
  • the selecting unit 604 may select, as the reference data block, a comparison data block having the same number of sub data blocks as the data block to be saved and having the least clustering of different sub data blocks, wherein one cluster of different sub data blocks refers to A set of different sub-blocks of consecutive positions, and sub-blocks of adjacent positions of the set are the same sub-block.
  • the similarity between each comparison data block and the data block to be saved can be determined, and the comparison data block with the same sub-data block as the data block to be saved is the data block to be saved.
  • the most similar data block is selected by selecting the clustered reference data block with the smallest clustering of the different sub-blocks as the reference data block in the comparison data block with the highest similarity to the data block to be saved.
  • the compression ratio of the deduplication operation is further increased.
  • the selecting unit 604 is configured to select a representative sub-data block of the first sub-data block, where the selecting unit 604 is configured to select a reference sub-object corresponding to each first sub-data block.
  • the data block is used as a representative sub-block of each of the first sub-blocks.
  • a reference sub-correlation corresponding to each of the first sub-blocks is selected.
  • the data block can cause a large number of consecutive "0"s in the XOR result, thereby greatly reducing the compression ratio of the run-length coding.
  • the selecting unit 604 is configured to select a representative sub-data block of the first sub-data block, where the selecting unit 604 is configured to use the XOR unit to
  • the two saved sub-blocks are respectively subjected to an exclusive-OR operation, and according to the run-length coding compression rate of the result of the XOR operation, the saved sub-segment with the smallest run-length compression ratio of the XOR result of the first sub-block is selected.
  • the data block serves as the representative sub-data block.
  • the saved sub-blocks that minimize the run-length coding compression rate of the first sub-block and its XOR operation result are selected from the saved sub-blocks as the representative sub-block, which can be reduced to a large extent. Save the storage space required for the data block to be saved.
  • the selecting unit 604 is configured to select a representative sub-block of the first sub-block, and the selecting unit 604 is configured to perform, by using the XOR unit, the first sub-block.
  • the XOR operation according to the run-length coding compression rate of the result of the XOR operation, selects the first sub-block with the smallest run-length compression ratio of the XOR result of the other first sub-blocks as the representative sub-block.
  • the storage to be saved can be reduced without being attached to the saved sub-block.
  • the selecting unit 604 is further configured to: extract samples from the data to be saved, and determine, according to different sub-block sizes to be saved, respectively, the storage compression rate, the saving speed, and the reading speed of the sample. And determining, according to the saving speed and the reading speed, a constraint, and storing a sub-data size with a minimum compression ratio, wherein the constraint condition is: the saving speed is greater than or equal to a preset first threshold. The read speed is greater than or equal to a preset second threshold.
  • the storage space required for storing the data block can be minimized under the premise of satisfying the storage processing speed and the reading processing speed.
  • the comparing unit 606 is further configured to save a fingerprint of the first sub-block. Therefore, when the first sub-block is subsequently used as a reference sub-block of another data block, the fingerprint is not repeatedly calculated.
  • the embodiment of the present invention is an apparatus embodiment corresponding to the method 300, and the feature description of the embodiment of the method 300 is applicable to the embodiment of the present invention, and details are not described herein again.
  • the data block to be saved is divided into sub-data blocks of smaller granularity, and the sub-data of the same position of the data block and the reference data block are to be saved.
  • the blocks are compared separately, and the same sub-blocks are deduplicated with a small amount of calculation, and the different sub-blocks are XORed with the reference sub-blocks, and the XOR result is run-length encoded and compressed, which greatly reduces the recording.
  • the storage space required for the data block to be saved is required for the data block to be saved.
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the modules is only a logical function division, and may be implemented in another manner, for example, multiple modules or components may be combined or may be Integrate into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or module, and may be electrical, mechanical or otherwise.
  • the modules described as separate components may or may not be physically separated.
  • the components displayed as modules may or may not be physical modules, that is, may be located in one place, or may be distributed to multiple network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional module in each embodiment of the present invention may be integrated into one processing module, or each module may exist physically separately, or two or more modules may be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or in the form of hardware plus software function modules.
  • the above-described integrated modules implemented in the form of software function modules can be stored in a computer readable storage medium.
  • the software functional modules described above are stored in a storage medium and include instructions for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform some of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: a mobile hard disk, a read-only memory (English: Read-Only Memory, ROM for short), a random access memory (English: Random Access Memory, RAM for short), a magnetic disk or an optical disk, and the like. The medium of the code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种数据保存方法和装置,用于对保存于数据库的数据进行重删和压缩处理。该方法包括:将待保存数据块和参照数据块分别划分为大小相同的N个待保存子数据块,比较对应相同位置标识的待保存子数据块和参照子数据块,从N个待保存子数据块中确定可重删子数据块和不可重删子数据块,对可重删子数据块进行重删操作,选择不可重删子数据块的代表子数据块,并将不可重删子数据块的数据与代表子数据块的数据进行异或操作,使用游程编码对异或操作的结果进行压缩,并保存压缩结果和不可重删子数据块的位置信息。通过对待保存数据块的重删和压缩操作,可以减小保存待保存数据块需要的存储空间。

Description

一种数据保存方法和装置 技术领域
本发明实施例涉及计算机领域,尤其涉及一种数据保存方法和装置。
背景技术
随着数字信息量的爆炸式增长,数据占用空间越来越大;随着数据的指数级增长,企业面临的快速备份和恢复的时间点越来越多,管理保存数据的成本及数据中心空间和能耗也变得越来越严重。研究发现,应用系统所保存的数据中高达60%是冗余的,而且随着时间的推移,数据冗余占比将会越来越大。
为了缓解存储系统的空间增长问题,缩减数据占用空间和降低成本,最大程度地利用已有资源,重复数据删除技术已成为一个热门的研究课题。一方面,利用重复数据删除技术(以下简称“重删”)可以对存储空间的利用率进行优化,以消除分布在存储系统中的相同文件或者数据块。另一方面,利用重复数据删除技术可以减少在网络中传输的数据量,进而降低能量消耗和网络成本,并为数据复制大量节省网络带宽。
随着重复数据删除技术的发展,该技术大量应用于存储备份和归档系统中,该系统中的重复数据删除模块负责对数据内容进行比对分析,查找出冗余数据,对于重复出现的数据单元只需要记录已存放的可参考的相同数据的位置,最后只将不重复的数据存入到存储介质中。
现有技术中,压缩效果较好的重删算法运算量都比较大,运算量小的重删算法的压缩效果又比较差。研究发现,数据库应用中的数据块存在较大的相似度,需要一种有效的保存算法来实现数据库应用中数据的压缩。
发明内容
有鉴于此,本发明实施例提供了一种数据保存方法和装置。以实现对待存储数据块的重删和压缩,从而减少存储待存储数据块需要的存储空间
第一方面,本申请提供了一种数据保存方法,包括:
将待保存数据块划分为大小相同的N个待保存子数据块,N个待保存子数据块对应N个位置标识,每个待保存子数据块对应一个位置标识,其中,N为大于1的正整数;从已保存的至少两个对比数据块中选择一个作为参照数据块,参照数据块包括N个参照子数据块,N个参照子数据块对应该N个位置标识,每个参照子数据块对应一个位置标识;将对应第i位置标识的待保存子数据块和对应第i位置标识的参照子数据块进行比较,从N个待保存子数据块中确定第一子数据块,其中,i为从1递增至N的正整数,第一子数据块为与所比较的参照子数据块不相同的待保存子数据块;选择第一子数据块的代表子数据块;将第一子数据块的数据与代表子数据块的数据进行异或操作;使用游程编码对异或操作的结果进行压缩,并保存压缩结果和第一子数据块的位置信息。
具体实现过程中,因为待保存数据块和参照数据块的大小相同,待保存子数据块与参照子数据块的大小也相同,可以用子数据块在数据块中的顺序作为该位置标识。例如,按照位置顺序,第i个待保存子数据块的位置标识为“i”。比较对应相同位置标识的待保存子数据块和参照子数据块,即比较相同位置的待保存子数据块和参照子数据块。
结合第一方面,在第一方面第一种可能的实现方式中,该方法还包括:通过对应相同位置标识的待保存子数据块和参照子数据块的比较,从N个待保存子数据块中确定第二子数据块,其中,第二子数据块为与所比较的参照子数据块相同的待保存子数据块,并对第二子数据块进行重删操作。
根据数据库中数据相似性与位置相关的特点,通过将待保存数据块分为更小粒度的子数据块,并对待保存数据块和参照数据块相同位置的子数据块分别进行比较,以较小的计算量对相同子数据块进行重删操作,并将不同子数据块与参照子数据块进行异或操作,对异或结果进行游程编码压缩,大大降低了记录待保存数据块所需要的存储空间。
结合第一方面或第一方面以上任一种可能的实现方式,在第一方面第二种可能的实现方式中,将对应第i位置标识的待保存子数据块和对应第i位置标识的参照子数据块进行比较包括:将对应第i位置标识的待保存子数据块的指纹和对应第i位置标识的参照子数据块的指纹进行比较,如果指纹相同,则表示二者相同,如果指纹不同,则表示二者不同。
通过对子数据块的指纹的比较,避免了子数据块之间比较时的按位比较, 从而减小了比较运算的复杂度。
结合第一方面或第一方面以上任一种可能的实现方式,在第一方面第三种可能的实现方式中,从已保存的至少两个对比数据块中选择一个作为参照数据块,包括:从至少两个对比数据块中选择与待保存数据块有最多相同子数据块的对比数据块作为参照数据块。更进一步的,可以选取与待保存数据块相同子数据块最多,且不同子数据块的分簇最少的对比数据块作为参照数据块,其中,不同子数据块的一个分簇是指位置连续的不同子数据块组成的集合,且所述集合的相邻位置的子数据块为相同子数据块。
通过待保存数据块与多个对比数据块的比较,可以判断每一个对比数据块与待保存数据块的相似度,与待保存数据块相同子数据块最多的对比数据块是与待保存数据块相似度最高的数据块。进一步的,通过在与待保存数据块相似度最高的对比数据块中选择不同子数据块的分簇最少的对比数据块作为参照数据块,可以减少记录不同子数据块的位置所需要的信息,进一步的增加了重删操作的压缩比。
结合第一方面或第一方面以上任一种可能的实现方式,在第一方面第四种可能的实现方式中,选择第一子数据块的代表子数据块,包括:选择与每一个第一子数据块对应相同位置标识的参照子数据块作为该每一个第一子数据块的代表子数据块。
因为数据库中存储的数据块存在很大的相似度,而且不同数据块的相同部分在数据块内部的相对位置一般是相同的,所以选择与每一个第一子数据块对应相同位置标识的参照子数据块作为该每一个第一子数据块的代表子数据块,可以使异或结果中出现大量连续的“0”,从而大大降低游程编码的压缩率。
结合第一方面或第一方面以上任一种可能的实现方式,在第一方面第五种可能的实现方式中,选择第一子数据块的代表子数据块,包括:将第一子数据块与至少两个已保存子数据块分别进行异或操作,根据异或操作的结果的游程编码压缩率,选择与第一子数据块的异或结果的游程编码压缩率最小的已保存子数据块作为代表子数据块。
通过比较,从已保存的子数据块中选取使第一子数据块与其异或操作结果的游程编码压缩率最小的已保存子数据块作为该代表子数据块,可以在较大程度上减小保存待保存数据块需要的存储空间。
结合第一方面或第一方面以上任一种可能的实现方式,在第一方面第六种可能的实现方式中,选择第一子数据块的代表子数据块,包括:将第一子数据块进行两两异或操作,根据异或操作的结果的游程编码压缩率,选择与其他第一子数据块的异或结果的游程编码压缩率最小的第一子数据块作为代表子数据块。
通过从第一子数据块中寻找一个与其他第一子数据块最相似的一个第一子数据块作为代表子数据块,可以在不依附于已保存子数据块的情况下,降低存储待保存数据块需要的存储空间。
结合第一方面或第一方面以上任一种可能的实现方式,在第一方面第七种可能的实现方式中,将待保存数据块划分为大小相同的N个待保存子数据块之前,还包括:从待保存数据中抽取样本,根据不同的待保存子数据块大小,分别确定样本的保存压缩率、保存速度和读取速度,根据保存速度和读取速度满足约束条件,且保存压缩率最小的子数据大小确定N的大小,其中,约束条件为:保存速度大于等于预设的第一阈值,读取速度大于等于预设的第二阈值。
通过抽取样本,选择最合适的数据块分割方法,可以在满足保存处理速度和读取处理速度的前提下,最大限度的降低存储到保存数据块需要的存储空间。
结合第一方面或第一方面以上任一种可能的实现方式,在第一方面第八种可能的实现方式中,该方法还包括:保存第一子数据块的指纹。从而使第一子数据块后续作为其他数据块的参照子数据块时,不用重复计算指纹。
第二方面,本申请提供了一种计算机可读介质,包括计算机执行指令,当计算机的处理器执行该计算机执行指令时,该计算机执行第一方面或第一方面任一种可能的实现方式中的方法。
第三方面,本申请提供了一种计算设备,包括:处理器、存储器、总线和通信接口;该存储器用于存储执行指令,该处理器与该存储器通过该总线连接,当该计算设备运行时,该处理器执行该存储器存储的该执行指令,以使该计算设备执行第一方面或第一方面任一种可能的实现方式中的方法。
第四方面,本申请提供了一种数据保存装置,包括:
分割单元,用于将待保存数据块划分为大小相同的N个待保存子数据块,N个待保存子数据块对应N个位置标识,每个待保存子数据块对应一个位置 标识,其中,N为大于1的正整数;选择单元,用于从已保存的至少两个对比数据块中选择一个作为参照数据块,参照数据块包括N个参照子数据块,N个参照子数据块对应N个位置标识,每个参照子数据块对应一个位置标识;比较单元,用于将对应第i位置标识的待保存子数据块和对应第i位置标识的参照子数据块进行比较,从N个待保存子数据块中确定第一子数据块,其中,i为从1递增至N的正整数,第一子数据块为与所比较的参照子数据块不相同的待保存子数据块;选择单元还用于选择第一子数据块的代表子数据块;计算单元,用于将第一子数据块的数据与代表子数据块的数据进行异或操作;压缩单元,用于使用游程编码对异或操作的结果进行压缩,并保存压缩结果和第一子数据块的位置信息。
具体实现过程中,因为待保存数据块和参照数据块的大小相同,待保存子数据块与参照子数据块的大小也相同,可以用子数据块在数据块中的顺序作为该位置标识。例如,按照位置顺序,第i个待保存子数据块的位置标识为“i”。比较对应相同位置标识的待保存子数据块和参照子数据块,即比较相同位置的待保存子数据块和参照子数据块。
结合第四方面,在第四方面第一种可能的实现方式中,装置还包括重删单元,比较单元还用于从N个待保存子数据块中确定第二子数据块,其中,第二子数据块为与所比较的参照子数据块相同的待保存子数据块;重删单元用于对第二子数据块进行重删操作。
根据数据库中数据相似性与位置相关的特点,通过将待保存数据块分为更小粒度的子数据块,并对待保存数据块和参照数据块相同位置的子数据块分别进行比较,以较小的计算量对相同子数据块进行重删操作,并将不同子数据块与参照子数据块进行异或操作,对异或结果进行游程编码压缩,大大降低了记录待保存数据块所需要的存储空间。
结合第四方面或第四方面以上任一种可能的实现方式,在第四方面第二种可能的实现方式中,比较单元具体用于将对应第i位置标识的待保存子数据块的指纹和对应第i位置标识的参照子数据块的指纹进行比较,如果指纹相同,则表示二者相同,如果指纹不同,则表示二者不同。
通过对子数据块的指纹的比较,避免了子数据块之间比较时的按位比较,从而减小了比较运算的复杂度。
结合第四方面或第四方面以上任一种可能的实现方式,在第四方面第二 种可能的实现方式中,选择单元用于从已保存的至少两个对比数据块中选择一个作为参照数据块,包括:选择单元用于从至少两个对比数据块中选择与待保存数据块有最多相同子数据块的对比数据块作为参照数据块。
通过待保存数据块与多个对比数据块的比较,可以判断每一个对比数据块与待保存数据块的相似度,与待保存数据块相同子数据块最多的对比数据块是与待保存数据块相似度最高的数据块。进一步的,通过在与待保存数据块相似度最高的对比数据块中选择不同子数据块的分簇最少的对比数据块作为参照数据块,可以减少记录不同子数据块的位置所需要的信息,进一步的增加了重删操作的压缩比。
结合第四方面或第四方面以上任一种可能的实现方式,在第四方面第二种可能的实现方式中,选择单元用于选择第一子数据块的代表子数据块,包括:选择单元用于选择与每一个第一子数据块对应相同位置标识的参照子数据块作为每一个第一子数据块的代表子数据块。更进一步的,改选单元可以用于选取与待保存数据块相同子数据块最多,且不同子数据块的分簇最少的对比数据块作为参照数据块,其中,不同子数据块的一个分簇是指位置连续的不同子数据块组成的集合,且所述集合的相邻位置的子数据块为相同子数据块。
因为数据库中存储的数据块存在很大的相似度,而且不同数据块的相同部分在数据块内部的相对位置一般是相同的,所以选择与每一个第一子数据块对应相同位置标识的参照子数据块作为该每一个第一子数据块的代表子数据块,可以使异或结果中出现大量连续的“0”,从而大大降低游程编码的压缩率。
结合第四方面或第四方面以上任一种可能的实现方式,在第四方面第二种可能的实现方式中,选择单元用于选择第一子数据块的代表子数据块,包括:选择单元用于通过异或单元将第一子数据块与至少两个已保存子数据块分别进行异或操作,根据异或操作的结果的游程编码压缩率,选择与第一子数据块的异或结果的游程编码压缩率最小的已保存子数据块作为代表子数据块。
通过比较,从已保存的子数据块中选取使第一子数据块与其异或操作结果的游程编码压缩率最小的已保存子数据块作为该代表子数据块,可以在较大程度上减小保存待保存数据块需要的存储空间。
结合第四方面或第四方面以上任一种可能的实现方式,在第四方面第二种可能的实现方式中,选择单元用于选择第一子数据块的代表子数据块,包括:选择单元用于通过异或单元将第一子数据块进行两两异或操作,根据异或操作的结果的游程编码压缩率,选择与其他第一子数据块的异或结果的游程编码压缩率最小的第一子数据块作为代表子数据块。
通过从第一子数据块中寻找一个与其他第一子数据块最相似的一个第一子数据块作为代表子数据块,可以在不依附于已保存子数据块的情况下,降低存储待保存数据块需要的存储空间。
结合第四方面或第四方面以上任一种可能的实现方式,在第四方面第二种可能的实现方式中,选择单元还用于:从待保存数据中抽取样本,根据不同的待保存子数据块大小,分别确定样本的保存压缩率、保存速度和读取速度,根据保存速度和读取速度满足约束条件,且保存压缩率最小的子数据大小确定N的大小,其中,约束条件为:保存速度大于等于预设的第一阈值,读取速度大于等于预设的第二阈值。
通过抽取样本,选择最合适的数据块分割方法,可以在满足保存处理速度和读取处理速度的前提下,最大限度的降低存储到保存数据块需要的存储空间。
结合第四方面或第四方面以上任一种可能的实现方式,在第四方面第二种可能的实现方式中,比较单元还用于保存第一子数据块的指纹。从而使第一子数据块后续作为其他数据块的参照子数据块时,不用重复计算指纹。
根据本发明公开的技术方案,根据数据库中数据相似性与位置相关的特点,通过将待保存数据块分为更小粒度的子数据块,并对待保存数据块和参照数据块相同位置的子数据块分别进行比较,以较小的计算量对相同子数据块进行重删操作,并将不同子数据块与参照子数据块进行异或操作,对异或结果进行游程编码压缩,大大降低了记录待保存数据块所需要的存储空间。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为一种数据保存系统的示例性联网环境框图;
图2为依据本发明一实施例的计算设备硬件结构示意图;
图3为依据本发明一实施例的数据保存方法的示范性流程图;
图4为依据本发明一实施例的数据块分割方法示意图;
图5为依据本发明一实施例的对比数据块比较示意图;
图6为依据本发明一实施例的数据保存装置的逻辑结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行描述。
根据本申请,数据库服务器的存储设备上保存着供客户端使用的数据信息,可以对待存储在该数据库服务器的存储设备上的数据进行重删和压缩,或者不进行重删和压缩,但是进行重删和压缩可以大大的减小待存储数据在数据库服务器上的存储设备上需要的存储空间,节省存储资源。图1示出了一种数据保存系统100的示例性联网环境框图,数据保存系统100包含客户端102和数据库服务器104,进一步的,还可以包括存储装置108。应了解,以上命名仅仅是为了描述方便,不应对本发明有任何限制。
数据库服务器104的存储设备上保存有供客户端102使用的数据,用于为客户端102的应用提供服务,这些服务可以是存储、查询、更新、事务管理、索引、高速缓存、查询优化、安全及多用户存取控制等。其中,数据库服务器104可以由一台或多台计算机和数据库管理系统软件共同构成。
数据库服务器104和客户端102都配置有供二者进行通讯的通信接口。可选的,数据库服务器104和客户端102通过网络106进行通信。
网络106可以是因特网,内联网,局域网(Local Area Networks,简称LANs),广域网络(Wireless Local Area Networks,简称WLANs),存储区域网络(Storage Area Networks,简称SANs)等,或者以上网络的组合。
存储装置108可以通过通信接口与数据库服务器104和/或客户端102进行耦合,也可以通过网络106与数据库服务器104和/或客户端102进行耦合,数据库服务器104和客户端102二者或二者中的任意一个可以访问存储装置108。可选的,存储装置108可以充当数据库服务器104的存储设备,存储数据库服务器104重删和压缩后的数据。
应理解,图1的目的仅仅是示例性的引入数据保存系统100的参与者以及它们的相互关系。因此,所描绘的系统100被大大地简化,本发明实施例仅仅对其进行概括性的说明,并不对其实现方式进行任何的限定。且图1中的客户端102和数据库服务器104可以是任意体系结构的,本发明实施例并不对此进行限定。
图1所示的客户端102和/或数据库服务器104可以由图2所示的计算设备200来实现。
图2为计算设备200的简化的逻辑结构示意图,如图2所示,计算设备200包括处理器202、内存单元204、输入/输出接口206、通信接口208、总线210和存储设备212。其中,处理器202、内存单元204、输入/输出接口206、通信接口208和存储设备212,通过总线210实现彼此之间的通信连接。
处理器202是计算设备200的控制中心,用于执行相关程序,以实现本发明实施例所提供的技术方案。可选的,处理器202包含一个或多个中央处理器单元(Central Processing Unit,CPU),例如,图2所示的中央处理器单元1和中央处理器单元2。可选的,计算设备200还可以包含多个处理器202,每一个处理器202可以是一个单核处理器(包含一个CPU)或多核处理器(包含多个CPU)。除非另有说明,在本发明中,一个用于执行特定功能的组件,例如,处理器202或内存单元204,可以通过配置一个通用的组件来执行相应功能来实现,也可以通过一个专门执行特定功能的专用组件来实现,本申请并不对此进行限定。处理器202可以采用通用的中央处理器,微处理器,应用专用集成电路(Application SQecific Integrated Circuit,ASIC),或者一个或多个集成电路,用于执行相关程序,以实现本申请所提供的技术方案。
处理器202可以通过总线210与一个或多个存储方案相连接。存储方案可以包含内存单元204和存储设备212。其中,存储设备212可以为只读存储器(Read Only Memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(Random Access Memory,RAM)。内存单元204可以为随机存取存储器。内存单元204可以与处理器202集成在一起或集成在处理器202的内部,也可以是独立于处理器202的一个或多个存储单元。
供处理器202或处理器202内部的CPU执行的程序代码可以存储在存储设备212或内存单元204中。可选的,存储在存储设备212内部的程序代码(例如,操作系统、应用程序、重删压缩模块或通信模块等)被拷贝到内存单元204中,以供处理器202执行。
存储设备212可以包含高速随机存储器(RAM),也可以包含非易失性存储器,例如一个或者多个磁盘存储器,闪速存储器,或者其他非易失性存储器。在一些实施例中,存储设备还可能进一步包含与所述一个和多个处理器202分离的远程存储器,例如通过通信接口208与通信网络进行访问的网盘,该通信网络可以为因特网,内联网,局域网(LANs),广域网络(WLANs),存储区域网络(SANs)等,或者以上网络的组合。存储设备212还可以用于存储数据库服务器104重删压缩后的数据。
操作系统(例如Darwin、RTXC、LINUX、UNIX、OS X、WINDOWS或是诸如Vxworks之类的嵌入式操作系统)包括用于控制和管理常规系统任务(例如内存管理、存储设备控制、电源管理等等)以及有助于各种软硬件组件之间通信的各种软件组件和/或驱动器。
输入/输出接口206用于接收输入的数据和信息,输出操作结果等数据。
通信接口208使用例如但不限于收发器一类的收发装置,来实现计算设备200与其他设备或通信网络之间的通信。
总线210可包括一通路,在计算设备200各个部件(例如处理器202、内存单元204、输入/输出接口206、通信接口208和存储设备212)之间传送信息。可选的,总线210可以使用有线的连接方式或采用无线的通讯方式,本申请并不对此进行限定。
应注意,尽管图2所示的计算设备200仅仅示出了处理器202、内存单元204、输入/输出接口206、通信接口208、总线210以及存储设备212,但是在具体实现过程中,本领域的技术人员应当明白,计算设备200还包含实现正常运行所必须的其他器件。
计算设备200可以为一般的通用计算机或专门用途的计算设备,包括但不限于便携计算机,个人台式计算机,网络服务器,平板电脑,手机,个人数字助理(PDA)等任何电子设备,或者以上两种或者多种的组合设备,本申请并不对计算设备200的具体实现形式进行任何限定。
此外,图2的计算设备200仅仅是一个计算设备200的例子,计算设备200可能包含相比于图2展示的更多或者更少的组件,或者有不同的组件配置方式。根据具体需要,本领域的技术人员应当明白,计算设备200还可包含实现其他附加功能的硬件器件。本领域的技术人员应当明白,计算设备200也可仅仅包含实现本发明实施例所必须的器件,而不必包含图2中所示的全部器件。同时,图2中展示的各种组件可以用硬件、软件或者硬件与软件的结合方式实施。
图2所示的硬件结构以及上述描述适用于本发明实施例所提供的各种计算设备,适用于执行本发明实施例所提供的各种数据保存方法。
如图2所示,计算设备200的内存单元204中包含重删压缩模块,处理器202执行该重删压缩模块程序代码,实现对待存储在数据库服务器104的存储设备212中的数据进行重删和压缩操作。数据库服务器104在进行数据库存储时是基于数据块对待存储数据进行存储的,不同的数据块之间存在较大的相似性,且不同数据块之间数据相同的部分在数据块中的相对位置是相同的,根据这一特点,在进行相似数据重删操作的时候,可以针对待存储数据块与参照数据块中相对位置相同的数据进行比较,从而避免了相对位置不同的数据间的交叉比较,大大降低了比较的复杂度。
重删压缩模块可以由一个或者多个操作指令构成,以使计算设备根据以上描述执行一个或多个方法步骤。具体的方法步骤在本申请的以下部分进行详细描述。
图3为依据本发明一实施例的数据保存方法300的示范性流程图,如图3所示,方法300包括:
S302:将待保存数据块分为大小相同的N个待保存子数据块。
其中,该N个待保存子数据块对应N个位置标识,每个待保存子数据块对应一个位置标识,N为大于1的正整数。具体的,将待保存数据块分为N个待保存子数据块,可以使用待保存子数据块在该待保存数据块中的位置顺序来表示待保存子数据块的位置标识。
在具体实现过程中,待保存数据块大小可以为数据库存储中的基本存储单元大小,将数据库中的基本存储单位数据库进一步细化,切分成更小的子数据块。例如,基本存储单位数据块为8K字节,如果256字节是最佳的待 保存子数据块的大小,则把待保存数据块分为32个大小为256字节的待保存子数据块。把8K字节的待保存数据块切成更小的待保存子数据块,每个待保存子数据块的大小为256字节,这样进行相似重删比较的基本单位变为256字节。
S304:从已保存的至少两个对比数据块中选择一个作为参照数据块。
其中,参照数据块包括N个参照子数据块,该N个参照子数据块对应N个位置标识,每个参照子数据块对应一个位置标识。参照数据块的分割方式与待保存数据块的分割方式相同,位置标识的表示方式与待保存数据块相同。
S306:比较对应相同位置标识的待保存子数据块和参照子数据块,确定第一子数据块,其中,第一子数据块为与所比较的参照子数据块不相同的待保存子数据块。
具体的,将对应第i位置标识的待保存子数据块和对应第i位置标识的参照子数据块进行比较,从N个待保存子数据块中确定第一子数据块,其中,i为从1递增至N的正整数。其中,第一子数据块的数目可以为1个或大于1的正整数个。
S308:选择第一子数据块的代表子数据块。
S310:将第一子数据块与第一子数据块的代表子数据块进行异或操作。
S312:使用游程编码对步骤S310的异或操作的结果进行压缩,并保存压缩结果和第一子数据块的位置信息。
步骤S306中,比较对应相同位置标识的待保存子数据块和参照子数据块时,为了避免按字节进行比较,可以比较到保存子数据块和参照子数据块的指纹,如果指纹相同,则表明两个子数据块相同,如果指纹不同,则表明两个子数据块不同,指纹即子数据块身份的凭证。计算子数据块指纹的方法多种多样,例如,可以通过计算SHA1或MD5散列值等方法进行子数据块的哈希指纹计算,本发明实施例并不对此进行限定。
比较对应相同位置标识的待保存子数据块与参照子数据块的指纹,确定第一子数据块,并保存第一子数据块的位置信息。
待保存数据块和参照数据块的一种可能的划分和比较方式如图4所示,待保存数据块402按照数据顺序分为第1到第N的N个大小相同的待保存子数据块,可以用待保存子数据块的顺序表示待保存子数据块的位置标识,例如i表示第i个待保存子数据块的位置标识,同理,参照数据块404也分为 第1到第N的N个大小相同的子数据块,比较待保存数据块402与参照数据块404相同位置标识上的子数据块的指纹,按照位置顺序,一一比较对应相同位置标识的待保存子数据块与参照子数据块的指纹,如果指纹相同,则表明子数据块相同,如果指纹不同,则表明子数据块不同。即将对应第i位置标识的待保存子数据块的指纹和对应第i位置标识的参照子数据块的指纹进行比较,如果指纹相同,则表示二者相同,如果指纹不同,则表示二者不同,其中i从1递增至N。
如果指纹相同,则可以对第i位置标识的待保存子数据块进行重删操作,即不保存第i个待保存子数据块,在进行数据读取的时候,使用参照数据块404的第i个参照子数据块来恢复待保存数据块402的第i个待保存子数据块;如果指纹不同,则说明待保存数据块402与参照数据块404的第i个子数据块不同,则保存第i个待保存子数据块的位置信息,并对待保存数据块402的第i个数据块进行后续步骤S308-S312的操作,并保留操作后的信息。其中,i从1递增到N。
可选的,在本发明实施例的一种实现方式中,可以通过“二分法”来比较待保存数据块和参照数据块,以寻找待保存数据块与参照数据块的相同位置标识的子数据块指纹不同的位置信息。
具体实现流程为:将待保存数据块和参照数据块分别分为两个大小相同的“左数据块”和“右数据块”,并计算待保存数据块的“左数据块”的指纹与“右数据块”的指纹;比较待保存数据块与参照数据块的“左数据块”指纹,并比较待保存数据块与参照数据块的“右数据块”的指纹(如果保存了参照数据块的“左数据块”和“右数据块”的指纹,则可以直接使用保存的指纹值;如果没有保存参照数据块的“左数据块”和“右数据块”的指纹,则计算参照数据块的“左数据块”和“右数据块”的指纹),如果待保存数据块与参照数据块的“左(右)数据块”的指纹相同,则表明二者的“左(右)数据块”相同,对其不再进行分割;如果待保存数据块与参照数据块的“左(右)数据块”指纹不同,则表明二者的“左(右)数据块”不同,则按照相同原理,将二者的“左(右)数据块”进行“二分法”分割,并分别比较分割后的数据块的指纹,以此类推,直至分割后的数据块的大小等于子数据块的大小,从而寻找出待保存数据块与参照数据块的所有的相同位置的子数据块指纹不同的位置信息,并记录指纹不同的位置信息。
当待保存数据块与参照数据块的相似度较高时,采用二分法可以减少寻找待保存数据块与参照数据块的相同位置的子数据块指纹不同的位置信息的运算量,增加重删压缩的速度。
应理解,保存第一子数据块的位置信息的方式多种多样,可以通过记录第一子数据块的位置信息来实现,则没有记录的位置信息上的待保存子数据块为第二子数据块;也可以通过记录第二子数据块的位置信息来实现,则没有记录的位置信息上的待保存子数据块为第一子数据块。本发明实施例,并不对保存第一子数据块的位置信息的方式进行限定。
在具体实现过程中,可以使用第一子数据块在待保存数据块中的顺序来表示第一子数据块的位置信息,例如,如图4所示,可以使用i表示第i个待保存子数据块在待保存数据块中的位置信息,其中,i为大于0小于等于N的正整数。在记录第一子数据块的位置信息时,可以通过记录第一子数据块在待保存数据块中的位置顺序来实现,没有记录的位置顺序上的待保存子数据块为第二子数据块;也可以通过记录第二子数据块的位置顺序来实现,没有记录的位置顺序上的待保存子数据块为第一子数据块。
在具体实现过程中,也可以通过记录第一子数据块的首地址和尾地址,或者记录第一子数据块的首地址与地址长度来记录第一子数据块的位置信息,如果有多个第一子数据块的位置信息连续,则只需要记录连续位置的首地址和尾地址,或连续位置的首地址和地址长度来记录该连续位置信息,从而节省记录该连续位置信息所需要的存储空间。应理解,此处的第一子数据块的首地址可以为相对待保存数据块首地址的偏移量。
因为数据库服务器104中存储的数据块存在很大的相似度,而且不同数据块的相同部分在数据块内部的相对位置一般是相同的。根据不同数据块相同部分的位置一般相同这一特点,通过把待保存数据块分割为更小粒度的待保存子数据块,并比较待保存数据块与参照数据块相同位置标识上的子数据块的指纹,避免了对不同位置标识上的子数据块的交叉比较,能够以较小的计算量,找出待保存数据块与参照数据块相同的部分,并对相同部分进行重删操作,即不保存待保存数据块的与参照数据块相同位置且指纹相同的子数据块,在进行数据恢复时,相同子数据块可以根据参照数据块对应位置的子数据块对重删的子数据块进行恢复,从而大大减小了记录待保存数据块所需要的存储单元。
参照数据块的参照子数据块的指纹可以在保存参照数据块的时候进行计算,保存在数据库;也可以在比较参照数据块与待保存数据块的对应相同位置标识的参照子数据块和待保存子数据块时进行指纹计算,本发明实施例并不对此进行限定。
具体的,S304中,从已保存的至少两个对比数据块中选择一个作为参照数据块具体包括:比较待保存数据块与至少两个对比数据块,按照步骤306的方式进行比较,从该至少两个对比数据块中选择与待保存数据块有最多相同子数据块的对比数据块作为所述参照数据块。更进一步的,可以选取与待保存数据块相同子数据块最多,且不同子数据块的分簇最少的对比数据块作为参照数据块,其中,不同子数据块的一个分簇是指位置连续的不同子数据块组成的集合,且所述集合的相邻位置的子数据块为相同子数据块。
通过待保存数据块与多个对比数据块的比较,可以判断每一个对比数据块与待保存数据块的相似度,与待保存数据块相同子数据块最多的对比数据块是与待保存数据块相似度最高的数据块。进一步的,通过在与待保存数据块相似度最高的对比数据块中选择不同子数据块的分簇最少的对比数据块作为参照数据块,可以减少记录不同子数据块的位置所需要的信息,进一步的增加了重删操作的压缩比。
例如,如图5所示,将待保存数据块402与对比数据块502、对比数据块504以及对比数据块506三个对比数据块进行比较,分别比较待保存数据块402与每一个对比数据块相同位置上的子数据块的指纹,指纹相同,则表明是相同子数据块,指纹不同,则表明是不同子数据块。如图5所示,对比数据块502与待保存数据块402相比,有3个不同子数据块,分别为第2个子数据块,第N-2个子数据块,第N-1个子数据块;对比数据块504与待保存数据块402相比,有2个不同子数据块,分别为第2个子数据块和第N-2个子数据块;对比数据块506与待保存数据块402相比,有2个不同子数据块,分别为第2个子数据块和第3个子数据块。通过比较发现,对比数据块504和对比数据块506与待保存数据块402相同子数据块最多,均为N-2个。因为对比数据块506的不同子数据块的分簇为1,即第2子数据块和第3个子数据块组成的集合;对比数据块504的不同子数据块的分簇为2,即第2个子数据块和第N-2个子数据块。所以选择不同子数据块的分簇最少的对比数据块506作为参照数据块。
比较待保存数据块402与对比数据块506相同位置的子数据块的指纹,发现待保存数据块402与对比数据块506的第2个子数据块和第3个子数据块的指纹不同,则记录第2个子数据块和第3个子数据块的位置信息。例如,可以记录第2个子数据块和第3个子数据块组成的分簇的初始位置和结束位置,或者记录该分簇的初始位置和分簇的数据长度,本发明实施例并不对此进行限定。
可选的,S308中,选择第一子数据块的代表子数据块包括:选择与每一个第一子数据块对应相同位置标识的参照子数据块作为所述每一个第一子数据块的代表子数据块。
例如,如图5所示,待保存数据块402的参照数据块为对比数据块506。如图5所示,对比数据块506与待保存数据块402相比,有2个不同子数据块,分别为第2个子数据块和第3个子数据块。则选取对比数据块506的第2个子数据块作为待保存数据块402的第2个待保存子数据块的代表数据块,选取对比数据块506的第3个子数据块作为待保存数据块402的第3个待保存子数据块的代表数据块,将待保存数据块402的第2个子数据块与对比数据块506的第2个子数据块进行异或操作,将待保存数据块402的第3个子数据块与对比数据块506的第3个子数据块进行异或操作。因为第2和第3两个待保存子数据块位置连续,则可以对其异或结果进行统一的游程编码,并统一记录位置信息。
可选的,S308中,选择第一子数据块的代表子数据块包括:将第一子数据块与至少两个已保存子数据块分别进行异或操作,根据异或操作的结果的游程编码压缩率,选择与第一子数据块的异或结果的游程编码压缩率最小的已保存子数据块作为所述代表子数据块。
即在多个可选的已保存子数据块中,通过分别与所有第一子数据块进行异或操作,并分别统计异或操作的结果的游程编码压缩率,选择一个对应游程编码压缩率最小的已保存子数据块作为参照数据块。
可选的,S308中,选择第一子数据块的代表子数据块包括:将所述第一子数据块进行两两异或操作,根据异或操作的结果的游程编码压缩率,选择与其他第一子数据块的异或结果的游程编码压缩率最小的第一子数据块作为所述代表子数据块。并保存所述代表子数据块。
例如,待保存数据块的第一子数据块有M个,为了描述方便,按照在待 保存数据块中的位置顺序对其进行编号,从第1到第M个,则将该M个第一子数据块分别两两进行异或操作。并统计每一个第一子数据块与其他M-1个第一子数据块的异或结果的游程编码压缩率,即统计第1个第一子数据块与第2到第M个第一子数据块的异或结果的游程编码压缩率,第2个第一子数据块与其他M-1个第一子数据块(第1个和第3到第M个第一子数据块)的异或结果的游程编码压缩率,以此类推,统计第M个第一子数据块与前M-1个第一子数据块的异或结果的游程编码压缩率,并选择与其他M-1个第一子数据块的异或结果的游程编码压缩率最小的第一子数据块作为参照子数据块。
具体的,因为游程编码是通过记录连续出现的字符,以及该字符连续出现的次数进行压缩编码,所以在具体实现的时候,可以通过统计异或结果中连续出现的字符进行游程编码带来的压缩空间大小,来判断每一个第一子数据块与其他第一子数据块的异或结果的游程编码的压缩率,游程编码带来的压缩空间越大,则游程编码的压缩率越大。例如,连续出现3个相同的字符,即连续出现24个“0”,则3个字节减去记录出现3个连续相同字符这一信息需要的存储空间,即其带来的压缩空间大小,如果记录连续出现3个相同字符这一信息需要的存储空间为2个字节,则带来的压缩空间为1个字节。
S310中,将第一子数据块与第一子数据块的代表子数据块进行异或操作,其中,异或操作是针对子数据块的二进制数据位(bit)进行的,即将待保存子数据块与参照子数据块进行位异或操作,如果二者相同位置上的bit位数值相同,则异或结果为0,如果二者相同位置上的bit位数值不同,则异或结果为1。
因为异或结果是由“0”和/或“1”组成的二进制字符串,通过使用游程编码对异或结果进行压缩,节省了记录信息所需要的存储空间。
方法300还包括:保存所述第一子数据块的指纹。从而使第一子数据块后续作为其他数据块的参照子数据块时,不用重复计算指纹。
在进行数据读取的时候,根据保存的游程编码结果即可还原出异或结果;根据还原出来的异或结果和保存的代表子数据块,即可还原出不同子数据块,异或结果为“1”的位置,表明该位置上第一子数据块的值与代表子数据块的值相反,异或结果为“0”的位置,表明该位置上第一子数据块的 值与代表子数据块的值相同;根据还原出的第一子数据块,保存的第一子数据块的位置信息和保存的参照数据块,即可还原出原始待保存数据块。
在本发明实施例的一种实现方式中,在步骤302之前,方法300进一步包括:从待保存数据中抽取样本,根据不同的待保存子数据块大小,分别确定所述样本的保存压缩率、保存速度和读取速度,根据所述保存速度和所述读取速度满足约束条件,且保存压缩率最小的子数据大小确定N的大小,其中,所述约束条件为:所述保存速度大于等于预设的第一阈值,所述读取速度大于等于预设的第二阈值。
在具体过程中,在子数据块N的确定过程中,可以根据样本数据,确定合适的子数据块的大小和子数据块的数目N,假设待保存数据块的大小为K,待保存子数据块的大小M,则M的取值需满足以下约束:
M>0,即M的取值需要为正数;
K/M>=2,即待保存子数据块的个数应该大于等于2;
KmodM=0,即K对M取模的结果为0,从而保证待保存子数据块的大小是相同的。
在满足以上条件的M取值中,根据方法300的流程,针对不同的待保存子数据块大小M,对样本的保存压缩率、保存速度和读取速度进行统计,在满足约束条件:保存速度大于等于预设的第一阈值,读取速度大于等于预设的第二阈值的M取值中,选择保存压缩率最小的M值,根据N=K/M确定子数据块的数目N。其中,保存压缩率为重删压缩后的数据大小与待保存数据块大小的比值,重删压缩后的数据为保存待保存数据块的信息需要的数据,包含游程编码结果、记录参照数据块和参照子数据块信息需要的数据,以及记录第一子数据块的位置信息的数据。
应了解,以上仅仅是对子数据块大小和子数据块数目N的确定过程进行举例说明,可以通过样本数据的统计结果进行确定,也可以通过其他方式进行确定,例如,基于经验值动态变化,或者在系统运行前进行设定,本发明实施例并不对此进行限定。
应理解,方法300可以由客户端102执行,或者由数据库服务器104进行,或者部分步骤由客户端102执行,部分步骤由数据库服务器104执行,例如,可以由客户端将待保存数据块分为N个待保存子数据块,并分别计 算每个待保存子数据块的指纹,然后将N个待保存子数据块的指纹发送给服务器104,服务器104执行步骤S304,为待保存数据块选择一个参照数据块,并通过相同位置标识的待保存子数据块和参照子数据块的比较,确定第一子数据块,并将第一子数据块的指示信息发送给客户端102,客户端102再将第一子数据块发送至该数据库服务器104,由数据库服务器执行后续操作。本发明实施例并不对方法300的具体执行主体进行限定。
根据本发明实施例公开的技术方案,根据数据库中数据相似性与位置相关的特点,通过将待保存数据块分为更小粒度的子数据块,并对待保存数据块和参照数据块相同位置的子数据块分别进行比较,以较小的计算量对相同子数据块进行重删操作,并将不同子数据块与参照子数据块进行异或操作,对异或结果进行游程编码压缩,大大降低了记录待保存数据块所需要的存储空间。
图6是依据本发明一实施例的数据保存装置600的逻辑结构示意图。如图6所示,装置600包含分割单元602、选择单元604、比较单元606、计算单元608和压缩单元610,其中,
分割单元602,用于将待保存数据块划分为大小相同的N个待保存子数据块,所述N个待保存子数据块对应N个位置标识,每个待保存子数据块对应一个位置标识,其中,N为大于1的正整数。
选择单元604,用于从已保存的至少两个对比数据块中选择一个作为参照数据块,所述参照数据块包括N个参照子数据块,所述N个参照子数据块对应所述N个位置标识,每个参照子数据块对应一个位置标识。
比较单元606,用于将对应第i位置标识的待保存子数据块和对应第i位置标识的参照子数据块进行比较,从所述N个待保存子数据块中确定第一子数据块,其中,i为从1递增至N的正整数,所述第一子数据块为与所比较的参照子数据块不相同的待保存子数据块。
所述选择单元604还用于选择所述第一子数据块的代表子数据块。
计算单元608,用于将所述所述第一子数据块的数据与所述代表子数据块的数据进行异或操作。
压缩单元610,用于使用游程编码对所述异或操作的结果进行压缩,并保存压缩结果和所述第一子数据块的位置信息。
具体实现过程中,装置600还包括重删单元,所述比较单元606还用于从所述N个待保存子数据块中确定第二子数据块,其中,所述第二子数据块为与所比较的参照子数据块相同的待保存子数据块;所述重删单元用于对所述第二子数据块进行重删操作。
根据数据库中数据相似性与位置相关的特点,通过将待保存数据块分为更小粒度的子数据块,并对待保存数据块和参照数据块相同位置的子数据块分别进行比较,以较小的计算量对相同子数据块进行重删操作,并将不同子数据块与参照子数据块进行异或操作,对异或结果进行游程编码压缩,大大降低了记录待保存数据块所需要的存储空间。
比较对应相同位置标识的待保存子数据块和参照子数据块时,为了避免按字节进行比较,比较单元606具体用于将对应第i位置标识的待保存子数据块的指纹和对应第i位置标识的参照子数据块的指纹进行比较,如果指纹相同,则表示二者相同,如果指纹不同,则表示二者不同。通过对子数据块的指纹的比较,避免了子数据块之间比较时的按位比较,从而减小了比较运算的复杂度。
可选的,所述选择单元604用于从已保存的至少两个对比数据块中选择一个作为参照数据块,包括:所述选择单元604用于从所述至少两个对比数据块中选择与所述待保存数据块有最多相同子数据块的对比数据块作为所述参照数据块。更进一步的,选择单元604可以选取与待保存数据块相同子数据块最多,且不同子数据块的分簇最少的对比数据块作为参照数据块,其中,不同子数据块的一个分簇是指位置连续的不同子数据块组成的集合,且所述集合的相邻位置的子数据块为相同子数据块。
通过待保存数据块与多个对比数据块的比较,可以判断每一个对比数据块与待保存数据块的相似度,与待保存数据块相同子数据块最多的对比数据块是与待保存数据块相似度最高的数据块。进一步的,通过在与待保存数据块相似度最高的对比数据块中选择不同子数据块的分簇最少的对比数据块作为参照数据块,可以减少记录不同子数据块的位置所需要的信息,进一步的增加了重删操作的压缩比。
可选的,所述选择单元604用于选择所述第一子数据块的代表子数据块,包括:所述选择单元604用于选择与每一个第一子数据块对应相同位置标识的参照子数据块作为所述每一个第一子数据块的代表子数据块。
因为数据库中存储的数据块存在很大的相似度,而且不同数据块的相同部分在数据块内部的相对位置一般是相同的,所以选择与每一个第一子数据块对应相同位置标识的参照子数据块作为该每一个第一子数据块的代表子数据块,可以使异或结果中出现大量连续的“0”,从而大大降低游程编码的压缩率。
可选的,所述选择单元604用于选择所述第一子数据块的代表子数据块,包括:所述选择单元604用于通过所述异或单元将所述第一子数据块与至少两个已保存子数据块分别进行异或操作,根据所述异或操作的结果的游程编码压缩率,选择与所述第一子数据块的异或结果的游程编码压缩率最小的已保存子数据块作为所述代表子数据块。
通过比较,从已保存的子数据块中选取使第一子数据块与其异或操作结果的游程编码压缩率最小的已保存子数据块作为该代表子数据块,可以在较大程度上减小保存待保存数据块需要的存储空间。
可选的,所述选择单元604用于选择所述第一子数据块的代表子数据块,包括:所述选择单元604用于通过所述异或单元将所述第一子数据块进行两两异或操作,根据异或操作的结果的游程编码压缩率,选择与其他第一子数据块的异或结果的游程编码压缩率最小的第一子数据块作为所述代表子数据块。
通过从第一子数据块中寻找一个与其他第一子数据块最相似的一个第一子数据块作为代表子数据块,可以在不依附于已保存子数据块的情况下,降低存储待保存数据块需要的存储空间。
为了确定最佳的N取值,选择单元604还用于:从待保存数据中抽取样本,根据不同的待保存子数据块大小,分别确定所述样本的保存压缩率、保存速度和读取速度,根据所述保存速度和所述读取速度满足约束条件,且保存压缩率最小的子数据大小确定N的大小,其中,所述约束条件为:所述保存速度大于等于预设的第一阈值,所述读取速度大于等于预设的第二阈值。
通过抽取样本,选择最合适的数据块分割方法,可以在满足保存处理速度和读取处理速度的前提下,最大限度的降低存储到保存数据块需要的存储空间。
可选的,所述比较单元606还用于保存所述第一子数据块的指纹。从而使第一子数据块后续作为其他数据块的参照子数据块时,不用重复计算指纹。
本发明实施例是方法300对应的装置实施例,方法300实施例部分的特征描述,适用于本发明实施例,在此不再赘述。
根据本发明公开的技术方案,根据数据库中数据相似性与位置相关的特点,通过将待保存数据块分为更小粒度的子数据块,并对待保存数据块和参照数据块相同位置的子数据块分别进行比较,以较小的计算量对相同子数据块进行重删操作,并将不同子数据块与参照子数据块进行异或操作,对异或结果进行游程编码压缩,大大降低了记录待保存数据块所需要的存储空间。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,设备和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实现时可以有另外的划分方式,例如多个模块或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或模块的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。
上述以软件功能模块的形式实现的集成的模块,可以存储在一个计算机可读取存储介质中。上述软件功能模块存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的部分步骤。而前述的存储介质包括:移动硬盘、只读存储器(英文:Read-Only Memory,简称ROM)、随机存取存储器(英文:Random Access Memory,简称RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对 其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的保护范围。

Claims (20)

  1. 一种数据保存方法,其特征在于,包括:
    将待保存数据块划分为大小相同的N个待保存子数据块,所述N个待保存子数据块对应N个位置标识,每个待保存子数据块对应一个位置标识,其中,N为大于1的正整数;
    从已保存的至少两个对比数据块中选择一个作为参照数据块,所述参照数据块包括N个参照子数据块,所述N个参照子数据块对应所述N个位置标识,每个参照子数据块对应一个位置标识;
    将对应第i位置标识的待保存子数据块和对应第i位置标识的参照子数据块进行比较,从所述N个待保存子数据块中确定第一子数据块,其中,i为从1递增至N的正整数,所述第一子数据块为与所比较的参照子数据块不相同的待保存子数据块;
    选择所述第一子数据块的代表子数据块;
    将所述所述第一子数据块的数据与所述代表子数据块的数据进行异或操作;
    使用游程编码对所述异或操作的结果进行压缩,并保存压缩结果和所述第一子数据块的位置信息。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    从所述N个待保存子数据块中确定第二子数据块,其中,所述第二子数据块为与所比较的参照子数据块相同的待保存子数据块;
    对所述第二子数据块进行重删操作。
  3. 根据权利要求1或2所述的方法,其特征在于,将对应第i位置标识的待保存子数据块和对应第i位置标识的参照子数据块进行比较包括:将对应第i位置标识的待保存子数据块的指纹和对应第i位置标识的参照子数据块的指纹进行比较,如果指纹相同,则表示二者相同,如果指纹不同,则表示二者不同。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述从已保存的至少两个对比数据块中选择一个作为参照数据块,包括:从所述至少两个对比数据块中选择与所述待保存数据块有最多相同子数据块的对比数据块作为所述参照数据块。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述选择所述第一子数据块的代表子数据块,包括:选择与每一个第一子数据块对应相同位置标识的参照子数据块作为所述每一个第一子数据块的代表子数据块。
  6. 根据权利要求1-4任一项所述的方法,其特征在于,所述选择所述第一子数据块的代表子数据块,包括:将所述第一子数据块与至少两个已保存子数据块分别进行异或操作,根据所述异或操作的结果的游程编码压缩率,选择与所述第一子数据块的异或结果的游程编码压缩率最小的已保存子数据块作为所述代表子数据块。
  7. 根据权利要求1-4任一项所述的方法,其特征在于,所述选择所述第一子数据块的代表子数据块,包括:将所述第一子数据块进行两两异或操作,根据异或操作的结果的游程编码压缩率,选择与其他第一子数据块的异或结果的游程编码压缩率最小的第一子数据块作为所述代表子数据块。
  8. 根据权利要求1-7任一项所述的方法,其特征在于,所述将待保存数据块划分为大小相同的N个待保存子数据块之前,还包括:
    从待保存数据中抽取样本,根据不同的待保存子数据块大小,分别确定所述样本的保存压缩率、保存速度和读取速度,根据所述保存速度和所述读取速度满足约束条件,且保存压缩率最小的子数据大小确定N的大小,其中,所述约束条件为:所述保存速度大于等于预设的第一阈值,所述读取速度大于等于预设的第二阈值。
  9. 根据权利要求3所述的方法,其特征在于,所述方法还包括:保存所述第一子数据块的指纹。
  10. 一种数据保存装置,其特征在于,包括:
    分割单元,用于将待保存数据块划分为大小相同的N个待保存子数据块,所述N个待保存子数据块对应N个位置标识,每个待保存子数据块对应一个位置标识,其中,N为大于1的正整数;
    选择单元,用于从已保存的至少两个对比数据块中选择一个作为参照数据块,所述参照数据块包括N个参照子数据块,所述N个参照子数据块对应所述N个位置标识,每个参照子数据块对应一个位置标识;
    比较单元,用于将对应第i位置标识的待保存子数据块和对应第i位置标识的参照子数据块进行比较,从所述N个待保存子数据块中确定第一子数据块,其中,i为从1递增至N的正整数,所述第一子数据块为与所比较 的参照子数据块不相同的待保存子数据块;
    所述选择单元还用于选择所述第一子数据块的代表子数据块;
    计算单元,用于将所述所述第一子数据块的数据与所述代表子数据块的数据进行异或操作;
    压缩单元,用于使用游程编码对所述异或操作的结果进行压缩,并保存压缩结果和所述第一子数据块的位置信息。
  11. 根据权利要求10所述的装置,其特征在于,所述装置还包括重删单元,所述比较单元还用于从所述N个待保存子数据块中确定第二子数据块,其中,所述第二子数据块为与所比较的参照子数据块相同的待保存子数据块;所述重删单元用于对所述第二子数据块进行重删操作。
  12. 根据权利要求10或11所述的装置,其特征在于,所述比较单元具体用于将对应第i位置标识的待保存子数据块的指纹和对应第i位置标识的参照子数据块的指纹进行比较,如果指纹相同,则表示二者相同,如果指纹不同,则表示二者不同。
  13. 根据权利要求10-12任一项所述的装置,其特征在于,所述选择单元用于从已保存的至少两个对比数据块中选择一个作为参照数据块,包括:所述选择单元用于从所述至少两个对比数据块中选择与所述待保存数据块有最多相同子数据块的对比数据块作为所述参照数据块。
  14. 根据权利要求10-13任一项所述的装置,其特征在于,所述选择单元用于选择所述第一子数据块的代表子数据块,包括:所述选择单元用于选择与每一个第一子数据块对应相同位置标识的参照子数据块作为所述每一个第一子数据块的代表子数据块。
  15. 根据权利要求10-13任一项所述的装置,其特征在于,所述选择单元用于选择所述第一子数据块的代表子数据块,包括:所述选择单元用于通过所述异或单元将所述第一子数据块与至少两个已保存子数据块分别进行异或操作,根据所述异或操作的结果的游程编码压缩率,选择与所述第一子数据块的异或结果的游程编码压缩率最小的已保存子数据块作为所述代表子数据块。
  16. 根据权利要求10-13任一项所述的装置,其特征在于,所述选择单元用于选择所述第一子数据块的代表子数据块,包括:所述选择单元用于通过所述异或单元将所述第一子数据块进行两两异或操作,根据异或操作的结 果的游程编码压缩率,选择与其他第一子数据块的异或结果的游程编码压缩率最小的第一子数据块作为所述代表子数据块。
  17. 根据权利要求10-16任一项所述的装置,其特征在于,所述选择单元还用于:
    从待保存数据中抽取样本,根据不同的待保存子数据块大小,分别确定所述样本的保存压缩率、保存速度和读取速度,根据所述保存速度和所述读取速度满足约束条件,且保存压缩率最小的子数据大小确定N的大小,其中,所述约束条件为:所述保存速度大于等于预设的第一阈值,所述读取速度大于等于预设的第二阈值。
  18. 根据权利要求12所述的装置,其特征在于,所述比较单元还用于保存所述第一子数据块的指纹。
  19. 一种计算机可读介质,其特征在于,包括计算机执行指令,当计算机的处理器执行所述计算机执行指令时,所述计算机执行权利要求1-9任一项所述的方法。
  20. 一种计算设备,其特征在于,包括:处理器、存储器、总线和通信接口;
    所述存储器用于存储执行指令,所述处理器与所述存储器通过所述总线连接,当所述计算设备运行时,所述处理器执行所述存储器存储的所述执行指令,以使所述装置执行权利要求1-9任一项所述的方法。
PCT/CN2015/096696 2015-12-08 2015-12-08 一种数据保存方法和装置 WO2017096532A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
CN201580056658.7A CN107046812B (zh) 2015-12-08 2015-12-08 一种数据保存方法和装置
EP15910007.2A EP3376393B1 (en) 2015-12-08 2015-12-08 Data storage method and apparatus
PCT/CN2015/096696 WO2017096532A1 (zh) 2015-12-08 2015-12-08 一种数据保存方法和装置
US16/002,585 US20180285014A1 (en) 2015-12-08 2018-06-07 Data storage method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/096696 WO2017096532A1 (zh) 2015-12-08 2015-12-08 一种数据保存方法和装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/002,585 Continuation US20180285014A1 (en) 2015-12-08 2018-06-07 Data storage method and apparatus

Publications (1)

Publication Number Publication Date
WO2017096532A1 true WO2017096532A1 (zh) 2017-06-15

Family

ID=59012447

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/096696 WO2017096532A1 (zh) 2015-12-08 2015-12-08 一种数据保存方法和装置

Country Status (4)

Country Link
US (1) US20180285014A1 (zh)
EP (1) EP3376393B1 (zh)
CN (1) CN107046812B (zh)
WO (1) WO2017096532A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10853257B1 (en) * 2016-06-24 2020-12-01 EMC IP Holding Company LLC Zero detection within sub-track compression domains
CN113641308A (zh) * 2021-08-12 2021-11-12 南京冰鉴信息科技有限公司 压缩文件索引增量更新方法、装置及电子设备

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10963436B2 (en) 2018-10-31 2021-03-30 EMC IP Holding Company LLC Deduplicating data at sub-block granularity
CN110007855B (zh) * 2019-02-28 2020-04-28 华中科技大学 一种硬件支持的3d堆叠nvm内存数据压缩方法及系统
US10963437B2 (en) * 2019-05-03 2021-03-30 EMC IP Holding Company, LLC System and method for data deduplication
US10733158B1 (en) 2019-05-03 2020-08-04 EMC IP Holding Company LLC System and method for hash-based entropy calculation
US10990565B2 (en) 2019-05-03 2021-04-27 EMC IP Holding Company, LLC System and method for average entropy calculation
US10817475B1 (en) 2019-05-03 2020-10-27 EMC IP Holding Company, LLC System and method for encoding-based deduplication
US11138154B2 (en) 2019-05-03 2021-10-05 EMC IP Holding Company, LLC System and method for offset-based deduplication
EP4111591A1 (en) * 2020-03-25 2023-01-04 Huawei Technologies Co., Ltd. Method and system of differential compression
WO2022089755A1 (en) * 2020-10-30 2022-05-05 Huawei Technologies Co., Ltd. Method and system for differential deduplication in untrusted storage
US20230049329A1 (en) * 2021-08-10 2023-02-16 Samsung Electronics Co., Ltd. Systems, methods, and apparatus for processing data at a storage device
CN116541828B (zh) * 2023-07-03 2023-09-22 北京双鑫汇在线科技有限公司 一种服务信息数据的智能管理方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050060643A1 (en) * 2003-08-25 2005-03-17 Miavia, Inc. Document similarity detection and classification system
CN101039374A (zh) * 2006-03-14 2007-09-19 联想(北京)有限公司 一种图像无损压缩和图像解压缩方法
CN101796492A (zh) * 2007-04-11 2010-08-04 Emc公司 使用细分段的集群存储
CN102592682A (zh) * 2012-02-20 2012-07-18 中国科学院声学研究所 一种测试数据编码压缩方法
CN103955355A (zh) * 2013-03-18 2014-07-30 清华大学 一种应用于非易失处理器中的分段并行压缩方法及系统

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102218732B1 (ko) * 2014-01-23 2021-02-23 삼성전자주식회사 저장 장치 및 그것의 동작 방법

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050060643A1 (en) * 2003-08-25 2005-03-17 Miavia, Inc. Document similarity detection and classification system
CN101039374A (zh) * 2006-03-14 2007-09-19 联想(北京)有限公司 一种图像无损压缩和图像解压缩方法
CN101796492A (zh) * 2007-04-11 2010-08-04 Emc公司 使用细分段的集群存储
CN102592682A (zh) * 2012-02-20 2012-07-18 中国科学院声学研究所 一种测试数据编码压缩方法
CN103955355A (zh) * 2013-03-18 2014-07-30 清华大学 一种应用于非易失处理器中的分段并行压缩方法及系统

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10853257B1 (en) * 2016-06-24 2020-12-01 EMC IP Holding Company LLC Zero detection within sub-track compression domains
CN113641308A (zh) * 2021-08-12 2021-11-12 南京冰鉴信息科技有限公司 压缩文件索引增量更新方法、装置及电子设备
CN113641308B (zh) * 2021-08-12 2024-04-23 南京冰鉴信息科技有限公司 压缩文件索引增量更新方法、装置及电子设备

Also Published As

Publication number Publication date
EP3376393A4 (en) 2018-12-19
US20180285014A1 (en) 2018-10-04
EP3376393A1 (en) 2018-09-19
EP3376393B1 (en) 2021-02-17
CN107046812B (zh) 2021-02-12
CN107046812A (zh) 2017-08-15

Similar Documents

Publication Publication Date Title
WO2017096532A1 (zh) 一种数据保存方法和装置
US10303797B1 (en) Clustering files in deduplication systems
EP2256934B1 (en) Method and apparatus for content-aware and adaptive deduplication
US9678974B2 (en) Methods and apparatus for network efficient deduplication
US8751462B2 (en) Delta compression after identity deduplication
US8849772B1 (en) Data replication with delta compression
US8712963B1 (en) Method and apparatus for content-aware resizing of data chunks for replication
WO2012065408A1 (zh) 容灾数据备份的方法及系统
US9690501B1 (en) Method and system for determining data profiles using block-based methodology
US20210397350A1 (en) Data Processing Method and Apparatus, and Computer-Readable Storage Medium
US20190138507A1 (en) Data Processing Method and System and Client
WO2014184857A1 (ja) 重複排除システム及びその方法
WO2014094479A1 (zh) 重复数据删除方法和装置
WO2017020576A1 (zh) 一种键值存储系统中文件压实的方法和装置
US10838923B1 (en) Poor deduplication identification
CN108415671B (zh) 一种面向绿色云计算的重复数据删除方法及系统
WO2021082926A1 (zh) 一种数据压缩的方法及装置
US10915260B1 (en) Dual-mode deduplication based on backup history
EP3432168B1 (en) Metadata separated container format
Wu et al. A feature-based intelligent deduplication compression system with extreme resemblance detection
Vikraman et al. A study on various data de-duplication systems
WO2022206334A1 (zh) 一种数据压缩方法及装置
Xue et al. A comprehensive study of present data deduplication
Majed et al. Cloud based industrial file handling and duplication removal using source based deduplication technique
US20240143212A1 (en) Inline snapshot deduplication

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15910007

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2015910007

Country of ref document: EP