WO2015042909A1 - 一种数据处理方法、系统及客户端 - Google Patents

一种数据处理方法、系统及客户端 Download PDF

Info

Publication number
WO2015042909A1
WO2015042909A1 PCT/CN2013/084597 CN2013084597W WO2015042909A1 WO 2015042909 A1 WO2015042909 A1 WO 2015042909A1 CN 2013084597 W CN2013084597 W CN 2013084597W WO 2015042909 A1 WO2015042909 A1 WO 2015042909A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
data
client
storage node
fingerprint value
Prior art date
Application number
PCT/CN2013/084597
Other languages
English (en)
French (fr)
Inventor
黄岩
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP13894821.1A priority Critical patent/EP3015999A4/en
Priority to CN201380002196.1A priority patent/CN104823184B/zh
Priority to PCT/CN2013/084597 priority patent/WO2015042909A1/zh
Publication of WO2015042909A1 publication Critical patent/WO2015042909A1/zh
Priority to US15/011,074 priority patent/US10210186B2/en
Priority to US16/240,358 priority patent/US11163734B2/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0813Multiuser, multiprocessor or multiprocessing cache systems with a network or matrix configuration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • G06F16/1752De-duplication implemented within the file system, e.g. based on file segments based on file chunks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof

Definitions

  • Embodiments of the present invention relate to a storage technology, and in particular, to a data processing method, system, and client. Background technique
  • Deduplication also known as smart compression or single instance storage
  • Smart compression or single instance storage is an automatic search for duplicate data, leaving only the same copy of the same data, and replacing other duplicates with pointers to a single copy to eliminate redundancy.
  • Data storage technology that reduces storage capacity requirements.
  • deduplication technology is widely used in application environments such as backup and virtual desktops.
  • the data processing system is composed of multiple storage nodes. Each storage node has its own deduplication processing engine and storage medium, such as a hard disk.
  • the data is partitioned in the cache. Obtaining a plurality of data blocks, calculating a fingerprint value of each data block, and sending a part of the fingerprint value of each data block fingerprint to all physical nodes in the data processing system for query; from the query result Obtaining a target physical node with the highest duplicate fingerprint value, and then transmitting all data partition information in the data packet corresponding to the sampled metadata information to the target physical node for repeated data query.
  • the amount of calculation increases with the increase of physical nodes in the data processing system, resulting in system deduplication performance. Summary of the invention
  • Embodiments of the present invention provide a data processing method, system, and client, which improve deduplication performance.
  • an embodiment of the present invention provides a data processing method, where the method is applied to a data processing system, where the data processing system includes at least one client and multiple storage nodes, each of the client and the data.
  • the method includes: at least one of the clients receiving data, dividing the data into a plurality of data blocks, acquiring a second fingerprint value of each data block; obtaining a second vector corresponding to the received data, a second vector representing a characteristic of the received data;
  • the node Comparing the second vector with each first vector stored on the client receiving the data, determining a target storage node, and transmitting a second fingerprint value corresponding to the plurality of data blocks to the target storage
  • the node performs a repeated data search or loads a first fingerprint value corresponding to the data partition stored in the target storage node to the client receiving the data for repeated data search.
  • the embodiment of the present invention provides a first possible manner, further comprising: obtaining non-repeating data partitioning in the received data, and dividing the obtained non-repetitive data chunking and the non-duplicating data chunking
  • the third fingerprint value corresponding to the block is stored in a cache of the client receiving the data;
  • the non-duplicate data block stored in the cache of the client receiving the data satisfies a preset storage condition, obtain a third vector of the cache that does not repeat the data block, and the third vector represents all the buffers in the cache.
  • the value of each of the second fingerprint values is a feature word
  • the corresponding Receiving the second vector corresponding to the data comprising: extracting N feature words from each of the second fingerprint values, respectively, N is an integer greater than or equal to 1; among all the extracted feature words, in the second
  • the feature words having the same position in the fingerprint value are added to obtain N values, and the N values constitute a second vector corresponding to the received data.
  • the comparing the second vector with each first vector stored on the client receiving the data determine the target storage node, including:
  • an embodiment of the present invention provides a client, where the client exists in a data processing system, where the data processing system further includes multiple storage nodes, the client and the data office.
  • Each of the storage nodes in the system is connected, and each of the storage nodes corresponds to a first vector, and the client stores a first vector corresponding to all storage nodes in the data processing system;
  • the method includes: a receiving unit, configured to receive data, divide the data into a plurality of data blocks, and obtain a second fingerprint value of each data block; and a second vector obtaining unit, configured to obtain a second corresponding to the received data a vector, the second vector representing a characteristic of the received data;
  • a processing unit configured to compare the second vector with each first vector stored on the client, determine a target storage node, and send a second fingerprint value corresponding to the multiple data blocks to the
  • the target storage node performs a repeated data search or loads a first fingerprint value corresponding to the data partition stored in the target storage node to the client for repeated data search.
  • the embodiment of the present invention provides a first possible implementation manner, where the client further includes:
  • a storage unit configured to obtain a non-duplicate data block in the received data, and store the obtained non-duplicate data block and the third fingerprint value corresponding to the non-duplicate data block into a cache of the client
  • the non-duplicate data block stored in the cache of the client satisfies a preset storage condition, obtain a third vector of the cache that does not repeat the data block, and the third vector represents all the non-duplicate data in the cache.
  • a feature of partitioning comparing the third vector with each first vector stored on the client, determining to store non-duplicate data chunks in the cache and not repeating data partitions in the cache A storage node of the third fingerprint value corresponding to the block.
  • the embodiment of the present invention provides a second In a possible manner, the value of each bit of each of the second fingerprint values is a feature word, and the second vector obtaining unit is specifically configured to: extract N features from each of the second fingerprint values respectively a word, N is an integer greater than or equal to 1; among all the extracted feature words, the feature words having the same position in the second fingerprint value are added to obtain N values, the N values A second vector corresponding to the received data is composed.
  • the embodiment of the present invention provides a third possible manner, where the processing unit is specifically configured to: determine that the second vector and the first vector are in the same a position in a multi-dimensional space in which the second vector is compared with the first vector, determining at least one first vector closest to the second vector or with the second vector At least one first vector having the smallest cosine value, and the storage node corresponding to the at least one first vector is a target storage node.
  • an embodiment of the present invention provides a data processing system, where the data processing system includes a plurality of storage nodes and the foregoing client, each of the storage nodes corresponding to a first vector, and each of the clients A first vector corresponding to each storage node in the data processing system is stored on the end, and each of the clients is connected to each of the storage nodes in the data processing system.
  • an embodiment of the present invention further provides a client, including a processor, a memory, a communication interface, and a bus.
  • the processor, the communication interface, and the memory communicate with each other through the bus; the communication interface is configured to receive and transmit data; and the memory is configured to store a program;
  • the processor is operative to execute the program in the memory, performing the method of any of the first aspects described above.
  • the target storage node is determined by comparing the second vector of the received data with the first vector corresponding to all the storage nodes stored on the client receiving the data, and no longer needs to sample from the received data.
  • Part of the fingerprint value is sent to all storage nodes in the data processing system for querying and waiting for feedback from the storage node to determine the target storage node, thereby avoiding multiple interactions between the client and the storage node, improving the deduplication performance, and Reduce network bandwidth while reducing latency.
  • FIG. 1 is a schematic structural diagram of a data processing system according to an embodiment of the present disclosure
  • Figure 1 -B is a schematic structural diagram of another data processing system according to an embodiment of the present invention
  • Figure 2-A is a flow chart of a data processing method in a data processing system according to an embodiment of the present invention
  • 2B is a flow chart of a data processing method in another data processing system according to an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of a second vector calculation method provided by an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of comparing a second vector and a first vector according to an embodiment of the present invention
  • FIG. 5 is a schematic structural diagram of a client provided by an embodiment of the present invention
  • FIG. 6 is a schematic structural diagram of a client according to an embodiment of the present invention. detailed description
  • the embodiment of the present invention provides a data processing system, where the data processing system includes at least one client and multiple storage nodes, and the client and the storage node may be deployed in multiple manners.
  • the embodiment of the present invention provides two deployment modes, for example, as shown in FIG. 1-A, mode 1: each client is connected to a storage node through a network, and each storage node in the data processing system corresponds to a first vector, each A first vector corresponding to all storage nodes in the data processing system is stored in the client, and the client can be deployed on the user side as a software or hardware device.
  • a system consisting of storage nodes can be called a cluster storage system, and a system composed of storage nodes and clients can be called a data processing system.
  • the client may be integrated as a hardware device on the storage node or deployed as a software function module on the storage node.
  • a clustered storage system with integrated clients also known as a data processing system.
  • the client receives the data sent by the user and processes it.
  • a data processing system 10 provided in an embodiment of the present invention includes at least one client 101, 102, 10n, and a plurality of storage nodes 111, 112, 11n; each client receives data sent by a user through an interface,
  • the interface may be a standard protocol interface, such as a Network File System (NFS) protocol interface; wherein each storage node corresponds to a first vector, and the first vector represents a feature of data stored in the corresponding storage node.
  • NFS Network File System
  • Each client stores a first vector corresponding to all the storage nodes in the data processing system.
  • the fingerprint value corresponding to the data stored in the storage node in the data processing system is referred to as a first fingerprint value
  • each of the The client is connected to each storage node in the data processing system, such as a network connection or other means of connection.
  • Data storage blocks are stored on each storage node in the data processing system, and the corresponding values of each data block are corresponding.
  • the first vector corresponding to each storage node may be artificially set during initialization, and the first vector corresponding to each storage node may be uniformly distributed in the multi-dimensional space, but how to allocate the user may be initialized.
  • the embodiment of the present invention is not limited as determined by the actual situation.
  • the client in the data processing system may be a separate entity independent of the storage node, or the client may be deployed as a software module to the storage node, or deployed on other hardware, through the network and The storage nodes in the data processing system are connected.
  • FIG. 2A and 2B are flowcharts of a data processing method in a data processing system according to an embodiment of the present invention, where a client (hereinafter referred to as a "first client") in the data processing system
  • a client hereinafter referred to as a "first client”
  • the method of the embodiment of the present invention may include:
  • Step 201 Receive data, divide the data into multiple data blocks, obtain a fingerprint value of each data block, and divide the obtained fingerprint value corresponding to the data block into a second fingerprint value;
  • the fingerprint value obtained according to the data received by the client is referred to as a second fingerprint value; and the first client in the data processing system
  • the second fingerprint value corresponding to the data block represents a feature of the corresponding data block
  • the method for obtaining the second fingerprint is different in the prior art, for example, Calculate the hash value of the data block, and use the hash value as the fingerprint value of the corresponding data block.
  • each of the second fingerprint values is a feature word
  • the second vector of the received data is obtained by using the following method:
  • N may be an integer greater than or equal to 1; among all the extracted feature words, the feature words at the same position as the second fingerprint value are Adding, N values are obtained, and the N values form an N-dimensional array to form a second vector corresponding to the received data.
  • each of the second fingerprint values is still a feature word
  • the second vector of the received data is obtained, and the second corresponding to the received data may also be obtained by:
  • the N feature words may be integers greater than or equal to 1; among all the extracted feature words, firstly, the feature words whose value is 0 in the extracted feature words Converting to -1, and then adding the feature words at the same position of the second fingerprint value to obtain N values, the N values forming an N-dimensional array to form a second vector corresponding to the received data .
  • the user can select according to the actual situation and needs, for example: If each second fingerprint value is 160 bits, 64 feature words on the lowest 64 bits of the second fingerprint value can be extracted, or 64 in the upper position.
  • the feature word may also be a 160-bit feature word in the second fingerprint.
  • Step 203 Compare the second vector with each first vector stored on the first client, determine a target storage node, and send a second fingerprint value corresponding to the multiple data blocks to the The target storage node performs a repeated data search or loads a first fingerprint value corresponding to the data partition stored in the target storage node to the first client for repeated data search; It should be noted that acquiring the second vector of the received data may acquire one or more. If the received data is divided into several parts, a second vector is acquired for each part of the data, and multiple It is possible to obtain a plurality of second vectors in part.
  • each second vector When acquiring a plurality of second vectors, the specific implementation of each second vector is the same as the specific operation of a second vector, but when determining the target storage node, multiple The second vector determines a plurality of target storage nodes, and each second vector corresponds to a target storage node.
  • the second fingerprint value corresponding to the plurality of data blocks is sent to the target storage node for repetition.
  • the data search is correspondingly: sending the second fingerprint corresponding to the partial data of the second vector to the corresponding target storage node for repeated data search.
  • the method for comparing the second vector with the first vector in step 203 may be: Method 1: determining a position of the second vector and the first vector in the same multidimensional space, Comparing the second vector with the first vector in a multi-dimensional space, determining at least one first vector that is closest to the second vector, the corresponding storage node is a target storage node, or, method two: determining Positioning the second vector and the first vector in a multi-dimensional space, comparing the second vector with the first vector, and determining at least one first with a minimum cosine of the second vector The storage node corresponding to the vector is the target storage node;
  • the first vector that is closest to the second vector may be determined one or more, and the user needs to preset a quantity that needs to be determined according to actual conditions, for example, determining the closest distance to the second vector.
  • Two first vectors, the storage nodes corresponding to the two first vectors are the target storage nodes.
  • the dimensions of the second vector and the first vector in the embodiment of the present invention may be the same or different. If different situations are needed, the dimension needs to be complemented by zero-padding so that the first vector And the second vector can determine the location in the same multidimensional space, and the two vectors can Compare.
  • the method in this embodiment may further include:
  • the storage node updates the first vector corresponding to the storage node according to data stored in the storage node every predetermined period, the first vector represents a feature of storing data in the storage node, and the updated The first vector corresponding to the storage node notifies the client; the client receives an update message corresponding to the first vector sent by the storage node.
  • the specific update method can be the same as the method of calculating the second vector of the received data.
  • the second vector is compared with the first vector corresponding to all the storage nodes in the data processing system stored on the client receiving the data, and the vector is compared.
  • the method is to determine a target storage node, and consider that the similarity between the data stored on the target storage node and the received data is relatively high, and the data in the target storage node is compared with the received data, and therefore, the target storage Nodes are also referred to as similar storage nodes.
  • the second vector reflects the feature of the received data, and a first vector corresponding to a storage node may be used to reflect the feature of the storage data stored by the corresponding storage node.
  • the second vector and the first vector are adopted.
  • the comparison that is, the feature of the received data and the feature of the stored data on the storage node are compared, so that the storage node corresponding to the first vector closest to the received data feature can be obtained, and the storage node is used as a similar storage node. .
  • the target storage node is determined by comparing the second vector of the received data with the first vector corresponding to all the storage nodes, and it is no longer necessary to sample a part of the fingerprint value from the received data and send it to the data processing system. All storage nodes query the way to determine the target storage node, thereby avoiding multiple interactions between the client and the storage node, and improving the deduplication performance. And reduce the network bandwidth consumption while reducing the delay.
  • a second vector corresponding to the received data is obtained, and the second vector represents a feature of the received data as a whole, and each storage node presets a corresponding first vector at initialization time.
  • the non-repetitive data partitions that need to be stored in the data processing system are obtained, and the non-repetitive data partitioning is performed.
  • the embodiment of the present invention provides two modes, as shown in FIG. 2-A, mode A:
  • the embodiment of the present invention may further include:
  • Step 204A Obtain a non-repeating data block in the received data, and store the obtained non-duplicate data block and the third fingerprint value corresponding to the non-duplicate data block into a cache of the first client. ;
  • Step 205A When the non-duplicate data partition stored in the cache of the first client satisfies a preset storage condition, obtain a third vector of the cache that does not repeat the data partition, where the third vector represents the cache. All features that do not repeat data chunking;
  • Step 206 A comparing the third vector with each first vector stored on the first client, determining to store non-duplicate data blocks in the cache and not repeating data in the cache A storage node of the third fingerprint corresponding to the block.
  • a method for storing a non-duplicate data block in the cache and a storage node of a third fingerprint corresponding to the data block in the cache and a method for determining a target node in the foregoing are determined with.
  • the third vector in the cache that does not repeat the data partitioning, one can also obtain one.
  • all the non-duplicate data in the cache is corresponding to a third vector; or the non-repeating data in the cache is divided into multiple parts, and each part determines a corresponding one.
  • the third vector; each part respectively determines a corresponding storage node for storing data blocks according to the method provided by the embodiment of the present invention.
  • the preset storage condition may be that the data stored in the cache reaches a size of a storage strip preset in the hard disk, or the size of a storage unit in the hard disk, etc., and the preset storage condition in the embodiment of the present invention Not limited.
  • each storage node is assigned a corresponding first vector during initialization, because the first vector needs to reflect the feature of the data stored on the corresponding storage node, and therefore, the third vector reflecting the non-repeating data feature is obtained. And comparing the third vector with all the first vectors to determine a storage node for storing non-duplicate data.
  • Mode B Because the data received by the client at a certain time usually has continuity between the data itself, the similarity of the data itself is high, so it can be in the data received at a certain time.
  • the non-repeated data that is found is directly stored in the target storage node for performing the repeated data search. Therefore, the embodiment of the present invention may further include:
  • Step 204B Obtain a non-duplicate data block in the received data, and store the non-duplicate data block into the target storage node.
  • the non-repeating fingerprint value found in the second fingerprint value returned by the target storage node is received, and the data segment corresponding to the non-repetitive fingerprint value is regarded as non-repeating data segmentation. Therefore, the data block is not repeatedly decoded in the received data.
  • the following describes how to obtain the second vector and the target storage node in this embodiment in conjunction with an example of a specific implementation manner.
  • the corresponding block from the data block Extracting the lowest 64-bit feature word from the two fingerprint values.
  • the feature words having the same position in the second fingerprint value are added, and the fingerprint FW1 is added.
  • the first feature word and the first feature word of FW2 to FWn are added to obtain a value A01, and the second feature word in the fingerprint value FW1 and the second feature word in FW2 to FWn are added to obtain a value A02.
  • calculateated in turn get 64 values A03, A04, A64, 64 values will be obtained, and 64 values will be combined into one
  • a 64-dimensional array forms a second eigenvector A corresponding to the received data; referring to FIG. 4, determining a position of the second eigenvector A in a 64-dimensional space, and calculating a second eigenvector A and a first vector corresponding to the storage node
  • the distance between two multidimensional space vectors can be calculated as: dtsli
  • the embodiment of the present invention provides a target storage node by comparing a second vector that receives data with a first vector that stores all storage nodes on a client that receives data, and no longer needs to receive from the receiver.
  • the data is sampled and sent to all storage nodes in the data processing system for querying and waiting for feedback from the storage node to determine the target storage node, thereby avoiding multiple interactions between the client and the storage node, thereby improving Deduplication performance, while reducing network bandwidth usage, reduces latency.
  • an embodiment of the present invention provides a client for performing the data processing method described in the foregoing embodiment, where the client exists in a data processing system, where the data processing system includes at least one client and a plurality of storage nodes, each of the storage nodes corresponding to a first vector, each of the clients storing a first vector corresponding to all storage nodes in the data processing system,
  • the fingerprint value corresponding to the data stored in the storage node is a first fingerprint value, and each of the clients is connected to each of the storage nodes in the data processing system;
  • the client includes:
  • the receiving unit 501 is configured to receive data, divide the data into a plurality of data blocks, and obtain a second fingerprint value of each data block.
  • the second vector obtaining unit 502 is configured to obtain a second corresponding to the received data. a vector, the second vector representing a characteristic of the received data;
  • the processing unit 503 is configured to compare the second vector with each first vector stored on the client, determine a target storage node, and send the second fingerprint value corresponding to the multiple data blocks to the
  • the target storage node performs a repeated data search or loads a first fingerprint value corresponding to the data partition stored in the target storage node to the first client for repeated data search.
  • the client may further include:
  • the storage unit 504 is configured to obtain a non-duplicate data block in the received data, and store the obtained non-duplicate data block and the third fingerprint value corresponding to the non-duplicate data block into a cache of the client. And acquiring, when the non-duplicate data partition stored in the cache of the client meets a preset storage condition, acquiring a third vector of the cache that does not repeat the data partition, where the third vector represents all non-repetitions in the cache Characterizing data chunking; comparing the third vector to each first vector stored on the client, determining to store non-duplicate data chunks in the cache and not repeating data in the cache A storage node that partitions the third fingerprint value corresponding to the block.
  • the embodiment of the present invention further provides another client, which has the same structure as the client, except that the function of the storage unit 504 is different, and the storage unit 504 is configured to obtain the received data.
  • Non-repeating data partitioning storing the non-repeating data chunks into the target storage section Point.
  • the processing unit 503 is specifically configured to determine a location of the second vector and the first vector in a same multi-dimensional space, where the second vector is performed with the first vector. Comparing, determining at least one first vector that is closest to the second vector or at least one first vector having a smallest cosine value to the second vector, the storage node corresponding to the at least one first vector is a target storage node.
  • the embodiment of the present invention further provides a data processing system.
  • the data processing system includes a plurality of storage nodes and the client described in the previous embodiment, and each of the storage nodes corresponds to one.
  • a first vector each of the clients storing a first vector corresponding to all storage nodes in the data processing system, and a value corresponding to the data stored in the storage node is a first indication value, and each The client is coupled to each of the storage nodes in the data processing system.
  • an embodiment of the present invention further provides a client 600, which includes a processor 61, a memory 62, a communication interface 63, and a communication bus 64.
  • the processor 61, the communication interface 63, and the memory 62 communicate with each other through the communication bus 64; the communication interface is configured to receive and transmit data;
  • the memory is for storing a program;
  • the memory 62 may include a high speed RAM memory, and may also include a non-volatile memory such as at least one disk storage;
  • the processor 61 is configured to execute the program in the memory, and perform the method provided by the foregoing method embodiment.
  • the embodiment provided by the embodiment of the present invention determines the target storage node by comparing the second vector receiving the data with the first vector corresponding to all the storage nodes in the client receiving the data, and no longer needs to receive the data from the received data.
  • a part of the fingerprint value is sampled and sent to all storage nodes in the data processing system for querying, and the feedback of the storage node is waited for to determine the target storage node, thereby avoiding multiple interactions between the client and the storage node, and improving the deduplication. Performance, while reducing network bandwidth usage, reduces latency.
  • the disclosed systems, devices, and methods may be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not executed.
  • the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some communication interface, device or unit, and may be electrical, mechanical or otherwise.
  • the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product. Based on this understanding, this issue
  • the technical solution of the present invention, or the part contributing to the prior art, may be embodied in the form of a software product stored in a storage medium, including a plurality of instructions for causing a A computer device (which may be a personal computer, server, or network device, etc.) performs all or part of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, and the like, which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明实施例提供了一种数据处理方法、系统及客户端,通过接收数据的第二向量和接收数据的客户端上预先存储有所有存储节点对应的第一向量进行比较的方式来确定目标存储节点,而不再需要从接收的数据中抽样出一部分指纹值发送到数据处理系统中所有存储节点上去查询的方式以及等待存储节点的反馈来确定目标存储节点,从而避免了客户端与存储节点之间的多次交互,提高了重删性能,而且减少网络带宽占用的同时,降低了时延。

Description

一种数据处理方法、 系统及客户端 技术领域
本发明实施例涉及存储技术,尤其涉及一种数据处理方法、系统及客户端。 背景技术
重复数据删除也称为智能压缩或单一实例存储,是一种可自动搜索重复数 据,将相同数据只保留唯一的一个副本, 并使用指向单一副本的指针替换掉其 他重复副本, 以达到消除冗余数据、 降低存储容量需求的存储技术。
在现有技术中, 重复数据删除技术被大量应用于备份、虚拟桌面等应用环 境。数据处理系统是由多个存储节点构成,每个存储节点都有自己的重删处理 引擎和存储介质, 例如硬盘, 当需要将数据写入到某个文件时, 在緩存中将数 据进行分块得到多个数据分块,计算出每个数据分块的指纹值,将每个数据分 块的指纹中抽样出一部分指纹值发送到数据处理系统中的所有物理节点中去 查询; 从查询结果中获取重复指纹值最多的目标物理节点, 然后将抽样出的元 数据信息对应的数据分组中的所有数据分块信息发送到所述目标物理节点进 行重复数据查询。
发明人通过研究发现, 现有技术的集群重删技术中, 需要将抽样的指纹 值发送到所有物理节点上去查询, 导致在重删过程中, 物理节点间交互次数太 多,当数据处理系统中的物理节点较多的情况下,每个物理节点在执行重删时, 计算量会随着数据处理系统中物理节点的增加而增加, 导致系统重删性能下 发明内容
本发明实施例提供一种数据处理方法、 系统及客户端, 提高重删性能。 第一方面, 本发明实施例提供一种数据处理方法, 所述方法应用于数据处 理系统, 所述数据处理系统包括至少一个客户端和多个存储节点,每个所述客 户端和所述数据处理系统中的每个所述存储节点连接,每个所述存储节点对应 一个第一向量,每个所述客户端上存储有所述数据处理系统中所有存储节点对 应的第一向量, 所述方法包括: 至少一个所述客户端接收数据, 将所述数据划分为多个数据分块, 获取每 个数据分块的第二指纹值; 得到所述接收的数据对应的第二向量, 所述第二向量代表所述接收数据的 特征;
将所述第二向量与所述接收数据的客户端上存储的每个第一向量相比较, 确定目标存储节点,将所述多个数据分块对应的第二指纹值发送到所述目标存 储节点进行重复数据查找或者将所述目标存储节点中存储的数据分块对应的 第一指纹值加载到所述接收数据的客户端上进行重复数据查找。 结合第一方面, 本发明实施例提供了第一可能方式中, 还包括: 获得所述接收的数据中不重复数据分块, 将所述获得的不重复数据分块以 及所述不重复数据分块对应的第三指纹值存储到所述接收数据的客户端的緩 存中; 当所述接收数据的客户端的緩存中存储的不重复数据分块满足预设的存储 条件时,得到所述緩存中不重复数据分块的第三向量, 所述第三向量代表緩存 中所有不重复数据分块的特征;
将所述第三向量和所述接收数据的客户端上存储的每个第一向量相比较, 确定用于存储所述緩存中的不重复数据分块以及所述緩存中不重复数据分块 对应的第三指纹值的存储节点。 结合第一方面的第一种可能方式,本发明实施例提供的第二种可能方式中, 每个所述第二指纹值的每一位上的数值为一个特征字,所述得到对应所述接收 数据对应的第二向量, 包括: 分别从每个所述第二指纹值中提取 N个特征字, N大于等于 1的整数; 在所有所述提取的特征字中,将在所述第二指纹值中所处位置相同的所述 特征字相加, 获取 N个数值, 所述 N个数值组成对应所述接收数据的第二向量。 结合第一方面的第一种可能方式,本发明实施例提供的第三种可能方式中, 所述将所述第二向量与所述接收数据的客户端上存储的每个第一向量相比较, 确定目标存储节点, 包括:
确定所述第二向量和所述第一向量在同一个多维空间中的位置,在所述多 维空间中将所述第二向量与所述第一向量进行比较,确定与所述第二向量距离 最近的至少一个第一向量或者与所述第二向量夹角余弦值最小的至少一个第 一向量, 所述至少一个第一向量对应的存储节点为目标存储节点。 第二方面, 本发明实施例提供一种客户端, 所述客户端存在于一个数据处 理系统中, 所述数据处理系统还包括多个存储节点, 所述客户端和所述数据处 理系统中的每个所述存储节点连接,每个所述存储节点对应一个第一向量, 所 述客户端上存储有所述数据处理系统中所有存储节点对应的第一向量; 所述客户端包括: 接收单元, 用于接收数据, 将数据划分为多个数据分块, 获取每个数据分 块的第二指纹值; 第二向量获得单元, 用于得到所述接收的数据对应的第二向量, 所述第二 向量代表所述接收数据的特征;
处理单元,用于将所述第二向量与所述客户端上存储的每个第一向量相比 较,确定目标存储节点,将所述多个数据分块对应的第二指纹值发送到所述目 标存储节点进行重复数据查找或者将所述目标存储节点中存储的数据分块对 应的第一指纹值加载到所述客户端上进行重复数据查找。
结合第二方面, 本发明实施例提供了第一种可能的实施方式, 所述客户端 还包括:
存储单元, 用于获得所述接收的数据中不重复数据分块,将所述获得的不 重复数据分块以及所述不重复数据分块对应的第三指纹值存储到所述客户端 的緩存中;当所述客户端的緩存中存储的不重复数据分块满足预设的存储条件 时,得到所述緩存中不重复数据分块的第三向量, 所述第三向量代表緩存中所 有不重复数据分块的特征;将所述第三向量和所述客户端上存储的每个第一向 量相比较,确定用于存储所述緩存中的不重复数据分块以及所述緩存中不重复 数据分块对应的第三指纹值的存储节点。 结合第二方面以及第二方面的第一种可能方式, 本发明实施例提供了第二 种可能方式,每个所述第二指纹值的每一位上的数值为一个特征字, 所述第二 向量获得单元具体用于: 分别从每个所述第二指纹值中提取 N个特征字, N大于等于 1的整数; 在所有所述提取的特征字中,将在所述第二指纹值中所处位置相同的所述 特征字相加, 获取 N个数值, 所述 N个数值组成对应所述接收数据的第二向量。
结合第二方面以及第二方面所提供的第一种可能方式,本发明实施例提供 第三种可能方式, 所述处理单元具体用于: 确定所述第二向量和所述第一向量 在同一个多维空间中的位置,在所述多维空间中将所述第二向量与所述第一向 量进行比较,确定与所述第二向量距离最近的至少一个第一向量或者与所述第 二向量夹角余弦值最小的至少一个第一向量,所述至少一个第一向量对应的存 储节点为目标存储节点。
第三方面, 本发明实施例提供了一种数据处理系统, 所述数据处理系统包 括多个存储节点和前面所述客户端,每个所述存储节点对应一个第一向量,每 个所述客户端上存储有所述数据处理系统中每个存储节点对应的第一向量,每 个所述客户端和所述数据处理系统中的每个所述存储节点连接。
第四方面, 本发明实施例还提供一种客户端, 包括处理器, 存储器, 通信 接口, 总线;
所述处理器、通信接口、存储器通过所述总线相互的通信;所述通信接口, 用于接收和发送数据; 所述存储器用于存储程序;
所述处理器用于执行所述存储器中的所述程序, 执行如前面所述第一 方面中任一所述的方法。 本发明实施例通过接收数据的第二向量和接收数据的客户端上预先存储 有所有存储节点对应的第一向量进行比较的方式来确定目标存储节点,而不再 需要从接收的数据中抽样出一部分指纹值发送到数据处理系统中所有存储节 点上去查询的方式以及等待存储节点的反馈来确定目标存储节点,从而避免了 客户端与存储节点之间的多次交互,提高了重删性能, 而且减少网络带宽占用 的同时, 降低了时延。 附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施 例或现有技术描述中所需要使用的附图作一简单地介绍, 显而易见地, 下面描 述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出 创造性劳动性的前提下, 还可以根据这些附图获得其他的附图。
图 1-A为本发明实施例提供的数据处理系统的架构示意图;
图 1 -B为本发明实施例提供的另一种数据处理系统的架构示意图; 图 2-A为本发明实施例所提供的一种数据处理系统中数据处理方法的流程 图;
图 2-B为本发明实施例所提供的另一种数据处理系统中数据处理方法的流 程图;
图 3为本发明实施例提供给的第二向量计算方法示意图;
图 4为本发明实施例提供的第二向量和第一向量进行比较的示意图; 图 5为本发明实施例所提供客户端的结构示意图;
图 6为本发明实施例提供的客户端的结构示意图。 具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚, 下面将结合本发明 实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然, 所描述的实施例是本发明一部分实施例, 而不是全部的实施例。基于本发明中 的实施例 ,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其 他实施例, 都属于本发明保护的范围。
本发明实施例提供一种数据处理系统,数据处理系统包括至少一个客户端 和多个存储节点,客户端和存储节点的部署方式可以有多种。本发明实施例提 供两种部署方式, 例如, 如图 1-A所示, 方式一: 每个客户端通过网络与存储 节点连接,数据处理系统中每个存储节点对应一个第一向量,每个客户端中存 储有数据处理系统中所有存储节点对应的第一向量,客户端可以作为一个软件 或硬件设备部署在用户侧。 由存储节点组成的系统可以称为集群存储系统, 由 存储节点和客户端组成的系统可以称为数据处理系统。
或者, 方式二, 如图 1-B所示, 客户端可以是作为硬件装置集成在存储节 点上,也可以作为软件功能模块部署在存储节点上。 集成了客户端的集群存储 系统, 也可以称为数据处理系统。 客户端接收用户发来的数据后进行处理。 本 发明实施例中提供的一种数据处理系统 10包括至少一个客户端 101,102, ,10η,以及多个存储节点 111, 112, , 11η; 每个客户端通过接口 接收用户发来的数据, 所述接口可以是标准协议接口, 例如网络文件系统 ( Network File System, NFS )协议接口; 其中, 每个存储节点对应一个第一 向量, 所述第一向量代表对应存储节点中所存储数据的特征。 在本实施例中, 每个客户端上存储有数据处理系统中所有存储节点对应的第一向量,为描述方 便, 将数据处理系统中存储节点中存储的数据对应的指纹值称为第一指纹值, 所述每个客户端和数据处理系统中的每个存储节点连接,例如网络连接或其他 方式的连接。
数据处理系统中每个存储节点上存储有数据分块,以及每个数据分块对应 的指故值。
其中, 每一个存储节点对应的第一向量可以是在初始化时进行人为设定, 可以将每个存储节点对应的第一向量均勾地分布在多维空间中 ,但具体如何分 配用户可以在初始化时根据实际情况确定, 本发明实施例不做限定。
在具体应用中 ,数据处理系统中的客户端可以是作为一个单独的实体独立 于存储节点,也可以将客户端作为一个软件模块部署到存储节点中,也可以部 署在其他硬件上, 通过网络和数据处理系统中的存储节点相连。
图 2-A和图 2-B为本发明实施例提供的一种数据处理系统中数据处理方法 的流程图, 以所述数据处理系统中一个客户端(以下称为 "第一客户端" )为 执行主体为例, 如图 2A和 2B所示, 本发明实施例的方法可以包括:
步骤 201 : 接收数据, 将数据划分为多个数据分块, 获取每个数据分块的 指纹值, 划分得到的所述数据分块对应的指纹值为第二指纹值;
其中, 为了从描述上区别于存储节点中存储的数据对应的第一指纹值,将 根据所述客户端接收的数据得到的指纹值称为第二指纹值;并以数据处理系统 中第一客户端为例来描述本发明实施例的执行过程;所述数据分块对应的第二 指纹值代表了对应数据分块的特征; 获取第二指纹的方法现有技术中有多种, 例如, 通过计算数据分块的哈希值, 将哈希值作为对应数据分块的指纹值。 步骤 202: 得到所述接收的数据对应的第二向量, 所述第二向量代表所述 接收的数据的特征;
其中, 每个所述第二指纹值中每一位上的数值为一个特征字, 获取所述接 收的数据的第二向量, 可以通过以下方法:
分别从每个第二指纹值中提取 N个特征字, N可以是大于等于 1的整数; 在 所有提取的特征字中, 将在所述第二指纹值所处位置相同的所述特征字相加, 获得 N个数值, 所述 N个数值组成 N维数组形成对应所述接收数据的第二向量。
此外,仍然将每个所述第二指纹值中每一位上的数值为一个特征字, 获取 所述接收的数据的第二向量,还可以通过以下方式来获取所述接收数据对应的 第二向量:
分别从每个第二指纹值中提取 N个特征字, N个特征字可以是大于等于 1的 整数; 在所有提取的特征字中, 首先将所述提取的特征字中数值为 0的特征字 转换为 -1 , 然后将在所述第二指纹值所处位置相同的所述特征字相加, 获得 N 个数值, 所述 N个数值组成 N维数组形成对应所述接收数据的第二向量。 户可以根据实际情况和需要选择, 例如: 如果每个第二指纹值为 160位的情况 下, 可以提取第二指纹值的最低 64位上的 64个特征字, 也可以是高位上的 64 个特征字, 也可以是提取第二指纹中 160位特征字。
步骤 203: 将所述第二向量与所述第一客户端上存储的每个第一向量相比 较,确定目标存储节点,将所述多个数据分块对应的第二指纹值发送到所述目 标存储节点进行重复数据查找或者将所述目标存储节点中存储的数据分块对 应的第一指纹值加载到所述第一客户端上进行重复数据查找; 需要说明的是, 获取所述接收的数据的第二向量, 可以获取一个或多个, 如果将所述接收的数据划分为几个部分,针对每一个部分的数据获取一个第二 向量, 多个部分就可能得到多个第二向量, 当获取多个第二向量时, 针对每一 个第二向量的具体实施方式和一个第二向量的具体操作相同,只是在确定目标 存储节点时, 多个第二向量会确定多个目标存储节点,每个第二向量对应一个 目标存储节点, 步骤 203中, 所述将所述多个数据分块对应的第二指纹值发送 到所述目标存储节点进行重复数据查找, 就相应地为: 将第二向量对应部分数 据的第二指纹发送到对应的目标存储节点进行重复数据查找。
其中, 步骤 203中将所述第二向量与所述第一向量进行比较的方法, 可以 是: 方法一: 确定所述第二向量和所述第一向量在同一个多维空间中的位置, 在多维空间中将所述第二向量与所述第一向量进行比较,确定与所述第二向量 距离最近的至少一个第一向量, 所对应的存储节点为目标存储节点, 或, 方法 二: 确定所述第二向量和所述第一向量在多维空间中的位置,将所述第二向量 与所述第一向量进行比较,确定与所述第二向量夹角余弦值最小的至少一个第 一向量对应的存储节点为目标存储节点;
其中, 与所述第二向量距离最近的第一向量, 可以确定一个, 也可以确定 多个, 需要用户根据实际情况预先设定需要确定的数量, 例如, 确定与所述第 二向量距离最近的 2个第一向量,这 2个第一向量分别对应的存储节点为目标存 储节点。
本发明实施例中的所述第二向量和所述第一向量的维度可以是相同,也可 以是不同, 如果不同的情况下, 需要通过补零的方式将维度补齐使得所述第一 向量和所述第二向量在同一个多维空间中可以确定所在的位置,两个向量可以 进行比较。
为了提高第一向量反应对应存储节点中存储数据的特征的准确度,本实施 例的方法中, 还可以包括:
存储节点每隔预定周期根据所述存储节点中存储的数据,更新所述存储节 点对应的所述第一向量, 所述第一向量代表所述存储节点中存储数据的特征, 将更新后的所述存储节点对应的所述第一向量通知所述客户端;客户端会接收 到存储节点发送的对应第一向量的更新消息。
具体的更新方式, 和计算接收数据的第二向量的方法可以相同。
本发明实施例通过获取接收到的数据对应的第二向量,将第二向量和接收 到数据的客户端上存储的数据处理系统中所有的存储节点对应的第一向量进 行比较,通过向量的比较方式来确定目标存储节点,认为目标存储节点上存储 的数据和所述接收的数据之间相似度相对较高,将目标存储节点中的数据作为 和接收的数据进行比较的对象, 因此, 目标存储节点也被称为相似存储节点。 本发明实施例中, 第二向量反映了接收数据的特征, 一个第一向量对应一个存 储节点可以用于反映了对应存储节点存储数据的特征,在多维空间中,通过第 二向量和第一向量的比较,也就是接收数据的特征和存储节点上已经存储数据 的特征进行比较, 因此, 可以得到与接收的数据特征最接近的第一向量对应的 存储节点, 并将该存储节点作为相似存储节点。
本发明实施例通过接收数据的第二向量和所有存储节点对应的第一向量 进行比较的方式来确定目标存储节点,而不再需要从接收的数据中抽样出一部 分指纹值发送到数据处理系统中所有存储节点上去查询的方式来确定目标存 储节点, 从而避免了客户端与存储节点之间的多次交互, 提高了重删性能, 而 且减少网络带宽占用的同时, 降低了时延。
本发明实施例中, 获取接收到的数据对应的第二向量, 第二向量代表了接 收到的数据作为一个整体的特征,而每个存储节点在初始化时预设了对应的第 一向量,在经过重复数据查找后, 获得需要存储到数据处理系统中的不重复数 据分块, 对不重复数据分块如何存储, 本发明实施例提供了两种方式, 参见图 2-A, 方式 A:
计算需要存储的不重复数据分块的对应的向量,然后根据和存储节点对应 的第一向量进行比较, 确定存储不重复数据的第二目标存储节点, 因此, 本发 明实施例还可以包括:
步骤 204A: 获得所述接收的数据中不重复数据分块, 将所述获得的不重 复数据分块以及所述不重复数据分块对应的第三指纹值存储到所述第一客户 端的緩存中;
步骤 205A: 当所述第一客户端的緩存中存储的不重复数据分块满足预设 的存储条件时, 获取所述緩存中不重复数据分块的第三向量, 所述第三向量代 表緩存中所有不重复数据分块的特征;
步骤 206 A: 将所述第三向量和所述第一客户端上存储的每个第一向量相 比较,确定用于存储所述緩存中的不重复数据分块以及所述緩存中不重复数据 分块对应的第三指纹的存储节点。
其中,确定用于存储所述緩存中的不重复数据分块以及所述緩存中不重复 数据分块对应的第三指纹的存储节点的方法,和确定前面所述的目标地节点的 方法才目同。
值得说明的是,对緩存中不重复数据分块的第三向量, 可以获取一个也可 以获取多个, 但获取一个时, 将緩存中的所有不重复数据对应一个第三向量; 也可以是将緩存中不重复数据分块划分为多个部分,每一个部分确定一个对应 的一个第三向量;每一个部分按照本发明实施例提供的方法分别确定用于存储 数据分块的对应存储节点。
其中,预设的存储条件可以是緩存中存储的数据达到在硬盘中预置的一个 存储条带的大小, 或者在硬盘中的一个存储单元的大小等, 本发明实施例对预 设的存储条件不做限定。
本发明实施例中,初始化时为每一个存储节点分配了对应的第一向量, 因 为第一向量需要反映对应存储节点上所存储数据的特征, 因此,通过获取反映 不重复数据特征的第三向量,将第三向量和所有第一向量相比较,确定用于存 储不重复数据的存储节点。
参见图 2-B, 方式 B: 因为客户端在某一次接收的数据, 其数据本身之间 通常会有连续性, 因此数据本身的相似度较高, 因此, 可以将在某一次接收的 数据中查找到的不重复的数据, 直接存储到进行重复数据查找的目标存储节 点, 因此, 本发明实施例还可以包括:
步骤 204B, 获得所述接收数据中的不重复数据分块, 将所述不重复数据 分块存储到所述目标存储节点。
其中, 方式 A和方式 B中, 接收到目标存储节点返回的在第二指纹值中查 找到的不重复的指纹值,不重复指纹值对应的数据分块就被认为是不重复数据 分块, 因此最终获得所述接收数据中不重复数据分块。
下面结合具体实现方式的举例,对本实施例的中如何获得第二向量以及目 标存储节点进行举例说明。对接收的数据进行分块之后,从数据分块对应的第 二指纹值中提取最低 64位的特征字, 参见图 3 , 将从每个第二指纹值提取的特 征字中,在第二指纹值中所处位置相同的所述特征字相加,指纹 FW1中第一位 特征字和 FW2至 FWn中第一位特征字相加, 得到数值 A01 , 指纹值 FW1中的第 二位特征字和 FW2至 FWn中第二位特征字相加, 得到数值 A02, 依次计算, 得 到 64个数值 A03 , A04 , A64, 将获得的 64个数值, 将 64个数值组成一个
64维的数组形成对应所述接收数据的第二特征向量 A; 参见图 4, 在一个 64维 空间中确定第二特征向量 A的位置, 计算第二特征向量 A和存储节点对应的第 一向量之间的距离, 两个多维空间向量之间的距离的计算方法可以是: dtsli
Figure imgf000015_0001
其中, X, Y表示两个向量, i=l,2……, n;确定巨离最小的 第一向量 B或者与第一向量 A夹角余弦值最小的第一向量 B, 第一向量 B所对应 的存储节点就确定为目标存储节点。
本发明实施例提供的本发明实施例通过接收数据的第二向量和接收数据 的客户端上预先存储所有存储节点对应的第一向量进行比较的方式来确定目 标存储节点,而不再需要从接收的数据中抽样出一部分指纹值发送到数据处理 系统中所有存储节点上去查询的方式以及等待存储节点的反馈来确定目标存 储节点, 从而避免了客户端与存储节点之间的多次交互, 提高了重删性能, 而 且减少网络带宽占用的同时, 降低了时延。 参见图 5,本发明实施例提供一种用于执行上述实施例中描述的数据处理方 法的客户端, 所述客户端存在于一个数据处理系统中, 所述数据处理系统包括 至少一个客户端和多个存储节点,每个所述存储节点对应一个第一向量,每个 所述客户端上存储有所述数据处理系统中所有存储节点对应的第一向量,所述 存储节点中存储的数据对应的指紋值为第一指紋值,每个所述客户端和所述数 据处理系统中的每个所述存储节点连接;
所述客户端包括:
接收单元 501 , 用于接收数据, 将数据划分为多个数据分块, 获取每个数 据分块的第二指纹值; 第二向量获得单元 502,用于得到所述接收的数据对应的第二向量, 所述第 二向量代表所述接收数据的特征;
处理单元 503 , 用于将所述第二向量与所述客户端上存储的每个第一向量 相比较,确定目标存储节点, 将所述多个数据分块对应的第二指纹值发送到所 述目标存储节点进行重复数据查找或者将所述目标存储节点中存储的数据分 块对应的第一指纹值加载到所述第一客户端上进行重复数据查找。
所述客户端还可以包括:
存储单元 504, 用于获得所述接收的数据中不重复数据分块, 将所述获得 的不重复数据分块以及所述不重复数据分块对应的第三指纹值存储到所述客 户端的緩存中;当所述客户端的緩存中存储的不重复数据分块满足预设的存储 条件时, 获取所述緩存中不重复数据分块的第三向量, 所述第三向量代表緩存 中所有不重复数据分块的特征;将所述第三向量和所述客户端上存储的每个第 一向量相比较,确定用于存储所述緩存中的不重复数据分块以及所述緩存中不 重复数据分块对应的第三指纹值的存储节点。
本发明实施例还提供了另一种客户端, 其结构和上述客户端相同, 所不同 的是, 所述存储单元 504的功能有所不同, 所述存储单元 504, 用于获得所述接 收数据中的不重复数据分块, 将所述不重复数据分块存储到所述目标存储节 点。
其中, 所述处理单元 503具体用于确定所述第二向量和所述第一向量在同 一个多维空间中的位置,在所述多维空间中将所述第二向量与所述第一向量进 行比较,确定与所述第二向量距离最近的至少一个第一向量或者与所述第二向 量夹角余弦值最小的至少一个第一向量,所述至少一个第一向量对应的存储节 点为目标存储节点。
本发明实施例所提供的客户端,其详细的工作原理和前述方法实施例相同, 在这里仅对客户端的结构做描述, 详细描述可参考前述方法实施例中的描述。
本发明实施例还提供一种数据处理系统, 参见附图 1-A和 1-B, 所述数据处 理系统包括多个存储节点和前面实施例所述客户端,每个所述存储节点对应一 个第一向量,每个所述客户端上存储有所述数据处理系统中所有存储节点对应 的第一向量, 所述存储节点中存储的数据对应的指故值为第一指故值,每个所 述客户端和所述数据处理系统中的每个所述存储节点连接。
参见图 6, 本发明实施例还提供一种客户端 600, 包括处理器 61 , 存储器 62, 通信接口 63 , 通信总线 64;
所述处理器 61、 通信接口 63、 存储器 62通过所述通信总线 64相互的通信; 所述通信接口, 用于接收和发送数据;
所述存储器用于存储程序; 存储器 62可能包含高速 RAM存储器, 也可 能还包括非易失性存储器( non-volatile memory ) , 例如至少一个磁盘存储 器;
所述处理器 61用于执行所述存储器中的所述程序,执行如前述方法实施 例所提供的方法。 本发明实施例提供的实施例通过接收数据的第二向量和接收数据的客户 端上预先存储所有存储节点对应的第一向量进行比较的方式来确定目标存储 节点,而不再需要从接收的数据中抽样出一部分指纹值发送到数据处理系统中 所有存储节点上去查询的方式以及等待存储节点的反馈来确定目标存储节点, 从而避免了客户端与存储节点之间的多次交互,提高了重删性能, 减少网络带 宽占用的同时, 降氐了时延。
在本申请所提供的几个实施例中, 应该理解到, 所揭露的系统、 装置和 方法, 可以通过其它的方式实现。 例如, 以上所描述的装置实施例仅仅是 示意性的, 例如, 所述单元的划分, 仅仅为一种逻辑功能划分, 实际实现 时可以有另外的划分方式, 例如多个单元或组件可以结合或者可以集成到 另一个系统, 或一些特征可以忽略, 或不执行。 另一点, 所显示或讨论的 相互之间的耦合或直接耦合或通信连接可以是通过一些通信接口, 装置或 单元的间接耦合或通信连接, 可以是电性, 机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作 为单元显示的部件可以是或者也可以不是物理单元, 即可以位于一个地方, 或者也可以分布到多个网络单元上。 可以根据实际的需要选择其中的部分 或者全部单元来实现本实施例方案的目的。
另外, 在本发明各个实施例中的各功能单元可以集成在一个处理单元 中, 也可以是各个单元单独物理存在, 也可以两个或两个以上单元集成在 一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使 用时, 可以存储在一个计算机可读取存储介质中。 基于这样的理解, 本发 明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的 部分可以以软件产品的形式体现出来, 该计算机软件产品存储在一个存储 介质中, 包括若干指令用以使得一台计算机设备(可以是个人计算机, 服 务器, 或者网络设备等)执行本发明各个实施例所述方法的全部或部分步 骤。而前述的存储介质包括: U盘、移动硬盘、只读存储器(ROM, Read-Only Memory ) 、 随机存取存储器 (RAM, Random Access Memory ) 、 磁碟或 者光盘等各种可以存储程序代码的介质。
以上所述, 仅为本发明的具体实施方式, 但本发明的保护范围并不局限于 此, 任何熟悉本技术领域的技术人员在本发明揭露的技术范围内, 可轻易 想到变化或替换, 都应涵盖在本发明的保护范围之内。 因此, 本发明的保 护范围应所述以权利要求的保护范围为准。

Claims

权 利 要 求
1、 一种数据处理方法, 所述方法应用于数据处理系统, 所述数据处理系统 包括至少一个客户端和多个存储节点,每个所述客户端和所述数据处理系统中 的每个所述存储节点连接,其特征在于,每个所述存储节点对应一个第一向量, 每个所述客户端上存储有所述数据处理系统中所有存储节点对应的第一向量, 所述方法包括: 至少一个所述客户端接收数据, 将所述数据划分为多个数据分块, 获取每 个数据分块的第二指纹值; 得到所述接收的数据对应的第二向量, 所述第二向量代表所述接收数据的 特征; 将所述第二向量与所述接收数据的客户端上存储的每个第一向量相比较, 确定目标存储节点 ,将所述多个数据分块对应的第二指纹值发送到所述目标存 储节点进行重复数据查找或者将所述目标存储节点中存储的数据分块对应的 第一指纹值加载到所述接收数据的客户端上进行重复数据查找。
2、 根据权利要求 1所述的方法, 其特征在于, 还包括: 获得所述接收的数据中不重复数据分块, 将所述获得的不重复数据分块以 及所述不重复数据分块对应的第三指纹值存储到所述接收数据的客户端的緩 存中; 当所述接收数据的客户端的緩存中存储的不重复数据分块满足预设的存储 条件时,得到所述緩存中不重复数据分块的第三向量, 所述第三向量代表緩存 中所有不重复数据分块的特征; 将所述第三向量和所述接收数据的客户端上存储的每个第一向量相比较, 确定用于存储所述緩存中的不重复数据分块以及所述緩存中不重复数据分块 对应的第三指纹值的存储节点。
3、 根据权利要求 1所述的方法, 其特征在于, 还包括: 获得所述接收数据中的不重复数据分块, 将所述不重复数据分块存储到所 述目标存储节点。
4、 根据权利要求 1或 2或 3所述的方法, 其特征在于, 每个所述第二指纹值 的每一位上的数值为一个特征字, 所述得到对应所述接收数据对应的第二向 量, 包括: 分别从每个所述第二指纹值中提取 N个特征字, N大于等于 1的整数; 在所有所述提取的特征字中, 将在所述第二指纹值中所处位置相同的所述 特征字相加, 获取 N个数值, 所述 N个数值组成对应所述接收数据的第二向量。
5、 根据权利要求 1或 2或 3所述的方法, 其特征在于, 每个所述第二指纹值 的每一位上的数值为一个特征字, 所述获取对应所述接收数据对应的第二向 量, 包括: 分别从每个所述第二指纹值中 N个特征字, N大于等于 1的整数; 在所有所述提取的特征字中,首先将所述提取的特征字中数值为 0的特征字 转换为 -1 , 然后将在所述第二指纹值所处位置相同的所述特征字相加, 获得 N 个数值, 所述 N个数值组成对应所述接收数据的第二向量。
6、 根据权利要求 1或 2或 3所述方法, 其特征在于, 所述将所述第二向量与 所述接收数据的客户端上存储的每个第一向量相比较,确定目标存储节点, 包 括: 确定所述第二向量和所述第一向量在同一个多维空间中的位置, 在所述多 维空间中将所述第二向量与所述第一向量进行比较,确定与所述第二向量距离 最近的至少一个第一向量或者与所述第二向量夹角余弦值最小的至少一个第 一向量, 所述至少一个第一向量对应的存储节点为目标存储节点。
7、 根据权利要求 1-6任一权利要求所述的方法, 其特征在于, 还包括: 接 收所述存储节点发送的对应所述第一向量的更新消息,所述第一向量的更新消 息由所述存储节点每隔预定周期根据所述存储节点中存储的数据更新所述存 储节点对应的所述第一向量后生成的,所述第一向量代表所述存储节点中存储 数据的特征。
8、 一种客户端, 所述客户端存在于一个数据处理系统中, 所述数据处理系 统还包括多个存储节点,所述客户端和所述数据处理系统中的每个所述存储节 点连接, 其特征在于, 每个所述存储节点对应一个第一向量, 所述客户端上存 储有所述数据处理系统中所有存储节点对应的第一向量; 所述客户端包括: 接收单元, 用于接收数据, 将数据划分为多个数据分块, 获取每个数据分 块的第二指纹值; 第二向量获得单元, 用于得到所述接收的数据对应的第二向量, 所述第二 向量代表所述接收数据的特征; 处理单元, 用于将所述第二向量与所述客户端上存储的每个第一向量相比 较,确定目标存储节点,将所述多个数据分块对应的第二指纹值发送到所述目 标存储节点进行重复数据查找或者将所述目标存储节点中存储的数据分块对 应的第一指纹值加载到所述客户端上进行重复数据查找。
9、 根据权利要求 8所述的客户端, 其特征在于, 还包括: 存储单元, 用于获得所述接收的数据中不重复数据分块, 将所述获得的不 重复数据分块以及所述不重复数据分块对应的第三指纹值存储到所述客户端 的緩存中;当所述客户端的緩存中存储的不重复数据分块满足预设的存储条件 时,得到所述緩存中不重复数据分块的第三向量, 所述第三向量代表緩存中所 有不重复数据分块的特征;将所述第三向量和所述客户端上存储的每个第一向 量相比较 ,确定用于存储所述緩存中的不重复数据分块以及所述緩存中不重复 数据分块对应的第三指纹值的存储节点。
10、 根据权利要求 8所述的客户端, 其特征在于, 还包括: 存储单元, 用于获得所述接收数据中的不重复数据分块, 将所述不重复数 据分块存储到所述目标存储节点。
11、 根据权利要求 8-10任一所述的客户端, 其特征在于, 每个所述第二指 纹值的每一位上的数值为一个特征字, 所述第二向量获得单元具体用于: 分别从每个所述第二指纹值中提取 N个特征字, N大于等于 1的整数; 在所有所述提取的特征字中, 将在所述第二指纹值中所处位置相同的所述 特征字相加, 获取 N个数值, 所述 N个数值组成对应所述接收数据的第二向量。
12、 根据权利要求 8-10任一所述的客户端, 其特征在于, 每个所述第二指 纹值的每一位上的数值为一个特征字, 所述第二向量获得单元具体用于: 分别从每个所述第二指纹值中提取 N个特征字, N大于等于 1的整数; 在所有所述提取的特征字中,首先将所述提取的特征字中数值为 0的特征字 转换为 -1 , 然后将在所述第二指纹值所处位置相同的所述特征字相加, 获得 N 个数值, 所述 N个数值组成对应所述接收数据的第二向量。
13、 根据权利要求 8-10任一所述的客户端, 其特征在于, 所述处理单元具 体用于: 确定所述第二向量和所述第一向量在同一个多维空间中的位置,在所 述多维空间中将所述第二向量与所述第一向量进行比较,确定与所述第二向量 距离最近的至少一个第一向量或者与所述第二向量夹角余弦值最小的至少一 个第一向量, 所述至少一个第一向量对应的存储节点为目标存储节点。
14、 一种数据处理系统, 其特征在于, 所述数据处理系统包括多个存储节 点和权利要求 8-10任一所述客户端, 每个所述存储节点对应一个第一向量, 每 个所述客户端上存储有所述数据处理系统中每个存储节点对应的第一向量,每 个所述客户端和所述数据处理系统中的每个所述存储节点连接。
15、 一种客户端, 其特征在于, 包括处理器, 存储器, 通信接口, 总线; 所述处理器、通信接口、存储器通过所述总线相互的通信;所述通信接口, 用于接收和发送数据; 所述存储器用于存储程序;
所述处理器用于执行所述存储器中的所述程序, 执行如权利要求 1-7任一 所述的方法。
PCT/CN2013/084597 2013-09-29 2013-09-29 一种数据处理方法、系统及客户端 WO2015042909A1 (zh)

Priority Applications (5)

Application Number Priority Date Filing Date Title
EP13894821.1A EP3015999A4 (en) 2013-09-29 2013-09-29 METHOD OF PROCESSING DATA, SYSTEM AND CLIENT
CN201380002196.1A CN104823184B (zh) 2013-09-29 2013-09-29 一种数据处理方法、系统及客户端
PCT/CN2013/084597 WO2015042909A1 (zh) 2013-09-29 2013-09-29 一种数据处理方法、系统及客户端
US15/011,074 US10210186B2 (en) 2013-09-29 2016-01-29 Data processing method and system and client
US16/240,358 US11163734B2 (en) 2013-09-29 2019-01-04 Data processing method and system and client

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2013/084597 WO2015042909A1 (zh) 2013-09-29 2013-09-29 一种数据处理方法、系统及客户端

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/011,074 Continuation US10210186B2 (en) 2013-09-29 2016-01-29 Data processing method and system and client

Publications (1)

Publication Number Publication Date
WO2015042909A1 true WO2015042909A1 (zh) 2015-04-02

Family

ID=52741845

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/084597 WO2015042909A1 (zh) 2013-09-29 2013-09-29 一种数据处理方法、系统及客户端

Country Status (4)

Country Link
US (2) US10210186B2 (zh)
EP (1) EP3015999A4 (zh)
CN (1) CN104823184B (zh)
WO (1) WO2015042909A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107179878A (zh) * 2016-03-11 2017-09-19 伊姆西公司 基于应用优化的数据存储的方法和装置
CN111194055A (zh) * 2019-12-30 2020-05-22 广东博智林机器人有限公司 数据存储频率处理方法、装置、电子设备以及存储介质
US20220269657A1 (en) * 2021-02-22 2022-08-25 International Business Machines Corporation Cache indexing using data addresses based on data fingerprints

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10764311B2 (en) * 2016-09-21 2020-09-01 Cequence Security, Inc. Unsupervised classification of web traffic users
WO2019028809A1 (zh) * 2017-08-11 2019-02-14 深圳配天智能技术研究院有限公司 一种交通数据处理方法及车载客户端
US10761945B2 (en) * 2018-06-19 2020-09-01 International Business Machines Corporation Dynamically directing data in a deduplicated backup system
CN109063121B (zh) * 2018-08-01 2024-04-05 平安科技(深圳)有限公司 数据存储方法、装置、计算机设备及计算机存储介质
CN112181869A (zh) * 2020-09-11 2021-01-05 中国银联股份有限公司 信息存储方法、装置、服务器及介质
CN112152937B (zh) * 2020-09-29 2022-08-19 锐捷网络股份有限公司 一种报文去重的方法、装置、电子设备及存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102833298A (zh) * 2011-06-17 2012-12-19 英业达集团(天津)电子技术有限公司 分布式的重复数据删除系统及其处理方法
CN103177111A (zh) * 2013-03-29 2013-06-26 西安理工大学 重复数据删除系统及其删除方法
CN103189867A (zh) * 2012-10-30 2013-07-03 华为技术有限公司 重复数据检索方法及设备

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8438119B2 (en) * 2006-03-30 2013-05-07 Sap Ag Foundation layer for services based enterprise software architecture
US8412682B2 (en) * 2006-06-29 2013-04-02 Netapp, Inc. System and method for retrieving and using block fingerprints for data deduplication
US8321648B2 (en) * 2009-10-26 2012-11-27 Netapp, Inc Use of similarity hash to route data for improved deduplication in a storage server cluster
CN102479245B (zh) * 2010-11-30 2013-07-17 英业达集团(天津)电子技术有限公司 数据区块的切分方法
US8458145B2 (en) * 2011-01-20 2013-06-04 Infinidat Ltd. System and method of storage optimization
US8745003B1 (en) * 2011-05-13 2014-06-03 Emc Corporation Synchronization of storage using comparisons of fingerprints of blocks
US8484170B2 (en) 2011-09-19 2013-07-09 International Business Machines Corporation Scalable deduplication system with small blocks
US8868520B1 (en) * 2012-03-01 2014-10-21 Netapp, Inc. System and method for removing overlapping ranges from a flat sorted data structure
US9164688B2 (en) 2012-07-03 2015-10-20 International Business Machines Corporation Sub-block partitioning for hash-based deduplication
US9195608B2 (en) * 2013-05-17 2015-11-24 International Business Machines Corporation Stored data analysis

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102833298A (zh) * 2011-06-17 2012-12-19 英业达集团(天津)电子技术有限公司 分布式的重复数据删除系统及其处理方法
CN103189867A (zh) * 2012-10-30 2013-07-03 华为技术有限公司 重复数据检索方法及设备
CN103177111A (zh) * 2013-03-29 2013-06-26 西安理工大学 重复数据删除系统及其删除方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3015999A4 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107179878A (zh) * 2016-03-11 2017-09-19 伊姆西公司 基于应用优化的数据存储的方法和装置
CN111194055A (zh) * 2019-12-30 2020-05-22 广东博智林机器人有限公司 数据存储频率处理方法、装置、电子设备以及存储介质
CN111194055B (zh) * 2019-12-30 2022-09-09 广东博智林机器人有限公司 数据存储频率处理方法、装置、电子设备以及存储介质
US20220269657A1 (en) * 2021-02-22 2022-08-25 International Business Machines Corporation Cache indexing using data addresses based on data fingerprints
US11625179B2 (en) * 2021-02-22 2023-04-11 International Business Machines Corporation Cache indexing using data addresses based on data fingerprints

Also Published As

Publication number Publication date
US20160147800A1 (en) 2016-05-26
US10210186B2 (en) 2019-02-19
US20190138507A1 (en) 2019-05-09
EP3015999A1 (en) 2016-05-04
CN104823184A (zh) 2015-08-05
CN104823184B (zh) 2016-11-09
EP3015999A4 (en) 2016-08-17
US11163734B2 (en) 2021-11-02

Similar Documents

Publication Publication Date Title
WO2015042909A1 (zh) 一种数据处理方法、系统及客户端
US10956370B2 (en) Techniques for improving storage space efficiency with variable compression size unit
US10303797B1 (en) Clustering files in deduplication systems
US9678688B2 (en) System and method for data deduplication for disk storage subsystems
JP6537214B2 (ja) 重複排除方法および記憶デバイス
CN110235098B (zh) 存储系统访问方法及装置
EP3376393B1 (en) Data storage method and apparatus
WO2014067063A1 (zh) 重复数据检索方法及设备
WO2017020576A1 (zh) 一种键值存储系统中文件压实的方法和装置
WO2017113124A1 (zh) 一种服务器以及服务器压缩数据的方法
US11573928B2 (en) Techniques for data deduplication
CN108984103B (zh) 用于去重的方法和设备
WO2013155417A2 (en) Data coreset compression
CN108540510B (zh) 一种云主机创建方法、装置及云服务系统
US11106374B2 (en) Managing inline data de-duplication in storage systems
US10996898B2 (en) Storage system configured for efficient generation of capacity release estimates for deletion of datasets
CN110199270B (zh) 存储系统中存储设备的管理方法及装置
US8918378B1 (en) Cloning using an extent-based architecture
US10929239B2 (en) Storage system with snapshot group merge functionality
US20220398220A1 (en) Systems and methods for physical capacity estimation of logical space units
US10762047B2 (en) Relocating compressed extents using file-system hole list
US10761762B2 (en) Relocating compressed extents using batch-hole list
WO2023050856A1 (zh) 数据处理方法及存储系统
US11615063B2 (en) Similarity deduplication
US11068208B2 (en) Capacity reduction in a storage system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13894821

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2013894821

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE