CN106201771B - Data-storage system and data read-write method - Google Patents

Data-storage system and data read-write method Download PDF

Info

Publication number
CN106201771B
CN106201771B CN201510226830.0A CN201510226830A CN106201771B CN 106201771 B CN106201771 B CN 106201771B CN 201510226830 A CN201510226830 A CN 201510226830A CN 106201771 B CN106201771 B CN 106201771B
Authority
CN
China
Prior art keywords
bucket
finger print
multiple knot
data block
print information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510226830.0A
Other languages
Chinese (zh)
Other versions
CN106201771A (en
Inventor
蒋雄伟
吴锐
李勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201510226830.0A priority Critical patent/CN106201771B/en
Publication of CN106201771A publication Critical patent/CN106201771A/en
Application granted granted Critical
Publication of CN106201771B publication Critical patent/CN106201771B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of data-storage system, including central node and remove multiple knot;The central node be used for according to preset strategy by each Bucket be assigned to it is corresponding remove multiple knot, routing table is created with the corresponding relationship of multiple knot is removed according to Bucket, and synchronize the routing table to each multiple knot that goes;It is described to go multiple knot for storing the data block that finger print information and the finger print information corresponding to each Bucket being assigned to represent according to the routing table.Realize the global duplicate removal storage management of the finger print information to the initial data and 100TB or more rank of 100PB or more rank.

Description

Data-storage system and data read-write method
Technical field
The invention belongs to Internet technical fields, specifically, be related to a kind of date storage method, data-storage system and Data read-write method, the client for reading and writing data and the system for reading and writing data.
Background technique
Internet company needs to back up in recent years, the data of filing are in outburst trend.Due to cost considerations, tape is always It is the primary storage medium of backup and filing system and virtual machine main storage system.But the storage environment of tape require it is high and Service life is again shorter, generally every 4-5 with regard to needing to dump on new tape.When the quantity of tape is accumulated to tens of thousands of or even tens Wan Hou, unloading work will become a nightmare.
With the development of magnetic disc, capacity has had reached 6T even 8T, and capacity price is more close with tape than gradually, And disk with respect to the advantage of tape be support random access, this make it possible data de-duplication technology application, lead to It crosses and combines magnetic disc and data de-duplication technology, the cost of backup filing can be greatlyd save.
Existing data de-duplication commercial product currently on the market, such as EMC DD990, HP StoreOnce B6200 Equipment, SEPATON DeltaStor software etc., substantially belong to single cpu mode, scalability is very limited, maximum available 1.6PB, maximum handling capacity 31TB/h (8.8G/s) are unable to satisfy Internet company no matter from capacity or performance at all Storage demand.
A kind of " scalable distributed repeated data for supporting mass data to back up of Inst. of Computing Techn. Academia Sinica Deletion system ", for the deficiency of single cpu mode, distribution is proposed in terms of the two in the scalability and deduplicated efficiency of machining system Formula Bloom filter (bloomfilter) is used to go the data of multiple knot to route in distributed machining system, and proposes and be based on The fingerprint queries of sampling mechanism realize distributed data deduplication system 3D- to improve fingerprint queries speed deduper.Hereinafter referred to as scheme one.
EMC Inc. also has developed clustering deduplication storage on the basis of its single machine (single-node) mode (cluster deduplication storage system).Its way is to increase several backup servers, is responsible for data The fingerprint that stream carries out stripping and slicing, calculates data block, is then packaged into superblock (super chunk) and according to certain strategy Being routed to some goes multiple knot to be handled.Hereinafter referred to as scheme two.
Both the above scheme cannot be known as distributed system for stricti jurise, but group system.Group system Basic ideas are that the load balancing of task is carried out between multiple reliable single nodes.And the basic ideas of distributed system be Data distribution is carried out between multiple insecure single nodes, and (when data distribution equilibrium, then the load for realizing task naturally is equal Weighing apparatus), and reliability is ensured using means such as more copies or check codes.
In above-mentioned two scheme, the fingerprint base of group system is decentralized, although it is most to use certain measure What amount was responsible for handling before being routed to the data block occurred before and its fingerprint goes on multiple knot, it can be difficult to avoiding by road By being gone on multiple knot to one from the untreated data block, to be mistaken for new data block and be repeated preservation.Scheme One is even more to use the sparse index based on sampling to fingerprint, and the misjudged probability of data block has been further aggravated.Even only 2% erroneous judgement, it is for the system of the order of magnitude even more big for 100PB and unacceptable.
Summary of the invention
In view of this, the application provides a kind of date storage method, data-storage system and data read-write method, for counting Client according to read-write and the system for reading and writing data, solve in the machining system of the big order of magnitude due to finger print information Decentralization and caused by the larger technical problem of probability of miscarriage of justice.
In order to solve the above-mentioned technical problem, this application discloses a kind of date storage method, it is applied to include central node With the data-storage system for removing multiple knot;The date storage method, comprising: the central node will be each according to preset strategy Bucket (Bucket), which is assigned to, corresponding removes multiple knot;The central node is according to Bucket and the corresponding relationship of multiple knot is gone to create Routing table, and synchronize the routing table and remove multiple knot to each;It is described to go multiple knot according to the routing table, it stores each described The data block that finger print information corresponding to the Bucket being assigned to and the finger print information represent.
It is described to go multiple knot according to the routing table, store finger print information corresponding to each Bucket being assigned to The data block represented with the finger print information include: it is described go multiple knot be it is each described in the Bucket that is assigned to be respectively created pair Container (Container) file answered;It is described that multiple knot is gone to save corresponding fingerprint in each Bucket being assigned to Information saves the number that the finger print information represents in Container file corresponding with each Bucket for being assigned to According to block.
It is described that multiple knot is gone to judge whether the size of the Container file is greater than preset threshold;When described It is described that multiple knot is gone to take the Container archive to backstage when the size of Container file is greater than preset threshold Business device.
Each Bucket is distributed to that corresponding to remove multiple knot include: the center according to preset strategy by the central node Node by each Bucket be assigned to it is multiple it is corresponding remove multiple knot, it is the multiple it is corresponding go in multiple knot determine a master Node and at least one standby node.
Whether central node judgement each goes whether multiple knot can be used, or increase and new remove multiple knot;When sentencing It is disconnected go out some go multiple knot unavailable, or increase new when removing multiple knot, the central node is redistributed described each Bucket;The central node, which updates the routing table and is synchronized to, each removes multiple knot;It is described go multiple knot according to it is described more Routing table after new carries out Data Migration.
It is described that go multiple knot to carry out Data Migration according to the updated routing table include: the host node according to Updated routing table initiates the Data Migration.
It is described when judging that some goes multiple knot unavailable, the central node redistributes each Bucket packet Include: when judging that the host node is unavailable, the central node redefines out from least one described standby node One host node;It is described that go multiple knot to carry out Data Migration according to the updated routing table include: described redefine Host node initiates the Data Migration according to the updated routing table.
It includes a finger print information storehouse that multiple knot is removed described in each, and the finger print information storehouse is stored in solid state hard disk Cuckoo Hash Map removes finger print information corresponding to each Bucket of multiple knot and the finger print information generation including described The storage information of the data block of table.
M cuckoo Hash Map is run simultaneously in the solid state hard disk, and uses N number of cuckoo Hash letter simultaneously Number;Wherein, M × N=128.
32 cuckoo Hash Maps are run in the solid state hard disk simultaneously, and use 4 tunnel cuckoo Hash letters simultaneously Number.
In order to solve the above-mentioned technical problem, disclosed herein as well is a kind of data read-write methods, comprising: is by data cutting Multiple data blocks and the finger print information for calculating separately each data block;Corresponding to the finger print information for determining each data block Bucket;According to the routing table obtained from central node, determination is corresponding with the Bucket to remove multiple knot;Send fingerprint queries Request removes multiple knot to corresponding with the Bucket, and the fingerprint queries request includes the finger print information of data block;It receives The finger print information not inquired for going multiple knot to return corresponding with the Bucket;Upload the finger print information not inquired And its data block represented removes multiple knot to corresponding with the Bucket.
Bucket corresponding to the finger print information of determination each data block includes: by the finger print information and institute The total quantity for stating Bucket carries out modulo operation, the finger print information is determined according to the result of the modulo operation corresponding to Bucket。
The method also includes: it is finished when the finger print information not inquired and its data block of representative all upload When, the mapped file of the data is uploaded to multiple knot is removed, and the mapped file includes the finger of each data block of the data The finger print information of line information, each data block is arranged according to the cutting sequence of data block.
The mapped file for uploading the data is to removing multiple knot, comprising: by the mapped file cutting is multiple numbers According to block and calculate separately mapped file data block cryptographic Hash;Corresponding to the cryptographic Hash for determining the data block of the mapped file Bucket;According to corresponding to the determining Bucket corresponding with the cryptographic Hash of the data block of the mapped file of the routing table Remove multiple knot;Upload the Hash of the data block and corresponding cryptographic Hash of the mapped file extremely with the data block of the mapped file It is worth corresponding to corresponding Bucket and removes multiple knot.
Described be multiple data blocks by the mapped file cutting include: by the head information cutting of the mapped file is institute State first data block in multiple data blocks;The head information of the mapped file includes the total size of the mapped file, institute State the information such as the total quantity of multiple data blocks.
The method also includes: from the mapped file for going multiple knot to obtain the data;According to institute in the mapped file The finger print information of each data block of data is stated from each data block for going multiple knot to obtain the data;According to every number Go out the data according to sequential concatenation of the finger print information of block in the mapped file.
It is described from the mapped file for going multiple knot to obtain the data include: title and data according to the mapped file Block serial number is from each data block for going multiple knot to obtain the mapped file;Each data block of the mapped file is spliced into The mapped file of the data.
The routing table that the basis is obtained from central node determines that corresponding with the Bucket to remove multiple knot include: to work as For the first time when storing data, routing table is obtained from the central node;According to the routing table obtained from central node, it is determining with it is described Bucket is corresponding to remove multiple knot.
The routing table that the basis is obtained from central node, determination is corresponding with the Bucket to remove multiple knot further include: It sends request packet and removes multiple knot to corresponding with the Bucket;Receive the sound for going multiple knot to return corresponding with the Bucket It should wrap, the response bag includes the version information of routing table;Judge the version information of the routing table in the response bag with it is described Whether the version information of the routing table obtained from central node is identical;Version information and institute when the routing table in the response bag When stating identical as the version information of routing table obtained from central node, determined according to the routing table obtained from central node It is corresponding with the Bucket to remove multiple knot;When the routing table in the response bag version information with from central node obtain When the version information of routing table is not identical, updated routing table is obtained from the central node;According to the updated road It is redefined by table and corresponding with the Bucket removes multiple knot.
In order to solve the above-mentioned technical problem, disclosed herein as well is a kind of data read-write methods, comprising:
Central node sends routing table to client, and the routing table includes Bucket corresponding closes with go between multiple knot System;Multiple knot is gone to receive the fingerprint queries request of the client, the fingerprint queries request includes removing multiple knot with described The corresponding finger print information of the Bucket being assigned to;It is described that multiple knot is gone to inquire the finger print information, by what is do not inquired Finger print information is back to the client;It is described that multiple knot is gone to receive the fingerprint not inquired that the client uploads Information and its representative data block.
The method also includes: it is described that multiple knot is gone to save the finger not inquired in the Bucket being assigned to Line information saves the data block in Container file corresponding with the Bucket being assigned to, described to remove multiple knot The data block, which is returned, to the client saves successful message.
Described that multiple knot is gone to return before the data block saves successful message to the client, the method is also wrapped It includes: described to go multiple knot that the data block of the finger print information not inquired and its representative is backuped to standby node.
The method also includes: the data blocks for going multiple knot to save the mapped file that the client uploads and corresponding Cryptographic Hash.
The data block for saving the mapped file that the client uploads and corresponding cryptographic Hash include: in the correspondence Bucket corresponding in Container file, save the data block of the mapped file;In the corresponding Bucket In, save the cryptographic Hash and the first storage information of the data block of the mapped file.
The first storage information includes: to save the title of the Container file of data block of the mapped file, institute State the size of the data block of offset and the mapped file of the data block of mapped file in the Container file.
The method also includes: it is described that multiple knot is gone to receive the data block that the client obtains the mapped file Request;The data block for going multiple knot to send the mapped file is to the client;It is described that go multiple knot to receive described Client obtains the request of data block representated by each finger print information in the mapped file;It is described that multiple knot is gone to send institute Data block representated by each finger print information is stated to the client.
It is described that go multiple knot to send data block representated by each finger print information to the client include: described go Multiple knot determines the second storage information of the data block according to the finger print information, and the second storage information includes saving institute State the title of the Container file of data block, offset and the number of the data block in the Container file According to the size of block;It is described to go whether multiple knot judges the Container file according to the title of the Container file File to background server;It is described to go multiple knot according to when the Container file has been filed to background server Data block offset and the data block in the Container file size from the background server obtain described in Data block is simultaneously sent to the client;It is described to go multiple knot according to institute when the Container file is still stored in local The size for stating offset and the data block of the data block in the Container file obtains the data block simultaneously from local It is sent to the client.
The central node send routing table to client include: when the client storing data for the first time, it is described in Heart node receives the routing table request of the client;The central node sends routing table to the client.
The central node sends routing table to client further include: described that multiple knot is gone to receive asking for the client Seek packet: described that multiple knot is gone to send response bag to the client, the response bag includes the routing for going multiple knot to save The version information of table;Version information and the routing table for going multiple knot to save when the routing table that the client saves Version information it is inconsistent when, the central node receive the client routing table request;The central node is sent Updated routing table is to the client.
It is described that multiple knot is gone to inquire the finger print information, the finger print information not inquired is back to the client End goes multiple knot to judge that the finger print information whether there is by Bloom filter described in including:;Sentence when by Bloom filter In the absence of the disconnected finger print information out, determine that the finger print information is the finger print information not inquired;When pass through the grand filtering of cloth In the presence of device judges the finger print information, the finger print information is inquired in finger print information storehouse whether there is;Believe when in fingerprint When inquiring the finger print information in breath library, determine that the finger print information is existing;When not inquiring institute in finger print information storehouse When stating finger print information, determine that the finger print information is the finger print information not inquired.
In order to solve the above-mentioned technical problem, disclosed herein as well is a kind of data-storage systems, comprising: central node and one It is a or multiple remove multiple knot, wherein the central node, for each bucket (Bucket) to be assigned to correspondence according to preset strategy Remove multiple knot, and routing table is created with the corresponding relationship of multiple knot is removed according to Bucket, and synchronize the routing table to each Remove multiple knot;It is described to remove multiple knot, for storing finger corresponding to each Bucket being assigned to according to the routing table The data block that line information and the finger print information represent.
In order to solve the above-mentioned technical problem, disclosed herein as well is a kind of clients for reading and writing data, comprising: cutting Computing module, for being multiple data blocks and the finger print information for calculating separately each data block by data cutting;Bucket determining module, For determining Bucket corresponding to the finger print information of each data block;Node determining module is used for basis from centromere The routing table obtained is put, determination is corresponding with the Bucket to remove multiple knot;Request sending module is asked for sending fingerprint queries It asks to corresponding with the Bucket and removes multiple knot, the fingerprint queries request includes the finger print information of data block;Information receives Module, for receiving the finger print information not inquired for going multiple knot to return corresponding with the Bucket;Data upload mould Block, for uploading the data block extremely duplicate removal section corresponding with the Bucket of the finger print information not inquired and its representative Point.
In order to solve the above-mentioned technical problem, disclosed herein as well is a kind of systems for reading and writing data, comprising: centromere Point and remove multiple knot, wherein the central node, for sending routing table to client, the routing table include Bucket with Remove the corresponding relationship between multiple knot;Described to remove multiple knot, the fingerprint queries for receiving the client are requested, the finger Line inquiry request includes finger print information corresponding with the Bucket for going multiple knot to be assigned to;The finger print information is looked into It askes, the finger print information not inquired is back to the client;Receive that the client uploads described does not inquire Finger print information and its representative data block.
Compared with prior art, the application can be obtained including following technical effect: be realized to 100PB or more rank The global duplicate removal storage management of the finger print information of initial data and 100TB or more rank has very high scalability, is What addition was new in system goes multiple knot rear center node that can re-start data distribution according to preset strategy, goes multiple knot automatically complete It at Data Migration, be extended the performance of system and capacity can easily.
Certainly, any product for implementing the application must be not necessarily required to reach all the above technical effect simultaneously.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present application, constitutes part of this application, this Shen Illustrative embodiments and their description please are not constituted an undue limitation on the present application for explaining the application.In the accompanying drawings:
Fig. 1 is a kind of structural schematic diagram of data-storage system (system for reading and writing data) of the embodiment of the present application;
Fig. 2 is the routing table schematic diagram of the embodiment of the present application;
Fig. 3 is a kind of flow diagram of data read-write method of the embodiment of the present application;
Fig. 4 is a kind of flow diagram of data read-write method of the embodiment of the present application;
Fig. 5 is a kind of structural schematic diagram of client for reading and writing data of the embodiment of the present application.
Specific embodiment
Carry out the embodiment that the present invention will be described in detail below in conjunction with accompanying drawings and embodiments, how the present invention is applied whereby Technological means solves technical problem and reaches the realization process of technical effect to fully understand and implement.
Fig. 1 is data-storage system provided by the embodiments of the present application (hereinafter referred to as " system "), including 10 He of central node Multiple knot 11 is removed, central node 10 is coupled with multiple knot 11 is removed.In systems, central node 10 is responsible for removing multiple knot 11 to multiple Distributed management and system in data distribution and replica management.Multiple knot 11 is gone to be responsible for data block and data block Finger print information and storage information be managed and save, and under the distributed management of central node 10 complete data duplication And migration.It goes multiple knot 11 to have abstract storage engines layer, can very easily add new storage engines.
System to the management of data block with bucket (Bucket) for unit, Bucket is a logical concept in system, is Each Bucket distributes a Bucket number, and Bucket number is preset for passing through with the finger print information of each data block Hash algorithm establishes corresponding relationship, so that data block be stored respectively according to Bucket number and establish data block and storage file Between corresponding relationship.The finger print information of data block and data block that the central node of system saves system by Bucket into Row global administration.
Central node according to preset strategy by each Bucket be assigned to it is corresponding remove multiple knot, which can be Load balancing.For example, central node obtains each load data for removing multiple knot, by the real-time change of load data come It determines and each removes the current load condition of multiple knot, Bucket is preferentially assigned to present load is lower to remove multiple knot, pass through The each load balancing for removing multiple knot of data distribution balanced realization.The preset strategy can be position security strategy, for example, center Node according to the permission of the concerning security matters of data or client by different Bucket be assigned to it is different remove multiple knot, make difference The client data of concerning security matters rank or different rights is stored in different go in multiple knot.
Central node by each Bucket be assigned to it is corresponding remove multiple knot, numbered by Bucket and remove multiple knot The corresponding relationship for identifying to establish Bucket Yu remove multiple knot, and routing table is created according to the corresponding relationship.The routing table can be seen Make a mapping table, have recorded Bucket and go the mapping relations between multiple knot, Fig. 2 is routing table in the embodiment of the present application Exemplary diagram, wherein the number of lateral gauge outfit represents Bucket number, and the number of longitudinal gauge outfit represents the copy mark of Bucket, Letter in table, which respectively represents, different removes multiple knot.As shown in Fig. 2, the Bucket that wherein number is 0, No. 0 copy are divided It is fitted on multiple knot D, No. 1 copy is assigned to multiple knot A;The Bucket that number is 1, No. 0 copy are assigned to duplicate removal section Point A, No. 1 copy are assigned to multiple knot B.Fig. 2 is used to illustrate the routing table in the embodiment of the present application, and Do not constitute the limitation to the application protection scope, in system it is settable it is any number of remove multiple knot, each go multiple knot can be by Multiple Bucket are assigned to, each Bucket, which there can also be one or more backups and back up, removes multiple knot in different.
After central node creates routing table, which is synchronized to and each removes multiple knot.Each duplicate removal in system Node is assigned to local Bucket according to routing table determination, stores fingerprint letter corresponding with local Bucket is assigned to The data block that breath and finger print information represent.Multiple knot is gone to save what the corresponding finger print information of each Bucket and finger print information represented When data block, the Bucket to be each assigned to creates corresponding container (Container) file, saves in each Bucket Corresponding finger print information saves the data block that finger print information represents in Container file corresponding with Bucket.And fingerprint Corresponding relationship between information and Bucket is to carry out modulo operation by Bucket sum of the finger print information to internal system, Finger print information corresponding Bucket number is determined according to operation result, this calculating process is usually to the visitor of system storing data Family end is completed.When finger print information corresponding with Bucket is more and more, the data that are saved in corresponding Container file Block increases therewith, and the memory space that Container file occupies also increases with it, and goes multiple knot that can store in order to ensure each The load of multiple knot is gone in the copy of multiple Bucket and control, when the size of the corresponding Container file of a Bucket When more than preset threshold, multiple knot is gone to background server 12, as shown in fig. 1, each to go the Container archive Multiple knot is all coupled with background server 12, and when multiple knot being gone to receive corresponding data block again, storage is to positioned at background service In the Container file of device 12.
Central node according to preset strategy by each Bucket distribute to it is corresponding remove multiple knot when, can will be each Bucket be assigned to it is multiple it is corresponding remove multiple knot, so that each Bucket is there are multiple copies in systems, and each to copy Shellfish distributes different copy marks.Such as in routing table shown in Fig. 2, central node is that each Bucket is assigned to two duplicate removals Node, each Bucket is in the different copy marks 0 and 1 for going the copy of multiple knot to be respectively provided with.
Central node by each Bucket be assigned to it is multiple remove multiple knot when, it is multiple go in multiple knot determine a master Node and at least one standby node.Central node can identify according to copy and determine host node and standby node, each A primary copy mark is determined in multiple copies mark of Bucket, other copy marks are backing copy mark, for example, will The copy for being identified as 0 is copied as the primary copy of each Bucket, copies of other copy marks are backing copy.And in institute Have in multiple knot, the master for going multiple knot to be determined as the Bucket copy of some Bucket being identified as where 0 copy Node, the multiple knot that goes where other copies of the Bucket is the standby node of the Bucket.To prevent some duplicate removal section When point is unavailable, the data storage and read-write of the Bucket gone on multiple knot all will be unable to carry out.Each of Bucket is copied Shellfish includes the corresponding finger print information of the Bucket and the Container file for saving the data block that the finger print information represents.
Central node judge it is each go whether multiple knot can be used, or judge whether internal system increases new duplicate removal Node.Whether whether central node judged each to go multiple knot available by the heartbeat message gone between multiple knot or be increased Add and new has removed multiple knot.Central node judge some go multiple knot unavailable or system in increase and new remove multiple knot When, central node redistributes each Bucket, and internal system Bucket will change with the mapping relations of multiple knot are gone.When When some goes multiple knot unavailable, central node will go the corresponding Bucket of multiple knot to be re-assigned to according to preset strategy with this Other go in multiple knot;New when removing multiple knot when increasing in system, central node will be in system according to preset strategy Bucket is redistributed.Above two situation can all be such that internal system Bucket becomes with the mapping relations of multiple knot are gone Change, central node updates routing table with the mapping relations of multiple knot are gone according to the Bucket after variation, and by updated routing Table, which is synchronized to, each removes multiple knot.Since what Bucket was assigned to goes multiple knot to be changed, go multiple knot will in system Data Migration is carried out according to updated routing table.
The Data Migration is initiated by the host node for removing the changed Bucket of multiple knot being assigned to.For example, on road By No. 0 copy (primary copy) for the Bucket that in table, number is 1 from going multiple knot A to become multiple knot B, then by the duplicate removal section No. 0 of the Bucket that point A initiation number is 1 is copied to the Data Migration for removing multiple knot B, and multiple knot B is gone to sentence further according to routing table Whether other standby nodes for the Bucket that the number of breaking is 1 changed, in case of variation, such as from removing multiple knot D becomes multiple knot E, then the data in Bucket for being again 1 by number backup to multiple knot E, goes the number of multiple knot E to be The copy of 1 Bucket is identified as backing copy mark.Each go multiple knot can be according to updated road after the completion of Data Migration Data with the local Bucket that mapping relations are not present are deleted by table.Multiple knot is removed when the host node as some Bucket When unavailable, central node redefines a host node in the standby node of the Bucket, by the host node redefined The Data Migration in relation to the Bucket is initiated according to updated routing table.For example, the host node for the Bucket that number is 1 --- When going multiple knot A unavailable, the standby node for the Bucket that central node is 1 from number --- it removes multiple knot B and removes multiple knot C In, it determines the host node for going multiple knot B to be the Bucket that number is 1, then removes the copy for the Bucket that number is 1 in multiple knot B Copy mark become primary copy mark (such as 0), the standby node for the Bucket that number is 1 in updated routing table is to go Multiple knot C and multiple knot D is removed, then by going multiple knot B that the data for numbering the Bucket for being 1 are backuped to multiple knot D.
Each duplicate removal intra-node is built with a finger print information storehouse.The finger print information storehouse includes each of multiple knot The storage information for the data block that finger print information corresponding to Bucket and finger print information represent.The finger print information storehouse can use The form of Key-Value Store is Key with finger print information, the storage information of the data block which represents as Value.During the reading and writing data of system, it is related to a large amount of finger print information inquiry and comparison processing, in each duplicate removal section Partial query request is undertaken using Bloom filter in point, since there are the possibility of under-enumeration for Bloom filter, there are also a large amount of Request needs further exist for being completed by finger print information storehouse.Therefore to the reading performance of finger print information storehouse (Key-Value Store) It is required that very high.It, can be Key-Value to (Key-Value for small-sized Key-Value Store , it is stored on common hard disc, establishes index, in memory then with fast Key-Value Pair in fast ground access hard disk.But since this system is applied to the data storage of 100PB rank or more, fingerprint Information and information memory capacity are very big (initial data of 100PB, the finger print information and storage information of corresponding about 50TB), therefore Index can not be established in device memory at this time.Therefore the inventors of the present application found that completely can be in solid state hard disk (Solid State Drives, SSD) on realize a Hash table to store whole Key-Value Pair of finger print information storehouse.This is deposited The Hash table being stored in solid state hard disk is cuckoo Hash Map, due to going carrying out first by Bloom filter for multiple knot The inquiry of finger print information compares, and exists due to hash-collision and the situation of under-enumeration, cuckoo Hash Map is that one kind can be located The mode of hash-collision is managed, its basic ideas are that the position of Key storage, (1) are calculated using two different hash functions If two positions are all idle, a position insertion is selected;(2) if only one position is idle, it is inserted into this sky Not busy position;(3) it if two positions are not idle, randomly chooses the position of one of both and kicks out of Key in this position, so The corresponding position of another cryptographic Hash for calculating the Key kicked out of afterwards is inserted into, and is inserted into if this position is sky, if The Key on this position is not kicked out of again then for sky, so continue to find clear position always.Obvious this mode is possible to generate Infinite Cyclic, therefore it is normally set up a maximum lookup number, when reaching this maximum value, it is believed that the Hash table has been expired. Inventor selects cuckoo Hash, is because input and output number of the system when inquiring Key is usually arranged as constant.
Common cuckoo Hash Map only has 49% utilization rate, so two kinds for generalling use cuckoo Hash are main Deformation: 1) increase hash function number;2) number of Key can be stored by increasing each position.Both deformations are ok For improving the utilization rate of cuckoo Hash Map.Present inventor has selected murmur2 hash function as basic Hash function, and by the way that different seed is arranged, identical Key value can produce different cryptographic Hash.
Since the solid-state based on NVMe (NonVolatile Memory express, high speed nonvolatile storage) agreement is hard Disk (SSD) bottom be all with the page of 4K (Page) be basic unit, therefore finger print information storehouse operated when be all to be with 4K Size is written and read.Key-Value Pair size in finger print information storehouse is 256Byte, then the Page of a 4K can be with Store 16 finger print informations.16 Key-Value Pair are stored in so cuckoo Hash Map, each position, each Key-Value Pair is to be written in Page by insertion sequence, is not sorted by Key, and this unordered mode can be to avoid Sort bring expense in solid state hard disk.According to the actual test of present inventor, using 128 concurrent (queue depths Number=128 × job) asynchronous mode, can sufficiently excavate IOPS (the Input/Output Operations Per of NVMe Second, the number per second for being written and read (I/O) operation) ability (450K), it is so big concurrent in order to generate, the application's Inventor is optimized at two aspects: 1, multiple cuckoo Hash mapping sheet forms are run on one piece of NVMe hard disk Key-Value Store;2, multiple cuckoo hash functions are used on one piece of NVMe hard disk, and use asynchronous reading side Formula;And need while meeting: the number of the cuckoo Hash Map run in every piece of solid state hard disk is multiplied by cuckoo Hash The number of function is equal to 128.The inventors of the present application found that single cuckoo Hash reflects when cuckoo hash function becomes more QPS (query rate per second, the Query Per Second) decline of firing table is fairly obvious, becomes 8 tunnels from 4 tunnel cuckoo hash functions When cuckoo hash function, QPS has dropped half, and when cuckoo hash function is very little, cuckoo Hash mapping table space Utilization rate then declines obviously.Choosing comprehensively considers performance and space utilization rate, and present inventor selects 4 tunnel cuckoo Hash Function, space utilization rate can reach 98.66%.Thus, it is desirable to run 32 cuckoo Hash on one piece of NVMe hard disk Mapping table.And another benefit for being divided into multiple Hash Maps is the locking granularity that can reduce the finger print information storehouse.
The process for carrying out data read-write operation with above-mentioned data-storage system to client below is described further.Client Hold to system be written data when, as shown in figure 3, the process includes the following steps.
In step s 201, data cutting is multiple data blocks and the fingerprint letter for calculating separately each data block by client Breath.
The cryptographic Hash of each data block is calculated as finger print information, such as SHA- using the lower hash algorithm of collision rate The hash algorithms such as 1, MD5.
In step S202, client determines Bucket corresponding to the finger print information of each data block.
Bucket sum in the finger print information and system of data block is carried out modulo operation by client, according to modulo operation Result and Bucket number matched, so that it is determined that the corresponding Bucket of the finger print information.For example, the cryptographic Hash of data block For a, the Bucket sum in system is p, carries out modulo operation a%P, and modulo operation result is 2, then the fingerprint letter of the data block Cease the Bucket that reference numeral is 2.
In step S203, client determines duplicate removal corresponding with Bucket according to the routing table obtained from central node Node.
Client is determined according to the routing table of preservation removes multiple knot where the corresponding Bucket of finger print information, works as client When data are written to system for the first time, routing table first can be requested to central node.For example, the finger print information reference numeral of data block is 2 Bucket, in the routing table, the Bucket that number is 2 are assigned to multiple knot A and remove multiple knot B, wherein removing multiple knot A is the host node for the Bucket that number is 2, and removing multiple knot B is the standby node for the Bucket that number is 2, it is therefore desirable to by this The finger print information of data block is sent to multiple knot A and carries out fingerprint queries.
In step S204, client sends fingerprint queries and requests to remove multiple knot to corresponding with Bucket, which looks into Ask the finger print information that request includes data block.
Client includes reading thread, sending thread and logical process thread.Multiple reading threads are each responsible for the data Different piece carry out stripping and slicing and calculate the finger print information of data block, then finger print information is kept in inquiry request queue, Each reading thread includes multiple queries request queue, and each inquiry request queue corresponds to a Bucket number.Client The finger print information of the same Bucket of correspondence can be kept in the same inquiry request queue.Number in inquiry request queue According to being more than after a certain amount of or the inquiry request queue delay expires, inquiry request is placed into transmission thread by reading thread Buffer area.
Thread is sent according to the corresponding Bucket of each inquiry request queue, sends and goes where request packet to the Bucket Multiple knot (host node of the Bucket).In one embodiment, which includes four buffer areas, two of them buffering The request that area's storage is being transmitted to system, respectively corresponds fingerprint queries and asks summed data block upload request, other two buffering Area receives the new request that other threads are sent inside client, respectively corresponds fingerprint queries and asks summed data block upload request.If Two different buffer areas are set, the new request that the request and other threads transmitted to system are sent is separated, can be avoided There is prolonged obstruction in new request process is written in other threads.
When send thread receive the response bag that multiple knot is sent back to when, response bag can be sent to logical process thread into The corresponding processing of row.Logical process thread will not inquire fingerprint letter according to the fingerprint queries result for going multiple knot to return accordingly Breath and its upload request of data block represented pass to transmission thread, by transmission thread by the finger print information not inquired and its The data block of representative, which is sent to, corresponding removes multiple knot.Such thread burse mode can guarantee that the transmission of request is continuous , smoothly.
The version information in the response bag that thread receives including the routing table for going multiple knot currently stored is wherein sent, Judge whether the version information of the routing table in the response bag is identical as the version information of the routing table obtained from central node, when When the version information of routing table in response bag is not identical as the version information of the routing table obtained from central node, represent in this Heart node has had updated routing table and has been synchronized to the multiple knot that goes in system, and client is by sending thread to from center at this time Node obtains updated routing table, and redefines according to the updated routing table that Bucket is corresponding to remove multiple knot, from And what the data block for redefining the finger print information and its representative that do not inquire should upload to removes multiple knot.When in response bag When the version information of routing table is identical as the version information of routing table obtained from central node, obtained still according to from central node Routing table determine it is corresponding with Bucket remove multiple knot, the finger print information not inquired and its representative data block institute Ying Shangchuan Go multiple knot constant.
In step S205, multiple knot is gone to inquire finger print information, the finger print information not inquired is back to visitor Family end.
Duplicate removal intra-node includes a Bloom filter and a finger print information storehouse.The Bloom filter establishes this and goes The hash index of the currently stored all finger print informations of multiple knot;It is protected in the form of Key-Value Pair in the finger print information storehouse Deposit the storage information for the data block that all finger print informations and finger print information represent.It goes during multiple knot requests fingerprint queries All finger print informations successively access Bloom filter and finger print information storehouse.The Kazakhstan of each finger print information is calculated by Bloom filter It is uncommon to index and judge whether identical as the hash index in Bloom filter.When with the not phases of the hash index in Bloom filter Simultaneously, it is determined that this goes in multiple knot the data block for not having identical finger print information and its representative, when in Bloom filter Some hash index phase is simultaneously as Bloom filter there are the loophole of hash-collision, can determine that the finger print information is possible to In the presence of, need further by finger print information library inquiry whether include the finger print information, when there are the fingerprints in finger print information storehouse It when information, determines that the finger print information is existing, when the finger print information is not present in finger print information storehouse, determines the finger print information not In the presence of.Carrying out inquiry by the Bloom filter with finger print information hash index first can be improved multiple knot fingerprint queries Efficiency, then make up by finger print information storehouse the under-enumeration situation that Bloom filter is likely to occur due to hash-collision, improve Go the accuracy of multiple knot fingerprint queries.Multiple knot is gone to put all finger print informations not inquired in fingerprint queries request Enter response bag and is back to client.The response bag further includes the version information of the routing table for going multiple knot currently stored, with Judge whether to need to update routing table for client.
In step S206, client upload the data block of the finger print information not inquired and its representative to Bucket pairs That answers removes multiple knot.
The data block of the finger print information not inquired in response bag and its representative is uploaded to and does not inquire by client The corresponding Bucket of finger print information where remove multiple knot.If the version information of routing table does not change, it is somebody's turn to do and does not look into Ask to finger print information corresponding Bucket where the multiple knot that goes be exactly to carry out fingerprint queries in step S205 to remove multiple knot. Fingerprint queries request in other finger print informations due to being had existed in removing multiple knot, then do not need to upload again, avoid be System repeats to store identical data block.
In step S207, multiple knot is gone to save the finger print information not inquired in the Bucket being assigned to, The data block is saved in Container file corresponding with the Bucket being assigned to.
Multiple knot is gone to receive the data block of the finger print information and its representative that do not inquire, corresponding with finger print information The finger print information not inquired is saved in Bucket, the Container file corresponding to Bucket corresponding with finger print information The middle data block for saving the finger print information and representing.The title of Container file is as corresponding to Container file The number of Bucket+internal system Universally Unique Identifier (UUID)+date (Date) composition, such as 2_abcd234_ 010515.In order to guarantee that disk is written in data block, data block is written accordingly in such a way that O_SYNC flag bit is set Container file makes just to return after the completion of being written every time, the finger print information of the data block is written again after pwrite is returned Finger print information storehouse, when finger print information storehouse is written, using the finger print information of the data block as Key, by the second storage of the data block Information forms a Key-Value Pair and is stored in finger print information storehouse as Value.The second storage information includes saving The title of the Container file of the data block, offset (Offset) of the data block in the Container file and should The size (Chunksize) of data block.The hash index of the Key-Value Pair of the new preservation is updated in Bloom filter, To be used for subsequent data duplication elimination query.
In step S208, multiple knot is gone to save successful message to client returned data block.
After the data block for the finger print information and its representative not inquired saves, multiple knot is gone to return to number to client Successful message is saved according to block, or in one embodiment, when the corresponding Bucket of the finger print information not inquired is in system In there are when standby node, host node has saved the data block of the client finger print information not inquired uploaded and its representative Bi Hou, then the standby node of corresponding Bucket is backuped to, it is saved successfully after backup finishes to client returned data block Message.
In step S209, when the data block for the finger print information and its representative not inquired, which all uploads, to be finished, client End uploads the mapped file of data to removing multiple knot.
Mapped file includes the finger print information of each data block of the data, and the finger print information of each data block according to Cutting sequence when the data cutting is multiple data blocks by client arranges, to guarantee correct mapping to the data.
Client similarly uploads mapped file piecemeal when uploading mapped file.Client is by mapped file cutting For multiple data blocks and calculate separately mapped file data block cryptographic Hash.For example, client passes through murmur2 hash function Calculate separately the cryptographic Hash of each data block of mapped file.Client determines corresponding to the cryptographic Hash of the data block of mapped file Bucket, according to routing table determine Bucket corresponding with the cryptographic Hash of the data block of mapped file corresponding to duplicate removal section Point uploads and removes multiple knot corresponding to the data block and corresponding cryptographic Hash to corresponding Bucket of mapped file.In client When passing the data block of the mapped file, fingerprint queries are similarly carried out according to the cryptographic Hash of the data block of each mapped file, only The data block for uploading mapped file corresponding to the cryptographic Hash not inquired, avoids uploading duplicate mapped file data block.Visitor Family end by mapped file cutting be multiple data blocks when, by the head information cutting of mapped file be multiple data blocks in first Data block, the head information of the mapped file include the information such as the total size of mapped file and the total quantity of multiple data block.
In step S210, multiple knot is gone to save the data block and corresponding cryptographic Hash of the mapped file that client uploads.
In Bucket corresponding with the cryptographic Hash of the data block of mapped file, the Hash of the data block of mapped file is saved It is worth and first stores information, the data of mapped file is saved in the Container file corresponding to the corresponding Bucket Block.The first storage information includes saving the title of the Container file of data block of mapped file, the data of mapped file The size of the data block of offset and mapped file of the block in Container file.Again with title+data block of mapped file Serial number Key is Value with the first storage information of the data block of mapped file, updates fingerprint as Key-Value Pair Information bank.So far client all terminates to the process of system write-in data.
As shown in figure 4, client reads the process of data from system in the embodiment of the present application, which includes following step Suddenly.
In step S301, client is according to the mapped file titles of data and data block sequence number to going multiple knot request to reflect Penetrate file.
Client is first to first data block for removing multiple knot request mapped file, first data block of mapped file Head information including the mapped file.The head information of the mapped file includes the size of mapped file and the number of the mapped file According to the total quantity of block.Client obtains other data blocks of mapped file to going multiple knot to issue according to the head information of mapped file Request.
In step s 302, go the data block of multiple knot transmission mapped file to client.
The Key in mapped file title and data block sequence number and finger print information storehouse for going multiple knot to be sent according to client into Row matching, to inquire Key-Value Pair of the data block of the mapped file in finger print information storehouse, determine and map The corresponding first storage information of the Key of the data block of file.It is determined according to the Container file name in the first storage information Which Container file is the data block of the mapped file be stored in, is further existed according to the data block of the mapped file The size of offset and the mapped file data block in Container file gets the mapping from Container file The data block of file.
In step S303, client is spliced into mapped file according to the data block of mapped file, and according to mapped file In each data block finger print information to duplicate removal node requests data block.
Client is spliced into complete mapped file according to the block serial number of mapped file data block.Mapped file includes all The finger print information of data block and according to the cutting of data block sequence arrange.Client determination is corresponding with finger print information Bucket removes multiple knot where determining Bucket corresponding with finger print information by routing table, goes multiple knot to send to this Obtain the request of corresponding data block.
In step s 304, go the data block of the finger print information representative of multiple knot transmission mapped file to client.
Multiple knot is removed according to the finger print information in the request for obtaining data block to inquire finger print information storehouse, inquires and refers to this The corresponding second storage information of line information.The finger print information is determined according to the Container file name in the second storage information Which Container file is the data block of representative be stored in, and the offset according to the data block in Container file Size with the data block is from Container file acquisition to the data block.In one embodiment, according to the second storage information In Container file name determine the finger print information represent data block be stored in which Container file after, judgement Whether the Container file has filed background server, if the Container file has filed background service Device goes multiple knot to get data block from the Container file for being stored in background server and is sent to data block Client.
In step S305, client goes out institute according to sequential concatenation of the finger print information of each data block in mapped file State data.
As shown in figure 5, being used for the client of reading and writing data in the embodiment of the present application, comprising:
Cutting computing module 501, for being multiple data blocks and the fingerprint for calculating separately each data block by data cutting Information;
Bucket determining module 502, for determining Bucket corresponding to the finger print information of each data block;
Node determining module 503, for according to the routing table obtained from central node, determination to be corresponding with the Bucket Remove multiple knot;
Request sending module 504 requests to remove multiple knot to corresponding with the Bucket for sending fingerprint queries, described Fingerprint queries request includes the finger print information of data block;
Information receiving module 505 with the Bucket corresponding goes what multiple knot returned not inquire for receiving Finger print information;
Data uploading module 506, for upload the data block of the finger print information not inquired and its representative to institute State that Bucket is corresponding to remove multiple knot;It is finished when the finger print information not inquired and its data block of representative all upload When, it is also used to upload the mapped file of the data to multiple knot is removed, the mapped file includes each data of the data The finger print information of the finger print information of block, each data block is arranged according to the cutting sequence of data block.
In addition, the system that reading and writing data is used in a kind of the embodiment of the present application is also disclosed, it can be refering to what is shown in Fig. 1, packet Include: central node 10 and one or more remove multiple knot 11, wherein
The central node 10, for sending routing table to client, the routing table includes Bucket and removes multiple knot Between corresponding relationship;
Described to remove multiple knot 11, the fingerprint queries for receiving the client are requested, the fingerprint queries request packet Include finger print information corresponding with the Bucket for going multiple knot to be assigned to;The finger print information is inquired, will not inquired To finger print information be back to the client;Receive the finger print information not inquired that the client uploads and its Representative data block.
It should be noted that the feature of the system shown in figure 1 for reading and writing data and embodiment shown by Fig. 3,4 It corresponds to each other, the client illustrated in fig. 5 for reading and writing data is also mutually right with the feature of embodiment shown by Fig. 3,4 Answer, thus Fig. 1,5 embodiment in shortcoming can be found in the description of Fig. 3, embodiment shown by 4, repeat no more.
Date storage method, data-storage system and data read-write method provided by the embodiments of the present application are read for data The client write and the system for reading and writing data, realize to the initial data and 100TB of 100PB or more rank with higher level The global duplicate removal storage management of other finger print information, has a very high scalability, system be added it is new after removing multiple knot, in Heart node can re-start data distribution according to preset strategy, and multiple knot is gone to be automatically performed Data Migration, make the performance of system It can be easily extended with capacity.Multiple knot is gone to realize the high-performance fingerprint letter based on solid state hard disk each Library is ceased, the cuckoo Hash Map of large capacity is established in solid state hard disk, is overcome when the data volume of finger print information is very big Index can not be established in memory, and then can not carry out the technical difficulty of duplication elimination query, while ensure that finger print information inquiry Efficiency and and improve finger print information inquiry accuracy.
In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net Network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include non-temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
As used some vocabulary to censure specific components in the specification and claims.Those skilled in the art answer It is understood that hardware manufacturer may call the same component with different nouns.This specification and claims are not with name The difference of title is as the mode for distinguishing component, but with the difference of component functionally as the criterion of differentiation.Such as logical The "comprising" of piece specification and claim mentioned in is an open language, therefore should be construed to " include but do not limit In "." substantially " refer within the acceptable error range, those skilled in the art can within a certain error range solve described in Technical problem basically reaches the technical effect.In addition, " coupling " word includes any direct and indirect electric property coupling herein Means.Therefore, if it is described herein that a first device is coupled to a second device, then representing the first device can directly electrical coupling It is connected to the second device, or the second device indirectly electrically coupled through other devices or coupling means.Specification Subsequent descriptions are to implement better embodiment of the invention, so the description be for the purpose of illustrating rule of the invention, The range being not intended to limit the invention.Protection scope of the present invention is as defined by the appended claims.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability Include, so that commodity or system including a series of elements not only include those elements, but also including not clear The other element listed, or further include for this commodity or the intrinsic element of system.In the feelings not limited more Under condition, the element that is limited by sentence "including a ...", it is not excluded that in the commodity or system for including the element also There are other identical elements.
Several preferred embodiments of the invention have shown and described in above description, but as previously described, it should be understood that the present invention Be not limited to forms disclosed herein, should not be regarded as an exclusion of other examples, and can be used for various other combinations, Modification and environment, and the above teachings or related fields of technology or knowledge can be passed through within that scope of the inventive concept describe herein It is modified.And changes and modifications made by those skilled in the art do not depart from the spirit and scope of the present invention, then it all should be in this hair In the protection scope of bright appended claims.

Claims (33)

1. a kind of date storage method, which is characterized in that applied to including central node and remove the data-storage system of multiple knot, The date storage method, comprising:
Each Bucket (bucket) is assigned to according to preset strategy and corresponding removes multiple knot by the central node;
The central node creates routing table with the corresponding relationship of multiple knot is removed according to Bucket, and synchronizes the routing table to often It is a to remove multiple knot;
It is described to go multiple knot according to the routing table, store finger print information and institute corresponding to each Bucket being assigned to State the data block of finger print information representative.
2. date storage method as described in claim 1, which is characterized in that it is described to go multiple knot according to the routing table, it deposits The data block that finger print information and the finger print information corresponding to each Bucket being assigned to of storage represent, comprising:
It is described to go multiple knot that corresponding Container (container) file is respectively created for each Bucket being assigned to;
It is described that multiple knot is gone to save corresponding finger print information in each Bucket being assigned to, with each distribution To the corresponding Container file of Bucket in save the data block that the finger print information represents.
3. date storage method as claimed in claim 2, which is characterized in that
It is described that multiple knot is gone to judge whether the size of the Container file is greater than preset threshold;
It is described that multiple knot is gone to return the Container file when the size of the Container file is greater than preset threshold Shelves are to background server.
4. date storage method as described in claim 1, which is characterized in that the central node will be each according to preset strategy Bucket, which is distributed to, corresponding removes multiple knot, comprising:
The central node by each Bucket be assigned to it is multiple it is corresponding remove multiple knot, corresponding remove multiple knot the multiple One host node of middle determination and at least one standby node.
5. date storage method as claimed in claim 4, which is characterized in that
Whether central node judgement each goes whether multiple knot can be used, or increase and new remove multiple knot;
When judging that some goes multiple knot unavailable, or increase new when removing multiple knot, the central node is redistributed Each Bucket;
The central node, which updates the routing table and is synchronized to, each removes multiple knot;
It is described that multiple knot is gone to carry out Data Migration according to updated routing table.
6. date storage method as claimed in claim 5, which is characterized in that described to go multiple knot according to updated routing table Carry out Data Migration, comprising:
The host node initiates the Data Migration according to updated routing table.
7. date storage method as claimed in claim 5, which is characterized in that described to judge that some goes multiple knot unavailable When, the central node redistributes each Bucket, comprising:
When judging that the host node is unavailable, the central node redefines out from least one described standby node One host node;
It is described go multiple knot according to updated routing table carry out Data Migration include:
The host node redefined initiates the Data Migration according to updated routing table.
8. date storage method as described in claim 1, which is characterized in that removing multiple knot described in each includes a fingerprint letter Library is ceased, the finger print information storehouse is stored in the cuckoo Hash Map in solid state hard disk, goes the every of multiple knot including described The storage information for the data block that finger print information corresponding to a Bucket and the finger print information represent.
9. date storage method as claimed in claim 8, which is characterized in that run M cuckoo in the solid state hard disk simultaneously Bird Hash Map, and N number of cuckoo hash function is used simultaneously;Wherein, M × N=128.
10. date storage method as claimed in claim 9, which is characterized in that run 32 cloth in the solid state hard disk simultaneously Paddy bird Hash Map, and 4 tunnel cuckoo hash functions are used simultaneously.
11. a kind of data read-write method characterized by comprising
It is multiple data blocks and the finger print information for calculating separately each data block by data cutting;
Determine Bucket corresponding to the finger print information of each data block;
According to the routing table obtained from central node, determination is corresponding with the Bucket to remove multiple knot;
It sends fingerprint queries to request to remove multiple knot to corresponding with the Bucket, the fingerprint queries request includes data block Finger print information;
Receive the finger print information not inquired for going multiple knot to return corresponding with the Bucket;
The data block for uploading the finger print information not inquired and its representative removes multiple knot to corresponding with the Bucket.
12. method as claimed in claim 11, which is characterized in that the finger print information institute of determination each data block is right The Bucket answered includes:
The total quantity of the finger print information and the Bucket are subjected to modulo operation, determined according to the result of the modulo operation Bucket corresponding to the finger print information.
13. method as claimed in claim 11, which is characterized in that the method also includes:
When the finger print information not inquired and its data block of representative, which all upload, to be finished, the mapping of the data is uploaded For file to multiple knot is removed, the mapped file includes the finger print information of each data block of the data, each data block Finger print information according to data block cutting sequence arrange.
14. method as claimed in claim 13, which is characterized in that the mapped file for uploading the data to duplicate removal section Point, comprising:
By the mapped file cutting be multiple data blocks and calculate separately mapped file data block cryptographic Hash;
Determine Bucket corresponding to the cryptographic Hash of the data block of the mapped file;
According to duplicate removal corresponding to the determining Bucket corresponding with the cryptographic Hash of the data block of the mapped file of the routing table Node;
Upload the cryptographic Hash pair of the data block and corresponding cryptographic Hash of the mapped file extremely with the data block of the mapped file Multiple knot is removed corresponding to the Bucket answered.
15. method as claimed in claim 14, which is characterized in that it is described by the mapped file cutting be multiple data block packets It includes:
By first data block that the head information cutting of the mapped file is in the multiple data block;The mapped file Head information includes the size of the mapped file, the total quantity of the multiple data block.
16. method as claimed in claim 13, which is characterized in that the method also includes:
From the mapped file for going multiple knot to obtain the data;
According to the finger print information in the mapped file from each data block for going multiple knot to obtain the data;
Go out the data according to sequential concatenation of the finger print information of each data block in the mapped file.
17. the method described in claim 16, which is characterized in that described from the mapped file for going multiple knot to obtain the data Include:
According to the title of the mapped file and data block sequence number from each data block for going multiple knot to obtain the mapped file;
Each data block of the mapped file is spliced into the mapped file of the data.
18. method as claimed in claim 11, which is characterized in that the routing table that the basis is obtained from central node determines It is corresponding with the Bucket to go the multiple knot to include:
When storing data for the first time, routing table is obtained from the central node;
According to the routing table obtained from central node, determination is corresponding with the Bucket to remove multiple knot.
19. method as claimed in claim 18, which is characterized in that the routing table that the basis is obtained from central node determines It is corresponding with the Bucket to remove multiple knot further include:
It sends request packet and removes multiple knot to corresponding with the Bucket;
The response bag for going multiple knot to return corresponding with the Bucket is received, the response bag includes the version letter of routing table Breath;
Judge the version information of the routing table in the response bag and the version information of the routing table obtained from central node It is whether identical;
When the version information and the version information phase of the routing table obtained from central node of the routing table in the response bag Meanwhile multiple knot is removed according to the routing table determination obtained from central node is corresponding with the Bucket;
When the version information of the routing table in the response bag is not identical as the version information of the routing table obtained from central node When, updated routing table is obtained from the central node;According to the updated routing table redefine with it is described Bucket is corresponding to remove multiple knot.
20. a kind of data read-write method characterized by comprising
Central node sends routing table to client, and the routing table includes Bucket and removes the corresponding relationship between multiple knot;
Multiple knot is gone to receive the fingerprint queries request of the client, the fingerprint queries request includes removing multiple knot with described The corresponding finger print information of the Bucket being assigned to;
It is described that multiple knot is gone to inquire the finger print information, the finger print information not inquired is back to the client;
It is described that multiple knot is gone to receive the finger print information not inquired and its representative data that the client uploads Block.
21. method as claimed in claim 20, which is characterized in that the method also includes:
It is described that multiple knot is gone to save the finger print information not inquired in the Bucket being assigned to, with the distribution To the corresponding Container file of Bucket in save the data block,
It is described that multiple knot is gone to return to the successful message of the data block preservation to the client.
22. method as claimed in claim 21, which is characterized in that described that multiple knot is gone to return to the data to the client Before block saves successful message, the method also includes:
It is described to go multiple knot that the data block of the finger print information not inquired and its representative is backuped to standby node.
23. method as claimed in claim 21, which is characterized in that the method also includes:
The data block and corresponding cryptographic Hash for going multiple knot to save the mapped file that the client uploads.
24. method as claimed in claim 23, which is characterized in that the number for saving the mapped file that the client uploads Include: according to block and corresponding cryptographic Hash
In the Container file corresponding to the corresponding Bucket, the data block of the mapped file is saved;
In the corresponding Bucket, the cryptographic Hash and the first storage information of the data block of the mapped file are saved.
25. method as claimed in claim 24, which is characterized in that the first storage information includes: to save the mapping text The data block of the title of the Container file of the data block of part, the mapped file is inclined in the Container file The size of the data block of shifting amount and the mapped file.
26. method as claimed in claim 23, which is characterized in that the method also includes:
It is described that multiple knot is gone to receive the request that the client obtains the data block of the mapped file;
The data block for going multiple knot to send the mapped file is to the client;
It is described to go multiple knot to receive the client to obtain data representated by each finger print information in the mapped file The request of block;
It is described that multiple knot is gone to send data block representated by each finger print information to the client.
27. method as claimed in claim 26, which is characterized in that described that multiple knot is gone to send each finger print information institute's generation The data block of table to the client includes:
The second storage information for going multiple knot to determine the data block according to the finger print information, the second storage information Title including saving the Container file of the data block, offset of the data block in the Container file The size of amount and the data block;
It is described go multiple knot according to the title of the Container file judge the Container file whether filed to Background server;
It is described to go multiple knot according to the data block described when the Container file has been filed to background server The size of offset and the data block in Container file obtains the data block from the background server and sends To the client;
It is described to go multiple knot according to the data block described when the Container file is still stored in local The size of offset and the data block in Container file is from the local acquisition data block and is sent to the client End.
28. method as claimed in claim 20, which is characterized in that the central node sends routing table to client and includes:
When the client storing data for the first time, the central node receives the request that the client obtains routing table;
The central node sends routing table to the client.
29. method as claimed in claim 28, which is characterized in that the central node sends routing table and also wraps to client It includes:
The request packet for going multiple knot to receive the client:
Described that multiple knot is gone to send response bag to the client, the response bag includes the routing table for going multiple knot to save Version information;
When the version information for the routing table that the client saves and the version of the routing table for going multiple knot to save are believed When ceasing inconsistent, the central node receives the routing table request of the client;
The central node sends updated routing table to the client.
30. method as claimed in claim 20, which is characterized in that it is described that multiple knot is gone to inquire the finger print information, The finger print information not inquired, which is back to the client, includes:
It is described that multiple knot is gone to judge that the finger print information whether there is by Bloom filter;
In the absence of judging the finger print information by Bloom filter, determine that the finger print information is the finger not inquired Line information;
In the presence of judging the finger print information by Bloom filter, the finger print information is inquired in finger print information storehouse is No presence;
When inquiring the finger print information in finger print information storehouse, determine that the finger print information is existing;
When not inquiring the finger print information in finger print information storehouse, determine that the finger print information is the fingerprint letter not inquired Breath.
31. a kind of data-storage system characterized by comprising central node and one or more remove multiple knot, wherein
The central node, for according to preset strategy by each Bucket (bucket) be assigned to it is corresponding remove multiple knot, and according to Bucket creates routing table with the corresponding relationship of multiple knot is removed, and synchronizes the routing table and remove multiple knot to each;
It is described to remove multiple knot, for storing fingerprint letter corresponding to each Bucket being assigned to according to the routing table The data block that breath and the finger print information represent.
32. a kind of client for reading and writing data characterized by comprising
Cutting computing module, for being multiple data blocks and the finger print information for calculating separately each data block by data cutting;
Bucket determining module, for determining Bucket corresponding to the finger print information of each data block;
Node determining module, for determining duplicate removal section corresponding with the Bucket according to the routing table obtained from central node Point;
Request sending module requests to remove multiple knot to corresponding with the Bucket for sending fingerprint queries, and the fingerprint is looked into Ask the finger print information that request includes data block;
Information receiving module, for receiving the fingerprint letter not inquired for going multiple knot to return corresponding with the Bucket Breath;
Data uploading module, for upload the data block of the finger print information not inquired and its representative to the Bucket It is corresponding to remove multiple knot.
33. a kind of system for reading and writing data characterized by comprising central node and remove multiple knot, wherein
The central node, for sending routing table to client, the routing table includes Bucket and goes between multiple knot Corresponding relationship;
Described to remove multiple knot, the fingerprint queries for receiving the client are requested, and the fingerprint queries request includes and institute State the corresponding finger print information of Bucket that multiple knot is assigned to;The finger print information is inquired, the finger that will do not inquired Line information is back to the client;Receive the finger print information not inquired that the client uploads and its representative Data block.
CN201510226830.0A 2015-05-06 2015-05-06 Data-storage system and data read-write method Active CN106201771B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510226830.0A CN106201771B (en) 2015-05-06 2015-05-06 Data-storage system and data read-write method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510226830.0A CN106201771B (en) 2015-05-06 2015-05-06 Data-storage system and data read-write method

Publications (2)

Publication Number Publication Date
CN106201771A CN106201771A (en) 2016-12-07
CN106201771B true CN106201771B (en) 2019-07-05

Family

ID=57459493

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510226830.0A Active CN106201771B (en) 2015-05-06 2015-05-06 Data-storage system and data read-write method

Country Status (1)

Country Link
CN (1) CN106201771B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107766478A (en) * 2017-10-11 2018-03-06 复旦大学 A kind of design method of concurrent index structure towards high competition scene
CN107832341B (en) * 2017-10-12 2022-01-28 千寻位置网络有限公司 AGNSS user duplicate removal statistical method
CN109725842B (en) * 2017-10-30 2022-10-11 伊姆西Ip控股有限责任公司 System and method for accelerating random write placement for bucket allocation within a hybrid storage system
CN108093024B (en) * 2017-11-14 2020-08-04 西北工业大学 Classified routing method and device based on data frequency
CN108509616B (en) * 2018-03-30 2022-03-08 北京怡生乐居信息服务有限公司 Data processing method and system
CN109740037B (en) * 2019-01-02 2023-11-24 山东省科学院情报研究所 Multi-source heterogeneous flow state big data distributed online real-time processing method and system
CN110071964B (en) * 2019-03-26 2022-03-15 罗克佳华科技集团股份有限公司 File synchronization method, device, file sharing network, file sharing system and storage medium
CN110209727B (en) * 2019-04-04 2020-08-11 特斯联(北京)科技有限公司 Data storage method, terminal equipment and medium
CN110134331B (en) * 2019-04-26 2020-06-05 重庆大学 Routing path planning method, system and readable storage medium
CN110674116B (en) * 2019-09-25 2022-05-03 四川长虹电器股份有限公司 System and method for checking and inserting data repetition of database based on swoole
CN111158948B (en) * 2019-12-30 2024-04-09 深信服科技股份有限公司 Data storage and verification method and device based on deduplication and storage medium
CN112148928B (en) * 2020-09-18 2024-02-20 鹏城实验室 Cuckoo filter based on fingerprint family
CN111966649B (en) * 2020-10-21 2021-01-01 中国人民解放军国防科技大学 Lightweight online file storage method and device capable of efficiently removing weight
CN113420400B (en) * 2021-07-06 2023-06-30 北京字跳网络技术有限公司 Routing relation establishment method, request processing method, device and equipment
CN113625968B (en) * 2021-08-12 2024-03-01 网易(杭州)网络有限公司 File authority management method and device, computer equipment and storage medium
CN115988002B (en) * 2023-02-16 2023-08-15 荣耀终端有限公司 Data transmission method and electronic equipment

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101539950A (en) * 2009-05-08 2009-09-23 成都市华为赛门铁克科技有限公司 Data storage method and device
US9292530B2 (en) * 2011-06-14 2016-03-22 Netapp, Inc. Object-level identification of duplicate data in a storage system
CN102968498B (en) * 2012-12-05 2016-08-10 华为技术有限公司 Data processing method and device

Also Published As

Publication number Publication date
CN106201771A (en) 2016-12-07

Similar Documents

Publication Publication Date Title
CN106201771B (en) Data-storage system and data read-write method
US8799238B2 (en) Data deduplication
US10380073B2 (en) Use of solid state storage devices and the like in data deduplication
US8712963B1 (en) Method and apparatus for content-aware resizing of data chunks for replication
US8639669B1 (en) Method and apparatus for determining optimal chunk sizes of a deduplicated storage system
US9798486B1 (en) Method and system for file system based replication of a deduplicated storage system
US7992037B2 (en) Scalable secondary storage systems and methods
US7577808B1 (en) Efficient backup data retrieval
US9189493B2 (en) Object file system
US9547706B2 (en) Using colocation hints to facilitate accessing a distributed data storage system
Manogar et al. A study on data deduplication techniques for optimized storage
US9383936B1 (en) Percent quotas for deduplication storage appliance
US10628298B1 (en) Resumable garbage collection
CN104408111A (en) Method and device for deleting duplicate data
TW201734750A (en) Data deduplication cache comprising solid state drive storage and the like
CN110888837B (en) Object storage small file merging method and device
CN109522283A (en) A kind of data de-duplication method and system
US20230394010A1 (en) File system metadata deduplication
CN113535670B (en) Virtual resource mirror image storage system and implementation method thereof
CN104951475A (en) Distributed file system and implementation method
CN109241011B (en) Virtual machine file processing method and device
CN111290883B (en) Simplified replication method based on deduplication
US20240143213A1 (en) Fingerprint tracking structure for storage system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant