CN103098015B - Storage system - Google Patents

Storage system Download PDF

Info

Publication number
CN103098015B
CN103098015B CN201180043251.2A CN201180043251A CN103098015B CN 103098015 B CN103098015 B CN 103098015B CN 201180043251 A CN201180043251 A CN 201180043251A CN 103098015 B CN103098015 B CN 103098015B
Authority
CN
China
Prior art keywords
data
storage
blocks
block
stored
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201180043251.2A
Other languages
Chinese (zh)
Other versions
CN103098015A (en
Inventor
M·韦尔尼克基
J·萨克泽普科维斯基
C·达布尼克基
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Corp
Original Assignee
NEC Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Corp filed Critical NEC Corp
Publication of CN103098015A publication Critical patent/CN103098015A/en
Application granted granted Critical
Publication of CN103098015B publication Critical patent/CN103098015B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0689Disk arrays, e.g. RAID, JBOD

Abstract

Storage system comprises data storage control unit, and this data storage control unit stores multiple unit of the blocks of data generated by segmentation storage target data in a distributed fashion in multiple memory device, and performs repeated storage elimination.Store in the particular storage device of this data storage control unit in the plurality of memory device and generate by splitting this storage target data, multiple sequential cells of the blocks of data of this storage target data, store associated with one another in this particular storage device based on the data content of this blocks of data characteristic and represent that the stored position information of the memory location of this blocks of data in this particular storage device is using as memory location appointment table, and this characteristic storing storage device identification information for identifying this particular storage device associated with one another and be stored in this blocks of data in this particular storage device is using as memory device appointment table.

Description

Storage system
Technical field
The present invention relates to storage system, and in particular to having the storage system of repeated storage elimination function.
Background technology
The duplicate removal (deduplication) of secondary storage system causes extensive concern at present in research and commercial applications.Only store the single copy of this piece by block identical in identification data, duplicate removal makes storage capacity requirement significantly reduce.Result in the past has been presented in Backup Data to exist and has obviously repeated.This no wonder, because the subsequent backup of identical systems is usually closely similar.
Duplicate removal storage system is different in multiple dimension.Some systems only carry out duplicate removal to identical file, and the block of file division Cheng Geng little is carried out duplicate removal to these blocks by other system.The present invention by closing the duplicate removal of castable rank because back-up application is typically polymerized the independent file from the file system be backed up in the files of large-scale tar class.Duplicate removal in file-level can not provide how many space to reduce.
Block can have fixing or variable size, is wherein typically produced the block of variable dimension by content-defined chunk.Demonstrate and use the block of content-defined variable dimension to significantly improve deduplicated efficiency.
Major part system eliminates identical block, and some systems only need block storage to be similar, and store discrepancy effectively.Although this can improve deduplicated efficiency, its need to read from dish before block, this make to be difficult to transmission high write handling capacity.The present invention is therefore by the duplicate removal of concern identical block herein.
(general introduction that duplicate removal stores)
Backup storage system is provided typically via the long data stream created by back-up application.These flow files or virtual tape image typically.Data stream is divided into block, and for each piece of computationally secure Hash (such as SHA-1).Then the cryptographic hash of these cryptographic hash with the block stored in systems in which is in the past compared.Due to the hash-collision of secure hash function extremely can not be found, so the block with identical cryptographic hash can suppose it is identical (so-called compared by Hash).Therefore, if find the block with identical Hash, then think that this block is repetition and does not store this block.The identifier forming whole blocks of this data stream is stored and can be used for reconstructing the original data stream read.
Reference listing
Non-patent literature
NPL1:DUBNICKI,C.,GRYZ,L.,HELDT,L.,KACZMARCZYK,M.,KILIAN,W.,STRZELCZAK,P.,SZCZEPKOWSKI,J.,UNGUREANU,C.,ANDWELNICKI,M.,HYDRAstor:aScalableSecondaryStorage。At 7thUSENIXConferenceonFileandStorageTechnologyies (SanFrancisco, California, USA, February2009).
NPL2:ZHU,B.,LI,K.,ANDPATTERSON,H.,Avoidingthediskbottleneckinthedatadomaindeduplicationfilesystem。In FAST ' 08:the6thUSENIXConferenceonFileandStorageTechnologyies (Berkeley, CA, USA, 2008), USENIXAssociation, pp.1-14.
NPL3:BIRK,Y.,Randomraidswithselectiveexploitationofredundancyforhighperformancevideoservers,671-681。
NPL4:UNGUREANU,C.,ARANYA,A.,GOKHALE,S.,RAGO,S.,ATKIN,B.,BOHRA,A.,DUBNICKI,C.,ANDCALKOWSKI,G.,Hydrafs:Ahigh-throughputfilesystemforthehydrastorcontentaddressablestoragesystem。In FAST ' 10:Proceedingsofthe8thUSENIXConferenceonFileandStorageTe chnologies (Berkeley, CA, USA, 2010), USENIXAssociation, pp.225-239.
NPL5:DUBNICKI,C.,UNGUREANU,C.,ANDKILIAN,W.,FPN:ADistributedHashTableforCommercialApplications。At ProceedingsoftheThirteenthInternationalSymposiumonHigh-P erformanceDistributedComputing (HPDC-132004) (Honolulu, Hawaii, June2004), pp.120-128.
NPL6:BEN-OR,M.,Anotheradvantageoffreechoice(extendedabstract):Completelyasynchronousagreementprotocols。In PODC ' 83:ProceedingsofthesecondannualACMsymposiumonPrincipleso fdistributedcomputing (NewYork, NY, USA, 1983), ACM, pp.27-30.
NPL7:LAMPORT,L.,Thepart-timeparliamentACMTrans.Comput.Syst.16,2(1998),133-169。
Summary of the invention
Technical matters
(performance challenges based on the duplicate removal of dish)
In order to realize extensive duplicate removal storage system, the performance challenges that some are great must be overcome.
Large scale system stores too many block, and thus their Hash is not suitable with primary memory.Use index on the simple dish of Hash will cause the non-constant of performance owing to carrying out index search, index search is actually random reading.
Some systems store whole input block and off-line completes duplicate removal solves this problem by interim.Due to known all new blocks in advance, so Hash lookup can be rearranged according to Hash order, and effectively can perform in batches and search.But off-line duplicate removal needs large, the high performance staging area stored for interim block.On the other hand, embedded (inline) machining system can avoid the block of repetition to write on together, thus provides higher write performance when typical repeatability is high.
Major part system (disclosed in NPL1 system) solves this problem by relying on stream locality to observe---and typical case, the repeatable block in continuous backup occurs according in the order identical with those blocks from original backup.By retaining the locality of backup stream, the Hash of many repeatable blocks of can effectively looking ahead.By using the Bloom filter in storer, or to repeat possibility with some exchanging better performance for, can effectively identify non-repetitive piece by accepting approximate repetition.
Another problem is to cause stream reading performance to reduce due to flow point section.Because repeatable block is stored in the position different from the block of up-to-date write, so seem that large order reads to be resolved into multiple shorter reading inherently.In the system of carrying out accurate duplicate removal, if two streams are stored within the system, one of them stream is the random permutation of another stream, then this problem is intrinsic, at least one stream in these streams must send a small amount of, read at random.In fact, allowing the same stream locality of effective duplicate removal to observe makes this worst case not occur.But, along with segmentation is typically along with the age of system increases, should notice that inner locality is not reduced by the data placement differed from further.
(scalable overall duplicate removal)
Centralized system as described in NPL2 such as has limited scalability in system dimension.Multiple independently system can be set up to carry out convergent-divergent to capacity, but which prevent the duplicate removal between them, and add maintenance load by installing backup to isolated storage island.
Some systems (NPL1) are by distributing to block the duplicate removal that memory node introduces scalable global scope based on Hash.Large-scale piece of index is assigned on whole node by effectively, and wherein each node is responsible for a part for hash space.
Although this framework provides scalability and good performance in single client is arranged, performance issue may be there is when multiple client reads or writes simultaneously.
The degradation of stream locality
Because block is evenly distributed on whole node, so the part of inlet flow that each node average received reduces by the system dimension factor.This causes the remarkable reduction of flowing locality in large scale system---and any stream locality occurred in primary flow also will reduce according to this factor among each node.
Any part and parcel reading backflow needs the participation of the whole nodes in this system.If many clients attempt to read back (different) stream simultaneously, then they must compete identical resource on each node.In order to maintain high-throughput, memory node will need the reading cache size be directly proportional to the quantity of client---and this is called as impact damper proliferation issues (NPL3).Degradation in stream locality makes this problem more complicated, it reduces the efficiency of looking ahead.As a result, in very large-scale system, the order of primary flow reads and will be degenerated to the random reading among memory node.
Identical problem is applicable to duplicate removal and searches---and the looking ahead of Hash of existing piece also will be degenerated to random reading.But negative effect is less obvious for duplicate removal, because Hash is more much smaller and will be easier to adapt to the cache memory of moderate dimensions than blocks of data.
Symmetrical network handling capacity
Due to block being uniformly distributed to memory node, whole node receives the block of roughly the same number from client.When the number of client increases, network throughput requires also to increase, to hold the write of whole non-duplicate block.
As a result, the network with the point-to-point handling capacity of high symmetrical expression is necessary to provide high write handling capacity for this system.As discussed below, it is difficult for setting up this network for large scale system.
So, an exemplary object of the present invention is the penalty of the storage system preventing from having duplicate removal, and this is the problem that will solve as described above.
The solution of problem
According to an aspect of the present invention, a kind of storage system comprises data storage control unit, this data storage control unit stores multiple unit of the blocks of data generated by segmentation storage target data in a distributed fashion in multiple memory device, and when attempting to store another with the data content identical with the data content of the storage target data be stored in memory device in memory device and storing target data, this data storage control unit stores target data by reference to this storage target data be stored in this memory device as this another and eliminates to perform repeated storage.Store in the particular storage device of this data storage control unit in the plurality of memory device and generate by splitting this storage target data, multiple sequential cells of the blocks of data of this storage target data, store associated with one another in this particular storage device based on the data content of this blocks of data characteristic and represent that the stored position information of the memory location of this blocks of data in this particular storage device is using as memory location appointment table, and this characteristic storing storage device identification information for identifying this particular storage device associated with one another and be stored in this blocks of data in this particular storage device is using as memory device appointment table.
According to another aspect of the present invention, a kind of stored program computer-readable medium is the medium storing following program, wherein this program comprises instruction, this instruction is used for making messaging device realize data storage control unit, this data storage control unit stores multiple unit of the blocks of data generated by segmentation storage target data in a distributed fashion in multiple memory device, and when attempting to store another with the data content identical with the data content of the storage target data be stored in memory device in memory device and storing target data, this data storage control unit stores target data by reference to this storage target data be stored in this memory device as this another and eliminates to perform repeated storage, wherein store in the particular storage device of this data storage control unit in the plurality of memory device and generate by splitting this storage target data, multiple sequential cells of the blocks of data of this storage target data, store associated with one another in this particular storage device based on the data content of this blocks of data characteristic and represent that the stored position information of the memory location of this blocks of data in this particular storage device is using as memory location appointment table, and this characteristic storing storage device identification information for identifying this particular storage device associated with one another and be stored in this blocks of data in this particular storage device is using as memory device appointment table.
According to another aspect of the present invention, a kind of date storage method is following method, the method is used for the multiple unit storing the blocks of data generated by segmentation storage target data in a distributed fashion in multiple memory device, and when attempting to store another with the data content identical with the data content of the storage target data be stored in memory device in memory device and storing target data, store target data to perform repeated storage elimination by reference to this storage target data be stored in this memory device as this another.The method is included in the particular storage device in the plurality of memory device to store and generates by splitting this storage target data, multiple sequential cells of the blocks of data of this storage target data, store associated with one another in this particular storage device based on the data content of this blocks of data characteristic and represent that the stored position information of the memory location of this blocks of data in this particular storage device is using as memory location appointment table, and this characteristic storing storage device identification information for identifying this particular storage device associated with one another and be stored in this blocks of data in this particular storage device is using as memory device appointment table.
Beneficial effect of the present invention
Owing to configuring the present invention as mentioned above, so the present invention can improve the performance of the storage system with duplicate removal.
Accompanying drawing explanation
Fig. 1 is the table of the block address type in the pointer blocks for showing in the first illustrative embodiments;
Fig. 2 is for showing in this first illustrative embodiments because system dimension the increases load that causes to the chart of impact writing bandwidth;
Fig. 3 is for showing in this first illustrative embodiments because system dimension the increases load that causes to the chart of impact writing bandwidth;
Fig. 4 is the block diagram of the configuration of whole system for showing the storage system comprising the second illustrative embodiments;
Fig. 5 is the block diagram of the configuration of storage system for schematically showing this second illustrative embodiments;
Fig. 6 is the functional block diagram of the configuration of access node for showing this second illustrative embodiments;
Fig. 7 is the explanation view of the aspect for illustration of the data storage procedure in storage system disclosed in Fig. 5;
Fig. 8 is the explanation view of this aspect for illustration of the data storage procedure in storage system disclosed in Fig. 5;
Fig. 9 is the explanation view of the aspect for illustration of the data acquisition in storage system disclosed in Fig. 6;
Figure 10 is the explanation view of this aspect for illustration of the data acquisition in storage system disclosed in Fig. 6;
Figure 11 is the block diagram of the configuration for showing the storage system according to complementary annotations 1.
Embodiment
< first illustrative embodiments >
Invention introduces the new framework of the scalable storage system for having overall embedded duplicate removal.Be separated with to the index repeated by data being stored, the system proposed improves the shortcoming of existing system: restorability is along with the requirement of the degradation of system dimension and the unified bandwidth all between node.
As undertissue first illustrative embodiments.First, the requirement considered when introduction being designed this system and prerequisite.Then, will the framework realizing those and require be described, and the key operation that will illustrate about proposed Organization of Data.Then, how to transmit required feature by evaluating the system that proposes, and by compromise that the during the design being provided in it faces.
(requiring and prerequisite)
Before describing the system architecture proposed, we will summarize it by the requirement of the environment of operation and prerequisite.
(storage system requires general introduction)
The main application of storage system will be backup.In order to maximize the saving in duplicate removal, storage system will store the backup of many client.The capacity that this need for environment is high and reliability, and there are some unique performance characteristics.Must complete in short backup window owing to backing up, so it is necessary that handling capacity is write in very high polymerization.This system is mainly used for writing---write data much more frequent than read data.Read mainly to occur between the convalescence when standby system runs into fault.Because the time of recovering this system is normally vital, so rationally high handling capacity of reading is necessary.
For above-mentioned reasons, the duplicate removal realized by storage system should meet following standard:
Block rank
Identical block
The block of variable dimension, has the block boundary arranged by content-defined chunk
Compared by Hash
Accurately
Embedded
Distributed
Global scope.
In order to keep cost low, this system should be constructed by business machine and should be scalable to reaching 100/1000 node, corresponding to the petabyte of former storage.
Interface
System must provide the backup interface of industrial standard to client machine.In the environment of dish to dish backup, this is normally as the file system that NAS (network attached storage) or VTL (VTL) exports.
The details realized due to NAS or VTL and theme of the present invention have nothing to do, and therefore we will pay close attention to simpler block memory interface, similar with the description in NPL1.File system can be set up, as described in NPL4 on this block stores.
In brief, block stores the block allowing the variable dimension storing data.Block is constant, and they can be acquired by storing the address of generation by block.Duplicate removal is completed by distributing identical address to the block with identical content.
Can use special pointer blocks that independent data chunk is made into large data stream.These blocks comprise the address of the block that they point to---and no matter be routine data block or other pointer blocks.Similar with the block of routine, pointer blocks is constant and carries out duplicate removal to identical block.The tree of the pointer blocks with routine data block can be constructed in leaf to represent data stream.The address of the pointer blocks of the root of this tree is enough to obtain whole stream.
(network schemer)
This storage system needs internal network to be amplified to required Large Copacity, and connection data source---i.e. client spare machine.Network all must provide high-throughput between the node of storage system and to the link of data source.
Along with the growth of the size of this system, between whole node, set up the catenet with high aggregate throughput become difficulty and expensive.Tradition, the network in large-scale data center is set up in different levels mode, wherein connect independent machine by first order switch (such as 1G bit), and connect first order switch by second level switch (such as 10G bit) faster, by that analogy.Link between switch needs faster, to provide rational aggregate throughput, which increases network hardware cost or which increase cable complexity when binding multiple physical link when using interconnection faster.
Certainly, in mini-system, do not occur hierarchy, wherein in mini-system, whole node can be connected to same first order switch, can obtain identical high-throughput between whole node.Further, given enough resources, even can construct the high catenet of aggregate throughput from commercial networking hardware.
Therefore, this storage system should be adapted to be:
The high network by different level that still between aggregation switch, handling capacity is lower of handling capacity in switch, and
The symmetrical network of complete xsect bandwidth can be obtained between any two nodes.
(client performance limitations)
Client machine (backup server) transmission is finally had to pass through from the data of the write of this storage system or reading.Each client backup server has for the limited resources of source with place data---and this domain or network are connected and become bottleneck.
Therefore there is no need to provide high-throughput for single stream for storage system; A small amount of (such as a dozen) node being easily stored system exceedes by the resource of single client machine.But when stream multiple from read/write while of multiple client machine, system should still provide good composite behaviour.
(framework)
(overview)
The storage system proposed in the present invention is made up of the node with Types Below:
Access node, its as to system gateway and connect client machine,
Memory node, it is stored data base in fact, and
Index node, it is responsible for identifying and location is repeated.
Can selectively by the combination of nodes of difference in functionality on Same Physical machine, if consider to prove useful (such as power consumption, cooling, data center space use) for hardware.
In order to meet above-mentioned requirements, the present invention proposes the storage system realizing following purpose of design.
Locality retains storage
The sequence of the non-duplicate block belonging to a stream is closely stored together by the little subset of memory node.This remains the above-mentioned locality based on stream, thus allows during restoration to carry out effective order reading.This is also important for repeated elimination performance, achieves effectively looking ahead of the Hash of repeatable block.
The method is different from the former embedding overall situation machining system described in NPL1.These system in combination repetition index and block store, and force block to be uniformly distributed over the whole system.Although they also attempt to retain stream locality among memory node, initially separate the efficiency reducing it.
Based on the index of overall Hash
Memory node due to block write is no longer dependent on the Hash of block, so must maintain the block index of separation.Based on block Hash, whole index nodes in systems in which separate this index.Here Hash is suitable, because in no case there is locality in hash space, and it provides good scalability, concurrency and load balance.
Memory capacity balances
Stream locality retains only for nearly some max-flow length are meaningful, and this is that the efficiency of being accessed by sequential disc is determined.After have accumulated enough order blocks in one location, other block can be stored elsewhere.Therefore, the node do not write to the non-duplicate block of constant current changes in time.This contributes to maintaining good capacitance balance, prevents some memory nodes from filling up sooner than other nodes.
Dissymmetrical network performance
Because Data Position is not determined by block Hash, so the client device of the data on back end near these data of write freely keeps by the system proposed.By avoiding data transmission across high-grade switch and relevant network throughput throughput bottleneck, this can greatly improve writes bandwidth in asymmetrical network.The whole nodes simply need in network send duplication elimination query equably, but they are much smaller and do not need a large amount of bandwidth.It is hereafter the description of the logic module forming this system.
(front end)
Front end is to the similar image of client output file system, VTL or data.The write stream chunk of input is become the block of variable dimension and submits them to so that duplicate removal and storage by it.Its in access node by master control.This part of system can be identical with the part occurred in the HYDRAstor described in NPL1.
(DHT is network overlapped)
The distributed hashtable combined with distributed consensus is used for realizing network overlapped layer.DHT is the basis of the scalability of system.Network overlappedly to provide:
Object's position virtual, thus allow object logic to effective mapping of physical machine when in the face of fault and system reprovision
Fault detect and tolerance
Load balance (suppose object is uniformly distributed in the key space of DHT)
The propagation of the state (global state) in little system scope and maintenance.
(there is the FPN of supernode)
The DHT used in the present invention is the Fixed CP network (NPL5) with supernode.Its use within the storage system has been described in NPL1; Only be summarized in function overlapping in the environment of this system here.
Key (Hash) is mapped to the set of the node of these keys responsible by overlapping network.It is organized into supernode, and each supernode comprises the supernode assembly of constant, numbers.At the upper master control supernode assembly of physical node (being index node and memory node in this case).The number of the assembly of each supernode---supernode radix (cardinality) (SNC) is fixing for the given example of FPN.It is right that assembly as the member of same supernode is called as.
Each supernode is responsible for the part in Hash key space; Between supernode, separate hash space, thus whole space is capped, and there is not overlap in responsibility between supernode.
The all components of handle node failures among supernode---given supernode continues to visit logical (ping) and changes with detection failure and spread state each other.When node failure, by all the other to the assembly recovering master control on this node.
The distributed consensus algorithm described in NPL6 or 7 is for guaranteeing that all components has the consistent image of the member relation of supernode.In order to maintain for conforming legal number (quorum), the more than half SNC assembly from each supernode must be survived at All Time.This also prevents Network Separation from causing " fissure " to operate.
FPN also provides the grade of load balance.Its attempt between physical machine with thereon can resource proportionally extension element.Basic premise is that each supernode will roughly receive identical load (in the capacity used and request per second).Also prevent from Same Physical node, putting peer component altogether to improve fault-tolerance.
Different DHT can be easily used to realize replacing FPN, as long as it can be extended to provide fault-tolerance and global state broadcast.The use in the present invention with the FPN of supernode is excited by its successful use in HYDRAstor system.
(data and index FPN)
There are two examples be separated of DHT within the system:
" data FPN ", logical data location is mapped to the memory node being responsible for storing them by it.The assembly of major control data FPN on memory node.This mapping provides the virtual of Data Position---and logical place does not change when system reconfiguration or fault, even if the memory node of these data of master control changes.Data FPN will be described in detail after a while.
" index FPN ", block Hash mapping is used for the index node of the conversion of this Hash by it to maintenance.The assembly of this network is placed on index node.This is discussing in detail after a while.
Use the FPN network be separated to allow to place the node of these types on different hardware for index node and memory node.Such as, index node may need a lot of cpu power, RAM and IOPS, and memory node is by a lot of memory capacity of needs and high dish and network throughput.
Even if the assembly of these two networks is placed on Same Physical machine, usually can complete load balance independently in each network, because as implied above, they utilize different resources.Further, two networks can have different supernode radixes (is SNC respectively indexand SNC data), and can increase independently (synchronous without the need to making FPN split between which).
(block storage)
(Organization of Data overview)
By data FPN assembly, the whole user data stored in systems in which are kept as block.This block erasure codes is become SNC dataindividual segmentation, some are original and some are redundancies.Elasticity (resilience) classification of being distributed to data by user determines original and ratio that is redundancy segmentation.When write-in block, block is distributed to data FPN supernode.The details of allocation strategy will be provided after a while.
(synchronous operation (synchrun) and SCC)
Among data FPN supernode, storage block is grouped into synchronous operation.The segmentation belonging to same is put in the corresponding Synchronization Component of synchronous operation.There is the SNC being used for each synchronous operation dataindividual synchronous operation component, it corresponds to fragment number 0 to SNC data-1.Synchronous operation is that the atomic unit of the process for data synchronizing operation---during backstage maintains operation, block never strides across synchronous boundaries.
An integer synchronous operation component groups is become synchronous operation assembly container (SCC); SCC is stored on memory node data disks.SCC only can add---and when writing whole SCC, it becomes constant.Follow-up consistency operation only can revise it by rewriteeing SCC.
Complete synchronous operation assembly and need to be stored the number of the entity of nodes keep track to the grouping of SCC with restriction---along with from system-kill block, the size of synchronous operation will reduce.By linking the synchronous operation of (concatenate) continuous print when the size of synchronous operation reduces, the size of SCC is roughly maintained the original dimension (about 64MB) of a synchronous operation assembly.
Stream runs (streamrun)
A large amount of continuous print synchronous operation is grouped into stream to run.This grouping is static and is determined when distributing synchronous operation.Stream runs and corresponds to from the operation of the block of phase homogeneous turbulence, and it should to be maintained in same supernode to obtain good locality---and they store balancing unit.
Retain in locality and there is compromise between capacitance balance quality, the size can run by stream controls this compromise.Hereafter will probe into this compromise in more detail.
(mark of synchronous operation)
Each synchronous operation is identified by the identifier of 64 bits.Synchronous operation Id determines the supernode belonging to synchronous operation statically.
Synchronous operation Id is logically divided into 3 parts:
Supernode zone prefix
Stream among this supernode runs id
Sequence number among this stream runs
Number for the bit of sequence number is fixing; The number being interpreted as the bit of supernode prefix increases along with system and the length of data FPN zone prefix increases and increases.Details will be described after a while.
(block identification and segment lookup)
Their sequence number write in synchronous operation is distributed in the whole blocks stored in systems in which.This sequence number combined with synchronous operation Id identifies the block among whole system uniquely.(synchronous length of run Id (SynchrunId), block sequence number (BlockSeqNum)) is to being therefore called as unique block address.This address is never reused, even if this block is removed after a while.
(writing promoter)
The request storing new block in given supernode is passed through always to a fixation kit of this supernode---write promoter.This promoter is responsible among synchronous operation, distribute unique block identifier and with other assembly of supernode and coordinate write operation with index FPN.
(SCC index)
Except former segment data, each SCC storage belongs to the metadata of the segmentation of this SCC.This metadata comprises the Hash, its unique block id, the position of data in SCC of size and this segmentation etc. of block.
This metadata and this data separating are stored in SCC index.Therefore can read rapidly and upgrade SCC index, and without the need to skipping over segment data.
If the position of the segmentation among known SCC, then the metadata reading independent block from SCC index is also possible.Due to block delete, only unique block id oneself can not determine segmentation position; External searching must be carried out to it.
(overall block index)
Overall situation block index is distributed hashtable, and it is by the unique block identifier (namely (flow length of run Id, block sequence number) to) of the Hash mapping of the block stored to them.It is implemented on index FPN.
Prefix based on block Hash key separates Hash table.The node that the node being responsible for storing given piece of Hash is master control index FPN assembly, region corresponds to this Hash.Among index node, mapping is stored in the upper Hash table of dish.
Overall situation block index is fault-tolerant, at all SNC of supernode indexindividual assembly copies each region.
Due to the size of index, by index stores on dish.Renewal is buffered in memory and is applied in the background in batches.By using the Bloom filter in storer, this index support is for the cheapness inquiry that there is not block.Inquiry for existing block needs once to coil reading at random.
(compact disk index)
Among each memory node, overall block index to be remained on the dish being called as compact disk index (DCI) in data structure.DCI needs with high-performance mark non-duplicate block.
DCI can be implemented on standard plate, as Hash table on the dish of the Bloom filter had in negative (non-duplicate) storer inquired about.This is similar to the index described in NPL2.
In this solution, by update all,---conversion is inserted and removed---is put in the impact damper in storer to avoid carrying out random writing.On dish, Hash table, write buffer and Bloom filter are separated into bucket (bucket), and wherein, each bucket corresponds to a part for key space.When write buffer starts to fill up, each bucket of operation processed in sequence is scanned on backstage:
Hash table bucket on reading disk
Apply any renewal from write buffer
Rebuild the Bloom filter part being used for this barrel
The bucket of renewal is write to disc brush
Alternatively, can by index stores based on the SSD of flash memory.This has carried out learning and has had the possibility of the power saving of advantage and the essence reducing RAM consumption in nearest research.
In order to reduce the size of Hash table, DCI does not need to store whole key (block Hash) clearly.When having conflict in Hash table, return and all mate conversion.Then can by reading the metadata of these candidate blocks from suitable Scc index (SccIndex) and checking whether complete block Hash mates and verify these candidate blocks.If additional multiple bit of key is stored in DCI, then the number of candidate can be kept on average close to 1.
(block index upgrade)
After block successfully being write its synchronous operation and when being removed by garbage collection process when block, upgrade overall block index.Because in responsible master control overall situation block index, the index node in the region of block is usually different from the memory node of this block of actual storage, so the careful synchronization of index upgrade is necessary.
By writing promoter's write-in block in data FPN, each up-to-date write-in block is created to the conversion of Hash key (Hashkey) to (synchronous length of run Id, block sequence number).Index node to the suitable block index region of master control sends this conversion.Then be stored in the conversion log of destination index node, and by the DCI that write in the background.As long as when conversion continues to be in conversion log, index node just replies writing promoter.
Because conversion insertion request may be lost, so each promoter of writing maintains (permanent) daily record of the conversion be inserted in overall block index.Periodically retransmit the insertion request to conversion from daily record, successfully reply until receive from index node.
Request is inserted in the conversion that index node can receive repetition.Because (synchronous length of run Id, block sequence number) is unique for each write, can be discarded safely so repeat insertion.Repeat to insert by usually they still in DCI write buffer time be detected, but also can remove them on DCI scans.
(removing)
Only remove conversion due to refuse collection from overall block index.In the simplest solution, whole overall block index can be rebuild according to rest block after completing refuse collection.More complicated solution hereinafter described is also feasible.
In order to the object of refuse collection, system lifetim is divided into the stage being called as period (epoch).Whole block writes one time in this system of interim execution.Current period number is maintained in global state, and it is pushed into when garbage collection process starts.Only by from period n-1 whole blocks to add in GBI after period may be advanced into n+1.Refuse collection in period n only removes until block (those blocks namely only undoubtedly in GBI) that period, n-2 stored.
These stages help avoid GBI conversion renewal, block removes and GBI changes the competition between removing.Used time issue inserts request (conversion log entry) lid timestamp to GBI; Because reception hint node repeats, abandon the request from too old period.If refuse collection is determined remove block, then send for it conversion remove request.Also with current period to this request lid timestamp.If block was once stored again, then it will to be in different synchronous operation and therefore it will be different conversions.
(Hash is rented (lease))
Only after the block of conversion has been successfully stored in synchronous operation, just increase this conversion to overall block index.If two or more clients attempt to write identical block simultaneously, then this can constitute competition, and can store multiple copies of identical block.
In order to prevent this competition, client requirements is submitting block to rent the Hash of block before storing from overall block index.To other potential writers, renting of obtaining notifies that this block has been written into and they should be synchronous with original writer with signal.If if write unsuccessfully or rent and expire (such as because the original access node for the treatment of this write stops response), then return this when inserting actual converted for identical Hash and rent.
(conversion cache memory)
Conversion cache memory is the caches in the storer of SCC index, for carrying out effective duplicate removal to the block stored.It utilizes the block locality in a stream (operation of repeatable block often re-writes according to the order identical with the order of their original storage) repeated.
Conversion cache memory is positioned in access node.When determining whether block repeats, each access node is with reference to its this locality conversion cache memory.SCC index can be downloaded by the memory node from master control SCC index, fill cache memory.Because cache memory has limited capacity, can remove so a kind of SCC index from cache memory, wherein the conversion of this SCC index is not used recently.
If basic SCC changes, then the SCC index be stored in conversion cache memory may be run ragged.Owing in the content of memory node place checking conversion cache memory, if so authentication failed, can abandon them from cache memory at leisure before the use always.
(operation)
Next, how description is performed normal operations in Organization of Data given above.
(write and repeated elimination)
First process the write from user by the front end of access node, wherein they are split into the block of variable dimension and form the tree of block.For each piece, calculate its SHA-1 Hash key, by being used for, it determines that this block is unique or repeats.
(repeatable block)
First in conversion cache memory, search the Hash of block.If be present in there, then find the synchronous operation of candidate's original block and unique block id.Use synchronous operation id, the memory node to it sends request, to verify that conversion cache entries is not outmoded and this block has enough elasticity so that for it to write duplicate removal.If this is verified, then write operation completes.
If do not find block or not by checking, then send the inquiry for the Hash key of block to overall block index in conversion cache memory.By being delivered to suitable index node through DHT route.Then read overall block index, and return candidate blocks location sets.
Then candidate's (in fact, on average only a candidate) is verified one by one.For each candidate, the memory node to the synchronous operation of this candidate of master control sends request.Use unique block id, read segment meta-data position from SCC index search.Segment meta-data comprises the Hash of block, the Hash of itself and new block can be compared.If their mate and block has enough elasticity, then find repetition.Otherwise check all the other candidates.
If repeatable block is eliminated, then consider the SCC index of original block, to read in conversion cache memory to accelerate follow-up repeated elimination.
(unique block)
If conversion cache memory does not comprise entry available arbitrarily, then with reference to overall block index.Due to the use of Bloom filter, if block is not yet in overall block index, then can return negative answer and dish without the need to any high probability is accessed.If do not find candidate or whole candidate blocks to be rejected, then this block is unique and will be stored.
Access node maintains the open synchronous operation being used for each data stream be just written into.All new block is stored in this synchronous operation.If there is no for the open synchronous operation of flowing or the synchronous operation capacity before having exceeded, then new synchronous operation is distributed.
After selecting the open synchronous operation for block, by block erasure codes to SNC datain individual segmentation, and the assembly that this opens the supernode of synchronous operation to master control sends this segmentation.Namely one of them assembly is write promoter and is responsible for synchronization write operation.It sends the request of the conversion for inserting the block for being just stored in overall block index.It is collected for SNC datathe confirmation of the storage of individual segmentation, and with success or unsuccessfully reply access node.
(synchronous operation distribution)
Always create new synchronous operation by the promoter that writes of the supernode of responsible synchronous operation.Write promoter and know which stream runs and which synchronous operation operating of those stream distributed in the past, and can ensure that the most newly assigned synchronous operation has unique id.
In both cases, access node needs to distribute synchronous operation:
Before the first unique block writing new stream
When before synchronous operation expired time.
If access node has had the synchronous operation that pin opens in stream, then it has distributed next synchronous operation by usually attempting in phase homogeneous turbulence runs.Supernode is determined, so distribution request can be sent through data FPN to the suitable promoter that writes because stream runs Id.If be allocated successfully, then write promoter and will assign next synchronous operation Id, and returned to access node.Then access node submits all writes newly to by with this synchronous operation Id.If because stream operation is expired or supernode insufficient space distributes unsuccessfully, then the access node stream operation that needs distribution new.
Run to distribute new stream, access node first select new supernode come master control it.By searching random key and send distribution request select supernode to the promoter that writes being responsible for this key in data FPN.If be allocated successfully, then return the Id of the first synchronous operation that new stream runs to access node.Otherwise access node selects another supernode.This basic allocation strategy can be revised to provide such as the feature of the support of asymmetrical network.
Usually, for the synchronous operation that each client flow assignment is separated.But, because each open synchronous operation needs some resources of storage-node side, so there is restriction in the number of stream open while each supernode.If write multithread simultaneously, then same synchronous operation will be used by multiple stream.The negative effect that this synchronous operation is shared is will mix incoherent data in same synchronous operation, thus reduces the positive effect of stream locality.We do not wish that the number of the stream simultaneously write in practice is excessive and are therefore not intended to be optimized this situation.
(writing while repeatable block)
If multiple access node attempts to write identical block simultaneously, then may store multiple copies of identical block.Overall block index is used to rent to prevent this from occurring in practice.
Always taked to rent before the new block of write---can when overall block search index does not return candidate automatically or obtain clearly when whole candidate is rejected and rent.Rent the address of the access node of Hash and this block of write comprising the block be just written into.
If find the activity of the Hash about request to rent during overall block search index, then return the notice that another access node writes identical block just at the same time.Then original for contact access node carries out waiting for until original block has write by follow-up writer.
When the conversion for identical Hash is inserted in GBI, when write operation failure (such as due to insufficient space) or after some time-out (such as in access node failure), can discharges and rent.Only permitted to rent by the selected assembly in the index FPN supernode of the Hash of responsible block.If this assembly does not receive legal number within some times in its supernode, then renting also will be not licensed.During when the failure of index FPN assembly or with Network Separation, this restriction repeatable block is stored into the possibility of short time window simultaneously.
(reading)
Depend on and keep the address of which kind of type (hereafter will discuss in detail) in pointer blocks, unique block id of block-based Hash key or block can read block.Reconstructed blocks can be carried out by reading abundant segmentation.In order to actual read data, first need the side-play amount of the segmentation of searching in SCC.
The additional step for searching unique block id is needed by the reading of Hash.By reference to conversion cache memory and overall block index, it can complete as duplicate removal.
The conversion cache memory in access node is used to find SCC side-play amount.If find unique block id in the cache, then associated entry comprises data offset.This side-play amount may be outmoded, therefore, verifies this side-play amount when processing fragment read requests on memory node.If there is not the entry for segmentation in conversion cache memory, then the memory node to the synchronous operation of this segmentation of master control forwards this segmentation read requests.
Memory node can be used in the side-play amount found in conversion cache memory and directly read data.If side-play amount is unknown or invalid, then must read SCC directory entry.Under normal conditions, this only needs to complete on an assembly, because the segmentation of identical block is stored in all SNC with identical side-play amount usually datain individual SCC.
As in repeated elimination, the index comprising the SCC of abundant segmentation is downloaded in conversion cache memory, to accelerate reading in the future.
Only need to read original segment and carry out reconstructed blocks.Preferred original segment, because reconstruct raw data according to them not need erasure codes.But it may be useful for changing into and read some redundancy segmentations more uniformly to propagate read requests between the discs.
(fault recovery)
The fault of index and memory node is detected by suitable FPN layer.On different index/memory nodes, (use consistance) is reconstituted in the FPN assembly of master control on malfunctioning node.Select these nodes to maintain the well balanced of the number of the assembly of each node.
When assembly position change time, from before position transmit or reconstruct the total data (being synchronous operation or overall block directory entry respectively) be associated with this assembly from peer component.This restructuring procedure carries out in the background.
In index FPN, repeat and overall block index translation can be copied simply.In data FPN, by reading all the other segmentations, reconstruct original block, recoding and lose segmentation and the SCC lost in the write of the position of New Parent, reconstruct SCC.
Due to load balance, the assembly of recovery typically will be propagated on multiple node.Therefore data reconstruction will write to multiple nodal parallel, thus produce high reconstruction performance, and recover the resilient class wanted rapidly.
(deleting and space reclamation)
Use distributed garbage collection process to complete the deletion of block.This system is gone for the identical overall algorithm described in NPL1.
Distributed refuse collection
Generally speaking, in SCC index for each piece of maintenance reference counter.The reference counter of block is the number of the pointer blocks quoting this block.
Only change Counter Value by periodic garbage collection process.Refuse collection uses global state mechanism global synchronisation ground to run in multiple stage.
In the first phase, the whole new pointer blocks write after an in the end refuse collection is processed, and increase request to the memory node transmitting counter of the directed block of master control.After whole block is all processed, reference counter upgrades is classified by unique block id, and is applied to the whole blocks in given SCC in batches.Then, identify that reference counter is the pointer blocks of 0.Because these blocks will be removed, so the whole block transmitting counters reduction requests pointed to them.Again apply reference counter to upgrade, and if more pointer blocks are removed, then start another reduction stage.
What be called as period simplifies the synchronization of overall block index with block write to the segmentation in stage---and interimly when identical with the period that block is written into never remove this block, and be advanced to and need all unsettled overall block index upgrade to complete next period.
Space reclamation
Block is only labeled as death by garbage collection process---and remove their conversion from overall block index, and new repetition can not be eliminated for them, but their storage is not yet released.Reclaim this space in the background, next SCC.
Space reclamation will reduce the average-size of synchronous operation.In order to increase with preventing the endless number of each SCC metadata, by link continuous print SCC so that average SCC Size dimensional is held within boundary.
Only can link the SCC with continuous print synchronous operation.For the link accord priority of the synchronous operation run from same stream---if there is no have other SCC of the data run from this stream, then the synchronous operation run from various flows only can be placed in a SCC.
(system growth)
When increasing new memory node to this system and its capacity increases, the number of FPN supernode must increase to maintain good load balance.This has been come by the length of increase zone prefix---each FPN assembly is divided into two New Parents with longer prefix.
Between New Parent, overall block directory entry is split based on Hash key.
Also between new supernode, split synchronous operation.This has come by expanding the number being interpreted as the bit of the synchronous operation identifier of zone prefix, and wherein the Least significant bit of stream operation Id is moved to zone prefix.Such as, there is 010:0:0,011:0:0,010:1:0,011:1:0,010:2:0 and the 011:2:0 after id (prefix: stream runs: the sequence number) synchronous operation of 01:0:0,01:1:0,01:2:0,01:3:0,01:4:0 and 01:5:0 is equivalent to segmentation.
As a result, when system increases, synchronous operation to be distributed between new supernode with flowing the epigranular of operation.
If the synchronous operation belonging to different supernode is upon splitting joined into single SCC, then this SCC will be divided by consistency operation.But this seldom occurs, link because right of priority has been given in stream operation before linking between stream runs.
Therefore always rebalancing assembly (and therefore data) on the node newly increased, writes bandwidth immediately to provide high.
(Organization of Data discussion and evaluation)
(stream runs the impact of size)
The size that stream runs determines new supernode and will how to be selected for data stream continually.Exist and run the relevant compromise of the selection of size with flowing.For load balance, be often switched to new supernode (such as after each synchronous operation) is good, but:
Data are caused to be dispersed between supernode after system increases
Prevention dish reduction of speed.
Need to find the appropriate balance between switching after each synchronous operation and just switching after having only had supernode to expire.
(capacitance balance)
Supernode assembly is used for the capacity utilization in balanced system.With the quantity of the memory capacity that memory node exists proportionally to memory node assignment component.Owing to being that whole assembly is passed always, so multiple in them are present on each memory node, make equilibrium particle size less.
If all supernode roughly has identical size, then the equilibratory capacity utilization of the balance of supernode component-level.The unified Random assignment that stream runs to supernode prevents from forming the significant arbitrarily uneven of supernode size.Even if in input data with in the face of occurring when deleting associating, supernode remains balance.
With divided the system of cloth by Hash compared with, collector unit is relatively large---and distribute whole stream in the proposed system and run, its at least large than block 3 magnitudes.If stream runs too large, if then use the simple unified distribution of supernode, then the peak use rate of system is by undermined.How the selection completed for evaluating allocation units size affects the experiment by the attainable peak use rate of Random assignment.Supernode to Stochastic choice divides flow to run, until run into complete supernode.This experimental hypothesis 48TB system, wherein the size of each supernode is 1.5TB.
Run for the stream being of a size of 64MB, the imbalance between supernode is on average 2%.Utilize strict unified Random assignment strategy, when 98% of power system capacity is written into, system is full by change.If the supernode insufficient space of original selection, then can improve it by attempting distribution in different supernodes.This allows new write to reach the utilization factor of almost 100%, and data censored mean still will can not cause significant imbalance simultaneously.
(redundancy and concurrency)
The supernode radix of data FPN is determined:
The redundancy of data FPN---the movable FPN assembly less than half may be permanent failed; Otherwise the legal number of consistance is lost
Available other number of data elastic type---erasure codes can be configured to produce from 0 to reaching SNC data-1 redundancy segmentation
Be assigned to the parallel quantity of single stream.
Each piece of write requires write SNC dataindividual segmentation, and block reading requirement at least reads the original segment of this block.Therefore, in fact individual traffic is stripped to SNC dataon individual memory node.This peels through parallelization based on reaching SNC datathe data access of individual memory disc improves the handling capacity of each stream.SNC can be increased datathis system is configured with the handling capacity for higher single stream.But, too high SNC datastream locality will be made and read performance degradation at random, because many dishes must be accessed to read single piece.
Standard supernode radix value is 12, and it should provide enough concurrency to make the handling capacity of single client saturated, maintains good stream locality and random reading performance simultaneously.
The supernode radix of index FPN may be lower, because overall block index translation is replicated, is not wiped free of coding.Concurrency is provided inherently by the load distribution based on Hash.Therefore, only need in this case to consider Network Survivability and availability.
(block address in pointer blocks)
Pointer blocks is the block of other blocks stored before quoting.They may be used for each block-chaining in the data structure of such as file or whole file system snapshot.
The Hash Round Robin data partition can derived by content or visit each piece that stores in system by the unique block address depending on position.Any one address in these two addresses should be stored in pointer blocks in principle.The selection of pointer type is with multiple compromise.Outline these compromises in FIG.
Address size
Hash Round Robin data partition is the Hash of the content of the block linked with some metadata (such as elasticity classification).This address is sufficiently large to make it possible to can ignore hash-collision in the system estimating size.Suppose to use SHA-1 hash function, the size of Hash Round Robin data partition is 20 bytes.
Unique block address is that (synchronous operation Id, block sequence number) is right, its block uniquely in tag system.This address can be made more much smaller than Hash---because synchronous operation Id is symmetrical assignment, so there is not the possibility of conflict.The number of the bit uniquely required for home block depends on the number of non-repetitive piece to system write during system lifetim.Even if suppose minimum 1K block size and each synchronous operation 2 16individual block, the synchronous operation identifier space of 64 bits can not be depleted until 2 40till the non-duplicate data of petabyte are written into system.
Reading performance
The position of block must be searched before the data reading block.If sequentially read block according to the order identical with the order of initial write-in block, then by by conversion cache memory process these search in major part search, and to access without the need to any dish.But conversion cache memory may not comprise for the conversion (until looked ahead the Scc index of stream) of several pieces before flowing, and to read cache memory completely invalid for random.In these cases, expensive segmentation location lookup must be completed.
If pointer blocks is Hash Round Robin data partition, then this is searched and must travel through overall block index, thus initiation dish tracking (seek).This is unnecessary for unique block address, because required synchronous operation Id is included in this address.
Block relocates
When using static synchronous operation to map to supernode, it may be useful in some cases block being moved to different synchronous operation.May be necessary to improve the load balance in asymmetrical network.
If use Hash Round Robin data partition in pointer blocks, then the synchronous operation of block can change and point to the content of its pointer blocks without the need to change.On the other hand, if use unique block address, then needs are updated by the whole pointer blocks pointing to the block of reallocation.Needs are propagated into block tree root by this renewal always, because the address be stored in pointer blocks is included in the calculating of the Hash of pointer blocks.
About the requirement of Hash lookup
Its conversion existed in overall block index is depended on by the reading block of the Hash Round Robin data partition of block.If this is the sole mode reading block, then system must ensure that GBI is successfully updated before block writes completes.This will increase the delay of block writes or to need Hash to rent lasting.
System reset
If system experienced by the more fault of the fault being configured to stand than it, then some blocks may become and not can read.Due to duplicate removal, comprise all files system snapshot of the block that not can read by influenced.
In many cases, the data of loss to be still present in primal system and the backup of the utilization next one are written into system.To again store this block in new synchronous operation, but there is identical Hash Round Robin data partition.
If pointer blocks changes into comprise Hash Round Robin data partition to replace unique block address, then also can use this new block when reading the old file system of the block that original sensing not can read.In fact, rewriteeing the block lost will automatic " recovery " this system.
There is the pointer blocks of clue (hint)
By the Hash Round Robin data partition of each pointer and unique both block address are remained in pointer blocks, those benefits of the benefit of Hash Round Robin data partition (block relocates, system reset) and unique block address (better random reading performance, looser requirement for Hash lookup) can be combined.Hash Round Robin data partition will be authoritative and only it will affect the Hash of pointer blocks.Unique block address will be the clue for avoiding overall block index upgrade, as long as this clue is up-to-date.This clue may be run ragged (when pointer blocks change position or become not can read time), and this clue can be upgraded at leisure in these cases.The negative effect of the method is that it needs the most of memory capacity for pointer blocks.
(performance of unique block write)
As mentioned above, standby system is more generally written into instead of reads, and handling capacity of writing high for the feasibility of this system is necessary.
In the framework proposed in the present invention, when each stream of unique data is written at first, it is stripped to SNC dataon individual dish.On the other hand, carrying out in the system distributed based on the block of Hash, this write propagated uniformly by whole dish.Therefore, what the system proposed in the present invention provided significantly lower single stream writes handling capacity.But as implied above, in any case single client typically can not utilize this high-throughput, therefore we find that this restriction is inessential.
Load balance
In large scale system, typically will write multiple stream simultaneously.Will randomly and independently for each flow assignment synchronous operation.Therefore, identical supernode can be selected to carry out the multiple synchronous operation of master control, force multiple stream to share the handling capacity of single memory node.
By using multiple Stochastic choice in synchronous operation allocation algorithm, laod unbalance can be alleviated.When selecting new supernode, the supernode to d Stochastic choice sends inquiry, and selects to have the supernode that the minimum stream write actively runs number.Show and used multiple Stochastic choice to significantly improve randomized load balance.
Fig. 2 and Fig. 3 shows laod unbalance, and how along with system dimension increase, bandwidth is write in impact.For supernode and the distribution inquiry of different number, simulate the distribution that n stream runs to n supernode.Notice, the number of supernode is always directly proportional to system dimension.
Fig. 2 shows the max-flow distributing to single supernode and runs the average of number.As expected, an additional allocation inquiry is only used to significantly reduce the maximum number of the stream operation in supernode.But, even if utilize many inquiries, also the supernode that there is multiple active flow and run can be found by high probability.Stream runs the stream being assigned to this supernode and experience degradation is write handling capacity, is assigned with until this stream runs depleted and another stream operation.
But even if Fig. 3 shows independent stream may experience some decelerations, the impact that this laod unbalance writes bandwidth for polymerization is also little.At least be assigned with by counting the number of supernode that a stream runs to calculate and write bandwidth (basic premise is that single stream is enough to make the handling capacity of a supernode saturated).Utilize 10 inquiries, even if for very large system, the bandwidth of realization is also within 5% of maximal value.
Traffic classification
Carrying out in the system based on the distribution of Hash, the multiplexing write belonging to not homogeneous turbulence in identical storage container.Because phase homogeneous turbulence can not be read together, so the reading of this multiplexing container is inadequate, because they must skip over unnecessary data.Traffic classification is used, to be bonded into larger chunk to improve reading in the future by the data of gravity flow in the future in NPL1.But traffic classification or increase and postpone (if it embedded during ablation process complete), or need to rewrite total data by background processes according to the order of the stream of classification.
The framework proposed in the present invention is avoided by the data-reusing from various flows together, runs because create for each stream the stream be separated.
(reading handling capacity)
The mainspring of the framework proposed is to improve and read handling capacity by retaining more multithread locality in large scale system.
(stream locality retains)
Stream locality is naturally demoted in the storage system of carrying out accurate duplicate removal.Focus due to this paper is the extra degradation caused by the internal data organization of storage system, so we consider the effect of duplicate removal by not passing through the stream of analysis for unique data block how to retain locality.
At first, the part of the synchronous operation size of inlet flow is placed on dish in order.The desired size of synchronous operation is from several megabyte in the scope of tens of megabytes, and thus the order of inlet flow reads and will cause the insignificant a small amount of tracking on memory disc.
Deletion can remove block from the centre of synchronous operation.Refuse collection will cause the size of synchronous operation to reduce subsequently.Synchronous operation size significantly drop to be enough to impact order reading performance before, continuous print synchronous operation will be linked as described above.Link the locality of the data by retaining the size run up to stream.If run the part of size from the stream of data stream of the half only remaining synchronous operation to remove polylith, then link and will start to merge the synchronous operation belonging to various flows, and for the reservation of the locality of primary flow by no longer valid.
After system increases, data with existing is passed to new node and balances to keep capacity utilization.But as implied above, stream runs and keeps together as a unit always.Therefore, locality is flowed not by the impact of the interpolation of new memory node.
(what distribute with the block based on Hash compares)
The access stencil read during handling capacity depends on write and read significantly in both the block of each stream proposed in based on the block distribution of Hash and the present invention distributes.In order to make the compromise between two frameworks more visible, we will analyze these systems and how operate in some typical situations.
Single current write, single current are read
But although it is when data stream is initially stored and is the unique stream be just written into that the simplest situation extremely can not occur in large scale system, order reading data flow.In this case, the distribution based on Hash is very effective, provides the combined throughput of whole node.Utilize SNC datawalking abreast of individual memory node, the framework proposed in the present invention performs enough good, wherein supposes SNC dataparallel being enough to of individual memory node makes single client saturated.
Multithread write, single current are read
In systems in practice, multiple stream is written simultaneously and only situation when being read back after a while in them is quite typical demonstrably.When multiple system during shared backup window parallel be backed up and only one of them system suffers fault and is resumed from backup subsequently time, this may easily occur.
For the system used based on the distribution of Hash, this situation is more disadvantageous.Because the block belonging to all stream is distributed to identical dish upper container uniformly, only read back one and flow needs tracking or skip over other blocks.NPL1 attempt by when block write buffer medium to be committed time embedded solve this problem according to the stream Id block come in classification container in address period on backstage.The efficiency of this traffic classification is subject to the restriction of container dimensional.
The framework proposed in the present invention not by this problems affect because be stored in independently in container from the write of different data streams.Handling capacity of reading in this case remains SNC datathe combined throughput of individual memory node.
Multithread is read
If the multiple backup image of parallel recovery after the extensive fault of multiple standby system, may read back multiple stream simultaneously.But, when reading the duplicate removal stream of height segmentation, read even if single outside reading flow also may look like multiple stream for this system.
Have in the system based on the distribution of Hash, whole memory node stores the version reduced of each stream effectively.These streams reduced each must by parallel reading to rebuild whole stream.Each memory node must serve the access from each stream be just read in this system.Because both memory node and access node all have the storer for cushioning reading of fixed qty, so the number of the stream that increase must be utilized to read uses less dish to read size simultaneously.Use little dish to read and significantly reduce handling capacity, thus order reading is downgraded into randomized block reading the most at last.
The system proposed is not subject to the infringement of same problem, because each data stream is stripped in the little set of only memory node.But different from the distribution based on Hash, it is subject to the infringement of imperfect load balance---this is possible for many streams that will read from the small set of memory node, and other memory nodes are idle simultaneously.Read redundancy segmentation using the exchange as some original segment, can consume as cost improves load balance than the CPU higher by erasure codes algorithm.But for a large amount of reading flow simultaneously, reading performance is significantly higher than the reading performance during block distribution used based on Hash.
(overall block index upgrade)
As mentioned above, overall block index is by the unique block address (synchronous operation Id in synchronous operation and sequence number) of Hash mapping to block.Due to this decision, when Data Position changes or refuse collection completes, overall situation block index translation is without the need to changing---and block address is still effectively until this block is removed.
Alternative solution keeps the side-play amount of SCCId and block to be in this SCC by being.This can by being avoided the conversion of (synchronous operation Id, sequence number) to (SCCId, side-play amount) to improve random reading performance.But it upgrades GBI conversion by needing after any consistency operation of the side-play amount (space reclamation, link) changing the segmentation in SCC, and the load that will therefore increase on index node.
(support for asymmetrical network)
Based on the block being distributed in propagation data stream uniformly on whole memory node of Hash.Therefore, the data of equal number must be sent to each memory node by access node.The quantitative limitation of handling up of the slowest network link that the bandwidth of write data stream will be subject between access node and memory node.
In the framework proposed in the present invention, access node has more degree of freedom when selecting supernode and therefore they store the memory node of data.This can be used for the write performance improved in asymmetric system.
As mentioned above, suppose in the present invention, network is made up of node group.Node in group can be high point-to-point handling capacity communicate, the link simultaneously between group provides lower every node throughput.
Point flow on the memory node attempted only in their group runs by access node, to avoid using link between the group for writing.Because stream operation is assigned to supernode instead of directly distributes to memory node, so data FPN key space is separated, the scope of the prefix in data FPN is made to correspond to a group node.If supernode is assigned to a group node, then all its assembly is maintained at and belongs on the memory node of this group.
Amendment stream runs allocation algorithm only to consider the supernode in the group identical with access node.When only having selected supernode to expire, just perform and do not distribute by the routine of node group constraint.
The allocation strategy of this locality of this group eliminates the most of bandwidth intensive data transmission through slower link.Unless the capacity of group system is depleted, otherwise only write by the memory node processing block in the group identical with access node.Still send GBI inquiry to whole index node uniformly, but they do not consume massive band width.Similarly, if repeat to be stored in different groups, then look ahead when writing repeatable block can use bandwidth between some groups by changing Scc index that cache memory completes.But, because Scc index is less than the size of data, so they should not exceed handling capacity between group.Data reconstruction after the failure does not need bandwidth between too many group yet, because all supernode assembly is in identical group.
But this strategy is with some compromises.If only complete capacitance balance in individual node group---some clients are than other clients write more data, then the free space in their group is by more depleted more quickly than the free space in other groups.If the fault of the memory node in identical group is not independently, then the redundancy of system may reduce, because all components of supernode is placed in identical node group.
Although new write does not produce the network traffics across group, duplicate removal form is depended on the impact of reading.Such as, when the data that access node write has been write by the access node being connected to different group, data are only stored in original set.Read data from the second access node and must transmit total data from this original set.In this case, reading performance may scattered data be poorer uniformly than on whole supernode.
To advocate in no matter worst case in the present invention lower reads handling capacity, and when considering the lower cost of this network, disposing asymmetrical network may be meaningful.First, if identical client is backed up by the access node in a group of networks all the time, then any unique data existed only in this system will likely be stored in this set.This data can be read by high-throughput.Secondly, the recovery of faulty client system typically relates to only reads some backup image.If read minute quantity stream simultaneously, then between group, link should enough fast instead of bottleneck, even if data are stored in other node groups.And final, read data without the need to network throughput between write competition group from remote node group simultaneously.
(delay and elasticity to plunderer (marauder))
Compared with the distribution based on Hash, the framework proposed can introduce more the delay, because need extra network hop to inquire about overall block index to block write.And it may have higher write delay for multiple relatively slow client---need the more time to hold large impact damper for being sequentially written in.This is not by result that the block from various flows mixes.In the system of the distribution based on Hash carrying out unifying, the block from whole stream can be accumulated in identical write buffer and sequentially to be write with a brush dipped in Chinese ink dish.
On the other hand, traffic classification embedded arbitrarily that is required, that can increase write delay is significantly unnecessary within the system in based on the compartment system of Hash.
The framework proposed is more flexible for plunderer, wherein this plunderer be work obtain be declared enough soon and not fault but obtain slower node than other nodal operations.In the architecture, the stream of concrete node is only accessed by the slow of this node or fault effects.Utilize the distribution based on Hash, by the performance of the whole system of node determination the most slowly in network.
Because only some memory nodes are just serving the write request representing single stream, so clear and definite the writing with a brush dipped in Chinese ink of untreated data in can asking to flow reduces to postpone.Such as when processing the NFS synchronization request in more such clients, this is useful, and wherein this client stops operation further until be written into the total data of submit usually.Access node can ask writing with a brush dipped in Chinese ink of clear and definite high priority, because once only flow to a synchronous operation by one to send write.In the system of the distribution based on Hash, this is infeasible, because must send the request for whole memory node.
(synchronous operation to the static state of supernode to dynamic assignment)
In the solution proposed in the present invention, distribute synchronous operation statically to supernode.This distribution only not to change when not changing the Id of synchronous operation based on synchronous operation Id.
Can consider that the dynamic mapping of supernode is arrived in synchronous operation, the memory node that the data wherein must searching synchronous operation are stored in and this memory node are not determined statically by synchronous operation Id.The advantage of this dynamic mapping is that independent supernode can change position with the change in adaptive system.Such as, in asymmetrical network, synchronous operation can be moved closer to accessing their access node the most continually.
The present invention determines not carry out additional mappings in the proposed system, because the extra network hop of introducing is searched to memory node for synchronous operation by it, thus increases reading delay.
(conclusion)
Invention introduces the new architecture for the embedded duplicate removal of efficient scalable high-performance, the alphabetic data known with stream based on the overall block index of DHT being used for accurate duplicate removal is placed and is separated by it.
More than describe and shown compared with existing solution, when the number of the stream read increases along with system dimension simultaneously, the framework proposed in the present invention improves the reading performance in large scale system.Even if face data to delete and node interpolation, this system also retains stream locality, maintains the good capacitance balance between memory node simultaneously.When to write multiple stream simultaneously, it also avoids and the block from various flows is interweaved.
In symmetrical network, the distribution based on Hash provides slightly high and writes handling capacity, but is significant cost with reading performance.Even if occur reading simultaneously, the framework proposed in the present invention also provides obviously higher write performance in asymmetric system, but reading performance depends on access stencil to heavens.
Carry out existing system based on the distribution of Hash small-sized more effective in medium-sized system, because that they avoid having the problem of load balance and focus.But we find when the high multithread read throughput of needs, and the framework proposed in the present invention is applicable to large-scale installation better.
< second illustrative embodiments >
With reference to Fig. 4 to Figure 10, the second illustrative embodiments of the present invention is described.Fig. 4 is the block diagram of the configuration for showing whole system.Fig. 5 is the block diagram for schematically showing storage system, and Fig. 6 is the functional block diagram for showing configuration.Fig. 7 to Figure 10 is the explanation view of the operation for illustration of storage system.
Here illustrative embodiments shows the situation that storage system is a kind of like this system, all HYDRAstor in this way of this system and configure this system by connecting multiple server computer.But storage system of the present invention is not limited to have the configuration of multiple computing machine and can be configured by a computing machine.
As shown in Figure 4, storage system 10 of the present invention is connected to standby system 11 for controlling backup procedure via network N.Standby system 11 obtains the backup target data (storage target data) in the backup target equipment 12 being stored in and connecting via network N, and asks storage system 10 to store.Therefore, the backup target data of asking to be stored store as backup by storage system 10.
As shown in Figure 5, the storage system 10 of illustrative embodiments adopts the configuration connecting multiple server computer.Specifically, storage system 10 be equipped with the server computer of the storage/regenerative operation served as control store system 10 access node 10A (first server), serve as the server computer of the memory device be equipped with for storing data memory node 10B (second server) and for the index node 10C (three server) of storage list registration according to the index data of storage destination.The number of the number of access node 10A, the number of memory node 10B and memory node 10C is not limited to those numbers shown in Fig. 5, and can adopt the configuration connecting more nodes 10A, 10B and 10C.
In addition, the storage system 10 of this illustrative embodiments has segmentation storage target data and they is stored in a distributed fashion as the function in the memory node 10B of memory device.Storage system 10 also has by using for representing that unique cryptographic hash of the feature storing target data (blocks of data) checks the function whether data of identical content have been stored, and for the data be stored, repeated storage is eliminated in the memory location of these data by reference.Concrete storing process will be described in detail below.
Fig. 6 shows the configuration of storage system 10.As shown in the figure, the access node 10A forming storage system 10 comprises the data storage control unit 21 of the read and write for controlling the data that will store.
It should be noted that, carry out configuration data storage control unit 21 by the program be arranged in the arithmetic equipment of the CPU (central processing unit) of all access node 10A as shown in Figure 5.
Such as, said procedure is provided to storage system 10 being stored in the state in storage medium (such as CD-ROM).Alternatively, program can be stored in the memory device of another server computer on network, and provide this program via network to storage system 10 from other server computers.
Hereinafter in detail the configuration of data storage control unit 21 will be described.First, when data storage control unit 21 receives the input as the flow data of backup target data A, backup target data A is divided into the blocks of data D of predetermined volumes (such as 64KB) by data storage control unit 21, as shown in Figure 7.Then, based on the data content of this blocks of data D, data storage control unit 21 calculates the unique cryptographic hash H (characteristic) for representing this data content.Such as, by using default hash function to come to calculate cryptographic hash H according to the data content of blocks of data D.
Then, whether data storage control unit 21 performs the blocks of data D repeating to determine recently will store and has been stored in memory node 10B and memory device.Now, data storage control unit 21 checks whether the cryptographic hash of blocks of data D is present in any following SCC index B2 be read in access node 10A recently.The cryptographic hash of if block data D is not present in any SCC index B2, then data storage control unit 21 checks whether the cryptographic hash of the blocks of data D that recently will store is present in the overall block index C1 be stored in index node 10C subsequently.In addition, when SCC index B2 is not read in access node 10A, data storage control unit 21 also checks whether the cryptographic hash of the blocks of data D that recently will store is present in the overall block index C1 be stored in index node 10C.
If the cryptographic hash of the blocks of data D that recently will store is not present in the overall block index C1 be stored in index node 10C, then the up-to-date blocks of data by flow data of data storage control unit 21 is kept in memory node 10B.Specifically describe data storage control unit 21 with reference to Fig. 7 and Fig. 8 and blocks of data D is stored in aspect in memory node 10B.
Blocks of data D1 generated as the data stream of backup target data A by segmentation etc. is sequentially stored in the SCC file B1 formed in concrete memory node 10B by data storage control unit 21.Now, data storage control unit 21 determines that the memory node 10B of minimum by memory capacity or that existence is open SCC file B1 is as the concrete memory node 10B for storage block data D1 etc.It should be noted that, data storage control unit 21 can determine the memory node 10B of storage block data D1 etc. by additive method.
Then, data storage control unit 21 stores multiple continuous print unit of blocks of data D1, D2, the D3 etc. of the data stream that will store in SCC file B1.Now, the cryptographic hash H of the memory location of the unit of blocks of data D1, D2, the D3 etc. in SCC file B1 with blocks of data D1, D2, D3 etc. of storage is associated by data storage control unit 21, and they is stored in the memory node 10B for storage block data D1, D2, D3 etc. as SCC index B2 (specifying table for memory location).In addition, the ID (such as representing the ID (see Fig. 8) of the specific region in concrete SCC file B1) of the identification information (storage device identification information) as the memory node 10B be used to specify for storage block data D1, D2, D3 is associated with the cryptographic hash of blocks of data D1, D2, D3 by data storage control unit 21, and they is stored in index node 10C as overall block index C1 (table specified by memory device).Here, the ID being used to specify memory node 10B should be associated not with cryptographic hash but with a part for cryptographic hash and store them by data storage control unit 21.Now, overall block index C1 is stored in multiple index node 10C by data storage control unit 21 in a distributed fashion.In order to store cryptographic hash and ID in a distributed fashion, any means can be used.
Owing to storing data as described above, so multiple continuous print unit of blocks of data D1, D2, D3 etc. of backup target data A are stored in same memory node 10B continuously, and the data cell being used to indicate their memory location is stored in SCC index B2 continuously.The memory node 10B (specific region in concrete SCC file B1) being used for storage block data D1, D2, D3 etc. is managed by overall block index C1.
It should be noted that, perform the storing process of above-mentioned blocks of data D1, D2, D3 etc. practically, the group of memory node 10B (supernode) is made to be used as concrete memory node 10B, and the unit of storage block data D1, D2, D3 etc. in a distributed fashion.The aspect that further block data carry out storage block data is described through referring now to Fig. 7.
Data storage control unit 21 compresses the blocks of data D that recently will store as described above, and becomes to have the fragment of multiple segment datas of predetermined volumes by this Data Segmentation as illustrated in fig. 7.Such as, as shown in by Reference numeral EI to the E9 in Fig. 7, Data Segmentation is become the fragment (partition data 41) of 9 segment datas by data storage control unit 21.In addition, even if data storage control unit 21 generates redundant data, some segment datas obtained by segmentation be lost, also raw data can be recovered, and redundant data storage is being passed through to split in the segment data 41 of acquisition by data storage control unit 21.Such as, as shown in by Reference numeral EI0 to the E12 in Fig. 7, data storage control unit 21 increases by 3 segment datas (redundant data 42).Therefore, data storage control unit 21 generation data set closes 40, and data acquisition 40 comprises 12 segment datas be made up of 9 partition datas 41 and 3 redundant datas.
Then data storage control unit 21 distributes and memory segment data one by one, and wherein the data acquisition of generation is combined into the storage area 31 formed in the group of the memory node 10B as supernode by this segment data.Such as, as shown in Figure 7, when generation 12 segment data E1 to E12, in segment data E1 to E12 is stored in a data storage file in data storage file F1 to the F12 (data storage areas) be formed in 12 storage areas 31 by data storage control unit 21.
Next, describe such a case with reference to Fig. 9 and Figure 10, wherein input has the data stream of the backup target data A ' of the data content almost consistent with above-mentioned data stream A data content as new storage target data in this case.First, data storage control unit 21 performs and repeats to determine whether the blocks of data D1 of backup target data A ' has been stored in the memory node 10B as memory device.Now, data storage control unit 21 checks whether read SCC index B2 in access node 10A.In this case, owing to not yet reading SCC index B2, so data storage control unit 21 checks whether the cryptographic hash (being a part for cryptographic hash) of the blocks of data D1 that recently will store is present in the overall block index C1 be stored in index node 10C here.
If the cryptographic hash of the blocks of data D1 that recently will store (part for cryptographic hash) is present in the overall block index C1 be stored in index node 10C, then data storage control unit 21 specifies the memory node 10B (region of concrete SCC file B1) be associated with this cryptographic hash (part for cryptographic hash), and quotes the SCC index B2 in memory node 10B.The cryptographic hash of the cryptographic hash be stored in SCC index B2 with the blocks of data D1 that recently will store compares by data storage control unit 21, and if their couplings, then quote SCC index B2 and the memory location of the blocks of data in SCC file B1 be called the blocks of data D1 that recently will store.Thus, in fact the blocks of data D1 that recently will store self is not stored, and can eliminate repeated storage.
Meanwhile, as above quoted, the SCC index B2 be stored in memory node 10B is read into access node 10A by data storage control unit 21.Then, for subsequent block data D2 and the D3 of backup target data A ', data storage control unit 21 by the cryptographic hash of blocks of data D2 and D3 be stored in the cryptographic hash be read out in the SCC index B2 of access node 10A and compare, and if their couplings, then quote SCC index B2 and the memory location of the blocks of data be stored in SCC file B1 be called blocks of data D2 and D3 that recently will store.Therefore, in fact the blocks of data D2 that recently will store and blocks of data D3 self is not stored, and can eliminate repeated storage.In addition, speed that can be higher performs and repeats to determine.
As mentioned above, the present invention includes multiple memory node 10B, and the data achieving distributed mode store to keep the well balanced of the capacity between memory node.In addition, according to the present invention, locally in concrete group of index node 10B (supernode) can also be maintained by segmentation and store target data and the sequential cells of the predetermined quantity of blocks of data that generates.So, speed that can be higher performs duplicate removal process, and in addition, speed that can also be higher performs data read process.
< complementary annotations >
All or part of of disclosed illustrative embodiments above can be described to following complementary annotations.Hereafter by describe storage system 100 of the present invention configuration overview (see Figure 11), for stored program computer-readable medium and date storage method.But, the invention is not restricted to following configuration.
(complementary annotations 1)
A kind of storage system 100, comprising:
Data storage control unit 101, this data storage control unit 101 stores multiple unit of the blocks of data generated by segmentation storage target data in a distributed fashion in multiple memory device 110, and when attempting to store another with the data content identical with the data content of the storage target data be stored in memory device 110 in memory device 110 and storing target data, store target data by reference to this storage target data be stored in this memory device 110 as this another to eliminate to perform repeated storage, wherein
Store in the particular storage device 110 of this data storage control unit 101 in the plurality of memory device 110 and generate by splitting this storage target data, multiple sequential cells of the blocks of data of this storage target data, store associated with one another in this particular storage device 110 based on the data content of this blocks of data characteristic and represent that the stored position information of the memory location of this blocks of data in this particular storage device 110 is using as memory location appointment table, and this characteristic storing storage device identification information for identifying this particular storage device 110 associated with one another and be stored in this blocks of data in this particular storage device 110 is using as memory device 110 appointment table.
(complementary annotations 2)
According to the storage system of complementary annotations 1, wherein
This data storage control unit based on this characteristic of the blocks of data generated by splitting the storage target data that recently will store with reference to this memory device appointment table so that designated store comprises this particular storage device of this memory location appointment table of this characteristic of this blocks of data, and read this memory location appointment table from this particular storage device.
(complementary annotations 3)
According to the storage system of complementary annotations 2, wherein
Based on this memory location appointment table read from this particular storage device, this data storage control unit determines whether the blocks of data generated has been stored in this memory device by splitting this storage target data that recently will store.
(complementary annotations 4)
According to the storage system of complementary annotations 3, wherein
If this characteristic of this blocks of data generated by splitting this storage target data that recently will store is not present in from this memory location appointment table that this particular storage device reads, then by this characteristic based on this blocks of data generated by splitting this storage target data that recently will store, with reference to this memory device appointment table, designated store comprises another particular storage device of another memory location appointment table of this characteristic of this blocks of data to this data storage control unit, and read this another memory location appointment table from this another particular storage device.
(complementary annotations 5)
According to the storage system of complementary annotations 1, also comprise:
At least one first server, it controls operation storage target data be stored in multiple memory device, and
Multiple second server, it forms the plurality of memory device, wherein
This memory location appointment table is read into this first server from the second server of this second server by this data storage control unit.
(complementary annotations 6)
According to the storage system of complementary annotations 5, also comprise:
Multiple 3rd server, it stores this memory device appointment table, wherein
This data storage control unit stores this memory device appointment table in a distributed fashion in the plurality of 3rd server.
(complementary annotations 7)
A kind of stored program computer-readable medium, this program comprises instruction, and this instruction is used for messaging device is realized:
Data storage control unit, this data storage control unit stores multiple unit of the blocks of data generated by segmentation storage target data in a distributed fashion in multiple memory device, and when attempting to store another with the data content identical with the data content of the storage target data be stored in memory device in memory device and storing target data, store target data by reference to this storage target data be stored in this memory device as this another to eliminate to perform repeated storage, wherein
Store in the particular storage device of this data storage control unit in the plurality of memory device and generate by splitting this storage target data, multiple sequential cells of the blocks of data of this storage target data, store associated with one another in this particular storage device based on the data content of this blocks of data characteristic and represent that the stored position information of the memory location of this blocks of data in this particular storage device is using as memory location appointment table, and this characteristic storing storage device identification information for identifying this particular storage device associated with one another and be stored in this blocks of data in this particular storage device is using as memory device appointment table.
(complementary annotations 8)
According to the computer-readable medium of this program of storage of complementary annotations 7, wherein
This data storage control unit based on this characteristic of the blocks of data generated by splitting the storage target data that recently will store with reference to this memory device appointment table so that designated store comprises this particular storage device of this memory location appointment table of this characteristic of this blocks of data, and read this memory location appointment table from this particular storage device.
(complementary annotations 9)
A kind of date storage method, for storing multiple unit of the blocks of data generated by segmentation storage target data in a distributed fashion in multiple memory device, and when attempting to store another with the data content identical with the data content of the storage target data be stored in memory device in memory device and storing target data, store target data by reference to this storage target data be stored in this memory device as this another to eliminate to perform repeated storage, the method comprises:
Store in particular storage device in the plurality of memory device by splitting that this storage target data generates, multiple sequential cells of the blocks of data of this storage target data,
Store associated with one another in this particular storage device based on the data content of this blocks of data characteristic and represent that the stored position information of the memory location of this blocks of data in this particular storage device is using as memory location appointment table, and
This characteristic storing storage device identification information for identifying this particular storage device associated with one another and be stored in this blocks of data in this particular storage device is using as memory device appointment table.
(complementary annotations 10)
According to the date storage method of complementary annotations 9, also comprise:
Based on this characteristic of the blocks of data generated by splitting the storage target data that recently will store with reference to this memory device appointment table so that designated store comprises this particular storage device of this memory location appointment table of this characteristic of this blocks of data, and
This memory location appointment table is read from this particular storage device.

Claims (4)

1. a storage system, comprising:
Data storage control unit, described data storage control unit stores multiple unit of the blocks of data generated by segmentation storage target data in a distributed fashion in multiple memory device, and when attempting to store another with the data content identical with the data content of the storage target data be stored in memory device in memory device and storing target data, store target data by reference to the described storage target data be stored in described memory device as described another to eliminate to perform repeated storage
Multiple sequential cells of the blocks of data of the described storage target data generated by splitting described storage target data are stored in the particular storage device of wherein said data storage control unit in described multiple memory device, the characteristic based on the data content of described blocks of data and the stored position information representing the memory location of described blocks of data in described particular storage device is stored associated with one another in described particular storage device, to specify table as memory location, and the storage device identification information stored associated with one another for identifying described particular storage device and the described characteristic being stored in the described blocks of data in described particular storage device, to specify table as memory device,
The described characteristic of the blocks of data that wherein said data storage control unit is generated based on the storage target data that recently will be stored by segmentation is come with reference to described memory device appointment table, so that designated store comprises the described particular storage device of the described memory location appointment table of the described characteristic of described blocks of data, and read described memory location appointment table from described particular storage device
Wherein said data storage control unit is based on the described memory location appointment table read from described particular storage device, determine whether the described blocks of data generated has been stored in described memory device by splitting the described storage target data that recently will store, and
If the described characteristic of the described blocks of data wherein generated by splitting the described storage target data that recently will store is not present in from the described memory location appointment table that described particular storage device reads, then described data storage control unit is come with reference to described memory device appointment table by the described characteristic of the described blocks of data generated based on the described storage target data that recently will be stored by segmentation, carry out another particular storage device that designated store comprises another memory location appointment table of the described characteristic of described blocks of data, and read another memory location described from another particular storage device described and specify table.
2. storage system according to claim 1, also comprises:
At least one first server, at least one first server described controls operation storage target data be stored in multiple memory device, and
Multiple second server, described multiple second server forms described multiple memory device, wherein
Described memory location appointment table is read into described first server from the second server of described second server by described data storage control unit.
3. storage system according to claim 2, also comprises:
Multiple 3rd server, table specified by memory device described in described multiple 3rd server stores, wherein
Described data storage control unit stores described memory device appointment table in a distributed fashion in described multiple 3rd server.
4. a date storage method, for storing multiple unit of the blocks of data generated by segmentation storage target data in a distributed fashion in multiple memory device, and when attempting to store another with the data content identical with the data content of the storage target data be stored in memory device in memory device and storing target data, store target data by reference to the described storage target data be stored in described memory device as described another to eliminate to perform repeated storage, described method comprises:
Multiple sequential cells of the blocks of data of the described storage target data generated by splitting described storage target data are stored in particular storage device in described multiple memory device, the characteristic based on the data content of described blocks of data and the stored position information representing the memory location of described blocks of data in described particular storage device is stored associated with one another in described particular storage device, to specify table as memory location, and store storage device identification information for identifying described particular storage device associated with one another and be stored in the described characteristic of the described blocks of data in described particular storage device, to specify table as memory device,
The described characteristic of the blocks of data generated based on the storage target data that recently will be stored by segmentation is come with reference to described memory device appointment table, so that designated store comprises the described particular storage device of the described memory location appointment table of the described characteristic of described blocks of data, and reads described memory location appointment table from described particular storage device;
Based on the described memory location appointment table read from described particular storage device, determine whether the described blocks of data generated has been stored in described memory device by splitting the described storage target data that recently will store; And
If the described characteristic of the described blocks of data generated by splitting the described storage target data that recently will store is not present in from the described memory location appointment table that described particular storage device reads, then come with reference to described memory device appointment table by the described characteristic of the described blocks of data generated based on the described storage target data that recently will be stored by segmentation, carry out another particular storage device that designated store comprises another memory location appointment table of the described characteristic of described blocks of data, and read another memory location described from another particular storage device described and specify table.
CN201180043251.2A 2010-09-30 2011-09-21 Storage system Active CN103098015B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US38826210P 2010-09-30 2010-09-30
US61/388,262 2010-09-30
PCT/JP2011/005301 WO2012042792A1 (en) 2010-09-30 2011-09-21 Storage system

Publications (2)

Publication Number Publication Date
CN103098015A CN103098015A (en) 2013-05-08
CN103098015B true CN103098015B (en) 2015-11-25

Family

ID=45892285

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201180043251.2A Active CN103098015B (en) 2010-09-30 2011-09-21 Storage system

Country Status (6)

Country Link
US (1) US9256368B2 (en)
EP (1) EP2622452A4 (en)
JP (1) JP5500257B2 (en)
CN (1) CN103098015B (en)
CA (1) CA2811437C (en)
WO (1) WO2012042792A1 (en)

Families Citing this family (83)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9400799B2 (en) * 2010-10-04 2016-07-26 Dell Products L.P. Data block migration
US9933978B2 (en) 2010-12-16 2018-04-03 International Business Machines Corporation Method and system for processing data
US8589406B2 (en) * 2011-03-03 2013-11-19 Hewlett-Packard Development Company, L.P. Deduplication while rebuilding indexes
EP2698718A4 (en) * 2011-05-31 2014-02-19 Huawei Tech Co Ltd Data reading and writing method, device and storage system
US9639591B2 (en) * 2011-06-13 2017-05-02 EMC IP Holding Company LLC Low latency replication techniques with content addressable storage
US9383928B2 (en) * 2011-06-13 2016-07-05 Emc Corporation Replication techniques with content addressable storage
US9069707B1 (en) 2011-11-03 2015-06-30 Permabit Technology Corp. Indexing deduplicated data
US9208082B1 (en) * 2012-03-23 2015-12-08 David R. Cheriton Hardware-supported per-process metadata tags
CN103019960B (en) * 2012-12-03 2016-03-30 华为技术有限公司 Distributed caching method and system
US9158468B2 (en) 2013-01-02 2015-10-13 International Business Machines Corporation High read block clustering at deduplication layer
US8862847B2 (en) 2013-02-08 2014-10-14 Huawei Technologies Co., Ltd. Distributed storage method, apparatus, and system for reducing a data loss that may result from a single-point failure
US9953042B1 (en) 2013-03-01 2018-04-24 Red Hat, Inc. Managing a deduplicated data index
US8751763B1 (en) * 2013-03-13 2014-06-10 Nimbus Data Systems, Inc. Low-overhead deduplication within a block-based data storage
US9361028B2 (en) * 2013-05-07 2016-06-07 Veritas Technologies, LLC Systems and methods for increasing restore speeds of backups stored in deduplicated storage systems
US9256612B1 (en) 2013-06-11 2016-02-09 Symantec Corporation Systems and methods for managing references in deduplicating data systems
US9298724B1 (en) 2013-06-14 2016-03-29 Symantec Corporation Systems and methods for preserving deduplication efforts after backup-job failures
US20150039645A1 (en) * 2013-08-02 2015-02-05 Formation Data Systems, Inc. High-Performance Distributed Data Storage System with Implicit Content Routing and Data Deduplication
US9348531B1 (en) 2013-09-06 2016-05-24 Western Digital Technologies, Inc. Negative pool management for deduplication
EP3061253A1 (en) * 2013-10-25 2016-08-31 Microsoft Technology Licensing, LLC Hash-based block matching in video and image coding
CN105684409B (en) * 2013-10-25 2019-08-13 微软技术许可有限责任公司 Each piece is indicated using hashed value in video and image coding and decoding
US9792063B2 (en) 2014-01-15 2017-10-17 Intel Corporation Deduplication-based data security
KR102218732B1 (en) * 2014-01-23 2021-02-23 삼성전자주식회사 Stoarge device and method operation thereof
WO2015125765A1 (en) * 2014-02-18 2015-08-27 日本電信電話株式会社 Security device, method therefor and program
US10368092B2 (en) * 2014-03-04 2019-07-30 Microsoft Technology Licensing, Llc Encoder-side decisions for block flipping and skip mode in intra block copy prediction
US10567754B2 (en) * 2014-03-04 2020-02-18 Microsoft Technology Licensing, Llc Hash table construction and availability checking for hash-based block matching
WO2015196322A1 (en) * 2014-06-23 2015-12-30 Microsoft Technology Licensing, Llc Encoder decisions based on results of hash-based block matching
CN104268091B (en) * 2014-09-19 2016-02-24 盛杰 File storage method and file modification method
US9740632B1 (en) 2014-09-25 2017-08-22 EMC IP Holding Company LLC Snapshot efficiency
KR102490706B1 (en) 2014-09-30 2023-01-19 마이크로소프트 테크놀로지 라이센싱, 엘엘씨 Hash-based encoder decisions for video coding
KR20160083762A (en) * 2015-01-02 2016-07-12 삼성전자주식회사 Method for managing mapping table in storage system and storage system adopting the same
US9921910B2 (en) * 2015-02-19 2018-03-20 Netapp, Inc. Virtual chunk service based data recovery in a distributed data storage system
US10241689B1 (en) 2015-06-23 2019-03-26 Amazon Technologies, Inc. Surface-based logical storage units in multi-platter disks
US10311023B1 (en) 2015-07-27 2019-06-04 Sas Institute Inc. Distributed data storage grouping
US10310748B2 (en) 2015-08-26 2019-06-04 Pivotal Software, Inc. Determining data locality in a distributed system using aggregation of locality summaries
US10706070B2 (en) * 2015-09-09 2020-07-07 Rubrik, Inc. Consistent deduplicated snapshot generation for a distributed database using optimistic deduplication
WO2017068617A1 (en) * 2015-10-19 2017-04-27 株式会社日立製作所 Storage system
US10152527B1 (en) 2015-12-28 2018-12-11 EMC IP Holding Company LLC Increment resynchronization in hash-based replication
US9697224B1 (en) * 2016-02-09 2017-07-04 International Business Machines Corporation Data deduplication for an eventually consistent system
US9898200B2 (en) 2016-02-18 2018-02-20 Samsung Electronics Co., Ltd Memory device having a translation layer with multiple associative sectors
US10574751B2 (en) 2016-03-22 2020-02-25 International Business Machines Corporation Identifying data for deduplication in a network storage environment
US10324782B1 (en) 2016-03-24 2019-06-18 Emc Corporation Hiccup management in a storage array
US10101934B1 (en) 2016-03-24 2018-10-16 Emc Corporation Memory allocation balancing for storage systems
US9857990B1 (en) 2016-03-24 2018-01-02 EMC IP Holding Company LLC Fast startup for modular storage systems
US10705907B1 (en) 2016-03-24 2020-07-07 EMC IP Holding Company LLC Data protection in a heterogeneous random access storage array
US10437785B2 (en) 2016-03-29 2019-10-08 Samsung Electronics Co., Ltd. Method and apparatus for maximized dedupable memory
JP6420489B2 (en) 2016-04-19 2018-11-07 華為技術有限公司Huawei Technologies Co.,Ltd. Vector processing for segmented hash calculation
SG11201704733VA (en) 2016-04-19 2017-11-29 Huawei Tech Co Ltd Concurrent segmentation using vector processing
CN106055274A (en) * 2016-05-23 2016-10-26 联想(北京)有限公司 Data storage method, data reading method and electronic device
CN106201338B (en) 2016-06-28 2019-10-22 华为技术有限公司 Date storage method and device
US10515064B2 (en) * 2016-07-11 2019-12-24 Microsoft Technology Licensing, Llc Key-value storage system including a resource-efficient index
US10390039B2 (en) 2016-08-31 2019-08-20 Microsoft Technology Licensing, Llc Motion estimation for screen remoting scenarios
US10417064B2 (en) * 2016-09-07 2019-09-17 Military Industry—Telecommunication Group (Viettel) Method of randomly distributing data in distributed multi-core processor systems
US10152371B1 (en) 2016-09-30 2018-12-11 EMC IP Holding Company LLC End-to-end data protection for distributed storage
US10223008B1 (en) 2016-09-30 2019-03-05 EMC IP Holding Company LLC Storage array sizing for compressed applications
US10255172B1 (en) 2016-09-30 2019-04-09 EMC IP Holding Company LLC Controlled testing using code error injection
CN106527981B (en) * 2016-10-31 2020-04-28 华中科技大学 Data fragmentation method of self-adaptive distributed storage system based on configuration
KR102610996B1 (en) 2016-11-04 2023-12-06 에스케이하이닉스 주식회사 Data management system and method for distributed data processing
US11095877B2 (en) 2016-11-30 2021-08-17 Microsoft Technology Licensing, Llc Local hash-based motion estimation for screen remoting scenarios
JP6805816B2 (en) 2016-12-27 2020-12-23 富士通株式会社 Information processing equipment, information processing system, information processing method and program
US10489288B2 (en) * 2017-01-25 2019-11-26 Samsung Electronics Co., Ltd. Algorithm methodologies for efficient compaction of overprovisioned memory systems
US10282127B2 (en) 2017-04-20 2019-05-07 Western Digital Technologies, Inc. Managing data in a storage system
US10809928B2 (en) 2017-06-02 2020-10-20 Western Digital Technologies, Inc. Efficient data deduplication leveraging sequential chunks or auxiliary databases
CN107329903B (en) * 2017-06-28 2021-03-02 苏州浪潮智能科技有限公司 Memory garbage recycling method and system
US11429587B1 (en) 2017-06-29 2022-08-30 Seagate Technology Llc Multiple duration deduplication entries
US10706082B1 (en) 2017-06-29 2020-07-07 Seagate Technology Llc Deduplication database management
US10503608B2 (en) 2017-07-24 2019-12-10 Western Digital Technologies, Inc. Efficient management of reference blocks used in data deduplication
US10289566B1 (en) 2017-07-28 2019-05-14 EMC IP Holding Company LLC Handling data that has become inactive within stream aware data storage equipment
US10372681B2 (en) * 2017-09-12 2019-08-06 International Business Machines Corporation Tape drive memory deduplication
CN107589917B (en) * 2017-09-29 2020-08-21 苏州浪潮智能科技有限公司 Distributed storage system and method
CN109726037B (en) 2017-10-27 2023-07-21 伊姆西Ip控股有限责任公司 Method, apparatus and computer program product for backing up data
KR20190074897A (en) * 2017-12-20 2019-06-28 에스케이하이닉스 주식회사 Memory system and operating method thereof
US11153094B2 (en) * 2018-04-27 2021-10-19 EMC IP Holding Company LLC Secure data deduplication with smaller hash values
US10592136B2 (en) * 2018-06-29 2020-03-17 EMC IP Holding Company LLC Block based striped backups
JP2020057305A (en) * 2018-10-04 2020-04-09 富士通株式会社 Data processing device and program
CN111833189A (en) 2018-10-26 2020-10-27 创新先进技术有限公司 Data processing method and device
US11392551B2 (en) * 2019-02-04 2022-07-19 EMC IP Holding Company LLC Storage system utilizing content-based and address-based mappings for deduplicatable and non-deduplicatable types of data
US11658882B1 (en) * 2020-01-21 2023-05-23 Vmware, Inc. Algorithm-based automatic presentation of a hierarchical graphical representation of a computer network structure
US11202085B1 (en) 2020-06-12 2021-12-14 Microsoft Technology Licensing, Llc Low-cost hash table construction and hash-based block matching for variable-size blocks
CN114490517A (en) * 2020-10-23 2022-05-13 华为技术有限公司 Data processing method, device, computing node and computer readable storage medium
CN112199326B (en) * 2020-12-04 2021-02-19 中国人民解放军国防科技大学 Method and device for dynamically constructing software supernodes on array heterogeneous computing system
CN114442927B (en) * 2021-12-22 2023-11-03 天翼云科技有限公司 Management method and device for data storage space
US11829341B2 (en) 2022-03-31 2023-11-28 Dell Products L.P. Space-efficient persistent hash table data structure
CN115599316B (en) * 2022-12-15 2023-03-21 南京鹏云网络科技有限公司 Distributed data processing method, apparatus, device, medium, and computer program product

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010170475A (en) * 2009-01-26 2010-08-05 Nec Corp Storage system, data write method in the same, and data write program
JP2010176181A (en) * 2009-01-27 2010-08-12 Nec Corp Storage system

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005502096A (en) * 2001-01-11 2005-01-20 ゼット−フォース コミュニケイションズ インコーポレイテッド File switch and exchange file system
US7418454B2 (en) * 2004-04-16 2008-08-26 Microsoft Corporation Data overlay, self-organized metadata overlay, and application level multicasting
JP2008059398A (en) * 2006-08-31 2008-03-13 Brother Ind Ltd Identification information allocation device, information processing method therefor, and program therefor
US8315984B2 (en) * 2007-05-22 2012-11-20 Netapp, Inc. System and method for on-the-fly elimination of redundant data
US8209506B2 (en) * 2007-09-05 2012-06-26 Emc Corporation De-duplication in a virtualized storage environment
US8176269B2 (en) * 2008-06-30 2012-05-08 International Business Machines Corporation Managing metadata for data blocks used in a deduplication system
US8392791B2 (en) * 2008-08-08 2013-03-05 George Saliba Unified data protection and data de-duplication in a storage system
US7992037B2 (en) * 2008-09-11 2011-08-02 Nec Laboratories America, Inc. Scalable secondary storage systems and methods
JP5413948B2 (en) * 2009-01-27 2014-02-12 日本電気株式会社 Storage system
JP5339432B2 (en) * 2009-02-25 2013-11-13 日本電気株式会社 Storage system
JP5407430B2 (en) * 2009-03-04 2014-02-05 日本電気株式会社 Storage system
JP5691229B2 (en) * 2010-04-08 2015-04-01 日本電気株式会社 Online storage system and method for providing online storage service
US8397080B2 (en) * 2010-07-29 2013-03-12 Industrial Technology Research Institute Scalable segment-based data de-duplication system and method for incremental backups

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010170475A (en) * 2009-01-26 2010-08-05 Nec Corp Storage system, data write method in the same, and data write program
JP2010176181A (en) * 2009-01-27 2010-08-12 Nec Corp Storage system

Also Published As

Publication number Publication date
JP2013514560A (en) 2013-04-25
JP5500257B2 (en) 2014-05-21
US20130036289A1 (en) 2013-02-07
US9256368B2 (en) 2016-02-09
WO2012042792A1 (en) 2012-04-05
CA2811437C (en) 2016-01-19
EP2622452A1 (en) 2013-08-07
CN103098015A (en) 2013-05-08
CA2811437A1 (en) 2012-04-05
EP2622452A4 (en) 2017-10-04

Similar Documents

Publication Publication Date Title
CN103098015B (en) Storage system
US11153380B2 (en) Continuous backup of data in a distributed data store
US10956601B2 (en) Fully managed account level blob data encryption in a distributed storage environment
US11120152B2 (en) Dynamic quorum membership changes
US10764045B2 (en) Encrypting object index in a distributed storage environment
US11755415B2 (en) Variable data replication for storage implementing data backup
US10331655B2 (en) System-wide checkpoint avoidance for distributed database systems
US10229011B2 (en) Log-structured distributed storage using a single log sequence number space
US10558565B2 (en) Garbage collection implementing erasure coding
US9507843B1 (en) Efficient replication of distributed storage changes for read-only nodes of a distributed database
US10659225B2 (en) Encrypting existing live unencrypted data using age-based garbage collection
JP5539683B2 (en) Scalable secondary storage system and method
US11030055B2 (en) Fast crash recovery for distributed database systems
CN106687911B (en) Online data movement without compromising data integrity
WO2021011053A1 (en) Data deduplication across storage systems
US20140279929A1 (en) Database system with database engine and separate distributed storage service
US10725666B2 (en) Memory-based on-demand data page generation
US10803012B1 (en) Variable data replication for storage systems implementing quorum-based durability schemes
US10223184B1 (en) Individual write quorums for a log-structured distributed storage system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant