Background technology
Along with the high speed development of computer technology and network technology, the data volume of Storage and Processing is needed to be the growth of geometry multiple, traditional storage system adopts the storage server concentrated to deposit all data, storage server becomes the bottleneck of systematic function, also be the focus of reliability and fail safe, the needs of Mass storage application can not be met.So distributed storage scheme is arisen at the historic moment.Distributed storage scheme utilizes the technology such as many copies, striping, to adapt to user to Information Security, and the demand of read-write high efficiency.As shown in Figure 1, distributed memory system adopts extendible system configuration, multiple stage storage server is utilized to share storage load, be connected with multiple stage storage server by communication network by main control server, memory device (physical memory resources) on management multiple stage storage server, it not only increases the reliability of system, availability and access efficiency, is also easy to expansion.Subregion (partition) is the virtual container for storing data, and each subregion corresponds to one section of memory space on certain memory node, and each memory node may correspond to multiple subregion usually.In order to the fail safe stored, usually correspond on different memory nodes respectively by multiple copies of same subregion, also namely multiple copies of data will be stored on different memory nodes.As shown in Figure 2, the corresponding relation of data, subregion and memory node is shown.
Any one distributed memory system all needs a key issue of solution exactly: on each memory node, how each copy of data distributes.The algorithm addressed this problem is called subregion allocation algorithm (partition allocation algorithm), also referred to as Replica placement algorithm (replica placement algorithm) or data Allocation Algorithms (data allocation algorithm) etc.Can this algorithm determines whole storage system reach optimum in following index:
(1) fail safe: level of security can be rack, server, disk etc., different according to level of security, each copy of data must leave on the memory node of different level of security, such as: when level of security is server, the different copies of same subregion must leave on different server);
(2) each node load balancing: whether various types of copies (primary copy, standby copy 1, standby copy 2, referred to as active and standby 1, standby 2) point other total amount referring to be distributed in data total amount on each memory node and data is identical or close;
(3) dispersed: after node failure, can the load of malfunctioning node be distributed on other nodes uniformly; And
(4) in node failure and failover procedure, whether there is invalid Data Migration.Namely invalid Data Migration refers to when node failure: occur that data on this malfunctioning node non-are toward the Data Migration on other nodes; Or when malfunctioning node recovers or increase node, occur that other nodes are toward the Data Migration in non-faulting recovery nodes or non-newly-increased node.
In These parameters, fail safe determines deposit data whether safety, after three indexs then affect the useful life of memory node and the performance of data I/O.These indexs are interrelated, mutually retrain, and how to take into account the difficult point that above-mentioned each index becomes subregion allocation algorithm.
Subregion allocation algorithm comparatively conventional is at present the subregion allocation algorithm regularly distributed:
Each memory node (rack, server, OSD(Object Storage Device, object storage device) etc. in storage system) on all record subregion and distribute relevant information.Such as, be assigned to the number of all subregions on this memory node, the sum of subregion, the various copy of subregion, on this memory node, the standby copy of the main correspondence of subregion and standby copy corresponding to standby copy are distributed to the number of other memory nodes, and the weights etc. that the memory capacity of each memory node is corresponding.When subregion distributes, these information that foundation memory node is preserved are according to fail safe, harmony, dispersiveness and compare memory node without rules such as invalid migrations, preferentially select each memory node that each copy of this subregion the most applicable is deposited.When there is node failure, the subregion copy on malfunctioning node all being taken out, reallocates according to above-mentioned distribution method.When fault recovery and dilatation, according to the information that memory node is preserved, still carry out the overall situation according to rules such as fail safe, harmony, dispersiveness and invalid migrations to the subregion copy on original memory node to compare, select the most inappropriate subregion copy, and according to above-mentioned allocation rule, this subregion copy is re-assigned on the memory node of fault recovery or dilatation.
The subregion allocation algorithm regularly distributed, when distributing unitedly at all memory nodes all subregions, can accomplish that allocation result possesses almost ideal harmony and dispersiveness under the prerequisite meeting fail safe.But it is all optimum in order to ensure subregion allocation result when rejecting malfunctioning node and recover malfunctioning node at every turn, add in order to satisfied without the condition of invalid migration, such restriction makes when all failed storage nodes all recover, the result distributed is poorer than the result that all subregions are carried out distributing unitedly on all memory nodes, in this case, occur the scene of node failure again, the result that subregion distributes constantly can worsen on existing basis, becomes worse and worse.
Summary of the invention
In view of this, the invention provides a kind of partition allocation method, device and distributed memory system, can break down within the storage system node time, ensure that the result that subregion distributes reaches as far as possible optimum.
To achieve these goals, in first aspect, The embodiment provides a kind of partition allocation method, comprising:
Set up the snapshot of the subregion allocation result under stable memory node topological structure in advance, described stable memory node topological structure is the topological structure not having malfunctioning node, the memory node recording each subregion copy in described snapshot and distribute for each subregion copy;
During memory node fault, the subregion copy on malfunctioning node is assigned to remaining memory node; And
If described malfunctioning node recovers normal, then according to the record of described snapshot, by being assigned to the subregion copy of remaining memory node from the normal malfunctioning node of described recovery, redistribute back the normal malfunctioning node of described recovery.
In conjunction with first aspect, in the implementation that the first is possible, if there is multiple malfunctioning node, and at least one recovers normal in described multiple malfunctioning node, for the normal malfunctioning node of each recovery, at the described record according to described snapshot, the subregion copy of remaining memory node will be assigned to from the normal malfunctioning node of described recovery, after redistributing back the step of the normal malfunctioning node of described recovery:
According to the record of described snapshot, normal malfunctioning node being assigned to the part in the subregion copy of remaining memory node by never recovering, being assigned to the normal malfunctioning node of described recovery.
In conjunction with the first possible implementation of first aspect or first aspect, in the implementation that the second is possible, the described record according to described snapshot, to be assigned to the subregion copy of remaining memory node from the normal malfunctioning node of described recovery, the step redistributing back the normal malfunctioning node of described recovery comprises:
If there is the copy that conflicts having security conflicts with the subregion copy waiting to redistribute back the normal malfunctioning node of described recovery, then wait that the subregion copy redistributing back the normal malfunctioning node of described recovery is exchanged with the described copy that conflicts by described, then distribute back described recovery normal malfunctioning node in the conflict copy after exchanging;
Wherein, described have the conflict copy of security conflicts to be stored on other memory node of described recovery normal malfunctioning node place level of security, belongs to the subregion copy of same subregion with the described subregion copy waiting to redistribute back the normal malfunctioning node of described recovery;
Described exchange comprises: the exchange of copy type and the exchange of the corresponding relation of copy and memory node in described snapshot.
In conjunction with first aspect or first aspect first to or two kinds of possible implementations, in the implementation that the third is possible, described, subregion copy on malfunctioning node is assigned in the step of remaining memory node, particularly: under the prerequisite meeting fail safe, according to ensureing that allocation result principle that is harmonious and dispersiveness optimum distributes the copy of the subregion on described malfunctioning node.
In second aspect, embodiments provide a kind of subregion distributor, comprising:
Snapshot sets up unit, for setting up the snapshot of the subregion allocation result under stable memory node topological structure in advance, described stable memory node topological structure is the topological structure not having malfunctioning node, the memory node recording each subregion copy in described snapshot and distribute for each subregion copy;
Allocation units, for when memory node fault, are assigned to remaining memory node by the subregion copy on malfunctioning node; And
Code reassignment unit, for when described malfunctioning node recovers normal, according to the record of described snapshot, by being assigned to the subregion copy of remaining memory node from the normal malfunctioning node of described recovery, redistributes back the normal malfunctioning node of described recovery.
In conjunction with second aspect, in the implementation that the first is possible, described code reassignment unit comprises:
Load balancing module, for the record according to described snapshot, normal malfunctioning node being assigned to the part in the subregion copy of remaining memory node by never recovering, being assigned to the normal malfunctioning node of described recovery.
In conjunction with the first possible implementation of second aspect or second aspect, in the implementation that the second is possible, described code reassignment unit comprises:
Transpose module, for waiting that the subregion copy redistributing back the normal malfunctioning node of described recovery is exchanged with the copy that conflicts, then distributes back described recovery normal malfunctioning node in the conflict copy after exchanging;
Wherein, described have the conflict copy of security conflicts to be stored on other memory node of described recovery normal malfunctioning node place level of security, belongs to the subregion copy of same subregion with the described subregion copy waiting to redistribute back the normal malfunctioning node of described recovery;
Described exchange comprises: the exchange of copy type and the exchange of the corresponding relation of subregion copy and memory node in described snapshot.
In conjunction with the implementation that first or the second of second aspect or second aspect are possible, in the implementation that the third is possible, described allocation units, under the prerequisite meeting fail safe, distribute the copy of the subregion on described malfunctioning node according to the requirement that is harmonious and dispersiveness optimum of guarantee allocation result.
The third aspect, embodiments provide a kind of distributed memory system, comprise client-server, main control server and storage server, described main control server comprises the subregion distributor described in any one implementation of second aspect or second aspect.
The method that the embodiment of the present invention provides, device and distributed memory system, based on the result that the subregion of stable state distributes, and the code reassignment of subregion copy is carried out according to the snapshot of this subregion allocation result, break down within the storage system node time, can ensure that the result that subregion distributes reaches optimum as far as possible, to guarantee that storage system is run under remaining essentially in optimum subregion allocative decision.
According to below with reference to the accompanying drawings to detailed description of illustrative embodiments, further feature of the present invention and aspect will become clear.
Embodiment
Various exemplary embodiment of the present invention, characteristic sum aspect is described in detail below with reference to accompanying drawing.The same or analogous element of Reference numeral presentation function identical in accompanying drawing.Although the various aspects of embodiment shown in the drawings, unless otherwise indicated, accompanying drawing need not be drawn in proportion.
Word " exemplary " special here means " as example, embodiment or illustrative ".Here need not be interpreted as being better than or being better than other embodiment as any embodiment illustrated by " exemplary ".
In addition, in order to better the present invention is described, in embodiment hereafter, give numerous details.It will be appreciated by those skilled in the art that do not have these details, the present invention can implement equally.In other example, known method, means, element and circuit are not described in detail, so that highlight purport of the present invention.
As shown in Figure 3, the partition allocation method of the embodiment of the present invention comprises:
S1. the snapshot of the subregion allocation result under stable memory node topological structure is set up in advance.
Wherein, the described snapshot memory node that records each subregion copy and distribute for each subregion copy.This stable memory node topological structure and user expect can the memory node topological structure of long-time running under this kind of topological structure, in other words as the topological structure not having malfunctioning node.This stable memory node topological structure can be the topological structure of storage system initial configuration, or the topological structure after user's dilatation, or user subtracts the topological structure after appearance, or other users wish a kind of topological structure that system keeps for a long time.Subregion allocation result under such topological structure can as snapshot.
Under the memory node topological structure that this is stable, global analysis can be carried out to the topological structure of all subregion copies to be allocated and memory node, all subregion copies to be allocated are assigned on memory node according to certain subregion allocation algorithm (the subregion allocation algorithm such as regularly distributed).Owing to having done global analysis to all subregion copies to be allocated and all memory nodes, so can make the sub-optimal result that whole subregion allocation result can both reach optimum or in power in fail safe, harmony and dispersiveness, the snapshot of this result can as the basis of reallocation.
This snapshot have recorded the relevant information of the result of scene and the subregion distribution comprising subregion distribution, the scene that subregion distributes comprises the scene information having influence on subregion allocation result, the queuing message of such as memory node and the topology information etc. of memory node, the method that initial subregion distributes is different, and the information that the scene that this subregion distributes records is different; The result that subregion distributes is each copy of all subregions and the corresponding relation of memory node.As shown in Figure 4, for carrying out the result of subregion distribution according to " the subregion allocation algorithm regularly distributed ", which show the information of the snapshot record of the result that this set up subregion distributes.
S2. when there is permanent fault in memory node, this is occurred the subregion copy on the memory node (hereinafter referred to as malfunctioning node) of permanent fault is assigned on all the other memory nodes, and should under the prerequisite meeting fail safe, the subregion copy on this malfunctioning node is assigned on all the other memory nodes by principle that is harmonious according to guarantee allocation result and dispersed optimum as far as possible.
It should be noted that, the method of the embodiment of the present invention is to occurring that the malfunctioning node of of short duration fault does not process, but wait for that it recovers, if still cannot recover after wait certain hour, then confirm that it occurs permanent fault, replacing process carried out to the memory device of the malfunctioning node occurring permanent fault and has recovered normally at this malfunctioning node of confirmation, then carrying out redistributing of subregion copy, therefore, in embodiment of the present invention method said " malfunctioning node " for there is the memory node of permanent fault.
If S3. described malfunctioning node recovers normal, then according to the record of described snapshot, by being assigned to the subregion copy of remaining memory node from the normal malfunctioning node of described recovery, redistribute back the normal malfunctioning node of described recovery.
Based on the method for the embodiment of the present invention subregion allocation result under stable memory node topological structure, and the code reassignment of subregion copy is carried out according to the snapshot of this subregion allocation result, break down within the storage system node time, can ensure that the result that subregion distributes reaches as far as possible optimum, to guarantee that storage system remains essentially in as far as possible optimum subregion allocative decision and runs.
Tool says it, when after appearance memory node permanent fault, by the subregion copy that is assigned on this node according to certain regular allocation to remaining normal memory node, and in step s3, the failing storage device of this malfunctioning node is carried out replacing make this malfunctioning node recover normal after, preferentially carry out redistributing of subregion copy according to snapshot: the distribution of recording in snapshot subregion copy is on this node redistributed back this node.
If there is multiple malfunctioning node, then each being redistributed back by subregion copy is recovered after normal malfunctioning node, also need to carry out equilibrium assignment again: be assigned to part in the subregion copy of all the other memory nodes by describedly in snapshot record belonging in the subregion copy (being dispensed to remaining memory node in step s 2) not recovering normal malfunctioning node, never recovering normal malfunctioning node, be assigned to the normal malfunctioning node of described recovery, to make the load balancing of each memory node.
Still for the snapshot shown in Fig. 4, after the snapshot shown in Fig. 4 has been set up, in supposing the system there is permanent fault in OSD 1, OSD2 and OSD4 tri-memory nodes.According to the method for the embodiment of the present invention, subregion copy on these three nodes has been assigned on remaining normal memory node through step S2 equably, subregion distribution condition on remaining normal memory node is as shown in Figure 5: not only comprise the subregion copy that this memory node recorded in snapshot distributes, also comprise the subregion copy shifted from three malfunctioning nodes.Suppose that OSD4 recovers normal, then through step S3, OSD 4 will only comprise the subregion copy recorded in snapshot, and subregion distribution condition on remaining normal memory node is by as shown in Figure 6: comprise the subregion copy that this memory node of recording in snapshot distributes, also comprise the subregion copy shifted from malfunctioning node OSD 1 and OSD 2.Now, the subregion copy that OSD 4 distributes is less than the subregion copy that other normal memory nodes distribute, so need to carry out equilibrium assignment again: distribute a part on OSD 4 from other normal memory nodes the subregion copy (being now distributed on remaining normal memory node) belonged on OSD 1 and OSD 2 recorded in snapshot, make the subregion copy in storage system on all normal memory nodes to reach load balancing.If OSD 2 recovers again, situation and OSD 4 similar.Finally, OSD 1 recovers, and owing to being last malfunctioning node recovered, distributes so need not carry out balanced subregion again, OSD 1 recovers and after the process of step S3, the subregion distribution state of whole storage system just can return to the consistent state recorded with snapshot.
When breaking down node, only there is the migration of the subregion copy from malfunctioning node to other memory nodes in the method for the embodiment of the present invention; When malfunctioning node recovers, only there is the migration of other memory nodes to the subregion copy of the malfunctioning node recovered, so there is not invalid migration in the process recovered at whole memory node fault and malfunctioning node, make whole distributed memory system in life cycle, the fail safe of its Data distribution8, harmony and dispersiveness can be protected, and move without invalid data in whole process.
In addition, the security conflicts problem of moving back in order to avoid subregion in process, the method for the embodiment of the present invention also comprises the step of the exchange carrying out conflict subregion copy in step s3.Tool says it: have the copy that conflicts of security conflicts (for being stored on other memory node of described recovery normal malfunctioning node place level of security if exist with the subregion copy waiting to redistribute back the normal malfunctioning node of described recovery, the subregion copy of same subregion is belonged to) with the described subregion copy waiting to redistribute back the normal malfunctioning node of described recovery, then wait that the subregion copy redistributing back the normal malfunctioning node of described recovery is exchanged with the described copy that conflicts by described, comprise: the exchange of the exchange of copy type and the corresponding relation of the subregion copy recorded in described snapshot and memory node, distribute back described recovery normal malfunctioning node in the former conflict copy after exchange again, subregion copy originally to be redistributed then does not move.
Topological structure for shown in Fig. 7: each subregion comprises two copies (primary copy and standby copy), the level of security of storage system is server level (namely the different copies of same subregion can not be assigned on same server), after initial subregion is assigned, set up snapshot according to subregion allocation result.Suppose that the primary copy of subregion 1 in snapshot is assigned to OSD 0, standby copy is assigned to OSD 4, OSD 0 and belongs to server 0, OSD 4 and belong to server 1, then according to the method for the embodiment of the present invention:
If OSD 0 fault, be temporarily assigned to by the primary copy of subregion 1 on OSD 8, OSD8 belongs to server 3; Suppose that OSD 4 breaks down again, the standby copy of subregion 1 is temporarily assigned to OSD 2.
After OSD 0 recovers, according to snapshot the primary copy of subregion 1 moved back from OSD 8 and be assigned to OSD 0, but now, the standby copy of subregion 1 is on OSD 2, if the primary copy of subregion 1 redistributes back OSD 0, then by occurring that two copies of subregion 1 are assigned to server 0 time simultaneously, do not meet fail safe; If first the standby copy of subregion 1 is moved to the OSD under other server, then the primary copy of subregion 1 OSD 0 that moves back can be addressed this problem, but invalid migration will have been occurred like this.
Therefore, the method of the embodiment of the present invention in step s3, first the primary copy of subregion 1 and standby copy are carried out masterslave switchover, comprise the exchange of corresponding relation between transcript and memory node in the exchange of copy type and snapshot, at this moment the primary copy of subregion 1 is on OSD 2, and the standby copy of subregion 1 on OSD 8, then to be moved back the primary copy of subregion 1 to OSD 0 from OSD 2 according to snapshot, thus avoid invalid migration, and the allocation result of subregion returns to consistent with snapshot.
To sum up, the method for the embodiment of the present invention can make distributed memory system more stable, efficient, and by the equiblibrium mass distribution of data with avoid invalid data to move, extends the useful life of memory node.
As shown in Figure 8, be the structured flowchart of the subregion distributor 800 of the embodiment of the present invention, this device 800 carries out subregion distribution according to Fig. 3 to method embodiment illustrated in fig. 7, and this device comprises:
Snapshot sets up unit 810, for setting up the snapshot of the subregion allocation result under stable memory node topological structure in advance, described stable memory node topological structure is the topological structure not having malfunctioning node, records each subregion copy and the memory node for each subregion copy distribution in described snapshot;
Allocation units 820, for when memory node fault, under the prerequisite meeting fail safe, according to the requirement ensureing the harmonious and dispersiveness optimum of allocation result, are assigned to remaining memory node by the subregion copy on malfunctioning node; And
Code reassignment unit 830, for when described malfunctioning node recovers normal, then according to the record of described snapshot, by being assigned to the subregion copy of remaining memory node from the normal malfunctioning node of described recovery, redistributes back the normal malfunctioning node of described recovery.
Based on the device of the embodiment of the present invention subregion allocation result under stable memory node topological structure, and carry out subregion code reassignment according to the snapshot of this subregion allocation result, can ensure when breaking down node within the storage system, the result that subregion is distributed reaches as far as possible optimum, guarantees that storage system is run under remaining essentially in as far as possible optimum subregion allocative decision.
In addition, in the device of the embodiment of the present invention, code reassignment unit 830 can comprise: load balancing module 831, for the record according to described snapshot, normal malfunctioning node is assigned to the part in the subregion copy of remaining memory node by never recovering, be assigned to the normal malfunctioning node of described recovery, to make the load balancing of each memory node.
Code reassignment unit 830 also can comprise: Transpose module 832, for waiting that the subregion copy redistributing back the normal malfunctioning node of described recovery is exchanged with the copy that conflicts, then distributes back described recovery normal malfunctioning node in the conflict copy after exchanging; Wherein, described conflict copy is on other memory node of described recovery normal malfunctioning node place level of security, belongs to the subregion copy of same subregion with the described subregion copy waiting to redistribute back the normal malfunctioning node of described recovery; Described exchange comprises: the exchange of copy type and the exchange of the corresponding relation of subregion copy and memory node in described snapshot.
When breaking down node, only there is the migration of the subregion copy from malfunctioning node to other memory nodes in the device of the embodiment of the present invention; When malfunctioning node recovers, only there is the migration of other memory nodes to the subregion copy of the malfunctioning node recovered, so there is not invalid migration in the process recovered at whole memory node fault and malfunctioning node, make whole distributed memory system in life cycle, the fail safe of its Data distribution8, harmony and dispersiveness can be protected, and move without invalid data in whole process.The device of the embodiment of the present invention can make distributed memory system more stable, efficient, and by the equiblibrium mass distribution of data with avoid invalid data to move, extends the useful life of memory node.
The embodiment of the present invention additionally provides a kind of distributed memory system as shown in Figure 1, and this distributed memory system comprises client-server, main control server and storage server.The specific embodiment of the invention does not limit the specific implementation of this main control server.As shown in Figure 9, this main control server 900 can comprise:
Processor (processor) 910, communication interface (Communications Interface) 920, memory (memory) 930 and communication bus 940.Wherein:
Processor 910, communication interface 720 and memory 930 complete mutual communication by communication bus 740.
Communication interface 920, for the net element communication with such as client etc.
Processor 910, for executive program 932, specifically can perform the correlation step in the embodiment of the method shown in above-mentioned Fig. 3 to Fig. 7.
Particularly, program 932 can comprise program code, and described program code comprises computer-managed instruction.
Processor 910 may be a central processor CPU, or specific integrated circuit ASIC(Application Specific Integrated Circuit), or be configured to the one or more integrated circuits implementing the embodiment of the present invention.
Memory 930, for depositing program 932.Memory 930 may comprise high-speed RAM memory, still may comprise nonvolatile memory (non-volatile memory), such as at least one magnetic disc store.Program 932 specifically can comprise:
Snapshot sets up unit, for setting up the snapshot of the subregion allocation result under stable memory node topological structure in advance;
Allocation units, for when memory node fault, under the prerequisite meeting fail safe, according to the requirement ensureing the harmonious and dispersiveness optimum of allocation result, are assigned to remaining memory node by the subregion copy on malfunctioning node; And
Code reassignment unit, for when described malfunctioning node recovers normal, according to the record of described snapshot, by being assigned to the subregion copy of remaining memory node from the normal malfunctioning node of described recovery, redistributes back the normal malfunctioning node of described recovery.
In program 932, the specific implementation of each unit can corresponding units in device embodiment shown in Figure 8, is not repeated herein.Those skilled in the art can be well understood to, and for convenience and simplicity of description, the equipment of foregoing description and the specific works process of module, can describe with reference to the corresponding process in preceding method embodiment, not repeat them here.
Those of ordinary skill in the art can recognize, in conjunction with unit and the method step of each example of embodiment disclosed herein description, can realize with the combination of electronic hardware or computer software and electronic hardware.These functions perform with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.Professional and technical personnel can use distinct methods to realize described function to each specifically should being used for, but this realization should not thought and exceeds scope of the present invention.
If described function using the form of SFU software functional unit realize and as independently production marketing or use time, can be stored in a computer read/write memory medium.Based on such understanding, the part of the part that technical scheme of the present invention contributes to prior art in essence in other words or this technical scheme can embody with the form of software product, this computer software product is stored in a storage medium, comprising some instructions in order to make a computer equipment (can be personal computer, server, or the network equipment etc.) perform all or part of step of method described in each embodiment of the present invention.And aforesaid storage medium comprises: USB flash disk, portable hard drive, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or CD etc. various can be program code stored medium.
Above execution mode is only for illustration of the present invention; and be not limitation of the present invention; the those of ordinary skill of relevant technical field; without departing from the spirit and scope of the present invention; can also make a variety of changes and modification; therefore all equivalent technical schemes also belong to category of the present invention, and scope of patent protection of the present invention should be defined by the claims.