CN104331478A

CN104331478A - Data consistency management method for self-compaction storage system

Info

Publication number: CN104331478A
Application number: CN201410614846.4A
Authority: CN
Inventors: 马春
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2014-11-05
Filing date: 2014-11-05
Publication date: 2015-02-04
Anticipated expiration: 2034-11-05
Also published as: CN104331478B

Abstract

The invention provides a data consistency management method for a self-compaction storage system, belongs to the technical field of automatic compaction configuration, and designs a metadata structure for data block management and an implementation scheme of the metadata storage structure. For a metadata structure for managing data blocks, an improved B + Tree structure is designed, and the self-simplification management of the data blocks is realized by matching metadata such as super blocks, metadata bitmaps and data bitmaps. On the basis of the original B + Tree data structure, the space of each non-leaf node is expanded by one time and is divided into an active domain and an inactive domain, so that the space of an external magnetic disk is basically not divided in the B + Tree modifying process, the metadata modifying operation complexity is reduced, meanwhile, the allocated extra inactive domain space can also be used as a metadata copy or used for historical operation records, and the storage system copy maintenance or log maintenance cost is reduced.

Description

A kind of from simplifying memory system data consistency management method

Technical field

The present invention relates to thin provisioning field, specifically a kind of from simplifying memory system data consistency management method.

Background technology

The data volume that current internet produces is explosive growth, proposes requirements at the higher level to the capacity of storage system and performance.There is the problem that disk storage utilization factor is lower, storage resources is wasted in existing storage system, had therefore occurred thin provisioning in recent years.

Thin provisioning utilizes, and " distributing when writing " is tactful, by changing the resource distribution according to need of storage system, disk storage space utilization factor can be improved, while improving performance of storage system, reach the lower deployment cost reducing storage system and the object economized on resources." when writing distribute " be exactly to certainly simplify logical volume write data time, ability is from certainly simplifying memory allocated space storage pool.In fact certainly will simplify storage pool storage space now and be divided into equal-sized data block, and carry out organization and administration by forms such as B+Tree: operations such as comprising the distribution of data block, reclaim, search.Data field and meta-data region is divided into from simplifying storage pool, data field is for storing data, meta-data region includes storage pool superblock, metadata bitmap, data bitmap, logical volume information etc. is an organizer and governor to certainly simplifying storage pool, extremely important, once metadata goes out active, the mistake such as inconsistent, user data will be lost and even make whole storage system collapse.Meanwhile, when storage system is normally run, metadata is stored in internal memory also regularly to be write with a brush dipped in Chinese ink on disk, and when system occurs abnormal, as hard errors such as controller inefficacy, controller power down, RAID inefficacy and RAID power down, metadata may be caused to write with a brush dipped in Chinese ink failure, cause metadata error.Therefore how to ensure that the consistance of metadata is the emphasis of storage system and automatic reduction techniques.

In the realization of automatic reduction techniques, large multiplex B+Tree data structure carrys out management data block.In order to ensure the consistance of metadata B+Tree, a kind of management method that can take is: when carrying out the retouching operation of B+Tree, another B+Tree that extra establishment one is identical with B+Tree, and at the extra enterprising line operate of B+Tree created, by the time after whole operation completes, by the root node of the B+Tree that the pointed pointing to original B+Tree root node newly creates, and the space of original B+Tree is discharged, reach the object of amendment metadata.The advantage of this implementation method is the consistance that can ensure metadata B+Tree, either phase in whole modification process, the metadata be stored on disk all keeps consistency, and the metadata that causes because of hard errors such as controller inefficacies can be prevented preferably inconsistent.And the shortcoming of the method is also comparatively obvious, is exactly at every turn that amendment metadata B+Tree needs the B+Tree of a reconstruction equal size, needs for peer distribution space each in B+Tree in process of reconstruction; Simultaneously in order to ensure metadata availability, in other place with the same RAID of metadata B+Tree, also a copy to be stored.Therefore expense is all larger over time and space for this method.

The management method of another kind of metadata is using RAID single in storage system as metadata space, and special storing metadata, this makes this RAID easily be called " focus " and the single failure point of data access in system.A kind of solution is left in several RAID at metadata dispersion, if but system middle controller lost efficacy and still can affect greatly metadata access.

Summary of the invention

The invention provides one from simplifying memory system data consistency management method, the shortcoming in above-mentioned two kinds of methods can be solved, ensure metadata consistency, the Time & Space Complexity of metadata operation can be reduced again.

The present invention devises the metadata structure of block management data and the implementation of metadata store structure.For the metadata structure of block management data, devise the modified node method of a kind of B+Tree, what coordinate the metadata such as superblock, metadata bitmap and data bitmap to realize data block simplifies management simultaneously certainly.On original B+Tree data structure basis, by the space enlargement one times of each nonleaf node, and be divided into active scope and inactive territory two parts, to make in amendment B+Tree process the not outer disk space of allocation substantially, reduce metadata retouching operation complexity, the extra inactive domain space simultaneously distributed also can be used as metadata copy or for historical operation record, reduces storage system copy and safeguard or daily record maintenance costs.

A kind of from simplifying memory system data consistency management method, it is characterized by:

S1: in the B+Tree of metadata organization, increases the space of each non-leaf nodes of B+Tree.On original B+Tree data structure basis, by the space enlargement one times of each nonleaf node, and be divided into active scope and inactive territory two parts, wherein store the data of Mapping B+Tree node in active scope, i.e. (key, value) key-value pair; But not can the copy of storage activities numeric field data according to Different Strategies in active scope, also can store the data before the amendment of the last node.Carry out the inactive territory being modified in node of node, after the amendment of node completes, active scope and inactive territory exchange.Each nonleaf node start address when distributing is alignd with node size, if such as node size is 8KB, wherein active scope and inactive territory respectively account for 4KB, then node start address is alignd with 8KB.So just to make in amendment metadata process not allocation external memory space, reduce metadata retouching operation complexity.

Three kinds of operations are related to the amendment of metadata: increase data block and map, delete data block mapping and the mapping of Update Table block.Each operating process is as follows:

A, increase data block map

1, the father node N of newly-increased data block in Mapping B+Tree is searched;

2, the active scope data of N are copied to inactive territory;

3, revise the inactive territory of N, increase key and index point, increase pointed newly node, be about to newly-increased node city to N;

4, judge that N is the need of dividing.If do not need, turn to step 7; If N needs division, then turn to step 5;

5, divide N, obtain node N ' and node N ' ', now the father node of origin node N points to N ' after N division;

5.1. search metadata bitmap B+Tree, find idle meta data block;

5.2. new node N ' ' is distributed and initialization, more new metadata bitmap B+Tree;

5.3. each self-contained metadata information of division posterior nodal point N ' and N ' ' is calculated, i.e. the scope of (key, value) key-value pair;

5.4. will treat that a data part for split vertexes N active scope copies to inactive territory according to result of calculation, another part copies to the active scope of new distribution node N ' ', now claims node N to be N ';

6, forward step 2 to, insert node N ' ' to by the father node M of split vertexes;

The inactive territory of the node 7, the pointed of the father node of each node revised revised;

8, the position that more newly-increased data block is corresponding in new bit figure B+Tree, is set to and uses;

9, other metadata such as superblock of more new metadata, changes the size of the logical equipment object such as storage pool, logical volume;

10, operated.

B, deletion data block map

1, the father node N of node to be deleted in Mapping B+Tree is searched;

2, the active scope data of N are copied to inactive territory;

3, revise the inactive territory of N, delete key and index point;

4, judge that N merges the need of with other node.If do not need, turn to step 7; If desired merge, then turn to step 5;

5, find and the node N ' merged, and carry out node union operation.Now the father node of N points to the new node M after merging after it merges;

5.1. calculate node N and N ' to be combined, determine the metadata information that merging posterior nodal point comprises;

5.2. according to result of calculation by the inactive territory of the data Replica of node N and N ' to node N ', now claim node N to be node M;

6, forward step 2 to, delete merged node N;

The inactive region of the node 7, the pointed of the father node of each node revised revised;

8, the space of deleted node is discharged;

9, more delete position corresponding to data block in new bit figure B+Tree, be set to and do not use;

10, other metadata such as superblock of more new metadata, changes the size of the logical equipment object such as storage pool, logical volume;

11, operated.

C, Update Table block map

1, search Mapping B+Tree, determine the father node N belonging to the data block to be modified and father node N ' after amendment mapping relations;

2, from node N, delete data block map;

3, under data block being inserted into node N ';

4, operated.

S2: being stored in each bottom storage unit of storage pool of metadata hash.

Metadata is distributed in storage pool on each RAID, by mode organization and administration such as B+Tree; Better lifting metadata access performance, again reduces the risk that hardware anomalies brings metadata to lose.

Because metadata store is in each RAID of storage pool, the dilatation of storage pool in units of RAID and capacity reducing need, by metadata space enlargement in each RAID one times, to be divided into activity space and inactive space equally.Like this when system carries out dilatation and capacity reducing, the only inactive space of amendment metadata, and do not affect the normal access of activity space.When metadata is after inactive space allocation, metadata activity space in each RAID and inactive space are exchanged, enables new metadata, complete dilatation and the capacity reducing of storage system, the data in last synchronous movement space and inactive space, and set up the metadata copy across RAID.

The invention has the beneficial effects as follows: the disk 1) being conducive to metadata is write with a brush dipped in Chinese ink, compared to the mode of existing B+Tree management data block, reduce the hash degree of metadata space distribution in disk; 2) decrease metadata amendment time space assignment overhead, except needing to carry out except the distribution of new node during node split, all the other operations all do not need additional allocation space; 3) application mode is flexible, and namely the inactive territory of Mapping B+Tree nonleaf node can be used as the copy of Mapping B+Tree metadata, also can be used as the record of historical operation, to support to operate rollback.When the inactive territory of node is used as copy, after mapping B+Tree operation is terminated, first each Activity On the Node numeric field data is synchronized in inactive territory.Rebuild inactive domain node pointer afterwards, by the inactive territory of the pointed child nodes of each node.Now the inactive territory of each node forms an independently Mapping B+Tree copy, if except the active scope corrupted data of root node local official one node, then only the pointer of amendment sensing root node can be switched to inactive territory copy, normal access map B+Tree metadata fast.When the inactive territory of node is used as copy, together with copy dispersion being kept at script, decrease the Time and place expense safeguarding the extra disk access that copy consistency brings; When the inactive territory of node is used as operation historical record, save its last time as data during active scope, decreasing the data volume that system journal needs record, reducing the Time and place expense of log recording simultaneously, decreasing the data reconstruction complexity of rolling back action.4) metadata consistency is protected.Due to before operating mapping B+Tree, can by the active scope data Replica of each node to inactive territory, and operate in inactive territory, even if therefore there is the situations such as controller inefficacy or RAID power down in operation, the data that there is Activity On the Node territory also can keep consistency, and do not complete the operation of amendment in the just inactive territory affected.Meanwhile, even if single RAID loss of data, the intersection copy also by storing in other RAID carries out data reconstruction.5) promote metadata access performance, owing to metadata being distributed in all RAID of system, take full advantage of the performance that many RAID are concurrent, improve the IOPS of metadata access, solve metadata single-point performance bottleneck problem, support storage system on-line rapid estimation simultaneously.The seamless switching of new and old metadata after realizing System Expansion capacity reducing.

It is ensure metadata safety and the complex operations that adopts from simplifying in storage system that this method compensate for existing, and decreasing metadata increases and repeatedly apply in delete procedure and overhead that release disk space is caused and storage space fragmentation.The access performance of metadata is reduced under the prerequisite that ensure that metadata consistency.Adopt the metadata storing method of super distributed to it also avoid the Single Point of Faliure problem of metadata access simultaneously.

Accompanying drawing explanation

Accompanying drawing 1 is metadata structure schematic diagram.

Accompanying drawing 2 is Mapping B+Tree structural representations.

Accompanying drawing 3 inserts node step 1.

Accompanying drawing 4 inserts node step 2.

Accompanying drawing 5 inserts node step 3.

Accompanying drawing 6 is split vertexes steps 1.

Accompanying drawing 7 is split vertexes steps 2.

Accompanying drawing 8 is split vertexes steps 3.

Accompanying drawing 9 is split vertexes steps 4.

Accompanying drawing 10 is split vertexes steps 5.

Accompanying drawing 11 is metadata store structural representations.

Embodiment

With reference to the accompanying drawings, be operating as example with the insertion node of Mapping B+Tree in the present invention and split vertexes, stress that the Mapping B+Tree in increase in the present invention, deletion and the mapping of Update Table block operates.Deletion of node and merge node operation are the inverse process inserting node and split vertexes operation respectively, do not repeat them here.Operation to metadata in the storage organization of metadata in each RAID and storage system dilatation, capacity reducing process is described simultaneously.

Accompanying drawing 1 is the data structure schematic diagram of Mapping B+Tree, and wherein each nonleaf node comprises active scope and inactive territory, both equal and opposite in directions, and address space is adjacent, and nonleaf node start address is alignd with node size.Leaf node is the pointer pointing to data block.In nonleaf node, active scope and inactive territory are that the pointer pointing to present node by father node determines.Claim two spaces that in node, address is adjacent to be A territory and B territory, wherein the start address in A territory is the start address of node.Owing to aliging with node size in the address of nonleaf node, if the address that the pointer then pointing to present node in father node stores is A territory start address, be also present node start address, then A territory is active scope, and B territory is inactive territory simultaneously; If otherwise the address that the pointer pointing to present node in father node stores is B territory start address, now can not align with node size in this address, then A territory is inactive territory, and B territory is active scope.

Accompanying drawing 3 to accompanying drawing 5 has carried out process description to the operation mapping B+Tree insertion node;

1) as shown in Figure 3, first search Mapping B+Tree, determine the father node of newly-increased leaf node, by data Replica in its Activity On the Node territory to inactive territory;

2) as shown in Figure 4, the amendment inactive territory of node, adds new leaf node index, amendment key assignments;

3) as shown in Figure 5, point to the pointer of present node in the father node of amendment present node, make it point to the inactive territory of present node, complete the conversion in active scope and inactive territory, enable new node metadata, insert nodal operation and complete.

In Mapping B+Tree, certain nonleaf node is after inserting new node, index value in node may beyond the node restriction in Mapping B+Tree data structure, need to carry out node split and form two new nodes, each new node stores the data of an origin node part.Accompanying drawing 6 to accompanying drawing 10 has carried out process description to the operation mapping B+Tree split vertexes;

1) as shown in Figure 6, in Mapping B+Tree, a certain nonleaf node index value after inserting node reaches maximum, needs to divide;

2) the active scope data of present node as shown in Figure 7, are copied to inactive territory;

3) as shown in Figure 8, a new nonleaf node is distributed and initialization, by a part of Data Migration in inactive for present node territory in the active scope of newly assigned node;

4) as shown in Figure 9, newly assigned node is inserted in Mapping B+Tree as a new node;

5) as shown in Figure 10, revise all index points relating to node, point to the inactive territory having the node of new metadata, complete the conversion in each Activity On the Node territory and inactive territory, enable new node metadata, split vertexes has operated.

Accompanying drawing 11 is metadata actual storage structural representations.The superblock of metadata deposits a copy in each RAID in systems in which, store data block Mapping B+Tree root node, metadata bitmap B+Tree root node, data bitmap B+Tree root node in superblock, and in system other metadata as equipment UUID, device name, device object index, device attribute information etc.Data block Mapping B+Tree, metadata bitmap B+Tree and data bitmap B+Tree store a identical data in each RAID, but the data block forming B+Tree are stored in all RAID according to certain load balancing dispersion.The inactive space of metadata in each RAID stores the copy of metadata in current RAID metadata activity space, simultaneously, according to diversification strategies, data block Mapping B+Tree, metadata bitmap B+Tree in RAID and data bitmap B+Tree are stored two parts of copies in other RAID, two parts of copies, not in same RAID, ensure that the metadata that dispersion stores can also keep integrality when two RAID lose efficacy in systems in which.

As follows to metadata processing procedure when storage system carries out dilatation:

1, initialization increases RAID newly;

2, according to meta-data distribution after load balancing calculating dilatation;

3, metadata activity space and inactive space in synchronous each RAID, makes the metadata of storage in both consistent;

4, according to step 2 result of calculation, replication meta is to the metadata activity space of newly-increased RAID;

5, according to step 2 result of calculation, the metadata revising the inactive space of metadata in each RAID is state after dilatation;

6, superblock in each RAID, metadata bitmap B+Tree and data bitmap B+Tree is upgraded;

7, the inactive space of metadata enabling original each RAID becomes activity space;

8, the metadata activity space of synchronous each RAID and inactive space, and re-establish the metadata copy across RAID;

9, operated.

The capacity reducing operation of storage system is similar with dilatation operating process, does not repeat them here.

The above, be only better embodiment of the present invention, be not intended to limit protection scope of the present invention.

Claims

1., from simplifying a memory system data consistency management method, it is characterized in that devising the metadata structure of block management data and the implementation of metadata store structure, wherein

Metadata structure comprises:

(1) modified node method of B+Tree data structure;

(2) the increase data block that the improvement data structure applying B+Tree realizes maps, delete data block maps and Update Table block map operation;

(3) active scope of nonleaf node and the decision procedure in inactive territory in the improvement data structure of B+Tree;

Be distributed in storage system in all RAID according to different allocation strategies by metadata, by mode organization and administration such as B+Tree, metadata is done to intersect and is backed up in different RAID simultaneously;

Metadata store structure comprises: (1) metadata is across the storage of all RAID and backup; (2) application unit

The storage system dilatation capacity reducing operation of data store organisation.

2. method according to claim 1, it is characterized in that the modified node method design of described B+Tree data structure, for the nonleaf node allocation external space in original B+Tree data structure, make the nonleaf node space size after improving be original twice, and node start address is alignd with node size; Node space is divided into adjacent active scope and inactive territory, and active scope is used for the operation of normal metadata query; Inactive territory is used for storage activities territory copy or a front operation historical record.

3. method according to claim 1, it is characterized in that the block management data operation that the modified node method of described application B+Tree data structure realizes, when carrying out the map operation of certainly simplifying data block, by data Replica in active scope in the node of B+Tree modified node method to inactive territory, in all modifications node, the inactive territory operating in node of data is carried out, after data modification in node, the pointer that there is each node of data modification is pointed in amendment, it is made to point to the original inactive territory of each node, the inactive territory of each node having amended data is become active scope, each Activity On the Node territory originally becomes inactive territory.

4. method according to claim 1, it is characterized in that the active scope of nonleaf node and the decision procedure in inactive territory in the improvement data structure of described B+Tree, determine active scope and inactive territory according to the address that the pointer of father node sensing present node stores; Node is divided into two spaces that address is adjacent: A territory and B territory, wherein the start address in A territory is the start address of node; Owing to aliging with node size in the address of nonleaf node, if the address that the pointer then pointing to present node in father node stores is A territory start address, be also present node start address, then A territory is active scope, and B territory is inactive territory simultaneously; If otherwise the address that the pointer pointing to present node in father node stores is B territory start address, now can not align with node size in this address, then A territory is inactive territory, and B territory is active scope.

5. method according to claim 1, it is characterized in that described metadata is across the storage of all RAID and backup, be that two sizes are identical by the metadata spatial division of storing metadata in each RAID, the metadata activity space that address is adjacent and the inactive space of metadata; The data structure that in metadata, data volume is less, as deposited a identical copies in superblock etc. within the storage system each RAID metadata activity space; And for the larger data structure of data volume, as metadata Mapping B+Tree, metadata bitmap B+Tree and data bitmap B+Tree, be distributed in each RAID metadata activity space according to load balancing, each RAID deposits a part for metadata, and the copy of metadata activity space is deposited in the inactive space of the metadata in RAID; Meanwhile, in each RAID, the copy of metadata in other two RAID is deposited according to certain strategy.

6. method according to claim 1, it is characterized in that storage system dilatation and the capacity reducing operation of described apply metadata storage organization, mainly when dilatation and capacity reducing operation, first synchronizing metadata activity space and the inactive space of metadata, the inactive space of metadata in each RAID is revised afterwards according to metadata diversification strategies, the inactive space of metadata of finally changing each RAID is activity space, enables new metadata and completes the operation of dilatation capacity reducing.