CN106293537A

CN106293537A - A kind of autonomous block management method of the data-intensive file system of lightweight

Info

Publication number: CN106293537A
Application number: CN201610665489.3A
Authority: CN
Inventors: 陈付梅; 韩德志; 毕坤; 王军
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2016-08-12
Filing date: 2016-08-12
Publication date: 2017-01-04
Anticipated expiration: 2036-08-12
Also published as: CN106293537B

Abstract

The invention discloses the autonomous block management method of the data-intensive file system of a kind of lightweight, (ISD is divided by cross transfer, Intersected Shifted Declustering) realize the data block mapping to data memory node, the quick lookup of data block in data memory node, the quick redistribution etc. of data block when the fast quick-recovery of data block and newly added data memory node when data memory node lost efficacy, host node is made only to be responsible for the storage and maintenance of file name space in data-intensive file system, and data block is to the mapping relation information storage and maintenance of data memory node, and data memory node time the replacement of data block and new data memory node add when losing efficacy the redistribution etc. of data block all completed by data memory node autonomy.This invention saves the memory headroom of host node in data-intensive file system, improves the disposal ability of host node, can increase substantially the block management data efficiency of data-intensive file system under big data environment.

Description

A kind of autonomous block management method of the data-intensive file system of lightweight

Technical field

The present invention relates to computer security technique, particularly relate to the autonomy of the data-intensive file system of a kind of lightweight Block management method.

Background technology

Data-intensive file system DiFS, such as Google's file system GFS, Hadoop distributed file system HDFS Deng, have become as the main file system of big data storage management.Current data-intensive file system DiFS uses principal and subordinate Formula framework, host node (meta data server) manages all of metadata, deposits from node (data memory node) the most responsible data Storage.In order to maintain high availability, data file is generally divided into the block of fixed size by these storage systems, and each data block is usual There are 3 copies, and they are all assigned in the data memory node of different clusters.Host node must record hundreds and thousands of The address of data memory node, and record the data block map information to these memory nodes of all data files.Further, Host node must check the change of the address mapping information of all data blocks termly.Along with the continuous increase of data volume, these Metadata information does not occupy the memory headroom of host node, affects the disposal ability of host node, and seriously limits master The extensibility of node.

In order to solve the problem that data-intensive file system exists, by the distribution of data file physical block with safeguard from unit Separating in data management, the maintaining method by each data memory node execution data block to memory node map information should Transport and give birth to.Adopting said method, host node need not preserve substantial amounts of data block metadata information again and data block is deposited to data Storage node map information, but need with one group of data block to data memory node, data memory node to data block it Between invertible mapping function complete.

The data of data-intensive file system management magnanimity, these data have the following characteristics that 1) data volume is big, data Total amount increases fast；2) data storage performance demand is high；3) high reliability and high restorability are required: when data occur to lose or number When losing efficacy according to memory node, on the premise of not affecting normal work, it is possible to quick extensive restored data；4) requirement can be quick The storage position of lookup data block；5) require as far as possible the fewest memory headroom taking host node and the fewest affect host node Disposal ability；

As seen from the above analysis, the management method of traditional file systems is not suitable with the pipe of data-intensive file system Reason, main cause: 1) along with the continuous increase of data volume, the storage of file data blocks address table will take substantial amounts of storage sky Between；2) host node is responsible for the maintenance of file data blocks address table, along with being continuously increased of file data blocks address table, is substantially reduced The disposal ability of host node；3) being continuously increased of data volume not only occupies the substantial amounts of memory space of host node, increases ground The metadata maintenance costs such as location, also reduce the extensibility of host node simultaneously；4) each data memory node is storing Will first seek advice from host node during with inquiry, so increase the time of addressing.

Summary of the invention

Data block for data-intensive file system stores and searching and managing demand, the invention provides a kind of light weight The autonomous block management method of the data-intensive file system of level, by by the distribution of physical data block, unit's number of inquiring about and be correlated with Separate from traditional metadata management according to safeguarding, each data memory node complete, reduce host node memory space Expense and burden.The present invention can promote the extensibility of the data-intensive file system under big data environment, reduce data Block addressing the time, and can first mate's degree improve whole system performance.

The know-why of the present invention is, the present invention is by cross transfer division methods (ISD, Intersected Shifted Declustering) realize the autonomous management of data block, i.e. by realizing data block with one group of reversible mathematical function To data memory node, and data memory node is to the mapping of data block, completes the distributed storage of data block and fast quick checking Ask.

The present invention specifically comprises following several operation:

Operation 1, data block storage operation；

Operation 2, data block search operation；

Operation 3, fail data memory node crash handling operation；

Operation 4, interpolation new data memory node operation.

(1) data block storage operation comprises the following steps:

Step 1.1, host node select data block place logical groups (LG) by reversible linear hash function；

Step 1.2, host node select data memory node storage data in logical groups by reversible displacement segmentation function Blocks of data；

Step 1.3, data memory node storage data block data and data block address map information.

(2) data block search operation comprises the following steps:

Step 2.1, data block b place data memory node calculate data block b according to its call number with reverse invertible function The new ID of place logical groups；

Step 2.2, data block b place data memory node, according to data block b place logical groups ID, use reverse invertible function Calculating the physics ID of data block b, the data file complete for file system recovery provides condition；

Step 2.3, data memory node, according to the physics ID of data block, obtain data block and believe in the mapping of memory node Breath；

Step 2.4, data memory node send file system according to the fetch data data of block b of the map information of data block b.

(3) fail data memory node crash handling operation comprises the following steps:

Step 3.1, determine fail data memory node place logic groups；

The data memory node loading minimum in logic groups beyond step 3.2, selection data storage failure node is made For backup node；

Step 3.3, multiple post data memory node use intelligence restructuring mapping method to replicate parallel in each logical groups The data comprised in this corresponding fail data memory node.

(4) add the operation of new data memory node to comprise the following steps:

The average load COV of data memory node in all logical groups in step 4.1, calculating whole system_ave；

Step 4.2, one logical groups of selection, calculate load C OV maximum in all data memory nodes in this group_max；

Step 4.3, compare COV_maxAnd COV_aveSize, if COV_max≥COV_ave, with being newly added data memory node Replace this data memory node of logical groups.Otherwise, choose next logical groups, repeat step 4.1, step 4.2 and step 4.3, Until the load of the data memory node being newly added reaches or close to the average load of data memory node in system.

The advantage of this data-intensive file system autonomy block management method is:

(1) host node memory space expense is greatly reduced.By data block to data memory node map information from tradition Metadata in separate, stored by the carrying out that each data memory node is autonomous and managed, host node need not preserve and Safeguarding substantial amounts of data block address information, the metadata information making host node preserve reduces more than 90% than traditional file systems.

(2) disposal ability of host node it is greatly improved.Map information between data block and data memory node is by often The storage and maintenance that individual data memory node is autonomous, eliminates the burden of host node.This kind of method and distributed file system HDFS compares, and the process performance of host node can be made to improve more than 30%.

(3) improve restorability and the extensibility of system.When data memory node fault by using intelligence weight Group mapping method, when adding new data memory node by using decoupling address mapping method, so only migrates minority data Block just can complete recovering and the duplication of newly added back end data of fail data node data, and substantially increase system can Restorative and extensibility.

Accompanying drawing explanation

Fig. 1 is the flow chart of concrete operations of the present invention；

Fig. 2 is the schematic diagram that in the present invention, host node and data memory node management function divide；

Fig. 3 is continuous blocks to the mapping of back end and back end to the example of the lookup of block；

Fig. 4 is the example of back end failure recovery process；

Fig. 5 is the example of new data node adding procedure.

Detailed description of the invention

For the technological means making the present invention realize, creation characteristic, reach purpose and be easy to understand with effect, below knot Close diagram and specific embodiment, the autonomy of the data-intensive file system of a kind of lightweight that the present invention propose is expanded on further Block management method.

A kind of autonomous block management method of the data-intensive file system of lightweight, real by one group of reversible mathematical function Existing data block is to back end and back end to the mapping of data block.As in figure 2 it is shown, each node concrete function in the present invention Divide: host node is only responsible for system name space maintenance, the distribution of data block to data memory node, each data memory node Management；Each data memory node is responsible for the consistency check of data block, data block is recovered and the mapping of data memory node Information storage and maintenance.

As it is shown in figure 1, autonomous block management method of the present invention, specifically include following several operation:

Operation 1, data block storage operation；

Operation 2, data block search operation；

Operation 3, fail data memory node crash handling operation；

Operation 4, interpolation new data memory node operation.

(1) data block storage operation, comprises the following steps:

Step 1.1, host node select block place logical groups (LG) by reversible linear hash function；

In the step 1.1 of data block storage operation, selected the logic at data block place by reversible linear hash function Group (LG) formula:

Wherein, g is intended to logical groups ID mapped, and x is group number current in system, the logical groups in system when X is to start Number, it is newly-increased logical groups number that b is intended to store data block block ID, s in its file,

In the step 1.2 of data block storage operation, data in logical groups are selected to store by reversible displacement segmentation function The process of node includes:

A) the new block identification after data block b is mapped to logical groups g is calculated, its formula:

Wherein, a is data block new logo in logical groups g, and x is current logic group number, and X is initial logical groups number, b Data-oriented block ID, s are newly-increased logical groups numbers,

B) the index ID of the data memory node that data block b is mapped in logical groups g is calculated, its formula:

D=node (a, i)=(a+i) %4 (3)

Wherein, a is data block b new data block mark in logical groups g, i be data block b copy number (value 0,1, 2), d is the index (value 0,1,2,3) of the data memory node that data block b selects in logical groups.

Described copy number, refers to that intensive file system provides three copies for each data block, fully ensures that it can By property, they are numbered 0,1,2 years old；The index of described data memory node, refers to all data memory nodes in a logical groups Numbering, in the present invention, each logical groups includes 4 data memory nodes, and its call number is respectively 0,1,2,3.

(2) data block search operation comprises the following steps:

The data that step 2.4, data memory node obtain data block b according to the map information of data block b deliver to file system System.

Step 2.1 in data block search operation calculates number according to data memory node call number d with reverse invertible function According to the new ID of block b place logical groups, its formula:

D=(a+i) %4 → can inverse operation → a=4 j+ (d-i) %4 (4)

Wherein i represents the copy number of data block, can take 0,1,2 with iteration, j is desirable 0,1,2 ..., n etc.；

Step 2.2 in data block search operation reversely invertible function calculates the physics ID of data block b, its formula:

Wherein, g is the index comprising data-oriented memory node logical groups,

Fig. 3 (a) be continuous print data block by linear Hash mapping to each logical groups, and by migrate division realize number According to block distributed storage of each data memory node in logical groups；Fig. 3 (b) is as a example by back end 2, demonstrates by can Inverse function realizes the reverse process searching data block.

Step 3.1, determine fail data memory node place logic groups；

Described intelligence restructuring mapping method, is to choose post data memory node number and comprise fail data memory node Logical groups number equal, a fail data memory node is likely to be contained in multiple logical groups, and each post data is deposited Storage node is only responsible for replicating the part data in this fail data memory node in a corresponding logical groups.

Fig. 4, as a example by back end 2 lost efficacy, demonstrates each logical groups and back end 2 is replaced recovery process.

(4) adding the operation of new data memory node, main employing decouples address mapping method, comprises the following steps:

Step 4.3, compare COV_maxAnd COV_aveSize, if COV_max≥COV_ave, with being newly added data memory node Replace the data memory node loading maximum in logical groups.Otherwise, choose next logical groups, repeat step 4.1, step 4.2 With step 4.3, the load until the data memory node being newly added reaches or bears close to the average of data memory node in system Till load.

Fig. 5 demonstrates system and adds new data node node₁₂₈Constitute new new logic group LG₁₀₀₀Time, the number of whole system According to block transition process.

By Fig. 4 and Fig. 5 it can be seen that system by invertible function and uses intelligence restructuring mapping method and uses decoupling Address mapping method, when making back end lose efficacy with the interpolation of new data node, the most little data block migration, fully ensure that The stability of system and the availability to user.

This method is illustrated below with an example.

Selection HDFS is as data-intensive file system, by 10000 back end of emulation, 1000000 data Under the big data environment of block, using the autonomous block management method of data-intensive file system of lightweight and do not using the party During method, host node EMS memory occupation situation is as shown in table 1, and it is as shown in table 2 that host node CUP takies situation.Wherein 1000000 data block Being generally evenly distributed in 10000 back end, each data block size is 64MB.

Table 1 host node management data block EMS memory occupation situation

Data section is counted	1000	2000	5000	7000	9000	10000
							Committed memory (MB) after optimization	15	20	27	36	42	50
It is not optimised committed memory (MB)	180	186	189	192	194	196

Table 2 host node management data block CPU takies situation

Data section is counted	500	2000	3000	4000	5000
						CPU usage (%) after optimization	1.4	2.3	2.5	3.1	4.2
It is not optimised rear CPU usage (%)	6.3	12.1	16.6	19.8	23.2

Knowable to Tables 1 and 2, after using the autonomous block management method of data-intensive file system of lightweight, main joint The EMS memory occupation situation of point and the situation that takies of CPU are substantially better than the autonomy of the data-intensive file system being provided without lightweight The situation of block management method.

Although present disclosure has been made to be discussed in detail by above preferred embodiment, but it should be appreciated that above-mentioned Description is not considered as limitation of the present invention.After those skilled in the art have read foregoing, for the present invention's Multiple amendment and replacement all will be apparent from.Therefore, protection scope of the present invention should be limited to the appended claims.

Claims

1. the autonomous block management method of the data-intensive file system of a lightweight, it is characterised in that data-intensive literary composition Part system passes through cross transfer division methods, it is achieved the autonomous management of data block, i.e. by using one group of reversible mathematical function, real Existing data block is to data memory node, and data memory node is to the mapping of data block, complete the distributed storage of data block with Search；

Realize data block storage operation by described autonomous block management method, comprise the following steps:

Step 1.1, host node, by reversible linear hash function, select data block place logical groups；

Step 1.2, host node, by reversible displacement segmentation function, select data memory node in described logical groups；

Step 1.3, at the data memory node chosen, the data of storage data block and data block address map information.

2. the autonomous block management method of the data-intensive file system of lightweight as claimed in claim 1, it is characterised in that

Realize data block search operation by described autonomous block management method, comprise the following steps:

Step 2.1, data block place data memory node, according to its call number, calculate data block place with reverse invertible function and patrol Collect the new ID of group；

Step 2.2, data block place data memory node are according to the new ID of data block place logical groups, with reverse invertible function meter Calculate the physics ID of data block；

Step 2.3, data memory node, according to the physics ID of data block, obtain the data block map information at memory node；

Step 2.4, data memory node deliver to data-intensive literary composition according to the map information of data block, the data obtaining data block Part system.

3. the autonomous block management method of the data-intensive file system of lightweight as claimed in claim 2, it is characterised in that

The crash handling realizing fail data memory node by described autonomous block management method operates, and comprises the following steps:

Step 3.1, determine fail data memory node place logic groups；

Logic groups beyond step 3.2, selection data storage failure node loads the data memory node of minimum as rear Slave node；

Step 3.3, multiple post data memory node use intelligence restructuring mapping method, corresponding in parallel each logical groups of duplication This fail data memory node in the data that comprise；

Described intelligence restructuring mapping method, is to make the post data memory node number chosen and comprise fail data memory node Logical groups number is equal, and a fail data memory node is comprised in multiple logical groups, and the storage of each post data Node only replicates the part data of this fail data memory node in a corresponding logical groups.

4. the autonomous block management method of the data-intensive file system of lightweight as described in Claims 2 or 3, its feature exists In,

Realize adding new data memory node by described autonomous block management method to operate, wherein use decoupling address mapping side Method, comprises the following steps:

Step 4.2, select any one logical groups, calculate load maximum in all data memory nodes in this logical groups COV_max；

Step 4.3, compare COV_maxAnd COV_aveSize, if COV_max≥COV_ave, replace with the data memory node being newly added Change the data memory node loading maximum in logical groups；Otherwise, next logical groups, repeated execution of steps 4.1, step are chosen 4.2 and step 4.3, until the load of the data memory node being newly added reaches or close to data memory node average in system Till load.

5. the autonomous block management method of the data-intensive file system of lightweight as claimed in claim 1, it is characterised in that

In step 1.1, by the formula of reversible linear hash function selection data block place logical groups:

Wherein, g is logical groups ID, and x is current logic group number in system, and the logical groups number in system when X is initial, b is intended to deposit The data block of storage data block ID in its file, s is newly-increased logical groups number,

In step 1.2, the process of data memory node in logical groups is selected to include by reversible displacement segmentation function:

Wherein, a is data block b new logo in logical groups g；

D=node (a, i)=(a+i) %4 (3)

Wherein, i is the copy number of data block b, and d is the call number of the data memory node that data block b selects in logical groups.

6. the autonomous block management method of the data-intensive file system of lightweight as claimed in claim 5, it is characterised in that

In step 2.1, calculate the new ID of data block b place logical groups with reverse invertible function；

A=4 j+ (d-i) %4 (4)

This formula is by carrying out inverse operation obtaining to d=(a+i) %4；

Wherein, the copy i iteration of data block takes 0,1,2, and j takes zero or positive integer；

In step 2.2, with reverse invertible function calculate data block b physics ID:

7. the autonomous block management method of the data-intensive file system of lightweight as claimed in claim 1, it is characterised in that

In described data-intensive file system, carry out system name space maintenance by host node, data block to data stores The distribution of node, the management of each data memory node；

And, it is responsible for the consistency check of data block, data block recovery and data memory node by each data memory node Map information storage and maintenance.