CN106293537B

CN106293537B - A kind of autonomous block management method of the data-intensive file system of lightweight

Info

Publication number: CN106293537B
Application number: CN201610665489.3A
Authority: CN
Inventors: 陈付梅; 韩德志; 毕坤; 王军
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2016-08-12
Filing date: 2016-08-12
Publication date: 2019-11-12
Anticipated expiration: 2036-08-12
Also published as: CN106293537A

Abstract

The invention discloses a kind of autonomous block management methods of the data-intensive file system of lightweight, (ISD is divided by cross transfer, Intersected Shifted Declustering) realize the mapping of data block to data memory node, the quick lookup of data block in data memory node, the fast quick-recovery of data block and the quick redistribution of data block when newly adding data memory node etc. when data memory node fails, host node is set only to be responsible for the storage and maintenance of file name space in data-intensive file system, and data block is to the mapping relation information storage and maintenance of data memory node, and data memory node fail when data block replacement and new data memory node add when data block redistribution etc. all completed by data memory node autonomy.The invention saves the memory headroom of host node in data-intensive file system, improves the processing capacity of host node, can increase substantially the block management data efficiency of data-intensive file system under big data environment.

Description

A kind of autonomous block management method of the data-intensive file system of lightweight

Technical field

The present invention relates to a kind of autonomies of the data-intensive file system of computer security technique more particularly to lightweight Block management method.

Background technique

Data-intensive file system DiFS, such as Google file system GFS, Hadoop distributed file system HDFS Deng having become the main file system of big data storage management.Current data-intensive file system DiFS uses principal and subordinate Formula framework, host node (meta data server) manage all metadata, and from node (data memory node), only responsible data are deposited Storage.In order to maintain high availability, data file is usually divided into the block of fixed size by these storage systems, and each data block is usual There are 3 copies, and they are all assigned in the data memory node of different clusters.Host node must record hundreds and thousands of The address of data memory node, and the data block of all data files is recorded to the map information of these memory nodes.Also, Host node must regularly check the variation of the address mapping information of all data blocks.With the continuous increase of data volume, these Metadata information does not occupy the memory headroom of host node, influences the processing capacity of host node, and seriously limits master The scalability of node.

In order to solve the problems, such as that data-intensive file system exists, by the distribution of data file physical block and safeguard from member It separates in data management, is answered by the maintaining method that each data memory node executes data block to memory node map information It transports and gives birth to.Adopting said method, host node does not need to save a large amount of data block metadata information again and data block is deposited to data Store up node map information, but need with one group of data block to data memory node, data memory node to data block it Between invertible mapping function complete.

The data of data-intensive file system management magnanimity, these data have the following characteristics that 1) data volume is big, data Total amount increases fast；2) data storage performance demand is high；3) high reliability and high restorability are required: when data occur to lose or count It, being capable of quick extensive restored data under the premise of not influencing normal work when failing according to memory node；4) it is required to quickly Lookup data block storage location；5) memory headroom and influence host node few as far as possible of occupancy host node few as far as possible are required Processing capacity；

As seen from the above analysis, the management method of traditional file systems is not suitable with the pipe of data-intensive file system Reason, main cause: 1) with the continuous increase of data volume, it is empty that the storage of file data blocks address table will occupy a large amount of storage Between；2) host node is responsible for the maintenance of file data blocks address table, with being continuously increased for file data blocks address table, substantially reduces The processing capacity of host node；3) being continuously increased for data volume not only occupies a large amount of memory space of host node, increases ground The metadata maintenance cost such as location, while also reducing the scalability of host node；4) each data memory node is being stored Host node will be first seeked advice from when with inquiry, increase the time of addressing in this way.

Summary of the invention

For the data block storage of data-intensive file system and searching and managing demand, the present invention provides a kind of light weights The autonomous block management method of the data-intensive file system of grade, by by the distribution of physical data block, inquiry and related first number It separates from traditional metadata management according to maintenance, is completed by each data memory node, reduce host node memory space Expense and burden.The present invention can promote the scalability of the data-intensive file system under big data environment, reduce data Block address the time, and can first mate's degree improve whole system performance.

Technical principle of the invention is that the present invention is by cross transfer division methods (ISD, Intersected Shifted Declustering) autonomous management of realizing data block, i.e., by realizing data block with one group of reversible mathematical function To data memory node and data memory node to the mapping of data block, the distributed storage and fast quick checking of data block are completed Ask etc..

The present invention specifically includes following several operations:

Operation 1, data block storage operation；

Operate 2, data block search operation；

Operation 3, fail data memory node crash handling operation；

Operation 4, addition new data memory node operation.

(1) data block storage operation the following steps are included:

Logical groups (LG) where step 1.1, host node select data block by reversible linear hash function；

Step 1.2, host node select data memory node storing data in logical groups by reversible displacement segmentation function Block number evidence；

Step 1.3, data memory node storing data block number evidence and data block address map information.

(2) data block search operation the following steps are included:

Data memory node calculates data block b according to the reversed invertible function of its call number where step 2.1, data block b The new ID of place logical groups；

Data memory node is according to logical groups ID where data block b where step 2.2, data block b, with reversed invertible function The physics ID for calculating data block b, provides condition for the complete data file of file system recovery；

Step 2.3, data memory node obtain data block in the mapping letter of memory node according to the physics ID of data block Breath；

Step 2.4, data memory node take the data of data block b to send file system according to the map information of data block b.

(3) crash handling of fail data memory node operation the following steps are included:

Step 3.1 determines logic groups where fail data memory node；

The smallest data memory node is loaded in logic groups other than step 3.2, selection data storage failure node to make For backup node；

Step 3.3, multiple post data memory nodes are replicated in each logical groups parallel using intelligence recombination mapping method The data for including in the corresponding fail data memory node.

(4) addition new data memory node operation the following steps are included:

Step 4.1, the average load COV for calculating data memory node in all logical groups in whole system_ave；

One step 4.2, selection logical groups, calculate in the group maximum load C OV in all data memory nodes_max；

Step 4.3 compares COV_maxAnd COV_aveSize, if COV_max≥COV_ave, data memory node is added with new Replace the logical groups data memory node.Otherwise, next logical groups are chosen, step 4.1, step 4.2 and step 4.3 are repeated, Until the load for the data memory node being newly added reaches or approaches the average load of data memory node in system.

The advantage of this data-intensive file system autonomy block management method is:

(1) host node memory space expense is greatly reduced.By data block to data memory node map information from tradition Metadata in separate, by the carry out storage and management that each data memory node is autonomous, host node do not need save and Safeguard a large amount of data block address information, the metadata information for saving host node reduces 90% or more than traditional file systems.

(2) processing capacity of host node is greatly improved.Map information between data block and data memory node is by every The autonomous storage and maintenance of a data memory node, eliminates the burden of host node.Such method and distributed file system HDFS is compared, and the process performance of host node can be made to improve 30% or more.

(3) restorability and scalability of system are improved.When data memory node failure by using intelligent weight Group mapping method only migrates a small number of data when adding new data memory node by using decoupling address mapping method in this way Block can complete the recovery of fail data node data and the duplication of new addition back end data, and substantially increase system can Restorative and scalability.

Detailed description of the invention

Fig. 1 is the flow chart of concrete operations of the present invention；

Fig. 2 is the schematic diagram that host node and data memory node management function divide in the present invention；

Fig. 3 be continuous blocks to back end mapping and back end to block lookup example；

Fig. 4 is the example of back end failure recovery process；

Fig. 5 is the example of new data node adding procedure.

Specific embodiment

In order to be easy to understand the technical means, the creative features, the aims and the efficiencies achieved by the present invention, tie below Diagram and specific embodiment are closed, a kind of autonomy of the data-intensive file system for lightweight that the present invention is further explained proposes Block management method.

A kind of autonomous block management method of the data-intensive file system of lightweight, it is real by one group of reversible mathematical function Existing data block is to back end and back end to the mapping of data block.As shown in Fig. 2, each node concrete function in the present invention Divide: host node is only responsible for system name space maintenance, the distribution of data block to data memory node, each data memory node Management；Each data memory node is responsible for the consistency check of data block, data block is restored and the mapping of data memory node Information storage and maintenance.

As shown in Figure 1, autonomy block management method of the present invention, specifically includes following several operations:

Operation 1, data block storage operation；

Operate 2, data block search operation；

Operation 3, fail data memory node crash handling operation；

Operation 4, addition new data memory node operation.

(1) data block storage operation, comprising the following steps:

Step 1.1, host node pass through logical groups (LG) where reversible linear hash function selection block；

In the step 1.1 of data block storage operation, the logic where data block is selected by reversible linear hash function Group (LG) formula:

Wherein, it is group number current in system that g, which is the logical groups ID, x to be mapped, and X is logical groups when starting in system Number, b is that the block ID, s for wanting storing data block in its file are newly-increased logical groups numbers,

In the step 1.2 of data block storage operation, data in logical groups are selected to store by reversible displacement segmentation function The process of node includes:

A it) calculates data block b and is mapped to the new block identification after logical groups g, formula:

Wherein, a is new logo of the data block in logical groups g, and x is current logic group number, and X is initial logical groups number, b Data-oriented block ID, s are newly-increased logical groups numbers,

B the index ID that data block b is mapped to the data memory node in logical groups g) is calculated, formula:

D=node (a, i)=(a+i) %4 (3)

Wherein, a is new data block mark of the data block b in logical groups g, i be data block b copy number (value 0,1, 2), d is index (value 0,1,2,3) of the data block b in the data memory node of logic group selection.

The copy number refers to that intensive file system provides three copies for each data block, fully ensures that it can With property, number is 0,1,2；The index of the data memory node refers to all data memory nodes in a logical groups Number, the present invention in each logical groups include 4 data memory nodes, call number is respectively 0,1,2,3.

(2) data block search operation the following steps are included:

Step 2.4, data memory node are sent according to the data that the map information of data block b obtains data block b to file system System.

Step 2.1 in data block search operation calculates number according to the reversed invertible function of data memory node call number d According to the new ID of logical groups where block b, formula:

D=(a+i) %4 → can inverse operation → a=4j+ (d-i) %4 (4)

Wherein i indicates the copy number of data block, can take 0 with iteration, 1,2, j it is desirable 0,1,2 ..., n etc.；

The reversed invertible function of step 2.2 in data block search operation calculates the physics ID of data block b, formula:

Wherein, g is the index comprising data-oriented memory node logical groups,

Fig. 3 (a) be continuous data block by linear Hash mapping to each logical groups, and by migration divide realize number According to the distributed storage of block each data memory node in logical groups；Fig. 3 (b) is by taking back end 2 as an example, and demonstration passes through can Inverse function realizes the reverse process for searching data block.

Step 3.1 determines logic groups where fail data memory node；

The intelligence recombination mapping method, is to choose post data memory node number and comprising fail data memory node Logical groups number it is equal, a fail data memory node is likely to be contained in multiple logical groups, each post data is deposited Node is stored up only to be responsible for replicating the partial data in a corresponding logical groups in the fail data memory node.

Fig. 4 demonstrates each logical groups and replaces recovery process to back end 2 by taking the failure of back end 2 as an example.

(4) addition new data memory node operation, it is main using decoupling address mapping method, comprising the following steps:

Step 4.3 compares COV_maxAnd COV_aveSize, if COV_max≥COV_ave, data memory node is added with new Maximum data memory node is loaded in replacement logical groups.Otherwise, next logical groups are chosen, step 4.1, step 4.2 are repeated With step 4.3, born until the load for the data memory node being newly added reaches or approaches the average of data memory node in system Until load.

Fig. 5 demonstrates system addition new data node node₁₂₈Constitute new new logic group LG₁₀₀₀When, the number of whole system According to block transition process.

It can be seen that system by invertible function and using intelligence recombination mapping method and using decoupling by Fig. 4 and Fig. 5 Address mapping method, when making back end failure and the addition of new data node, only seldom data block migration is fully ensured that The stability of system and availability to user.

This method is illustrated with an example below.

Select HDFS as data-intensive file system, by emulating 10000 back end, 1000000 data Under the big data environment of block, is using the autonomous block management method of the data-intensive file system of lightweight and do not using the party Host node EMS memory occupation situation is as shown in table 1 when method, and host node CUP occupancy situation is as shown in table 2.Wherein 1000000 data block It is generally evenly distributed in 10000 back end, each data block size is 64MB.

1 host node management data block EMS memory occupation situation of table

Data section points	1000	2000	5000	7000	9000	10000
							Committed memory (MB) after optimization	15	20	27	36	42	50
It is not optimised committed memory (MB)	180	186	189	192	194	196

2 host node management data block CPU occupancy situation of table

Data section points	500	2000	3000	4000	5000
						CPU usage (%) after optimization	1.4	2.3	2.5	3.1	4.2
It is not optimised rear CPU usage (%)	6.3	12.1	16.6	19.8	23.2

From Tables 1 and 2 it is found that after autonomous block management method using the data-intensive file system of lightweight, main section The EMS memory occupation situation of point and the occupancy situation of CPU are substantially better than not using the autonomy of the data-intensive file system of lightweight The case where block management method.

It is discussed in detail although the contents of the present invention have passed through above preferred embodiment, but it should be appreciated that above-mentioned Description is not considered as limitation of the present invention.After those skilled in the art have read above content, for of the invention A variety of modifications and substitutions all will be apparent.Therefore, protection scope of the present invention should be limited to the appended claims.

Claims

1. a kind of autonomous block management method of the data-intensive file system of lightweight, which is characterized in that data-intensive text Part system realizes the autonomous management of data block by cross transfer division methods, i.e., real by using one group of reversible mathematical function Existing data block to data memory node and data memory node to the mapping of data block, complete the distributed storage of data block with It searches；

Data block storage operation is realized by the autonomous block management method, comprising the following steps:

Step 1.1, host node select logical groups where data block by reversible linear hash function；

Step 1.2, host node select data memory node in the logical groups by reversible displacement segmentation function；

Step 1.3, in the data memory node chosen, the data and data block address map information of storing data block；

The crash handling operation of fail data memory node is realized by the autonomous block management method, comprising the following steps:

Step 3.1 determines logic groups where fail data memory node；

Step 3.2, select data store failure node other than logic groups in load the smallest data memory node as after Slave node；

Step 3.3, multiple post data memory nodes are replicated corresponding in each logical groups parallel using intelligence recombination mapping method The fail data memory node in include data；

The intelligence recombination mapping method is to make the post data memory node number of selection and comprising fail data memory node Logical groups number is equal, and a fail data memory node is comprised in multiple logical groups, and each post data stores Node only replicates the partial data of the fail data memory node in a corresponding logical groups.

2. the autonomous block management method of the data-intensive file system of lightweight as described in claim 1, which is characterized in that

Data block search operation is realized by the autonomous block management method, comprising the following steps:

Data memory node where step 2.1, data block is patrolled where calculating data block with reversed invertible function according to its call number Collect the new ID of group；

Step 2.2, data memory node where data block are according to the new ID of logical groups where data block, with reversed invertible function meter Calculate the physics ID of data block；

Step 2.3, data memory node obtain data block in the map information of memory node according to the physics ID of data block；

According to the map information of data block, the data for obtaining data block are sent to data-intensive text for step 2.4, data memory node Part system.

3. the autonomous block management method of the data-intensive file system of lightweight as claimed in claim 1 or 2, feature exist In,

Addition new data memory node operation is realized by the autonomous block management method, wherein using decoupling address of cache side Method, comprising the following steps:

Step 4.2 selects any one logical groups, calculates maximum load in all data memory nodes in the logical groups COV_max；

Step 4.3 compares COV_maxAnd COV_aveSize, if COV_max≥COV_ave, replaced with the data memory node being newly added It changes in logical groups and loads maximum data memory node；Otherwise, next logical groups are chosen, step 4.1, step are repeated 4.2 and step 4.3, until the load for the data memory node being newly added reaches or approaches being averaged for data memory node in system Until load.

4. the autonomous block management method of the data-intensive file system of lightweight as described in claim 1, which is characterized in that

In step 1.1, the formula of logical groups where selecting data block by reversible linear hash function:

Wherein, g is logical groups ID, and x is current logic group number in system, and logical groups number when X is initial in system, b is to deposit Data block ID, s of the data block of storage in its file are newly-increased logical groups numbers,

In step 1.2, the process for selecting data memory node in logical groups by reversible displacement segmentation function includes:

Wherein, a is new logo of the data block b in logical groups g；

D=node (a, i)=(a+i) %4 (3)

Wherein, i is the copy number of data block b, and d is call number of the data block b in the data memory node of logic group selection.

5. the autonomous block management method of the data-intensive file system of lightweight as claimed in claim 4, which is characterized in that

In step 2.1, the new ID of logical groups where calculating data block b with reversed invertible function；

A=4j+ (d-i) %4 (4)

The formula by d=(a+i) %4 carry out can inverse operation obtain；

Wherein, the copy i iteration of data block take 0,1,2, j take zero or positive integer；

In step 2.2, the physics ID of data block b is calculated with reversed invertible function:

6. the autonomous block management method of the data-intensive file system of lightweight as described in claim 1, which is characterized in that

In the data-intensive file system, system name space maintenance is carried out by host node, data block to data stores The distribution of node, the management of each data memory node；

And it is responsible for the consistency check, data block recovery and data memory node of data block by each data memory node Map information storage and maintenance.