CN111190542A

CN111190542A - Method and system for realizing file layout of file system

Info

Publication number: CN111190542A
Application number: CN201911371156.XA
Authority: CN
Inventors: 李兆龙
Original assignee: Tianjin Zhongke Shuguang Storage Technology Co Ltd
Current assignee: Tianjin Zhongke Shuguang Storage Technology Co Ltd
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2020-05-22
Anticipated expiration: 2039-12-27
Also published as: CN111190542B

Abstract

The invention discloses a method for realizing file system file layout, which comprises traversing vset array to obtain a two-dimensional array related to disk correlation degree and hit times; constructing topo according to the disk association trend reflected by the two-dimensional array; and filling all physical disks in the disk pool into the topo to complete the creation. The method of the invention establishes the heterogeneous topo set as the file system file layout, obviously increases the association degree of the disks in the same disk pool, and one disk is mutually duplicated and redundant with as many disks in the same disk pool as possible on the premise of ensuring the data reliability (the limitation of node redundancy and the like), thereby greatly improving the reading bandwidth and improving the repair speed when the repair operation occurs.

Description

Method and system for realizing file layout of file system

Technical Field

The invention relates to the technical field of computer data processing, in particular to a method and a system for realizing file system file layout.

Background

layout is an important component of a file system, one of the main members of file metadata; the mapping from the logical position to the physical position of the file content is maintained, and the offset of the file content in a specific disk or even in the disk can be clearly indicated by using limited data.

Implementations of layout can be broadly divided into two broad categories: three-level indirect address layout and expansion segment block layout. The three-level indirect address layout is simply described by recording the file position by using a fixed mapping relation (mapping algorithm), the method is applied to local file systems such as ext2/ext3 and the like, but the layout is not suitable for a distributed file system because the layout describes that the data volume is large, and the data volume of the distributed file system usually reaches the PB level.

The realization of the layout of the distributed file system layout usually adopts the idea of expanding segment blocks; the idea of expanding the block layout is to segment the file, to fix the size of the file segments, and before the file is written to a certain size, the file has only one segment, and only when the file size exceeds one segment, a segment is reallocated to the file. The layout of the file is an expandable segment description array, and since the segment size is fixed, when describing each segment, only the mapping between the segment size and the physical location needs to be described clearly.

There are many distributed file system products on the market today, such as whale fleet file systems, oceansetrn 9000, and others. The implementation schemes of the file system layout are all based on an extended segment block layout scheme, the application scenarios are basically the same, the file systems are all massive distributed file systems, and file data are all stored in a multi-node multi-disk computer cluster according to a specific ratio.

The existing schemes all have the idea that continuously created files are expected to be uniformly written into all the disks in the system so as to obtain the maximum write bandwidth of the system. The description modes of the segment blocks of the products are different, but the conventional implementation scheme has a common problem that one disk in the system only has a fixed relationship with the rest of the limited number of disks in the system to generate the segment block description, so that the data in one disk is limited to be mutually copied and redundant with the other limited disks.

When a certain disk or a certain node in the system fails or even actively deletes the disk, a space needs to be searched from other nodes and disks in the system in order to ensure the reliability of data, and the failed disk is repaired again. At this time, if the data on the disk is only duplicated or redundant with a limited number of disks, the read bandwidth of the repair operation is severely limited, and the system repair operation is slow.

In view of the above, the present invention is particularly proposed.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a method and a system for realizing file layout of a file system, which ensure the reliability of data in a distributed file system.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a method for realizing file system file layout comprises

Traversing the vset array to obtain a two-dimensional array related to the correlation degree and the hit times of the disks;

constructing topo according to the disk association trend reflected by the two-dimensional array;

and filling all physical disks in the disk pool into the topo to complete the creation.

Further, in the method for implementing the file system layout, the "build topo according to the disk association trend reflected by the two-dimensional array", where the disk association trend of the two-dimensional array is that the correlation value of each disk is the sum of the hit times of the disk and the other disks divided by the maximum value of the hit times of the disk and the other disks; the calculation formula is as follows:

c_value(i)＝(hit(i-1)+…+hit(i-n))/hit(max in disk(i))；

wherein i is a disk number; c _ value (i) is the correlation value of the current disk; hit is the longitudinal array of the disk.

Further, in the method for implementing the file system layout, the "constructing topo according to the disk association trend reflected by the two-dimensional array" includes:

s1, finding a magnetic disk with the minimum c _ value as a first disk, and firstly creating vset for the first disk; the node where the disk is located is fixed as the first node;

s2, after determining that a first disk of the newly-built vset is established, traversing a longitudinal hit array of the disk, and selecting the disk which forms the vset with less vset times with the disk to form the vset with the disk under the priority condition of ensuring that the node distribution is unchanged;

when one vset determination is completed, the disk with the smallest c _ value is continuously found from the remaining disks, and the steps S1-S2 are repeated until all the physical disks in the disk pool are filled into topo.

Further, in the method for implementing the file system layout, the "filling all physical disks in the disk pool into topo to complete creating" includes: filling all physical disks in the disk pool into the topo to complete the construction of the topo;

and traversing the topo to construct a new disk combination.

The invention also provides a file system layout constructed according to the method.

The invention also provides a device for constructing the file system layout, which comprises a processor and a memory, wherein the memory stores a program, and when the program is executed by the processor, the program executes the following steps:

Further, in the apparatus for constructing a file system layout, in the "constructing topo according to the disk association trend reflected by the two-dimensional array", the disk association trend of the two-dimensional array is that the correlation value of each disk is the sum of the hit times of the disk and the rest of disks divided by the maximum value of the hit times of the disk and the rest of disks; the calculation formula is as follows:

c_value(i)＝(hit(i-1)+…+hit(i-n))/hit(max in disk(i))；

Further, in the apparatus for constructing a file system layout, when the program executes the "constructing topo according to the disk association trend reflected by the two-dimensional array", the method includes:

Further, in the apparatus for constructing a file system layout, when the program executes the "fill all physical disks in the disk pool into topo and complete creation", the program includes: filling all physical disks in the disk pool into the topo to complete the construction of the topo;

and traversing the topo to construct a new disk combination.

The invention has the beneficial effects that:

the method of the invention establishes the heterogeneous topo set as the file system layout, which obviously increases the association degree of the disks in the same disk pool, and one disk is mutually duplicated and redundant with as many disks in the same disk pool as possible on the premise of ensuring the data reliability (the limitation of node redundancy and the like), thereby greatly improving the reading bandwidth and improving the repair speed when the repair operation occurs.

Drawings

In order to more clearly illustrate the detailed description of the invention or the technical solutions in the prior art, the drawings that are needed in the detailed description of the invention or the prior art will be briefly described below. Throughout the drawings, like elements or portions are generally identified by like reference numerals. In the drawings, elements or portions are not necessarily drawn to scale.

FIG. 1 is a schematic diagram of a hierarchy of system resources;

FIG. 2 is a schematic diagram of one embodiment of a disk pool topo structure;

FIG. 3 is a schematic diagram of a two-dimensional array according to one embodiment of the method of the present invention;

FIG. 4 is a schematic diagram of a small set of disks obtained after topo is constructed by the method of the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and therefore are only examples, and the protection scope of the present invention is not limited thereby.

It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which the invention pertains.

First, technical terms involved in the present invention are explained:

the hierarchical relationship of system resources is shown in fig. 1, and resources in hierarchical relationships such as a storage pool (storagepool), a disk pool (disk pool), a node (ds), a disk (disk), a logical disk, and the like exist in the system. The physical disk space is divided to form a logical disk, the mapping relation between the logical disk and the specific position of the physical disk is managed by a local file system, and the space of one logical disk can even be discontinuous on the physical disk. A group of logical disks which accord with a specific rule, namely proportion, is called vset, and the vset is the description of the extended segment block layout. The layout of a file is an extensible vset array.

Example 1

The invention discloses a method for realizing a file system layout, which comprises the following steps

The method of the invention creates a new disk set by increasing the disk association degree to realize the layout of the file system. Under the condition that a plurality of disk pools exist, the construction of the set is actually the superposition of the construction results of all the disk pools, so that a reasonable set construction algorithm only needs to be provided for a single disk pool.

The invention constructs a set in a disk pool to achieve the following three aims:

the physical disks in the disk pool are covered evenly;

to ensure that vset node redundancy is as high as possible;

any one physical disk in the disk pool should be associated with all other physical disks in the disk pool as uniformly as possible (constituting vset).

To meet these three goals, the disks in the disk pool need to be covered many times, especially when the number of disks in the disk pool is large. But directly creating a complete large set requires the system to create a large number of logical disks for each physical disk from the beginning, which is obviously undesirable; the set construction scheme first specifies the following three-point rules:

continuously constructing small sets and continuously fusing the small sets into a large set, so that the large set finally meets the conditions;

each small set is constructed to ensure that all disks in the disk pool are uniformly covered as much as possible;

the final correlation of all physical disks is ensured by the heterogeneous nature of the small set.

If the disk is large enough, the number of times of constructing the small set is enough, and as long as the small set with uniform coverage is constructed randomly each time, the target requirement can be met theoretically. But since the number of times a small set is actually constructed is not sufficient, human intervention is required to construct the small set.

As shown in fig. 2, a topo structure of a disk pool is substantially as shown in fig. 2-1, and a node chain table is arranged in the horizontal direction and a disk chain table under each node is arranged in the vertical direction; if the topo is directly traversed to construct the vset and the set, namely the topo structure of the disk pool is traversed, the construction mode of a small set is completed by traversing nodes and disks until the disks in the whole topo are all contained, and the constructed small set is not sufficient in heterogeneous structure due to the fact that the topo structure is fixed in the mode, even if translation is introduced into the constructed small set every time, one disk can only be correlated with a plurality of fixed disks when the number of disks in the disk pool is large. However, this simple topo traversal has two advantages:

the constructed small set basically covers all the disks in the disk pool uniformly;

the logical disks in each vset can be distributed at multiple points as much as possible (ensuring maximum node redundancy).

Based on the method, a more reasonable method for constructing the heterogeneous topo is adopted, one topo is constructed according to a specific algorithm when a small set is constructed each time, and then the topo is traversed to construct a small set which completely meets the requirement; in order to ensure the redundancy of the nodes, the constructed topo still needs to keep the original node and disk dependency relationship, and only purposefully adjusts the sequence of the nodes and the sequence of the disks below the nodes in the construction process.

The layout scheme of the layout of the extended segment blocks based on the size of the fixed segment is applied to a distributed file system, and in order to ensure the reliability of data, the file data is stored in a plurality of disks of a plurality of nodes according to a specific ratio;

taking a specific scenario as an example, suppose that disk pool topo is shown in fig. 2, and 3

nodes

1, 2, and 3 each have 2 disks, and create a 3-copy set (the most common 3-copy set) with a stripe width of 1 and a node redundancy of 2. Assuming that there are already 6 vsets in the large set, the vset distribution is as follows:

vset1:[disk1 disk3 disk5]

vset2:[disk2 disk4 disk6]

vset3:[disk1 disk4 disk6]

vset4:[disk2 disk3 disk5]

vset5:[disk1 disk3 disk5]

vset6:[disk2 disk4 disk6]

then a two-dimensional array about the disks can be obtained by traversing these vsets, as shown in fig. 3, the numbers marked longitudinally in the figure are the number of times that the top disk and the disk corresponding to the number form vset, and the numbers marked transversely at the top are the correlation value (c _ value) of each disk, and the calculation formula is as follows:

c _ value (i) ═ hit (i-1) + … + hit (i-n))/hit (max in disk (i)); where i is the disk number (i.e., disk id); hit is the vertical array of the disk, i.e. the hit times of the current disk and the rest of the disks.

C _ value of each disk is the sum of the hit times of the disk and the rest of disks, and the maximum value of the hit times of the disk and the rest of disks is divided by the sum of the hit times of the disk and the rest of disks; according to the formula, the c _ value is normalized for all the disks in one disk pool, and the maximum value is the number n of the disks in the disk pool minus 1; the formula can also show that the more disks and the more other disks in the disk pool form vset, the larger the value is; and the more uniform vset formation with other disks, the larger the value; so the c _ value can point to the focus disk well, and the smaller the c _ value is, the more the disk should be for processing when the small set is newly created.

The topo of the small set selection is constructed by the above two-dimensional array, and the steps are as follows:

s1, finding a disk with the minimum c _ value as a first disk, and firstly creating vset for the first disk; obviously, the node where the disk is located is naturally fixed as the first node (in the case that the disk is selected or the selected node meets the same reference value, the node with the smallest id can be selected or the node can be selected randomly).

And S2, after determining that a first disk of the newly-built vset is established, traversing the longitudinal hit array of the disk, and selecting the disk with less vset times with the disk to form the vset with the disk under the priority condition of ensuring that the node distribution is unchanged.

S3, after one vset is determined, continuously finding the disk with the minimum c _ value from the remaining disks, and repeating the steps S1-S2 until all physical disks in the disk pool are filled into the topo, so that the construction of the topo is completed; the topo constructed is shown in FIG. 4.

S4, traversing the topo, and constructing a small set containing 2 vsets which are respectively as follows:

vset7:[disk3 disk6 disk2]

vset8:[disk4 disk5 disk1]

the newly generated disk combinations of the mini-set are constructed according to the trend of disk association, and in essence, the first 6 vsets in the example are constructed by using the method from the beginning 3 times of the mini-set.

The invention establishes a heterogeneous topo set as a layout of a file system layout, which obviously increases the association degree of disks in the same disk pool, one disk is mutually duplicated and redundant with as many disks in the same disk pool as possible on the premise of ensuring the data reliability (the limitation of node redundancy and the like), thereby greatly improving the reading bandwidth and the repairing speed when the repairing operation occurs, and when 96 disks are in the same storage pool, (4+ 2): under the scene of 1 proportion, the repair time of 1TB data volume is improved to about 1 hour.

Example 2

The invention also provides a file system realized by the method, a disk set is constructed based on the association degree of the disks in the disk pool, and one disk and the disks in the disk pool are mutually redundant and duplicate as much as possible, so that when repair operation occurs, the read bandwidth can be obviously improved, and finally, the repair speed is obviously improved.

EXAMPLE 3

The invention also provides a device for constructing the file system layout. The system comprises a processor and a memory, wherein the memory stores a program, and when the program is executed by the processor, the program executes:

In the step of establishing topo according to the disk association trend reflected by the two-dimensional array, the disk association trend of the two-dimensional array is that the correlation value of each disk is the sum of the hit times of the disk and the rest of disks, and the maximum value of the hit times of the disk and the rest of disks is divided by the sum of the hit times of the disk and the rest of disks; the calculation formula is as follows:

c_value(i)＝(hit(i-1)+…+hit(i-n))/hit(max in disk(i))；

The program executing the step of constructing topo according to the disk association trend reflected by the two-dimensional array comprises the following steps:

The program executing the step of "filling all physical disks in the disk pool into the topo, and completing creation" includes: filling all physical disks in the disk pool into the topo to complete the construction of the topo;

and traversing the topo to construct a new disk combination.

The steps and principles of the program execution of the apparatus of the present invention for implementing the method of the present invention can be explained with reference to the above embodiment 1, and are not described herein again.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims

1. A method for realizing file system file layout is characterized by comprising the following steps:

2. The method for implementing file system file layout according to claim 1, wherein in the "constructing topo according to the disk association trend reflected by the two-dimensional array", the disk association trend of the two-dimensional array is that the correlation value of each disk is the sum of the hit times of the disk and the other disks divided by the maximum value of the hit times of the disk and the other disks; the calculation formula is as follows:

c_value(i)＝(hit(i-1)+…+hit(i-n))/hit(max in disk(i))；

3. The method for implementing file system file layout according to claim 2, wherein the "building topo according to the disk association trend reflected by the two-dimensional array" includes:

4. The method for implementing file system file layout according to claim 3, wherein the "filling all physical disks in a disk pool into topo, completing creation" includes: filling all physical disks in the disk pool into the topo to complete the construction of the topo;

and traversing the topo to construct a new disk combination.

5. A file system file layout constructed according to the method of any of claims 1-4.

6. An apparatus for constructing a file system file layout, comprising a processor and a memory, the memory having a program stored therein, wherein the program, when executed by the processor, performs:

7. The apparatus for constructing a file system layout according to claim 6, wherein in the "constructing topo according to the disk association trend reflected by the two-dimensional array", the disk association trend of the two-dimensional array is that the correlation value of each disk is the sum of the hit times of the disk and the rest of disks divided by the maximum value of the hit times of the disk and the rest of disks; the calculation formula is as follows:

c_value(i)＝(hit(i-1)+…+hit(i-n))/hit(max in disk(i))；

8. The method for implementing file system file layout according to claim 7, wherein the step of executing the "build topo according to the disk association trend reflected by the two-dimensional array" by the program comprises:

9. The method for implementing file system file layout according to claim 8, wherein the step of program executing "fill all physical disks in the disk pool into topo, complete creation" includes: filling all physical disks in the disk pool into the topo to complete the construction of the topo;

and traversing the topo to construct a new disk combination.