CN101556557B

CN101556557B - Object file organization method based on object storage device

Info

Publication number: CN101556557B
Application number: CN2009100985619A
Authority: CN
Inventors: 尹建伟; 孙鹏; 吴朝晖; 邓水光; 吴健; 李莹
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2009-05-14
Filing date: 2009-05-14
Publication date: 2011-03-23
Anticipated expiration: 2029-05-14
Also published as: CN101556557A

Abstract

The invention relates to an object file organization method based on an object storage device, comprising the following steps of: establishing the layout of an object file system in a disk, loading information including object description, object bitmap and the like in a memory; detecting the size of the object file by a space allocator before allocating the space; if the size of the object file is acquired, adopting a pre-allocating method to allocate space for the object file on the disk; if the size of the object file can not be acquired, firstly writing part data of the object file into a buffer zone; detecting whether the buffer zone is fully filled or whether the client needs to release the caching data, allocating space for the data in the buffer zone on the disk, and writing the data into the disk; and when the data in the buffer zone is far larger than size of the logic data sub-block, allocating storage space for the data by the space allocator. The invention adopts the method of continuously allocating a plurality of data blocks, thus reducing the time for searching and allocating free blocks, and making up the limitation and disadvantage of object file system disk spaceallocation in the current distribution-type file system.

Description

A kind of obj ect file method for organizing based on object storage equipment

Technical field

The present invention relates to the computer distribution type technical field of memory, especially a kind of obj ect file method for organizing based on object storage equipment.

Background technology

To liking the base unit in the object storage system, one to liking data and one group of combination of attributes of file.The attribute of object can comprise RAID parameter, service quality, visit information and the DATA DISTRIBUTION information etc. of file.

Object storage equipment externally provides object interface, with object as the basic access unit.Object storage equipment has certain intelligence, and CPU, internal memory, network and the disk system of oneself arranged, and realizes the function of data storage, intelligent distribution and object metadata management.In object storage equipment, all objects all have unique identifier, by object identifier visit this document object.Object storage equipment control file object data, data are placed on the standard disk unit, but the form of piece interface is not provided, and client will be carried out data read-write operation by object identifier and side-play amount.Object storage equipment can come data are distributed and be optimized by CPU and the memory source of self, the support data by buffer memory is set look ahead and buffering reads, reduce the physical disk reading times.

In traditional file system, all be to use an indirect data block mapping table to write down the distribution situation of the data block of each file.To the storage of a super large file, need take a lot of data blocks, such data block mapping table is very big, and is difficult to safeguard.The distribution of data block is controlled by the piece divider, and general piece divider once can only distribute a data block, and this just means when disk system writes data, just need call piece divider many times.Need the piece number that distributes altogether because the piece divider can't obtain in allocation space the time, therefore can't on the position, optimize the space of being distributed.

Lustre is a file system of increasing income by the exploitation of Cluster File System company, and system is made up of client, cluster management system and object storage system (Object Store Target).The NASD of CMU has adopted intelligent disk drive equipment, and what this equipment offered the user is not the disk block interface, but the object disk interface.All adopted traditional file system in these two systems of the NASD of Lustre and CMU, Lustre uses is Ext3 among the Linux, that NASD uses is the UFS that process is revised, because traditional file system does not make full use of the characteristics that vary in size of object, on allocation of space, adopt the distribution method of less piecemeal, the distribution of data block is more scattered, and the speed ratio of distribution is lower.On the object access method, lack the read-write number of times that caching mechanism has efficiently increased disk, can't give full play to the advantage of object storage system.

Summary of the invention

Technical matters to be solved by this invention is to adopt the method for a plurality of data blocks of continuous dispensing, reduce the time of searching and distributing free block, remedy object-based file system disk space distribution limitation and not enough obj ect file method for organizing in the current distributed file system based on object storage equipment.

The present invention addresses the above problem, and the technical scheme that is adopted is: the concrete steps of this method are,

(1) in disk, sets up the layout of object-based file system, object factory, object bitmap, data bitmap, Object node table information are carried in the internal memory;

(2) size of allocation of space device detected object file before allocation space;

(3), on disk this obj ect file allocation space if the size of known object file adopts preallocated method;

(4) if do not know the size of obj ect file, the partial data with obj ect file writes buffer zone earlier;

(5) detect whether buffer zone is filled or whether client has requirement release data cached, the data allocation space on disk for buffer zone writes disk with data; When the logical data that will be far longer than disk when the data of buffer zone divided block size, the allocation of space device was its interval of dividing storage.

Preallocated procedure of the present invention is,

(a), determine the packet positions of this obj ect file requisite space according to the size of obj ect file;

(b) in grouping, search only interval, if find then use this interval to preserve data;

(c) if do not find suitable interval, then distribute the space of obj ect file size 1/2 earlier, jump to step (a) then;

(d) lock the interval of finding, preparation is distributed to obj ect file with this interval;

(e) judge whether the size in this interval equals the size of obj ect file, if equal the size of obj ect file then directly distribute;

(f) if the size greater than obj ect file that should the interval is then cut apart this interval, obj ect file is distributed to for one section in the front in interval, corresponding interval grouping is then put into according to size for one section in the back.

In the step of the present invention (4), adopt the partial data of core buffer cache object file, the process of wherein searching Object node information is,

(a) in buffer memory, search Object node information according to object ID;

(b) if in buffer memory query hit, return this Object node information;

(c) if do not inquire in buffer memory, tabulation is waited in the request of inspection, if the nodal information request of this object ID is wherein arranged, then adds waiting list to;

(d) on disk, read the nodal information of this object ID correspondence, the result is returned to all request wait persons;

(e) nodal information that inquires is put into buffer memory, adopt nearest minimum usage policy to carry out buffer memory and replace.

It is 1Kbyte or 4Kbyte that the logical data of the middle disk of step of the present invention (5) divides block size.

The partition holding of disk of the present invention is from being divided into the plurality of data piece in logic, each piece 1024byte or 4096byte, and the size of data block is provided with by the system manager when the system of establishment, or the creation procedure of giving file system goes decision.

Interval of the present invention is a plurality of data piece continuous on the disk, represents an interval with reference position and length.

The present invention compared with prior art, have following beneficial effect: the object-based file system that this method adopted has been inherited the good characteristic of traditional common file system, comprise stability, high efficiency, high reliability, mechanism by interval hyperdisk space guaranteed big object to writing efficient, guaranteed the consistance and the high failure tolerance of data by the log mechanism of object metadata; The space is distributed according to the interval, and the storage of object has space continuity, have space management efficient efficiently, and buffer zone mechanism can guarantee the high-performance of object accesses.

Description of drawings

Fig. 1 is a distributed file system environment map of the present invention.

Fig. 2 is the comparison diagram of object-based file system and traditional file systems.

Fig. 3 is an object-based file system magnetic disk synoptic diagram.

Fig. 4 is the overview flow chart of allocation of space device.

Fig. 5 is the process flow diagram of disk space method for pre-distributing.

Embodiment

In object-based file system, file data comes organization and management according to the mode of object, and each object has the object ID of an overall situation, and object also comprises size, creates attributes such as modification time, access characteristic and incidence relation.

General Ext3 file system is not ideal enough to the support of big file, file data blocks may be distributed on the discontinuous disk space, and need the very long blocks of files of maintenance to tabulate, the present invention improves on the basis of Ext3 file system, (4～512Kbyte) adopt the original mode of Ext3 file system to deposit for small object, put in several continuous spaces of disk according to interval division for big object (more than the 512Kbyte), reduced the length of blocks of files tabulation and the time-delay of magnetic head tracking and rotation like this, guaranteed high-level efficiency big object read-write.

The present invention is called the interval with plurality of data piece continuous on the disk, uses interval of method representation of reference position and length.The distribution of disk space does not re-use the piece distribution method of original fixed size, and uses the interval to come allocation space.Can replace the tabulation of original very long data block like this, use a spot of interval number just can represent the space that whole file distributed, and the space is continuous relatively in distribution, make disk space management more efficient.

The concrete steps of this method are,

Wherein preallocated procedure is:

In the present invention, method for pre-distributing provides corresponding api interface in the file system aspect, P2P software, multi-source downloaded software etc. can utilize this interface be pre-created one with the empty file of wanting the identical size of file in download, enough storage spaces have been allocated in advance, guarantee that later download can not fail because of Insufficient disk space, and these spaces are continuous on disk, can not produce too much disk space fragment.Different with the allocation strategy as early as possible of traditional file systems is, the present invention is under the object size condition of unknown, adopt the method that postpones distribution, the partial data of file object will be first written to buffer zone, when buffer zone fills up or client require to discharge when data cached, just the data with buffer zone write disk.The logical data that the data of buffer zone will be far longer than disk divides block size (1Kbyte or 4Kbyte), and the allocation of space device will be divided the interval of storage for it.Such policy optimization the data block allocations of whole file system, significantly promote performance.

In the above-mentioned steps (4), adopt the partial data of core buffer cache object file, the process of wherein searching Object node information is:

(a) in buffer memory, search Object node information according to object ID;

(b) if in buffer memory query hit, return this Object node information;

Referring to Fig. 1, total system is made up of metadata management server cluster, object storage server cluster and file object access client.The metadata information of meta data server management object, maintenance documentation are in charge of user right and authentication to the mapping relations of object.The object storage server provides data read-write operation by object-based access interface to client, manages and organize data on the disk, and disk space is distributed, and obj ect file method for organizing of the present invention is the key components of object storage server.

Referring to Fig. 2, the obj ect file storage system is compared with traditional file systems, its improvements are: the obj ect file storage system is transferred to the bottom of storage subsystem with memory module from kernel spacing, the management of disk space and capacity, the storage allocation and the metadata cache function of data block are provided.

Referring to Fig. 3, for this object-based file system, the disk storage subregion is at first from being divided into data block one by one in logic, each piece 1024byte or 4096byte, this size is provided with by the system manager when the system of establishment, and the creation procedure that also can give file system goes decision.The creation procedure of file system is selected a suitable value according to the size of disk partition.These data blocks are divided into again in several storage spaces, and it is ready-portioned when creating file system that what pieces are arranged in each storage space.

In this object-based file system, the initial position of disk has defined the basic parameter of file system, divides the basic function that required space is used for finishing file system.In order to keep the compatibility with the ordinary file system, these spaces from disk partition (i.e. the 1st byte) to 1024 bytes are used for preserving magnetic disk head information, magnetic disk head has been described the information of whole object-based file system, this part space is called as superblock, comprises file system signature sign, disk number, the space total amount, use amount, file system state, mounting time, the position of journal file and size, the number of objects of depositing on the disk, data block count, the data block count that keeps, the idle data block count, the data block size, the sequence number of last operation, internal system sign and superblock version number etc.In order to guarantee reliability, preserve two superblocks, that distributes in object-based file system is respectively 0 and 1 to ID.

The position of writing down object bitmap, data bitmap and Object node table in section is described positions by these three position indicator pointers.In store object node (Onode) information in the Object node table, be equivalent to traditional file systems Inode, the data block of in store this object tabulation in the Onode, the size definition of this tabulation is 15, preceding 12 pointers are the immediate data block pointer, and what preserve in these data blocks is exactly object data.If these 12 immediate data pieces are deposited and failed to lay down whole data, the 13rd pointer is the indirect data block pointer, and what preserve in the corresponding data block is the data block pointer entirely.The 14th and the 15th pointer are respectively the data block pointers of secondary and three grades, and such hierarchy is enough preserved abundant object data tabulation.

The object look-up table is a Hash table, is used for realizing object ID to object Onode number mapping, and each Object table accounts for one in Hash table, and the size of Hash table equals stored number of objects in the disk.Interval table is used for the distribution condition in storage administration interval, and each interval is saved in the different groupings according to size, in grouping is sorted in the interval, can be convenient to the distribution in space like this.The NameSpace of file system does not realize that by object-based file system object-based file system does not provide the parsing of file path name, and the work of this part will be finished by meta data server.

Referring to Fig. 4, the allocation of space device if obtained the size of object, will adopt method for pre-distributing before allocation space.Otherwise, adopt to postpone allocation strategy, earlier data are write buffer zone, fill full or data have write when finishing when buffer zone, will divide the space according to method for pre-distributing to the data in the buffer zone.

Referring to Fig. 5, when distributing disk space, distributing object-based file system will provide with the object is the disk read-write operation of interface, read operation for object, object ID need be imported into as parameter, Onode number of this object found in the inquiry of process Hash table earlier, inquires the reference position of object on disk and shared interval in the Onode table, after obtaining block information, just can be from the data of reading object on the disk.For the write operation of new object, just need be this object memory allocated space.

Management aspect at disk space, to be divided into several groups according to approximate size between each free area on the disk, the size of each group can be configured when operation is initial by the system manager, also can determine according to disk size in file system formatization.In each group, interval size is sorted, when being N interval, needs the execution following steps as applying for distributing a size:

(1), determines the packet positions of this object requisite space according to object size N;

(2) search only interval in grouping, whether because the interval in the grouping is through sorting, can detect in the time at 0 (logn) has suitable interval to come store data.If find suitable interval in grouping, execution in step (3) then; If do not find suitable interval, then preparation distributes the space of object size 1/2, jumps to begin to re-execute this process.

(3) prepare this object is distributed in this interval, lock this interval and distribute to other objects to prevent it;

(4) whether the size of judging this interval equals object size, if equal object size then directly distribute, jumps to step (6);

(5) cut apart this interval, object is distributed to for one section in the front of this subregion, the back is put into corresponding interval according to size for one section and is divided into groups;

(6) this time allocation of space is finished, and process finishes.

Because disk is to manage remaining space according to the piece of fixed size, the partition size that is assigned to is chosen the last period of subregion and is placed data generally greater than object size, and back one section free space reclaims puts into the Free Partition table.

Once can't be assigned to requisite space for big object, then attempt distributing the space of half earlier, then attempt 1/4th space,, will finally obtain the object store space that a plurality of segments are formed through after the repeated dispensing if can't satisfy.Though these sections are discontinuous, the size of each section is greater than the size of data in magnetic disk piece, can reduce the piece distribution list of traditional file systems.

Behind this object-based file system carry, object bitmap and data bitmap can be loaded in the internal memory, and the Object node table is because very big can't the loading that take up room is provided with the buffer memory of Object node table.To obtain corresponding Onode by object ID, at first in buffer memory, inquire about,, then need to disk, to read if do not find.Before reading disk, check that earlier the request of Onode waits for whether having in the tabulation that this Onode request has been arranged, find to have this request then to add to wait in the tabulation and wait for, after reading Onode, disk notifies all wait persons.By the LRU strategy Onode that inquires is put into buffer memory, later visit is hit as much as possible.

Use the target buffer district to preserve the part visit data, because the size of buffer zone is limited, partial data might be replaced away by the LRU strategy.When reading the data of an object, have in some data buffer like this, some lacks.Data in the displacement of the object data that will ask and length and the buffer zone are compared, and find the displacement and the length of missing data, from the data of these disappearances of disk request, have so just reduced IO number of disk effectively.

Claims

1. obj ect file method for organizing based on object storage equipment is characterized in that: concrete steps are,

(5) detect whether buffer zone is filled or whether client has requirement release data cached, the data allocation space on disk for buffer zone writes disk with data; When the logical data that will be far longer than disk when the data of buffer zone divided block size, the allocation of space device was its interval of dividing storage;

Described preallocated procedure is,

2. the obj ect file method for organizing based on object storage equipment according to claim 1 is characterized in that: in the described step (4), adopt the partial data of core buffer cache object file, the process of wherein searching Object node information is,

(a) in buffer memory, search Object node information according to object ID;

(b) if in buffer memory query hit, return this Object node information;

3. the obj ect file method for organizing based on object storage equipment according to claim 1 is characterized in that: it is 1Kbyte or 4Kbyte that the logical data of the middle disk of described step (5) divides block size.

4. the obj ect file method for organizing based on object storage equipment according to claim 1, it is characterized in that: the partition holding of described disk is from being divided into the plurality of data piece in logic, each piece 1024byte or 4096byte, the size of data block is provided with by the system manager when creating system, or the creation procedure of giving file system goes decision.

5. the obj ect file method for organizing based on object storage equipment according to claim 1 is characterized in that: described interval is a plurality of data piece continuous on the disk, represents an interval with reference position and length.