CN104375782A

CN104375782A - Read-write solution for tens of millions of small file data

Info

Publication number: CN104375782A
Application number: CN201410560613.0A
Authority: CN
Inventors: 张砚波; 吴丙涛
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2014-10-21
Filing date: 2014-10-21
Publication date: 2015-02-25

Abstract

The invention provides a read-write solution for tens of millions of small file data. According to the solution, the mode of creating large-block continuous disk space is used for storing a large number of small files when the small files are stored, namely, logically continuous data are stored on the continuous space of a disk array as far as possible; the disk space is divided into a plurality of blocks, and the size of each block is 64 KB. According to the basic thought, each small file can only be stored in the single block and cannot be stored by crossing two blocks, each folder can be provided with one or more blocks, the blocks are only used for storing the data of the corresponding folder, and each piece of file data is stored on the continuous disk space. Compared with the prior art, the logically continuous data are stored on the continuous space of each physical disk as far as possible, the cache technology is used for playing the role of a metadata server, the cache utilization rate is improved through simplified file information nodes, and thus the access performance of the small files is improved.

Description

The one read-write solution of millions small documents data

Technical field

The present invention relates to Computer Applied Technology field, specifically a kind of one read-write solution of millions small documents data.

Background technology

At the modal data mode that the reading of present stage, field of storage small documents are data access, use.Cut into slices relative to the striping technology of large files, improve the concurrency of user to file access, small documents (≤64KB) is unfavorable for striping due to it, traditional method is generally the method that Single document is stored on individual data server by employing, but after the quantity of small documents arrives to a certain degree, burden in performance and I/O bottleneck problem will be brought to data server to the repeated accesses in large quantities of small documents, because the data message on internet shows mainly with high-frequency small documents form greatly, and read in the information of general user, in storage, to the reading of small documents, store more, therefore important realistic meaning is had to the research of small documents read/write performance high-frequency on internet.

In present stage, the upper main problem that there are following 3 aspects of the management such as traditional process for millions small documents, operation:

(1) access frequency due to small documents is higher, needs repeatedly to access disk, so the performance of magnetic disc i/o is lower;

(2) because Documents Comparison is little, easily form file fragmentation and cause the waste of disk space;

(3) for easily producing network delay when a connection is set up in each small documents request, the reading rate of small documents is reduced.

Summary of the invention

The object of the invention is to overcome the deficiencies in the prior art, a kind of one read-write solution of millions small documents data is provided, the propagation delay time of small documents is reduced, the performance that the reading that improve small documents stores by the document transmission method sending the access of batch high-frequency in advance.

Technical scheme of the present invention realizes in the following manner, and method is as follows:

The storage organization layout design method of small documents on disk array is:

The present invention is by adopting the mode of opening up the continuous disk space of bulk to store large amount of small documents when storing small documents, namely continuous print data are in logic stored on the continuous space of disk array as much as possible, are stored on continuous print disk array block as much as possible by the data of same file or the file data under same file is pressed from both sides; Disk space is divided into multiple pieces, the size of each piece is 64KB, basic thought is: each small documents can only leave in single piece, 2 blocks can not be crossed over deposit, each file will have one or more pieces, these blocks all only deposit the data of this file, and each file data leaves on continuous print disk space;

The store data structure of small documents on disk array is:

Devise a kind of simplification Node attribute information of applicable small documents in this patent; Meanwhile, the attribute information of fileinfo node leaves on meta data server by we, so only needs the disk space information knowing file to conduct interviews; Carry out simplified design to the data structure of Node, the disk space information of a document retaining and belong to its low volume data member, wherein, File_id is file identifier; StartPosition is the reference position of file in block; Long is the length of file; Weight is file weight, in the method the access frequency of representation file; Block_id is the identifier of the block that file is deposited; Count is the access counter of file; Lock is file lock;

The read operation of small documents is designed to:

Mainly spend on the tracking location of magnetic disc head in the time delay of read-write small documents, once have good positioning, the time phase difference reading time that a data block spends and read continuous several data block is not very large, therefore, in conjunction with the data store organisation of optimization presented above, in this solution, adopt the mode pre-read, the file in same piece is read out together, thus reduce the number of times of magnetic disc i/o, the problem of whole system I/O poor-performing is caused for the disk on frequent accesses meta-data server, in the method for the invention, cache is used to serve as the role of meta data server, cache preserves the information of fileinfo node, and make the disk space information of each fileinfo node document retaining and information useful on a small quantity in addition by the Node data structure simplified, thus improve the utilization factor of cache, cache is enable to preserve a large amount of fileinfo nodes, in this way, reduce the access number of times of disk and the expense of file reading information node, thus improve I/O performance.

Advantage of the present invention is:

The one read-write solution of millions small documents data of the present invention compared to the prior art, the method of this invention is by being stored in the continuous space of physical disk as far as possible by continuous print data in logic, use cache technology serve as the role of meta data server and improve cache utilization factor by the fileinfo node simplified, improve small documents access performance; The related data of being polymerized in more new data and file territory thereof is an I/O request write, decrease file fragmentation quantity, adopt the small documents mode of the high rate of people logging in sending batch in advance to reduce I/O operation frequently during reading, improve file transfer performance preferably.

Embodiment

Below the one read-write solution of millions small documents data of the present invention is described in detail below.

The one read-write solution of millions small documents data of the present invention, method is as follows:

The store data structure of small documents on disk array is:

The read operation of small documents is designed to:

The mode that the present invention's design opens up the continuous disk space of bulk when storing small documents by employing stores a large amount of small documents.First disk space is divided into multiple pieces, the size of each piece is 64KB, the continuous disk space of large files just formed by these a series of pieces, when the Documents Comparison hour run into, each small documents can only leave in single piece, 2 blocks can not be crossed over deposit, each file data leaves on continuous print disk space, A1, A2, A3, A4 and A5 is 5 files, deposit continuously between file and file, as A1 and A2, A3 and A4, RED sector is the fragment of this block, when occurring that the size of file is less than the size of these fragments, should preferentially file be left in these fragments, in order to improve the hit rate of " pre-reading " data, what design in storage layout of the present invention is continuous print data be in logic stored in as much as possible on the continuous space of physical disk, be stored on continuous print disk space block as much as possible by the data of same file or the file data under same file is pressed from both sides, each file will have one or more pieces, these blocks all only deposit the file of this file

In the data store organisation of native system, the attribute information of fileinfo node leaves on meta data server by we, on I/O server, the disk space information knowing file is only needed to conduct interviews, therefore on I/O server, only need the disk space information of log file, and do not need other attributes of log file, as creation-time, last access time and owning user etc.Based on this, carry out simplified design to the data structure of Node, the disk space information of a document retaining and belong to its low volume data member, therefore, devise a kind of Node attribute information of simplification, wherein, File_id is file identifier; StartPosition is the reference position of file in block; Long is the length of file; Weight is file weight, in the method the access frequency of representation file; Block_id is the identifier of the block that file is deposited; Count is the access counter of file; Lock is file lock.

For the data memory access frequency issues of native system, first, design a global variable NodeList, NodeList is a sequence chained list to Node, access frequency according to file sorts, it designs to realize optimizing file transfer, is whole small documents reading, storage solution service.NodeList serves with each file, a sorted lists is formed according to the access frequency of each file in file, when user accesses certain file of this file underedge, the file of access frequency high in this list can automatically be sent over by system together, but in order to avoid sending too much file, the threshold value of a setting high access frequency, and by all access frequencys higher than document order be divided into multiple groups, each group may comprise multiple file, in group, All Files size sum is no more than 64KB, when user asks a file in current file folder, the file that one is organized can be sent over by system in order together, thus reduce file transfer time delay.

Its processing and fabricating of one read-write solution of millions small documents data of the present invention is very simple and convenient, can process to specifications.

Except the technical characteristic described in instructions, be the known technology of those skilled in the art.

Claims

1. the one read-write solution of millions small documents data, is characterized in that:

The store data structure of small documents on disk array is:

The read operation of small documents is designed to: