CN104375782A - Read-write solution for tens of millions of small file data - Google Patents
Read-write solution for tens of millions of small file data Download PDFInfo
- Publication number
- CN104375782A CN104375782A CN201410560613.0A CN201410560613A CN104375782A CN 104375782 A CN104375782 A CN 104375782A CN 201410560613 A CN201410560613 A CN 201410560613A CN 104375782 A CN104375782 A CN 104375782A
- Authority
- CN
- China
- Prior art keywords
- file
- data
- small documents
- disk
- read
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 claims description 18
- 238000013461 design Methods 0.000 claims description 11
- 238000005457 optimization Methods 0.000 claims description 3
- 230000008520 organization Effects 0.000 claims description 3
- 239000012634 fragment Substances 0.000 description 3
- 238000012546 transfer Methods 0.000 description 3
- 238000013467 fragmentation Methods 0.000 description 2
- 238000006062 fragmentation reaction Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0629—Configuration or reconfiguration of storage systems
- G06F3/0631—Configuration or reconfiguration of storage systems by allocating resources to storage systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/0644—Management of space entities, e.g. partitions, extents, pools
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a read-write solution for tens of millions of small file data. According to the solution, the mode of creating large-block continuous disk space is used for storing a large number of small files when the small files are stored, namely, logically continuous data are stored on the continuous space of a disk array as far as possible; the disk space is divided into a plurality of blocks, and the size of each block is 64 KB. According to the basic thought, each small file can only be stored in the single block and cannot be stored by crossing two blocks, each folder can be provided with one or more blocks, the blocks are only used for storing the data of the corresponding folder, and each piece of file data is stored on the continuous disk space. Compared with the prior art, the logically continuous data are stored on the continuous space of each physical disk as far as possible, the cache technology is used for playing the role of a metadata server, the cache utilization rate is improved through simplified file information nodes, and thus the access performance of the small files is improved.
Description
Technical field
The present invention relates to Computer Applied Technology field, specifically a kind of one read-write solution of millions small documents data.
Background technology
At the modal data mode that the reading of present stage, field of storage small documents are data access, use.Cut into slices relative to the striping technology of large files, improve the concurrency of user to file access, small documents (≤64KB) is unfavorable for striping due to it, traditional method is generally the method that Single document is stored on individual data server by employing, but after the quantity of small documents arrives to a certain degree, burden in performance and I/O bottleneck problem will be brought to data server to the repeated accesses in large quantities of small documents, because the data message on internet shows mainly with high-frequency small documents form greatly, and read in the information of general user, in storage, to the reading of small documents, store more, therefore important realistic meaning is had to the research of small documents read/write performance high-frequency on internet.
In present stage, the upper main problem that there are following 3 aspects of the management such as traditional process for millions small documents, operation:
(1) access frequency due to small documents is higher, needs repeatedly to access disk, so the performance of magnetic disc i/o is lower;
(2) because Documents Comparison is little, easily form file fragmentation and cause the waste of disk space;
(3) for easily producing network delay when a connection is set up in each small documents request, the reading rate of small documents is reduced.
Summary of the invention
The object of the invention is to overcome the deficiencies in the prior art, a kind of one read-write solution of millions small documents data is provided, the propagation delay time of small documents is reduced, the performance that the reading that improve small documents stores by the document transmission method sending the access of batch high-frequency in advance.
Technical scheme of the present invention realizes in the following manner, and method is as follows:
The storage organization layout design method of small documents on disk array is:
The present invention is by adopting the mode of opening up the continuous disk space of bulk to store large amount of small documents when storing small documents, namely continuous print data are in logic stored on the continuous space of disk array as much as possible, are stored on continuous print disk array block as much as possible by the data of same file or the file data under same file is pressed from both sides; Disk space is divided into multiple pieces, the size of each piece is 64KB, basic thought is: each small documents can only leave in single piece, 2 blocks can not be crossed over deposit, each file will have one or more pieces, these blocks all only deposit the data of this file, and each file data leaves on continuous print disk space;
The store data structure of small documents on disk array is:
Devise a kind of simplification Node attribute information of applicable small documents in this patent; Meanwhile, the attribute information of fileinfo node leaves on meta data server by we, so only needs the disk space information knowing file to conduct interviews; Carry out simplified design to the data structure of Node, the disk space information of a document retaining and belong to its low volume data member, wherein, File_id is file identifier; StartPosition is the reference position of file in block; Long is the length of file; Weight is file weight, in the method the access frequency of representation file; Block_id is the identifier of the block that file is deposited; Count is the access counter of file; Lock is file lock;
The read operation of small documents is designed to:
Mainly spend on the tracking location of magnetic disc head in the time delay of read-write small documents, once have good positioning, the time phase difference reading time that a data block spends and read continuous several data block is not very large, therefore, in conjunction with the data store organisation of optimization presented above, in this solution, adopt the mode pre-read, the file in same piece is read out together, thus reduce the number of times of magnetic disc i/o, the problem of whole system I/O poor-performing is caused for the disk on frequent accesses meta-data server, in the method for the invention, cache is used to serve as the role of meta data server, cache preserves the information of fileinfo node, and make the disk space information of each fileinfo node document retaining and information useful on a small quantity in addition by the Node data structure simplified, thus improve the utilization factor of cache, cache is enable to preserve a large amount of fileinfo nodes, in this way, reduce the access number of times of disk and the expense of file reading information node, thus improve I/O performance.
Advantage of the present invention is:
The one read-write solution of millions small documents data of the present invention compared to the prior art, the method of this invention is by being stored in the continuous space of physical disk as far as possible by continuous print data in logic, use cache technology serve as the role of meta data server and improve cache utilization factor by the fileinfo node simplified, improve small documents access performance; The related data of being polymerized in more new data and file territory thereof is an I/O request write, decrease file fragmentation quantity, adopt the small documents mode of the high rate of people logging in sending batch in advance to reduce I/O operation frequently during reading, improve file transfer performance preferably.
Embodiment
Below the one read-write solution of millions small documents data of the present invention is described in detail below.
The one read-write solution of millions small documents data of the present invention, method is as follows:
The storage organization layout design method of small documents on disk array is:
The present invention is by adopting the mode of opening up the continuous disk space of bulk to store large amount of small documents when storing small documents, namely continuous print data are in logic stored on the continuous space of disk array as much as possible, are stored on continuous print disk array block as much as possible by the data of same file or the file data under same file is pressed from both sides; Disk space is divided into multiple pieces, the size of each piece is 64KB, basic thought is: each small documents can only leave in single piece, 2 blocks can not be crossed over deposit, each file will have one or more pieces, these blocks all only deposit the data of this file, and each file data leaves on continuous print disk space;
The store data structure of small documents on disk array is:
Devise a kind of simplification Node attribute information of applicable small documents in this patent; Meanwhile, the attribute information of fileinfo node leaves on meta data server by we, so only needs the disk space information knowing file to conduct interviews; Carry out simplified design to the data structure of Node, the disk space information of a document retaining and belong to its low volume data member, wherein, File_id is file identifier; StartPosition is the reference position of file in block; Long is the length of file; Weight is file weight, in the method the access frequency of representation file; Block_id is the identifier of the block that file is deposited; Count is the access counter of file; Lock is file lock;
The read operation of small documents is designed to:
Mainly spend on the tracking location of magnetic disc head in the time delay of read-write small documents, once have good positioning, the time phase difference reading time that a data block spends and read continuous several data block is not very large, therefore, in conjunction with the data store organisation of optimization presented above, in this solution, adopt the mode pre-read, the file in same piece is read out together, thus reduce the number of times of magnetic disc i/o, the problem of whole system I/O poor-performing is caused for the disk on frequent accesses meta-data server, in the method for the invention, cache is used to serve as the role of meta data server, cache preserves the information of fileinfo node, and make the disk space information of each fileinfo node document retaining and information useful on a small quantity in addition by the Node data structure simplified, thus improve the utilization factor of cache, cache is enable to preserve a large amount of fileinfo nodes, in this way, reduce the access number of times of disk and the expense of file reading information node, thus improve I/O performance.
The mode that the present invention's design opens up the continuous disk space of bulk when storing small documents by employing stores a large amount of small documents.First disk space is divided into multiple pieces, the size of each piece is 64KB, the continuous disk space of large files just formed by these a series of pieces, when the Documents Comparison hour run into, each small documents can only leave in single piece, 2 blocks can not be crossed over deposit, each file data leaves on continuous print disk space, A1, A2, A3, A4 and A5 is 5 files, deposit continuously between file and file, as A1 and A2, A3 and A4, RED sector is the fragment of this block, when occurring that the size of file is less than the size of these fragments, should preferentially file be left in these fragments, in order to improve the hit rate of " pre-reading " data, what design in storage layout of the present invention is continuous print data be in logic stored in as much as possible on the continuous space of physical disk, be stored on continuous print disk space block as much as possible by the data of same file or the file data under same file is pressed from both sides, each file will have one or more pieces, these blocks all only deposit the file of this file
In the data store organisation of native system, the attribute information of fileinfo node leaves on meta data server by we, on I/O server, the disk space information knowing file is only needed to conduct interviews, therefore on I/O server, only need the disk space information of log file, and do not need other attributes of log file, as creation-time, last access time and owning user etc.Based on this, carry out simplified design to the data structure of Node, the disk space information of a document retaining and belong to its low volume data member, therefore, devise a kind of Node attribute information of simplification, wherein, File_id is file identifier; StartPosition is the reference position of file in block; Long is the length of file; Weight is file weight, in the method the access frequency of representation file; Block_id is the identifier of the block that file is deposited; Count is the access counter of file; Lock is file lock.
For the data memory access frequency issues of native system, first, design a global variable NodeList, NodeList is a sequence chained list to Node, access frequency according to file sorts, it designs to realize optimizing file transfer, is whole small documents reading, storage solution service.NodeList serves with each file, a sorted lists is formed according to the access frequency of each file in file, when user accesses certain file of this file underedge, the file of access frequency high in this list can automatically be sent over by system together, but in order to avoid sending too much file, the threshold value of a setting high access frequency, and by all access frequencys higher than document order be divided into multiple groups, each group may comprise multiple file, in group, All Files size sum is no more than 64KB, when user asks a file in current file folder, the file that one is organized can be sent over by system in order together, thus reduce file transfer time delay.
Its processing and fabricating of one read-write solution of millions small documents data of the present invention is very simple and convenient, can process to specifications.
Except the technical characteristic described in instructions, be the known technology of those skilled in the art.
Claims (1)
1. the one read-write solution of millions small documents data, is characterized in that:
The storage organization layout design method of small documents on disk array is:
The present invention is by adopting the mode of opening up the continuous disk space of bulk to store large amount of small documents when storing small documents, namely continuous print data are in logic stored on the continuous space of disk array as much as possible, are stored on continuous print disk array block as much as possible by the data of same file or the file data under same file is pressed from both sides; Disk space is divided into multiple pieces, the size of each piece is 64KB, basic thought is: each small documents can only leave in single piece, 2 blocks can not be crossed over deposit, each file will have one or more pieces, these blocks all only deposit the data of this file, and each file data leaves on continuous print disk space;
The store data structure of small documents on disk array is:
Devise a kind of simplification Node attribute information of applicable small documents in this patent; Meanwhile, the attribute information of fileinfo node leaves on meta data server by we, so only needs the disk space information knowing file to conduct interviews; Carry out simplified design to the data structure of Node, the disk space information of a document retaining and belong to its low volume data member, wherein, File_id is file identifier; StartPosition is the reference position of file in block; Long is the length of file; Weight is file weight, in the method the access frequency of representation file; Block_id is the identifier of the block that file is deposited; Count is the access counter of file; Lock is file lock;
The read operation of small documents is designed to:
Mainly spend on the tracking location of magnetic disc head in the time delay of read-write small documents, once have good positioning, the time phase difference reading time that a data block spends and read continuous several data block is not very large, therefore, in conjunction with the data store organisation of optimization presented above, in this solution, adopt the mode pre-read, the file in same piece is read out together, thus reduce the number of times of magnetic disc i/o, the problem of whole system I/O poor-performing is caused for the disk on frequent accesses meta-data server, in the method for the invention, cache is used to serve as the role of meta data server, cache preserves the information of fileinfo node, and make the disk space information of each fileinfo node document retaining and information useful on a small quantity in addition by the Node data structure simplified, thus improve the utilization factor of cache, cache is enable to preserve a large amount of fileinfo nodes, in this way, reduce the access number of times of disk and the expense of file reading information node, thus improve I/O performance.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410560613.0A CN104375782A (en) | 2014-10-21 | 2014-10-21 | Read-write solution for tens of millions of small file data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410560613.0A CN104375782A (en) | 2014-10-21 | 2014-10-21 | Read-write solution for tens of millions of small file data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104375782A true CN104375782A (en) | 2015-02-25 |
Family
ID=52554739
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410560613.0A Pending CN104375782A (en) | 2014-10-21 | 2014-10-21 | Read-write solution for tens of millions of small file data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104375782A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105574076A (en) * | 2015-11-27 | 2016-05-11 | 湖南大学 | Key value pair storage structure based on Bloom Filter and method |
CN107066505A (en) * | 2017-01-10 | 2017-08-18 | 郑州云海信息技术有限公司 | The system and method that a kind of small documents storage of performance optimization is accessed |
CN107193492A (en) * | 2017-05-18 | 2017-09-22 | 郑州云海信息技术有限公司 | The method and device that a kind of small documents update |
CN107391423A (en) * | 2017-07-26 | 2017-11-24 | Tcl移动通信科技(宁波)有限公司 | Method, storage medium and the mobile terminal of file are transmitted by OTG functions |
CN109086006A (en) * | 2018-07-24 | 2018-12-25 | 浪潮电子信息产业股份有限公司 | A kind of method and relevant apparatus of reading data |
-
2014
- 2014-10-21 CN CN201410560613.0A patent/CN104375782A/en active Pending
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105574076A (en) * | 2015-11-27 | 2016-05-11 | 湖南大学 | Key value pair storage structure based on Bloom Filter and method |
CN105574076B (en) * | 2015-11-27 | 2019-02-12 | 湖南大学 | A kind of key-value pair storage organization and method based on Bloom Filter |
CN107066505A (en) * | 2017-01-10 | 2017-08-18 | 郑州云海信息技术有限公司 | The system and method that a kind of small documents storage of performance optimization is accessed |
CN107193492A (en) * | 2017-05-18 | 2017-09-22 | 郑州云海信息技术有限公司 | The method and device that a kind of small documents update |
CN107391423A (en) * | 2017-07-26 | 2017-11-24 | Tcl移动通信科技(宁波)有限公司 | Method, storage medium and the mobile terminal of file are transmitted by OTG functions |
CN109086006A (en) * | 2018-07-24 | 2018-12-25 | 浪潮电子信息产业股份有限公司 | A kind of method and relevant apparatus of reading data |
CN109086006B (en) * | 2018-07-24 | 2021-10-15 | 浪潮电子信息产业股份有限公司 | Data reading method and related device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103176754A (en) | Reading and storing method for massive amounts of small files | |
CN102332029B (en) | Hadoop-based mass classifiable small file association storage method | |
US9792344B2 (en) | Asynchronous namespace maintenance | |
US9836514B2 (en) | Cache based key-value store mapping and replication | |
US9311252B2 (en) | Hierarchical storage for LSM-based NoSQL stores | |
CN104978362B (en) | Data migration method, device and the meta data server of distributed file system | |
CN104375782A (en) | Read-write solution for tens of millions of small file data | |
CN104391961A (en) | Tens of millions of small file data read and write solution strategy | |
US10210188B2 (en) | Multi-tiered data storage in a deduplication system | |
CN102819586B (en) | A kind of URL sorting technique based on high-speed cache and equipment | |
CN102323958A (en) | Data de-duplication method | |
CN103139300A (en) | Virtual machine image management optimization method based on data de-duplication | |
CN104133882A (en) | HDFS (Hadoop Distributed File System)-based old file processing method | |
WO2010062554A2 (en) | Index compression in databases | |
CN104462389A (en) | Method for implementing distributed file systems on basis of hierarchical storage | |
CN102129472A (en) | Construction method for high-efficiency hybrid storage structure of semantic-orient search engine | |
CN106066818B (en) | A kind of data layout method improving data de-duplication standby system restorability | |
CN103514210A (en) | Method and device for processing small files | |
CN102915340A (en) | Expanded B+ tree-based object file system | |
CN111159176A (en) | Method and system for storing and reading mass stream data | |
CN105630810A (en) | Method for uploading mass small files in distributed storage system | |
CN102737068A (en) | Method and equipment for performing cache management on retrieval data | |
CN107066505A (en) | The system and method that a kind of small documents storage of performance optimization is accessed | |
US10712943B2 (en) | Database memory monitoring and defragmentation of database indexes | |
CN103942301A (en) | Distributed file system oriented to access and application of multiple data types |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20150225 |
|
WD01 | Invention patent application deemed withdrawn after publication |