CN104375782A - Read-write solution for tens of millions of small file data - Google Patents

Read-write solution for tens of millions of small file data Download PDF

Info

Publication number
CN104375782A
CN104375782A CN201410560613.0A CN201410560613A CN104375782A CN 104375782 A CN104375782 A CN 104375782A CN 201410560613 A CN201410560613 A CN 201410560613A CN 104375782 A CN104375782 A CN 104375782A
Authority
CN
China
Prior art keywords
file
data
small documents
disk
read
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410560613.0A
Other languages
Chinese (zh)
Inventor
张砚波
吴丙涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN201410560613.0A priority Critical patent/CN104375782A/en
Publication of CN104375782A publication Critical patent/CN104375782A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0631Configuration or reconfiguration of storage systems by allocating resources to storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0644Management of space entities, e.g. partitions, extents, pools

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a read-write solution for tens of millions of small file data. According to the solution, the mode of creating large-block continuous disk space is used for storing a large number of small files when the small files are stored, namely, logically continuous data are stored on the continuous space of a disk array as far as possible; the disk space is divided into a plurality of blocks, and the size of each block is 64 KB. According to the basic thought, each small file can only be stored in the single block and cannot be stored by crossing two blocks, each folder can be provided with one or more blocks, the blocks are only used for storing the data of the corresponding folder, and each piece of file data is stored on the continuous disk space. Compared with the prior art, the logically continuous data are stored on the continuous space of each physical disk as far as possible, the cache technology is used for playing the role of a metadata server, the cache utilization rate is improved through simplified file information nodes, and thus the access performance of the small files is improved.

Description

The one read-write solution of millions small documents data
Technical field
The present invention relates to Computer Applied Technology field, specifically a kind of one read-write solution of millions small documents data.
Background technology
At the modal data mode that the reading of present stage, field of storage small documents are data access, use.Cut into slices relative to the striping technology of large files, improve the concurrency of user to file access, small documents (≤64KB) is unfavorable for striping due to it, traditional method is generally the method that Single document is stored on individual data server by employing, but after the quantity of small documents arrives to a certain degree, burden in performance and I/O bottleneck problem will be brought to data server to the repeated accesses in large quantities of small documents, because the data message on internet shows mainly with high-frequency small documents form greatly, and read in the information of general user, in storage, to the reading of small documents, store more, therefore important realistic meaning is had to the research of small documents read/write performance high-frequency on internet.
In present stage, the upper main problem that there are following 3 aspects of the management such as traditional process for millions small documents, operation:
(1) access frequency due to small documents is higher, needs repeatedly to access disk, so the performance of magnetic disc i/o is lower;
(2) because Documents Comparison is little, easily form file fragmentation and cause the waste of disk space;
(3) for easily producing network delay when a connection is set up in each small documents request, the reading rate of small documents is reduced.
Summary of the invention
The object of the invention is to overcome the deficiencies in the prior art, a kind of one read-write solution of millions small documents data is provided, the propagation delay time of small documents is reduced, the performance that the reading that improve small documents stores by the document transmission method sending the access of batch high-frequency in advance.
Technical scheme of the present invention realizes in the following manner, and method is as follows:
The storage organization layout design method of small documents on disk array is:
The present invention is by adopting the mode of opening up the continuous disk space of bulk to store large amount of small documents when storing small documents, namely continuous print data are in logic stored on the continuous space of disk array as much as possible, are stored on continuous print disk array block as much as possible by the data of same file or the file data under same file is pressed from both sides; Disk space is divided into multiple pieces, the size of each piece is 64KB, basic thought is: each small documents can only leave in single piece, 2 blocks can not be crossed over deposit, each file will have one or more pieces, these blocks all only deposit the data of this file, and each file data leaves on continuous print disk space;
The store data structure of small documents on disk array is:
Devise a kind of simplification Node attribute information of applicable small documents in this patent; Meanwhile, the attribute information of fileinfo node leaves on meta data server by we, so only needs the disk space information knowing file to conduct interviews; Carry out simplified design to the data structure of Node, the disk space information of a document retaining and belong to its low volume data member, wherein, File_id is file identifier; StartPosition is the reference position of file in block; Long is the length of file; Weight is file weight, in the method the access frequency of representation file; Block_id is the identifier of the block that file is deposited; Count is the access counter of file; Lock is file lock;
The read operation of small documents is designed to:
Mainly spend on the tracking location of magnetic disc head in the time delay of read-write small documents, once have good positioning, the time phase difference reading time that a data block spends and read continuous several data block is not very large, therefore, in conjunction with the data store organisation of optimization presented above, in this solution, adopt the mode pre-read, the file in same piece is read out together, thus reduce the number of times of magnetic disc i/o, the problem of whole system I/O poor-performing is caused for the disk on frequent accesses meta-data server, in the method for the invention, cache is used to serve as the role of meta data server, cache preserves the information of fileinfo node, and make the disk space information of each fileinfo node document retaining and information useful on a small quantity in addition by the Node data structure simplified, thus improve the utilization factor of cache, cache is enable to preserve a large amount of fileinfo nodes, in this way, reduce the access number of times of disk and the expense of file reading information node, thus improve I/O performance.
Advantage of the present invention is:
The one read-write solution of millions small documents data of the present invention compared to the prior art, the method of this invention is by being stored in the continuous space of physical disk as far as possible by continuous print data in logic, use cache technology serve as the role of meta data server and improve cache utilization factor by the fileinfo node simplified, improve small documents access performance; The related data of being polymerized in more new data and file territory thereof is an I/O request write, decrease file fragmentation quantity, adopt the small documents mode of the high rate of people logging in sending batch in advance to reduce I/O operation frequently during reading, improve file transfer performance preferably.
Embodiment
Below the one read-write solution of millions small documents data of the present invention is described in detail below.
The one read-write solution of millions small documents data of the present invention, method is as follows:
The storage organization layout design method of small documents on disk array is:
The present invention is by adopting the mode of opening up the continuous disk space of bulk to store large amount of small documents when storing small documents, namely continuous print data are in logic stored on the continuous space of disk array as much as possible, are stored on continuous print disk array block as much as possible by the data of same file or the file data under same file is pressed from both sides; Disk space is divided into multiple pieces, the size of each piece is 64KB, basic thought is: each small documents can only leave in single piece, 2 blocks can not be crossed over deposit, each file will have one or more pieces, these blocks all only deposit the data of this file, and each file data leaves on continuous print disk space;
The store data structure of small documents on disk array is:
Devise a kind of simplification Node attribute information of applicable small documents in this patent; Meanwhile, the attribute information of fileinfo node leaves on meta data server by we, so only needs the disk space information knowing file to conduct interviews; Carry out simplified design to the data structure of Node, the disk space information of a document retaining and belong to its low volume data member, wherein, File_id is file identifier; StartPosition is the reference position of file in block; Long is the length of file; Weight is file weight, in the method the access frequency of representation file; Block_id is the identifier of the block that file is deposited; Count is the access counter of file; Lock is file lock;
The read operation of small documents is designed to:
Mainly spend on the tracking location of magnetic disc head in the time delay of read-write small documents, once have good positioning, the time phase difference reading time that a data block spends and read continuous several data block is not very large, therefore, in conjunction with the data store organisation of optimization presented above, in this solution, adopt the mode pre-read, the file in same piece is read out together, thus reduce the number of times of magnetic disc i/o, the problem of whole system I/O poor-performing is caused for the disk on frequent accesses meta-data server, in the method for the invention, cache is used to serve as the role of meta data server, cache preserves the information of fileinfo node, and make the disk space information of each fileinfo node document retaining and information useful on a small quantity in addition by the Node data structure simplified, thus improve the utilization factor of cache, cache is enable to preserve a large amount of fileinfo nodes, in this way, reduce the access number of times of disk and the expense of file reading information node, thus improve I/O performance.
The mode that the present invention's design opens up the continuous disk space of bulk when storing small documents by employing stores a large amount of small documents.First disk space is divided into multiple pieces, the size of each piece is 64KB, the continuous disk space of large files just formed by these a series of pieces, when the Documents Comparison hour run into, each small documents can only leave in single piece, 2 blocks can not be crossed over deposit, each file data leaves on continuous print disk space, A1, A2, A3, A4 and A5 is 5 files, deposit continuously between file and file, as A1 and A2, A3 and A4, RED sector is the fragment of this block, when occurring that the size of file is less than the size of these fragments, should preferentially file be left in these fragments, in order to improve the hit rate of " pre-reading " data, what design in storage layout of the present invention is continuous print data be in logic stored in as much as possible on the continuous space of physical disk, be stored on continuous print disk space block as much as possible by the data of same file or the file data under same file is pressed from both sides, each file will have one or more pieces, these blocks all only deposit the file of this file
In the data store organisation of native system, the attribute information of fileinfo node leaves on meta data server by we, on I/O server, the disk space information knowing file is only needed to conduct interviews, therefore on I/O server, only need the disk space information of log file, and do not need other attributes of log file, as creation-time, last access time and owning user etc.Based on this, carry out simplified design to the data structure of Node, the disk space information of a document retaining and belong to its low volume data member, therefore, devise a kind of Node attribute information of simplification, wherein, File_id is file identifier; StartPosition is the reference position of file in block; Long is the length of file; Weight is file weight, in the method the access frequency of representation file; Block_id is the identifier of the block that file is deposited; Count is the access counter of file; Lock is file lock.
For the data memory access frequency issues of native system, first, design a global variable NodeList, NodeList is a sequence chained list to Node, access frequency according to file sorts, it designs to realize optimizing file transfer, is whole small documents reading, storage solution service.NodeList serves with each file, a sorted lists is formed according to the access frequency of each file in file, when user accesses certain file of this file underedge, the file of access frequency high in this list can automatically be sent over by system together, but in order to avoid sending too much file, the threshold value of a setting high access frequency, and by all access frequencys higher than document order be divided into multiple groups, each group may comprise multiple file, in group, All Files size sum is no more than 64KB, when user asks a file in current file folder, the file that one is organized can be sent over by system in order together, thus reduce file transfer time delay.
Its processing and fabricating of one read-write solution of millions small documents data of the present invention is very simple and convenient, can process to specifications.
Except the technical characteristic described in instructions, be the known technology of those skilled in the art.

Claims (1)

1. the one read-write solution of millions small documents data, is characterized in that:
The storage organization layout design method of small documents on disk array is:
The present invention is by adopting the mode of opening up the continuous disk space of bulk to store large amount of small documents when storing small documents, namely continuous print data are in logic stored on the continuous space of disk array as much as possible, are stored on continuous print disk array block as much as possible by the data of same file or the file data under same file is pressed from both sides; Disk space is divided into multiple pieces, the size of each piece is 64KB, basic thought is: each small documents can only leave in single piece, 2 blocks can not be crossed over deposit, each file will have one or more pieces, these blocks all only deposit the data of this file, and each file data leaves on continuous print disk space;
The store data structure of small documents on disk array is:
Devise a kind of simplification Node attribute information of applicable small documents in this patent; Meanwhile, the attribute information of fileinfo node leaves on meta data server by we, so only needs the disk space information knowing file to conduct interviews; Carry out simplified design to the data structure of Node, the disk space information of a document retaining and belong to its low volume data member, wherein, File_id is file identifier; StartPosition is the reference position of file in block; Long is the length of file; Weight is file weight, in the method the access frequency of representation file; Block_id is the identifier of the block that file is deposited; Count is the access counter of file; Lock is file lock;
The read operation of small documents is designed to:
Mainly spend on the tracking location of magnetic disc head in the time delay of read-write small documents, once have good positioning, the time phase difference reading time that a data block spends and read continuous several data block is not very large, therefore, in conjunction with the data store organisation of optimization presented above, in this solution, adopt the mode pre-read, the file in same piece is read out together, thus reduce the number of times of magnetic disc i/o, the problem of whole system I/O poor-performing is caused for the disk on frequent accesses meta-data server, in the method for the invention, cache is used to serve as the role of meta data server, cache preserves the information of fileinfo node, and make the disk space information of each fileinfo node document retaining and information useful on a small quantity in addition by the Node data structure simplified, thus improve the utilization factor of cache, cache is enable to preserve a large amount of fileinfo nodes, in this way, reduce the access number of times of disk and the expense of file reading information node, thus improve I/O performance.
CN201410560613.0A 2014-10-21 2014-10-21 Read-write solution for tens of millions of small file data Pending CN104375782A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410560613.0A CN104375782A (en) 2014-10-21 2014-10-21 Read-write solution for tens of millions of small file data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410560613.0A CN104375782A (en) 2014-10-21 2014-10-21 Read-write solution for tens of millions of small file data

Publications (1)

Publication Number Publication Date
CN104375782A true CN104375782A (en) 2015-02-25

Family

ID=52554739

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410560613.0A Pending CN104375782A (en) 2014-10-21 2014-10-21 Read-write solution for tens of millions of small file data

Country Status (1)

Country Link
CN (1) CN104375782A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105574076A (en) * 2015-11-27 2016-05-11 湖南大学 Key value pair storage structure based on Bloom Filter and method
CN107066505A (en) * 2017-01-10 2017-08-18 郑州云海信息技术有限公司 The system and method that a kind of small documents storage of performance optimization is accessed
CN107193492A (en) * 2017-05-18 2017-09-22 郑州云海信息技术有限公司 The method and device that a kind of small documents update
CN107391423A (en) * 2017-07-26 2017-11-24 Tcl移动通信科技(宁波)有限公司 Method, storage medium and the mobile terminal of file are transmitted by OTG functions
CN109086006A (en) * 2018-07-24 2018-12-25 浪潮电子信息产业股份有限公司 A kind of method and relevant apparatus of reading data

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105574076A (en) * 2015-11-27 2016-05-11 湖南大学 Key value pair storage structure based on Bloom Filter and method
CN105574076B (en) * 2015-11-27 2019-02-12 湖南大学 A kind of key-value pair storage organization and method based on Bloom Filter
CN107066505A (en) * 2017-01-10 2017-08-18 郑州云海信息技术有限公司 The system and method that a kind of small documents storage of performance optimization is accessed
CN107193492A (en) * 2017-05-18 2017-09-22 郑州云海信息技术有限公司 The method and device that a kind of small documents update
CN107391423A (en) * 2017-07-26 2017-11-24 Tcl移动通信科技(宁波)有限公司 Method, storage medium and the mobile terminal of file are transmitted by OTG functions
CN109086006A (en) * 2018-07-24 2018-12-25 浪潮电子信息产业股份有限公司 A kind of method and relevant apparatus of reading data
CN109086006B (en) * 2018-07-24 2021-10-15 浪潮电子信息产业股份有限公司 Data reading method and related device

Similar Documents

Publication Publication Date Title
CN103176754A (en) Reading and storing method for massive amounts of small files
CN102332029B (en) Hadoop-based mass classifiable small file association storage method
US9792344B2 (en) Asynchronous namespace maintenance
US9836514B2 (en) Cache based key-value store mapping and replication
US9311252B2 (en) Hierarchical storage for LSM-based NoSQL stores
CN104978362B (en) Data migration method, device and the meta data server of distributed file system
CN104375782A (en) Read-write solution for tens of millions of small file data
CN104391961A (en) Tens of millions of small file data read and write solution strategy
US10210188B2 (en) Multi-tiered data storage in a deduplication system
CN102819586B (en) A kind of URL sorting technique based on high-speed cache and equipment
CN102323958A (en) Data de-duplication method
CN103139300A (en) Virtual machine image management optimization method based on data de-duplication
CN104133882A (en) HDFS (Hadoop Distributed File System)-based old file processing method
WO2010062554A2 (en) Index compression in databases
CN104462389A (en) Method for implementing distributed file systems on basis of hierarchical storage
CN102129472A (en) Construction method for high-efficiency hybrid storage structure of semantic-orient search engine
CN106066818B (en) A kind of data layout method improving data de-duplication standby system restorability
CN103514210A (en) Method and device for processing small files
CN102915340A (en) Expanded B+ tree-based object file system
CN111159176A (en) Method and system for storing and reading mass stream data
CN105630810A (en) Method for uploading mass small files in distributed storage system
CN102737068A (en) Method and equipment for performing cache management on retrieval data
CN107066505A (en) The system and method that a kind of small documents storage of performance optimization is accessed
US10712943B2 (en) Database memory monitoring and defragmentation of database indexes
CN103942301A (en) Distributed file system oriented to access and application of multiple data types

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150225

WD01 Invention patent application deemed withdrawn after publication