A kind of mass small documents access optimization method based on Ceph
Technical field
The present invention relates to distributed document technical field of memory, and in particular to a kind of mass small documents access based on Ceph
Optimization method.
Background technology
With the rapid development of cloud computing and big data, global metadata amount is exponentially incremented by, traditional storage system due to
The factors such as its equipment cost and maintenance cost cannot meet the storage demand of people gradually.In addition, not with small documents quantity
Disconnected to increase, most of distributed memory system cannot meet the efficient storage of small documents and the demand of reading.How to solve
The storage of mass small documents and problem of management, the efficiency that stores and accesses for improving small documents is present maximum challenge.
Ceph is a kind of distributed file system, and when handling big file, the efficient storage and pipe of file may be implemented
Reason, but Ceph, when storing mass small documents, there are still some shortcomings:
(1) storage efficiency of mass small documents is relatively low.It is to support affairs that interface, which is locally stored, in Ceph, and introducing log mechanism makes
It obtains all write operations to be required for that daily record is first written, then local file system is written by object memory interface, therefore big
In the case of the continuous I/O of scale, the handling capacity exported on practical disk is the half of its physical property, leads to small documents storage
It can be relatively low;
(2) reading efficiency of mass small documents is not high.When small documents are accessed frequently, cluster needs to save in multiple storages
Constantly jump is searched between point, therefore the small documents reading performance of Ceph clusters can be caused poor.
Invention content
To be solved by this invention is that Ceph has that storage is low with reading efficiency when handling mass small documents, is carried
Optimization method is accessed for a kind of mass small documents based on Ceph.
To solve the above problems, the present invention is achieved by the following technical solutions:
A kind of mass small documents access optimization method based on Ceph, including steps are as follows:
Step 1, the filename and file size for obtaining the file of file to be uploaded in the client same period, and according to
The file threshold value of setting classifies to these files:When the size of file to be uploaded is more than file threshold value, then it is determined as
Big file is uploaded directly into Ceph clusters;When the size of file to be uploaded is equal to or less than file threshold value, then it is determined as small
File;
Step 2 is associated grouping using K-means clustering algorithms to small documents, and to the small documents in each grouping
It is ranked up from big to small according to file size, then uploads to Ceph collection after the small documents in each grouping are merged successively
Group, while merging the mapping relations generation index file in file according to small documents;
Step 3, when user sends out access request, whether client judges demand file in the caching of client:If
In the caching of client, then the demand file is directly directly accessed from the caching of client;Otherwise, client believes request
Breath uploads Ceph clusters;
Step 4, Ceph clusters receive solicited message, and determine its file type according to the filename of demand file, if asking
When to seek file be big file, then the demand file is directly read from Ceph clusters, and store into client-cache for user
It accesses, if demand file is small documents, specific location of the demand file in merging file is first determined according to index file
Information, then the demand file is read from Ceph clusters, and store and accessed into client-cache for user.
In above-mentioned steps 1, file threshold value is set according to Ceph group document block sizes.
In above-mentioned steps 2, the small documents in each grouping need to judge small documents to be combined in being associated with merging process
Whether it is more than file threshold value with the sum of the size of merging file for merging generation before;If being less than or equal to file threshold value, directly will
Small documents to be combined merge before being merged into the merging file generated, otherwise, need to apply for a merging file again.
In above-mentioned steps 2, the structure of index file is <key,value>, the filename of wherein key preservation small documents,
Value preserves the size file_length of initial position file_offset and small documents of the small documents in merging file.
As an improvement, the mass small documents based on Ceph access optimization method, still further comprises file and prefetched
Journey, i.e.,:
In the read requests file from Ceph clusters, and when demand file is small documents, where needing computation requests file
Merge the correlation ratio Ψ of each small documents and demand file in file, and correlation ratio Ψ in the merging file is more than related threshold
The small documents of value are read out together with demand file, in storage to client-cache;Wherein correlation ratio Ψ is:
Wherein, n accessed numbers of demand file in timing statistics section, d indicate to merge in file in timing statistics section
The accessed number of small documents, sum indicate the total degree that all small documents are accessed in timing statistics section.
As a further improvement, in file prefetching process, it is more than the small of dependent thresholds when merging correlation ratio Ψ in file
When file number prefetches number num more than given maximum, then num small documents and request are literary before only coming correlation ratio Ψ
Part stores in client-cache together.
In said program, maximum prefetches number num and is:
Wherein, math.floor (*) indicate downward rounding, TwIndicate the maximum latency of user, TCephIndicate Ceph collection
Group receives access request to the time for returning to file, TpreIndicate that Ceph clusters prefetch the time of a file.
As an improvement, the mass small documents based on Ceph access optimization method, still further comprise in client
Caching file carry out cache optimization process, that is, calculate separately the weight R of each filew, and according to the power of cache file
Weight RwFile is ranked up, the wherein high file of weight is stored in the L2 cache of client, and the low file of weight is deposited
Storage is in level cache;When the file newly read in follow-up Ceph clusters needs to store to the caching in client, and caching sky
Between it is insufficient when, weight R is gradually deleted from level cachewMinimum file;The wherein weight R of filewFor:
Rw=e-(Nt-Nr)×t
Wherein, NtIndicate the maximum capacity of client-cache, NrIndicate that the accessed number of cache file, t indicate caching more
The new time.
Compared with prior art, the present invention has following features:
1, merged by the association of judgement, small documents to file, the foundation of index file, mass small documents can be improved
Storage efficiency.
2, it using small documents pre-read mechanism, realizes and prefetches relative file to slow while small documents are read
It deposits, reduces the interaction of user and cluster to improve the reading efficiency of small documents.
3, in file cache mechanism, caching pair can dynamically be calculated according to the access times of cache object and access time
The weight factor of elephant determines the access of cache object according to its weight factor and eliminates sequence, reduces the waste of caching, improves slow
The hit rate deposited further improves the reading efficiency of small documents.
Description of the drawings
Fig. 1 is the functional block diagram that optimization method is accessed based on Ceph mass small documents of preferred embodiment of the present invention.
Fig. 2 is file write-in schematic diagram.
Fig. 3 is index structure figure.
Fig. 4 is that file reads schematic diagram.
Fig. 5 is that file prefetches schematic diagram.
Fig. 6 is cache optimization schematic diagram.
Specific implementation mode
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with specific example, and with reference to attached
Figure, the present invention is described in more detail.
A kind of mass small documents access optimization method based on Ceph, it is poly- first with K-means when user's storage file
Class algorithm obtains the associated packet of small documents, then is ranked up by sequence from big to small to the file in every group, then will close
Associated in connection grouping is stored into Ceph again after merging.When user initiates access request, system is first checked and is asked
Whether in the buffer file is sought, if in the presence of directly reading and returning to demand file;Otherwise solicited message is sent to Ceph clusters,
It realizes the reading of small documents and small text is realized according to demand file and the correlation where it in merging file between other small documents
Part prefetching and caching, and is then back to demand file and prefetches small documents.
Specifically, a kind of mass small documents based on Ceph access optimization method, as shown in Figure 1, being written including file
Stage and file read the stage;Specific steps are as follows:
(1) write-in of file, as shown in Figure 2.
The filename and file size of the file of file to be uploaded in the step S1 acquisition client same periods, and according to
The file threshold value of setting classifies to these files:When the size of file to be uploaded is more than file threshold value, then it is determined as
Big file is uploaded directly into Ceph clusters;When the size of file to be uploaded is equal to or less than file threshold value, then it is determined as small
File, and go to step 2.Upload procedure is as shown in Figure 2.
Step S2, grouping is associated to small documents using K-means clustering algorithms, and to the small documents in each grouping
It is ranked up from big to small according to file size, then uploads to Ceph collection after the small documents in each grouping are merged successively
Group, while merging the mapping relations generation index file in file according to small documents.
Clustering processing is carried out to these small documents using K-means clustering algorithms, obtains the different grouping of small documents,
Similarity in the same grouping between each small documents is higher, that is to say that the relevance between each small documents is bigger, can be with
Small documents in same grouping are merged.In order to avoid there are blocks of files fragment problems, first to the text in same grouping
Part is sorted from big to small, is stored again after then again merging the small documents in same group successively to Ceph clusters.In addition,
Small documents, to avoid across the block storage of file, need the small documents for judging newly to merge and have merged text in being associated with merging process
Whether the sum of size of part is more than threshold value 4M, if more than then needing to apply for a merging file again.
Indexed file structure indexed file structure <key,value>, key store small documents filename, value preserve small documents merging
The size file_length of initial position file_offset and small documents in file, indexed file structure are as shown in Figure 3.
(2) reading of file, as shown in Figure 4.
Step S3, when user initiates access request, client receives solicited message, and checks that this document whether there is in visitor
In the caching of family end.If in the buffer, reading small documents from caching and returning to demand file;Otherwise, illustrate this document not yet
It is read, solicited message is sent to Ceph clusters, and go to step S4.
Step S4, Ceph clusters receive solicited message, and determine its file type according to the filename of demand file, if asking
When to seek file be big file, then the demand file is directly read from Ceph clusters, and store into client-cache for user
It accesses, if demand file is small documents, first determines that demand file specific location in merging file is believed according to index file
Breath, then the demand file is read from Ceph clusters, and store and accessed into client-cache for user.
Step S5, file prefetches mechanism, as shown in Figure 5.
In order to effectively improve file reading speed, during small documents are read, pre-read mechanism can also be utilized real
The pre-read of existing correlation small documents, and return to corresponding small documents simultaneously and prefetch file.
It is related to other small documents that merging file where judging demand file currently merges demand file in file
Rate Ψ, and correlation ratio Ψ is compared with the dependent thresholds of setting:When the correlation ratio Ψ of small documents is more than dependent thresholds,
It with the pre-read small documents and can then store into client-cache;Wherein correlation ratio Ψ is:
Wherein, n accessed numbers of demand file in timing statistics section, d indicate to merge in file in timing statistics section
The accessed number of small documents, sum indicate the total degree that all small documents are accessed in timing statistics section.
In view of the limitation in client-cache space, in file prefetching process, it is more than phase when merging correlation ratio Ψ in file
When the small documents number of pass threshold value prefetches number num more than given maximum, then num small texts before only coming correlation ratio Ψ
Part stores in client-cache together with demand file, and above-mentioned maximum prefetches number num and can be manually set, can also basis
Following formula, which calculates, to be determined:
Wherein, math.floor (*) indicate downward rounding, TwIndicate the maximum latency of user, TCephIndicate Ceph collection
Group receives access request to the time for returning to file, TpreIndicate that Ceph clusters prefetch the time of a file.
Step S6, cache optimization mechanism, as shown in Figure 6.
According to the access frequency of cache file and access time, the weight R of each file is calculated separatelyw.According to weight Rw's
Size determines the priority of cache object, according to the weight R of cache filewFile is ranked up, wherein weight RwIt is relatively high
File Privilege is high, is stored in the L2 cache of client, and weight RwRelatively low File Privilege is low, is stored in visitor
In the level cache at family end.When the file newly read in follow-up Ceph clusters needs to store to the caching in client, and caching
When insufficient space, weight R is gradually deleted from level cachewMinimum file.If cache file is not accessed for a long time,
Weight RwIt can decay therewith, avoid the case where certain files waste spatial cache because not being accessed for a long time.
The weight R of above-mentioned filewFor:
Rw=e-(Nt-Nr)×t
Wherein, NtIndicate the maximum capacity of client-cache, NrIndicate that the accessed number of cache file, t indicate caching more
The new time.
The present invention is first passed through file detection, is clustered to small documents using K-means clustering algorithms when file is written
Analysis obtains the associated packet of small documents, then is associated to store after merging to the file in associated packet and arrives Ceph clusters.
When file association merges, by small documents with merges mapping relations between file generate index file and storage in the client, carry
The search efficiency of high small documents.When reading file, according to the phase merged in advance according to demand file in blocks of files between alternative document
Closing property realizes prefetching and caching for small documents.Cache optimization mechanism, small documents are prefetched in caching, and calculate corresponding power
Repeated factor, weight can reduce with the growth of time, if weight is less than given threshold value, be removed from caching, in this way can be with
The waste for reducing spatial cache, improves the hit rate of cache file.The present invention is reduced and is used by the interaction of reduction user and cluster
Family access time improves storage and the reading efficiency of mass small documents, improves the overall performance of system.
It should be noted that although the above embodiment of the present invention is illustrative, this is not to the present invention
Limitation, therefore the invention is not limited in above-mentioned specific implementation mode.Without departing from the principles of the present invention, every
The other embodiment that those skilled in the art obtain under the inspiration of the present invention is accordingly to be regarded as within the protection of the present invention.