CN108710639A

CN108710639A - A kind of mass small documents access optimization method based on Ceph

Info

Publication number: CN108710639A
Application number: CN201810343960.6A
Authority: CN
Inventors: 王勇; 陆小霞; 叶苗; 郇宜鸣
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2018-04-17
Filing date: 2018-04-17
Publication date: 2018-10-26
Anticipated expiration: 2038-04-17
Also published as: CN108710639B

Abstract

The present invention discloses a kind of mass small documents access optimization method based on Ceph, when user's storage file, the associated packet of small documents is obtained first with K-means clustering algorithms, the file in every group is ranked up by sequence from big to small again, is stored again into Ceph after then merging the associated in associated packet.When user initiates access request, whether in the buffer system first checks demand file, if in the presence of directly reading and returning to demand file；Otherwise solicited message is sent to Ceph clusters, realize the reading of small documents and according between blocks of files utilization rate and correlation ratio carry out small documents and prefetching and cache, return to demand file and prefetch small documents.The invention is reduced user's access time, is improved the access efficiency of mass small documents, improve the overall performance of system by the interaction of reduction user and cluster.

Description

A kind of mass small documents access optimization method based on Ceph

Technical field

The present invention relates to distributed document technical field of memory, and in particular to a kind of mass small documents access based on Ceph Optimization method.

Background technology

With the rapid development of cloud computing and big data, global metadata amount is exponentially incremented by, traditional storage system due to The factors such as its equipment cost and maintenance cost cannot meet the storage demand of people gradually.In addition, not with small documents quantity Disconnected to increase, most of distributed memory system cannot meet the efficient storage of small documents and the demand of reading.How to solve The storage of mass small documents and problem of management, the efficiency that stores and accesses for improving small documents is present maximum challenge.

Ceph is a kind of distributed file system, and when handling big file, the efficient storage and pipe of file may be implemented Reason, but Ceph, when storing mass small documents, there are still some shortcomings：

(1) storage efficiency of mass small documents is relatively low.It is to support affairs that interface, which is locally stored, in Ceph, and introducing log mechanism makes It obtains all write operations to be required for that daily record is first written, then local file system is written by object memory interface, therefore big In the case of the continuous I/O of scale, the handling capacity exported on practical disk is the half of its physical property, leads to small documents storage It can be relatively low；

(2) reading efficiency of mass small documents is not high.When small documents are accessed frequently, cluster needs to save in multiple storages Constantly jump is searched between point, therefore the small documents reading performance of Ceph clusters can be caused poor.

Invention content

To be solved by this invention is that Ceph has that storage is low with reading efficiency when handling mass small documents, is carried Optimization method is accessed for a kind of mass small documents based on Ceph.

To solve the above problems, the present invention is achieved by the following technical solutions：

A kind of mass small documents access optimization method based on Ceph, including steps are as follows：

Step 1, the filename and file size for obtaining the file of file to be uploaded in the client same period, and according to The file threshold value of setting classifies to these files：When the size of file to be uploaded is more than file threshold value, then it is determined as Big file is uploaded directly into Ceph clusters；When the size of file to be uploaded is equal to or less than file threshold value, then it is determined as small File；

Step 2 is associated grouping using K-means clustering algorithms to small documents, and to the small documents in each grouping It is ranked up from big to small according to file size, then uploads to Ceph collection after the small documents in each grouping are merged successively Group, while merging the mapping relations generation index file in file according to small documents；

Step 3, when user sends out access request, whether client judges demand file in the caching of client：If In the caching of client, then the demand file is directly directly accessed from the caching of client；Otherwise, client believes request Breath uploads Ceph clusters；

Step 4, Ceph clusters receive solicited message, and determine its file type according to the filename of demand file, if asking When to seek file be big file, then the demand file is directly read from Ceph clusters, and store into client-cache for user It accesses, if demand file is small documents, specific location of the demand file in merging file is first determined according to index file Information, then the demand file is read from Ceph clusters, and store and accessed into client-cache for user.

In above-mentioned steps 1, file threshold value is set according to Ceph group document block sizes.

In above-mentioned steps 2, the small documents in each grouping need to judge small documents to be combined in being associated with merging process Whether it is more than file threshold value with the sum of the size of merging file for merging generation before；If being less than or equal to file threshold value, directly will Small documents to be combined merge before being merged into the merging file generated, otherwise, need to apply for a merging file again.

In above-mentioned steps 2, the structure of index file is <key,value>, the filename of wherein key preservation small documents, Value preserves the size file_length of initial position file_offset and small documents of the small documents in merging file.

As an improvement, the mass small documents based on Ceph access optimization method, still further comprises file and prefetched Journey, i.e.,：

In the read requests file from Ceph clusters, and when demand file is small documents, where needing computation requests file Merge the correlation ratio Ψ of each small documents and demand file in file, and correlation ratio Ψ in the merging file is more than related threshold The small documents of value are read out together with demand file, in storage to client-cache；Wherein correlation ratio Ψ is：

Wherein, n accessed numbers of demand file in timing statistics section, d indicate to merge in file in timing statistics section The accessed number of small documents, sum indicate the total degree that all small documents are accessed in timing statistics section.

As a further improvement, in file prefetching process, it is more than the small of dependent thresholds when merging correlation ratio Ψ in file When file number prefetches number num more than given maximum, then num small documents and request are literary before only coming correlation ratio Ψ Part stores in client-cache together.

In said program, maximum prefetches number num and is：

Wherein, math.floor (^*) indicate downward rounding, T_wIndicate the maximum latency of user, T_CephIndicate Ceph collection Group receives access request to the time for returning to file, T_preIndicate that Ceph clusters prefetch the time of a file.

As an improvement, the mass small documents based on Ceph access optimization method, still further comprise in client Caching file carry out cache optimization process, that is, calculate separately the weight R of each file_w, and according to the power of cache file Weight R_wFile is ranked up, the wherein high file of weight is stored in the L2 cache of client, and the low file of weight is deposited Storage is in level cache；When the file newly read in follow-up Ceph clusters needs to store to the caching in client, and caching sky Between it is insufficient when, weight R is gradually deleted from level cache_wMinimum file；The wherein weight R of file_wFor：

R_w=e^-(Nt-Nr)×t

Wherein, N_tIndicate the maximum capacity of client-cache, N_rIndicate that the accessed number of cache file, t indicate caching more The new time.

Compared with prior art, the present invention has following features：

1, merged by the association of judgement, small documents to file, the foundation of index file, mass small documents can be improved Storage efficiency.

2, it using small documents pre-read mechanism, realizes and prefetches relative file to slow while small documents are read It deposits, reduces the interaction of user and cluster to improve the reading efficiency of small documents.

3, in file cache mechanism, caching pair can dynamically be calculated according to the access times of cache object and access time The weight factor of elephant determines the access of cache object according to its weight factor and eliminates sequence, reduces the waste of caching, improves slow The hit rate deposited further improves the reading efficiency of small documents.

Description of the drawings

Fig. 1 is the functional block diagram that optimization method is accessed based on Ceph mass small documents of preferred embodiment of the present invention.

Fig. 2 is file write-in schematic diagram.

Fig. 3 is index structure figure.

Fig. 4 is that file reads schematic diagram.

Fig. 5 is that file prefetches schematic diagram.

Fig. 6 is cache optimization schematic diagram.

Specific implementation mode

To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with specific example, and with reference to attached Figure, the present invention is described in more detail.

A kind of mass small documents access optimization method based on Ceph, it is poly- first with K-means when user's storage file Class algorithm obtains the associated packet of small documents, then is ranked up by sequence from big to small to the file in every group, then will close Associated in connection grouping is stored into Ceph again after merging.When user initiates access request, system is first checked and is asked Whether in the buffer file is sought, if in the presence of directly reading and returning to demand file；Otherwise solicited message is sent to Ceph clusters, It realizes the reading of small documents and small text is realized according to demand file and the correlation where it in merging file between other small documents Part prefetching and caching, and is then back to demand file and prefetches small documents.

Specifically, a kind of mass small documents based on Ceph access optimization method, as shown in Figure 1, being written including file Stage and file read the stage；Specific steps are as follows：

(1) write-in of file, as shown in Figure 2.

The filename and file size of the file of file to be uploaded in the step S1 acquisition client same periods, and according to The file threshold value of setting classifies to these files：When the size of file to be uploaded is more than file threshold value, then it is determined as Big file is uploaded directly into Ceph clusters；When the size of file to be uploaded is equal to or less than file threshold value, then it is determined as small File, and go to step 2.Upload procedure is as shown in Figure 2.

Step S2, grouping is associated to small documents using K-means clustering algorithms, and to the small documents in each grouping It is ranked up from big to small according to file size, then uploads to Ceph collection after the small documents in each grouping are merged successively Group, while merging the mapping relations generation index file in file according to small documents.

Clustering processing is carried out to these small documents using K-means clustering algorithms, obtains the different grouping of small documents, Similarity in the same grouping between each small documents is higher, that is to say that the relevance between each small documents is bigger, can be with Small documents in same grouping are merged.In order to avoid there are blocks of files fragment problems, first to the text in same grouping Part is sorted from big to small, is stored again after then again merging the small documents in same group successively to Ceph clusters.In addition, Small documents, to avoid across the block storage of file, need the small documents for judging newly to merge and have merged text in being associated with merging process Whether the sum of size of part is more than threshold value 4M, if more than then needing to apply for a merging file again.

Indexed file structure indexed file structure <key,value>, key store small documents filename, value preserve small documents merging The size file_length of initial position file_offset and small documents in file, indexed file structure are as shown in Figure 3.

(2) reading of file, as shown in Figure 4.

Step S3, when user initiates access request, client receives solicited message, and checks that this document whether there is in visitor In the caching of family end.If in the buffer, reading small documents from caching and returning to demand file；Otherwise, illustrate this document not yet It is read, solicited message is sent to Ceph clusters, and go to step S4.

Step S4, Ceph clusters receive solicited message, and determine its file type according to the filename of demand file, if asking When to seek file be big file, then the demand file is directly read from Ceph clusters, and store into client-cache for user It accesses, if demand file is small documents, first determines that demand file specific location in merging file is believed according to index file Breath, then the demand file is read from Ceph clusters, and store and accessed into client-cache for user.

Step S5, file prefetches mechanism, as shown in Figure 5.

In order to effectively improve file reading speed, during small documents are read, pre-read mechanism can also be utilized real The pre-read of existing correlation small documents, and return to corresponding small documents simultaneously and prefetch file.

It is related to other small documents that merging file where judging demand file currently merges demand file in file Rate Ψ, and correlation ratio Ψ is compared with the dependent thresholds of setting：When the correlation ratio Ψ of small documents is more than dependent thresholds, It with the pre-read small documents and can then store into client-cache；Wherein correlation ratio Ψ is：

In view of the limitation in client-cache space, in file prefetching process, it is more than phase when merging correlation ratio Ψ in file When the small documents number of pass threshold value prefetches number num more than given maximum, then num small texts before only coming correlation ratio Ψ Part stores in client-cache together with demand file, and above-mentioned maximum prefetches number num and can be manually set, can also basis Following formula, which calculates, to be determined：

Step S6, cache optimization mechanism, as shown in Figure 6.

According to the access frequency of cache file and access time, the weight R of each file is calculated separately_w.According to weight R_w's Size determines the priority of cache object, according to the weight R of cache file_wFile is ranked up, wherein weight R_wIt is relatively high File Privilege is high, is stored in the L2 cache of client, and weight R_wRelatively low File Privilege is low, is stored in visitor In the level cache at family end.When the file newly read in follow-up Ceph clusters needs to store to the caching in client, and caching When insufficient space, weight R is gradually deleted from level cache_wMinimum file.If cache file is not accessed for a long time, Weight R_wIt can decay therewith, avoid the case where certain files waste spatial cache because not being accessed for a long time.

The weight R of above-mentioned file_wFor：

R_w=e^-(Nt-Nr)×t

The present invention is first passed through file detection, is clustered to small documents using K-means clustering algorithms when file is written Analysis obtains the associated packet of small documents, then is associated to store after merging to the file in associated packet and arrives Ceph clusters. When file association merges, by small documents with merges mapping relations between file generate index file and storage in the client, carry The search efficiency of high small documents.When reading file, according to the phase merged in advance according to demand file in blocks of files between alternative document Closing property realizes prefetching and caching for small documents.Cache optimization mechanism, small documents are prefetched in caching, and calculate corresponding power Repeated factor, weight can reduce with the growth of time, if weight is less than given threshold value, be removed from caching, in this way can be with The waste for reducing spatial cache, improves the hit rate of cache file.The present invention is reduced and is used by the interaction of reduction user and cluster Family access time improves storage and the reading efficiency of mass small documents, improves the overall performance of system.

It should be noted that although the above embodiment of the present invention is illustrative, this is not to the present invention Limitation, therefore the invention is not limited in above-mentioned specific implementation mode.Without departing from the principles of the present invention, every The other embodiment that those skilled in the art obtain under the inspiration of the present invention is accordingly to be regarded as within the protection of the present invention.

Claims

1. a kind of mass small documents based on Ceph access optimization method, characterized in that including steps are as follows：

Step 1, the filename and file size for obtaining the file of file to be uploaded in the client same period, and according to setting File threshold value classify to these files：When the size of file to be uploaded is more than file threshold value, then it is determined as big text Part is uploaded directly into Ceph clusters；When the size of file to be uploaded is equal to or less than file threshold value, then it is determined as small text Part；

Step 2 is associated grouping using K-means clustering algorithms to small documents, and to the small documents in each grouping according to File size is ranked up from big to small, then Ceph clusters are uploaded to after the small documents in each grouping are merged successively, together When according to small documents merge file in mapping relations generate index file；

Step 3, when user sends out access request, whether client judges demand file in the caching of client：If in visitor In the caching at family end, then the demand file is directly directly accessed from the caching of client；Otherwise, client will be in solicited message Pass Ceph clusters；

Step 4, Ceph clusters receive solicited message, and determine its file type according to the filename of demand file, if request text When part is big file, then the demand file is directly read from Ceph clusters, and store and accessed into client-cache for user, If demand file is small documents, more specific location information of the demand file in merging file is first determined according to index file, The demand file is read from Ceph clusters again, and stores and is accessed into client-cache for user.

2. a kind of mass small documents based on Ceph according to claim 1 access optimization method, characterized in that step 1 In, file threshold value is set according to Ceph group document block sizes.

3. a kind of mass small documents based on Ceph according to claim 1 access optimization method, characterized in that step 2 In, the small documents in each grouping need to judge small documents to be combined and merge generation before in being associated with merging process Merge whether the sum of size of file is more than file threshold value；If being less than or equal to file threshold value, directly small documents to be combined are closed And it in the merging file for merging generation before, otherwise, needs to apply for a merging file again.

4. a kind of mass small documents based on Ceph according to claim 1 access optimization method, characterized in that step 2 In, the structure of index file is <key,value>, wherein key preserve small documents filename, value preserve small documents closing And the size file_length of the initial position file_offset and small documents in file.

5. a kind of mass small documents based on Ceph according to claim 1 access optimization method, characterized in that also into one Step includes file prefetching process, i.e.,：

In the read requests file from Ceph clusters, and when demand file is small documents, need to merge where computation requests file The correlation ratio Ψ of each small documents and demand file in file, and correlation ratio Ψ in the merging file is more than dependent thresholds Small documents are read out together with demand file, in storage to client-cache；Wherein correlation ratio Ψ is：

Wherein, n accessed numbers of demand file in timing statistics section, d indicate to merge small text in file in timing statistics section The accessed number of part, sum indicate the total degree that all small documents are accessed in timing statistics section.

6. a kind of mass small documents based on Ceph according to claim 5 access optimization method, characterized in that in file In prefetching process, number is prefetched more than given maximum when merging correlation ratio Ψ in file more than the small documents number of dependent thresholds When num, then num small documents store in client-cache together with demand file before only coming correlation ratio Ψ.

7. a kind of mass small documents based on Ceph according to claim 6 access optimization method, characterized in that maximum pre- The number num is taken to be：

Wherein, math.floor (*) indicates downward rounding, T_wIndicate the maximum latency of user, T_CephIndicate that Ceph clusters connect Access request is received to the time for returning to file, T_preIndicate that Ceph clusters prefetch the time of a file.

8. a kind of mass small documents based on Ceph access optimization method according to claim 1 or 5, characterized in that also Further comprise the process for carrying out cache optimization to the file of the caching in client, that is, calculates separately the weight of each file R_w, and according to the weight R of cache file_wFile is ranked up, wherein the high file of weight is stored in the L2 cache of client In, and the low file of weight is stored in level cache；Client is arrived when the file newly read in follow-up Ceph clusters needs to store Caching in end, and when inadequate buffer space, weight R is gradually deleted from level cache_wMinimum file；The wherein power of file Weight R_wFor：

R_w=e^-(Nt-Nr)×t

Wherein, N_tIndicate the maximum capacity of client-cache, N_rThe accessed number of cache file is indicated, when t indicates buffer update Between.