CN106446079B

CN106446079B - A kind of file of Based on Distributed file system prefetches/caching method and device

Info

Publication number: CN106446079B
Application number: CN201610811562.3A
Authority: CN
Inventors: 邝倍靖; 宋�莹; 王博; 孙毓忠
Original assignee: Institute of Computing Technology of CAS
Current assignee: Beijing Zhongke Flux Technology Co ltd
Priority date: 2016-09-08
Filing date: 2016-09-08
Publication date: 2019-06-18
Anticipated expiration: 2036-09-08
Also published as: CN106446079A

Abstract

The present invention proposes that a kind of file of Based on Distributed file system prefetches/caching method and device, this method includes when the file in client access distributed file system, the access time for accessing file and the file information are recorded in access log, and judge whether file is cached according to access time and the file information；According to access log, each access time point of file in take-off time section TP, and obtain each access time neighborhood of a point, by the degree of association for being accessed file in clustering algorithm calculation document and neighborhood, file corresponding with the degree of association is stored in queue to be prefetched in the form of character string, similarity between file two-by-two is calculated separately in queue to be prefetched by cosine similarity algorithm, file in queue to be prefetched is reconfigured, and combine similarity, each combination is compressed, calculate the total correlation degree of each combination and file, and obtain the maximum combination of total correlation degree in combination, as a group of file to be prefetched.

Description

A kind of file of Based on Distributed file system prefetches/caching method and device

Technical field

The present invention relates to files to prefetch and cache field problem, in particular to a kind of file of Based on Distributed file system Prefetch/caching method and device.

Background technique

Currently, with the rapid development of development of Mobile Internet technology, network data is also in explosive growth, in order to cope with magnanimity The various problems that the access of data is faced, the quick development that distributed storage technology also obtains, at present comparative maturity point Cloth file system mainly include the GFS of Google, Hadoop Distributed File System (HDFS), Lustre, FastDFS, MooseFS, MogileFS, NFS and the TFS of Taobao, each not phase of the scene of these distributed file systems application Together, thus they the characteristics of also difference.

These distributed file systems, the different properties showed in terms of production application are mainly manifested in from magnanimity Efficiency when a certain or certain files is read in file, the problem of for this respect, solution mainly has, Optimum distribution On the other hand the reading of formula file system itself, memory mechanism, storage rule of format, file including metadata etc. are then Optimize reading, the storage performance of file from the periphery of distributed file system, main prefetching including file and caching technology. Proposed by the present invention is the reading performance that distributed file system is improved by prefetching and caching.

For caching technology, basic thought is to be deposited when some file of distributed file system is accessed with memory The file being just read is put, rather than is directly lost, in this way, when reading again, so that it may directly read inside memory, without With reading from distributed file system again, the responding ability of distributed file system entire so just has very big mention Height, still, when the file number of reading is especially more, it is impossible to all accessed files all be put in the buffer, because interior It deposits size to be limited, with matching is a variety of different rejecting strategies, including lru algorithm, LIRS algorithm, and LRU is i.e. most Page replacement algorithm is used less, and the least recently used page is replaced；LIRS algorithm is same by using accessing twice The distance (this distance refers to centre is accessed how much non-duplicate files) of file is gone dynamically as a kind of scale by accessed text Part is ranked up, so that replacement selection is made, since the quantity of documents of distributed file system is bigger, so these caching plans The hit rate of the cache file slightly obtained is relatively low, is not able to satisfy certain demands.

For prefetching technique, following technology is generallyd use: the Prefetching Model based on access frequency, when the access tool to Web There is certain rule, and there is historic and Relatively centralized hobby, therefore propose interest and access behavior pair based on group The file that future will access is predicted；Prefetching Model based on data mining excavates user's using data mining technology Interest association rule, the foundation prefetched as the page that will be accessed user；Prefetching Model based on popularity, periodically The access times of geo-statistic webpage, and choose the more webpage of access times and form popular page set, it is then nearest according to client The size of the request amount of sending prefetches the request amount for being equivalent to user and issuing recently from the popular page set on each server The page be placed on caching or be directly fed to user, using the 1st law of Zipf and the 2nd law to access popularity modeling, propose Based on the Prefetching Model of Web popularity, these prefetching techniques are mainly used in prefetching for Web page, corresponding distributed field system Prefetching for file on system is then fewer and fewer.

Forecasting method above is seldom applied in distributed file system, but 104933110 A of patent CN proposes one Data prefetching method of the kind based on MapReduce, patent passage capacity assessment are handled come the data block for predicting each calculate node Amount, and non-localized task will appear to assess which calculate node according to a series of calculating, for by calculating assessment Non-localized task is just prefetched to calculate node local when calculate node does not also apply for handling the task in advance, so that Calculate node will not generate calculating and wait, this efficiency for prefetching the system that substantially increases and executing MapReduce task, but for Non- MapReduce scene this prefetch rule and be not suitable for.

Summary of the invention

In view of the deficiencies of the prior art, the present invention proposes that a kind of file of Based on Distributed file system prefetches the/side of caching Method and device.

The present invention proposes that a kind of file of Based on Distributed file system prefetches/caching method, comprising:

File cache step will access the visit of the file when the file in client access distributed file system It asks that time and the file information are recorded in access log, and the file is judged according to the access time and the file information Whether cached；

File pre-fetch step, according to the access log, each access time point of the file in take-off time section TP, and Each access time neighborhood of a point is obtained, is calculated by clustering algorithm and is accessed file in the file and the neighborhood File corresponding with the degree of association is stored in queue to be prefetched in the form of character string, passes through cosine phase by the degree of association Calculate separately in the queue to be prefetched similarity between file two-by-two like degree algorithm, to the file in the queue to be prefetched into Row reconfigures, and in conjunction with the similarity, compresses to each combination, calculates each combination and the file Total correlation degree, and the maximum combination of total correlation degree described in the combination is obtained, as a group of file to be prefetched.

The file cache step includes the threshold value N for setting the file and being accessed number, is sentenced according to the access time The file break in the accessed frequency n um in current time T, if the accessed frequency n um is greater than the threshold value N, then by the file cache to distributed caching in, the otherwise uncached file.

The neighborhood is that period TN, the period of acquisition are respectively taken before and after sometime point tt.

The file pre-fetch step further includes carrying out compression processing to a group of file to be prefetched, and prefetches compressed text Part is to distributed caching.

It further include rejecting Files step, comprising:

Step 31, the least file of access times in the t time is taken out in queue Qf, wherein pressing t in distributed caching The queue that access times in time are ranked up is the queue Qf；

Step 32, it finds in the t time and visits in the queue Qt for critical rejecting point with the middle position of queue Qt It asks number least file, judges its position in the queue Qt, if the least file of access times is in critical rejecting Point below, and time for being accessed earliest of the least file of access times less than the file of critical rejecting point earliest access when Between, then the least file of access times is rejected, the file cache step is otherwise executed, wherein by visit in distributed caching Ask that the queue that the time is ranked up is the queue Qt；

Step 33, access times in the t time are taken out in the queue Qf and are less than the file of threshold value M, and execute step Rapid 32.

The present invention also proposes that a kind of file of Based on Distributed file system prefetches/caching system, comprising:

File cache module, for the file will to be accessed when the file in client access distributed file system Access time and the file information be recorded in access log, and according to the access time and the file information judgement described in Whether file is cached；

File prefetches module, for according to the access log, each access time of the file in take-off time section TP Point, and each access time neighborhood of a point is obtained, it is calculated in the file and the neighborhood and is accessed by clustering algorithm File corresponding with the degree of association is stored in queue to be prefetched in the form of character string, is passed through by the degree of association of file Cosine similarity algorithm calculates separately in the queue to be prefetched similarity between file two-by-two, in the queue to be prefetched File is reconfigured, and in conjunction with the similarity, is compressed to each combination, and each combination and institute are calculated The total correlation degree of file is stated, and obtains the maximum combination of total correlation degree described in the combination, as the one group of text to be prefetched Part.

The file cache module includes the threshold value N for setting the file and being accessed number, is sentenced according to the access time The file break in the accessed frequency n um in current time T, if the accessed frequency n um is greater than the threshold value N, then by the file cache to distributed caching in, the otherwise uncached file.

It further includes carrying out compression processing to a group of file to be prefetched that the file, which prefetches module, prefetches compressed text Part is to distributed caching.

It further include that file rejects module, comprising:

Step 32, it finds in the t time and visits in the queue Qt for critical rejecting point with the middle position of queue Qt It asks number least file, judges its position in the queue Qt, if the least file of access times is in critical rejecting Point below, and time for being accessed earliest of the least file of access times less than the file of critical rejecting point earliest access when Between, then the least file of access times is rejected, the file cache module is otherwise executed, wherein by visit in distributed caching Ask that the queue that the time is ranked up is the queue Qt；

As it can be seen from the above scheme the present invention has the advantages that

Cache module of the invention can be by analysis, and effective caching is likely to the file being accessed again later, right In the file for being less likely to be accessed again, do not cached.Module is prefetched using clustering algorithm and cosine similarity algorithm The degree of association file is calculated, and combines compress technique, can effectively carry out prefetching problem, enable the file being prefetched to It is effectively accessed, improves the hit rate for prefetching file.For rejecting module, two queues are maintained, one when being by access Between a queue Qt being ranked up, the other is a queue Qf being ranked up by the access times in the t time, when use Between and spatially two kinds of strategies combine with ensure reject file be the file that can be most unlikely accessed again.

Detailed description of the invention

Fig. 1 is the method for the present invention flow chart；

Fig. 2 is present system structure chart.

Specific embodiment

The present invention provides one layer of distribution buffer (buffer), for delaying on existing distributed file system The file in distributed file system is deposited and prefetches, target is can to accelerate the reading speed of distributed file system.

To achieve the above object, The technical solution adopted by the invention is as follows:

The present invention proposes that the file of Based on Distributed file system prefetches/caching method, includes the following steps, such as Fig. 1 institute Show:

A. cache file, its implementation are as follows:

A1. when client accesses a file in distributed file system；

A2. the time for accessing this document and the file information are recorded in access log；

A3. according to the accessed time of this document in access log and i.e. corresponding the file information judge this document whether into Row caching.

B. file, its implementation are prefetched are as follows:

B1. when a file is cached in memory, according to the access log of history, this document in the TP period is taken out Each access time point t1, t2, t3 ... tN；

B2. traversal history access log, time neighborhood of a point refer to, respectively take the TN period before and after sometime point tt, The obtained period.Accessed file f i 1, fi2, the fi3 ... in t1 neighborhood are taken out, one column of composition obtain an access letter Cease Inf1；

B3. t2 neighborhood, the access information Inf2, Inf3 ... of t3 neighborhood ... tN neighborhood are taken out respectively by step B2 InfN；

B4. above-mentioned access information Inf1, Inf2, Inf3 ... InfN are calculated, is obtained with existing clustering algorithm The degree of association r1, r2, r3 ... of file are accessed in above-mentioned cache file and above-mentioned time vertex neighborhood.

B5. by the degree of association size f1, f2, f3 ... are ranked up and filename is stored in the form of character string to It prefetches in queue, according to the size of distributed memory, the space size M of file can be prefetched by setting one file of every caching, from pass Queue to be prefetched is intercepted from big to small according to the degree of association in connection file, makes the size p*M of file to be prefetched in queue to be prefetched, The desirable arbitrary number greater than 1 of p, it is proposed that take 3.

B6. cosine similarity algorithm is utilized, the similarity between the file two-by-two in queue to be prefetched is calculated separately.Cosine phase Like the step of degree algorithm are as follows: calculate cosine after pretreatment → text feature item selection → weighting → generation vector space model.Institute Obtained cosine value is the similarity of two files.

B7. the file in queue to be prefetched is reconfigured, in conjunction with the similarity between file, is calculated using existing compression Method takes out the file that compressed total size is M or slightly smaller than M size to each combination.

B8. the total degree of association size for calculating each composition file and institute's cache file in B7, takes out total correlation degree size The maximum a group of file one group, as to be prefetched.

B9. compression processing is carried out to the obtained a group of file in B8 with existing compress technique, then prefetches these compressions File prefetches end to distributed caching.

C. file, its implementation are rejected are as follows:

When distributed memory deficiency, need to reject some files from memory, elimination method is as follows.

C1. two document queues are safeguarded in distributed memory, one is a queue Qt being ranked up by access time, The other is a queue Qf being ranked up by the access times in the t time.

C2. the least file of access times in the t time is taken out in Qf.

C3. the least text of access times in the above-mentioned t time is found for critical rejecting point in Qt with the middle position of Qt Part judges its position in Qt, if this document behind critical rejecting point, i.e. the time ratio that is accessed earliest of this document The earliest access time of critical rejecting dot file is early, then directly rejects this document, which terminates.It is no to then follow the steps A4- 4。

C4. the file that access times in the t time are less than threshold value M is taken out in Qf, executes C3；

Step of the present invention is further described below, it is an object of the present invention to the texts in cache prefetching distributed file system Part promotes the response speed of distributed file system.Detailed implementation steps include to execute: A, cache file to distributed memory； B, the file big with the cache file degree of association is prefetched to distributed memory.C, when distributed memory deficiency, one in memory is rejected A little files.A kind of specific embodiment is as follows:

A. cache file, its implementation are as follows:

A1. when a file in client access distributed file system；

A3. judge this document in current time T time according to the accessed time of this document in access log Access times num sets the threshold value N of access times, if the access times num in T time is greater than the access times of setting Threshold value N, then it is on the contrary then do not cache this document in distributed caching this document being cached to.For distributed field system System, many files are only accessed once in longer period of time, just without caching these files, and then improve distributed memory Utilization rate, avoid useless caching and reject operation.

B. file, its implementation are prefetched are as follows:

B1. when a file is cached in memory, according to history access log, this document is each in the taking-up TP time Access time point t1, t2, t3 ... tN；

B2. traversal history access log, time neighborhood of a point refer to, respectively take the TN period before and after sometime point tt, The obtained period.Accessed the file f i1, fi2, fi3 ... in t1 neighborhood are taken out, one column of composition obtain an access letter Cease Inf1；

B7. the file in queue to be prefetched is reconfigured, in conjunction with the similarity between file, is calculated using existing compression Method takes out the file that compressed total size is M or slightly smaller than M size to each combination, and obtaining total size is greater than less than M's Combination nova.Steps are as follows for specific execution:

B7-1. prefetching the file number in queue is FN, and X file is taken out from FN file, guarantees X file Size is greater than M upon compression.FN file is reconfigured, is obtainedA combination.

B7-2. utilize the calculated file two-by-two of B6 similarity, calculate two high files of similarity in each combination into The compressed size of row.

B7-3. two high files of similarity are further screened from each combination as a compressed file Size is equal to or slightly less than a group of file of M out.

B8. the total degree of association size for calculating each group file and institute's cache file in B7, it is maximum to take out total correlation degree size One group, a group of file as to be prefetched.

B9. compression processing is carried out to the obtained a group of file in B8 using aprior compress technique, then prefetches these Compressed file prefetches end to distributed caching.

C. file, its implementation are rejected are as follows:

C2. the least file of access times in the t time is taken out in Qf.

The present invention also proposes that a kind of file of Based on Distributed file system prefetches/caching system, as shown in Figure 2, comprising:

It further include that file rejects module, comprising:

Claims

1. a kind of file of Based on Distributed file system prefetches/caching method characterized by comprising

File cache step, when the file in client access distributed file system, when by the access for accessing the file Between and the file information be recorded in access log, and whether the file is judged according to the access time and the file information It is cached；

File pre-fetch step, according to the access log, each access time point of the file in take-off time section TP, and obtain Each access time neighborhood of a point, by clustering algorithm calculate the file be accessed file in the neighborhood and be associated with Degree, file corresponding with the degree of association is stored in queue to be prefetched in the form of character string, cosine similarity is passed through Algorithm calculates separately in the queue to be prefetched similarity between file two-by-two, carries out weight to the file in the queue to be prefetched Combination nova, and in conjunction with the similarity, each combination is compressed, each combination of calculating is total with the file The degree of association, and the maximum combination of total correlation degree described in the combination is obtained, as a group of file to be prefetched；

Wherein the neighborhood is that period TN, the period of acquisition are respectively taken before and after sometime point tt.

2. the file of Based on Distributed file system as described in claim 1 prefetches/caching method, which is characterized in that described File cache step includes the threshold value N for setting the file and being accessed number, judges that the file exists according to the access time Accessed frequency n um in current time T, if the accessed frequency n um is greater than the threshold value N, by the text Part is cached in distributed caching, otherwise the uncached file.

3. the file of Based on Distributed file system as described in claim 1 prefetches/caching method, which is characterized in that described File pre-fetch step further includes carrying out compression processing to a group of file to be prefetched, prefetches compressed file to distributed Caching.

4. the file of Based on Distributed file system as described in claim 1 prefetches/caching method, which is characterized in that also wrap Include rejecting Files step, comprising:

Step 31, the least file of access times in the t time is taken out in queue Qf, wherein pressing the t time in distributed caching The queue that interior access times are ranked up is the queue Qf；

Step 32, with the middle position of queue Qt for critical rejecting point, access time in the t time is found in the queue Qt The least file of number, judges its position in the queue Qt, if the least file of access times is in critical rejecting point Below, and time for being accessed earliest of the least file of access times be less than critical rejecting point file earliest access time, The least file of access times is then rejected, the file cache step is otherwise executed, wherein by access in distributed caching The queue that time is ranked up is the queue Qt；

Step 33, access times in the t time are taken out in the queue Qf and are less than the file of threshold value M, and execute step 32.

5. a kind of file of Based on Distributed file system prefetches/caching system characterized by comprising

File cache module, for the visit of the file will to be accessed when the file in client access distributed file system It asks that time and the file information are recorded in access log, and the file is judged according to the access time and the file information Whether cached；

File prefetches module, for according to the access log, each access time point of the file in take-off time section TP, and Each access time neighborhood of a point is obtained, is calculated by clustering algorithm and is accessed file in the file and the neighborhood File corresponding with the degree of association is stored in queue to be prefetched in the form of character string, passes through cosine phase by the degree of association Calculate separately in the queue to be prefetched similarity between file two-by-two like degree algorithm, to the file in the queue to be prefetched into Row reconfigures, and in conjunction with the similarity, compresses to each combination, calculates each combination and the file Total correlation degree, and the maximum combination of total correlation degree described in the combination is obtained, as a group of file to be prefetched；

6. the file of Based on Distributed file system as claimed in claim 5 prefetches/caching system, which is characterized in that described File cache module includes the threshold value N for setting the file and being accessed number, judges that the file exists according to the access time Accessed frequency n um in current time T, if the accessed frequency n um is greater than the threshold value N, by the text Part is cached in distributed caching, otherwise the uncached file.

7. the file of Based on Distributed file system as claimed in claim 5 prefetches/caching system, which is characterized in that described It further includes carrying out compression processing to a group of file to be prefetched that file, which prefetches module, prefetches compressed file to distributed Caching.

8. the file of Based on Distributed file system as claimed in claim 5 prefetches/caching system, which is characterized in that also wrap It includes file and rejects module, comprising:

Step 32, with the middle position of queue Qt for critical rejecting point, access time in the t time is found in the queue Qt The least file of number, judges its position in the queue Qt, if the least file of access times is in critical rejecting point Below, and time for being accessed earliest of the least file of access times be less than critical rejecting point file earliest access time, The least file of access times is then rejected, the file cache module is otherwise executed, wherein by access in distributed caching The queue that time is ranked up is the queue Qt；