Background technique
Currently, with the rapid development of development of Mobile Internet technology, network data is also in explosive growth, in order to cope with magnanimity
The various problems that the access of data is faced, the quick development that distributed storage technology also obtains, at present comparative maturity point
Cloth file system mainly include the GFS of Google, Hadoop Distributed File System (HDFS), Lustre,
FastDFS, MooseFS, MogileFS, NFS and the TFS of Taobao, each not phase of the scene of these distributed file systems application
Together, thus they the characteristics of also difference.
These distributed file systems, the different properties showed in terms of production application are mainly manifested in from magnanimity
Efficiency when a certain or certain files is read in file, the problem of for this respect, solution mainly has, Optimum distribution
On the other hand the reading of formula file system itself, memory mechanism, storage rule of format, file including metadata etc. are then
Optimize reading, the storage performance of file from the periphery of distributed file system, main prefetching including file and caching technology.
Proposed by the present invention is the reading performance that distributed file system is improved by prefetching and caching.
For caching technology, basic thought is to be deposited when some file of distributed file system is accessed with memory
The file being just read is put, rather than is directly lost, in this way, when reading again, so that it may directly read inside memory, without
With reading from distributed file system again, the responding ability of distributed file system entire so just has very big mention
Height, still, when the file number of reading is especially more, it is impossible to all accessed files all be put in the buffer, because interior
It deposits size to be limited, with matching is a variety of different rejecting strategies, including lru algorithm, LIRS algorithm, and LRU is i.e. most
Page replacement algorithm is used less, and the least recently used page is replaced;LIRS algorithm is same by using accessing twice
The distance (this distance refers to centre is accessed how much non-duplicate files) of file is gone dynamically as a kind of scale by accessed text
Part is ranked up, so that replacement selection is made, since the quantity of documents of distributed file system is bigger, so these caching plans
The hit rate of the cache file slightly obtained is relatively low, is not able to satisfy certain demands.
For prefetching technique, following technology is generallyd use: the Prefetching Model based on access frequency, when the access tool to Web
There is certain rule, and there is historic and Relatively centralized hobby, therefore propose interest and access behavior pair based on group
The file that future will access is predicted;Prefetching Model based on data mining excavates user's using data mining technology
Interest association rule, the foundation prefetched as the page that will be accessed user;Prefetching Model based on popularity, periodically
The access times of geo-statistic webpage, and choose the more webpage of access times and form popular page set, it is then nearest according to client
The size of the request amount of sending prefetches the request amount for being equivalent to user and issuing recently from the popular page set on each server
The page be placed on caching or be directly fed to user, using the 1st law of Zipf and the 2nd law to access popularity modeling, propose
Based on the Prefetching Model of Web popularity, these prefetching techniques are mainly used in prefetching for Web page, corresponding distributed field system
Prefetching for file on system is then fewer and fewer.
Forecasting method above is seldom applied in distributed file system, but 104933110 A of patent CN proposes one
Data prefetching method of the kind based on MapReduce, patent passage capacity assessment are handled come the data block for predicting each calculate node
Amount, and non-localized task will appear to assess which calculate node according to a series of calculating, for by calculating assessment
Non-localized task is just prefetched to calculate node local when calculate node does not also apply for handling the task in advance, so that
Calculate node will not generate calculating and wait, this efficiency for prefetching the system that substantially increases and executing MapReduce task, but for
Non- MapReduce scene this prefetch rule and be not suitable for.
Summary of the invention
In view of the deficiencies of the prior art, the present invention proposes that a kind of file of Based on Distributed file system prefetches the/side of caching
Method and device.
The present invention proposes that a kind of file of Based on Distributed file system prefetches/caching method, comprising:
File cache step will access the visit of the file when the file in client access distributed file system
It asks that time and the file information are recorded in access log, and the file is judged according to the access time and the file information
Whether cached;
File pre-fetch step, according to the access log, each access time point of the file in take-off time section TP, and
Each access time neighborhood of a point is obtained, is calculated by clustering algorithm and is accessed file in the file and the neighborhood
File corresponding with the degree of association is stored in queue to be prefetched in the form of character string, passes through cosine phase by the degree of association
Calculate separately in the queue to be prefetched similarity between file two-by-two like degree algorithm, to the file in the queue to be prefetched into
Row reconfigures, and in conjunction with the similarity, compresses to each combination, calculates each combination and the file
Total correlation degree, and the maximum combination of total correlation degree described in the combination is obtained, as a group of file to be prefetched.
The file cache step includes the threshold value N for setting the file and being accessed number, is sentenced according to the access time
The file break in the accessed frequency n um in current time T, if the accessed frequency n um is greater than the threshold value
N, then by the file cache to distributed caching in, the otherwise uncached file.
The neighborhood is that period TN, the period of acquisition are respectively taken before and after sometime point tt.
The file pre-fetch step further includes carrying out compression processing to a group of file to be prefetched, and prefetches compressed text
Part is to distributed caching.
It further include rejecting Files step, comprising:
Step 31, the least file of access times in the t time is taken out in queue Qf, wherein pressing t in distributed caching
The queue that access times in time are ranked up is the queue Qf;
Step 32, it finds in the t time and visits in the queue Qt for critical rejecting point with the middle position of queue Qt
It asks number least file, judges its position in the queue Qt, if the least file of access times is in critical rejecting
Point below, and time for being accessed earliest of the least file of access times less than the file of critical rejecting point earliest access when
Between, then the least file of access times is rejected, the file cache step is otherwise executed, wherein by visit in distributed caching
Ask that the queue that the time is ranked up is the queue Qt;
Step 33, access times in the t time are taken out in the queue Qf and are less than the file of threshold value M, and execute step
Rapid 32.
The present invention also proposes that a kind of file of Based on Distributed file system prefetches/caching system, comprising:
File cache module, for the file will to be accessed when the file in client access distributed file system
Access time and the file information be recorded in access log, and according to the access time and the file information judgement described in
Whether file is cached;
File prefetches module, for according to the access log, each access time of the file in take-off time section TP
Point, and each access time neighborhood of a point is obtained, it is calculated in the file and the neighborhood and is accessed by clustering algorithm
File corresponding with the degree of association is stored in queue to be prefetched in the form of character string, is passed through by the degree of association of file
Cosine similarity algorithm calculates separately in the queue to be prefetched similarity between file two-by-two, in the queue to be prefetched
File is reconfigured, and in conjunction with the similarity, is compressed to each combination, and each combination and institute are calculated
The total correlation degree of file is stated, and obtains the maximum combination of total correlation degree described in the combination, as the one group of text to be prefetched
Part.
The file cache module includes the threshold value N for setting the file and being accessed number, is sentenced according to the access time
The file break in the accessed frequency n um in current time T, if the accessed frequency n um is greater than the threshold value
N, then by the file cache to distributed caching in, the otherwise uncached file.
The neighborhood is that period TN, the period of acquisition are respectively taken before and after sometime point tt.
It further includes carrying out compression processing to a group of file to be prefetched that the file, which prefetches module, prefetches compressed text
Part is to distributed caching.
It further include that file rejects module, comprising:
Step 31, the least file of access times in the t time is taken out in queue Qf, wherein pressing t in distributed caching
The queue that access times in time are ranked up is the queue Qf;
Step 32, it finds in the t time and visits in the queue Qt for critical rejecting point with the middle position of queue Qt
It asks number least file, judges its position in the queue Qt, if the least file of access times is in critical rejecting
Point below, and time for being accessed earliest of the least file of access times less than the file of critical rejecting point earliest access when
Between, then the least file of access times is rejected, the file cache module is otherwise executed, wherein by visit in distributed caching
Ask that the queue that the time is ranked up is the queue Qt;
Step 33, access times in the t time are taken out in the queue Qf and are less than the file of threshold value M, and execute step
Rapid 32.
As it can be seen from the above scheme the present invention has the advantages that
Cache module of the invention can be by analysis, and effective caching is likely to the file being accessed again later, right
In the file for being less likely to be accessed again, do not cached.Module is prefetched using clustering algorithm and cosine similarity algorithm
The degree of association file is calculated, and combines compress technique, can effectively carry out prefetching problem, enable the file being prefetched to
It is effectively accessed, improves the hit rate for prefetching file.For rejecting module, two queues are maintained, one when being by access
Between a queue Qt being ranked up, the other is a queue Qf being ranked up by the access times in the t time, when use
Between and spatially two kinds of strategies combine with ensure reject file be the file that can be most unlikely accessed again.
Specific embodiment
The present invention provides one layer of distribution buffer (buffer), for delaying on existing distributed file system
The file in distributed file system is deposited and prefetches, target is can to accelerate the reading speed of distributed file system.
To achieve the above object, The technical solution adopted by the invention is as follows:
The present invention proposes that the file of Based on Distributed file system prefetches/caching method, includes the following steps, such as Fig. 1 institute
Show:
A. cache file, its implementation are as follows:
A1. when client accesses a file in distributed file system;
A2. the time for accessing this document and the file information are recorded in access log;
A3. according to the accessed time of this document in access log and i.e. corresponding the file information judge this document whether into
Row caching.
B. file, its implementation are prefetched are as follows:
B1. when a file is cached in memory, according to the access log of history, this document in the TP period is taken out
Each access time point t1, t2, t3 ... tN;
B2. traversal history access log, time neighborhood of a point refer to, respectively take the TN period before and after sometime point tt,
The obtained period.Accessed file f i 1, fi2, the fi3 ... in t1 neighborhood are taken out, one column of composition obtain an access letter
Cease Inf1;
B3. t2 neighborhood, the access information Inf2, Inf3 ... of t3 neighborhood ... tN neighborhood are taken out respectively by step B2
InfN;
B4. above-mentioned access information Inf1, Inf2, Inf3 ... InfN are calculated, is obtained with existing clustering algorithm
The degree of association r1, r2, r3 ... of file are accessed in above-mentioned cache file and above-mentioned time vertex neighborhood.
B5. by the degree of association size f1, f2, f3 ... are ranked up and filename is stored in the form of character string to
It prefetches in queue, according to the size of distributed memory, the space size M of file can be prefetched by setting one file of every caching, from pass
Queue to be prefetched is intercepted from big to small according to the degree of association in connection file, makes the size p*M of file to be prefetched in queue to be prefetched,
The desirable arbitrary number greater than 1 of p, it is proposed that take 3.
B6. cosine similarity algorithm is utilized, the similarity between the file two-by-two in queue to be prefetched is calculated separately.Cosine phase
Like the step of degree algorithm are as follows: calculate cosine after pretreatment → text feature item selection → weighting → generation vector space model.Institute
Obtained cosine value is the similarity of two files.
B7. the file in queue to be prefetched is reconfigured, in conjunction with the similarity between file, is calculated using existing compression
Method takes out the file that compressed total size is M or slightly smaller than M size to each combination.
B8. the total degree of association size for calculating each composition file and institute's cache file in B7, takes out total correlation degree size
The maximum a group of file one group, as to be prefetched.
B9. compression processing is carried out to the obtained a group of file in B8 with existing compress technique, then prefetches these compressions
File prefetches end to distributed caching.
C. file, its implementation are rejected are as follows:
When distributed memory deficiency, need to reject some files from memory, elimination method is as follows.
C1. two document queues are safeguarded in distributed memory, one is a queue Qt being ranked up by access time,
The other is a queue Qf being ranked up by the access times in the t time.
C2. the least file of access times in the t time is taken out in Qf.
C3. the least text of access times in the above-mentioned t time is found for critical rejecting point in Qt with the middle position of Qt
Part judges its position in Qt, if this document behind critical rejecting point, i.e. the time ratio that is accessed earliest of this document
The earliest access time of critical rejecting dot file is early, then directly rejects this document, which terminates.It is no to then follow the steps A4-
4。
C4. the file that access times in the t time are less than threshold value M is taken out in Qf, executes C3;
Step of the present invention is further described below, it is an object of the present invention to the texts in cache prefetching distributed file system
Part promotes the response speed of distributed file system.Detailed implementation steps include to execute: A, cache file to distributed memory;
B, the file big with the cache file degree of association is prefetched to distributed memory.C, when distributed memory deficiency, one in memory is rejected
A little files.A kind of specific embodiment is as follows:
A. cache file, its implementation are as follows:
A1. when a file in client access distributed file system;
A2. the time for accessing this document and the file information are recorded in access log;
A3. judge this document in current time T time according to the accessed time of this document in access log
Access times num sets the threshold value N of access times, if the access times num in T time is greater than the access times of setting
Threshold value N, then it is on the contrary then do not cache this document in distributed caching this document being cached to.For distributed field system
System, many files are only accessed once in longer period of time, just without caching these files, and then improve distributed memory
Utilization rate, avoid useless caching and reject operation.
B. file, its implementation are prefetched are as follows:
B1. when a file is cached in memory, according to history access log, this document is each in the taking-up TP time
Access time point t1, t2, t3 ... tN;
B2. traversal history access log, time neighborhood of a point refer to, respectively take the TN period before and after sometime point tt,
The obtained period.Accessed the file f i1, fi2, fi3 ... in t1 neighborhood are taken out, one column of composition obtain an access letter
Cease Inf1;
B3. t2 neighborhood, the access information Inf2, Inf3 ... of t3 neighborhood ... tN neighborhood are taken out respectively by step B2
InfN;
B4. above-mentioned access information Inf1, Inf2, Inf3 ... InfN are calculated, is obtained with existing clustering algorithm
The degree of association r1, r2, r3 ... of file are accessed in above-mentioned cache file and above-mentioned time vertex neighborhood.
B5. by the degree of association size f1, f2, f3 ... are ranked up and filename is stored in the form of character string to
It prefetches in queue, according to the size of distributed memory, the space size M of file can be prefetched by setting one file of every caching, from pass
Queue to be prefetched is intercepted from big to small according to the degree of association in connection file, makes the size p*M of file to be prefetched in queue to be prefetched,
The desirable arbitrary number greater than 1 of p, it is proposed that take 3.
B6. cosine similarity algorithm is utilized, the similarity between the file two-by-two in queue to be prefetched is calculated separately.Cosine phase
Like the step of degree algorithm are as follows: calculate cosine after pretreatment → text feature item selection → weighting → generation vector space model.Institute
Obtained cosine value is the similarity of two files.
B7. the file in queue to be prefetched is reconfigured, in conjunction with the similarity between file, is calculated using existing compression
Method takes out the file that compressed total size is M or slightly smaller than M size to each combination, and obtaining total size is greater than less than M's
Combination nova.Steps are as follows for specific execution:
B7-1. prefetching the file number in queue is FN, and X file is taken out from FN file, guarantees X file
Size is greater than M upon compression.FN file is reconfigured, is obtainedA combination.
B7-2. utilize the calculated file two-by-two of B6 similarity, calculate two high files of similarity in each combination into
The compressed size of row.
B7-3. two high files of similarity are further screened from each combination as a compressed file
Size is equal to or slightly less than a group of file of M out.
B8. the total degree of association size for calculating each group file and institute's cache file in B7, it is maximum to take out total correlation degree size
One group, a group of file as to be prefetched.
B9. compression processing is carried out to the obtained a group of file in B8 using aprior compress technique, then prefetches these
Compressed file prefetches end to distributed caching.
C. file, its implementation are rejected are as follows:
When distributed memory deficiency, need to reject some files from memory, elimination method is as follows.
C1. two document queues are safeguarded in distributed memory, one is a queue Qt being ranked up by access time,
The other is a queue Qf being ranked up by the access times in the t time.
C2. the least file of access times in the t time is taken out in Qf.
C3. the least text of access times in the above-mentioned t time is found for critical rejecting point in Qt with the middle position of Qt
Part judges its position in Qt, if this document behind critical rejecting point, i.e. the time ratio that is accessed earliest of this document
The earliest access time of critical rejecting dot file is early, then directly rejects this document, which terminates.It is no to then follow the steps A4-
4。
C4. the file that access times in the t time are less than threshold value M is taken out in Qf, executes C3;
The present invention also proposes that a kind of file of Based on Distributed file system prefetches/caching system, as shown in Figure 2, comprising:
File cache module, for the file will to be accessed when the file in client access distributed file system
Access time and the file information be recorded in access log, and according to the access time and the file information judgement described in
Whether file is cached;
File prefetches module, for according to the access log, each access time of the file in take-off time section TP
Point, and each access time neighborhood of a point is obtained, it is calculated in the file and the neighborhood and is accessed by clustering algorithm
File corresponding with the degree of association is stored in queue to be prefetched in the form of character string, is passed through by the degree of association of file
Cosine similarity algorithm calculates separately in the queue to be prefetched similarity between file two-by-two, in the queue to be prefetched
File is reconfigured, and in conjunction with the similarity, is compressed to each combination, and each combination and institute are calculated
The total correlation degree of file is stated, and obtains the maximum combination of total correlation degree described in the combination, as the one group of text to be prefetched
Part.
The file cache module includes the threshold value N for setting the file and being accessed number, is sentenced according to the access time
The file break in the accessed frequency n um in current time T, if the accessed frequency n um is greater than the threshold value
N, then by the file cache to distributed caching in, the otherwise uncached file.
The neighborhood is that period TN, the period of acquisition are respectively taken before and after sometime point tt.
It further includes carrying out compression processing to a group of file to be prefetched that the file, which prefetches module, prefetches compressed text
Part is to distributed caching.
It further include that file rejects module, comprising:
Step 31, the least file of access times in the t time is taken out in queue Qf, wherein pressing t in distributed caching
The queue that access times in time are ranked up is the queue Qf;
Step 32, it finds in the t time and visits in the queue Qt for critical rejecting point with the middle position of queue Qt
It asks number least file, judges its position in the queue Qt, if the least file of access times is in critical rejecting
Point below, and time for being accessed earliest of the least file of access times less than the file of critical rejecting point earliest access when
Between, then the least file of access times is rejected, the file cache module is otherwise executed, wherein by visit in distributed caching
Ask that the queue that the time is ranked up is the queue Qt;
Step 33, access times in the t time are taken out in the queue Qf and are less than the file of threshold value M, and execute step
Rapid 32.