CN105930096B

CN105930096B - A kind of data block pre-cache method based on PageRank

Info

Publication number: CN105930096B
Application number: CN201610227750.1A
Authority: CN
Inventors: 肖殷洪; 刘震; 王晨光; 王天凯; 王斌; 王强富; 郑峰弓
Original assignee: China Travelsky Technology Co Ltd
Current assignee: China Travelsky Technology Co Ltd
Priority date: 2016-04-12
Filing date: 2016-04-12
Publication date: 2019-01-11
Anticipated expiration: 2036-04-12
Also published as: CN105930096A

Abstract

A kind of data block pre-cache method based on PageRank；It includes statistic record data block dispatch situation；The building of model；The update of model；The preservation of model；The load of model；The setting of model retention cycle H；Sequence based on PageRank algorithm；Block is lacked to interrupt；The pre- of data block is called in.The present invention be it is a kind of for the frequent disk I/O of data block in big data treatment process and caused by service performance decline and the data block cache hit rate solution that high problem does not propose, it can be widely applied in big data treatment process, pass through real-time statistic record data block dispatch situation, further according to spatial locality, close relation between temporal locality and the data block calculated by PageRank algorithm, by the way of calling in advance, will data block active be pushed in caching, to improve the hit rate of data-block cache, the performance of service is greatly improved.

Description

A kind of data block pre-cache method based on PageRank

Technical field

The invention belongs to server data caching technology fields, more particularly to a kind of data block based on PageRank Pre-cache method.

Background technique

With the development of internet and mobile Internet, a large amount of Internet application all relies on processing mass data to mention For service.Emerging Internet application, which is illustrated, applies different data storage and access features from tradition.It is some at present big The storage of the data of Internet company, which mostly uses greatly, stores many file mergencess onto disk at the data block of fixed size, this Management of the enterprise to data in terms of the storage mode of sample, however when face the processing of mass data block, usually because of data The influence of block cache policy causes data-block cache hit rate not high, and caused by multiple disk I/O, to seriously reduce The performance of service.Currently, most of methods for solving the problems, such as this, constantly increase physical memory, using multi-level buffer, increase solid-state Disk etc., and these methods substantially improve performance by way of physics, for memory, caching, solid state hard disk etc., these The price of equipment is more expensive, this will cause the spending of very big physical hardware.At present not yet discovery for data block pre-cache or The method based on PageRank algorithm of active cache.

Summary of the invention

To solve the above-mentioned problems, the purpose of the present invention is to provide a kind of data block pre-cache side based on PageRank Method.

A kind of data block pre-cache method based on PageRank, characterized by the following steps:

Step 1: judging system with the presence or absence of data model, if will load under specified catalogue there are data model Data model initializes the PageRank value of each data block according to data model, subsequently into step 6；Otherwise it enters step Two；

Step 2: parameters required for initialization data pattern；

Step 3: the case where statistical data block uses within the Δ t time, generates the relational matrix A between data block；

Step 4: by above-mentioned relation matrix A, generating probability transfer matrix M is calculated according to PageRank formula V ' The PageRank value of each data block, Construction of A Model are completed；It according to the PageRank value of each data block, is simulated, specifically Simulation process are as follows: the high N number of data block identifier of PageRank value is come out, and this N number of data block is called in into caching, counts Δ In t1, the total data block number n that labeled N number of data block access times n1 and computer use, then hit rate p in Δ t1 =n1/n；

Step 5: judging whether above-mentioned hit rate p is greater than preset hit rate P, if being greater than, model is put into practical raw In production；Otherwise step 3 is executed；

Step 6: the high N number of data block of PageRank value is called in caching；

Step 7: the service condition of statistical data block, updates relational matrix A；

Step 8: in data processing, whether in the buffer data are judged when reading data, if in the buffer, meter Calculation machine just reads data from the data block of caching；Otherwise occur to lack block interruption, execute step 11；

Step 9: judge whether the time to preservation model, and if time is up, preservation or more new model；Otherwise it executes Step 7；

Step 10: the data model at this moment is saved or is updated onto disk, it is direct when restarting so as to computer Stress model；

Step 11: judging to lack the value C whether the number MBIs that block interrupts is greater than initializing set, if being greater than, first CulBlock (data block that expression is currently accessing) is called in into caching, synchronization carries out model modification；Otherwise, will CulBlock calls in caching, and record calls in the data block of caching, while scarce block interruption times MIBs is added 1.

Further, in step 3, the expression formula of the relational matrix A are as follows:

Wherein, i=j=n, a_k,qIt indicates when data block k is accessed, it is next Moment accesses the number of data block q, and a when k=q_k,q=0, a_k,qFor the relationship degree of data block k and data block q, n indicates to calculate The total data block number that machine uses.

It is further: in step 4, the expression formula of the PageRank formula V ' are as follows:

V'=α MV+ (1- α) e, wherein α indicates that computer directly uses the probability of data block in the buffer, and 1- α indicates meter Calculation machine occurs to lack the probability that block interruption uses data block from disk, and M indicates that the probability transfer matrix of relational matrix A, V indicate every PageRank value in secondary iteration, and

The feature of data block pre-cache method provided by the invention based on PageRank is as follows:

1. the invention is to be put into data block in data-block cache in advance by way of a kind of soft strategy of active, to keep away The IO for having exempted from disk in data block treatment process improves the hit rate of caching.

It is for solving 2. the invention is merged into the data block of fixed size primarily directed to most data files and proposes Certainly quick processing of the enterprise to magnanimity long data block, it is not applicable to traditional other caching of operating system grade.

3. the invention is mainly the data block dispatch situation passed through under statistics probability real time environment, according to spatial locality and Temporal locality, along with the PageRank algorithm optimized sorts, by multiple by data block that is to be processed and not handling also It calls in caching in advance, and constantly updates data model under real time environment, there is good dynamic.

The present invention has the advantage that and good effect:

Data block pre-cache method based on PageRank is a kind of for the frequent magnetic of data block in big data treatment process Disk IO and caused by service performance decline and the data block cache hit rate solution that high problem does not propose, can be widely applied to In big data treatment process, by real-time statistic record data block dispatch situation, further according to spatial locality, temporal locality Close relation between the data block calculated by PageRank algorithm will data block by the way of calling in advance Active is pushed in caching, to improve the hit rate of data-block cache, the performance of service is greatly improved.

Detailed description of the invention

Fig. 1 is the schematic diagram of the actual production environment of the preferred embodiment of the present invention；

Fig. 2 is that the data block of the preferred embodiment of the present invention eliminates schematic diagram；

Fig. 3 is the flow chart of the preferred embodiment of the present invention.

Specific embodiment

Data block pre-cache method to provided by the invention based on PageRank in the following with reference to the drawings and specific embodiments It is described in detail.

As shown in Figure 1, Figure 2, Figure 3 shows, the core of the data block pre-cache method provided by the invention based on PageRank is thought Think be.

Data block pre-cache method provided by the invention based on PageRank is the implementation in actual production environment.Such as Shown in Fig. 1, the actual production environment includes: data processing centre, memory, Disk, model factory.

Data processing centre: the processing center of mass data, also including model calculating and each data block PageRank The calculating of value；

Memory: the high data block of caching PageRank value provides caching for data processing centre and supports, improves data processing The hit rate at center reduces disk I/O；

Disk: the data for storing magnanimity are stored using data block, the persistent storage equipment of data model；

Model factory: mainly including four modules: logger, counter, meta data manager, scheduler.Logger master To be used to the data block service condition of statistic record data processing centre, construct the relational matrix between data block；Scheduler master If manage caching of the data block in caching Cache according to data model, eliminate, the building of data model, data model Load, the update of data model and preservation of data model etc., initialization, update of various parameters etc.；Counter is mainly used Come enumeration data block interruption times, model retention cycle etc.；Meta data manager is primarily used to data block in managing internal memory Metadata.

As shown in Fig. 2, the data block of the data block pre-cache method provided by the invention based on PageRank eliminates mode, Include (a) and (b) two ways:

Data block when the scarce block that (a) in Fig. 3 indicates that S12 the and S13 stage in Fig. 3 occurs interrupts eliminates side Formula, the block number to come from disk tune are eliminated according to by the smallest data block of data block PageRank value in Cache；

(b) in Fig. 3 indicates that the model being calculated using PageRank algorithm progress data block is called in superseded Mode calls in the high N of data block PageRank value (allowing the data block number cached according to memory) a data block in caching, If some will call in the data block of memory in memory, then do not call in, is only marked；It only will not be in memory In data block call in memory；

Fig. 2 explanation: each grid represents a data block；CulBlock indicates the data block being currently accessing； StandByBlock indicates the data block (cache blocks) that subsequent time will likely access.

As shown in figure 3, the data block pre-cache method provided by the invention based on PageRank includes executing in order The following steps:

Step 1: judging that system whether there is the S1 stage of data model: judge that system whether there is data model, if There are data models to load data model under specified catalogue；Otherwise enter below step initialization data pattern；

Step 2: the S2 stage of initialization data parameter: parameters required for initialization data pattern；

Step 3: the S3 stage of the caching in the statistic record Δ t time: the feelings that statistical data block uses within the Δ t time Condition generates the relational matrix A between data block；

Step 4: construction data model, carries out emulated memory and calls in, calculate the S4 stage of hit rate: is raw by the S3 stage At relational matrix A, generate corresponding probability transfer matrix M, according to PageRank formula V ', calculate each data block PageRank value, Construction of A Model are completed；According to the PageRank value of each data block, simulated (by the high N of PageRank value A data block identifier comes out, it is assumed that they are called in memory, is counted in Δ t1, labeled N number of data block access times n1 And the total data block number n2 that uses of computer), then hit rate p=n1/n2 in Δ t1；

Step 5: judging whether hit rate is greater than the S5 stage of P: the hit rate p for judging that the S4 stage calculates is not greater than me Preset hit rate P, if being greater than, will model put into actual production in；Otherwise continue to improve the model of initial phase；

Step 6: maximally related N number of data block to be called in the S6 stage of caching according to data model: by PageRank value High N (allowing data block number according to memory) a data block calls in caching (only calling in the data block not having in current cache)；

Step 7: the S7 stage of data processing: the phase data computer disposal stage, continuing statistical number in this stage According to the service condition of block, relational matrix A is updated；

Step 8: judge data that computer uses whether the S8 stage in the buffer: computer is in data handling procedure In, whether in the buffer data are judged when reading data, if in the buffer, computer just reads number from the data block of caching According to；Otherwise but block occurs and interrupts the execution S11 stage；

Step 9: judging whether the S9 stage to the time of preservation model: judge whether the time to preservation model, if Time is up, preservation or more new model；Otherwise continue data processing；

Step 10: saving or updating the S10 stage of data model: the data model at this moment being saved or updated and arrives disk On, model is loaded directly into when restarting so as to computer；

Step 11: whether the judgement number that but block interrupts is greater than the S11 stage of the value of initializing set: judgement is but in block Whether disconnected number MBIs is greater than the value C of initializing set, if being greater than, CulBlock data block is called in memory first, together One moment carried out model modification；Otherwise, CulBlock data block need to only be called in memory, record calls in the data block of caching, Scarce block interruption times MIBs is added 1 simultaneously.

The embodiments of the present invention have been described in detail above, but content is only the preferred embodiment of the present invention, It should not be considered as limiting the scope of the invention.Any changes and modifications in accordance with the scope of the present application, It should still be within the scope of the patent of the present invention.

Claims

1. a kind of data block pre-cache method based on PageRank, characterized by the following steps:

Step 1: judging system with the presence or absence of data model, if there are data model data will be loaded under specified catalogue Model initializes the PageRank value of each data block according to data model, subsequently into step 6；Otherwise two are entered step；

Step 2: parameters required for initialization data pattern；

Step 3: the case where statistical data block uses within the Δ t time generates number according to the precedence relationship that data block is accessed According to the relational matrix A between block, formula are as follows:

Parameter in formula: i=j=n, a_k,qIndicate that, when data block k is accessed, subsequent time accesses the number of data block q, and k= A when q_k,q=0, a_k,qFor the relationship degree of data block k and data block q, n indicates the total data block number that computer uses；

Step 4: by above-mentioned relation matrix A, generating probability transfer matrix M is calculated each according to PageRank formula V ' The PageRank value of data block, Construction of A Model are completed；It according to the PageRank value of each data block, is simulated, specific mould Quasi- process are as follows: the high N number of data block identifier of PageRank value is come out, and this N number of data block is called in into caching, counts Δ t1 Total data block number n interior, that labeled N number of data block access times n1 and computer use, then hit rate p=in Δ t1 n1/n；

V ' is indicated are as follows:

V'=α MV+ (1- α) e

Parameter in formula: α indicates that computer directly uses the probability of data block in the buffer, and 1- α indicates that computer occurs to lack in block The disconnected probability that data block is used from disk, M indicate that the probability transfer matrix of relational matrix A, V indicate in each iteration PageRank value, andN in element indicates the total data block number that computer uses；

Step 5: judging whether above-mentioned hit rate p is greater than preset hit rate P, if being greater than, model is put into actual production In；Otherwise step 3 is executed；

Step 8: in data processing, whether in the buffer data are judged when reading data, if in the buffer, computer Just data are read from the data block of caching；Otherwise occur to lack block interruption, execute step 11；

Step 9: judge whether the time to preservation model, and if time is up, preservation or more new model；It is no to then follow the steps Seven；

Step 10: the data model at this moment is saved or is updated onto disk, it is loaded directly into when restarting so as to computer Model；

Step 11: the value C whether the number MBIs that judgement lacks block interruption is greater than initializing set will work as first if being greater than The preceding data block accessed calls in caching, and synchronization carries out model modification；Otherwise, the data block tune that will be currently accessing Enter to caching, record calls in the data block of caching, while scarce block interruption times MIBs is added 1.