CN108416054A

CN108416054A - Dynamic HDFS copy number calculating methods based on file access temperature

Info

Publication number: CN108416054A
Application number: CN201810228575.7A
Authority: CN
Inventors: 代钰; 杨雷; 化红翠; 王际烽; 张斌
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2018-03-20
Filing date: 2018-03-20
Publication date: 2018-08-17
Anticipated expiration: 2038-03-20
Also published as: CN108416054B

Abstract

The present invention provides a kind of dynamic HDFS copy number calculating methods based on file access temperature, is related to data analysis technique field.Dynamic HDFS copy number calculating methods based on file access temperature, the rule that the access temperature of hot spot file changes over time is obtained by improved Markovian model type analysis first, and according to the calculation formula of file access temperature, the access temperature of file is predicted.Then queueing theory algorithm is used, the calculation formula of copy number is provided, dynamic adjusts the copy number of hot spot file.Dynamic HDFS copy number calculating methods provided by the invention based on file access temperature, solve the problems, such as the access bottleneck to hot spot file, improve the efficiency of service of cluster.

Description

Dynamic HDFS copy number calculating methods based on file access temperature

Technical field

The present invention relates to data analysis technique field more particularly to a kind of dynamic HDFS copies based on file access temperature A number calculating method.

Background technology

With modern the Internet technology development with science and technology progress, data with large capacity, diversity, high speed and The characteristics of authenticity, penetrates into the industry-by-industry and field of social development.The growth trend of mass data, to data and resource It is reasonably managed, and ensures the reliability of data, have become the critical problem that field of cloud calculation faces.

The distributed system architecture Hadoop developed by Apache funds club realizes a distributed field system It unites (Hadoop Distributed File System), abbreviation HDFS.HDFS has the characteristics of high fault tolerance, and is designed to It is deployed on cheap (low-cost) hardware；And it provides high-throughput (high throughput) to access using journey The data of sequence are suitble to those to have the application program of super large data set (large data set).HDFS is relaxed (relax) The requirement of POSIX can access the data in (streaming access) file system in the form of streaming.In the copy of HDFS In administrative mechanism, cluster acquiescence takes the replica management mechanism that 3 number of copies are preserved for each data block of file, but cannot expire Sufficient different user is to the requirements for access of different files, when user increases the visit capacity of some file, the data block of acquiescence Copy number cannot respond to a large amount of access request, cause hot spot file access bottleneck problem.Presently relevant replica management side Method gradually turns to dynamic copies construction strategy by static replica creation strategy makes the whole of cluster when external environment changes Individual character can be constant or can keep expeditiously providing service for client.But in dynamic copies construction strategy still The factor of great influence is not accounted for but had to the working efficiency of cluster there are some.

In the prior art, document《Efficient more replica management researchs under cloud environment》For large-scale cloud storage system at This benefit ensures that problem proposes a kind of dynamic copies creation method, it considers the pass between copy number and availability System adjusts copy number under the premise of considering to meet cloud storage system availability, but does not account for file access temperature With the relationship between copy number.Document《An Elastic Replication Management System for HDFS》 It is proposed that active/standby storage models realize the flexible management to HDFS copies, it manages engine to reality using complicated affairs When the data volume that accesses be identified, dynamically adjust copy number, and introduce correcting and eleting codes and be managed to copy number.This is Although system is effectively improved the performance of HDFS, it realizes that process is complicated, and when identifying real time access data, complexity is high.

Invention content

In view of the drawbacks of the prior art, the present invention provides a kind of dynamic HDFS copy number meters based on file access temperature Calculation method, realization calculate dynamic copies number.

A kind of dynamic HDFS copy number calculating methods based on file access temperature, include the following steps：

Step 1, according to file access log table on distributed file system HDFS and according to the calculating of file access temperature The access temperature of each file in measurement period is calculated in formula, and by the sum of access temperature of file in timing statistics to file Descending sort is carried out, preceding 20% file in selected and sorted list is as hot spot file, structure hot spot file-access temperature sequence Row are used as sequence to be predicted, the prediction for the temperature that accesses；

The calculation formula of the file access temperature is shown below：

Wherein, Hot (f) indicates that the access temperature of file f, AF (f)=N/T indicate that the access frequency of file f, N indicate text Access times of the part f in statistic period T,Indicate the data block size of file f, f_sizeIndicate the size of file f,Expression is not more thanMaximum integer,Obtain the data block number of file f；

Step 2 carries out state space division, specific side using hierarchical clustering algorithm to hot spot file-access temperature sequence Method is：

It is the data set of N by hot spot file-one length of access temperature Sequence composition, the object in data set represents hot spot Access temperature of the file in different moments, the process that hierarchical clustering is carried out to hot spot file-access temperature data set are：

(1) regard each object in data set as one kind, N classes are obtained, the distance between class and class are each in two classes A data point two-by-two distance square median；

(2) two nearest classes of distance are merged into a class so that the sum of class reduces one；

(3) the distance between new class and other classes are recalculated；

(4) (2)-(3) step is repeated, until all data objects in data set are to the last merged into a class；

Based on above step, the clustering tree of hot spot file-access temperature sequence is obtained, according to cluster tree construction, defines horse The state space that Er Kefu is divided；

Step 3 carries out geneva inspection to the hot spot file-access temperature sequence for having divided state space, if met Geneva, using the sequence as improve Markov model list entries, otherwise the sequence cannot use improved markov Model is handled；

Step 4 will meet hot spot file-access temperature sequence of geneva as the input sequence for improving Markov model Row predict the access temperature of subsequent time hot spot file, and the access temperature that prediction obtains are written to hot spot file-access heat In the table of degrees of data library, specific method is：

Step 4.1：A step state is calculated and turns according to file-access temperature sequence based on the state space divided Move probability matrix P；

Step 4.2：It sets the corresponding state of current time file access hot value to initial state distribution, is denoted as p (0), probability distribution over states p (1)=p (0) P of subsequent time is calculated according to a step state transition probability matrix P；

Step 4.3：Take the state of distribution probability maximum value in the probability distribution over states p (1) of subsequent time as lower a period of time The state at quarter takes the sum of the standard deviation of hot spot file-access temperature sequence and the average value in dbjective state space to be used as lower a period of time The prediction at quarter accesses hot value；

Step 4.4：First value of list entries is removed, and using the access hot value newly predicted as pre- sequencing next time The last one value of row is added in sequence to be predicted, repeats above step, predicts the access temperature of hot spot file subsequent time；

Step 5, the queuing model based on the mono- queue Multiple server stations of M/M/r model copy access request, and herein On the basis of on setting node the handling capacity of copy be to determine the number of copy, specific method：

Step 5.1：Counting to obtain by inquiry hot spot file-access temperature database table specifies hot spot file to be unified under Count the access average request rate λ of the copy in the period；

Step 5.2：The cpu busy percentage threshold value U of server where copy is set, according to CPU effectiveness rules, cpu busy percentage Equal to request arriving rate divided by service rate, using following formula, the request service rate μ that single server is calculated is：

Step 5.3：Setting cluster total throughout is constrained to Q, based on the Little formula in queueing theory theory, service sojourn Time is multiplied by equal to service rateInverse, handling capacity be equal to service sojourn time inverse；It is more in isomorphism cluster environment Server service rate where copy is identical, to which copy number r be calculated by following two formula：

As shown from the above technical solution, the beneficial effects of the present invention are：It is provided by the invention to be based on file access temperature Dynamic HDFS copy number calculating methods, by based on improve Markov model the access temperature of file is predicted, Improve the accuracy of prediction.Meanwhile the copy number calculating method based on queueing theory, consider hot spot file access temperature at any time Between the rule that changes, dynamically adjust its copy number, the occurrence of high concurrent to cope with hot spot file accesses.Using queuing The method of opinion, by stored copies on node as Service Source, by analyzing the request rate and responsiveness of hot spot duplicate of the document, with Ensure that cluster throughput and reliability are target, copy number can be obtained by copy calculation formula, adjusted for subsequent dynamic Copy number makees place mat.

Description of the drawings

Fig. 1 is the stream of the dynamic HDFS copy number calculating methods provided in an embodiment of the present invention based on file access temperature Cheng Tu；

Fig. 2 is provided in an embodiment of the present invention using the access for improving Markov model prediction subsequent time hot spot file The prediction process schematic of temperature；

Fig. 3 is the prediction of the predicted value, improved Markov model of Markov model provided in an embodiment of the present invention Comparison diagram between value and actual value；

The number of copy and actual copy handling capacity is calculated based on queueing theory to be provided in an embodiment of the present invention in Fig. 4 The comparison diagram for the copy number being calculated.

Specific implementation mode

With reference to the accompanying drawings and examples, the specific implementation mode of the present invention is described in further detail.Implement below Example is not limited to the scope of the present invention for illustrating the present invention.

In the present embodiment, 3 racks are built, configure 4 virtual machines in each rack, and other three entities will be built Machine is respectively in the namenode nodes of Active states, and the namenode nodes in standby states, for preventing Only namenode Single Point of Faliures.Using third platform physical machine as calculate node, for obtaining file access daily record, predicting file It accesses temperature and calculates the number of copy.Cluster is configured to Hadoop version Hadoop-2.2.0, memory 32G, CPU Intel (R) Core (TM) i3-2120 CPU@3.30GHz, operating system CentOS-6.7, hard disk 2T, development language JAVA, R, Matlab。

Dynamic HDFS copy number calculating methods based on file access temperature, as shown in Figure 1, including the following steps：

Step 1, according to file access log table on distributed file system HDFS and according to access temperature calculation formula The access temperature of each file in measurement period is calculated, and file is carried out by the sum of access temperature of file in timing statistics Descending sort, preceding 20% file in selected and sorted list as hot spot file, make by structure hot spot file-access temperature sequence For sequence to be predicted, the prediction for the temperature that accesses；

The calculation formula of the file access temperature is shown below：

Wherein, Hot (f) indicates that the access temperature of file f, AF (f)=N/T indicate that the access frequency of file f, N indicate text Access times of the part f in statistic period T,Indicate the data block size of file f, f_sizeIndicate the size of file f,Expression is not more thanMaximum integer,Obtain the data block number of file f.

In the present embodiment, with 5 days for a measurement period, the access frequency of the flu.txt files in 5 periods is counted altogether Rate.The access heat of flu.txt in measurement period is calculated according to the calculation formula for accessing temperature according to file access log sheet It is as shown in table 1 to spend information.

The access temperature information table of 1 flu.txt files of table

Access time	2017-08-01	2017-08-02	2017-08-03	2017-08-04	2017-08-05
						Access temperature	262	486	632	300	570
Access time	2017-08-06	…	…	2017-10-02	…
						Access temperature	401	…	…	382	…

(3) the distance between new class and other classes are recalculated；

Based on above step, the clustering tree of hot spot file-access temperature sequence is obtained, according to cluster tree construction, defines horse The state space that Er Kefu is divided.

In the present embodiment, temperature is accessed to history using hierarchy clustering method and divides spatiality, historical data is divided For 5 spatialities, A, B, C, D and E be used in combination that data set is marked.

Geneva examine specific method be：

To including n possible state index sequential value X_n={ x₁, x₂..., x_n, the jth for shifting frequency matrix is arranged The sum of divided by the obtained value of summation that respectively arranges of each row, referred to as marginal probability, be shown below：

Wherein, f_ijIndicate index series X_n={ x₁, x₂..., x_nIn from frequencies of the state i through step transfer arrival state j Rate, i, j ∈ E；

Then statisticWith degree of freedom for (n-1)²χ²It is distributed as Limit Distribution, wherein

Given level of significance α, if in the presence ofThen this sequence X_nMeet geneva, otherwise the sequence Row cannot be handled with Markov model.

In the present embodiment, the cadence number transfer matrix f that can be shown below with R Language Processings_ijTurn with probability Move matrix p_ij, and marginal probability matrix p as shown in table 2_.j。

2 marginal probability table of table

State	1	2	3	4	5
						p_.j	0.17021277	0.42553191	0.17021277	0.08510638	0.14893617

Statistic is calculated according to values aboveObtain χ as shown in table 3²Statistic Computational chart.

3 χ of table²Normalized set table

In the present embodiment, level of significance α=0.1, according to χ²Normalized set table obtain quantileWherein, n=5.Therefore the history of this document accesses temperature and meets geneva, can be with The access temperature of file is predicted with Markov model.

Step 4 will meet hot spot file-access temperature sequence of geneva as the input sequence for improving Markov model Row predict the access temperature of subsequent time hot spot file, and the access temperature that prediction obtains are written to hot spot file-access heat In the table of degrees of data library, as shown in Fig. 2, specific method is：

Step 4.4：First value of list entries is removed, and using the access hot value newly predicted as pre- sequencing next time The last one value of row is added in sequence to be predicted, repeats above step, predicts the access temperature of hot spot file subsequent time.

In the present embodiment, in order to be verified to the accuracy that the method is predicted, improvement and unmodified horse is respectively adopted The flu.txt in 5 periods of Er Kefu models pair accesses temperature and compares.Markov model predicted value, improved Ma Erke Comparison diagram between the predicted value and actual value of husband's model is as shown in Figure 3.As seen from the figure, a cycle subsequent time is predicted When accessing hot value, the sequence due to accessing temperature is the same, so being obtained with unmodified Markov model with improving Access temperature and actual access hot value deviation be it is identical, and two methods prediction access hot value do not have with actual value There is too big difference.But in the access temperature at moment after prediction, using improved Markov model as a result of not Disconnected update sequence to be predicted, keeps the access temperature of prediction and actual deviation little, and unmodified Markov model, due to The ergodic of Markov model itself and balance distribution characteristics, with prediction frequency plus increase, the result of prediction with it is practical partially Difference is larger.The results show that the access hot value and actual value of improved Markov model prediction are relatively, and it is right The access trend of hot spot file also has relatively good prediction effect.

Step 5.2：The cpu busy percentage threshold value U of server where copy is set, according to CPU effectiveness rules, cpu busy percentage Equal to request arriving rate divided by service rate.To using following formula, calculate the request service rate u of single server：

In the present embodiment, predicted after obtaining the access temperature of hot spot file by improved Markov model, according to The access temperature of hot spot file, setting node cpu utilization threshold are 0.5, and the handling capacity of setting node copy is 100/s, then Per day handling capacity is that the handling capacity of 11 hours of access is 1,00*,11h,*36,00=,400 ten thousand, and pair is calculated based on queueing theory This number, and the copy number obtained with actual copy throughput calculation compares, comparison diagram is as shown in Figure 4.It can by figure Know, consider the request rate of people logging in and responsiveness of copy, the method for calculating copy number can be according to the trend adjustment pair for accessing temperature This number, in a cycle, the access temperature of hot spot file is in downward trend, at this point, the pair obtained based on queueing theory This number is fewer than actual copy number, within the subsequent cycle time, considers that the trend dynamic of hot spot file access temperature is adjusted Whole copy number, and the number of copies being calculated with goodput is relatively, demonstrates the validity of the method.

Finally it should be noted that：The above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although Present invention has been described in detail with reference to the aforementioned embodiments, it will be understood by those of ordinary skill in the art that：It still may be used To modify to the technical solution recorded in previous embodiment, either which part or all technical features are equal It replaces；And these modifications or replacements, model defined by the claims in the present invention that it does not separate the essence of the corresponding technical solution It encloses.

Claims

1. a kind of dynamic HDFS copy number calculating methods based on file access temperature, it is characterised in that：Include the following steps：

Step 1, according to file access log table on distributed file system HDFS and according to the calculation formula of file access temperature The access temperature of each file in measurement period is calculated, and file is carried out by the sum of access temperature of file in timing statistics Descending sort, preceding 20% file in selected and sorted list as hot spot file, make by structure hot spot file-access temperature sequence For sequence to be predicted, the prediction for the temperature that accesses；

Step 2 carries out state space division using hierarchical clustering algorithm to hot spot file-access temperature sequence；

Step 3 carries out geneva inspection to the hot spot file-access temperature sequence for having divided state space, if meeting geneva Property, using the sequence as improve Markov model list entries, otherwise the sequence cannot use improved Markov model To handle；

Step 4 will meet hot spot file-access temperature sequence of geneva as the list entries for improving Markov model, It predicts the access temperature of subsequent time hot spot file, and the access temperature that prediction obtains is written to hot spot file-access temperature In database table；

Step 5, the queuing model based on the mono- queue Multiple server stations of M/M/r model copy access request, and basic herein The handling capacity of copy determines the number of copy on upper setting node.

2. the dynamic HDFS copy number calculating methods according to claim 1 based on file access temperature, feature exist In：The calculation formula of file access temperature described in step 1 is shown below：

Wherein, Hot (f) indicates that the access temperature of file f, AF (f)=N/T indicate that the access frequency of file f, N indicate that file f exists Access times in statistic period T,Indicate the data block size of file f, f_sizeIndicate the size of file f, Expression is not more thanMaximum integer,Obtain the data block number of file f.

3. the dynamic HDFS copy number calculating methods according to claim 1 based on file access temperature, feature exist In：The specific method of the step 2 is：

It is the data set of N by hot spot file-one length of access temperature Sequence composition, the object in data set represents hot spot file In the access temperature of different moments, the process that hierarchical clustering is carried out to hot spot file-access temperature data set is：

(1) regard each object in data set as one kind, N classes are obtained, the distance between class and class are each number in two classes Strong point two-by-two distance square median；

(3) the distance between new class and other classes are recalculated；

Based on above step, the clustering tree of hot spot file-access temperature sequence is obtained, according to cluster tree construction, defines Ma Erke The state space that husband divides.

4. the dynamic HDFS copy number calculating methods according to claim 1 based on file access temperature, feature exist In：The specific method of the step 4 is：

Step 4.1：It is general to be calculated according to file-access temperature sequence for the transfer of one step state based on the state space divided Rate matrix P；

Step 4.2：It sets the corresponding state of current time file access hot value to initial state distribution, is denoted as p (0), root Probability distribution over states p (1)=p (0) P of subsequent time is calculated according to a step state transition probability matrix P；

Step 4.3：Take the state of distribution probability maximum value in the probability distribution over states p (1) of subsequent time as subsequent time State takes the sum of the standard deviation of hot spot file-access temperature sequence and the average value in dbjective state space as subsequent time Prediction accesses hot value；

Step 4.4：First value of list entries is removed, and using the access hot value newly predicted as forecasting sequence next time The last one value is added in sequence to be predicted, repeats above step, predicts the access temperature of hot spot file subsequent time.

5. the dynamic HDFS copy number calculating methods according to claim 1 based on file access temperature, feature exist In：The specific method of the step 5 is：

Step 5.1:Count to obtain specified hot spot file in next statistics week by inquiring hot spot file-access temperature database table The access average request rate λ of copy in phase；

Step 5.2：The cpu busy percentage threshold value U of server where copy is arranged, according to CPU effectiveness rules, cpu busy percentage is equal to Request arriving rate divided by service rate, using following formula, the request service rate μ that single server is calculated is：

Step 5.3：Setting cluster total throughout is constrained to Q, based on the Little formula in queueing theory theory, service sojourn time It is multiplied by equal to service rateInverse, handling capacity be equal to service sojourn time inverse；In isomorphism cluster environment, more copies Place server service rate is identical, to which copy number r be calculated by following two formula：

。