CN108416054B

CN108416054B - Method for calculating number of copies of dynamic HDFS (Hadoop distributed File System) based on file access heat

Info

Publication number: CN108416054B
Application number: CN201810228575.7A
Authority: CN
Inventors: 代钰; 杨雷; 化红翠; 王际烽; 张斌
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2018-03-20
Filing date: 2018-03-20
Publication date: 2021-10-22
Anticipated expiration: 2038-03-20
Also published as: CN108416054A

Abstract

The invention provides a method for calculating the number of copies of a dynamic HDFS (Hadoop distributed File System) based on file access heat, and relates to the technical field of data analysis. According to the method for calculating the number of the copies of the dynamic HDFS based on the file access heat, firstly, the rule of the change of the access heat of the hot files along with the time is obtained through the improved Markov model analysis, and the access heat of the files is predicted according to a calculation formula of the access heat of the files. And then, giving a calculation formula of the number of the copies by adopting a queuing theory algorithm, and dynamically adjusting the number of the copies of the hot spot file. The method for calculating the number of the copies of the dynamic HDFS based on the file access heat solves the problem of access bottleneck to the hot files, and improves the service efficiency of the cluster.

Description

Method for calculating number of copies of dynamic HDFS (Hadoop distributed File System) based on file access heat

Technical Field

The invention relates to the technical field of data analysis, in particular to a method for calculating the number of copies of a dynamic HDFS (Hadoop distributed File System) based on file access heat.

Background

With the development of modern internet technology and the progress of scientific technology, data permeates into various industries and fields of social development by the characteristics of high capacity, diversity, high speed and reality. The growing trend of mass data, reasonable management of data and resources and guarantee of data reliability have become a key problem facing the cloud computing field.

The Distributed System infrastructure Hadoop developed by the Apache foundation realizes a Distributed File System (Hadoop Distributed File System), HDFS for short. HDFS is characterized by high fault tolerance and is designed for deployment on inexpensive (low-cost) hardware; and it provides high throughput (high throughput) to access data of applications, suitable for applications with very large data sets. HDFS relaxes the requirements of (relax) POSIX and can access (streaming access) data in a file system in the form of streams. In a copy management mechanism of the HDFS, a cluster defaults to a copy management mechanism that stores 3 copies for each data block of a file, but cannot meet access requirements of different users on different files, and when the access amount of a user to a certain file increases, the default number of copies of the data block cannot respond to a large number of access requests, which causes a bottleneck problem of access to hot files. Currently, a related copy management method gradually changes from a static copy creation policy to a dynamic copy creation policy, so that when an external environment changes, the entire performance of a cluster can be unchanged or a service can be efficiently provided for a client. There are still some factors that are not considered in the dynamic copy creation policy but have a significant impact on the working efficiency of the cluster.

In the prior art, a document "high-efficiency multi-copy management research in cloud environment" proposes a dynamic copy creating method for the problem of cost benefit guarantee of a large-scale cloud storage system, which comprehensively considers the relationship between the number of copies and availability, i.e., adjusts the number of copies on the premise of considering the availability of the cloud storage system, but does not consider the relationship between the file access heat and the number of copies. Document "An Elastic Replication Management System for HDFS" proposes An active/standby storage model to realize flexible Management of HDFS copies, and the method utilizes a complex transaction engine to identify data volume accessed in real time, dynamically adjusts the number of copies, and introduces erasure codes to manage the number of copies. Although the system effectively improves the performance of the HDFS, the implementation process is complex, and the complexity is high when real-time access data is identified.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a method for calculating the number of copies of a dynamic HDFS based on the file access heat, which is used for calculating the number of the dynamic copies.

A method for calculating the number of copies of a dynamic HDFS based on file access heat comprises the following steps:

step 1, calculating according to a file access log table on a distributed file system HDFS and a calculation formula of file access heat to obtain access heat of each file in a statistical period, sorting the files in a descending order according to the sum of the access heat of the files in statistical time, selecting the first 20% of the files in the sorted list as hot files, and constructing a hot file-access heat sequence as a sequence to be predicted;

the calculation formula of the file access heat is shown as the following formula:

where hot (f) represents the access heat of the file f, af (f) represents the access frequency of the file f, N represents the number of accesses of the file f within the statistical period T,

representing the data block size of file f, f_sizeWhich represents the size of the file f, is,

means not more than

Is the largest integer of (a) to (b),

obtaining the number of data blocks of the file f;

step 2, performing state space division on the hotspot file-access heat sequence by adopting a hierarchical clustering algorithm, wherein the specific method comprises the following steps:

forming a data set with the length of N by the hotspot file-access heat sequence, wherein objects in the data set represent the access heat of the hotspot files at different moments, and the process of hierarchically clustering the hotspot file-access heat data set comprises the following steps:

(1) regarding each object in the data set as a class, and obtaining N classes in total, wherein the distance between the classes is the middle value of the square of the distance between every two data points in the two classes;

(2) merging two classes with the nearest distance into one class, so that the total number of the classes is reduced by one;

(3) recalculating distances between the new class and other classes;

(4) repeating the steps (2) - (3) until all data objects in the data set are finally merged into one class;

based on the steps, obtaining a clustering tree of the hotspot file-access heat sequence, and defining a Markov division state space according to the clustering tree structure;

step 3, conducting Markov test on the hot spot file-access heat sequence divided into the state space, if the Markov test is satisfied, using the sequence as an input sequence of the improved Markov model, otherwise, the sequence can not be processed by the improved Markov model;

step 4, taking the hot file-access heat sequence meeting the Markov property as an input sequence of an improved Markov model, predicting the access heat of the hot file at the next moment, and writing the predicted access heat into a hot file-access heat database table, wherein the specific method comprises the following steps:

step 4.1: calculating to obtain a one-step state transition probability matrix P according to the file-access heat sequence based on the divided state space;

step 4.2: setting the state corresponding to the file access heat value at the current moment as initial state distribution, marking as P (0), and calculating to obtain the state probability distribution P (1) ═ P (0) P at the next moment according to the one-step state transition probability matrix P;

step 4.3: taking a state of a distribution probability maximum value in a state probability distribution p (1) at the next moment as a state at the next moment, and taking the sum of a standard deviation of a hot point file-access heat sequence and an average value of a target state space as a predicted access heat value at the next moment;

step 4.4: removing the first value of the input sequence, and adding the newly predicted visit heat value as the last value of the next predicted sequence into the sequence to be predicted;

step 4.5: repeating the steps 4.1-4.4, and predicting the access heat of the hot spot file at the next moment;

step 5, modeling the copy access request based on the queue model of the M/M/r single-queue multi-service desk, and setting the throughput of the copies on the node to determine the number of the copies, wherein the specific method comprises the following steps:

step 5.1, obtaining the access average request rate lambda of the copy of the specified hotspot file in the next statistic period through inquiring the hotspot file-accessing the heat database table;

step 5.2: setting a CPU utilization rate threshold U of the server where the copy is located, wherein the CPU utilization rate is equal to the request arrival rate divided by the service rate according to a CPU utility rule, and calculating the request service rate mu of the single server by using the following formula:

step 5.3: setting the total throughput constraint of the cluster as Q, and based on a Little formula in the queuing theory, the service stay time is equal to the service rate multiplied by the service rate

Throughput is equal to the inverse of service dwell time; in the homogeneous cluster environment, the service rates of the servers where the multiple copies are located are the same, so that the number r of the copies is calculated by the following two formulas:

according to the technical scheme, the invention has the beneficial effects that: according to the method for calculating the number of the copies of the dynamic HDFS based on the file access heat, the access heat of the file is predicted based on the improved Markov model, and therefore the accuracy of prediction is improved. Meanwhile, the method for calculating the number of the copies based on the queuing theory considers the rule that the access heat of the hot files changes along with time, and dynamically adjusts the number of the copies so as to deal with the occurrence of high concurrent access conditions of the hot files. By adopting a queuing theory method, the copies stored on the nodes are taken as service resources, the request rate and the response rate of the hot spot file copies are analyzed to ensure the cluster throughput and reliability as targets, the number of the copies can be obtained through a copy calculation formula, and the method lays a foundation for the subsequent dynamic adjustment of the number of the copies.

Drawings

Fig. 1 is a flowchart of a method for calculating the number of copies of a dynamic HDFS based on file access heat according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a prediction process for predicting access heat of a hot file at the next time using an improved Markov model according to an embodiment of the present invention;

FIG. 3 is a graph illustrating a comparison between predicted values and true values for a Markov model, an improved Markov model, according to an embodiment of the present invention;

fig. 4 is a comparison diagram of the number of copies calculated based on the queuing theory and the number of copies calculated based on the actual copy throughput provided by the embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

In this embodiment, 3 racks are built, 4 virtual machines are configured on each rack, and three other physical machines are built, namely a namenode node in an Active state and a namenode in a standby state, so as to prevent a single point fault of the namenode. And taking the third entity machine as a computing node for acquiring the file access log, predicting the access heat of the file and computing the number of the copies. The configuration of the cluster is Hadoop version Hadoop-2.2.0, the internal memory 32G, the CPU Intel (R) core (TM) i3-2120 CPU @3.30GHz, the operating system CentOS-6.7, the hard disk 2T, the development language JAVA, R, Matlab.

The method for calculating the number of the copies of the dynamic HDFS based on the file access heat comprises the following steps as shown in FIG. 1:

step 1, calculating according to a file access log table on a distributed file system HDFS and a calculation formula of access heat to obtain access heat of each file in a statistical period, sorting the files in a descending order according to the sum of the access heat of the files in statistical time, selecting the first 20% of the files in the sorted list as hot files, and constructing a hot file-access heat sequence as a sequence to be predicted;

means not more than

Is the largest integer of (a) to (b),

and obtaining the number of the data blocks of the file f.

In this embodiment, 5 days are taken as a statistical period, and the access frequency of the flu.txt file in 5 periods is counted. The access heat information of flu.txt in the statistical period is obtained by calculation according to the file access log table and the calculation formula of the access heat and is shown in table 1.

Table 1 access heat information table of flu

Time of access	2017-08-01	2017-08-02	2017-08-03	2017-08-04	2017-08-05
						Visit heat	262	486	632	300	570
Time of access	2017-08-06	...	...	2017-10-02	...
						Visit heat	401	...	...	382	...

(3) recalculating distances between the new class and other classes;

based on the steps, a cluster tree of the hotspot file-access heat sequence is obtained, and a Markov division state space is defined according to the cluster tree structure.

In this embodiment, a hierarchical clustering method is used to divide the historical access heat into spatial states, divide the historical data into 5 spatial states, and label the data set with A, B, C, D and E.

the specific method for the Malassezia test comprises the following steps:

for a sequence of n possible state index values X_n＝{x₁,x₂,...,x_nDividing the sum of the jth column of the transition frequency matrix by the sum of each row and each column to obtain a value called a marginal probability, as shown in the following formula:

wherein f is_ijIndicates the index sequence X_n＝{x₁,x₂,...,x_nThe frequency of a state j is reached from a state i through one-step transfer, i, j belongs to E;

then statistic

With a degree of freedom of (n-1)²Chi of²The distribution is a limiting distribution, wherein,

given a level of significance α, if present

Then this sequence X_nIs markov-compliant, otherwise the sequence cannot be processed using markov models.

In this embodiment, the R language processing can be used to obtain a one-step frequency transfer matrix f shown in the following formula_ijAnd probability transition matrix p_ijAnd a marginal probability matrix p as shown in Table 2_.j。

TABLE 2 marginal probability table

Status of state	1	2	3	4	5
						p_.j	0.17021277	0.42553191	0.17021277	0.08510638	0.14893617

Calculating to obtain statistic according to the above values

As a result, χ shown in Table 3 was obtained²The statistics calculation table.

TABLE 3X²Statistic calculation table

In this example, the significance level α is 0.1 in terms of χ²The statistic calculation table obtains quantile points

Wherein n is 5. Therefore, the historical access heat of the file is satisfactory to Markov, and the access heat of the file can be predicted by using a Markov model.

Step 4, taking the hot file-access heat sequence satisfying the markov property as an input sequence of the improved markov model, predicting the access heat of the hot file at the next moment, and writing the predicted access heat into a hot file-access heat database table, as shown in fig. 2, the specific method is as follows:

step 4.5: and repeating the steps 4.1-4.4, and predicting the access heat of the hotspot file at the next moment.

In this embodiment, in order to verify the prediction accuracy of the method, the access heat of flu.txt of 5 cycles is compared by using improved and non-improved markov models, respectively. The comparison between the predicted values of the markov model, the predicted values of the improved markov model and the true values is shown in fig. 3. As can be seen from the figure, when the visit heat value at the next moment of the first cycle is predicted, since the sequence of the visit heat values is the same, the deviation of the visit heat values obtained by using the improved and non-improved markov models from the actual visit heat value is the same, and the predicted visit heat values of the two methods do not have much difference from the actual visit heat value. However, when predicting the access heat at a later time, the improved Markov model is adopted, and the predicted access heat has little deviation from the actual due to the adoption of the sequence to be predicted which is continuously updated, while the non-improved Markov model has larger deviation from the actual due to the increase of the prediction frequency due to the traversal property and the balanced distribution characteristic of the Markov model. The result shows that the improved Markov model forecasts the visit heat value which is relatively close to the actual value, and has relatively good forecasting effect on the visit trend of the hot spot file.

step 5.2: and setting a CPU utilization rate threshold U of the server where the copy is located, wherein the CPU utilization rate is equal to the request arrival rate divided by the service rate according to the CPU utility rule. Thus, the request service rate μ of the single server is calculated using the following formula:

in this embodiment, after the access heat of the hotspot file is obtained by the improved markov model prediction, according to the access heat of the hotspot file, the threshold of the node CPU utilization rate is set to 0.5, and the throughput of the node copy is set to 100/s, so that the daily average throughput is 100 × 11h × 3600 — 400 ten thousand for 11 hours of access, the number of copies is obtained by calculation based on the queuing theory, and is compared with the number of copies obtained by actual copy throughput calculation, and a comparison graph is shown in fig. 4. It can be known from the figure that, the method for calculating the number of copies can adjust the number of copies according to the trend of the access heat in consideration of the request access rate and the response rate of the copies, in the first period, the access heat of the hotspot file is in a descending trend, at this time, the number of copies obtained based on the queuing theory is less than the actual number of copies, in the subsequent period, the number of copies is dynamically adjusted in consideration of the trend of the access heat of the hotspot file, and is closer to the number of copies calculated by actual throughput, and the effectiveness of the method is verified.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims

1. A method for calculating the number of copies of a dynamic HDFS based on file access heat is characterized by comprising the following steps: the method comprises the following steps:

step 2, performing state space division on the hotspot file-access heat sequence by adopting a hierarchical clustering algorithm;

step 4, taking the hot file-access heat sequence meeting Markov property as an input sequence of an improved Markov model, predicting the access heat of the hot file at the next moment, and writing the predicted access heat into a hot file-access heat database table;

step 5, modeling the copy access request based on the queue model of the M/M/r single-queue multi-service desk, and setting the throughput of the copies on the node to determine the number of the copies;

the calculation formula of the file access heat in the step 1 is shown as the following formula:

means not more than

Is the largest integer of (a) to (b),

obtaining the number of data blocks of the file f;

the specific method of the step 2 comprises the following steps:

(3) recalculating distances between the new class and other classes;

the specific method of the step 4 comprises the following steps:

the specific method of the step 5 comprises the following steps:

step 5.1: obtaining the average access request rate lambda of the copy of the specified hotspot file in the next statistic period through inquiring the hotspot file-accessing the heat database table;

Throughput is equal to the inverse of service dwell time; in the homogeneous cluster environment, the service rates of the servers where the multiple copies are located are the same, so that the number r of the copies in the HDFS system is calculated by the following two formulas: