CN108416054B - Method for calculating number of copies of dynamic HDFS (Hadoop distributed File System) based on file access heat - Google Patents

Method for calculating number of copies of dynamic HDFS (Hadoop distributed File System) based on file access heat Download PDF

Info

Publication number
CN108416054B
CN108416054B CN201810228575.7A CN201810228575A CN108416054B CN 108416054 B CN108416054 B CN 108416054B CN 201810228575 A CN201810228575 A CN 201810228575A CN 108416054 B CN108416054 B CN 108416054B
Authority
CN
China
Prior art keywords
file
access heat
access
sequence
copies
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201810228575.7A
Other languages
Chinese (zh)
Other versions
CN108416054A (en
Inventor
代钰
杨雷
化红翠
王际烽
张斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201810228575.7A priority Critical patent/CN108416054B/en
Publication of CN108416054A publication Critical patent/CN108416054A/en
Application granted granted Critical
Publication of CN108416054B publication Critical patent/CN108416054B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1734Details of monitoring file system events, e.g. by the use of hooks, filter drivers, logs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for calculating the number of copies of a dynamic HDFS (Hadoop distributed File System) based on file access heat, and relates to the technical field of data analysis. According to the method for calculating the number of the copies of the dynamic HDFS based on the file access heat, firstly, the rule of the change of the access heat of the hot files along with the time is obtained through the improved Markov model analysis, and the access heat of the files is predicted according to a calculation formula of the access heat of the files. And then, giving a calculation formula of the number of the copies by adopting a queuing theory algorithm, and dynamically adjusting the number of the copies of the hot spot file. The method for calculating the number of the copies of the dynamic HDFS based on the file access heat solves the problem of access bottleneck to the hot files, and improves the service efficiency of the cluster.

Description

Method for calculating number of copies of dynamic HDFS (Hadoop distributed File System) based on file access heat
Technical Field
The invention relates to the technical field of data analysis, in particular to a method for calculating the number of copies of a dynamic HDFS (Hadoop distributed File System) based on file access heat.
Background
With the development of modern internet technology and the progress of scientific technology, data permeates into various industries and fields of social development by the characteristics of high capacity, diversity, high speed and reality. The growing trend of mass data, reasonable management of data and resources and guarantee of data reliability have become a key problem facing the cloud computing field.
The Distributed System infrastructure Hadoop developed by the Apache foundation realizes a Distributed File System (Hadoop Distributed File System), HDFS for short. HDFS is characterized by high fault tolerance and is designed for deployment on inexpensive (low-cost) hardware; and it provides high throughput (high throughput) to access data of applications, suitable for applications with very large data sets. HDFS relaxes the requirements of (relax) POSIX and can access (streaming access) data in a file system in the form of streams. In a copy management mechanism of the HDFS, a cluster defaults to a copy management mechanism that stores 3 copies for each data block of a file, but cannot meet access requirements of different users on different files, and when the access amount of a user to a certain file increases, the default number of copies of the data block cannot respond to a large number of access requests, which causes a bottleneck problem of access to hot files. Currently, a related copy management method gradually changes from a static copy creation policy to a dynamic copy creation policy, so that when an external environment changes, the entire performance of a cluster can be unchanged or a service can be efficiently provided for a client. There are still some factors that are not considered in the dynamic copy creation policy but have a significant impact on the working efficiency of the cluster.
In the prior art, a document "high-efficiency multi-copy management research in cloud environment" proposes a dynamic copy creating method for the problem of cost benefit guarantee of a large-scale cloud storage system, which comprehensively considers the relationship between the number of copies and availability, i.e., adjusts the number of copies on the premise of considering the availability of the cloud storage system, but does not consider the relationship between the file access heat and the number of copies. Document "An Elastic Replication Management System for HDFS" proposes An active/standby storage model to realize flexible Management of HDFS copies, and the method utilizes a complex transaction engine to identify data volume accessed in real time, dynamically adjusts the number of copies, and introduces erasure codes to manage the number of copies. Although the system effectively improves the performance of the HDFS, the implementation process is complex, and the complexity is high when real-time access data is identified.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a method for calculating the number of copies of a dynamic HDFS based on the file access heat, which is used for calculating the number of the dynamic copies.
A method for calculating the number of copies of a dynamic HDFS based on file access heat comprises the following steps:
step 1, calculating according to a file access log table on a distributed file system HDFS and a calculation formula of file access heat to obtain access heat of each file in a statistical period, sorting the files in a descending order according to the sum of the access heat of the files in statistical time, selecting the first 20% of the files in the sorted list as hot files, and constructing a hot file-access heat sequence as a sequence to be predicted;
the calculation formula of the file access heat is shown as the following formula:
Figure GDA0003143655550000021
where hot (f) represents the access heat of the file f, af (f) represents the access frequency of the file f, N represents the number of accesses of the file f within the statistical period T,
Figure GDA0003143655550000022
representing the data block size of file f, fsizeWhich represents the size of the file f, is,
Figure GDA0003143655550000023
means not more than
Figure GDA0003143655550000024
Is the largest integer of (a) to (b),
Figure GDA0003143655550000025
obtaining the number of data blocks of the file f;
step 2, performing state space division on the hotspot file-access heat sequence by adopting a hierarchical clustering algorithm, wherein the specific method comprises the following steps:
forming a data set with the length of N by the hotspot file-access heat sequence, wherein objects in the data set represent the access heat of the hotspot files at different moments, and the process of hierarchically clustering the hotspot file-access heat data set comprises the following steps:
(1) regarding each object in the data set as a class, and obtaining N classes in total, wherein the distance between the classes is the middle value of the square of the distance between every two data points in the two classes;
(2) merging two classes with the nearest distance into one class, so that the total number of the classes is reduced by one;
(3) recalculating distances between the new class and other classes;
(4) repeating the steps (2) - (3) until all data objects in the data set are finally merged into one class;
based on the steps, obtaining a clustering tree of the hotspot file-access heat sequence, and defining a Markov division state space according to the clustering tree structure;
step 3, conducting Markov test on the hot spot file-access heat sequence divided into the state space, if the Markov test is satisfied, using the sequence as an input sequence of the improved Markov model, otherwise, the sequence can not be processed by the improved Markov model;
step 4, taking the hot file-access heat sequence meeting the Markov property as an input sequence of an improved Markov model, predicting the access heat of the hot file at the next moment, and writing the predicted access heat into a hot file-access heat database table, wherein the specific method comprises the following steps:
step 4.1: calculating to obtain a one-step state transition probability matrix P according to the file-access heat sequence based on the divided state space;
step 4.2: setting the state corresponding to the file access heat value at the current moment as initial state distribution, marking as P (0), and calculating to obtain the state probability distribution P (1) ═ P (0) P at the next moment according to the one-step state transition probability matrix P;
step 4.3: taking a state of a distribution probability maximum value in a state probability distribution p (1) at the next moment as a state at the next moment, and taking the sum of a standard deviation of a hot point file-access heat sequence and an average value of a target state space as a predicted access heat value at the next moment;
step 4.4: removing the first value of the input sequence, and adding the newly predicted visit heat value as the last value of the next predicted sequence into the sequence to be predicted;
step 4.5: repeating the steps 4.1-4.4, and predicting the access heat of the hot spot file at the next moment;
step 5, modeling the copy access request based on the queue model of the M/M/r single-queue multi-service desk, and setting the throughput of the copies on the node to determine the number of the copies, wherein the specific method comprises the following steps:
step 5.1, obtaining the access average request rate lambda of the copy of the specified hotspot file in the next statistic period through inquiring the hotspot file-accessing the heat database table;
step 5.2: setting a CPU utilization rate threshold U of the server where the copy is located, wherein the CPU utilization rate is equal to the request arrival rate divided by the service rate according to a CPU utility rule, and calculating the request service rate mu of the single server by using the following formula:
Figure GDA0003143655550000031
step 5.3: setting the total throughput constraint of the cluster as Q, and based on a Little formula in the queuing theory, the service stay time is equal to the service rate multiplied by the service rate
Figure GDA0003143655550000032
Throughput is equal to the inverse of service dwell time; in the homogeneous cluster environment, the service rates of the servers where the multiple copies are located are the same, so that the number r of the copies is calculated by the following two formulas:
Figure GDA0003143655550000033
according to the technical scheme, the invention has the beneficial effects that: according to the method for calculating the number of the copies of the dynamic HDFS based on the file access heat, the access heat of the file is predicted based on the improved Markov model, and therefore the accuracy of prediction is improved. Meanwhile, the method for calculating the number of the copies based on the queuing theory considers the rule that the access heat of the hot files changes along with time, and dynamically adjusts the number of the copies so as to deal with the occurrence of high concurrent access conditions of the hot files. By adopting a queuing theory method, the copies stored on the nodes are taken as service resources, the request rate and the response rate of the hot spot file copies are analyzed to ensure the cluster throughput and reliability as targets, the number of the copies can be obtained through a copy calculation formula, and the method lays a foundation for the subsequent dynamic adjustment of the number of the copies.
Drawings
Fig. 1 is a flowchart of a method for calculating the number of copies of a dynamic HDFS based on file access heat according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a prediction process for predicting access heat of a hot file at the next time using an improved Markov model according to an embodiment of the present invention;
FIG. 3 is a graph illustrating a comparison between predicted values and true values for a Markov model, an improved Markov model, according to an embodiment of the present invention;
fig. 4 is a comparison diagram of the number of copies calculated based on the queuing theory and the number of copies calculated based on the actual copy throughput provided by the embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
In this embodiment, 3 racks are built, 4 virtual machines are configured on each rack, and three other physical machines are built, namely a namenode node in an Active state and a namenode in a standby state, so as to prevent a single point fault of the namenode. And taking the third entity machine as a computing node for acquiring the file access log, predicting the access heat of the file and computing the number of the copies. The configuration of the cluster is Hadoop version Hadoop-2.2.0, the internal memory 32G, the CPU Intel (R) core (TM) i3-2120 CPU @3.30GHz, the operating system CentOS-6.7, the hard disk 2T, the development language JAVA, R, Matlab.
The method for calculating the number of the copies of the dynamic HDFS based on the file access heat comprises the following steps as shown in FIG. 1:
step 1, calculating according to a file access log table on a distributed file system HDFS and a calculation formula of access heat to obtain access heat of each file in a statistical period, sorting the files in a descending order according to the sum of the access heat of the files in statistical time, selecting the first 20% of the files in the sorted list as hot files, and constructing a hot file-access heat sequence as a sequence to be predicted;
the calculation formula of the file access heat is shown as the following formula:
Figure GDA0003143655550000041
where hot (f) represents the access heat of the file f, af (f) represents the access frequency of the file f, N represents the number of accesses of the file f within the statistical period T,
Figure GDA0003143655550000042
representing the data block size of file f, fsizeWhich represents the size of the file f, is,
Figure GDA0003143655550000043
means not more than
Figure GDA0003143655550000044
Is the largest integer of (a) to (b),
Figure GDA0003143655550000045
and obtaining the number of the data blocks of the file f.
In this embodiment, 5 days are taken as a statistical period, and the access frequency of the flu.txt file in 5 periods is counted. The access heat information of flu.txt in the statistical period is obtained by calculation according to the file access log table and the calculation formula of the access heat and is shown in table 1.
Table 1 access heat information table of flu
Time of access 2017-08-01 2017-08-02 2017-08-03 2017-08-04 2017-08-05
Visit heat 262 486 632 300 570
Time of access 2017-08-06 ... ... 2017-10-02 ...
Visit heat 401 ... ... 382 ...
Step 2, performing state space division on the hotspot file-access heat sequence by adopting a hierarchical clustering algorithm, wherein the specific method comprises the following steps:
forming a data set with the length of N by the hotspot file-access heat sequence, wherein objects in the data set represent the access heat of the hotspot files at different moments, and the process of hierarchically clustering the hotspot file-access heat data set comprises the following steps:
(1) regarding each object in the data set as a class, and obtaining N classes in total, wherein the distance between the classes is the middle value of the square of the distance between every two data points in the two classes;
(2) merging two classes with the nearest distance into one class, so that the total number of the classes is reduced by one;
(3) recalculating distances between the new class and other classes;
(4) repeating the steps (2) - (3) until all data objects in the data set are finally merged into one class;
based on the steps, a cluster tree of the hotspot file-access heat sequence is obtained, and a Markov division state space is defined according to the cluster tree structure.
In this embodiment, a hierarchical clustering method is used to divide the historical access heat into spatial states, divide the historical data into 5 spatial states, and label the data set with A, B, C, D and E.
Step 3, conducting Markov test on the hot spot file-access heat sequence divided into the state space, if the Markov test is satisfied, using the sequence as an input sequence of the improved Markov model, otherwise, the sequence can not be processed by the improved Markov model;
the specific method for the Malassezia test comprises the following steps:
for a sequence of n possible state index values Xn={x1,x2,...,xnDividing the sum of the jth column of the transition frequency matrix by the sum of each row and each column to obtain a value called a marginal probability, as shown in the following formula:
Figure GDA0003143655550000051
wherein f isijIndicates the index sequence Xn={x1,x2,...,xnThe frequency of a state j is reached from a state i through one-step transfer, i, j belongs to E;
then statistic
Figure GDA0003143655550000052
With a degree of freedom of (n-1)2Chi of2The distribution is a limiting distribution, wherein,
Figure GDA0003143655550000053
given a level of significance α, if present
Figure GDA0003143655550000054
Then this sequence XnIs markov-compliant, otherwise the sequence cannot be processed using markov models.
In this embodiment, the R language processing can be used to obtain a one-step frequency transfer matrix f shown in the following formulaijAnd probability transition matrix pijAnd a marginal probability matrix p as shown in Table 2.j
Figure GDA0003143655550000061
Figure GDA0003143655550000062
TABLE 2 marginal probability table
Status of state 1 2 3 4 5
p.j 0.17021277 0.42553191 0.17021277 0.08510638 0.14893617
Calculating to obtain statistic according to the above values
Figure GDA0003143655550000063
As a result, χ shown in Table 3 was obtained2The statistics calculation table.
TABLE 3X2Statistic calculation table
Figure GDA0003143655550000064
In this example, the significance level α is 0.1 in terms of χ2The statistic calculation table obtains quantile points
Figure GDA0003143655550000065
Wherein n is 5. Therefore, the historical access heat of the file is satisfactory to Markov, and the access heat of the file can be predicted by using a Markov model.
Step 4, taking the hot file-access heat sequence satisfying the markov property as an input sequence of the improved markov model, predicting the access heat of the hot file at the next moment, and writing the predicted access heat into a hot file-access heat database table, as shown in fig. 2, the specific method is as follows:
step 4.1: calculating to obtain a one-step state transition probability matrix P according to the file-access heat sequence based on the divided state space;
step 4.2: setting the state corresponding to the file access heat value at the current moment as initial state distribution, marking as P (0), and calculating to obtain the state probability distribution P (1) ═ P (0) P at the next moment according to the one-step state transition probability matrix P;
step 4.3: taking a state of a distribution probability maximum value in a state probability distribution p (1) at the next moment as a state at the next moment, and taking the sum of a standard deviation of a hot point file-access heat sequence and an average value of a target state space as a predicted access heat value at the next moment;
step 4.4: removing the first value of the input sequence, and adding the newly predicted visit heat value as the last value of the next predicted sequence into the sequence to be predicted;
step 4.5: and repeating the steps 4.1-4.4, and predicting the access heat of the hotspot file at the next moment.
In this embodiment, in order to verify the prediction accuracy of the method, the access heat of flu.txt of 5 cycles is compared by using improved and non-improved markov models, respectively. The comparison between the predicted values of the markov model, the predicted values of the improved markov model and the true values is shown in fig. 3. As can be seen from the figure, when the visit heat value at the next moment of the first cycle is predicted, since the sequence of the visit heat values is the same, the deviation of the visit heat values obtained by using the improved and non-improved markov models from the actual visit heat value is the same, and the predicted visit heat values of the two methods do not have much difference from the actual visit heat value. However, when predicting the access heat at a later time, the improved Markov model is adopted, and the predicted access heat has little deviation from the actual due to the adoption of the sequence to be predicted which is continuously updated, while the non-improved Markov model has larger deviation from the actual due to the increase of the prediction frequency due to the traversal property and the balanced distribution characteristic of the Markov model. The result shows that the improved Markov model forecasts the visit heat value which is relatively close to the actual value, and has relatively good forecasting effect on the visit trend of the hot spot file.
Step 5, modeling the copy access request based on the queue model of the M/M/r single-queue multi-service desk, and setting the throughput of the copies on the node to determine the number of the copies, wherein the specific method comprises the following steps:
step 5.1, obtaining the access average request rate lambda of the copy of the specified hotspot file in the next statistic period through inquiring the hotspot file-accessing the heat database table;
step 5.2: and setting a CPU utilization rate threshold U of the server where the copy is located, wherein the CPU utilization rate is equal to the request arrival rate divided by the service rate according to the CPU utility rule. Thus, the request service rate μ of the single server is calculated using the following formula:
Figure GDA0003143655550000071
step 5.3: setting the total throughput constraint of the cluster as Q, and based on a Little formula in the queuing theory, the service stay time is equal to the service rate multiplied by the service rate
Figure GDA0003143655550000081
Throughput is equal to the inverse of service dwell time; in the homogeneous cluster environment, the service rates of the servers where the multiple copies are located are the same, so that the number r of the copies is calculated by the following two formulas:
Figure GDA0003143655550000082
in this embodiment, after the access heat of the hotspot file is obtained by the improved markov model prediction, according to the access heat of the hotspot file, the threshold of the node CPU utilization rate is set to 0.5, and the throughput of the node copy is set to 100/s, so that the daily average throughput is 100 × 11h × 3600 — 400 ten thousand for 11 hours of access, the number of copies is obtained by calculation based on the queuing theory, and is compared with the number of copies obtained by actual copy throughput calculation, and a comparison graph is shown in fig. 4. It can be known from the figure that, the method for calculating the number of copies can adjust the number of copies according to the trend of the access heat in consideration of the request access rate and the response rate of the copies, in the first period, the access heat of the hotspot file is in a descending trend, at this time, the number of copies obtained based on the queuing theory is less than the actual number of copies, in the subsequent period, the number of copies is dynamically adjusted in consideration of the trend of the access heat of the hotspot file, and is closer to the number of copies calculated by actual throughput, and the effectiveness of the method is verified.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims (1)

1. A method for calculating the number of copies of a dynamic HDFS based on file access heat is characterized by comprising the following steps: the method comprises the following steps:
step 1, calculating according to a file access log table on a distributed file system HDFS and a calculation formula of file access heat to obtain access heat of each file in a statistical period, sorting the files in a descending order according to the sum of the access heat of the files in statistical time, selecting the first 20% of the files in the sorted list as hot files, and constructing a hot file-access heat sequence as a sequence to be predicted;
step 2, performing state space division on the hotspot file-access heat sequence by adopting a hierarchical clustering algorithm;
step 3, conducting Markov test on the hot spot file-access heat sequence divided into the state space, if the Markov test is satisfied, using the sequence as an input sequence of the improved Markov model, otherwise, the sequence can not be processed by the improved Markov model;
step 4, taking the hot file-access heat sequence meeting Markov property as an input sequence of an improved Markov model, predicting the access heat of the hot file at the next moment, and writing the predicted access heat into a hot file-access heat database table;
step 5, modeling the copy access request based on the queue model of the M/M/r single-queue multi-service desk, and setting the throughput of the copies on the node to determine the number of the copies;
the calculation formula of the file access heat in the step 1 is shown as the following formula:
Figure FDA0003121191260000011
where hot (f) represents the access heat of the file f, af (f) represents the access frequency of the file f, N represents the number of accesses of the file f within the statistical period T,
Figure FDA0003121191260000012
representing the data block size of file f, fsizeWhich represents the size of the file f, is,
Figure FDA0003121191260000013
means not more than
Figure FDA0003121191260000014
Is the largest integer of (a) to (b),
Figure FDA0003121191260000015
obtaining the number of data blocks of the file f;
the specific method of the step 2 comprises the following steps:
forming a data set with the length of N by the hotspot file-access heat sequence, wherein objects in the data set represent the access heat of the hotspot files at different moments, and the process of hierarchically clustering the hotspot file-access heat data set comprises the following steps:
(1) regarding each object in the data set as a class, and obtaining N classes in total, wherein the distance between the classes is the middle value of the square of the distance between every two data points in the two classes;
(2) merging two classes with the nearest distance into one class, so that the total number of the classes is reduced by one;
(3) recalculating distances between the new class and other classes;
(4) repeating the steps (2) - (3) until all data objects in the data set are finally merged into one class;
based on the steps, obtaining a clustering tree of the hotspot file-access heat sequence, and defining a Markov division state space according to the clustering tree structure;
the specific method of the step 4 comprises the following steps:
step 4.1: calculating to obtain a one-step state transition probability matrix P according to the file-access heat sequence based on the divided state space;
step 4.2: setting the state corresponding to the file access heat value at the current moment as initial state distribution, marking as P (0), and calculating to obtain the state probability distribution P (1) ═ P (0) P at the next moment according to the one-step state transition probability matrix P;
step 4.3: taking a state of a distribution probability maximum value in a state probability distribution p (1) at the next moment as a state at the next moment, and taking the sum of a standard deviation of a hot point file-access heat sequence and an average value of a target state space as a predicted access heat value at the next moment;
step 4.4: removing the first value of the input sequence, and adding the newly predicted visit heat value as the last value of the next predicted sequence into the sequence to be predicted;
step 4.5: repeating the steps 4.1-4.4, and predicting the access heat of the hot spot file at the next moment;
the specific method of the step 5 comprises the following steps:
step 5.1: obtaining the average access request rate lambda of the copy of the specified hotspot file in the next statistic period through inquiring the hotspot file-accessing the heat database table;
step 5.2: setting a CPU utilization rate threshold U of the server where the copy is located, wherein the CPU utilization rate is equal to the request arrival rate divided by the service rate according to a CPU utility rule, and calculating the request service rate mu of the single server by using the following formula:
Figure FDA0003121191260000021
step 5.3: setting the total throughput constraint of the cluster as Q, and based on a Little formula in the queuing theory, the service stay time is equal to the service rate multiplied by the service rate
Figure FDA0003121191260000022
Throughput is equal to the inverse of service dwell time; in the homogeneous cluster environment, the service rates of the servers where the multiple copies are located are the same, so that the number r of the copies in the HDFS system is calculated by the following two formulas:
Figure FDA0003121191260000023
CN201810228575.7A 2018-03-20 2018-03-20 Method for calculating number of copies of dynamic HDFS (Hadoop distributed File System) based on file access heat Expired - Fee Related CN108416054B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810228575.7A CN108416054B (en) 2018-03-20 2018-03-20 Method for calculating number of copies of dynamic HDFS (Hadoop distributed File System) based on file access heat

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810228575.7A CN108416054B (en) 2018-03-20 2018-03-20 Method for calculating number of copies of dynamic HDFS (Hadoop distributed File System) based on file access heat

Publications (2)

Publication Number Publication Date
CN108416054A CN108416054A (en) 2018-08-17
CN108416054B true CN108416054B (en) 2021-10-22

Family

ID=63132988

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810228575.7A Expired - Fee Related CN108416054B (en) 2018-03-20 2018-03-20 Method for calculating number of copies of dynamic HDFS (Hadoop distributed File System) based on file access heat

Country Status (1)

Country Link
CN (1) CN108416054B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112689166A (en) * 2020-12-18 2021-04-20 武汉市烽视威科技有限公司 Method and system for flexibly increasing and decreasing CDN hot content in real time
CN113391765A (en) * 2021-06-22 2021-09-14 中国工商银行股份有限公司 Data storage method, device, equipment and medium based on distributed storage system
CN115033187B (en) * 2022-08-10 2022-11-08 蓝深远望科技股份有限公司 Big data based analysis management method
CN115544377B (en) * 2022-11-25 2023-04-07 浙江星汉信息技术股份有限公司 Cloud storage-based file heat evaluation and updating method
CN116600015B (en) * 2023-07-18 2023-10-10 湖南快乐阳光互动娱乐传媒有限公司 Resource node adjustment method, system, electronic equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150347A (en) * 2013-02-07 2013-06-12 浙江大学 Dynamic replica management method based on file heat
CN105574153A (en) * 2015-12-16 2016-05-11 南京信息工程大学 Transcript placement method based on file heat analysis and K-means
CN107632994A (en) * 2016-07-19 2018-01-26 普天信息技术有限公司 A kind of reliability Enhancement Method and system based on HDFS file system
CN107770259A (en) * 2017-09-30 2018-03-06 武汉理工大学 Copy amount dynamic adjusting method based on file temperature and node load

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150347A (en) * 2013-02-07 2013-06-12 浙江大学 Dynamic replica management method based on file heat
CN105574153A (en) * 2015-12-16 2016-05-11 南京信息工程大学 Transcript placement method based on file heat analysis and K-means
CN107632994A (en) * 2016-07-19 2018-01-26 普天信息技术有限公司 A kind of reliability Enhancement Method and system based on HDFS file system
CN107770259A (en) * 2017-09-30 2018-03-06 武汉理工大学 Copy amount dynamic adjusting method based on file temperature and node load

Also Published As

Publication number Publication date
CN108416054A (en) 2018-08-17

Similar Documents

Publication Publication Date Title
CN108416054B (en) Method for calculating number of copies of dynamic HDFS (Hadoop distributed File System) based on file access heat
Berger et al. {AdaptSize}: Orchestrating the Hot Object Memory Cache in a Content Delivery Network
Hadian et al. High performance parallel k-means clustering for disk-resident datasets on multi-core CPUs
Xie et al. Pandas: robust locality-aware scheduling with stochastic delay optimality
Xie et al. Kraken: memory-efficient continual learning for large-scale real-time recommendations
Liu et al. Scalable and adaptive data replica placement for geo-distributed cloud storages
CN111966495B (en) Data processing method and device
Liao et al. Prefetching on storage servers through mining access patterns on blocks
CN116383464A (en) Correlation big data clustering method and device based on stream computing
Sun et al. SORD: A new strategy of online replica deduplication in Cloud-P2P
Li et al. Efficient multi-attribute precedence-based task scheduling for edge computing in geo-distributed cloud environment
US11435926B2 (en) Method, device, and computer program product for managing storage system
KR101718739B1 (en) System and Method for Replicating Dynamic Data for Heterogeneous Hadoop
Zeng et al. Do more replicas of object data improve the performance of cloud data centers?
CN108875786B (en) Optimization method of consistency problem of food data parallel computing based on Storm
Guroob et al. Efficient replica consistency model (ERCM) for update propagation in data grid environment
CN106997303B (en) MapReduce-based big data approximate processing method
Luo et al. Superset: a non-uniform replica placement strategy towards perfect load balance and fine-grained power proportionality
Jian et al. A HDFS dynamic load balancing strategy using improved niche PSO algorithm in cloud storage
Li et al. MonickerHash: A Decentralized Load-Balancing Algorithm for Resource/Traffic Distribution
Zeng et al. Accelerating Neural Recommendation Training with Embedding Scheduling
Sun et al. Linux Storage IO Important Parameters Filtering Model Based on Random Forest
Gui et al. Grouping synchronous to eliminate stragglers with edge computing in distributed deep learning
US20240160572A1 (en) Systems and methods to generate a miss ratio curve for a cache with variable-sized data blocks
Huang et al. IObrain: An Intelligent Lightweight I/O Recommendation System based on Decision Tree

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20211022