CN105574153A - Transcript placement method based on file heat analysis and K-means - Google Patents
Transcript placement method based on file heat analysis and K-means Download PDFInfo
- Publication number
- CN105574153A CN105574153A CN201510943677.3A CN201510943677A CN105574153A CN 105574153 A CN105574153 A CN 105574153A CN 201510943677 A CN201510943677 A CN 201510943677A CN 105574153 A CN105574153 A CN 105574153A
- Authority
- CN
- China
- Prior art keywords
- file
- access
- temperature
- document
- obtains
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/11—File system administration, e.g. details of archiving or snapshots
- G06F16/128—Details of file system snapshots on the file-level, e.g. snapshot creation, administration, deletion
Abstract
The invention provides a transcript placement method based on file heat analysis and K-means. At first, The access frequency of a file in a given period of time is analyzed to calculate the access heat of the file. A possible file with high access heat is predicated by the access heat of the file, in combination with the K-means algorithm, and the number and placement positions of file transcripts are dynamically adjusted on demand by comprehensively considering a statistical cycle, a file size, a working environment and other factors. The transcript placement method provided by the invention can be used for effectively shortening the average response time of file access and improving the data service performance.
Description
Technical field
The invention belongs to field of cloud calculation, what be specifically related to is a kind of method utilizing hot statistics analysis and K-means algorithm to place temperature duplicate of the document dynamic conditioning high under cloud environment.
Background technology
Along with the development of society and the raising of Computer Storage and data-handling capacity, data explosion formula increases the key character having become current era.According to the estimation that International Data Corporation (IDC) (InternationalDataCorportion, IDC) increases data, 40ZB (1ZB=1.1805916207174113 × 10 will be produced to the year two thousand twenty
21b) data, to be equivalent on the earth 5247GB (http://datacenter.watchstor.com/infra-143421.htm) per capita.In the face of the ever-increasing mass data of scale, the store and management of thing followed mass data have also been obtained increasing concern.
In order to improve reliability and the access efficiency of system, data item is copied many parts by conventional Replication technology, and leaves in respectively on multiple nodes of distributed file system.For the different access requirement that each historical stage proposes data, there has been proposed multiple replication strategy, mainly comprise master-slave mode, hierarchy type, P2P computing (PeertoPeer, P2P) formula and based on figure etc. several.
Replication strategy will carry out the decision-making of copy number and deposit position two aspect usually, can be divided into Static and dynamic two class according to the opportunity of doing decision-making.Six kinds of replica creation strategy that IanForster and KavithaRanganathan proposed in hierarchically structured topographical network in calendar year 2001: without replication policy, best customer strategy, waterfall type strategy, normal cache strategy, buffer memory waterfall type strategy, Quick Extended strategy (based on the research and implementation Li Lin of the replica optimization strategy of economic model under data grid environment .).These strategies can both reduce access delay in most cases, but waterfall type strategy, buffer memory waterfall type strategy and Quick Extended strategy are only applicable to the data grids that data are stored in top mode, best customer strategy, normal cache strategy do not consider that the features such as topological structure, Data distribution8, the network bandwidth, node storage capacity are (based on the dynamic replica creation strategy of the bilayer-SADDERS grandson petrel of storage alliance, Wang Xiaodong, Zhou Bin etc.), do not consider that file size and the network bandwidth are on the impact of access delay.
The present invention, by the access frequency of Study document in preset time period, according to temperature computing formula, calculates the access temperature of file.Utilize the access temperature of file, in conjunction with K-means algorithm, (the dynamic copies establishment algorithm analyzed based on temperature is rich of heap of stone to predict height access temperature file possible in next cycle, Yang Fande, Li Xinming, Liu Dong .), consider the many factors such as measurement period, file size, working environment simultaneously, dynamically adjust quantity and the placement location of duplicate of the document.
Summary of the invention
Technical matters to be solved of the present invention is the Replica placement problem in distributed system or cloud computing platform, propose a kind ofly to analyze and the Replica placement method of K-means based on file temperature, maximal value is chosen as the time cycle, the access temperature of file in cycle computing time according to the execution time of task.Utilize the access temperature of file, in conjunction with K-means algorithm, predict height access temperature file possible in next cycle, consider the many factors such as measurement period, file size, working environment, dynamically adjust quantity and the placement location of duplicate of the document as required.The present invention can reduce the average response time of file access effectively, improves data, services performance.
Technical scheme:
Analyze and a Replica placement method of K-means based on file temperature, comprise the following steps:
Step 1), according to the execution time of task, select the time cycle that minimum value is analyzed as temperature, in the access frequency of this time cycle inner analysis file;
Step 2), according to step 1) file access frequency that obtains, the access hot value of calculation document;
Step 3), according to step 2) the file access hot value that obtains, obtain the information of the file of high hot value, by K-means algorithm, calculate and predict the high temperature file of next cycle of operation;
Step 4), according to step 3) the high temperature fileinfo that obtains, consider quantity and placement location that the many factors such as file size, quantity of documents, document location, working environment dynamically adjust duplicate of the document;
Further, of the present inventionly a kind ofly to analyze and the Replica placement method of K-means based on file temperature, step 1) according to execution time of task, the time cycle that selection maximal value is analyzed as temperature, in the access frequency of this time cycle inner analysis file.Present invention uses file access number counter and measurement period timer.During initialization, default document access times are 1, and in each measurement period, each accessed counter of file adds 1, and not accessed, counter subtracts 1.If access times have been 1, then counter has no longer performed and has subtracted 1 operation.If file access time-out does not complete, access counter adds 1.The access frequency f of certain file in a kth measurement period
k=n/t, wherein n is the number of times accessed in measurement period of this file, and t is the duration sum of access in measurement period;
Further, of the present inventionly a kind ofly to analyze and the Replica placement method of K-means based on file temperature, step 2) according to step 1) file access frequency that obtains, utilize formula h
ij=α F
j/ (S
i+ 1), calculation document i is in the access hot value in j moment.In formula, α is constant, for being normalized data; F
jrepresent that frequency is on the impact of file access temperature, S
irepresent that file size is on the impact of file access temperature.Wherein,
Further, of the present inventionly a kind ofly to analyze and the Replica placement method of K-means based on file temperature, step 3) according to step 2) the file access hot value that obtains, obtain the information of the file of high hot value, choose k file as Initialization Center, calculate the distance of each file to hub file, by each file allocation to nearest bunch.According to existing bunch of relation double counting aforementioned process, until meet end condition.End condition comprises:
(1) (or minimal amount) file is not had to be reallocated to different clusters;
(2) (or minimal amount) cluster centre is not had to change;
(3) error sum of squares (SSE) Local Minimum,
wherein x represents file, m
jrepresent cluster C
jcluster centre, dist (x, m
j) represent file x and cluster centre m
jbetween distance;
Further, of the present inventionly a kind ofly to analyze and the Replica placement method of K-means based on file temperature, step 4) according to step 3) clustering information that obtains, according to the access temperature of each cluster centre, consider quantity and placement location that the many factors such as file size, quantity of documents, document location, working environment dynamically adjust duplicate of the document, high temperature bunch suitably increase copy amount, a bunch class for low temperature suitably reduces copy amount.
Beneficial effect
The present invention is directed to Replica placement in distributed system or cloud computing platform, comprehensively analyze in conjunction with file access temperature and K-means algorithm, contribute to the reasonable placement realizing copy in the system of high access.The method compensate in the past simply by the Replica placement method that file temperature is analyzed, and carries out Replica placement merely by the file temperature in this measurement period; Meanwhile, for improving the response time of accessing in the subsequent statistical cycle, have employed K-means clustering algorithm, predicting high temperature file possible in next cycle, adjust duplicate of the document in advance.The combination of two aspects, can improve the rationality of copy, reduces the response time, can reduce IO again congested.
Accompanying drawing explanation
Fig. 1 is a kind of analysis based on file temperature and the process flow diagram of Replica placement method of K-means.
Embodiment
Be described in further detail below in conjunction with the enforcement of accompanying drawing to technical scheme:
In conjunction with process flow diagram and case study on implementation, a kind of Replica placement method based on the analysis of file temperature and K-means of the present invention is described in further detail.
The implementation case adopts file temperature analysis and K-means algorithm to carry out adjustment to the copy in distributed system or cloud environment and places.As shown in Figure 1, this method comprises following steps:
Step 1), according to the execution time of task, select the time cycle that minimum value is analyzed as temperature, in the access frequency of this time cycle inner analysis file;
Step 101), in distributed system or cloud environment, the execution time of different task is different, carry out the analysis of file temperature time, complete there being task and be, just can carry out a copy adjustment, the Information application produced by last tasks carrying is in time in follow-up application.The execution time of task can be obtained by analogue simulation or empirical value.;
Step 102), according to formula f
k=n/t, in preset time period, calculates the access frequency obtaining file.
Step 2), according to file access frequency obtained in the previous step, the access hot value of calculation document;
Step 201), obtain file access frequency can calculation document access frequency on the impact of its temperature, determined by the accessed frequency of this file in a nearest l measurement period and weights.
Step 202), calculation document size on the impact of file access temperature, by file size s
idetermine with the data block size in distributed system;
Step 203), according to formula h
ij=α F
j/ (S
i+ 1), in conjunction with the value of the response that first two steps obtain, be normalized, the access hot value of file i in the j moment can be calculated.
Step 3), according to file access hot value obtained in the previous step, obtain the information of the file of high hot value, by K-means algorithm, calculate and predict the high temperature file of next cycle of operation;
Step 301), according to the result that previous step calculates, the file of high hot value can be obtained, thus from system, obtain the information of these files.
Step 302), from high temperature file, choose K file as hub file, calculate the distance of All Files to each hub file, according to result of calculation, give nearest cluster centre by each file allocation;
Step 303), repeat previous step, until meet end condition;
Step 4), according to the clustering information that previous step obtains, according to the access temperature of each cluster centre, consider the factors such as file size, quantity of documents, working environment, the copy amount of each file and placement location are adjusted.The cluster of accessing the high cluster centre of temperature corresponding suitably increases its copy amount; Access lower grade cluster and then correspondingly reduce its copy amount.
Claims (5)
1. analyze and a Replica placement method of K-means based on file temperature, it is characterized in that, comprise the following steps:
Step 1), according to the execution time of task, select the time cycle that minimum value is analyzed as temperature, in the access frequency of this time cycle inner analysis file;
Step 2), according to step 1) file access frequency that obtains, the access hot value of calculation document;
Step 3), according to step 2) the file access hot value that obtains, obtain the information of the file of high hot value, by K-means algorithm, calculate and predict the high temperature file of next cycle of operation;
Step 4), according to step 3) the high temperature fileinfo that obtains, consider quantity and placement location that the many factors such as file size, quantity of documents, document location, working environment dynamically adjust duplicate of the document.
2. method according to claim 1, is characterized in that, step 1) in employ file access number counter and measurement period timer; During initialization, default document access times are 1, and in each measurement period, each accessed counter of file adds 1, and not accessed, counter subtracts 1; If access times have been 1, then counter has no longer performed and has subtracted 1 operation.If file access time-out does not complete, access counter adds 1; If the access frequency f of file in a kth measurement period
k=n/t, wherein n is the number of times accessed in measurement period of this file, and t is the duration sum of access in measurement period.
3. method according to claim 1, is characterized in that, step 2) according to step 1) file access frequency that obtains, utilize formula h
ij=α F
j/ (S
i+ 1), calculation document i is in the access hot value in j moment; In formula, α is constant, for being normalized data; F
jrepresent that frequency is on the impact of file access temperature, S
irepresent that file size is on the impact of file access temperature; Wherein,
4. method according to claim 1, it is characterized in that, step 3) according to step 2) the file access hot value that obtains, obtain the information of the file of high hot value, choose k file as Initialization Center, calculate the distance of each file to hub file, by each file allocation to nearest bunch.According to existing bunch of relation double counting aforementioned process, until meet end condition; End condition comprises:
(1) (or minimal amount) file is not had to be reallocated to different clusters;
(2) (or minimal amount) cluster centre is not had to change;
(3) error sum of squares (SSE) Local Minimum,
wherein x represents file, m
jrepresent cluster C
jcluster centre, dist (x, m
j) represent file x and cluster centre m
jbetween distance.
5. method according to claim 1, it is characterized in that, step 4) according to step 3) clustering information that obtains, according to the access temperature of each cluster centre, consider quantity and placement location that the many factors such as file size, quantity of documents, document location, working environment dynamically adjust duplicate of the document, high temperature bunch suitably increase copy amount, a bunch class for low temperature suitably reduces copy amount.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510943677.3A CN105574153A (en) | 2015-12-16 | 2015-12-16 | Transcript placement method based on file heat analysis and K-means |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510943677.3A CN105574153A (en) | 2015-12-16 | 2015-12-16 | Transcript placement method based on file heat analysis and K-means |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105574153A true CN105574153A (en) | 2016-05-11 |
Family
ID=55884284
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510943677.3A Pending CN105574153A (en) | 2015-12-16 | 2015-12-16 | Transcript placement method based on file heat analysis and K-means |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105574153A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106161170A (en) * | 2016-07-12 | 2016-11-23 | 广东工业大学 | A kind of asynchronous file being spaced execution selects and Replica placement method |
CN107302561A (en) * | 2017-05-23 | 2017-10-27 | 南京邮电大学 | A kind of hot spot data Replica placement method in cloud storage system |
CN107770259A (en) * | 2017-09-30 | 2018-03-06 | 武汉理工大学 | Copy amount dynamic adjusting method based on file temperature and node load |
CN108416054A (en) * | 2018-03-20 | 2018-08-17 | 东北大学 | Dynamic HDFS copy number calculating methods based on file access temperature |
CN108804351A (en) * | 2018-05-30 | 2018-11-13 | 郑州云海信息技术有限公司 | A kind of caching replacement method and device |
CN110879852A (en) * | 2018-09-05 | 2020-03-13 | 南京大学 | Video content caching method |
CN113312329A (en) * | 2020-02-26 | 2021-08-27 | 阿里巴巴集团控股有限公司 | Data file scheduling method, device and equipment |
CN114205416A (en) * | 2021-10-27 | 2022-03-18 | 北京旷视科技有限公司 | Resource caching method and device, electronic equipment and computer readable medium |
CN115190181A (en) * | 2022-09-07 | 2022-10-14 | 睿至科技集团有限公司 | Resource management method and system based on cloud management |
CN115292389A (en) * | 2022-10-08 | 2022-11-04 | 南通君合云起信息科技有限公司 | Big data self-adaptive storage method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103150347A (en) * | 2013-02-07 | 2013-06-12 | 浙江大学 | Dynamic replica management method based on file heat |
CN103425756A (en) * | 2013-07-31 | 2013-12-04 | 西安交通大学 | Copy management strategy for data blocks in HDFS |
CN103440182A (en) * | 2013-09-12 | 2013-12-11 | 重庆大学 | Adaptive allocation method and device and adaptive replica consistency method |
US20150169583A1 (en) * | 2013-12-18 | 2015-06-18 | Attivio, Inc. | Trending analysis for streams of documents |
CN104978362A (en) * | 2014-04-11 | 2015-10-14 | 中兴通讯股份有限公司 | Data migration method of distributive file system, data migration device of distributive file system and metadata server |
-
2015
- 2015-12-16 CN CN201510943677.3A patent/CN105574153A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103150347A (en) * | 2013-02-07 | 2013-06-12 | 浙江大学 | Dynamic replica management method based on file heat |
CN103425756A (en) * | 2013-07-31 | 2013-12-04 | 西安交通大学 | Copy management strategy for data blocks in HDFS |
CN103440182A (en) * | 2013-09-12 | 2013-12-11 | 重庆大学 | Adaptive allocation method and device and adaptive replica consistency method |
US20150169583A1 (en) * | 2013-12-18 | 2015-06-18 | Attivio, Inc. | Trending analysis for streams of documents |
CN104978362A (en) * | 2014-04-11 | 2015-10-14 | 中兴通讯股份有限公司 | Data migration method of distributive file system, data migration device of distributive file system and metadata server |
Non-Patent Citations (1)
Title |
---|
饶磊等: ""基于热度分析的动态副本创建算法"", 《计算机应用》 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106161170B (en) * | 2016-07-12 | 2019-08-02 | 广东工业大学 | A kind of asynchronous file selection and Replica placement method that interval executes |
CN106161170A (en) * | 2016-07-12 | 2016-11-23 | 广东工业大学 | A kind of asynchronous file being spaced execution selects and Replica placement method |
CN107302561A (en) * | 2017-05-23 | 2017-10-27 | 南京邮电大学 | A kind of hot spot data Replica placement method in cloud storage system |
CN107302561B (en) * | 2017-05-23 | 2019-08-13 | 南京邮电大学 | A kind of hot spot data Replica placement method in cloud storage system |
CN107770259A (en) * | 2017-09-30 | 2018-03-06 | 武汉理工大学 | Copy amount dynamic adjusting method based on file temperature and node load |
CN108416054A (en) * | 2018-03-20 | 2018-08-17 | 东北大学 | Dynamic HDFS copy number calculating methods based on file access temperature |
CN108416054B (en) * | 2018-03-20 | 2021-10-22 | 东北大学 | Method for calculating number of copies of dynamic HDFS (Hadoop distributed File System) based on file access heat |
CN108804351B (en) * | 2018-05-30 | 2021-10-29 | 郑州云海信息技术有限公司 | Cache replacement method and device |
CN108804351A (en) * | 2018-05-30 | 2018-11-13 | 郑州云海信息技术有限公司 | A kind of caching replacement method and device |
CN110879852A (en) * | 2018-09-05 | 2020-03-13 | 南京大学 | Video content caching method |
CN113312329A (en) * | 2020-02-26 | 2021-08-27 | 阿里巴巴集团控股有限公司 | Data file scheduling method, device and equipment |
CN113312329B (en) * | 2020-02-26 | 2024-03-01 | 阿里巴巴集团控股有限公司 | Scheduling method, device and equipment for data files |
CN114205416A (en) * | 2021-10-27 | 2022-03-18 | 北京旷视科技有限公司 | Resource caching method and device, electronic equipment and computer readable medium |
CN114205416B (en) * | 2021-10-27 | 2024-03-12 | 北京旷视科技有限公司 | Resource caching method, device, electronic equipment and computer readable medium |
CN115190181A (en) * | 2022-09-07 | 2022-10-14 | 睿至科技集团有限公司 | Resource management method and system based on cloud management |
CN115190181B (en) * | 2022-09-07 | 2023-02-17 | 睿至科技集团有限公司 | Resource management method and system based on cloud management |
CN115292389A (en) * | 2022-10-08 | 2022-11-04 | 南通君合云起信息科技有限公司 | Big data self-adaptive storage method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105574153A (en) | Transcript placement method based on file heat analysis and K-means | |
CN103150347B (en) | Based on the dynamic replication management method of file temperature | |
CN107330056A (en) | Wind power plant SCADA system and its operation method based on big data cloud computing platform | |
CN104023088A (en) | Storage server selection method applied to distributed file system | |
CN102984280A (en) | Data backup system and method for social cloud storage network application | |
CN103106152A (en) | Data scheduling method based on gradation storage medium | |
WO2020013884A1 (en) | Machine-learned prediction of network resources and margins | |
EP2671152A1 (en) | Estimating a performance characteristic of a job using a performance model | |
CN110888714A (en) | Container scheduling method, device and computer-readable storage medium | |
CN102857560A (en) | Multi-service application orientated cloud storage data distribution method | |
CN103905517A (en) | Data storage method and equipment | |
Domínguez et al. | Multi-chronological hierarchical clustering to solve capacity expansion problems with renewable sources | |
CN108416054B (en) | Method for calculating number of copies of dynamic HDFS (Hadoop distributed File System) based on file access heat | |
CN103294912B (en) | A kind of facing mobile apparatus is based on the cache optimization method of prediction | |
Zhai et al. | A two-layer algorithm based on PSO for solving unit commitment problem | |
Shen et al. | Host load prediction with bi-directional long short-term memory in cloud computing | |
Yu et al. | Achieving load-balanced, redundancy-free cluster caching with selective partition | |
CN110971468B (en) | Delayed copy incremental container check point processing method based on dirty page prediction | |
Myint et al. | Comparative analysis of adaptive file replication algorithms for cloud data storage | |
Zhao et al. | A weight-based dynamic replica replacement strategy in data grids | |
Soosai et al. | Dynamic replica replacement strategy in data grid | |
Spivak et al. | Storage tier-aware replicative data reorganization with prioritization for efficient workload processing | |
Biran et al. | Enabling green content distribution network by cloud orchestration | |
Zhao et al. | Improve the performance of data grids by value-based replication strategy | |
CN102096723A (en) | Data query method based on copy replication algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160511 |