CN105574153A - Transcript placement method based on file heat analysis and K-means - Google Patents

Transcript placement method based on file heat analysis and K-means Download PDF

Info

Publication number
CN105574153A
CN105574153A CN201510943677.3A CN201510943677A CN105574153A CN 105574153 A CN105574153 A CN 105574153A CN 201510943677 A CN201510943677 A CN 201510943677A CN 105574153 A CN105574153 A CN 105574153A
Authority
CN
China
Prior art keywords
file
access
temperature
document
obtains
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510943677.3A
Other languages
Chinese (zh)
Inventor
马廷淮
李坚
田伟
金子龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN201510943677.3A priority Critical patent/CN105574153A/en
Publication of CN105574153A publication Critical patent/CN105574153A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/128Details of file system snapshots on the file-level, e.g. snapshot creation, administration, deletion

Abstract

The invention provides a transcript placement method based on file heat analysis and K-means. At first, The access frequency of a file in a given period of time is analyzed to calculate the access heat of the file. A possible file with high access heat is predicated by the access heat of the file, in combination with the K-means algorithm, and the number and placement positions of file transcripts are dynamically adjusted on demand by comprehensively considering a statistical cycle, a file size, a working environment and other factors. The transcript placement method provided by the invention can be used for effectively shortening the average response time of file access and improving the data service performance.

Description

A kind ofly to analyze and the Replica placement method of K-means based on file temperature
Technical field
The invention belongs to field of cloud calculation, what be specifically related to is a kind of method utilizing hot statistics analysis and K-means algorithm to place temperature duplicate of the document dynamic conditioning high under cloud environment.
Background technology
Along with the development of society and the raising of Computer Storage and data-handling capacity, data explosion formula increases the key character having become current era.According to the estimation that International Data Corporation (IDC) (InternationalDataCorportion, IDC) increases data, 40ZB (1ZB=1.1805916207174113 × 10 will be produced to the year two thousand twenty 21b) data, to be equivalent on the earth 5247GB (http://datacenter.watchstor.com/infra-143421.htm) per capita.In the face of the ever-increasing mass data of scale, the store and management of thing followed mass data have also been obtained increasing concern.
In order to improve reliability and the access efficiency of system, data item is copied many parts by conventional Replication technology, and leaves in respectively on multiple nodes of distributed file system.For the different access requirement that each historical stage proposes data, there has been proposed multiple replication strategy, mainly comprise master-slave mode, hierarchy type, P2P computing (PeertoPeer, P2P) formula and based on figure etc. several.
Replication strategy will carry out the decision-making of copy number and deposit position two aspect usually, can be divided into Static and dynamic two class according to the opportunity of doing decision-making.Six kinds of replica creation strategy that IanForster and KavithaRanganathan proposed in hierarchically structured topographical network in calendar year 2001: without replication policy, best customer strategy, waterfall type strategy, normal cache strategy, buffer memory waterfall type strategy, Quick Extended strategy (based on the research and implementation Li Lin of the replica optimization strategy of economic model under data grid environment .).These strategies can both reduce access delay in most cases, but waterfall type strategy, buffer memory waterfall type strategy and Quick Extended strategy are only applicable to the data grids that data are stored in top mode, best customer strategy, normal cache strategy do not consider that the features such as topological structure, Data distribution8, the network bandwidth, node storage capacity are (based on the dynamic replica creation strategy of the bilayer-SADDERS grandson petrel of storage alliance, Wang Xiaodong, Zhou Bin etc.), do not consider that file size and the network bandwidth are on the impact of access delay.
The present invention, by the access frequency of Study document in preset time period, according to temperature computing formula, calculates the access temperature of file.Utilize the access temperature of file, in conjunction with K-means algorithm, (the dynamic copies establishment algorithm analyzed based on temperature is rich of heap of stone to predict height access temperature file possible in next cycle, Yang Fande, Li Xinming, Liu Dong .), consider the many factors such as measurement period, file size, working environment simultaneously, dynamically adjust quantity and the placement location of duplicate of the document.
Summary of the invention
Technical matters to be solved of the present invention is the Replica placement problem in distributed system or cloud computing platform, propose a kind ofly to analyze and the Replica placement method of K-means based on file temperature, maximal value is chosen as the time cycle, the access temperature of file in cycle computing time according to the execution time of task.Utilize the access temperature of file, in conjunction with K-means algorithm, predict height access temperature file possible in next cycle, consider the many factors such as measurement period, file size, working environment, dynamically adjust quantity and the placement location of duplicate of the document as required.The present invention can reduce the average response time of file access effectively, improves data, services performance.
Technical scheme:
Analyze and a Replica placement method of K-means based on file temperature, comprise the following steps:
Step 1), according to the execution time of task, select the time cycle that minimum value is analyzed as temperature, in the access frequency of this time cycle inner analysis file;
Step 2), according to step 1) file access frequency that obtains, the access hot value of calculation document;
Step 3), according to step 2) the file access hot value that obtains, obtain the information of the file of high hot value, by K-means algorithm, calculate and predict the high temperature file of next cycle of operation;
Step 4), according to step 3) the high temperature fileinfo that obtains, consider quantity and placement location that the many factors such as file size, quantity of documents, document location, working environment dynamically adjust duplicate of the document;
Further, of the present inventionly a kind ofly to analyze and the Replica placement method of K-means based on file temperature, step 1) according to execution time of task, the time cycle that selection maximal value is analyzed as temperature, in the access frequency of this time cycle inner analysis file.Present invention uses file access number counter and measurement period timer.During initialization, default document access times are 1, and in each measurement period, each accessed counter of file adds 1, and not accessed, counter subtracts 1.If access times have been 1, then counter has no longer performed and has subtracted 1 operation.If file access time-out does not complete, access counter adds 1.The access frequency f of certain file in a kth measurement period k=n/t, wherein n is the number of times accessed in measurement period of this file, and t is the duration sum of access in measurement period;
Further, of the present inventionly a kind ofly to analyze and the Replica placement method of K-means based on file temperature, step 2) according to step 1) file access frequency that obtains, utilize formula h ij=α F j/ (S i+ 1), calculation document i is in the access hot value in j moment.In formula, α is constant, for being normalized data; F jrepresent that frequency is on the impact of file access temperature, S irepresent that file size is on the impact of file access temperature.Wherein,
Further, of the present inventionly a kind ofly to analyze and the Replica placement method of K-means based on file temperature, step 3) according to step 2) the file access hot value that obtains, obtain the information of the file of high hot value, choose k file as Initialization Center, calculate the distance of each file to hub file, by each file allocation to nearest bunch.According to existing bunch of relation double counting aforementioned process, until meet end condition.End condition comprises:
(1) (or minimal amount) file is not had to be reallocated to different clusters;
(2) (or minimal amount) cluster centre is not had to change;
(3) error sum of squares (SSE) Local Minimum, wherein x represents file, m jrepresent cluster C jcluster centre, dist (x, m j) represent file x and cluster centre m jbetween distance;
Further, of the present inventionly a kind ofly to analyze and the Replica placement method of K-means based on file temperature, step 4) according to step 3) clustering information that obtains, according to the access temperature of each cluster centre, consider quantity and placement location that the many factors such as file size, quantity of documents, document location, working environment dynamically adjust duplicate of the document, high temperature bunch suitably increase copy amount, a bunch class for low temperature suitably reduces copy amount.
Beneficial effect
The present invention is directed to Replica placement in distributed system or cloud computing platform, comprehensively analyze in conjunction with file access temperature and K-means algorithm, contribute to the reasonable placement realizing copy in the system of high access.The method compensate in the past simply by the Replica placement method that file temperature is analyzed, and carries out Replica placement merely by the file temperature in this measurement period; Meanwhile, for improving the response time of accessing in the subsequent statistical cycle, have employed K-means clustering algorithm, predicting high temperature file possible in next cycle, adjust duplicate of the document in advance.The combination of two aspects, can improve the rationality of copy, reduces the response time, can reduce IO again congested.
Accompanying drawing explanation
Fig. 1 is a kind of analysis based on file temperature and the process flow diagram of Replica placement method of K-means.
Embodiment
Be described in further detail below in conjunction with the enforcement of accompanying drawing to technical scheme:
In conjunction with process flow diagram and case study on implementation, a kind of Replica placement method based on the analysis of file temperature and K-means of the present invention is described in further detail.
The implementation case adopts file temperature analysis and K-means algorithm to carry out adjustment to the copy in distributed system or cloud environment and places.As shown in Figure 1, this method comprises following steps:
Step 1), according to the execution time of task, select the time cycle that minimum value is analyzed as temperature, in the access frequency of this time cycle inner analysis file;
Step 101), in distributed system or cloud environment, the execution time of different task is different, carry out the analysis of file temperature time, complete there being task and be, just can carry out a copy adjustment, the Information application produced by last tasks carrying is in time in follow-up application.The execution time of task can be obtained by analogue simulation or empirical value.;
Step 102), according to formula f k=n/t, in preset time period, calculates the access frequency obtaining file.
Step 2), according to file access frequency obtained in the previous step, the access hot value of calculation document;
Step 201), obtain file access frequency can calculation document access frequency on the impact of its temperature, determined by the accessed frequency of this file in a nearest l measurement period and weights.
Step 202), calculation document size on the impact of file access temperature, by file size s idetermine with the data block size in distributed system;
Step 203), according to formula h ij=α F j/ (S i+ 1), in conjunction with the value of the response that first two steps obtain, be normalized, the access hot value of file i in the j moment can be calculated.
Step 3), according to file access hot value obtained in the previous step, obtain the information of the file of high hot value, by K-means algorithm, calculate and predict the high temperature file of next cycle of operation;
Step 301), according to the result that previous step calculates, the file of high hot value can be obtained, thus from system, obtain the information of these files.
Step 302), from high temperature file, choose K file as hub file, calculate the distance of All Files to each hub file, according to result of calculation, give nearest cluster centre by each file allocation;
Step 303), repeat previous step, until meet end condition;
Step 4), according to the clustering information that previous step obtains, according to the access temperature of each cluster centre, consider the factors such as file size, quantity of documents, working environment, the copy amount of each file and placement location are adjusted.The cluster of accessing the high cluster centre of temperature corresponding suitably increases its copy amount; Access lower grade cluster and then correspondingly reduce its copy amount.

Claims (5)

1. analyze and a Replica placement method of K-means based on file temperature, it is characterized in that, comprise the following steps:
Step 1), according to the execution time of task, select the time cycle that minimum value is analyzed as temperature, in the access frequency of this time cycle inner analysis file;
Step 2), according to step 1) file access frequency that obtains, the access hot value of calculation document;
Step 3), according to step 2) the file access hot value that obtains, obtain the information of the file of high hot value, by K-means algorithm, calculate and predict the high temperature file of next cycle of operation;
Step 4), according to step 3) the high temperature fileinfo that obtains, consider quantity and placement location that the many factors such as file size, quantity of documents, document location, working environment dynamically adjust duplicate of the document.
2. method according to claim 1, is characterized in that, step 1) in employ file access number counter and measurement period timer; During initialization, default document access times are 1, and in each measurement period, each accessed counter of file adds 1, and not accessed, counter subtracts 1; If access times have been 1, then counter has no longer performed and has subtracted 1 operation.If file access time-out does not complete, access counter adds 1; If the access frequency f of file in a kth measurement period k=n/t, wherein n is the number of times accessed in measurement period of this file, and t is the duration sum of access in measurement period.
3. method according to claim 1, is characterized in that, step 2) according to step 1) file access frequency that obtains, utilize formula h ij=α F j/ (S i+ 1), calculation document i is in the access hot value in j moment; In formula, α is constant, for being normalized data; F jrepresent that frequency is on the impact of file access temperature, S irepresent that file size is on the impact of file access temperature; Wherein,
4. method according to claim 1, it is characterized in that, step 3) according to step 2) the file access hot value that obtains, obtain the information of the file of high hot value, choose k file as Initialization Center, calculate the distance of each file to hub file, by each file allocation to nearest bunch.According to existing bunch of relation double counting aforementioned process, until meet end condition; End condition comprises:
(1) (or minimal amount) file is not had to be reallocated to different clusters;
(2) (or minimal amount) cluster centre is not had to change;
(3) error sum of squares (SSE) Local Minimum, wherein x represents file, m jrepresent cluster C jcluster centre, dist (x, m j) represent file x and cluster centre m jbetween distance.
5. method according to claim 1, it is characterized in that, step 4) according to step 3) clustering information that obtains, according to the access temperature of each cluster centre, consider quantity and placement location that the many factors such as file size, quantity of documents, document location, working environment dynamically adjust duplicate of the document, high temperature bunch suitably increase copy amount, a bunch class for low temperature suitably reduces copy amount.
CN201510943677.3A 2015-12-16 2015-12-16 Transcript placement method based on file heat analysis and K-means Pending CN105574153A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510943677.3A CN105574153A (en) 2015-12-16 2015-12-16 Transcript placement method based on file heat analysis and K-means

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510943677.3A CN105574153A (en) 2015-12-16 2015-12-16 Transcript placement method based on file heat analysis and K-means

Publications (1)

Publication Number Publication Date
CN105574153A true CN105574153A (en) 2016-05-11

Family

ID=55884284

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510943677.3A Pending CN105574153A (en) 2015-12-16 2015-12-16 Transcript placement method based on file heat analysis and K-means

Country Status (1)

Country Link
CN (1) CN105574153A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106161170A (en) * 2016-07-12 2016-11-23 广东工业大学 A kind of asynchronous file being spaced execution selects and Replica placement method
CN107302561A (en) * 2017-05-23 2017-10-27 南京邮电大学 A kind of hot spot data Replica placement method in cloud storage system
CN107770259A (en) * 2017-09-30 2018-03-06 武汉理工大学 Copy amount dynamic adjusting method based on file temperature and node load
CN108416054A (en) * 2018-03-20 2018-08-17 东北大学 Dynamic HDFS copy number calculating methods based on file access temperature
CN108804351A (en) * 2018-05-30 2018-11-13 郑州云海信息技术有限公司 A kind of caching replacement method and device
CN110879852A (en) * 2018-09-05 2020-03-13 南京大学 Video content caching method
CN113312329A (en) * 2020-02-26 2021-08-27 阿里巴巴集团控股有限公司 Data file scheduling method, device and equipment
CN114205416A (en) * 2021-10-27 2022-03-18 北京旷视科技有限公司 Resource caching method and device, electronic equipment and computer readable medium
CN115190181A (en) * 2022-09-07 2022-10-14 睿至科技集团有限公司 Resource management method and system based on cloud management
CN115292389A (en) * 2022-10-08 2022-11-04 南通君合云起信息科技有限公司 Big data self-adaptive storage method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150347A (en) * 2013-02-07 2013-06-12 浙江大学 Dynamic replica management method based on file heat
CN103425756A (en) * 2013-07-31 2013-12-04 西安交通大学 Copy management strategy for data blocks in HDFS
CN103440182A (en) * 2013-09-12 2013-12-11 重庆大学 Adaptive allocation method and device and adaptive replica consistency method
US20150169583A1 (en) * 2013-12-18 2015-06-18 Attivio, Inc. Trending analysis for streams of documents
CN104978362A (en) * 2014-04-11 2015-10-14 中兴通讯股份有限公司 Data migration method of distributive file system, data migration device of distributive file system and metadata server

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150347A (en) * 2013-02-07 2013-06-12 浙江大学 Dynamic replica management method based on file heat
CN103425756A (en) * 2013-07-31 2013-12-04 西安交通大学 Copy management strategy for data blocks in HDFS
CN103440182A (en) * 2013-09-12 2013-12-11 重庆大学 Adaptive allocation method and device and adaptive replica consistency method
US20150169583A1 (en) * 2013-12-18 2015-06-18 Attivio, Inc. Trending analysis for streams of documents
CN104978362A (en) * 2014-04-11 2015-10-14 中兴通讯股份有限公司 Data migration method of distributive file system, data migration device of distributive file system and metadata server

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
饶磊等: ""基于热度分析的动态副本创建算法"", 《计算机应用》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106161170B (en) * 2016-07-12 2019-08-02 广东工业大学 A kind of asynchronous file selection and Replica placement method that interval executes
CN106161170A (en) * 2016-07-12 2016-11-23 广东工业大学 A kind of asynchronous file being spaced execution selects and Replica placement method
CN107302561A (en) * 2017-05-23 2017-10-27 南京邮电大学 A kind of hot spot data Replica placement method in cloud storage system
CN107302561B (en) * 2017-05-23 2019-08-13 南京邮电大学 A kind of hot spot data Replica placement method in cloud storage system
CN107770259A (en) * 2017-09-30 2018-03-06 武汉理工大学 Copy amount dynamic adjusting method based on file temperature and node load
CN108416054A (en) * 2018-03-20 2018-08-17 东北大学 Dynamic HDFS copy number calculating methods based on file access temperature
CN108416054B (en) * 2018-03-20 2021-10-22 东北大学 Method for calculating number of copies of dynamic HDFS (Hadoop distributed File System) based on file access heat
CN108804351B (en) * 2018-05-30 2021-10-29 郑州云海信息技术有限公司 Cache replacement method and device
CN108804351A (en) * 2018-05-30 2018-11-13 郑州云海信息技术有限公司 A kind of caching replacement method and device
CN110879852A (en) * 2018-09-05 2020-03-13 南京大学 Video content caching method
CN113312329A (en) * 2020-02-26 2021-08-27 阿里巴巴集团控股有限公司 Data file scheduling method, device and equipment
CN113312329B (en) * 2020-02-26 2024-03-01 阿里巴巴集团控股有限公司 Scheduling method, device and equipment for data files
CN114205416A (en) * 2021-10-27 2022-03-18 北京旷视科技有限公司 Resource caching method and device, electronic equipment and computer readable medium
CN114205416B (en) * 2021-10-27 2024-03-12 北京旷视科技有限公司 Resource caching method, device, electronic equipment and computer readable medium
CN115190181A (en) * 2022-09-07 2022-10-14 睿至科技集团有限公司 Resource management method and system based on cloud management
CN115190181B (en) * 2022-09-07 2023-02-17 睿至科技集团有限公司 Resource management method and system based on cloud management
CN115292389A (en) * 2022-10-08 2022-11-04 南通君合云起信息科技有限公司 Big data self-adaptive storage method

Similar Documents

Publication Publication Date Title
CN105574153A (en) Transcript placement method based on file heat analysis and K-means
CN103150347B (en) Based on the dynamic replication management method of file temperature
CN107330056A (en) Wind power plant SCADA system and its operation method based on big data cloud computing platform
CN104023088A (en) Storage server selection method applied to distributed file system
CN102984280A (en) Data backup system and method for social cloud storage network application
CN103106152A (en) Data scheduling method based on gradation storage medium
WO2020013884A1 (en) Machine-learned prediction of network resources and margins
EP2671152A1 (en) Estimating a performance characteristic of a job using a performance model
CN110888714A (en) Container scheduling method, device and computer-readable storage medium
CN102857560A (en) Multi-service application orientated cloud storage data distribution method
CN103905517A (en) Data storage method and equipment
Domínguez et al. Multi-chronological hierarchical clustering to solve capacity expansion problems with renewable sources
CN108416054B (en) Method for calculating number of copies of dynamic HDFS (Hadoop distributed File System) based on file access heat
CN103294912B (en) A kind of facing mobile apparatus is based on the cache optimization method of prediction
Zhai et al. A two-layer algorithm based on PSO for solving unit commitment problem
Shen et al. Host load prediction with bi-directional long short-term memory in cloud computing
Yu et al. Achieving load-balanced, redundancy-free cluster caching with selective partition
CN110971468B (en) Delayed copy incremental container check point processing method based on dirty page prediction
Myint et al. Comparative analysis of adaptive file replication algorithms for cloud data storage
Zhao et al. A weight-based dynamic replica replacement strategy in data grids
Soosai et al. Dynamic replica replacement strategy in data grid
Spivak et al. Storage tier-aware replicative data reorganization with prioritization for efficient workload processing
Biran et al. Enabling green content distribution network by cloud orchestration
Zhao et al. Improve the performance of data grids by value-based replication strategy
CN102096723A (en) Data query method based on copy replication algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160511