CN105574153A

CN105574153A - Transcript placement method based on file heat analysis and K-means

Info

Publication number: CN105574153A
Application number: CN201510943677.3A
Authority: CN
Inventors: 马廷淮; 李坚; 田伟; 金子龙
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2015-12-16
Filing date: 2015-12-16
Publication date: 2016-05-11

Abstract

The invention provides a transcript placement method based on file heat analysis and K-means. At first, The access frequency of a file in a given period of time is analyzed to calculate the access heat of the file. A possible file with high access heat is predicated by the access heat of the file, in combination with the K-means algorithm, and the number and placement positions of file transcripts are dynamically adjusted on demand by comprehensively considering a statistical cycle, a file size, a working environment and other factors. The transcript placement method provided by the invention can be used for effectively shortening the average response time of file access and improving the data service performance.

Description

A kind ofly to analyze and the Replica placement method of K-means based on file temperature

Technical field

The invention belongs to field of cloud calculation, what be specifically related to is a kind of method utilizing hot statistics analysis and K-means algorithm to place temperature duplicate of the document dynamic conditioning high under cloud environment.

Background technology

Along with the development of society and the raising of Computer Storage and data-handling capacity, data explosion formula increases the key character having become current era.According to the estimation that International Data Corporation (IDC) (InternationalDataCorportion, IDC) increases data, 40ZB (1ZB=1.1805916207174113 × 10 will be produced to the year two thousand twenty ²¹b) data, to be equivalent on the earth 5247GB (http://datacenter.watchstor.com/infra-143421.htm) per capita.In the face of the ever-increasing mass data of scale, the store and management of thing followed mass data have also been obtained increasing concern.

In order to improve reliability and the access efficiency of system, data item is copied many parts by conventional Replication technology, and leaves in respectively on multiple nodes of distributed file system.For the different access requirement that each historical stage proposes data, there has been proposed multiple replication strategy, mainly comprise master-slave mode, hierarchy type, P2P computing (PeertoPeer, P2P) formula and based on figure etc. several.

Replication strategy will carry out the decision-making of copy number and deposit position two aspect usually, can be divided into Static and dynamic two class according to the opportunity of doing decision-making.Six kinds of replica creation strategy that IanForster and KavithaRanganathan proposed in hierarchically structured topographical network in calendar year 2001: without replication policy, best customer strategy, waterfall type strategy, normal cache strategy, buffer memory waterfall type strategy, Quick Extended strategy (based on the research and implementation Li Lin of the replica optimization strategy of economic model under data grid environment .).These strategies can both reduce access delay in most cases, but waterfall type strategy, buffer memory waterfall type strategy and Quick Extended strategy are only applicable to the data grids that data are stored in top mode, best customer strategy, normal cache strategy do not consider that the features such as topological structure, Data distribution8, the network bandwidth, node storage capacity are (based on the dynamic replica creation strategy of the bilayer-SADDERS grandson petrel of storage alliance, Wang Xiaodong, Zhou Bin etc.), do not consider that file size and the network bandwidth are on the impact of access delay.

The present invention, by the access frequency of Study document in preset time period, according to temperature computing formula, calculates the access temperature of file.Utilize the access temperature of file, in conjunction with K-means algorithm, (the dynamic copies establishment algorithm analyzed based on temperature is rich of heap of stone to predict height access temperature file possible in next cycle, Yang Fande, Li Xinming, Liu Dong .), consider the many factors such as measurement period, file size, working environment simultaneously, dynamically adjust quantity and the placement location of duplicate of the document.

Summary of the invention

Technical matters to be solved of the present invention is the Replica placement problem in distributed system or cloud computing platform, propose a kind ofly to analyze and the Replica placement method of K-means based on file temperature, maximal value is chosen as the time cycle, the access temperature of file in cycle computing time according to the execution time of task.Utilize the access temperature of file, in conjunction with K-means algorithm, predict height access temperature file possible in next cycle, consider the many factors such as measurement period, file size, working environment, dynamically adjust quantity and the placement location of duplicate of the document as required.The present invention can reduce the average response time of file access effectively, improves data, services performance.

Technical scheme:

Analyze and a Replica placement method of K-means based on file temperature, comprise the following steps:

Step 1), according to the execution time of task, select the time cycle that minimum value is analyzed as temperature, in the access frequency of this time cycle inner analysis file;

Step 2), according to step 1) file access frequency that obtains, the access hot value of calculation document;

Step 3), according to step 2) the file access hot value that obtains, obtain the information of the file of high hot value, by K-means algorithm, calculate and predict the high temperature file of next cycle of operation;

Step 4), according to step 3) the high temperature fileinfo that obtains, consider quantity and placement location that the many factors such as file size, quantity of documents, document location, working environment dynamically adjust duplicate of the document;

Further, of the present inventionly a kind ofly to analyze and the Replica placement method of K-means based on file temperature, step 1) according to execution time of task, the time cycle that selection maximal value is analyzed as temperature, in the access frequency of this time cycle inner analysis file.Present invention uses file access number counter and measurement period timer.During initialization, default document access times are 1, and in each measurement period, each accessed counter of file adds 1, and not accessed, counter subtracts 1.If access times have been 1, then counter has no longer performed and has subtracted 1 operation.If file access time-out does not complete, access counter adds 1.The access frequency f of certain file in a kth measurement period _k=n/t, wherein n is the number of times accessed in measurement period of this file, and t is the duration sum of access in measurement period;

Further, of the present inventionly a kind ofly to analyze and the Replica placement method of K-means based on file temperature, step 2) according to step 1) file access frequency that obtains, utilize formula h _ij=α F _j/ (S _i+ 1), calculation document i is in the access hot value in j moment.In formula, α is constant, for being normalized data; F _jrepresent that frequency is on the impact of file access temperature, S _irepresent that file size is on the impact of file access temperature.Wherein,

Further, of the present inventionly a kind ofly to analyze and the Replica placement method of K-means based on file temperature, step 3) according to step 2) the file access hot value that obtains, obtain the information of the file of high hot value, choose k file as Initialization Center, calculate the distance of each file to hub file, by each file allocation to nearest bunch.According to existing bunch of relation double counting aforementioned process, until meet end condition.End condition comprises:

(1) (or minimal amount) file is not had to be reallocated to different clusters;

(2) (or minimal amount) cluster centre is not had to change;

(3) error sum of squares (SSE) Local Minimum, wherein x represents file, m _jrepresent cluster C _jcluster centre, dist (x, m _j) represent file x and cluster centre m _jbetween distance;

Further, of the present inventionly a kind ofly to analyze and the Replica placement method of K-means based on file temperature, step 4) according to step 3) clustering information that obtains, according to the access temperature of each cluster centre, consider quantity and placement location that the many factors such as file size, quantity of documents, document location, working environment dynamically adjust duplicate of the document, high temperature bunch suitably increase copy amount, a bunch class for low temperature suitably reduces copy amount.

Beneficial effect

The present invention is directed to Replica placement in distributed system or cloud computing platform, comprehensively analyze in conjunction with file access temperature and K-means algorithm, contribute to the reasonable placement realizing copy in the system of high access.The method compensate in the past simply by the Replica placement method that file temperature is analyzed, and carries out Replica placement merely by the file temperature in this measurement period; Meanwhile, for improving the response time of accessing in the subsequent statistical cycle, have employed K-means clustering algorithm, predicting high temperature file possible in next cycle, adjust duplicate of the document in advance.The combination of two aspects, can improve the rationality of copy, reduces the response time, can reduce IO again congested.

Accompanying drawing explanation

Fig. 1 is a kind of analysis based on file temperature and the process flow diagram of Replica placement method of K-means.

Embodiment

Be described in further detail below in conjunction with the enforcement of accompanying drawing to technical scheme:

In conjunction with process flow diagram and case study on implementation, a kind of Replica placement method based on the analysis of file temperature and K-means of the present invention is described in further detail.

The implementation case adopts file temperature analysis and K-means algorithm to carry out adjustment to the copy in distributed system or cloud environment and places.As shown in Figure 1, this method comprises following steps:

Step 101), in distributed system or cloud environment, the execution time of different task is different, carry out the analysis of file temperature time, complete there being task and be, just can carry out a copy adjustment, the Information application produced by last tasks carrying is in time in follow-up application.The execution time of task can be obtained by analogue simulation or empirical value.；

Step 102), according to formula f _k=n/t, in preset time period, calculates the access frequency obtaining file.

Step 2), according to file access frequency obtained in the previous step, the access hot value of calculation document;

Step 201), obtain file access frequency can calculation document access frequency on the impact of its temperature, determined by the accessed frequency of this file in a nearest l measurement period and weights.

Step 202), calculation document size on the impact of file access temperature, by file size s _idetermine with the data block size in distributed system;

Step 203), according to formula h _ij=α F _j/ (S _i+ 1), in conjunction with the value of the response that first two steps obtain, be normalized, the access hot value of file i in the j moment can be calculated.

Step 3), according to file access hot value obtained in the previous step, obtain the information of the file of high hot value, by K-means algorithm, calculate and predict the high temperature file of next cycle of operation;

Step 301), according to the result that previous step calculates, the file of high hot value can be obtained, thus from system, obtain the information of these files.

Step 302), from high temperature file, choose K file as hub file, calculate the distance of All Files to each hub file, according to result of calculation, give nearest cluster centre by each file allocation;

Step 303), repeat previous step, until meet end condition;

Step 4), according to the clustering information that previous step obtains, according to the access temperature of each cluster centre, consider the factors such as file size, quantity of documents, working environment, the copy amount of each file and placement location are adjusted.The cluster of accessing the high cluster centre of temperature corresponding suitably increases its copy amount; Access lower grade cluster and then correspondingly reduce its copy amount.

Claims

1. analyze and a Replica placement method of K-means based on file temperature, it is characterized in that, comprise the following steps:

Step 4), according to step 3) the high temperature fileinfo that obtains, consider quantity and placement location that the many factors such as file size, quantity of documents, document location, working environment dynamically adjust duplicate of the document.

2. method according to claim 1, is characterized in that, step 1) in employ file access number counter and measurement period timer; During initialization, default document access times are 1, and in each measurement period, each accessed counter of file adds 1, and not accessed, counter subtracts 1; If access times have been 1, then counter has no longer performed and has subtracted 1 operation.If file access time-out does not complete, access counter adds 1; If the access frequency f of file in a kth measurement period _k=n/t, wherein n is the number of times accessed in measurement period of this file, and t is the duration sum of access in measurement period.

3. method according to claim 1, is characterized in that, step 2) according to step 1) file access frequency that obtains, utilize formula h _ij=α F _j/ (S _i+ 1), calculation document i is in the access hot value in j moment; In formula, α is constant, for being normalized data; F _jrepresent that frequency is on the impact of file access temperature, S _irepresent that file size is on the impact of file access temperature; Wherein,

4. method according to claim 1, it is characterized in that, step 3) according to step 2) the file access hot value that obtains, obtain the information of the file of high hot value, choose k file as Initialization Center, calculate the distance of each file to hub file, by each file allocation to nearest bunch.According to existing bunch of relation double counting aforementioned process, until meet end condition; End condition comprises:

(2) (or minimal amount) cluster centre is not had to change;

(3) error sum of squares (SSE) Local Minimum, wherein x represents file, m _jrepresent cluster C _jcluster centre, dist (x, m _j) represent file x and cluster centre m _jbetween distance.

5. method according to claim 1, it is characterized in that, step 4) according to step 3) clustering information that obtains, according to the access temperature of each cluster centre, consider quantity and placement location that the many factors such as file size, quantity of documents, document location, working environment dynamically adjust duplicate of the document, high temperature bunch suitably increase copy amount, a bunch class for low temperature suitably reduces copy amount.