CN102223404A

CN102223404A - Replica selection method based on access cost and transmission time

Info

Publication number: CN102223404A
Application number: CN2011101512234A
Authority: CN
Inventors: 刘伟; 杜薇; 石飞燕; 位凯志
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2011-06-07
Filing date: 2011-06-07
Publication date: 2011-10-19

Abstract

The invention relates to a replica selection method based on access cost and transmission time. The method comprises the following steps that: for any data-intensive task needing to access a plurality of data replicas, at first, a replica selection problem is modeled into a WSCP (Weight Set Covering Problem); then the problem is converted into a matrix; and by adopting a weight greedy algorithm, a storage node with minimum replica average access cost is selected at each time, so that the transmission time of the data replica can be reduced when the replicas with low cost are selected, until the data replicas needed by the task are obtained. By the method disclosed by the invention, the replicas with low cost can be selected, and simultaneously, the transmission time of the replicas can be reduced. The method is simple, has high execution efficiency, and is suitable for the replica selection in a data-intensive computing environment.

Description

A kind of copy selection method based on access cost and transmission time

Technical field

The present invention relates to the copy selection method in the data-intensive calculating, particularly a kind of copy selection method based on access cost and transmission time.

Background technology

Share to ecommerce from search engine, video, it is the service at center that Internet service becomes gradually with the mass data processing, and the ability to providing data processing is provided its service quality to a great extent.And, i.e. the position of these data, operation, storage, move, share and describe the performance bottleneck that has caused in the computing capability evolution just to the management of these data.Under this background, as a kind of support technology of new services, (Data-Intensive Computing DIC) arises at the historic moment and causes the common concern of industrial quarters and academia in data-intensive calculating.

And the copy technology is a Data Replication Technology in Mobile, is a kind of effective technology that improves data-intensive service quality that is widely adopted in the data-intensive calculating.In the data-intensive environment, promptly deposit a plurality of copies of same data by adopting distributed storage and data redundancy technology at different physical store nodes, not only can improve the reliabilty and availability of data in the data-intensive environment, the access to netwoks that can also effectively reduce data postpones, and improves the load balancing of network etc.And the scheduled for executing of task depends on the memory node of required by task copy to a great extent, and the selection of copy place memory node is optimized the execution efficient that can improve application program when satisfying the user task quality of service requirement.Therefore, when carrying out user task, the memory node of the best at selection required by task data trnascription place is most important.

At present, domestic and international research about copy selection method mostly is under data grid environment:

Srikumar Venugopal of Univ Melbourne Australia and Rajkumar Buyya are devoted to data-intensive application and research under the grid environment always, at " An SCP-based heuristic approach for scheduling distributed data-intensive applications on global grids " (source publication: Journal of Parallel and Distributed Computing volume: 69 phases: 4 pages: proposed a kind of copy selection method 471-487), promptly the copy that carries out in the data-intensive application based on the tree search type algorithm of set covering problem (SCP) is selected, its starting point is all copies that the physical store node of selection minimum number comes the covering task to need, to reduce the time that copy moves.

People such as the Sun Min of University Of Tianjin are at " Ant algorithm for file replica selection in data grid " (source publication: First International Conference on Semantics, Knowledge and Grid Materials Research, SKG 2005, paper number: 4125852 pages: proposed the selection problem that a kind of ant group algorithm solves data trnascription in the data-intensive calculating 64-66), to reduce data access delay, bandwidth consumption and distributed storage load.

People such as the Jin Hai of the Central China University of Science and Technology are at " Using classification techniques to improve replica selection in data grid " (source publication: On The Move to Meaningful Internet Systems volume: 4276 pages: propose a kind of new copy selection strategy based on sorting technique 1376-1387), by utilizing the transfer of data historical information to predict the physical location of best copy, and adopt the contiguous algorithm of K (KNN) to realize that the optimal data copy selects.

As a kind of emerging data-intensive computation model and a kind of computation schema that can handle large-scale data and huge commercial application value is arranged, this service model based on the Internet of cloud computing is subjected to the extensive concern of various circles of society.Each fatware manufacturer is all in the research of actively pushing forward cloud computing and application in the world, and proposed the scheme and the realization of using at cloud respectively, wherein is no lack of information giants such as Google, Amazon, IBM and Microsoft.Copy under the cloud environment selects also becoming the problem of paying close attention to of Chinese scholars:

People such as the Li Jing of University Of Chongqing are at " A replica selection decision in cloud computing environment " (source publication: Advanced Materials Research volume: 121-122 page or leaf: propose a kind of new copy selection algorithm 801-806), based on GM (1,1) the gray scale dynamic model adopts gray system theory to come the prediction data response time, use the reliability of Markov chain prediction data copy simultaneously, can improve the load balance between the memory node under the cloud computing environment.

In sum, the copy selection method major part that researcher before proposes is at improvement under certain specified conditions and optimization, its weak point is: the value of all not considering data itself is the access cost of copy, and does not have fully to pay attention to the time of the required cost of transmission copy.

Summary of the invention

In order to solve the problem that still exists in the present copy selection, at the deficiency of researcher's proposition method before, the purpose of this invention is to provide a kind of copy selection method based on access cost and transmission time, the copy that mainly solves in the data-intensive calculating is selected problem, making can be when selecting low-cost copy, and the transmission time of reducing copy is to improve the execution efficient of data-intensive application program.

Concrete steps of the present invention are as follows:

The first step: the set of a plurality of data trnascriptions that task need be visited and all in the data-intensive environment have the set of memory node of data trnascription as the initialization input of this copy selection course;

Second step: be matrix of task creation, wherein comprise the row of the memory node of the arbitrary copy of required by task as matrix, the duplicate of the document that this task needs is as matrix column, and each nonzero value is used for representing cost from the storage node accesses copy in the matrix, and correlated variables is carried out initialization;

The 3rd step:, the matrix ascending order is arranged according to the average copy access cost of memory node;

The 4th step: from orderly matrix, select first row, it is current memory node with minimum average B configuration copy access cost, it is added to the set of memory node at the best copy place of task choosing, and, upgrade matrix with the row deletion of the file correspondence of the row of this memory node correspondence and covering thereof;

The 5th step: judge whether all copies all are capped,, then forwarded for the 3rd step to, continue to carry out this copy selection course if also have copy not to be capped; Otherwise the copy selection course finishes, and so far obtains the memory node set at the optimum copy place of task needs.

Characteristics of the present invention

The present invention is by adopting the heavy greedy algorithm of cum rights, and each memory node of selecting to have minimum copy average access cost in (ln|X|+1) of polynomial time, when can select the low-cost data copy, reduces moving of copy.Method is simple, carries out the efficient height, is applicable to that the copy in the data-intensive computing environment is selected.

Description of drawings

Fig. 1 is a flow chart of the present invention.

Fig. 2 is the copy preference pattern figure that the present invention is based on the heavy set covering problem (WSCP) of cum rights.

Embodiment

The present invention is described in further detail below in conjunction with drawings and Examples.

Below in the description of embodiment, a data intensive task J, its operation needs the individual data trnascription that is distributed on the individual memory node of m (m＞0) of visit k (k＞0), and, the access cost difference of a plurality of copies of same file on different memory nodes.As shown in Figure 1, concrete steps are as follows:

The first step: the set of the individual data copy of k (k＞0) that task J need be visited and all in the data-intensive environment have the set D of memory node of data trnascription as the initialization input of this copy selection course;

As shown in Figure 2, the copy set that needs of J share F and represents F={f ₁, f ₂..., f _k, D={d is represented in the memory node set at these copy places with D ₁, d ₂..., d _m, and w _MkExpression is from memory node d _mLast visit copy f _kCost.Accompanying drawing 3 can further be converted into a matrix A=[a _Ij], wherein i represents i memory node, and 1≤i≤m, and j represents j copy, and 1≤j≤k.If VM can be from memory node d _iGo up with cost w _IjVisit data copy f _j, a then _Ij=w _Ij(w _Ij＞0), otherwise, if d _iDo not comprise f _j, a then _Ij=0.If delegation comprises a cost w in the matrix in a certain row _IjBe called this row with cost w _Ij" covering " should row, then the copy set of selecting problem can be converted into the row matrix that finds an optimum makes it cover all row with minimum average weight, this problem can be summed up as the heavy set covering problem (WSCP) of cum rights.In the following description, set C is used for depositing the current memory node of having selected that contains best copy, and set E is used for depositing the current copy that has been covered by selected memory node, it is carried out initialization respectively: C Φ, E Φ;

According to formula

Calculate the average copy access cost of each memory node in the matrix, the ratio of the number of the total cost of the task J desired data copy that promptly memory node covered and the copy that is covered, wherein d _iRepresent i memory node, Expression memory node d _iTotal access cost of the copy that covers, its computing formula is:

W_{d_{i}} = \underset{1 \leq j \leq k}{Σ} w_{ij}

And | d _i∩ F| is used for representing memory node d _iThe number of the copy that being covered of task J needs.From formula

As can be seen, memory node d _iThe required number of copies of being covered of task J is many more, and total access cost of these copies is more little, and then the average access cost of copy is more little, and promptly the copy access cost is lower, and the copy that covers is more concentrated.

For the copy of selecting low cost and relatively concentrating, according to the average access cost of the copy of each memory node, the matrix ascending order is arranged, make can begin from the memory node of average access cost minimum to select at every turn.

The 4th step: from orderly matrix, select first row, it is current memory node with minimum copy average access cost, it is added to the set of memory node at the best copy place of task choosing, and, then matrix is upgraded the row deletion of the file correspondence of the row of this memory node correspondence and covering thereof;

In the 3rd step, matrix has been carried out the ascending order arrangement according to the average access cost that each memory node covers the copy of task J needs, therefore, the corresponding memory node d of first row of selection matrix, promptly the current copy average access cost of this memory node is minimum, with the memory node set C at the best copy place selected of the task that is added to J in, i.e. C C ∪ d; Copy with the matrix column correspondence that d covered adds among the E simultaneously, i.e. E E ∪ d; Then matrix A is upgraded, deletion first row is the row of the copy correspondence that covered of pairing row and the d of memory node d.

Judge whether the copy set that has been capped equates with the set F of the required whole copies of task J, if E ≠ F illustrates that copy is not capped in addition, then turns to for the 3rd step, continue memory node is selected in new matrix ordering then; Otherwise, illustrating that all copies all are capped, the copy selection course finishes, and so far obtains the memory node set C at the optimum copy place of task needs.

Claims

1. copy selection method based on access cost and transmission time is characterized in that: