CN108989098B

CN108989098B - Time delay optimization-oriented scientific workflow data layout method in hybrid cloud environment

Info

Publication number: CN108989098B
Application number: CN201810700970.0A
Authority: CN
Inventors: 林兵; 项滔; 卢宇; 黄志高; 陈星�; 郭文忠; 蔡飞雄
Original assignee: Fujian Normal University
Current assignee: Fujian Normal University
Priority date: 2018-08-24
Filing date: 2018-08-24
Publication date: 2021-06-01
Anticipated expiration: 2038-08-24
Also published as: CN108989098A

Abstract

The invention discloses a time delay optimization-oriented scientific workflow data layout method in a mixed cloud environment, which considers the data layout characteristics in the mixed cloud environment, combines the dependency relationship among scientific workflow data, and considers the influence of factors such as bandwidth among cloud data centers, the number and capacity of private cloud data centers and the like on transmission time delay; firstly, preprocessing operation is carried out on the data layout strategy, so that the execution efficiency of the later-stage data layout strategy is improved; by introducing the crossover operator and the mutation operator of the genetic algorithm, the problem of premature convergence of the particle swarm optimization algorithm is avoided, the diversity of population evolution is improved, the data transmission delay is effectively compressed, and the scientific workflow data transmission delay in the mixed cloud environment is effectively reduced. The invention improves the execution efficiency of the data layout strategy and optimizes the transmission delay of the scientific workflow data layout.

Description

Time delay optimization-oriented scientific workflow data layout method in hybrid cloud environment

Technical Field

The invention relates to a scientific workflow data layout method in the field of parallel and distributed high-performance computing, in particular to a time delay optimization-oriented scientific workflow data layout method in a mixed cloud environment.

Background

The scientific workflow system is a data intensive application and has been widely applied to scientific research fields such as astronomy, high-energy physics and biological information. Scientific workflow application is based on data driving, complex data dependence exists among computing task nodes, and the size of a processed data set can reach TB magnitude and even PB magnitude. These data sets include the existing raw input data sets, as well as intermediate data sets and final data sets generated during the processing analysis. Since the scientific workflow application structure depends on the properties of complexity, large data volume and the like, the scientific workflow application structure has strict requirements on the computing power and data storage of a deployment environment. Traditional distributed environments such as grids and the like are usually built for research of a specific scientific application, the sharing degree between the traditional distributed environments is low, and the deployment of scientific workflows in the environments can cause serious resource waste.

The cloud computing virtualizes resources in different geographic positions into a resource pool through a virtualization technology, faces terminal users in a pay-as-you-go mode, and has the characteristics of high efficiency, flexibility, high flexibility and customization, and an economic solution is provided for scientific workflow deployment. A hybrid cloud computing environment typically includes one public cloud and multiple private clouds: the public cloud can guarantee resource supply and maintain service quality under the condition that the load of the scientific workflow fluctuates seriously; the private cloud can provide guarantee for the safety of the scientific workflow privacy data. With the increase of importance of big data in the field of scientific application, the scientific workflow data layout in the mixed cloud environment has become a hotspot in the field of scientific research. In the field of emergency management application, a large number of concurrent examples exist, and the time delay requirement on the scientific workflow data layout is strict. However, the fixed data center for storing the scientific workflow privacy data causes a large amount of data transmission across data centers in the application execution process, and a huge contradiction is formed between the transmission of TB and even PB magnitude data sets and the limited network bandwidth between the data centers, which causes a serious transmission delay. Therefore, the research on a reasonable scientific workflow data layout scheme in a mixed cloud environment is very important, and the method is embodied as follows: (1) the scientific workflow application structure is complex in dependence and large in data volume, and a reasonable data layout scheme is used for ensuring high cohesion in a single data center and low coupling between data centers under the mixed cloud and multi-data center environment, so that the data transmission time overhead of the data centers is reduced. (2) Due to the fact that the private data are specified to be stored in a specific private cloud data center in terms of safety, cross-data center transmission is needed due to the fact that the capacity of the private cloud data center is limited, and how to optimize data transmission delay is a challenge of scientific workflow data layout under the conditions that transmission bandwidth is limited and private data are fixedly stored and considering the influence of bandwidth factors. (3) The effective data layout scheme takes effective utilization of data center resources into consideration on the premise of compressing data transmission delay.

The existing scientific workflow data layout work is mainly based on a clustering method and an intelligent method. The clustering method mainly considers the load balance data layout of a plurality of data centers and effectively utilizes the data center resources. However, in a mixed cloud environment, a scientific workflow with private data needs a data layout mode with high cohesion inside a single data center and low coupling between data centers to effectively guarantee low time delay of data transmission. The traditional clustering method based on load balancing cannot meet the requirement of low-delay data layout of scientific workflows in a mixed cloud environment. The traditional intelligent methods are mainly data layout strategies based on genetic algorithms, mainly consider the problem of load balancing, and are easy to fall into local optimization. The existing research methods are mainly developed aiming at the number of transmission times and data transmission quantity of the data centers in the process of optimizing data layout, compression research on data transmission delay is less, and in addition, the traditional research methods do not fully discuss the transmission bandwidth difference among the data centers. Therefore, for the problem of time delay optimization oriented scientific workflow data layout in a mixed cloud environment, a complete and effective solution is not formed in the current research work.

Disclosure of Invention

The invention aims to provide a time delay optimization-oriented scientific workflow data layout method in a mixed cloud environment.

The technical scheme adopted by the invention is as follows:

a time delay optimization oriented scientific workflow data layout method for a mixed cloud environment comprises the following steps:

step 1: constructing a data layout scheme model based on scientific workflow in a hybrid cloud environment;

the definition of the whole data layout scheme is S ═ S (DS, DC, Map, T)_total) Wherein Map ═ U_{i＝1,2,...,|DS|}{＜dc_i,ds_k,dc_j> "represents the mapping of the data set DS to the data center set DC, T_totalRepresents the total time overhead incurred by data transmission across data centers during the data placement process; the time delay optimized scientific workflow data layout problem in the hybrid cloud environment is formally expressed as formula (8),

wherein u is_ij{0,1} represents a data set ds_jWhether or not to be stored in data centre dc_iIf yes, u_ijIs 1, otherwise is 0; t is_totalRepresenting the time overhead incurred by data transmission across data centers during the data placement process. In the data layout process, data is continuously transmitted and migrated, so that capacity limitation judgment is carried out on a certain private cloud data center when new data are placed in the certain private cloud data center. The core idea is to pursue the total time overhead T_totalAt a minimum, while meeting the storage capacity limitations of each data center.

Step 2: preprocessing a scientific workflow, merging adjacent data sets with only one related task, reducing the number of the data sets and improving the execution efficiency of a data layout algorithm;

and step 3: initializing the population size, the maximum iteration times, the inertia weight factor and the cognitive factor, and generating an initial population at random in a supervision mode; initializing self history optimal particles of the first generation particles and initial population global optimal particles; it is noted here that the quantile value of the private data is the corresponding fixed data center number;

and 4, step 4: constructing n-dimensional candidate solution particles by adopting a discrete coding mode for the preprocessed data set;

a particle represents a data layout scheme of a scientific workflow in a mixed cloud environment, and the position of the particle i in the t iteration

As shown in equation (11).

Each particle has n quantiles, and n represents the number of the data sets after the preprocessing operation;

indicating the storage location of the kth data set at the t-th iteration,

the value is a data center number, i.e. the number

And 4, step 4: mapping the data layout result and the candidate solution particles to obtain cross-data center transmission time and a corresponding data layout scheme;

and 5: calculating the fitness of each encoding particle, setting each particle as a self historical optimal particle, and selecting a feasible solution particle with the minimum fitness value as a population global optimal particle;

step 6: updating the particles based on the particle updating formula, and recalculating the fitness of each updated particle;

and 7: updating the self history optimal particle of the particle;

if the fitness value of the updated particle is smaller than the self historical optimal value, setting the updated particle as the self historical optimal particle; otherwise, jumping to step 9;

and 8: updating global optimal particles of the population;

if the fitness value of the updated particle is smaller than that of the population global optimal particle, setting the updated particle as the population global optimal particle;

and step 9: checking whether an algorithm termination condition reaching the maximum iteration number is met, and ending when the algorithm termination condition is met; otherwise, go to step 6.

Further, T in step 1_totalThe calculating method of (2):

step 1-1, mapping<dc_i,ds_k,dc_j>Representing a data set ds_kFrom the source data centre dc_iTransmitting to a target data center dc_jData transmission time T of_transferAs shown in equation (6):

wherein ds_kRepresenting a data set, dc_iRepresenting origin data center, dc_jIndicating delivery to a target data center, dc_i、dc_jAll belong to a data center set DC; dsize_kRepresenting a data set ds_kSize, band_ijRepresenting data centres dc_iAnd a data center dc_jA bandwidth value of a network bandwidth in between;

step 1-2, time overhead T caused by data transmission across data centers in data layout process_totalThe calculation formula of (a) is as follows:

wherein e _ijk0,1 represents whether a data set ds exists in the data layout process_kFrom the source data centre dc_iTransmitting to a target data center dc_jIf present, e_ijkIs 1, otherwise is 0.

Further, data set ds in step 1-1_k＝<dsize_k,gt_k,lc_k,flc_k>，dsize_kIs the data set size gt_kRepresentation generation data set ds_kTask of (1), lc_kRepresenting a data set ds_kStorage location of, flc_kRepresenting a data set ds_kAt the final layout position of_kAnd lc_kRespectively represent asThe following:

wherein, DS_iniRepresenting an initial representation data set, DS_genRepresenting the generated data set, the initial data set is the original input of the scientific workflow, and the generated data set is the intermediate data set generated in the execution process of the scientific workflow, and the data sets are often the input data sets of other tasks, Task (ds)_k) Representation generation data set ds_kThe task of (2). Data sets may be divided into DSs according to storage location_fixRepresenting fixed deposit datasets (private datasets) and DSs_flexArbitrarily deposit data sets (non-private data sets), private data sets DS_fixDC capable of being stored only in private cloud data center_pri，fix(ds_k) Indicating a private cloud data center number specifying the storage of the private data set.

Further, in step 1, the data center set DC ═ { DC ═ DC_pub,DC_priTherein DC_pubIs a public cloud, DC_priThe cloud computing system is a private cloud and is composed of a plurality of data centers;

data center DC with number k in data center set DC_kIs represented as follows:

dc_k＝＜capacity_k,type_k＞ (1)

wherein the capacity_kRepresenting data centres dc_kThe data sets stored on the data center cannot exceed the storage capacity of (c). type_kWhere {0,1} denotes the data center dc_kThe cloud service provider to which it belongs, when type_kWhen equal to 0, dc_kThe data center belongs to a public cloud and can only store non-private data; when type_kWhen 1, dc_kThe data center belongs to a private cloud and can store private data and non-private data.

Further, the specific steps of step 2 are as follows:

step 2-1, recording the out-degree and the in-degree of all tasks and data sets of the scientific workflow G;

step 2-2, searching 'one-way data edge cutting' e_ij；

Step 2-3, when 'one-way data edge cutting' e exists_ijAnd ds_iAnd ds_jDeleting e if not all private data_ijMerging ds_iAnd ds_jFor a new data set ds_kAnd performing step 2-2; and ending when the 'one-way data cut edge' does not exist.

Further, in step 3, an adjusting mechanism of the inertia weight factor w performs adaptive adjustment according to the difference degree between the current particle and the global optimal particle;

wherein div (X)^t-1,gBest^t-1) Representing the current particle X^t-1And global optimum particle gBest^t-1There are different valued digits on the same quantile.

Further, the calculation formula of the fitness of the particles in step 6 is as follows:

the two encoding particles are the same type of particles, the encoding particle with shorter data transmission time across the data center is selected, and the fitness function is defined as follows:

two encoded particles are different types of particle combinations of feasible solution particles and infeasible solution particles, then the fitness function is defined as follows:

wherein the capacity_iRepresenting data centres dc_iStorage capacity of u_ij{0,1} represents a data set ds_jWhether or not to be stored in data centre dc_iIf yes, u_ijIs 1, otherwise is 0.

Further, the update formula for updating the particle i in step 7 is as follows:

wherein, C_g() And C_p() Cross operator, M, representing a genetic algorithm_u() A mutation operator representing a genetic algorithm;

and gBest^t-1Respectively representing the individual optimal position of the particle after multiple iterations and the global optimal position of the population;

indicating the position of the particle i at time t,

indicating the position of particle i at time t-1.

Further, decomposing the updated particle formula into three core parts of inertial cognition, individual cognition and social cognition, then:

(1) combining the standard PSO algorithm with the variation operation of the genetic algorithm to obtain the inertia part of the particle i at the time t

The formula of (1) is as follows:

wherein r is₃Is a random factor, and the value range is (0, 1); w is an inertial weight factor, w is used to adjust the search capability of the particles on the solution space, M_u() Supervised random selection of a fraction in the encoded particles, random variation of the value of the fraction, and the value satisfying the corresponding value range,

indicating the position of the particle i at time t,

represents the position of the particle i at the time t-1;

(2) the formula for respectively obtaining the individual cognitive part and the global cognitive part of the particle i at the time t by combining the standard PSO algorithm with the cross operation of the genetic algorithm is as follows:

wherein c is₁Is an individual cognitive factor, c₂Is the global cognitive factor (c) which is,

and gBest^t-1Respectively representing the individual optimal position of the particle after multiple iterations and the global optimal position of the population; cp () and Cg () represent a crossover operation, the Cp () and Cg () randomly selecting two fractional bits of a particle, the AND operation

Or gBest^t-1Crossing the numerical values among the same quantiles; r is₁And r₂Is a random variable with a value range of [0, 1 ]]，r₁And r₂For enhancing randomness in the iterative search process.

Further, individual cognitive factor c₁And global cognition factor c₂Is arranged in a linear increasing and decreasing manner, and the formula (21) and the formula (22) are respectively c₁And c₂The update mechanism of (2).

Wherein

And

are respectively self-cognition factors c₁The initial value and the final value of the setting of (1),

and

respectively a population cognition factor c₂Set initial value and final value of (1). When div (X)^t-1) When larger, indicates that the current particle X is^t-1And gBest^t-1The difference is large, and the search range needs to be enlarged, so the weight of w should be increased to ensure that the particles search the problem solution in a larger range and avoid falling into local optimum too early; otherwise, the search range is narrowed, the weight of w is reduced, the convergence process is accelerated in a small range, and an optimized solution is found more quickly.

Further, supervised randomization includes the following two cases:

case 1: if the encoding particles are feasible solution particles, the selected quantiles do not contain the quantiles where the privacy data sets are located; since the private data set is fixedly stored, the storage location of the private data set cannot be changed.

Case 2: and if the encoding particles are not feasible particles, the selected quantile is the corresponding quantile of the overload data center encoding. The data layout scheme corresponding to the infeasible solution particle may have a plurality of overloaded data centers, and the bit corresponding to one overloaded data center code is randomly selected to perform a mutation operation, so that the infeasible solution particle may be mutated into a feasible solution particle.

By adopting the technical scheme, the data layout characteristics in a mixed cloud environment are considered, the dependence relationship among scientific workflow data is combined, and the influence of factors such as bandwidth among cloud data centers, the number and capacity of private cloud data centers and the like on transmission delay is considered; by introducing the crossover operator and the mutation operator of the genetic algorithm, the problem of premature convergence of the particle swarm optimization algorithm is avoided, the diversity of population evolution is improved, the data transmission delay is effectively compressed, and the scientific workflow data transmission delay in the mixed cloud environment is effectively reduced.

In order to compress the scale of the scientific workflow data, the method firstly carries out preprocessing operation on the scientific workflow data, and improves the execution efficiency of a later-stage data layout strategy; the problem of premature convergence of a particle swarm optimization algorithm for solving the NP-hard problem in the prior art is avoided, the diversity of population evolution is improved, and the distribution transmission delay of scientific workflow data is optimized.

Drawings

The invention is described in further detail below with reference to the accompanying drawings and the detailed description;

FIG. 1 is a scientific workflow diagram of the present invention;

FIG. 2 is one example of a data layout for a scientific workflow of the present invention;

FIG. 3 is a second example of a data layout for a scientific workflow of the present invention; FIG. 4 is a flow chart of the algorithm of the present invention;

FIG. 5 is a schematic diagram of the compressed 'one-way data cut edge' of the preprocessing process of the present invention;

FIG. 6 is a structure of Epigenomics workflow before and after pretreatment according to the present invention;

FIG. 7 is a diagram of an example of data layout particle encoding according to the present invention;

FIG. 8 is a cross-operator graph of the individual cognitive factors and global cognitive factors of the present invention;

FIG. 9 is a diagram of the mutation operator of the inertial part of the present invention

Detailed Description

As shown in one of fig. 1 to 7, the present invention discloses a time delay optimization oriented scientific workflow data layout method in a hybrid cloud environment, and the present invention is described in detail below with reference to the accompanying drawings.

1 problem definition and analysis

The section defines relevant concepts of the time delay optimization oriented scientific workflow data layout problem in the mixed cloud environment, and performs problem analysis by combining with the examples. The problem definition mainly comprises a mixed cloud environment, a scientific workflow and a data layout scheme.

1.1 problem definition

Mixed cloud DC ═ { DC_pub,DC_priThe cloud computing center mainly comprises a public cloud and a private cloud, and both the public cloud and the private cloud are composed of a plurality of data centers. Public cloud data center DC_pub＝{dc₁,dc₂,...,dc_nThe private cloud data center DC consists of n data centers_pri＝{dc₁,dc₂,...,dc_mConsists of m data centers. The data layout problem is focused on here, so only the storage capacity of the data center is focused on, and the computing capacity is ignored. Data center dc numbered i_iIs represented as follows:

dc_i＝＜capacity_i,type_i＞ (1)

wherein the capacity_iRepresenting data centres dc_iThe data sets stored on the data center cannot exceed the storage capacity of (c). type_iWhere {0,1} denotes the data center dc_iThe cloud service provider to which it belongs, when type_iWhen equal to 0, dc_iThe data center belongs to a public cloud and can only store non-private data; when type_iWhen 1, dc_iThe data center belongs to a private cloud and can store private data and non-private data. In addition, the bandwidth between the various data centers is represented as follows:

b_ij＝＜band_ij,type_i,type_j＞ (3)

wherein the pair

And i ≠ j, b_ijRepresenting data centres dc_iAnd a data center dc_jBandwidth of the network in between, band_ijIs its bandwidth value. It is assumed herein that bandwidth values between data centers are known and do not fluctuate.

Scientific workflows are represented by a directed acyclic graph G ═ (T, E, DS), where T ═ T₁,t₂,...,t_rDenotes a set of nodes containing r tasks, E ═ E₁₂,e₁₃,...,e_ijRepresents the data dependency between tasks, and DS ═ DS₁,ds₂,...,ds_nDenotes the set of all data of the scientific workflow.

Each data dependent edge e_ij＝(t_i,t_j) Representative task t_iAnd task t_jThere is a data dependency relationship between them, where the task t_iIs task t_jIs directly preceding (parent) node, and task t_jThen it is task t_iIs directly succeeding (child) node. In the scientific workflow scheduling process, a task can be executed only after all precursor nodes of the task are executed. In a given directed acyclic graph representing a scientific workflow, a task without a predecessor node is referred to as an 'in task', and similarly, a task without a successor node is referred to as an 'out task'.

For a certain subtask t_i＝<IDS_i,ODS_i>Whose set of input data is IDS_iThe set of output data is ODS_i. The correspondence between tasks and data is many-to-many, i.e. one data may be used by multiple tasks, and one task may require multiple input data when executed.

For a certain data set ds_k＝<dsize_k,gt_k,lc_k,flc_k>，dsize_kIs the data set size gt_kRepresentation generation data set ds_kTask of (1), lc_kRepresenting a data set ds_kStorage location of, flc_kRepresenting a data set ds_kAt the final layout position of_kAnd lc_kRespectively, as follows:

the data set may be divided into an initial data set DS by source_iniAnd generating a data set DS_genThe initial data set is the original input of the scientific workflow, and the generated data set is the intermediate data set generated during the execution of the scientific workflow, and these data sets are often the input data sets of other tasks, Task (ds)_k) Representation generation data set ds_kThe task of (2). The data sets can be divided into fixed storage data sets (private data sets) DS according to the storage location_fixAnd optional deposit dataset (non-private dataset) DS_flexPrivate data sets can only be stored in private cloud data center DC_pri，fix(ds_k) Representing a specified deposit privacy data set ds_kThe private cloud data center number of (1).

The purpose of data layout is to minimize data transfer time while meeting task execution requirements. Any one task execution needs to satisfy two conditions: (1) the task is scheduled to be executed in a data center; (2) the input data sets required for this task are already in the data center. Since the time for scheduling tasks to a data center is much shorter than the transmission time for transmitting data to the data center, the data layout is mainly focused on herein, and the task scheduling is not the focus of the text, so that the task is scheduled to be executed by the data center with the least transmission time overhead. Whole data layoutThe definition of the scheme is S ═ (DS, DC, Map, T)_total) Wherein Map ═ U_{i＝1,2,...,|DS|}{＜dc_i,ds_k,dc_j> "indicates the mapping of the data set DS to the data center DC, a certain mapping<dc_i,ds_k,dc_j>Representing a data set ds_kFrom the source data centre dc_iTransmitting to a target data center dc_jThe data transmission time generated by this process is shown in equation (6). T is_totalThe total time overhead caused by data transmission across data centers during data layout is shown as formula (7).

Based on the related definitions, the time delay optimization oriented scientific workflow data layout problem in the mixed cloud environment can be formally expressed as a formula (8), and the core idea is to pursue the total time overhead T_totalAt a minimum, while meeting the storage capacity limitations of each data center.

Wherein u is_ij{0,1} represents a data set ds_jWhether or not to be stored in data centre dc_iIf yes, u_ijIs 1, otherwise is 0. In the data layout process, data is continuously transmitted and migrated, so that capacity limitation judgment is carried out on a certain private cloud data center when new data are placed in the certain private cloud data center.

1.2 problem analysis

FIG. 1 is an example of a scientific workflow containing 5 tasks t₁,t₂,t₃,t₄,t₅}, 5 original input data sets { ds₁,ds₂,ds₃,ds₄,ds₅And 1 intermediate data set (ds)₆Composition, size of 6 datasets { dsize }₁,dsize₂,dsize₃,dsize₄,dsize₅,dsize₆Are {3GB,5GB,3GB,3GB,5GB,8GB }, respectively, where ds₄Is a private data set and must be stored in a data center dc₂The above. Task t₄Is { ds }₃,ds₄,ds₆Due to ds₄Must be fixedly stored in the data center dc₂So t is the privacy data of₄Must also be in the data center dc₂Is executed. Likewise, ds₅Must be stored in the data center dc₃Private data set of (1), t₅Must also be in the data center dc₃Is executed. FIGS. 2 and 3 are two data layout schemes, dc, respectively₁Is a public cloud data center, the storage capacity is infinite, and dc₂And dc₃The data transmission method is characterized in that the data transmission method comprises two private cloud data centers, the storage capacity is 20GB, the bandwidth between the private cloud data centers is about 10 times of the bandwidth from a public cloud data center to the private cloud data center, and therefore the size { band } of the bandwidth between 3 data centers is assumed₁₂,band₁₃,band₂₃Are {10M/s,20M/s,150M/s }, respectively.

2, according to a data layout scheme generated by Lijun et al, a public data set ds is divided according to a dependency matrix division model₁、ds₂And ds₃Deployed in public cloud data center dc₁Middle, ds₆Deployed in private cloud data center dc₂In (D), a private data set ds₄And ds₅Each deployed at an associated data center. The resulting layout scheme yields 4 data transfers, 27GB worth of data transfer, with a transfer time across the data center of about 1953 seconds.

3 is an optimal data layout scheme, which will have public possessionData set ds₁And ds₂Deployed in public cloud data center dc₁Middle, ds₃And ds₆Deployed in private cloud data center dc₃In (1). The resulting layout scheme yields 5 data transfers, a data transfer size of 30GB, and a cross data center transfer time of approximately 1023 seconds. Although the layout scheme exceeds Lijun et al in terms of data transmission frequency and data transmission quantity, the transmission time across the data centers is obviously superior to the former, which mainly means that the scheme comprehensively considers the influence caused by the transmission bandwidth among the data centers.

According to a traditional matrix division model or load balancing model based on data dependence destruction, data with high data dependence is divided into the same data center as much as possible, so that data transmission quantity among the data centers can be effectively reduced, but the method does not comprehensively consider layout influence caused by bandwidth difference among different data centers. Therefore, aiming at the defects of the traditional data layout model, a data layout strategy based on GA-DPSO is designed by combining a differentiated bandwidth division mechanism, different data sets are placed in a self-adaptive manner according to factors such as bandwidth and data center capacity limitation, and the transmission delay of scientific workflow data layout in a mixed cloud environment is effectively reduced.

2 GA-DPSO-based data layout strategy

For data layout scheme S ═ D (DS, DC, Map, T)_total) The core purpose of this document is to find the optimal mapping Map of the data set DS to the data center DC, so that the cross-data center transmission time T is_totalAnd the lowest. Finding the optimal mapping of DS to DC is an NP-hard problem and needs to take into account the bandwidth differences between different data centers in a hybrid cloud environment. In order to compress the scale of scientific workflow data, preprocessing operation is firstly carried out on the scientific workflow data, so that the execution efficiency of a later-stage data layout strategy is improved; in order to avoid the problem of premature convergence of a particle swarm optimization algorithm for solving the NP-hard problem in the prior art, the GA-DPSO algorithm is provided, the diversity of population evolution is improved, and the distribution transmission delay of scientific workflow data is optimized. The following sequentially introduces scientific workflow preprocessing and a self-adaptive discrete particle swarm optimization data layout strategy based on genetic algorithm operators.

2.1 scientific workflow pretreatment

Algorithm 1. merging adjacent datasets with only one dependent task

procedure preProcess(G(T,E,DS))

1, recording the out-degree and in-degree of all tasks and data sets of the scientific workflow G

2 searching for 'one-way data edge cutting' e_ij

3 if there is a 'one-way data cut edge' e_ijAnd ds_iAnd ds_jDeleting e if not all private data_ijMerging ds_iAnd ds_jFor a new data set ds_k

4, repeatedly executing the step 2 until ' one-way data cutting edge ' does not exist '

end procedure

Algorithm 1 mainly introduces a preprocessing process pseudo code for merging adjacent data sets with only one relevant task based on the structural features of the scientific workflow itself. Wherein the definition of 'one-way data cut edge' is: two data sets ds_iAnd ds_j，ds_iIs 1, ds_jHas an in-degree of 1, and has only one related task between two data sets, and the structure is shown in fig. 3. When the scientific workflow has 'one-way data cut edge', and ds_iAnd ds_jNot all private data, so ds can be used_iAnd ds_jThe combined placement is shown in fig. 5. For some scientific workflows with a large number of unidirectional data cut edges, such as the Epigenomics scientific workflows, after preprocessing, the number of data sets can be greatly reduced, so that the execution efficiency of a later-stage data layout algorithm is improved. Figure 6 shows the self-structural change of the Epigenomics workflow before and after preprocessing, the data set quantity is compressed by more than 30%.

Property 1 scientific workflow preprocessing strategies can compress the number of scientific workflow data sets, improve algorithm execution efficiency, but may affect the final data layout result.

FIG. 5 has shown an example of the number of compressed scientific workflow datasets, presented herein in section 2.2.1 based on dataThe discrete coding mode of the set number, therefore, the reduction of the data set number can improve the algorithm execution efficiency. FIG. 5 is a drawing illustrating combined placing ds₅And ds₆Mean ds₅And ds₆All the time in the same data center. If the capacity of a certain private cloud data center is large, only ds can be stored₅Or ds₆The preprocessed data layout result will be different from the non-preprocessed data layout result.

2.2 genetic algorithm operator-based adaptive discrete particle swarm optimization data layout strategy

The PSO algorithm was proposed by Eberhart and Kennedy in 1995, and is a group random optimization algorithm based on social behaviors of a bird group. Particles are important concepts in PSO, each particle represents a candidate solution of a problem, and the particles are moved and iteratively updated in a problem space to obtain better particles. The particle movement update is mainly to adjust its velocity and position, which are shown in equation (9) and equation (10).

And

respectively representing the speed and the position of the ith particle in the t-th iteration, and defining the maximum particle speed V for limiting the particle speed to ensure that the particles are updated in a problem solution space_max. The speed updating of the particles is influenced by the conditions of the particles, the optimal historical positions of the particles and the optimal positions of the population history. The inertial weight w directly influences the convergence of the algorithm and adjusts the searching capability of the particles on the solution space.

And gBest^tRespectively representing the self-history optimal position and the population history optimal position of the particle i after the t-th iteration. c. C₁And c₂The cognitive factors respectively represent cognitive learning abilities of the self-history optimal position and the population history optimal position. r is₁And r₂The two random factors are in a value range (0,1), so that the search randomness in the algorithm iteration process can be increased, and the population diversity is improved. In addition, in order to determine whether the particles are in good or bad positions in the problem space, a fitness function needs to be defined and evaluated.

The traditional PSO algorithm is used for solving a continuous problem, the data layout from a data set to a data center is a discrete problem, and a new problem coding mode and a fitness evaluation function are needed. Aiming at the problem of premature convergence of the traditional PSO algorithm, a new particle updating strategy is needed. In addition, the setting of algorithm parameters directly affects the iteration number and the searching capability of the algorithm execution process. The GA-DPSO data layout optimization algorithm proposed herein will be described in detail below in terms of problem coding, fitness function setting, particle update strategy, algorithm parameter setting, and the like.

2.2.1 problem coding

The good problem coding strategy can effectively improve the algorithm efficiency and the searching capability, and the problem coding mainly considers three basic principles: completeness, non-redundancy, and robustness.

Definition 1 (completeness) all feasible solutions in the problem space can find the corresponding encoded particles in the encoding space.

A certain candidate solution in the 2 (non-redundancy) problem space is defined to which only one encoded particle corresponds in the encoding space.

Any encoded particle in the 3 (robust) coding space is defined to correspond to a solution candidate in the problem space.

It is challenging to construct a problem code that satisfies the above three principles simultaneously. We use discrete coding to construct n-dimensional solution candidate particles. One granuleThe child represents a data layout scheme of scientific workflow in a mixed cloud environment, and the position of a particle i in the t iteration

As shown in equation (11).

Each particle has n quantiles, where n represents the number of data sets after the preprocessing operation.

The storage position of the kth data set in the t iteration is represented, and the specific value is a certain data center number, namely

It is noted here that for a private data set, the storage location is fixed regardless of the iterative update, as in data set ds in fig. 1₄And ds₅Each of which can only be fixedly stored at dc₂And dc₃In (1). FIG. 7 shows a problem coding scheme corresponding to the data layout of FIG. 3 formed for the scientific workflow of FIG. 1, wherein the data set is compressed from 6 to 5 by the preprocessing operation, and then compressed to ds of a whole data set₅And ds₆Are all stored in dc₃。

Property 2 the discrete coding strategy satisfies the principle of non-redundancy and completeness, but does not satisfy the principle of robustness.

Each data set is finally stored in a corresponding data center and is provided with a corresponding data center number, the final storage position of one data set can only be on a certain data center, a certain data layout scheme of the scientific workflow corresponds to an n-dimensional particle, the value of each quantile is the corresponding data center number, and one layout scheme only corresponds to one coding particleAnd sub-system, satisfying the principle of non-redundancy. The non-privacy data sets can be selected to be stored in all data centers, the corresponding coding quantiles can also be selected to be different data center numbers, the coding quantile value corresponding to each data set is the data center number which is designated to be stored, each layout scheme has corresponding coding particles, and the completeness principle is met. Some of the encoded particles fail to satisfy the realistic problem spatial candidate solution, and if the data set storage location in FIG. 7 is (1,2,2,2,2), ds is removed₁All data sets except those stored in dc₂The total data amount reaches 24GB and exceeds dc₂The 20GB storage capacity of (a) makes the data layout scheme infeasible and therefore does not meet the soundness principle.

2.2.2 fitness function

The fitness function is used for evaluating the advantages and disadvantages of the particles, and generally, the particles with smaller fitness function values have better performance. The objective of the method is to reduce the data transmission time across the data center of the scientific workflow data layout, and the smaller the transmission time, the better the particles, so that the fitness function value can be directly defined to be equal to the data transmission time of the data layout scheme corresponding to the particles. However, since problem coding does not satisfy the soundness principle, that is, it may occur that a data set placed in a certain data center exceeds the capacity of the data center, it is necessary to define the fitness function differently.

Defining 4 (feasible solution particle) to encode the data layout strategy that the particle corresponds to and satisfy the data center capacity restriction requirement, it does not appear that the data set of a certain data center exceeds the data center capacity.

Defining 5 (infeasible solution particle) that the data layout strategy corresponding to the encoding particle does not meet the data center capacity limit requirement, and the data set of a certain data center exceeds the data center capacity.

The fitness function values of two encoded particles are compared in three different cases.

Case 1: both encoding particles are feasible solution particles, the encoding particle with shorter data transmission time across the data center is selected, and the fitness function is defined as follows:

case 2: and the two encoding particles are both infeasible solution particles, the encoding particles with shorter data transmission time across the data center are selected, the infeasible solution particles are likely to become feasible solution particles through later particle updating operation, the encoding particles with shorter data transmission time are more likely to keep shorter data transmission time, and the fitness function definition is consistent with the formula (12).

Case 3: one encoding particle is an infeasible solution particle and one encoding particle is a feasible solution particle, the feasible solution particles are selected without question, and the fitness function is defined as follows:

2.2.3 particle update strategy

As shown in equation (9), a conventional PSO includes three core portions: inertia, individual cognition, and social cognition. The traditional PSO is based on random search of a continuous space, the search space is locally and slowly enlarged, premature convergence is easy to occur, and the PSO is trapped in local optimum. In order to enhance the searching capability of the PSO, apply to the discrete problem, and simultaneously enable the PSO to explore a wider solution space, and avoid the problem of premature convergence, the algorithm introduces crossover and mutation operators of the genetic algorithm, and the updating operation of the improved formula (9) on the particle i at the time t is as follows:

wherein C is_g() And C_p() Cross operator, M, representing a genetic algorithm_u() Represents a mutation operator of the genetic algorithm.

For the individual cognitive part and the social cognitive part, the corresponding part in the formula (9) is updated by combining the idea of the crossover operator of the genetic algorithm, and the updating operation is shown as the formulas (16) and (17).

r₁、r₂Is a random factor and has a value range of (0, 1). C_p() (or C)_g() Randomly selecting two bits of the encoded particle, and

or

The values between the same quantiles are interleaved. FIG. 8 is a cross-operator operation of the personal (social) cognizant part to randomly select two cross-positions (ind) of the encoded particle₁And ind₂) Let old particle ind₁And ind₂Replacement of values between the partial bits by

At values over this interval, new particles are formed.

Properties 3: the crossover operator operation may change the encoded particle from a feasible solution to an infeasible solution and vice versa.

The encoded particles (1,1,3,2,3) of fig. 7 are feasible solutions, assuming that the code of the pBest particle is (2,3,2,2,3) and the randomly generated crossover positions are 1 and 2, so the new encoded particle formed after crossover is (2,3,3,2, 3). The newly encoded particles will be₂、ds₃、ds₅And ds₆Is placed on dc₃，ds₂、ds₃、ds₅And ds₆Has a data amount of 21GB and dc₃The data center capacity of (2) is only 20GB, so the new encoded particles are not feasible to solve. Similarly, if the non-feasible solution particles (2,3,3,2,3), pBest encoded particles (2,2,1,2,3), and the cross positions are 1 and 2, then new feasible solution particles (2,2,3,2,3)。

for the inertial part, the corresponding part in equation (9) is updated by combining the idea of mutation operator of genetic algorithm, and the updating operation is shown as equation (18).

r₃Is a random factor and has a value range of (0, 1). M_u() And (3) supervised random selection of a fractional bit in the encoded particles, and random change of the numerical value of the fractional bit, wherein the numerical value meets the corresponding value range. Supervised randomness is to select randomly within a certain range of quantiles, and has two main situations.

Case 1: the encoding particle is a feasible solution particle, and the selected quantile does not contain the quantile where the private data set is located. Since the private data set is fixedly stored, the storage location of the private data set cannot be changed.

The encoded particle of FIG. 7 belongs to case 1, and FIG. 9 randomly selects except for the fourth and fifth quantiles (ds)₄And ds₅Corresponding quantile) outside the index ind₁Performing mutation operator operation, ind₁The value on the fractional bit is updated from 3 to 2.

Properties 4: mutation operator operations may change the encoded particle from a feasible solution to an infeasible solution and vice versa.

The encoded particles (1,2,3,2,3) are feasible solutions, the 2 nd fractional variation is randomly selected to form new encoded particles (1,3,3,2,3) which are infeasible solutions, and the new encoded particles will ds₂、ds₃、ds₅And ds₆Are all placed at dc₃Middle, ds₂、ds₃、ds₅And ds₆The sum of the data amount of (1) is 21GB and exceeds dc₃20GB capacity of data center. Similarly, if the variable position is 2, a new feasible solution particle (1,1,3,2,3) may be generated.

2.2.4 mapping of particles to data layout results

And 2, algorithm: mapping of encoded particles to data layout results

Algorithm 2 is pseudo code that encodes a particle to data layout result mapping. Inputs to the algorithm include scientific workflow G ═ (T, E, DS), hybrid cloud environment data center DC, and encoded particle X. First, an initial storage dc per data center is set_cur(i)Are all 0, with a transmission time across the data center of 0 (row 1). After initialization, the particle quantiles are scanned in sequence, the data sets are distributed to corresponding data centers according to the value of each quantile of the coded particles, and correspondingly, the current storage capacity dc of each data center is recorded_cur(X[i]). And when the storage capacity of a certain private cloud data center exceeds the capacity of the certain private cloud data center, the coded particles are proved to be infeasible to be solved, the operation is stopped, and the operation returns (lines 2-7). When the encoded particles are feasible solution particles, the layout of the corresponding data set corresponding to each data center is obtained, and the cross-data center transmission time needs to be further calculated. Scanning the scientific workflow task in sequence, and searching the task t_jOf an input data set IDS_jCorresponding all-layout data center DC_jWith computational tasks placed in a layout data center dc_kInput data Transfer time Transfer of_jk. Selecting the data center with the minimum input data transmission time to place the task, and calculating the corresponding data transmission time in an overlapping manner to form the final cross-data-center transmission time T_totalAnd finally outputting the cross-data center transmission time T_totalAnd a corresponding data layout scheme.

2.2.5 parameter settings

The inertial weight factor w in equation (9) determines the velocity variation, which has a direct effect on the searching capability and convergence of the PSO algorithm. When the inertia weight factor w is large, the global search capability of the algorithm is strong and the convergence is not easy; otherwise, the local search capability of the algorithm is strong and easy to converge. Formula (18) is a classical inertial weight factor adjustment mechanism, and in the initial stage of operation of the algorithm, the global search capability and the wider problem solution space of the particle are emphasized, and as the number of later iterations increases and the search is deep, the particle emphasizes the local search capability and the convergence. The value of the inertial weight factor w of equation (18) decreases linearly with increasing number of iterations. Wherein w_maxAnd w_minRespectively, the maximum and minimum values of the inertial weight factor w set at initialization, iters_maxAnd iters_curRespectively, the maximum number of iterations set at initialization and the current number of iterations.

The inertial weight factor of the formula (18) is adjusted in a linear decreasing manner based on the iteration number, and cannot well satisfy the problem of nonlinear data layout in the text, so that an inertial weight factor capable of adaptively adjusting the search capability according to the quality of the current particle needs to be designed. The new inertial weight factor adjustment mechanism may be adaptively adjusted according to the degree of difference between the current particle and the globally optimal particle, as shown in equation (19).

Wherein div (X)^t-1,gBest^t-1) Representing the current particle X^t-1And global optimum particle gBest^t-1There are different valued digits on the same quantile. When div (X)^t-1) When larger, indicates that the current particle X is^t-1And gBest^t-1The difference is large, and the search range needs to be enlarged, so the weight of w should be increased to ensure that the particles search the problem solution in a larger range and avoid falling into local optimum too early; otherwise, the search range is narrowed, the weight of w is reduced, the convergence process is accelerated in a small range, and an optimized solution is found more quickly.

In addition, a self-recognition factor c₁And a population recognition factor c₂Is arranged in a linear increasing and decreasing manner, and the formula (21) and the formula (22) are respectively c₁And c₂The update mechanism of (2).

Wherein

And

and

respectively a population cognition factor c₂Set initial value and final value of (1).

Claims

1. A time delay optimization-oriented scientific workflow data layout method in a mixed cloud environment is characterized by comprising the following steps: which comprises the following steps:

the definition of the whole data layout scheme is S ═ S (DS, DC, Map, T)_total) Wherein Map ═ U_{i＝1,2,...,|DS|}{＜dc_i,ds_k,dc_j> "represents the mapping relationship, mapping, of the data set DS to the data center set DC<dc_i,ds_k,dc_j>Representing a data set ds_kFrom the source data centre dc_iTransmitting to a target data center dc_j，T_totalRepresents the total time overhead incurred by data transmission across data centers during the data placement process;

step 2: preprocessing a scientific workflow, and merging adjacent data sets with only one related task;

and step 3: initializing the population size, the maximum iteration times, the inertia weight factor and the cognitive factor, and generating an initial population at random in a supervision mode; initializing self history optimal particles of the first generation particles and initial population global optimal particles;

one particle representing scientific workflow in a hybrid cloud environmentA data layout scheme, the position X of the particle i at the t-th iteration_i ^tAs shown in formula (11);

indicating the storage location of the kth data set at the t-th iteration,

the value is a data center number, i.e. the number

And 5: mapping the data layout result and the candidate solution particles to obtain cross-data center transmission time and a corresponding data layout scheme;

step 6: calculating the fitness of each encoding particle, setting each particle as a self historical optimal particle, and selecting a feasible solution particle with the minimum fitness value as a population global optimal particle;

and 7: updating the particles based on the particle updating formula, and recalculating the fitness of each updated particle;

and 8: updating the self history optimal particle of the particle;

if the fitness value of the updated particle is smaller than the self historical optimal value, setting the updated particle as the self historical optimal particle; otherwise, jumping to step 10;

and step 9: updating global optimal particles of the population;

step 10: checking whether an algorithm termination condition reaching the maximum iteration number is met, and ending when the algorithm termination condition is met; otherwise, go to step 7.

2. The time delay optimization-oriented scientific workflow data layout method of the hybrid cloud environment according to claim 1, characterized in that: step 1T_totalThe calculating method of (2):

wherein e_ijk0,1 represents whether a data set ds exists in the data layout process_kFrom the source data centre dc_iTransmitting to a target data center dc_jIf present, e_ijkIs 1, otherwise is 0.

3. The time delay optimization-oriented scientific workflow data layout method of hybrid cloud environment according to claim 2The method is characterized in that: data set ds in step 1-1_k＝<dsize_k,gt_k,lc_k,flc_k>，dsize_kIs the data set size gt_kRepresentation generation data set ds_kTask of (1), lc_kRepresenting a data set ds_kStorage location of, flc_kRepresenting a data set ds_kAt the final layout position of_kAnd lc_kRespectively, as follows:

wherein, DS_iniRepresenting an initial representation data set, DS_genRepresenting the generating dataset, DS_fixIndicating private data sets and DSs that need to be persisted securely_flexArbitrarily deposited non-private data set, private data set DS_fixDeposit in private cloud data center DC only_pri，Task(ds_k) Representation generation data set ds_kTask of (1), fix (ds)_k) Indicating a private cloud data center number specifying the storage of the private data set.

4. The time delay optimization-oriented scientific workflow data layout method of the hybrid cloud environment according to claim 1, characterized in that: the specific steps of step 2 are as follows:

step 2-1, recording the out-degree and the in-degree of all tasks and data sets of the scientific workflow G,

step 2-2, searching 'one-way data edge cutting' e_ij(ii) a 'one-way data cut edge' refers to two data sets ds_iAnd ds_j，ds_iIs 1, ds_jThe in-degree of (1) is 1, and only one related task exists between two data sets;

5. The time delay optimization-oriented scientific workflow data layout method of the hybrid cloud environment according to claim 1, characterized in that: in the step 3, the adjusting mechanism of the inertia weight factor w carries out self-adaptive adjustment according to the difference degree between the current particles and the global optimal particles;

wherein w_maxAnd w_minRespectively representing the upper and lower limits of the value range of w, div (X)^t-1,gBest^t-1) Indicating the position X of the current particle^t-1And global optimum particle gBest^t-1There are different valued digits on the same quantile.

6. The time delay optimization-oriented scientific workflow data layout method of the hybrid cloud environment according to claim 1, characterized in that: the calculation formula of the fitness of the particles in the step 6 is as follows:

7. The time delay optimization-oriented scientific workflow data layout method of the hybrid cloud environment according to claim 1, characterized in that: the update formula for updating the particle i in step 7 is as follows:

wherein, c₁And c₂Respectively representing the individual cognition factor and the global cognition factor of the particle, namely the learning degree of the particle to other individuals and the population-optimal individual, C_g() And C_p() Cross operator, M, representing a genetic algorithm_u() A mutation operator representing a genetic algorithm;

and gBest^t-1Respectively representing the individual optimal position of the particle after t-1 iterations and the global optimal position of the population;

indicating the position of the particle i at time t,

indicating the position of particle i at time t-1.

8. The time delay optimization-oriented scientific workflow data layout method of the hybrid cloud environment according to claim 7, characterized in that: decomposing the updated particle formula into three core parts of inertial cognition, individual cognition and social cognition, and then:

The formula of (1) is as follows:

indicating the position of the particle i at time t,

represents the position of the particle i at the time t-1;

and gBest^t-1Respectively representing the individual optimal position of the particle after t-1 iterations and the global optimal position of the population; cp () and Cg () represent the crossover operators of the genetic algorithm, and Cp () and Cg () randomly select the two quantiles of a particle, and

9. The time delay optimization-oriented scientific workflow data layout method of the hybrid cloud environment according to claim 1 or 7, characterized in that: individual cognitive factor c₁Global cognition factor c₂Is arranged in a linear increasing and decreasing manner, and the formula (21) and the formula (22) are respectively c₁And c₂The update mechanism of (2):

wherein

And

are respectively individual cognitive factors c₁The initial value and the final value of the setting of (1),

and

respectively a global cognition factor c₂Set initial value and final value of (1), iters_curRepresenting the current number of iterations, iters_maxIndicating the maximum number of iterations set at initialization.

10. The time delay optimization-oriented scientific workflow data layout method of the hybrid cloud environment according to claim 1 or 7, characterized in that: the following two cases are included in step 1:

case 1: if the encoding particles are feasible solution particles, the selected quantiles do not contain the quantiles where the privacy data sets are located;

case 2: and if the encoding particles are not feasible particles, the selected quantile is the corresponding quantile of the overload data center encoding.