CN116050235A

CN116050235A - Workflow data layout method under cloud side environment and storage medium

Info

Publication number: CN116050235A
Application number: CN202310176231.7A
Authority: CN
Inventors: 张舜民; 林兵; 郑裕恒; 吴克涛
Original assignee: Fujian Zhenshi Information Technology Co ltd; Fujian Normal University
Current assignee: Fujian Zhenshi Information Technology Co ltd; Fujian Normal University
Priority date: 2023-02-28
Filing date: 2023-02-28
Publication date: 2023-05-02

Abstract

The invention discloses a method and a storage medium for workflow data layout in a cloud edge environment, which are used for carrying out mathematical representation on the cloud edge environment, generating cost and data transmission cost based on copies, and modeling a data layout problem as a 0-1 integer programming problem with the aim of minimizing total time delay to obtain a mathematical problem model; adopting a nonlinear inertial weight discrete particle swarm optimization algorithm based on a genetic algorithm operator, introducing a crossover operator and a mutation operator of the genetic algorithm into the particle swarm algorithm, and adaptively adjusting the inertial weight according to the difference between particles and global particles so as to solve the mathematical problem model; carrying out workflow data layout according to the solving result; time delay can be effectively reduced; and the crossover and mutation operators of the genetic algorithm are introduced into the particle swarm algorithm, so that the searching capability of the particle swarm algorithm is enhanced, premature convergence is avoided, and the inertia weight is adaptively adjusted according to the difference between the current particle and the global particle, so that the optimizing process is more efficient.

Description

Workflow data layout method under cloud side environment and storage medium

Technical Field

The invention relates to the technical field of workflow data layout, in particular to a method and a storage medium for workflow data layout in a cloud edge environment.

Background

Workflow models are an effective method for describing business processes, and consist of a plurality of interrelated tasks, and workflow is commonly used in astronomy, physics, bioinformatics and other scientific fields. As a data intensive application, deployment of scientific workflows places stringent demands on the computing power and storage capacity of the environment.

Cloud computing has strong storage and computing capabilities, provides personalized services for users, and ensures resource supply of scientific workflow. However, the operation of the scientific workflow is accompanied by large-scale data transmission, and the cloud computing deployed at the far end can cause serious data transmission delay. The edge calculation moves the calculation to the edge of the network edge close to the position of the user, so that the transmission delay of data can be reduced, and the privacy data of the user can be stored. But the edge computing resources are limited and cannot store all the data needed and generated when the scientific workflow is executed. Cloud computing and edge computing are combined, and a safe and efficient mode can be provided for deployment of scientific workflow.

Due to the existence of private data, a large amount of data transmission can be performed during the execution of the scientific workflow, which causes serious time delay. With the reduction of the storage cost, the data copy is frequently used in cloud computing and edge computing, and the data transmission times can be reduced by accessing the copy nearby. However, the layout of the data copy in the cloud environment has many challenges, and in particular, the generation, transmission and storage of the copy are accompanied by overhead, so that a proper amount of copy needs to be generated by selecting proper data, and the position of the copy layout is difficult to select.

Therefore, how to layout the data copies to reduce latency is important.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: the method and the storage medium for workflow data layout in cloud environment can effectively reduce time delay.

In order to solve the technical problems, the invention adopts the following technical scheme:

a method for workflow data layout in cloud-edge environment comprises the following steps:

s1, carrying out mathematical representation on a cloud edge environment, and modeling a data layout problem as a 0-1 integer programming problem based on copy generation cost and data transmission cost with the aim of minimizing total time delay to obtain a mathematical problem model;

s2, adopting a nonlinear inertial weight discrete particle swarm optimization algorithm based on a genetic algorithm operator, introducing a crossover operator and a mutation operator of the genetic algorithm into the particle swarm algorithm, and adaptively adjusting the inertial weight according to the difference between particles and global particles so as to solve the mathematical problem model;

and S3, carrying out workflow data layout according to the solving result.

In order to solve the technical problems, the invention adopts another technical scheme that:

a storage medium having stored thereon a computer program which when executed performs the steps of a method of workflow data layout in a cloud-edge environment as described above.

The invention has the beneficial effects that: according to the method and the storage medium for workflow data layout in the cloud edge environment, the data copy layout is modeled into 0-1 integer programming problem with the aim of minimizing total time delay, and a nonlinear inertial weight discrete particle swarm optimization algorithm based on a genetic algorithm operator is adopted to solve the data layout problem and effectively reduce the time delay; and the crossover and mutation operators of the genetic algorithm are introduced into the particle swarm algorithm, so that the searching capability of the particle swarm algorithm is enhanced, premature convergence is avoided, and the inertia weight is adaptively adjusted according to the difference between the current particle and the global particle, so that the optimizing process is more efficient.

Drawings

Fig. 1 is a schematic diagram of a scientific workflow example of a method for workflow data layout in a cloud-edge environment according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an example of a workflow data layout in a cloud-edge environment;

FIG. 3 is a schematic diagram of an example one-dimensional encoding of a data layout of a method for workflow data layout in a cloud-edge environment according to an embodiment of the present invention;

fig. 4 is a schematic diagram of two-dimensional encoding example of a data layout of a method for workflow data layout in a cloud-edge environment according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a mutation operator example of a method for workflow data layout in a cloud-edge environment according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of an example of a cross operator of a method for workflow data layout in a cloud-edge environment according to an embodiment of the present invention;

fig. 7 is a flowchart of a method for workflow data layout in a cloud-edge environment according to an embodiment of the present invention.

Detailed Description

In order to describe the technical contents, the achieved objects and effects of the present invention in detail, the following description will be made with reference to the embodiments in conjunction with the accompanying drawings.

Referring to fig. 1, 2 and 4 to 7, a method for workflow data layout in a cloud environment includes the steps of:

And S3, carrying out workflow data layout according to the solving result.

From the above description, the beneficial effects of the invention are as follows: according to the method and the storage medium for workflow data layout in the cloud edge environment, the data copy layout is modeled into 0-1 integer programming problem with the aim of minimizing total time delay, and a nonlinear inertial weight discrete particle swarm optimization algorithm based on a genetic algorithm operator is adopted to solve the data layout problem and effectively reduce the time delay; and the crossover and mutation operators of the genetic algorithm are introduced into the particle swarm algorithm, so that the searching capability of the particle swarm algorithm is enhanced, premature convergence is avoided, and the inertia weight is adaptively adjusted according to the difference between the current particle and the global particle, so that the optimizing process is more efficient.

Further, in step S1, the mathematical representation of the cloud-edge environment is specifically:

the cloud-edge environment is expressed as:

S＝{S _cld ,S _edg }；

wherein, cloud computing S _cld Comprising j data centers, denoted as:

S _cld ＝{s ₁ ,s ₂ ,…,s _j }；

edge computation S _edg Comprising k data centers, denoted as:

S _edg ＝{s _j+1 ,s _j+2 ,…,s _j+k }；

each data center s _i Expressed as:

s _i ＝<c _i ，γ _i ，a _i >；

wherein ,c_i Representing its storage capacity, gamma _i Representing data center type, gamma _i ∈{0,1}，γ _i =0 represents that the data center is a cloud data center, and only public data, gamma, can be stored _i The data center is denoted by 1 and can store public data and private data with fixed storage positions, a _i Representing the speed at which the data center replicates data;

the network bandwidth between data centers is expressed as:

wherein ,b_ij Representing a data center s _i And data center s _j Is a bandwidth of (a);

the scientific workflow is expressed as:

G＝(V,E,D)；

wherein V represents a set of tasks in a scientific workflow:

V＝{v ₁ ,v ₂ ,…,v _w }；

e represents a set of task dependencies in a scientific workflow:

d represents a set of data replicas:

D＝{d ₁ ,d ₂ ,…,d _m }；

each task v _i The relevant dataset is represented as<D _i ，D _o >，D _i Representing its input dataset, D _o Representing an output dataset thereof, the input dataset and the output dataset each consisting of one or more data, the inter-task dependencies e _ij E, represent task v _j Is task v _i Is required at task v _i Can be executed after completion, otherwise task v _j For task v _i Without dependence, each data copy set d _i Comprising several copies of the ith data, d _ij Representing it as the j-th copy of the i-th data, d _i1 And (3) rendering the ith original data, each data copy containing attributes<z _i1 ,n _i1 ,f _i1 ,l _i1 >，z _i1 Representing data size, n _i1 Representing the number of copies of the data, is an integer greater than 0If n _i1 =1, then indicates that the data has no other copies, f _i1 Representing the generated data d _i1 If the data is the initial data, f _i1 Is marked as 0,l _i1 Record data d _ij If data d _ij Is privacy data, then l _i1 Recording the data center to which the data center belongs, and if the data center is public data, l _i1 Is 0.

From the above description, the cloud edge environment is mathematically represented through the above steps.

Further, in step S3, modeling of the digital problem model is specifically:

data d _i1 In data center s _k Copy overhead t of (2) _copy The method comprises the following steps:

/>

wherein ,z_i1 Is data d _ij Size of a), a _k Is a data center s _k The speed at which the data is copied;

data d _ij From data centre s _k Transmitted to data center s _l Is the transmission overhead t of (2) _tran The method comprises the following steps:

wherein b_kl Is a data center s _k And data center s _l If the copy is copied and laid out to the current data center, no transmission overhead exists;

the data layout is expressed as { S, D, Y, T } _total S is a data center set, D is a data set, Y is a layout position set of data, and all data D _ij E D, all correspond to unique data centers:

T _total for the total time delay corresponding to the data layout scheme, the data copying time T _copy And data transmission time T _tran And (2) sum:

T _total ＝T _copy +T _tran ；

data replication time T _copy Expressed as:

wherein ,

representing data d _i1 Layout position n of (2) _i1 For data d _i The number of copies;

data transmission time T _tran Expressed as:

where h (i, j, k, l) ∈ {0,1}, h (i, j, k, l) =1 represents the l-th copy d of data k _kl The slave data center s exists _i To data centre s _j Otherwise h (i, j, k, l) =0;

the targets of the data layout strategy are expressed as:

where β (i, j, k) ∈ {0,1}, β (i, j, k) =1 represents that the kth copy of data j is stored on data center i.

From the above description, it can be seen that, through the above steps, a mathematical problem model of a data layout strategy is obtained that aims to minimize the total delay.

Further, the step S3 includes the steps of:

encoding the data layout strategy by adopting a two-dimensional array to construct candidate particles:

data layout scheme of particle i at t-th iteration

The following are provided:

each bit

Representing the copy set storage location of data j for the ith particle in the t-th iteration: />

wherein ,q_k ∈{0,1}，q _k =1 indicates that a copy of data j is laid out on data center k, otherwise indicates that a copy of data j is not laid out on data center k, x _tij Middle q _k The number of=1 represents the number of copies of data j.

From the above description, two problems should be considered with the use of data replicas: (1) How to represent different copies of the data, (2) how to represent the storage locations of the copies of the data; the above steps solve both problems, giving attention to completeness and non-redundancy.

Further, solving the mathematical problem model according to the nonlinear inertial weight discrete particle swarm optimization algorithm based on the genetic algorithm operator comprises the following steps:

analyzing the scientific workflow, and performing topological ordering on the tasks to obtain a task queue capable of being sequentially executed;

initializing the maximum capacity of a data center, generating an initialization population according to a privacy data set, wherein privacy data in the initialization population can be laid out on the corresponding data center, and public data is randomly laid out without generating other copies;

simulating a data layout process, judging whether particles are feasible solutions, if so, calculating total time delay, and if not, recording an unreliable data set;

setting all individuals in the initial population as the optimal individual history, setting the optimal population history as the particles with the best fitness in the initial population, and calculating the fitness of the particles;

iterating the population, mutating the population according to the inertia weight factor w, and accelerating the population according to the acceleration factor alpha ₁ Crossing the population with the optimal population of the individual history according to the acceleration factor alpha ₂ Crossing the population with the population history optimum, calculating the adaptability of the new population, and updating the global information;

and outputting the total time delay with optimal population history when the iteration is finished.

From the above description, it can be seen that the data copy layout strategy based on NPSO-GA is realized according to the above steps.

Further, wherein the calculating of the fitness includes:

based on the comparison of fitness values F for both types of particles, a fitness function is established:

both particles compared are feasible solutions, and the particle fitness with lower total delay is better, and the fitness function is defined as follows:

F＝T _total ；

both particles compared are not feasible solutions, then the data set D is not resolvable _inf The smaller length particles have better fitness, which means that more data is laid out in feasible locations, and become feasible solution particles in subsequent iterations more easily, and the fitness function is as follows:

F＝|D _inf |；

if the feasible solution particles and the infeasible solution particles are compared, selecting a feasible solution, wherein the fitness function is as follows:

from the above description, it can be seen that, since the encoding of the data layout strategy of the present invention is not robust, infeasible solution particles are generated, and thus different adaptations need to be defined according to different situations.

Further, the data layout process includes the steps of:

initializing a task position list for recording the execution positions of all tasks and an overrun mark for recording whether a data center exceeds the capacity limit of the data center in the task execution process;

Calculating the capacity condition of the data center after the initial data set is subjected to data layout, traversing the task queue, calculating the execution position of the task and recording the execution position of the task into a task position list;

when a task generates an output data set, temporarily storing the input data set and the output data set of the task on a data center, judging whether the data center exceeds capacity limit at the moment, then distributing the output data of the task on the data center designated by the task, and updating the capacity of the data center;

if the data center exceeds the capacity limit in the process of executing the task, recording the data distributed on the data center exceeding the capacity limit in the insoluble data set D _inf And if not, calculating and recording the total time delay.

As is apparent from the above description, the data layout process is realized through the above steps.

Further, in the nonlinear inertial weight discrete particle swarm optimization algorithm based on the genetic algorithm operator in the step S3, introducing the crossover and mutation operator of the genetic algorithm into the particle swarm algorithm comprises the steps of:

iterating the velocity and position of the particles:

the ith update policy for the ith particle is:

wherein ,C_g and C_p Is a crossover operator, M _u Is a mutation operator, which is used for the mutation of the original data,

is the individual history of particle i at the t-th iteration is optimal, g ^t Is the optimal population history at the t iteration, alpha ₁ 、α ₂ And w is between 0 and 1, representing an acceleration factor and an inertial weight factor;

replacing an inertia part in the particle swarm algorithm by adopting a mutation operator of the genetic algorithm:

generating a random number r between 0 and 1 _w If it is smaller than the inertial weight factor w, the particles undergo mutation:

acquisition of an insoluble data set D of particles X i _inf From the insoluble dataset D _inf And a privacy dataset D _fix Obtaining a variation position:

if D _inf If there is no data, choose not to be at D _fix Bit of a data correspondence of D _inf If there is data in the list, select D _inf The common data of (a) is divided into bits;

counting the copy number of the data corresponding to the position to be mutated of the statistical particles X i, if

X _i [muIndex][j]＝1；

Then it indicates that the data corresponding to the position to be mutated of particle X i has a copy on data center j;

updating the copy number copy count, increasing or decreasing the copy number copy count according to probability based on the original number, and ensuring that at least one copy exists and the copy number copy count does not exceed the number of the data center;

is particle X _i Generating a data copy layout scheme with copy number of copy count at the position to be mutated;

the individual cognition and social cognition parts in the particle swarm algorithm are replaced by adopting a crossover operator of the genetic algorithm:

wherein ,

representing optimal crossing of particles and individual history, +.>

Representing the optimal intersection of particles and population history.

From the above description, by introducing the crossover and mutation operators of the genetic algorithm into the particle swarm algorithm, the searching capability of the particle swarm algorithm is enhanced, and premature convergence is avoided.

Further, in the nonlinear inertial weight discrete particle swarm optimization algorithm based on the genetic algorithm operator in the step S3, the step of adaptively adjusting the inertial weight according to the difference between the particles and the global particles includes the steps of:

the strategy of nonlinear adjustment of the inertia weight is adopted, and the inertia weight is adjusted based on the difference degree of the current particle and the global particle:

representing the difference between the particles and the population optimal particles;

adjusting acceleration factor alpha using a linear variation strategy ₁ and α₂ ：

From the above description, it can be seen that through the above steps, the inertia weight is adaptively adjusted according to the difference between the current particle and the global particle, so that the optimizing process is more efficient.

The method and the storage medium for arranging the workflow data in the cloud environment are suitable for arranging the workflow data in the cloud environment.

Referring to fig. 1 to 7, a first embodiment of the present invention is as follows:

in step S1, the mathematical representation of the cloud edge environment is specifically:

the cloud-edge environment is expressed as:

S＝{S _cld ,S _edg }；

wherein, cloud computing S _cld Comprising j data centers, denoted as:

S _cld ＝{s ₁ ,s ₂ ,…,s _j }；

edge computation S _edg Comprising k data centers, denoted as:

S _edg ＝{s _j+1 ,s _j+2 ,…,s _j+k }；

each data center s _i Expressed as:

s _i ＝<c _i ，γ _i ，a _i >；

the network bandwidth between data centers is expressed as:

the scientific workflow is expressed as:

G＝(V,E,D)；

wherein V represents a set of tasks in a scientific workflow:

V＝{v ₁ ,v ₂ ,…,v _w }；

e represents a set of task dependencies in a scientific workflow:

d represents a set of data replicas:

D＝{d ₁ ,d ₂ ,…,d _m }；

tasksIs a unit that can perform calculations in a data center, using data sets as inputs to perform tasks and generate new data sets in a certain order of execution. Each task v _i The relevant dataset is represented as<D _i ，D _o >，D _i Representing its input dataset, D _o Representing an output dataset thereof, the input dataset and the output dataset each consisting of one or more data, the inter-task dependencies e _ij E, represent task v _j Is task v _i Is required at task v _i Can be executed after completion, otherwise task v _j For task v _i Without dependence, each data copy set d _i Comprising several copies of the ith data, d _ij Representing it as the j-th copy of the i-th data, d _i1 And (3) rendering the ith original data, each data copy containing attributes<z _i1 ,n _i1 ,f _i1 ,l _i1 >，z _i1 Representing data size, n _i1 Representing the number of copies of the data, which is an integer greater than 0, if n _i1 =1, then indicates that the data has no other copies, f _i1 Representing the generated data d _i1 If the data is the initial data, f _i1 Is marked as 0,l _i1 Record data d _ij If data d _ij Is privacy data, then l _i1 Recording the data center to which the data center belongs, and if the data center is public data, l _i1 Is 0.

Different copies can be laid out on different data centers so as to shorten the data transmission delay, and if the data is private data, the copies can not be generated. The use of data copies creates additional overhead, including data copy overhead t _copy And data copy transmission overhead t _tran And at the same time, the storage resources of the data center are occupied. Data d _i1 In data center s _k Copy overhead t of (2) _copy The method comprises the following steps:

wherein ,z_i1 Is a number ofAccording to d _ij Size of a), a _k Is a data center s _k The speed at which the data is copied;

wherein b_kl Is a data center s _k And data center s _l And if the copy is copied and laid out to the current data center, no transmission overhead exists.

In this embodiment, copying all common data generates a large amount of overhead, so the number of copies of each data in the present invention is dynamic, and the number of copies of the data is affected by the number of times the data is input as a task. FIG. 1 illustrates a data replication model of the present invention that selectively replicates data in exchange for overhead of generating a replica for transmission overhead, thereby reducing overall latency.

before a task in a scientific workflow is executed, all input data required for the task should be transmitted to a data center for executing the task. Because the data volume in the scientific workflow is huge, the task scheduling time is far less than the data transmission time, so the task scheduling time is ignored. T (T) _total For the total time delay corresponding to the data layout scheme, the data copying time T _copy And data transmission time T _tran And (2) sum:

T _total ＝T _copy +T _tran ；

data replication time T _copy Representation ofThe method comprises the following steps:

wherein ,

data transmission time T _tran Expressed as:

the targets of the data layout strategy are expressed as:

In this embodiment, the data layout of the scientific workflow shown in fig. 2 is the scientific workflow from fig. 1. The scientific workflow contains a task set v= { V ₁ ,v ₂ ,v ₃ ,v ₄ ,v ₅ ,v ₆ ,v ₇ Sum dataset d= { D ₁ ,d ₂ ,d ₃ ,d ₄ ,d ₅ ,d ₆ ,d ₇ The data size is {6GB,10GB,4GB,3GB, 5GB,11GB }, wherein the data set is divided into a common data set D _flex ＝{d ₂ ,d ₆ ,d ₇ Sum privacy dataset D _fix ＝{d ₁ ,d ₃ ,d ₄ ,d ₅ }. The data center comprises two data units with a capacity of 25GBAn edge data center and a cloud data center with unlimited storage space. Set the bandwidth { b between data centers ₁₂ ,b ₁₃ ,b ₂₃ The data replication speed of the data center was set to 800M/s for {10M/s,20M/s,100M/s } respectively. Privacy data d ₁ ,d ₃ Laid out on the edge data center 2, the private data d ₄ ,d ₅ Laid out on the edge data center 3. Since tasks all involve private data as input or output, task v ₁ ,v ₂ ,v ₃ ,v ₆ Executing on edge data center 2, task v ₄ ,v ₅ ,v ₇ Is performed on the edge data center 3.

Wherein FIG. 2a and FIG. 2b are two layout schemes, respectively, that do not use a copy of the data, the difference being that the a scheme uses data d ₂ Is laid out in the data center 2, while scheme b will be data d ₂ Is laid out in the data center 3. Both schemes generate data d ₂ Is transmitted across data centers twice and data d ₇ Is transmitted across the data center, causing a delay of about 6144 s. FIG. 2c is a diagram illustrating a data layout scheme using dynamic copy number according to the present invention, shown in v ₂ Generating data d ₂ When the data is copied once, one copy is transmitted to the data center 3, and only the data d is needed ₂ Performing one copy and one transmission across data centers and data d ₇ Making one transmission across the data center can cause a delay of about 5427 s. In addition, if all the common data are duplicated, unnecessary time overhead is caused, and even the limit of the capacity of the edge data center is exceeded. The invention replaces transmission cost with cost of generating copy on the premise of capacity permission, and reduces total time delay by reasonably using data copy.

S2, adopting a nonlinear inertial weight discrete particle swarm optimization algorithm based on a genetic algorithm operator, introducing a crossover operator and a mutation operator of the genetic algorithm into the particle swarm algorithm, and adaptively adjusting the inertial weight according to the difference between particles and global particles so as to solve the mathematical problem model.

The overall goal of the data placement strategy is to achieve a mapping of the data set D to the data center S such that the overall latency is minimized, as allowed by the data center capacity. In this embodiment, a data copy layout strategy of a nonlinear inertial weight discrete particle swarm optimization algorithm (Nonlinear inertial weight discrete Particle Swarm Optimization algorithm based on Genetic Algorithm's operators, NPSO-GA) based on genetic algorithm operators is provided, which considers the cost of generating data copies, selectively copies data according to the task requirement, and determines the layout position of the data.

Problem coding:

two problems should be considered with using a copy of the data: (1) How to represent different copies of data, (2) how to represent the storage locations of the copies of data, and problem encoding requires that completeness, non-redundancy and robustness be considered as much as possible.

FIG. 3 is a diagram of a conventional static copy number encoding method, in which the same number of copies (the number of copies is 2 in the diagram) is generated for public data, and a one-dimensional array is used to represent a data layout scheme of a scientific workflow in a cloud environment, wherein each bit represents a layout position of one data copy. The coding scheme has completeness, each candidate solution of the problem space can be coded as a particle, but has no non-redundancy and robustness, such as particle X ₁ = (2,2,3,2,3,3,2,3,1,1) and particle X ₂ The same solution to the corresponding problem space, = (2,3,2,2,3,3,2,3,1,1), is d for data ₂ One copy is made and the two copies are laid out on the data center 2 and the data center 3, respectively. In addition, the number of the copies needs to be determined in advance in the coding mode, and the number of the data copies cannot be adjusted according to the frequency of data use.

In this embodiment, a new coding scheme is proposed, and a two-dimensional array is used to construct candidate solution particles.

The step S3 includes the steps of:

data layout scheme of particle i at t-th iteration

The following are provided: />

Each bit

Representing the copy set storage location of data j for the ith particle in the t-th iteration:

Such a coding scheme has completeness and non-redundancy and may vary the number and location of copies with the iteration of the particle, the data layout of fig. 2c corresponds to a coding scheme as in fig. 4 (assuming a number of data centers of 3).

Fitness function:

the invention aims to reduce the total time delay of the data layout of the scientific workflow, and the lower the total time delay is, the higher the particle quality is. However, the codes of the present invention are not robust and produce infeasible solution particles. There are two reasons for the impossibility of solving the problem, namely privacy disclosure and lack of satisfaction of capacity constraint, which are different fitness needs to be defined according to different situations. Wherein privacy disclosure indicates that at least one private data is copied or distributed to a non-corresponding data center, and that the capacity constraint is not satisfied indicates that at least one edge data center stores data exceeding the capacity constraint, and the illegal data set D is used _inf To describe the data set that caused the particle to become an infeasible solution. The comparison of the fitness value F of two types of particles for a viable solution and an infeasible solution is divided into three cases.

F＝T _total ；

F＝|D _inf |；

particle update policy:

PSO (particle swarm optimization) uses particles to represent each solution in the search space, the velocity of the particles determining the direction and distance they fly, and the optimal solution is obtained by iterating the velocity and position of the particles continuously:

iterating the velocity and position of the particles:

in this example, the NPSO-GA used is an improvement to the PSO algorithm. The t-th update strategy for the ith particle in NPSO-GA is as follows:

The ith update policy for the ith particle is:

generating a random number r between 0 and 1 _w If it is smaller than the inertia weight factor w, the particles undergo a mutation process M _u As shown in algorithm 1:

in algorithm 1, first an insoluble dataset D of particles X i is acquired _inf From the insoluble dataset D _inf And a privacy dataset D _fix Obtaining a variation position:

X _i [muIndex][j]＝1；

Is particle X _i A data copy layout scheme with copy number of copy count is generated at the position to be mutated.

The overall mutation process not only results in a change in the layout position of the data, but also changes the number of copies, and fig. 5 is an example of a mutation process.

wherein ,

representing optimal crossing of particles and individual history, +.>

Representing the optimal intersection of particles and population history.

The process of cross operation of particles and individual history optimization (population history optimization): after the mutation operation, a random number r between 0 and 1 is generated ₁ (r ₂ ) If it is less than or equal to the acceleration factor alpha ₁ (α ₂ ) Randomly selecting two bits of the particle, wherein a segment between the two bits is used as a crossing interval, and the segment in the crossing interval is replaced by a corresponding segment of p (or g), as shown in FIG. 6, which is a crossingExamples of fork procedures.

Parameter updating:

the larger inertia weight factor is beneficial to global searching, and local extremum is jumped out; and a smaller w is favorable for local search, so that the algorithm can be quickly converged to the optimal solution. In order to achieve the balance between the search speed and the search precision, the invention uses a strategy of nonlinear adjustment of the inertia weight w:

representing the difference between the particles and the population optimal particles. When the value is larger, the difference between the current particle and the population optimum is larger, the inertia weight should be increased to perform global search, otherwise, the inertia weight should be reduced to perform local search, so that the algorithm can be quickly converged to the optimum solution.

/>

As the number of iterations increases, α ₁ Continuously decrease and alpha ₂ The acceleration factor alpha is increased continuously, so that a larger acceleration factor alpha is obtained at the initial stage of iteration ₁ And a smaller acceleration factor alpha ₂ Searching a local optimal value in a smaller range, so that the particle searching is finer; obtaining smaller acceleration factor alpha in the later period of iteration ₁ And a larger acceleration factor alpha ₂ The global cooperation capability among particles is improved, and the particles can jump out of local optimum conveniently.

Data copy layout strategy overview:

algorithm 2 introduces the overall flow of the data replica layout strategy, which is based on the flow of the traditional PSO algorithm:

/>

in the algorithm 2, firstly, system initialization (1 st to 5 th lines), analysis of scientific workflow, and topology sequencing of tasks are carried out to obtain a task queue (1 st line) which can be executed sequentially;

Initializing the maximum capacity of a data center (line 2), generating an initialization population according to a privacy data set, wherein privacy data in the initialization population can be laid out on the corresponding data center, and public data is randomly laid out without generating other copies (line 3);

in this embodiment, the data layout process is simulated through the DataPlacement () function, whether the particles are feasible solutions is determined, if so, the total delay is calculated, and if not, an insoluble data set is recorded (line 4);

setting all individuals in the initial population as the optimal individual history, setting the optimal population history as the particles (line 5) with the best fitness in the initial population, and calculating the fitness of the particles;

iterative population (lines 6-12), variation of population according to inertial weight factor w (line 8), acceleration factor alpha ₁ Crossing the population with the optimal population of the individual history according to the acceleration factor alpha ₂ Crossing the population with the population history optimum (line 9), calculating the fitness of the new population, and updating global information (lines 10-11);

and outputting the total time delay with optimal population history at the end of iteration (line 12).

The data layout process comprises the following steps:

algorithm 3 gives the data layout process of the encoded particles and records the fitness of the particles.

/>

In this embodiment, the data layout process function dataPlaclement () returns fitness information of the population, records its total delay for feasible solution particles, and records its insoluble data set D for insoluble particles _inf 。

The data layout process comprises the steps of:

initializing a task position list (taskLocList) for recording the execution positions of all tasks and an overrun identification (flagOverflow) for recording whether a data center exceeds the capacity limit of the data center in the task execution process (lines 1-4);

calculating capacity conditions of the data center after the initial data set is subjected to data layout (lines 5-7), calculating capacity conditions of the data center in the process of task execution, traversing task queues (lines 8-17), calculating execution positions of tasks and recording task position lists (lines 9-13);

when a task generates an output data set, temporarily storing the input data set and the output data set of the task on a data center, judging whether the data center exceeds capacity limit at the moment (lines 14-15), then distributing the output data of the task on the data center appointed by the task, and updating the capacity of the data center (line 16);

if the data center exceeds the capacity limit in the process of executing the task, recording the data distributed on the data center exceeding the capacity limit in the insoluble data set D _inf And (lines 18-19), otherwise, calculating and recording the total delay (lines 20-22), including the data replication delay and the data transmission delay.

And S3, carrying out workflow data layout according to the solving result.

The second embodiment of the invention is as follows:

a storage medium having stored thereon a computer program for workflow data layout in a cloud-edge environment, characterized in that the computer program when executed performs the steps of a method for workflow data layout in a cloud-edge environment according to any of the preceding claims 1-9.

In summary, according to the workflow data layout method and the storage medium in the cloud environment provided by the invention, under the premise of considering the factors such as transmission bandwidth, data copy generation cost, data center capacity, privacy data and the like, the data copy is adaptively generated to optimize the transmission delay in the operation of the scientific workflow. And modeling the data copy layout into a 0-1 integer programming problem with the aim of minimizing the total time delay, generating copies for data used by high frequency according to the topological structure of the scientific workflow, and exchanging the cost of generating the copies for the transmission cost, thereby reducing the total time delay. The nonlinear inertial weight discrete particle swarm optimization algorithm based on the genetic algorithm operator is provided for solving the problem of data layout. The crossover operator and the mutation operator of the genetic algorithm are introduced into the particle swarm algorithm, so that the searching capability of the particle swarm algorithm is enhanced, premature convergence is avoided, and the inertia weight is adaptively adjusted according to the difference between the current particle and the global particle, so that the optimizing process is more efficient.

The core purpose of the invention is to minimize the time delay while meeting the storage capacity limit of the data privacy and the data center in the process of executing the scientific workflow.

The foregoing description is only illustrative of the present invention and is not intended to limit the scope of the invention, and all equivalent changes made by the specification and drawings of the present invention, or direct or indirect application in the relevant art, are included in the scope of the present invention.

Claims

1. The method for arranging the workflow data in the cloud side environment is characterized by comprising the following steps:

and S3, carrying out workflow data layout according to the solving result.

2. The method for workflow data layout in a cloud-edge environment according to claim 1, wherein the mathematical representation of the cloud-edge environment in step S1 is specifically:

The cloud-edge environment is expressed as:

S＝{S _cld ,S _edg }；

wherein, cloud computing S _cld Comprising j data centers, denoted as:

S _cld ＝{s ₁ ,s ₂ ,…,s _j }；

edge computation S _edg Comprising k data centers, denoted as:

S _edg ＝{s _j+1 ,s _j+2 ,…,s _j+k }；

each data center s _i Expressed as:

s _i ＝<c _i ，γ _i ，a _i >；

the network bandwidth between data centers is expressed as:

the scientific workflow is expressed as:

G＝(V,E,D)；

wherein V represents a set of tasks in a scientific workflow:

V＝{v ₁ ,v ₂ ,…,v _w }；

e represents a set of task dependencies in a scientific workflow:

d represents a set of data replicas:

D＝{d ₁ ,d ₂ ,…,d _m }；

each task v _i The relevant dataset is represented as<D _i ，D _o >，D _i Representing its input dataset, D _o Representing an output dataset thereof, the input dataset and the output dataset each consisting of one or more data, the inter-task dependencies e _ij E, represent task v _j Is task v _i Is required at task v _i Can be executed after completion, otherwise task v _j For task v _i Without dependence, each data copy set d _i Comprising several copies of the ith data, d _ij Representing it as the j-th copy of the i-th data, d _i1 And (3) rendering the ith original data, each data copy containing attributes<z _i1 ,n _i1 ,f _i1 ,l _i1 >，z _i1 Representing data size, n _i1 Representing the number of copies of the data, which is an integer greater than 0, if n _i1 =1, then indicates that the data has no other copies, f _i1 Representing the generated data d _i1 If dataF is the initial data _i1 Is marked as 0,l _i1 Record data d _ij If data d _ij Is privacy data, then l _i1 Recording the data center to which the data center belongs, and if the data center is public data, l _i1 Is 0.

3. The method for workflow data layout in a cloud-edge environment according to claim 2, wherein the modeling of the mathematical problem model in step S3 is specifically:

T _total for the total delay corresponding to the data placement scheme,for data replication time T _copy And data transmission time T _tran And (2) sum:

T _total ＝T _copy +T _tran ；

data replication time T _copy Expressed as:

wherein ,

data transmission time T _tran Expressed as:

the targets of the data layout strategy are expressed as:

4. The method for workflow data layout in cloud-edge environment as recited in claim 1, wherein said step S3 comprises the steps of:

particle i at the t-th iterationData layout scheme of (a)

The following are provided:

each bit

5. The method of workflow data placement in a cloud-edge environment of claim 4, wherein solving the mathematical problem model according to the genetic algorithm operator based nonlinear inertial weight discrete particle swarm optimization algorithm comprises the steps of:

6. The method of workflow data placement in a cloud-edge environment of claim 5, wherein the fitness calculation comprises:

F＝T _total ；

F＝|D _inf |；

7. the method for workflow data placement in a cloud-edge environment of claim 5, wherein the data placement process comprises the steps of:

8. The method for workflow data layout in a cloud environment as claimed in claim 7, wherein in the nonlinear inertial weight discrete particle swarm optimization algorithm based on genetic algorithm operator in step S3, introducing crossover and mutation operators of genetic algorithm in the particle swarm algorithm comprises the steps of:

Iterating the velocity and position of the particles:

the ith update policy for the ith particle is:

counting the copy number of the data corresponding to the position to be mutated of the particle Xi, if

X _i [muIndex][j]＝1；

The corresponding data of the position to be mutated of the particle Xi is shown to have a copy on the data center j;

wherein ,

representing optimal crossing of particles and individual history, +.>

Representing the optimal intersection of particles and population history.

9. The method for workflow data layout in a cloud environment according to claim 5, wherein in the nonlinear inertial weight discrete particle swarm optimization algorithm based on genetic algorithm operator in step S3, the self-adaptive adjustment of the inertial weight according to the difference between the particles and the global particles comprises the steps of:

a strategy of non-linearly adjusting the inertial weight is adopted, adjusting inertial weights based on the degree of difference of the current particle and the global particle:

10. A storage medium having stored thereon a computer program for workflow data layout in a cloud-edge environment, characterized in that the computer program when executed performs the steps of a method for workflow data layout in a cloud-edge environment according to any of the preceding claims 1-9.