CN108989098B - Time delay optimization-oriented scientific workflow data layout method in hybrid cloud environment - Google Patents

Time delay optimization-oriented scientific workflow data layout method in hybrid cloud environment Download PDF

Info

Publication number
CN108989098B
CN108989098B CN201810700970.0A CN201810700970A CN108989098B CN 108989098 B CN108989098 B CN 108989098B CN 201810700970 A CN201810700970 A CN 201810700970A CN 108989098 B CN108989098 B CN 108989098B
Authority
CN
China
Prior art keywords
data
particle
particles
representing
scientific workflow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810700970.0A
Other languages
Chinese (zh)
Other versions
CN108989098A (en
Inventor
林兵
项滔
卢宇
黄志高
陈星�
郭文忠
蔡飞雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Normal University
Original Assignee
Fujian Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Normal University filed Critical Fujian Normal University
Priority to CN201810700970.0A priority Critical patent/CN108989098B/en
Publication of CN108989098A publication Critical patent/CN108989098A/en
Application granted granted Critical
Publication of CN108989098B publication Critical patent/CN108989098B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0803Configuration setting
    • H04L41/0823Configuration setting characterised by the purposes of a change of settings, e.g. optimising configuration for enhancing reliability
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0803Configuration setting
    • H04L41/0823Configuration setting characterised by the purposes of a change of settings, e.g. optimising configuration for enhancing reliability
    • H04L41/083Configuration setting characterised by the purposes of a change of settings, e.g. optimising configuration for enhancing reliability for increasing network speed
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Physiology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a time delay optimization-oriented scientific workflow data layout method in a mixed cloud environment, which considers the data layout characteristics in the mixed cloud environment, combines the dependency relationship among scientific workflow data, and considers the influence of factors such as bandwidth among cloud data centers, the number and capacity of private cloud data centers and the like on transmission time delay; firstly, preprocessing operation is carried out on the data layout strategy, so that the execution efficiency of the later-stage data layout strategy is improved; by introducing the crossover operator and the mutation operator of the genetic algorithm, the problem of premature convergence of the particle swarm optimization algorithm is avoided, the diversity of population evolution is improved, the data transmission delay is effectively compressed, and the scientific workflow data transmission delay in the mixed cloud environment is effectively reduced. The invention improves the execution efficiency of the data layout strategy and optimizes the transmission delay of the scientific workflow data layout.

Description

Time delay optimization-oriented scientific workflow data layout method in hybrid cloud environment
Technical Field
The invention relates to a scientific workflow data layout method in the field of parallel and distributed high-performance computing, in particular to a time delay optimization-oriented scientific workflow data layout method in a mixed cloud environment.
Background
The scientific workflow system is a data intensive application and has been widely applied to scientific research fields such as astronomy, high-energy physics and biological information. Scientific workflow application is based on data driving, complex data dependence exists among computing task nodes, and the size of a processed data set can reach TB magnitude and even PB magnitude. These data sets include the existing raw input data sets, as well as intermediate data sets and final data sets generated during the processing analysis. Since the scientific workflow application structure depends on the properties of complexity, large data volume and the like, the scientific workflow application structure has strict requirements on the computing power and data storage of a deployment environment. Traditional distributed environments such as grids and the like are usually built for research of a specific scientific application, the sharing degree between the traditional distributed environments is low, and the deployment of scientific workflows in the environments can cause serious resource waste.
The cloud computing virtualizes resources in different geographic positions into a resource pool through a virtualization technology, faces terminal users in a pay-as-you-go mode, and has the characteristics of high efficiency, flexibility, high flexibility and customization, and an economic solution is provided for scientific workflow deployment. A hybrid cloud computing environment typically includes one public cloud and multiple private clouds: the public cloud can guarantee resource supply and maintain service quality under the condition that the load of the scientific workflow fluctuates seriously; the private cloud can provide guarantee for the safety of the scientific workflow privacy data. With the increase of importance of big data in the field of scientific application, the scientific workflow data layout in the mixed cloud environment has become a hotspot in the field of scientific research. In the field of emergency management application, a large number of concurrent examples exist, and the time delay requirement on the scientific workflow data layout is strict. However, the fixed data center for storing the scientific workflow privacy data causes a large amount of data transmission across data centers in the application execution process, and a huge contradiction is formed between the transmission of TB and even PB magnitude data sets and the limited network bandwidth between the data centers, which causes a serious transmission delay. Therefore, the research on a reasonable scientific workflow data layout scheme in a mixed cloud environment is very important, and the method is embodied as follows: (1) the scientific workflow application structure is complex in dependence and large in data volume, and a reasonable data layout scheme is used for ensuring high cohesion in a single data center and low coupling between data centers under the mixed cloud and multi-data center environment, so that the data transmission time overhead of the data centers is reduced. (2) Due to the fact that the private data are specified to be stored in a specific private cloud data center in terms of safety, cross-data center transmission is needed due to the fact that the capacity of the private cloud data center is limited, and how to optimize data transmission delay is a challenge of scientific workflow data layout under the conditions that transmission bandwidth is limited and private data are fixedly stored and considering the influence of bandwidth factors. (3) The effective data layout scheme takes effective utilization of data center resources into consideration on the premise of compressing data transmission delay.
The existing scientific workflow data layout work is mainly based on a clustering method and an intelligent method. The clustering method mainly considers the load balance data layout of a plurality of data centers and effectively utilizes the data center resources. However, in a mixed cloud environment, a scientific workflow with private data needs a data layout mode with high cohesion inside a single data center and low coupling between data centers to effectively guarantee low time delay of data transmission. The traditional clustering method based on load balancing cannot meet the requirement of low-delay data layout of scientific workflows in a mixed cloud environment. The traditional intelligent methods are mainly data layout strategies based on genetic algorithms, mainly consider the problem of load balancing, and are easy to fall into local optimization. The existing research methods are mainly developed aiming at the number of transmission times and data transmission quantity of the data centers in the process of optimizing data layout, compression research on data transmission delay is less, and in addition, the traditional research methods do not fully discuss the transmission bandwidth difference among the data centers. Therefore, for the problem of time delay optimization oriented scientific workflow data layout in a mixed cloud environment, a complete and effective solution is not formed in the current research work.
Disclosure of Invention
The invention aims to provide a time delay optimization-oriented scientific workflow data layout method in a mixed cloud environment.
The technical scheme adopted by the invention is as follows:
a time delay optimization oriented scientific workflow data layout method for a mixed cloud environment comprises the following steps:
step 1: constructing a data layout scheme model based on scientific workflow in a hybrid cloud environment;
the definition of the whole data layout scheme is S ═ S (DS, DC, Map, T)total) Wherein Map ═ Ui=1,2,...,|DS|{<dci,dsk,dcj> "represents the mapping of the data set DS to the data center set DC, TtotalRepresents the total time overhead incurred by data transmission across data centers during the data placement process; the time delay optimized scientific workflow data layout problem in the hybrid cloud environment is formally expressed as formula (8),
Figure GDA0003022072380000021
wherein u isij{0,1} represents a data set dsjWhether or not to be stored in data centre dciIf yes, uijIs 1, otherwise is 0; t istotalRepresenting the time overhead incurred by data transmission across data centers during the data placement process. In the data layout process, data is continuously transmitted and migrated, so that capacity limitation judgment is carried out on a certain private cloud data center when new data are placed in the certain private cloud data center. The core idea is to pursue the total time overhead TtotalAt a minimum, while meeting the storage capacity limitations of each data center.
Step 2: preprocessing a scientific workflow, merging adjacent data sets with only one related task, reducing the number of the data sets and improving the execution efficiency of a data layout algorithm;
and step 3: initializing the population size, the maximum iteration times, the inertia weight factor and the cognitive factor, and generating an initial population at random in a supervision mode; initializing self history optimal particles of the first generation particles and initial population global optimal particles; it is noted here that the quantile value of the private data is the corresponding fixed data center number;
and 4, step 4: constructing n-dimensional candidate solution particles by adopting a discrete coding mode for the preprocessed data set;
a particle represents a data layout scheme of a scientific workflow in a mixed cloud environment, and the position of the particle i in the t iteration
Figure GDA0003022072380000022
As shown in equation (11).
Figure GDA0003022072380000031
Each particle has n quantiles, and n represents the number of the data sets after the preprocessing operation;
Figure GDA0003022072380000032
indicating the storage location of the kth data set at the t-th iteration,
Figure GDA0003022072380000033
the value is a data center number, i.e. the number
Figure GDA0003022072380000034
Figure GDA0003022072380000035
And 4, step 4: mapping the data layout result and the candidate solution particles to obtain cross-data center transmission time and a corresponding data layout scheme;
and 5: calculating the fitness of each encoding particle, setting each particle as a self historical optimal particle, and selecting a feasible solution particle with the minimum fitness value as a population global optimal particle;
step 6: updating the particles based on the particle updating formula, and recalculating the fitness of each updated particle;
and 7: updating the self history optimal particle of the particle;
if the fitness value of the updated particle is smaller than the self historical optimal value, setting the updated particle as the self historical optimal particle; otherwise, jumping to step 9;
and 8: updating global optimal particles of the population;
if the fitness value of the updated particle is smaller than that of the population global optimal particle, setting the updated particle as the population global optimal particle;
and step 9: checking whether an algorithm termination condition reaching the maximum iteration number is met, and ending when the algorithm termination condition is met; otherwise, go to step 6.
Further, T in step 1totalThe calculating method of (2):
step 1-1, mapping<dci,dsk,dcj>Representing a data set dskFrom the source data centre dciTransmitting to a target data center dcjData transmission time T oftransferAs shown in equation (6):
Figure GDA0003022072380000036
wherein dskRepresenting a data set, dciRepresenting origin data center, dcjIndicating delivery to a target data center, dci、dcjAll belong to a data center set DC; dsizekRepresenting a data set dskSize, bandijRepresenting data centres dciAnd a data center dcjA bandwidth value of a network bandwidth in between;
step 1-2, time overhead T caused by data transmission across data centers in data layout processtotalThe calculation formula of (a) is as follows:
Figure GDA0003022072380000037
wherein e ijk0,1 represents whether a data set ds exists in the data layout processkFrom the source data centre dciTransmitting to a target data center dcjIf present, eijkIs 1, otherwise is 0.
Further, data set ds in step 1-1k=<dsizek,gtk,lck,flck>,dsizekIs the data set size gtkRepresentation generation data set dskTask of (1), lckRepresenting a data set dskStorage location of, flckRepresenting a data set dskAt the final layout position ofkAnd lckRespectively represent asThe following:
Figure GDA0003022072380000041
Figure GDA0003022072380000042
wherein, DSiniRepresenting an initial representation data set, DSgenRepresenting the generated data set, the initial data set is the original input of the scientific workflow, and the generated data set is the intermediate data set generated in the execution process of the scientific workflow, and the data sets are often the input data sets of other tasks, Task (ds)k) Representation generation data set dskThe task of (2). Data sets may be divided into DSs according to storage locationfixRepresenting fixed deposit datasets (private datasets) and DSsflexArbitrarily deposit data sets (non-private data sets), private data sets DSfixDC capable of being stored only in private cloud data centerpri,fix(dsk) Indicating a private cloud data center number specifying the storage of the private data set.
Further, in step 1, the data center set DC ═ { DC ═ DCpub,DCpriTherein DCpubIs a public cloud, DCpriThe cloud computing system is a private cloud and is composed of a plurality of data centers;
data center DC with number k in data center set DCkIs represented as follows:
dck=<capacityk,typek> (1)
wherein the capacitykRepresenting data centres dckThe data sets stored on the data center cannot exceed the storage capacity of (c). typekWhere {0,1} denotes the data center dckThe cloud service provider to which it belongs, when typekWhen equal to 0, dckThe data center belongs to a public cloud and can only store non-private data; when typekWhen 1, dckThe data center belongs to a private cloud and can store private data and non-private data.
Further, the specific steps of step 2 are as follows:
step 2-1, recording the out-degree and the in-degree of all tasks and data sets of the scientific workflow G;
step 2-2, searching 'one-way data edge cutting' eij
Step 2-3, when 'one-way data edge cutting' e existsijAnd dsiAnd dsjDeleting e if not all private dataijMerging dsiAnd dsjFor a new data set dskAnd performing step 2-2; and ending when the 'one-way data cut edge' does not exist.
Further, in step 3, an adjusting mechanism of the inertia weight factor w performs adaptive adjustment according to the difference degree between the current particle and the global optimal particle;
Figure GDA0003022072380000051
Figure GDA0003022072380000052
wherein div (X)t-1,gBestt-1) Representing the current particle Xt-1And global optimum particle gBestt-1There are different valued digits on the same quantile.
Further, the calculation formula of the fitness of the particles in step 6 is as follows:
the two encoding particles are the same type of particles, the encoding particle with shorter data transmission time across the data center is selected, and the fitness function is defined as follows:
Figure GDA0003022072380000053
two encoded particles are different types of particle combinations of feasible solution particles and infeasible solution particles, then the fitness function is defined as follows:
Figure GDA0003022072380000054
wherein the capacityiRepresenting data centres dciStorage capacity of uij{0,1} represents a data set dsjWhether or not to be stored in data centre dciIf yes, uijIs 1, otherwise is 0.
Further, the update formula for updating the particle i in step 7 is as follows:
Figure GDA0003022072380000055
wherein, Cg() And Cp() Cross operator, M, representing a genetic algorithmu() A mutation operator representing a genetic algorithm;
Figure GDA0003022072380000056
and gBestt-1Respectively representing the individual optimal position of the particle after multiple iterations and the global optimal position of the population;
Figure GDA0003022072380000057
indicating the position of the particle i at time t,
Figure GDA0003022072380000058
indicating the position of particle i at time t-1.
Further, decomposing the updated particle formula into three core parts of inertial cognition, individual cognition and social cognition, then:
(1) combining the standard PSO algorithm with the variation operation of the genetic algorithm to obtain the inertia part of the particle i at the time t
Figure GDA0003022072380000059
The formula of (1) is as follows:
Figure GDA00030220723800000510
wherein r is3Is a random factor, and the value range is (0, 1); w is an inertial weight factor, w is used to adjust the search capability of the particles on the solution space, Mu() Supervised random selection of a fraction in the encoded particles, random variation of the value of the fraction, and the value satisfying the corresponding value range,
Figure GDA0003022072380000061
indicating the position of the particle i at time t,
Figure GDA0003022072380000062
represents the position of the particle i at the time t-1;
(2) the formula for respectively obtaining the individual cognitive part and the global cognitive part of the particle i at the time t by combining the standard PSO algorithm with the cross operation of the genetic algorithm is as follows:
Figure GDA0003022072380000063
Figure GDA0003022072380000064
wherein c is1Is an individual cognitive factor, c2Is the global cognitive factor (c) which is,
Figure GDA0003022072380000065
and gBestt-1Respectively representing the individual optimal position of the particle after multiple iterations and the global optimal position of the population; cp () and Cg () represent a crossover operation, the Cp () and Cg () randomly selecting two fractional bits of a particle, the AND operation
Figure GDA0003022072380000066
Or gBestt-1Crossing the numerical values among the same quantiles; r is1And r2Is a random variable with a value range of [0, 1 ]],r1And r2For enhancing randomness in the iterative search process.
Further, individual cognitive factor c1And global cognition factor c2Is arranged in a linear increasing and decreasing manner, and the formula (21) and the formula (22) are respectively c1And c2The update mechanism of (2).
Figure GDA0003022072380000067
Figure GDA0003022072380000068
Wherein
Figure GDA0003022072380000069
And
Figure GDA00030220723800000610
are respectively self-cognition factors c1The initial value and the final value of the setting of (1),
Figure GDA00030220723800000611
and
Figure GDA00030220723800000612
respectively a population cognition factor c2Set initial value and final value of (1). When div (X)t-1) When larger, indicates that the current particle X ist-1And gBestt-1The difference is large, and the search range needs to be enlarged, so the weight of w should be increased to ensure that the particles search the problem solution in a larger range and avoid falling into local optimum too early; otherwise, the search range is narrowed, the weight of w is reduced, the convergence process is accelerated in a small range, and an optimized solution is found more quickly.
Further, supervised randomization includes the following two cases:
case 1: if the encoding particles are feasible solution particles, the selected quantiles do not contain the quantiles where the privacy data sets are located; since the private data set is fixedly stored, the storage location of the private data set cannot be changed.
Case 2: and if the encoding particles are not feasible particles, the selected quantile is the corresponding quantile of the overload data center encoding. The data layout scheme corresponding to the infeasible solution particle may have a plurality of overloaded data centers, and the bit corresponding to one overloaded data center code is randomly selected to perform a mutation operation, so that the infeasible solution particle may be mutated into a feasible solution particle.
By adopting the technical scheme, the data layout characteristics in a mixed cloud environment are considered, the dependence relationship among scientific workflow data is combined, and the influence of factors such as bandwidth among cloud data centers, the number and capacity of private cloud data centers and the like on transmission delay is considered; by introducing the crossover operator and the mutation operator of the genetic algorithm, the problem of premature convergence of the particle swarm optimization algorithm is avoided, the diversity of population evolution is improved, the data transmission delay is effectively compressed, and the scientific workflow data transmission delay in the mixed cloud environment is effectively reduced.
In order to compress the scale of the scientific workflow data, the method firstly carries out preprocessing operation on the scientific workflow data, and improves the execution efficiency of a later-stage data layout strategy; the problem of premature convergence of a particle swarm optimization algorithm for solving the NP-hard problem in the prior art is avoided, the diversity of population evolution is improved, and the distribution transmission delay of scientific workflow data is optimized.
Drawings
The invention is described in further detail below with reference to the accompanying drawings and the detailed description;
FIG. 1 is a scientific workflow diagram of the present invention;
FIG. 2 is one example of a data layout for a scientific workflow of the present invention;
FIG. 3 is a second example of a data layout for a scientific workflow of the present invention; FIG. 4 is a flow chart of the algorithm of the present invention;
FIG. 5 is a schematic diagram of the compressed 'one-way data cut edge' of the preprocessing process of the present invention;
FIG. 6 is a structure of Epigenomics workflow before and after pretreatment according to the present invention;
FIG. 7 is a diagram of an example of data layout particle encoding according to the present invention;
FIG. 8 is a cross-operator graph of the individual cognitive factors and global cognitive factors of the present invention;
FIG. 9 is a diagram of the mutation operator of the inertial part of the present invention
Detailed Description
As shown in one of fig. 1 to 7, the present invention discloses a time delay optimization oriented scientific workflow data layout method in a hybrid cloud environment, and the present invention is described in detail below with reference to the accompanying drawings.
1 problem definition and analysis
The section defines relevant concepts of the time delay optimization oriented scientific workflow data layout problem in the mixed cloud environment, and performs problem analysis by combining with the examples. The problem definition mainly comprises a mixed cloud environment, a scientific workflow and a data layout scheme.
1.1 problem definition
Mixed cloud DC ═ { DCpub,DCpriThe cloud computing center mainly comprises a public cloud and a private cloud, and both the public cloud and the private cloud are composed of a plurality of data centers. Public cloud data center DCpub={dc1,dc2,...,dcnThe private cloud data center DC consists of n data centerspri={dc1,dc2,...,dcmConsists of m data centers. The data layout problem is focused on here, so only the storage capacity of the data center is focused on, and the computing capacity is ignored. Data center dc numbered iiIs represented as follows:
dci=<capacityi,typei> (1)
wherein the capacityiRepresenting data centres dciThe data sets stored on the data center cannot exceed the storage capacity of (c). typeiWhere {0,1} denotes the data center dciThe cloud service provider to which it belongs, when typeiWhen equal to 0, dciThe data center belongs to a public cloud and can only store non-private data; when typeiWhen 1, dciThe data center belongs to a private cloud and can store private data and non-private data. In addition, the bandwidth between the various data centers is represented as follows:
Figure GDA0003022072380000081
bij=<bandij,typei,typej> (3)
wherein the pair
Figure GDA0003022072380000082
And i ≠ j, bijRepresenting data centres dciAnd a data center dcjBandwidth of the network in between, bandijIs its bandwidth value. It is assumed herein that bandwidth values between data centers are known and do not fluctuate.
Scientific workflows are represented by a directed acyclic graph G ═ (T, E, DS), where T ═ T1,t2,...,trDenotes a set of nodes containing r tasks, E ═ E12,e13,...,eijRepresents the data dependency between tasks, and DS ═ DS1,ds2,...,dsnDenotes the set of all data of the scientific workflow.
Each data dependent edge eij=(ti,tj) Representative task tiAnd task tjThere is a data dependency relationship between them, where the task tiIs task tjIs directly preceding (parent) node, and task tjThen it is task tiIs directly succeeding (child) node. In the scientific workflow scheduling process, a task can be executed only after all precursor nodes of the task are executed. In a given directed acyclic graph representing a scientific workflow, a task without a predecessor node is referred to as an 'in task', and similarly, a task without a successor node is referred to as an 'out task'.
For a certain subtask ti=<IDSi,ODSi>Whose set of input data is IDSiThe set of output data is ODSi. The correspondence between tasks and data is many-to-many, i.e. one data may be used by multiple tasks, and one task may require multiple input data when executed.
For a certain data set dsk=<dsizek,gtk,lck,flck>,dsizekIs the data set size gtkRepresentation generation data set dskTask of (1), lckRepresenting a data set dskStorage location of, flckRepresenting a data set dskAt the final layout position ofkAnd lckRespectively, as follows:
Figure GDA0003022072380000083
Figure GDA0003022072380000084
the data set may be divided into an initial data set DS by sourceiniAnd generating a data set DSgenThe initial data set is the original input of the scientific workflow, and the generated data set is the intermediate data set generated during the execution of the scientific workflow, and these data sets are often the input data sets of other tasks, Task (ds)k) Representation generation data set dskThe task of (2). The data sets can be divided into fixed storage data sets (private data sets) DS according to the storage locationfixAnd optional deposit dataset (non-private dataset) DSflexPrivate data sets can only be stored in private cloud data center DCpri,fix(dsk) Representing a specified deposit privacy data set dskThe private cloud data center number of (1).
The purpose of data layout is to minimize data transfer time while meeting task execution requirements. Any one task execution needs to satisfy two conditions: (1) the task is scheduled to be executed in a data center; (2) the input data sets required for this task are already in the data center. Since the time for scheduling tasks to a data center is much shorter than the transmission time for transmitting data to the data center, the data layout is mainly focused on herein, and the task scheduling is not the focus of the text, so that the task is scheduled to be executed by the data center with the least transmission time overhead. Whole data layoutThe definition of the scheme is S ═ (DS, DC, Map, T)total) Wherein Map ═ Ui=1,2,...,|DS|{<dci,dsk,dcj> "indicates the mapping of the data set DS to the data center DC, a certain mapping<dci,dsk,dcj>Representing a data set dskFrom the source data centre dciTransmitting to a target data center dcjThe data transmission time generated by this process is shown in equation (6). T istotalThe total time overhead caused by data transmission across data centers during data layout is shown as formula (7).
Figure GDA0003022072380000091
Figure GDA0003022072380000092
Wherein e ijk0,1 represents whether a data set ds exists in the data layout processkFrom the source data centre dciTransmitting to a target data center dcjIf present, eijkIs 1, otherwise is 0.
Based on the related definitions, the time delay optimization oriented scientific workflow data layout problem in the mixed cloud environment can be formally expressed as a formula (8), and the core idea is to pursue the total time overhead TtotalAt a minimum, while meeting the storage capacity limitations of each data center.
Figure GDA0003022072380000093
Wherein u isij{0,1} represents a data set dsjWhether or not to be stored in data centre dciIf yes, uijIs 1, otherwise is 0. In the data layout process, data is continuously transmitted and migrated, so that capacity limitation judgment is carried out on a certain private cloud data center when new data are placed in the certain private cloud data center.
1.2 problem analysis
FIG. 1 is an example of a scientific workflow containing 5 tasks t1,t2,t3,t4,t5}, 5 original input data sets { ds1,ds2,ds3,ds4,ds5And 1 intermediate data set (ds)6Composition, size of 6 datasets { dsize }1,dsize2,dsize3,dsize4,dsize5,dsize6Are {3GB,5GB,3GB,3GB,5GB,8GB }, respectively, where ds4Is a private data set and must be stored in a data center dc2The above. Task t4Is { ds }3,ds4,ds6Due to ds4Must be fixedly stored in the data center dc2So t is the privacy data of4Must also be in the data center dc2Is executed. Likewise, ds5Must be stored in the data center dc3Private data set of (1), t5Must also be in the data center dc3Is executed. FIGS. 2 and 3 are two data layout schemes, dc, respectively1Is a public cloud data center, the storage capacity is infinite, and dc2And dc3The data transmission method is characterized in that the data transmission method comprises two private cloud data centers, the storage capacity is 20GB, the bandwidth between the private cloud data centers is about 10 times of the bandwidth from a public cloud data center to the private cloud data center, and therefore the size { band } of the bandwidth between 3 data centers is assumed12,band13,band23Are {10M/s,20M/s,150M/s }, respectively.
2, according to a data layout scheme generated by Lijun et al, a public data set ds is divided according to a dependency matrix division model1、ds2And ds3Deployed in public cloud data center dc1Middle, ds6Deployed in private cloud data center dc2In (D), a private data set ds4And ds5Each deployed at an associated data center. The resulting layout scheme yields 4 data transfers, 27GB worth of data transfer, with a transfer time across the data center of about 1953 seconds.
3 is an optimal data layout scheme, which will have public possessionData set ds1And ds2Deployed in public cloud data center dc1Middle, ds3And ds6Deployed in private cloud data center dc3In (1). The resulting layout scheme yields 5 data transfers, a data transfer size of 30GB, and a cross data center transfer time of approximately 1023 seconds. Although the layout scheme exceeds Lijun et al in terms of data transmission frequency and data transmission quantity, the transmission time across the data centers is obviously superior to the former, which mainly means that the scheme comprehensively considers the influence caused by the transmission bandwidth among the data centers.
According to a traditional matrix division model or load balancing model based on data dependence destruction, data with high data dependence is divided into the same data center as much as possible, so that data transmission quantity among the data centers can be effectively reduced, but the method does not comprehensively consider layout influence caused by bandwidth difference among different data centers. Therefore, aiming at the defects of the traditional data layout model, a data layout strategy based on GA-DPSO is designed by combining a differentiated bandwidth division mechanism, different data sets are placed in a self-adaptive manner according to factors such as bandwidth and data center capacity limitation, and the transmission delay of scientific workflow data layout in a mixed cloud environment is effectively reduced.
2 GA-DPSO-based data layout strategy
For data layout scheme S ═ D (DS, DC, Map, T)total) The core purpose of this document is to find the optimal mapping Map of the data set DS to the data center DC, so that the cross-data center transmission time T istotalAnd the lowest. Finding the optimal mapping of DS to DC is an NP-hard problem and needs to take into account the bandwidth differences between different data centers in a hybrid cloud environment. In order to compress the scale of scientific workflow data, preprocessing operation is firstly carried out on the scientific workflow data, so that the execution efficiency of a later-stage data layout strategy is improved; in order to avoid the problem of premature convergence of a particle swarm optimization algorithm for solving the NP-hard problem in the prior art, the GA-DPSO algorithm is provided, the diversity of population evolution is improved, and the distribution transmission delay of scientific workflow data is optimized. The following sequentially introduces scientific workflow preprocessing and a self-adaptive discrete particle swarm optimization data layout strategy based on genetic algorithm operators.
2.1 scientific workflow pretreatment
Algorithm 1. merging adjacent datasets with only one dependent task
procedure preProcess(G(T,E,DS))
1, recording the out-degree and in-degree of all tasks and data sets of the scientific workflow G
2 searching for 'one-way data edge cutting' eij
3 if there is a 'one-way data cut edge' eijAnd dsiAnd dsjDeleting e if not all private dataijMerging dsiAnd dsjFor a new data set dsk
4, repeatedly executing the step 2 until ' one-way data cutting edge ' does not exist '
end procedure
Algorithm 1 mainly introduces a preprocessing process pseudo code for merging adjacent data sets with only one relevant task based on the structural features of the scientific workflow itself. Wherein the definition of 'one-way data cut edge' is: two data sets dsiAnd dsj,dsiIs 1, dsjHas an in-degree of 1, and has only one related task between two data sets, and the structure is shown in fig. 3. When the scientific workflow has 'one-way data cut edge', and dsiAnd dsjNot all private data, so ds can be usediAnd dsjThe combined placement is shown in fig. 5. For some scientific workflows with a large number of unidirectional data cut edges, such as the Epigenomics scientific workflows, after preprocessing, the number of data sets can be greatly reduced, so that the execution efficiency of a later-stage data layout algorithm is improved. Figure 6 shows the self-structural change of the Epigenomics workflow before and after preprocessing, the data set quantity is compressed by more than 30%.
Property 1 scientific workflow preprocessing strategies can compress the number of scientific workflow data sets, improve algorithm execution efficiency, but may affect the final data layout result.
FIG. 5 has shown an example of the number of compressed scientific workflow datasets, presented herein in section 2.2.1 based on dataThe discrete coding mode of the set number, therefore, the reduction of the data set number can improve the algorithm execution efficiency. FIG. 5 is a drawing illustrating combined placing ds5And ds6Mean ds5And ds6All the time in the same data center. If the capacity of a certain private cloud data center is large, only ds can be stored5Or ds6The preprocessed data layout result will be different from the non-preprocessed data layout result.
2.2 genetic algorithm operator-based adaptive discrete particle swarm optimization data layout strategy
The PSO algorithm was proposed by Eberhart and Kennedy in 1995, and is a group random optimization algorithm based on social behaviors of a bird group. Particles are important concepts in PSO, each particle represents a candidate solution of a problem, and the particles are moved and iteratively updated in a problem space to obtain better particles. The particle movement update is mainly to adjust its velocity and position, which are shown in equation (9) and equation (10).
Figure GDA0003022072380000111
Figure GDA0003022072380000112
Figure GDA0003022072380000113
And
Figure GDA0003022072380000114
respectively representing the speed and the position of the ith particle in the t-th iteration, and defining the maximum particle speed V for limiting the particle speed to ensure that the particles are updated in a problem solution spacemax. The speed updating of the particles is influenced by the conditions of the particles, the optimal historical positions of the particles and the optimal positions of the population history. The inertial weight w directly influences the convergence of the algorithm and adjusts the searching capability of the particles on the solution space.
Figure GDA0003022072380000121
And gBesttRespectively representing the self-history optimal position and the population history optimal position of the particle i after the t-th iteration. c. C1And c2The cognitive factors respectively represent cognitive learning abilities of the self-history optimal position and the population history optimal position. r is1And r2The two random factors are in a value range (0,1), so that the search randomness in the algorithm iteration process can be increased, and the population diversity is improved. In addition, in order to determine whether the particles are in good or bad positions in the problem space, a fitness function needs to be defined and evaluated.
The traditional PSO algorithm is used for solving a continuous problem, the data layout from a data set to a data center is a discrete problem, and a new problem coding mode and a fitness evaluation function are needed. Aiming at the problem of premature convergence of the traditional PSO algorithm, a new particle updating strategy is needed. In addition, the setting of algorithm parameters directly affects the iteration number and the searching capability of the algorithm execution process. The GA-DPSO data layout optimization algorithm proposed herein will be described in detail below in terms of problem coding, fitness function setting, particle update strategy, algorithm parameter setting, and the like.
2.2.1 problem coding
The good problem coding strategy can effectively improve the algorithm efficiency and the searching capability, and the problem coding mainly considers three basic principles: completeness, non-redundancy, and robustness.
Definition 1 (completeness) all feasible solutions in the problem space can find the corresponding encoded particles in the encoding space.
A certain candidate solution in the 2 (non-redundancy) problem space is defined to which only one encoded particle corresponds in the encoding space.
Any encoded particle in the 3 (robust) coding space is defined to correspond to a solution candidate in the problem space.
It is challenging to construct a problem code that satisfies the above three principles simultaneously. We use discrete coding to construct n-dimensional solution candidate particles. One granuleThe child represents a data layout scheme of scientific workflow in a mixed cloud environment, and the position of a particle i in the t iteration
Figure GDA0003022072380000122
As shown in equation (11).
Figure GDA0003022072380000123
Each particle has n quantiles, where n represents the number of data sets after the preprocessing operation.
Figure GDA0003022072380000124
The storage position of the kth data set in the t iteration is represented, and the specific value is a certain data center number, namely
Figure GDA0003022072380000125
Figure GDA0003022072380000126
It is noted here that for a private data set, the storage location is fixed regardless of the iterative update, as in data set ds in fig. 14And ds5Each of which can only be fixedly stored at dc2And dc3In (1). FIG. 7 shows a problem coding scheme corresponding to the data layout of FIG. 3 formed for the scientific workflow of FIG. 1, wherein the data set is compressed from 6 to 5 by the preprocessing operation, and then compressed to ds of a whole data set5And ds6Are all stored in dc3
Property 2 the discrete coding strategy satisfies the principle of non-redundancy and completeness, but does not satisfy the principle of robustness.
Each data set is finally stored in a corresponding data center and is provided with a corresponding data center number, the final storage position of one data set can only be on a certain data center, a certain data layout scheme of the scientific workflow corresponds to an n-dimensional particle, the value of each quantile is the corresponding data center number, and one layout scheme only corresponds to one coding particleAnd sub-system, satisfying the principle of non-redundancy. The non-privacy data sets can be selected to be stored in all data centers, the corresponding coding quantiles can also be selected to be different data center numbers, the coding quantile value corresponding to each data set is the data center number which is designated to be stored, each layout scheme has corresponding coding particles, and the completeness principle is met. Some of the encoded particles fail to satisfy the realistic problem spatial candidate solution, and if the data set storage location in FIG. 7 is (1,2,2,2,2), ds is removed1All data sets except those stored in dc2The total data amount reaches 24GB and exceeds dc2The 20GB storage capacity of (a) makes the data layout scheme infeasible and therefore does not meet the soundness principle.
2.2.2 fitness function
The fitness function is used for evaluating the advantages and disadvantages of the particles, and generally, the particles with smaller fitness function values have better performance. The objective of the method is to reduce the data transmission time across the data center of the scientific workflow data layout, and the smaller the transmission time, the better the particles, so that the fitness function value can be directly defined to be equal to the data transmission time of the data layout scheme corresponding to the particles. However, since problem coding does not satisfy the soundness principle, that is, it may occur that a data set placed in a certain data center exceeds the capacity of the data center, it is necessary to define the fitness function differently.
Defining 4 (feasible solution particle) to encode the data layout strategy that the particle corresponds to and satisfy the data center capacity restriction requirement, it does not appear that the data set of a certain data center exceeds the data center capacity.
Defining 5 (infeasible solution particle) that the data layout strategy corresponding to the encoding particle does not meet the data center capacity limit requirement, and the data set of a certain data center exceeds the data center capacity.
The fitness function values of two encoded particles are compared in three different cases.
Case 1: both encoding particles are feasible solution particles, the encoding particle with shorter data transmission time across the data center is selected, and the fitness function is defined as follows:
Figure GDA0003022072380000131
case 2: and the two encoding particles are both infeasible solution particles, the encoding particles with shorter data transmission time across the data center are selected, the infeasible solution particles are likely to become feasible solution particles through later particle updating operation, the encoding particles with shorter data transmission time are more likely to keep shorter data transmission time, and the fitness function definition is consistent with the formula (12).
Case 3: one encoding particle is an infeasible solution particle and one encoding particle is a feasible solution particle, the feasible solution particles are selected without question, and the fitness function is defined as follows:
Figure GDA0003022072380000132
2.2.3 particle update strategy
As shown in equation (9), a conventional PSO includes three core portions: inertia, individual cognition, and social cognition. The traditional PSO is based on random search of a continuous space, the search space is locally and slowly enlarged, premature convergence is easy to occur, and the PSO is trapped in local optimum. In order to enhance the searching capability of the PSO, apply to the discrete problem, and simultaneously enable the PSO to explore a wider solution space, and avoid the problem of premature convergence, the algorithm introduces crossover and mutation operators of the genetic algorithm, and the updating operation of the improved formula (9) on the particle i at the time t is as follows:
Figure GDA0003022072380000141
wherein C isg() And Cp() Cross operator, M, representing a genetic algorithmu() Represents a mutation operator of the genetic algorithm.
For the individual cognitive part and the social cognitive part, the corresponding part in the formula (9) is updated by combining the idea of the crossover operator of the genetic algorithm, and the updating operation is shown as the formulas (16) and (17).
Figure GDA0003022072380000142
Figure GDA0003022072380000143
r1、r2Is a random factor and has a value range of (0, 1). Cp() (or C)g() Randomly selecting two bits of the encoded particle, and
Figure GDA0003022072380000144
or
Figure GDA0003022072380000145
The values between the same quantiles are interleaved. FIG. 8 is a cross-operator operation of the personal (social) cognizant part to randomly select two cross-positions (ind) of the encoded particle1And ind2) Let old particle ind1And ind2Replacement of values between the partial bits by
Figure GDA0003022072380000146
At values over this interval, new particles are formed.
Properties 3: the crossover operator operation may change the encoded particle from a feasible solution to an infeasible solution and vice versa.
The encoded particles (1,1,3,2,3) of fig. 7 are feasible solutions, assuming that the code of the pBest particle is (2,3,2,2,3) and the randomly generated crossover positions are 1 and 2, so the new encoded particle formed after crossover is (2,3,3,2, 3). The newly encoded particles will be2、ds3、ds5And ds6Is placed on dc3,ds2、ds3、ds5And ds6Has a data amount of 21GB and dc3The data center capacity of (2) is only 20GB, so the new encoded particles are not feasible to solve. Similarly, if the non-feasible solution particles (2,3,3,2,3), pBest encoded particles (2,2,1,2,3), and the cross positions are 1 and 2, then new feasible solution particles (2,2,3,2,3)。
for the inertial part, the corresponding part in equation (9) is updated by combining the idea of mutation operator of genetic algorithm, and the updating operation is shown as equation (18).
Figure GDA0003022072380000147
r3Is a random factor and has a value range of (0, 1). Mu() And (3) supervised random selection of a fractional bit in the encoded particles, and random change of the numerical value of the fractional bit, wherein the numerical value meets the corresponding value range. Supervised randomness is to select randomly within a certain range of quantiles, and has two main situations.
Case 1: the encoding particle is a feasible solution particle, and the selected quantile does not contain the quantile where the private data set is located. Since the private data set is fixedly stored, the storage location of the private data set cannot be changed.
Case 2: and if the encoding particles are not feasible particles, the selected quantile is the corresponding quantile of the overload data center encoding. The data layout scheme corresponding to the infeasible solution particle may have a plurality of overloaded data centers, and the bit corresponding to one overloaded data center code is randomly selected to perform a mutation operation, so that the infeasible solution particle may be mutated into a feasible solution particle.
The encoded particle of FIG. 7 belongs to case 1, and FIG. 9 randomly selects except for the fourth and fifth quantiles (ds)4And ds5Corresponding quantile) outside the index ind1Performing mutation operator operation, ind1The value on the fractional bit is updated from 3 to 2.
Properties 4: mutation operator operations may change the encoded particle from a feasible solution to an infeasible solution and vice versa.
The encoded particles (1,2,3,2,3) are feasible solutions, the 2 nd fractional variation is randomly selected to form new encoded particles (1,3,3,2,3) which are infeasible solutions, and the new encoded particles will ds2、ds3、ds5And ds6Are all placed at dc3Middle, ds2、ds3、ds5And ds6The sum of the data amount of (1) is 21GB and exceeds dc320GB capacity of data center. Similarly, if the variable position is 2, a new feasible solution particle (1,1,3,2,3) may be generated.
2.2.4 mapping of particles to data layout results
And 2, algorithm: mapping of encoded particles to data layout results
Figure GDA0003022072380000151
Figure GDA0003022072380000161
Algorithm 2 is pseudo code that encodes a particle to data layout result mapping. Inputs to the algorithm include scientific workflow G ═ (T, E, DS), hybrid cloud environment data center DC, and encoded particle X. First, an initial storage dc per data center is setcur(i)Are all 0, with a transmission time across the data center of 0 (row 1). After initialization, the particle quantiles are scanned in sequence, the data sets are distributed to corresponding data centers according to the value of each quantile of the coded particles, and correspondingly, the current storage capacity dc of each data center is recordedcur(X[i]). And when the storage capacity of a certain private cloud data center exceeds the capacity of the certain private cloud data center, the coded particles are proved to be infeasible to be solved, the operation is stopped, and the operation returns (lines 2-7). When the encoded particles are feasible solution particles, the layout of the corresponding data set corresponding to each data center is obtained, and the cross-data center transmission time needs to be further calculated. Scanning the scientific workflow task in sequence, and searching the task tjOf an input data set IDSjCorresponding all-layout data center DCjWith computational tasks placed in a layout data center dckInput data Transfer time Transfer ofjk. Selecting the data center with the minimum input data transmission time to place the task, and calculating the corresponding data transmission time in an overlapping manner to form the final cross-data-center transmission time TtotalAnd finally outputting the cross-data center transmission time TtotalAnd a corresponding data layout scheme.
2.2.5 parameter settings
The inertial weight factor w in equation (9) determines the velocity variation, which has a direct effect on the searching capability and convergence of the PSO algorithm. When the inertia weight factor w is large, the global search capability of the algorithm is strong and the convergence is not easy; otherwise, the local search capability of the algorithm is strong and easy to converge. Formula (18) is a classical inertial weight factor adjustment mechanism, and in the initial stage of operation of the algorithm, the global search capability and the wider problem solution space of the particle are emphasized, and as the number of later iterations increases and the search is deep, the particle emphasizes the local search capability and the convergence. The value of the inertial weight factor w of equation (18) decreases linearly with increasing number of iterations. Wherein wmaxAnd wminRespectively, the maximum and minimum values of the inertial weight factor w set at initialization, itersmaxAnd iterscurRespectively, the maximum number of iterations set at initialization and the current number of iterations.
Figure GDA0003022072380000162
The inertial weight factor of the formula (18) is adjusted in a linear decreasing manner based on the iteration number, and cannot well satisfy the problem of nonlinear data layout in the text, so that an inertial weight factor capable of adaptively adjusting the search capability according to the quality of the current particle needs to be designed. The new inertial weight factor adjustment mechanism may be adaptively adjusted according to the degree of difference between the current particle and the globally optimal particle, as shown in equation (19).
Figure GDA0003022072380000163
Figure GDA0003022072380000171
Wherein div (X)t-1,gBestt-1) Representing the current particle Xt-1And global optimum particle gBestt-1There are different valued digits on the same quantile. When div (X)t-1) When larger, indicates that the current particle X ist-1And gBestt-1The difference is large, and the search range needs to be enlarged, so the weight of w should be increased to ensure that the particles search the problem solution in a larger range and avoid falling into local optimum too early; otherwise, the search range is narrowed, the weight of w is reduced, the convergence process is accelerated in a small range, and an optimized solution is found more quickly.
In addition, a self-recognition factor c1And a population recognition factor c2Is arranged in a linear increasing and decreasing manner, and the formula (21) and the formula (22) are respectively c1And c2The update mechanism of (2).
Figure GDA0003022072380000172
Figure GDA0003022072380000173
Wherein
Figure GDA0003022072380000174
And
Figure GDA0003022072380000175
are respectively self-cognition factors c1The initial value and the final value of the setting of (1),
Figure GDA0003022072380000176
and
Figure GDA0003022072380000177
respectively a population cognition factor c2Set initial value and final value of (1).
By adopting the technical scheme, the data layout characteristics in a mixed cloud environment are considered, the dependence relationship among scientific workflow data is combined, and the influence of factors such as bandwidth among cloud data centers, the number and capacity of private cloud data centers and the like on transmission delay is considered; by introducing the crossover operator and the mutation operator of the genetic algorithm, the problem of premature convergence of the particle swarm optimization algorithm is avoided, the diversity of population evolution is improved, the data transmission delay is effectively compressed, and the scientific workflow data transmission delay in the mixed cloud environment is effectively reduced.
In order to compress the scale of the scientific workflow data, the method firstly carries out preprocessing operation on the scientific workflow data, and improves the execution efficiency of a later-stage data layout strategy; the problem of premature convergence of a particle swarm optimization algorithm for solving the NP-hard problem in the prior art is avoided, the diversity of population evolution is improved, and the distribution transmission delay of scientific workflow data is optimized.

Claims (10)

1. A time delay optimization-oriented scientific workflow data layout method in a mixed cloud environment is characterized by comprising the following steps: which comprises the following steps:
step 1: constructing a data layout scheme model based on scientific workflow in a hybrid cloud environment;
the definition of the whole data layout scheme is S ═ S (DS, DC, Map, T)total) Wherein Map ═ Ui=1,2,...,|DS|{<dci,dsk,dcj> "represents the mapping relationship, mapping, of the data set DS to the data center set DC<dci,dsk,dcj>Representing a data set dskFrom the source data centre dciTransmitting to a target data center dcj,TtotalRepresents the total time overhead incurred by data transmission across data centers during the data placement process;
step 2: preprocessing a scientific workflow, and merging adjacent data sets with only one related task;
and step 3: initializing the population size, the maximum iteration times, the inertia weight factor and the cognitive factor, and generating an initial population at random in a supervision mode; initializing self history optimal particles of the first generation particles and initial population global optimal particles;
and 4, step 4: constructing n-dimensional candidate solution particles by adopting a discrete coding mode for the preprocessed data set;
one particle representing scientific workflow in a hybrid cloud environmentA data layout scheme, the position X of the particle i at the t-th iterationi tAs shown in formula (11);
Figure FDA0003022072370000011
each particle has n quantiles, and n represents the number of the data sets after the preprocessing operation;
Figure FDA0003022072370000012
indicating the storage location of the kth data set at the t-th iteration,
Figure FDA0003022072370000013
the value is a data center number, i.e. the number
Figure FDA0003022072370000014
And 5: mapping the data layout result and the candidate solution particles to obtain cross-data center transmission time and a corresponding data layout scheme;
step 6: calculating the fitness of each encoding particle, setting each particle as a self historical optimal particle, and selecting a feasible solution particle with the minimum fitness value as a population global optimal particle;
and 7: updating the particles based on the particle updating formula, and recalculating the fitness of each updated particle;
and 8: updating the self history optimal particle of the particle;
if the fitness value of the updated particle is smaller than the self historical optimal value, setting the updated particle as the self historical optimal particle; otherwise, jumping to step 10;
and step 9: updating global optimal particles of the population;
if the fitness value of the updated particle is smaller than that of the population global optimal particle, setting the updated particle as the population global optimal particle;
step 10: checking whether an algorithm termination condition reaching the maximum iteration number is met, and ending when the algorithm termination condition is met; otherwise, go to step 7.
2. The time delay optimization-oriented scientific workflow data layout method of the hybrid cloud environment according to claim 1, characterized in that: step 1TtotalThe calculating method of (2):
step 1-1, mapping<dci,dsk,dcj>Representing a data set dskFrom the source data centre dciTransmitting to a target data center dcjData transmission time T oftransferAs shown in equation (6):
Figure FDA0003022072370000021
wherein dskRepresenting a data set, dciRepresenting origin data center, dcjIndicating delivery to a target data center, dci、dcjAll belong to a data center set DC; dsizekRepresenting a data set dskSize, bandijRepresenting data centres dciAnd a data center dcjA bandwidth value of a network bandwidth in between;
step 1-2, time overhead T caused by data transmission across data centers in data layout processtotalThe calculation formula of (a) is as follows:
Figure FDA0003022072370000022
wherein eijk0,1 represents whether a data set ds exists in the data layout processkFrom the source data centre dciTransmitting to a target data center dcjIf present, eijkIs 1, otherwise is 0.
3. The time delay optimization-oriented scientific workflow data layout method of hybrid cloud environment according to claim 2The method is characterized in that: data set ds in step 1-1k=<dsizek,gtk,lck,flck>,dsizekIs the data set size gtkRepresentation generation data set dskTask of (1), lckRepresenting a data set dskStorage location of, flckRepresenting a data set dskAt the final layout position ofkAnd lckRespectively, as follows:
Figure FDA0003022072370000023
Figure FDA0003022072370000024
wherein, DSiniRepresenting an initial representation data set, DSgenRepresenting the generating dataset, DSfixIndicating private data sets and DSs that need to be persisted securelyflexArbitrarily deposited non-private data set, private data set DSfixDeposit in private cloud data center DC onlypri,Task(dsk) Representation generation data set dskTask of (1), fix (ds)k) Indicating a private cloud data center number specifying the storage of the private data set.
4. The time delay optimization-oriented scientific workflow data layout method of the hybrid cloud environment according to claim 1, characterized in that: the specific steps of step 2 are as follows:
step 2-1, recording the out-degree and the in-degree of all tasks and data sets of the scientific workflow G,
step 2-2, searching 'one-way data edge cutting' eij(ii) a 'one-way data cut edge' refers to two data sets dsiAnd dsj,dsiIs 1, dsjThe in-degree of (1) is 1, and only one related task exists between two data sets;
step 2-3, when 'one-way data edge cutting' e existsijAnd dsiAnd dsjDeleting e if not all private dataijMerging dsiAnd dsjFor a new data set dskAnd performing step 2-2; and ending when the 'one-way data cut edge' does not exist.
5. The time delay optimization-oriented scientific workflow data layout method of the hybrid cloud environment according to claim 1, characterized in that: in the step 3, the adjusting mechanism of the inertia weight factor w carries out self-adaptive adjustment according to the difference degree between the current particles and the global optimal particles;
Figure FDA0003022072370000031
Figure FDA0003022072370000032
wherein wmaxAnd wminRespectively representing the upper and lower limits of the value range of w, div (X)t-1,gBestt-1) Indicating the position X of the current particlet-1And global optimum particle gBestt-1There are different valued digits on the same quantile.
6. The time delay optimization-oriented scientific workflow data layout method of the hybrid cloud environment according to claim 1, characterized in that: the calculation formula of the fitness of the particles in the step 6 is as follows:
the two encoding particles are the same type of particles, the encoding particle with shorter data transmission time across the data center is selected, and the fitness function is defined as follows:
Figure FDA0003022072370000038
two encoded particles are different types of particle combinations of feasible solution particles and infeasible solution particles, then the fitness function is defined as follows:
Figure FDA0003022072370000033
wherein the capacityiRepresenting data centres dciStorage capacity of uij{0,1} represents a data set dsjWhether or not to be stored in data centre dciIf yes, uijIs 1, otherwise is 0.
7. The time delay optimization-oriented scientific workflow data layout method of the hybrid cloud environment according to claim 1, characterized in that: the update formula for updating the particle i in step 7 is as follows:
Figure FDA0003022072370000034
wherein, c1And c2Respectively representing the individual cognition factor and the global cognition factor of the particle, namely the learning degree of the particle to other individuals and the population-optimal individual, Cg() And Cp() Cross operator, M, representing a genetic algorithmu() A mutation operator representing a genetic algorithm;
Figure FDA0003022072370000035
and gBestt-1Respectively representing the individual optimal position of the particle after t-1 iterations and the global optimal position of the population;
Figure FDA0003022072370000036
indicating the position of the particle i at time t,
Figure FDA0003022072370000037
indicating the position of particle i at time t-1.
8. The time delay optimization-oriented scientific workflow data layout method of the hybrid cloud environment according to claim 7, characterized in that: decomposing the updated particle formula into three core parts of inertial cognition, individual cognition and social cognition, and then:
(1) combining the standard PSO algorithm with the variation operation of the genetic algorithm to obtain the inertia part of the particle i at the time t
Figure FDA0003022072370000041
The formula of (1) is as follows:
Figure FDA0003022072370000042
wherein r is3Is a random factor, and the value range is (0, 1); w is an inertial weight factor, w is used to adjust the search capability of the particles on the solution space, Mu() Supervised random selection of a fraction in the encoded particles, random variation of the value of the fraction, and the value satisfying the corresponding value range,
Figure FDA0003022072370000043
indicating the position of the particle i at time t,
Figure FDA0003022072370000044
represents the position of the particle i at the time t-1;
(2) the formula for respectively obtaining the individual cognitive part and the global cognitive part of the particle i at the time t by combining the standard PSO algorithm with the cross operation of the genetic algorithm is as follows:
Figure FDA0003022072370000045
Figure FDA0003022072370000046
wherein c is1Is an individual cognitive factor, c2Is the global cognitive factor (c) which is,
Figure FDA0003022072370000047
and gBestt-1Respectively representing the individual optimal position of the particle after t-1 iterations and the global optimal position of the population; cp () and Cg () represent the crossover operators of the genetic algorithm, and Cp () and Cg () randomly select the two quantiles of a particle, and
Figure FDA0003022072370000048
or gBestt-1Crossing the numerical values among the same quantiles; r is1And r2Is a random variable with a value range of [0, 1 ]],r1And r2For enhancing randomness in the iterative search process.
9. The time delay optimization-oriented scientific workflow data layout method of the hybrid cloud environment according to claim 1 or 7, characterized in that: individual cognitive factor c1Global cognition factor c2Is arranged in a linear increasing and decreasing manner, and the formula (21) and the formula (22) are respectively c1And c2The update mechanism of (2):
Figure FDA0003022072370000049
Figure FDA00030220723700000410
wherein
Figure FDA00030220723700000411
And
Figure FDA00030220723700000412
are respectively individual cognitive factors c1The initial value and the final value of the setting of (1),
Figure FDA00030220723700000413
and
Figure FDA00030220723700000414
respectively a global cognition factor c2Set initial value and final value of (1), iterscurRepresenting the current number of iterations, itersmaxIndicating the maximum number of iterations set at initialization.
10. The time delay optimization-oriented scientific workflow data layout method of the hybrid cloud environment according to claim 1 or 7, characterized in that: the following two cases are included in step 1:
case 1: if the encoding particles are feasible solution particles, the selected quantiles do not contain the quantiles where the privacy data sets are located;
case 2: and if the encoding particles are not feasible particles, the selected quantile is the corresponding quantile of the overload data center encoding.
CN201810700970.0A 2018-08-24 2018-08-24 Time delay optimization-oriented scientific workflow data layout method in hybrid cloud environment Active CN108989098B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810700970.0A CN108989098B (en) 2018-08-24 2018-08-24 Time delay optimization-oriented scientific workflow data layout method in hybrid cloud environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810700970.0A CN108989098B (en) 2018-08-24 2018-08-24 Time delay optimization-oriented scientific workflow data layout method in hybrid cloud environment

Publications (2)

Publication Number Publication Date
CN108989098A CN108989098A (en) 2018-12-11
CN108989098B true CN108989098B (en) 2021-06-01

Family

ID=64539632

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810700970.0A Active CN108989098B (en) 2018-08-24 2018-08-24 Time delay optimization-oriented scientific workflow data layout method in hybrid cloud environment

Country Status (1)

Country Link
CN (1) CN108989098B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110033076B (en) * 2019-04-19 2022-08-05 福州大学 Workflow data layout method for cost optimization in mixed cloud environment
CN113411369B (en) * 2020-03-26 2022-05-31 山东管理学院 Cloud service resource collaborative optimization scheduling method, system, medium and equipment
CN111209091B (en) * 2020-04-22 2020-07-21 南京南软科技有限公司 Scheduling method of Spark task containing private data in mixed cloud environment
CN112256926B (en) * 2020-10-21 2022-10-04 西安电子科技大学 Method for storing scientific workflow data set in cloud environment
CN112492032B (en) * 2020-11-30 2022-09-23 杭州电子科技大学 Workflow cooperative scheduling method under mobile edge environment
CN112579987B (en) * 2020-12-04 2022-09-13 河南大学 Migration deployment method and operation identity verification method of remote sensing program in hybrid cloud
CN112632615B (en) * 2020-12-30 2023-10-31 福州大学 Scientific workflow data layout method based on hybrid cloud environment
CN116955354A (en) * 2023-06-30 2023-10-27 国家电网有限公司大数据中心 Identification analysis method and device for energy digital networking

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567851A (en) * 2011-12-29 2012-07-11 武汉理工大学 Safely-sensed scientific workflow data layout method under cloud computing environment
CN105554873A (en) * 2015-11-10 2016-05-04 胡燕祝 Wireless sensor network positioning algorithm based on PSO-GA-RBF-HOP
CN108170529A (en) * 2017-12-26 2018-06-15 北京工业大学 A kind of cloud data center load predicting method based on shot and long term memory network
CN108182109A (en) * 2017-12-28 2018-06-19 福州大学 Workflow schedule and data distributing method under a kind of cloud environment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104461728B (en) * 2013-09-18 2019-06-14 Sap欧洲公司 Computer system, medium and the method for migration event management and running

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567851A (en) * 2011-12-29 2012-07-11 武汉理工大学 Safely-sensed scientific workflow data layout method under cloud computing environment
CN105554873A (en) * 2015-11-10 2016-05-04 胡燕祝 Wireless sensor network positioning algorithm based on PSO-GA-RBF-HOP
CN108170529A (en) * 2017-12-26 2018-06-15 北京工业大学 A kind of cloud data center load predicting method based on shot and long term memory network
CN108182109A (en) * 2017-12-28 2018-06-19 福州大学 Workflow schedule and data distributing method under a kind of cloud environment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
A data placement strategy for scientific workflow in hybrid cloud;Zhanghui Liu;《IEEE》;20180707;全文 *

Also Published As

Publication number Publication date
CN108989098A (en) 2018-12-11

Similar Documents

Publication Publication Date Title
CN108989098B (en) Time delay optimization-oriented scientific workflow data layout method in hybrid cloud environment
Karthikeyan et al. A hybrid discrete firefly algorithm for solving multi-objective flexible job shop scheduling problems
Trivedi et al. Hybridizing genetic algorithm with differential evolution for solving the unit commitment scheduling problem
CN110033076B (en) Workflow data layout method for cost optimization in mixed cloud environment
Senouci et al. Use of genetic algorithms in resource scheduling of construction projects
Prayogo et al. Optimization model for construction project resource leveling using a novel modified symbiotic organisms search
Parveen et al. Review on job-shop and flow-shop scheduling using multi criteria decision making
Yan et al. A hybrid metaheuristic algorithm for the multi-objective location-routing problem in the early post-disaster stage.
CN110809275B (en) Micro cloud node placement method based on wireless metropolitan area network
Xu et al. Towards heuristic web services composition using immune algorithm
Fan et al. DNN deployment, task offloading, and resource allocation for joint task inference in IIoT
CN116050540B (en) Self-adaptive federal edge learning method based on joint bi-dimensional user scheduling
Kechmane et al. A hybrid particle swarm optimization algorithm for the capacitated location routing problem
WO2022216490A1 (en) Intelligent scheduling using a prediction model
CN111885551B (en) Selection and allocation mechanism of high-influence users in multi-mobile social network based on edge cloud collaborative mode
Wen et al. A multi-objective optimization method for emergency medical resources allocation
Zaman et al. Evolutionary algorithm for project scheduling under irregular resource changes
CN116128247A (en) Resource allocation optimization method and system for production equipment before scheduling
CN113821323B (en) Offline job task scheduling algorithm for mixed deployment data center scene
Wang et al. Multiobjective optimization algorithm with objective-wise learning for continuous multiobjective problems
CN113220437B (en) Workflow multi-target scheduling method and device
CN112632615B (en) Scientific workflow data layout method based on hybrid cloud environment
Zhang et al. A bi-level fuzzy random model for multi-mode resource-constrained project scheduling problem of photovoltaic power plant
CN113642808A (en) Dynamic scheduling method for cloud manufacturing resource change
CN116260730B (en) Geographic information service evolution particle swarm optimization method in multi-edge computing node

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant