CN112181623B

CN112181623B - Cross-cloud remote sensing application program scheduling method and application

Info

Publication number: CN112181623B
Application number: CN202011060026.7A
Authority: CN
Inventors: 黄震春; 甘霖; 赵文来; 刘英博
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2022-10-25
Anticipated expiration: 2040-09-30
Also published as: CN112181623A

Abstract

The cross-cloud remote sensing application program scheduling method executed by the computer comprises the following steps: will cross cloudThe sensing application is converted into a data transformation graph DTG which is composed of four variables<D,D _T ,T,E>Is defined as a general set of data items that have been used, D _T D is a subset of D, which is a data set containing the solution result of the scientific problem, T is a group of data conversion elements, and E is a set of dependency relationships between the data items and the data conversion elements; and determining a distribution scheme of each data node and each data conversion node between different cloud data centers, and optimizing the execution time of the application program by modifying the DTG (data transfer graph) or modifying the distribution scheme of each data node and each data conversion node in the DTG between different cloud data centers. The DTG model of the data transformation graph based on the directed acyclic graph presents the dependence and transformation relation among data items, accordingly models and optimizes the remote sensing analysis model, and helps the remote sensing application analysis program to realize the large-scale optimization of execution time and cost.

Description

Cross-cloud remote sensing application program scheduling method and application

Technical Field

The present invention generally relates to cross-cloud remote sensing application scheduling techniques.

Background

The remote sensing application program is a typical scientific workflow application program, obtains a valuable data processing result by analyzing data collected from a remote sensing satellite, comprises a plurality of data analysis tasks with sequential dependency relationship and is accompanied with mass data to be processed. Cloud computing, as a way to provide computing resources to users in a pay-as-you-go manner, is well suited to execute scientific applications where these local resources cannot meet the computing needs. To fully utilize computing resources distributed across different cloud data centers, extensive research has been conducted across cloud environments, including federated clouds and cloudy clouds. Task scheduling is a process of reasonably distributing various tasks to different cloud computing resources, the performance of a scheduling algorithm can directly determine the performance of the application program, economic cost and other QoS (user quality of service) indexes, and the QoS is an index which a cloud provider needs to meet in order to meet SLA (user service agreement), so that scheduling optimization is very important. A commonly used scientific workflow model in a cloud environment is a Directed Acyclic Graph (DAG), in which vertices represent tasks and edges represent the transfer relationship of data between two tasks. At present, with the continuous increase of remote sensing data volume and problem scale, higher requirements are put forward on the performance and expandability of a remote sensing analysis model. However, the DAG model does not intuitively display the dependency and conversion relationship between data items, and cannot perform corresponding analysis and optimization to improve the performance of the analysis model, and meanwhile, scientific workflow scheduling research in a cross-cloud environment is not sufficient for remote sensing applications, and QoS parameter optimization is insufficient. .

Disclosure of Invention

The present invention has been made in view of the above circumstances.

According to one aspect of the invention, a computer-implemented cross-cloud remote sensing application scheduling method is provided, and comprises the following steps: converting the cross-cloud remote sensing application program into a data transformation graph DTG (data transformation graph DTG), wherein the DTG consists of four variables<D,D _T ,T,E>Where D is a universal set of data items that have been used, D _T Is a subset of D, which is a data set containing the results of a solution to a scientific problem, T is a set of data transformation elements for using input data items and generating output data items, and E is a set of dependency relationships between data items and data transformation elements, represented as follows:

application =<D,D _T ,T,E>

D _T ＝{d _i |d _i Is one of the result data items }

D＝{d _i |d _i Is one of the used data items }

T＝{t _i |t _i Is an entry in the data conversion node }

E＝{d _i →t _j |d _i Is t _j One of the input data items of (1) }

∪{t _j →d _i |d _i Is t _j One of the output data items of (1) }

dep＝{d _i →d _j |d _j Dependent on d _i }；

And determining a distribution scheme of each data node and each data conversion node between different data centers of the cloud, wherein the execution time of the application program is optimized by modifying the data conversion graph DTG or modifying the distribution scheme of each data node and each data conversion node in the DTG between different data centers of the cloud.

Optionally, the reducing the execution time of the application program may include: the execution time of the application is reduced by "short-circuiting" the unnecessary data conversion nodes in the data conversion graph DTG.

Optionally, the reducing the execution time of the application program may include: and adjusting the data volume of the data nodes on the premise of ensuring the correct operation result.

Optionally, adjusting the data size of the data node on the premise of ensuring the correct operation result includes: when the remote sensing data is transmitted, only the analysis model corresponding to the data conversion node is transmitted and marked as a required wave band.

Optionally, the reducing the execution time of the application program may include: and storing the data correlated with each other in the data transformation graph DTG in the same data center in the cloud.

Optionally, in the remote sensing analysis application program, the data is packaged according to the divided regional year date, and the data in the same package is stored in the same data node.

Optionally, the reducing the execution time of the application program may include: and selecting high-performance computing nodes in the same data center for data conversion.

Optionally, the reducing the execution time of the application program includes: storing the intermediate data in a virtual machine main memory instead of a hard disk; intermediate data is transmitted in real time between virtual machine hosts.

Optionally, the reducing the execution time of the application program by modifying the data conversion graph DTG or modifying a distribution scheme of each data node and the data conversion node in the data conversion graph DTG among different data centers of the cloud includes: and determining the distribution mode of distributing each data node and each data conversion node to different data centers by using a genetic algorithm.

Optionally, the determining, by using a genetic algorithm, an allocation manner for allocating each data node and data conversion node to different data centers includes: initializing a sub-population and iterative information, wherein each node in a data transformation graph DTG workflow is represented as a sub-task, random mapping is carried out between the sub-task and a cloud-crossing provider, binary coding is adopted and is represented as a chromosome, in a genetic algorithm, a gene represents the sub-task, and an allele represents the cloud provider; evaluating the fitness, namely calculating the maximum length of all paths between the data nodes corresponding to the individual based on the constraint condition as the fitness of the data nodes; obtaining a next generation population based on mating, crossing, mutating, selecting and replicating; and determining whether a termination condition is reached, if so, terminating, otherwise, returning to fitness evaluation to continue iteration.

Optionally, in the cross-cloud remote sensing application scheduling method, remote sensing analysis workflows belonging to the same region and year are regarded as a subtask.

Optionally, the constraints include user-specified quality of service constraints and distance constraints between the provider location and the data storage location of the computing task.

Optionally, the execution time of the application is a maximum value of lengths of all paths between the data nodes, and the length of the path p between two data nodes is a sum of the execution times of the data conversion nodes on the path p connecting the two data nodes.

According to another aspect of the present invention, there is provided a computing device as a local cloud provider, comprising: a processor; and a memory having stored thereon a computer program that, when executed by the processor, performs the cross-cloud remote sensing application scheduling method of any of the above.

According to another aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the cross-cloud remote sensing application scheduling method of any one of the above.

A computing device as a local cloud provider of any of the above, comprising: a processor; and a memory having stored thereon a computer program operable when executed by the processor to perform the steps of: receiving the computing requirement of an application program of a user, and converting a scientific workflow into a data conversion chart; analyzing the scientific workflow and translating the execution requirements of the program required by the user, wherein the execution requirements comprise the content to be executed, the description of task input comprising a remote data file, the output information about the task, the expected service quality parameter and the service requirements; monitoring and evaluating a network environment, maintaining the state of cloud services by periodically checking the availability of known cloud services and discovering available new services, simultaneously monitoring the execution state of a job so as to return a job result to a user when the job is completed, matching workflow tasks to a cloud provider according to a hierarchy, calculating a set of mapping for an application program, and determining a distribution scheme of data nodes and data conversion nodes among different data centers of the cloud, wherein the execution time of the application program is optimized by modifying a data conversion graph (DTG) or modifying the distribution scheme of the data nodes and the data conversion nodes among the different data centers of the cloud; executing each mapping in the set of mappings according to the calculated set of mappings until a task is successfully assigned to a cloud provider.

The cross-cloud remote sensing application program scheduling method and application provided by the embodiment of the invention are combined with the characteristics that the data volume of a remote sensing analysis application program is large, a large number of independent sub-scientific workflows are possessed, the unit sub-science workflow calculation amount is in a condition supported by a single virtual machine, a data transformation graph DTG model based on a directed acyclic graph is provided, the dependence and conversion relation among data items is presented, the remote sensing analysis model is modeled and optimized, and the remote sensing application analysis program is helped to realize the large optimization of the execution time and the cost.

The embodiment of the invention designs a parallel genetic scientific workflow scheduling algorithm in a cross-cloud environment based on a data transformation graph DTG model of a directed acyclic graph, reduces the data transmission time and cost during the execution of the scientific workflow by optimizing the data and data transmission, promotes the scientific workflow to obtain reasonable resource allocation by parallel optimization and genetic algorithm optimization, and ensures that the maximum completion time and cost of an application program are smaller.

Drawings

Fig. 1 shows an overall flowchart of a computer-implemented cross-cloud remote sensing application scheduling method according to an embodiment of the present invention.

Fig. 2 shows an overall flowchart of a method 200 for implementing DTG model-based cloud remote sensing application scheduling by a genetic algorithm according to an embodiment of the present invention.

FIG. 3 illustrates an exemplary overall cross-cloud computing environment oriented scientific workflow scheduling diagram according to an embodiment of the present invention.

FIG. 4 illustrates a scientific workflow DAG graph corresponding to a remote sensing analysis application, according to an embodiment of the invention.

FIG. 5 illustrates the result of transforming a DAG graph for a telemetry analysis application using a DTG graph, according to an embodiment of the invention.

FIG. 6 shows a graphical representation of a refinement of workflow execution for a first year, first day of the first region of a telemetry analysis application science workflow using a DTG graph according to an embodiment of the invention.

Fig. 7 illustrates pseudo code of a parallel scientific workflow scheduling method based on a genetic algorithm in a cross-cloud environment according to an embodiment of the present invention.

Detailed Description

The invention is described below with reference to specific embodiments.

As shown in fig. 1, in step S110, the cross-cloud remote sensing application is converted into a data transformation graph DTG, where the data transformation graph DTG is composed of four variables<D,D _T ,T,E>Where D is a universal set of data items that have been used, D _T Is a subset of D, is a data set containing results of a solution to a scientific problem, T is a set of data transformation elements for using input data items and generating output data items, E is a set of dependency relationships between data items and data transformation elements, represented as follows:

application =<D,D _T ,T,E>

D _T ＝{d _i |d _i Is one of the result data items }

D＝{d _i |d _i Is in used data itemAn item }

T＝{t _i |t _i Is an entry in the data conversion node }

E＝{d _i →t _j |d _i Is t _j One of the input data items of (1) }

∪{t _j →d _i |d _i Is t _j One of the output data items of (1) }

dep＝{d _i →d _j |d _j Dependent on d _i }；

In the data transformation graph model, the application is represented as a graph model that describes "what the result is" rather than "how the result is obtained".

The DTG with which the application is associated has the following properties:

properties 1: an application will stop for a limited time when it has a DTG with the following characteristics:

DTG is a vertex-limited connectivity graph;

all T nodes in the DTG can be completed within a limited time;

dtg is a directed acyclic graph.

Properties 2: suppose f (t) _i ) Converting nodes t for data _i The execution time of (2) is set to LP = ∑ f (t) path length between data vertices _i ) Wherein t is _i The nodes are converted for data on path p connecting the data nodes. If p is ₁ ,p ₂ ,…p _n Is connecting data node d _i And d _j All paths of (d) _i And d _j The distance between them is defined as g (d) _i ,d _j ) = max (LP). The execution time of the application is max ({ g | g = g (d) _i ,d _j ),d _i ∈D _o ，d _j ∈D _T H) which is the maximum length of all paths between data nodes in the DTG.

In step S120, an allocation scheme of each data node and data conversion node between different data centers of the cloud is determined, wherein the execution time of the application program is optimized by modifying the data conversion graph DTG or modifying the allocation scheme of each data node and data conversion node in the DTG between different data centers of the cloud.

The property 2 of the application and its associated DTG described above provides a method for optimizing the running speed (time) of the remote sensing application by using the DTG, that is, max (LP) is reduced by modifying the DTG or modifying the distribution scheme of each data node and data conversion node in the DTG between different data centers of the cloud, so as to achieve the purpose of optimizing the running speed of the remote sensing application.

According to the embodiment of the invention, the following schemes for optimizing the execution time of the application program by modifying the data transformation graph DTG or modifying the distribution scheme of each data node and data transformation nodes in the DTG among different data centers of the cloud are provided:

optimization scheme 1. Max (LP) is reduced by "shorting" some of the unnecessary data transfer nodes in the DTG. For example, in some remote sensing application science workflows, a fragment in which the intermediate data is written into a file in the previous step and then read from the file in the next step occurs. "shorting" these segments will reduce max (LP) and increase the operating speed of remote sensing applications.

Optimization scheme 2. The execution time of a data transformation node is typically related to the amount of data in its predecessor data nodes, especially for those data transformation nodes that are primarily tasked with transmission. In DTG, we can reduce max (LP) by adjusting the data volume of the data node on the premise of ensuring the correct operation result. For example, when remote sensing data is transmitted, only the bands marked as 'needed' by the analysis models corresponding to the data conversion nodes are transmitted, so that the purpose of automatically reducing the running time of some key data conversion nodes in the DTG is achieved.

In addition to this, the allocation scheme of the data nodes and the data conversion nodes to different data centers in the cloud will cause the running times of the data conversion nodes in the DTG to be different, thereby causing the execution time max (LP) of the application program to be different. For example:

and 3, storing the correlated data in the DTG in the same data center in the cloud, reducing data transmission and conveniently realizing parallel processing.

And 4, selecting high-performance computing nodes in the same data center as much as possible for data conversion, so that the workflow can obtain wider bandwidth and higher processing performance while ensuring less data migration, and the data transmission time and the data conversion time can be greatly reduced in a parallel processing environment.

In fact, the optimal cloud remote sensing application scheduling scheme can be obtained by exhaustively distributing the data nodes and the data conversion nodes to different data centers. However, the algorithmic complexity of such exhaustive scheduling is clearly exponential and not practical.

Therefore, in the following preferred embodiment, genetic algorithm is introduced to realize cloud remote sensing application scheduling based on the DTG model.

In step S210, sub-populations and iteration information are initialized, where each node (data node and data conversion node) in the DTG workflow of the data conversion graph is represented as a sub-task, the sub-tasks and cloud-crossing providers are mapped randomly, binary coding is adopted, and are represented as chromosomes, and in the genetic algorithm, genes represent sub-tasks, and alleles represent cloud providers.

In other words, the mapping relationship between a task and a cloud provider constitutes an individual, and the set of all individuals constitutes chromosome c (solution), i.e., a sub-population; the number of individuals of the sub-population can be set as s, and the maximum iteration number n of the genetic algorithm is set, wherein n is a positive integer greater than or equal to 1.

In one example, a subtask is set to execute on a user-defined virtual machine (stored on the virtual machine for the data node), and thus can also be represented as a mapping between the virtual machine and the cloud provider.

In one example, the telestration workflow belonging to the same region and year is treated as a subtask, wherein the telestration workflow belonging to the same region and year is on one data node.

In step S220, fitness evaluation is performed, specifically, the maximum length of all paths between the data nodes corresponding to the individual is calculated as its fitness based on the constraint condition.

According to the modeling fitness function, evaluating the adaptability of the individual to determine the genetic chance of the individual; the fitness function is designed to evaluate whether the mapping meets the service quality constraint specified by the user in a certain range, so that the user constraint attribute needs to be compared with an actual attribute value in a cloud provider to judge whether the mapping is in the certain range, in addition, different weights of the service quality attribute also have great influence, and in a cross-cloud environment, the scheduling performance can be seriously influenced by a large amount of data transmission, so that the difference between the provider position and the data storage position of a computing task is required to be as small as possible, and the customized distance constraint can be the maximum weight.

An example of a fitness function is given below.

1) The variables are defined as follows:

the user-specified quality of service constraint is s, defined as Q = { Q = ₁ ,…,Q _s B = { B) corresponding attribute value in actual cloud provider ₁ ,…,B _s User-defined weights for individual quality of service attributes may be represented by w _j ＝Q _j Calculating to obtain a result,/| Q |, adding a parameter dis = k to Q as a distance constraint, wherein k is a constant, and adding the distance from an actual cloud provider to the position of the initial data node to B;

2) Calculating the formula:

to check the consistency of the user constraints with the actual values of the cloud provider, a set of inequality constraints QC = { QC is constructed ₁ ,…,QC _s ,QC _dis The larger the attribute value, the better, such as ram (random access memory), storage, etc., QC _j ＝|Q _j |-|B _j | is less than or equal to 0, and when the property is smaller, the better, such as Storage cost, distance, and QC _j ＝|B _j |-|Q _j I.ltoreq.0, defines the alleles of the gene gThe difference between the gene attribute and the user constraint is

The value of each item of Dg is calculated by an inequality formula introduced by QC, and the set range is that D is more than or equal to-1 _g Less than or equal to 1, namely the value range of Dg of each item Dg is between-1 and 1

The value of j is considered to satisfy the constraint of gene g; the fitness function can be modeled as a maximization problem, thus defining the fitness function for the gene g as L = max (LP), i.e. the maximum length of all paths between data nodes in the DTG.

In step S230, a next generation population is obtained based on mating, crossing, mutation, selection, and replication.

For example, the mating method is selective mating, and individuals with high fitness are retained in the next generation chromosome with high probability.

Regarding the intersection, for example, the manner is a one-point intersection: and randomly pairing the individuals pairwise, randomly setting a cross point position for each pair, exchanging partial chromosomes, setting a rate R, and executing s x R times of single point crossing for each generation.

For mutation, for example, a random mutation method is used, and mutation is performed for each gene of each individual with the probability of P.

For selection and replication, for example, the fraction E with the highest fitness of the selection gene is selected, and the remainder is optimally selected by cloning to reach population size.

In step S240, it is determined whether a termination condition is reached, and if so, termination is performed, otherwise, the iteration is continued back to the fitness evaluation S220.

For ease of understanding, FIG. 3 illustrates an exemplary overall cross-cloud computing environment oriented scientific workflow scheduling diagram according to embodiments of the present invention.

As shown in FIG. 3, the cross-cloud environment structure includes:

a user submission module: for submitting application computing requirements to a cross-cloud computing environment. Specifying a computing step of scientific workflow in a graphical interface, and determining precursor and subsequent task nodes of each task and an input data storage position required by the task computing of an application program in a directed acyclic graph format; and specifying environment hardware computing resource service quality indexes such as virtual machine memory, hard disk storage, mips, kernel number, geographic position, internal data transmission bandwidth of cloud providers, data transmission bandwidth between cloud providers and the like, and charging service quality indexes such as virtual machine memory cost, storage cost and the like required by each task node.

A user agent module: the user agent is connected with the user submitting interface, and each user has a set of agent components to ensure that the application programs are not influenced mutually. For scientific workflows, the agent module comprises four components and can complete the process from parsing the workflow to distributing tasks to the cloud provider virtual machines.

The user agent module includes the following components:

1) A workflow parser: providing an access link between the user application program interface and the agent and translating the execution requirements of the user application program, including content to be executed, a description of task input including the remote data file, output information regarding the task and desired quality of service parameters; service requirements including service location, service type, etc.;

2) Monitoring and evaluating the network environment: maintaining a status of the cloud service including a hardware description of cloud provider resources, network-related characteristics of a connected cloud provider, a usage status of the cloud provider resources, and the like, by periodically checking availability of known cloud services and discovering new services available, while monitoring an execution status of the job so as to return a job result to the user when the job is completed;

3) Workflow mapping: the component matches workflow tasks to cloud providers in a hierarchy, computing a set of valid mappings for the application. For this reason, cloud provider computing center information, application program characteristics and user-defined quality of service attributes need to be considered, and the quality of service parameters specified by the user need to be satisfied within an acceptable range, and the allocation does not cause overload of cloud provider nodes;

4) The workflow scheduler: when one or more mapping plans are calculated, the plans are sent to the scheduler component, the scheduler executes one plan each time to allocate the virtual machine to the cloud provider, and after allocation is successful, the task execution instructions are sequentially sent to each cloud provider to be executed. When the virtual machine allocation fails, releasing the allocated virtual machine, allocating the virtual machine again according to the next plan, and if all the virtual machines fail, notifying a user that the mapping is unsuccessful;

a cloud provider module: adapter components in the cloud provider provide a deployment environment for applications in a cross-cloud environment, and can communicate with each other through the adapter regardless of whether the cloud provider themselves would like to communicate with each other. The component exports cloud services into a cross-cloud environment by integrating basic functions of resource management, such as scheduling, allocation, virtualization, dynamic monitoring, database access and transmission and the like;

fig. 4 shows a DAG diagram of a scientific workflow corresponding to a remote sensing analysis application according to an embodiment of the present invention, where each node represents a specific computation task, and each edge represents a precedence and a data flow direction between the computation tasks. The scientific workflow has three calculation task nodes, NDWI is a difference between a surface reflectivity band 2 and a band 5 of an 8-day synthesis MODIS (medium Resolution Imaging Spectroradiometer, medium Resolution Imaging spectrum 20736) namely NDWI = (ρ 2- ρ 5)/(ρ 2+ ρ 5), avgdnw is an average value of all NDWI of the same region in a given time range, AWI is an abbreviation of an abnormal water index indicating a drought degree of the region in the given time range, and a calculation formula is AWI = NDWIi-avgdnw, and NDWI is a corresponding 500 meter state value, which is a typical data-intensive and calculation-intensive scientific workflow. According to the monitored region, the remote sensing analysis scientific workflow is divided into a plurality of data blocks in year, and no data dependency relationship exists among the data blocks;

FIG. 5 illustrates the result of transforming a DAG graph for a telemetry analysis application using a DTG graph, according to an embodiment of the invention. In the DTG graph, because data is emphasized mainly, input data and output data of a scientific workflow are shown, and in the intermediate data conversion process, input and output data are also shown;

in order to ensure the orderly execution among the computing steps of the scientific workflow, a workflow execution flow description language based on a DTG (delay tolerant group) diagram is shown in Table 1. In this model, "workflow" represents a complete scientific workflow flow with three sub-elements below: "task", "data" and "dependency": "task" represents a specific data transformation node in the workflow, is a calculation processing process for data, and comprises a processing instruction; "data" represents a specific data node in the workflow, "input _ data" is the initial input data node, "result _ data" is the result data node, and the rest are intermediate process data nodes; "dependency" indicates a dependency relationship between data nodes, "in _ data _ task" below it indicates one of input data nodes of a data conversion node, and "out _ data _ task" indicates one of output data nodes of the data conversion node, and the data conversion node instruction can be executed only when all input data node information of a certain data conversion node is generated.

TABLE 1

FIG. 6 shows a graphical representation of a refinement of workflow execution for a first year, first day of the first region of a telemetry analysis application science workflow using a DTG graph according to an embodiment of the invention. Because the DTG graph shows both the data nodes and the data conversion nodes, the data conversion process of three data conversion nodes of Tndwi, tavg and Tawi can be further refined, in Tndwi, data are copied and migrated from a storage center to a storage hard disk of a computing center through a transmission task to generate the data nodes stored in the hard disk, then the state values of beta 2, beta 5 and 500 meters in the data are read through the reading task to generate the data nodes stored on the main memory of a virtual machine, then the data calculation process is carried out through the computing task, the generated result is still stored in the data nodes on the main memory of the virtual machine, and then the data are uniformly stored in the hard disk through the writing task to form a data file. After the data in the hard disk is collected, a Tavg process is started, three data conversion processes need to be executed, a task is read, the task is calculated and the task is written in, in the Tawi process, the three data conversion tasks still need to be repeated, only the data with the state value of 500 meters is stored in a memory all the time, the complex reading and writing processes do not need to be carried out, an AWI file is obtained through writing in the task finally and serves as a final result, and all the AWI files are generated and then uniformly sent to a user side.

For Tndwi, it can be found that the execution time of the data conversion node is the sum of the execution times of the transmission task, the read task, the calculation task and the write task, and for Tavg and Tawi, the read task, the calculation task and the write task. For a transfer task, time dependent data transfer bandwidth and data volume size is performed, while for a read task, compute task and write task, time dependent with I/O bandwidth and processing power of the compute node. Thus, optimizing the task execution time direction includes reducing the amount of input data for the application, increasing the transmission bandwidth from the storage center to the compute center, increasing the I/O bandwidth, and increasing the processing power of the compute nodes.

By combining optimization analysis of a DGT refinement model, the cross-cloud remote sensing application program scheduling method provided by the embodiment of the invention provides the following optimization for the task execution time of the remote sensing application program:

optimization 1: since the larger the amount of data, the smaller the transmission bandwidth, the more time it takes for data transmission, and in order to reduce the transmission time, the amount of data transmitted may be reduced, or the transmission bandwidth may be increased. In the remote sensing analysis application program used in the embodiment of the invention, only data of three wave bands are used, data items required by a user application program can be analyzed in the workflow analyzer, selection scripts of related data items are automatically generated, and the selection scripts are sent to the storage center in the data transmission task to obtain corresponding data, so that the purpose of reducing the data transmission quantity is achieved.

And (3) optimizing 2: for an application program suitable for parallel processing, related data are organized and distributed according to a mode suitable for parallelization, the data processed together are stored as close as possible, and the time cost of data transmission from a storage node to a processing node through a network is reduced. In the remote sensing analysis application program used by the invention, data is packed according to the year and date of the divided areas, and the data under the same package is stored in the same storage center, so that parallel processing is conveniently realized.

And (3) optimization: for transmission bandwidth, the transmission bandwidth in the cloud is far higher than the cross-cloud transmission bandwidth, the transmission bandwidth of the local area network is far higher than the transmission bandwidth of the wide area network, and meanwhile, the data processing conversion time can be reduced by using the computing nodes with strong processing capacity. According to the cross-cloud remote sensing application program scheduling method, high-performance computing nodes are selected for data conversion in a computing center near a data storage center as much as possible, such as a cloud or a local area network, so that a workflow can obtain a wider bandwidth and a higher processing performance while ensuring less data migration, and data transmission time and data conversion time can be greatly reduced in a parallel processing environment. The specific allocation process may be implemented using a parallel genetic science workflow scheduling algorithm designed according to an embodiment of the present invention, which is described later.

And (4) optimization: the I/O processing of the remote sensing analysis application program takes a lot of time, a large number of reading tasks and writing tasks exist in the DTG model, many of the tasks are caused by related I/O operations of intermediate process data, and the time taken by the I/O operations is reduced by storing the intermediate data in a main memory of a virtual machine instead of a hard disk. In the cross-cloud environment used in the embodiment of the invention, because the data of the whole application program is split into a plurality of data blocks which can be executed in parallel, and the data volume is gradually reduced along with the progress of the data conversion process, the real-time data transmission between the main memory of the virtual machine is feasible.

The following sets forth the scheduling flow of the optimized DTG graph-based remote sensing analysis application in the present embodiment in a cross-cloud environment:

in the user submission module, the user needs to customize the scientific workflow process and set the service quality constraint. Defining a scientific workflow process conforming to a DAG (directed markup language) diagram by a user, setting each task node, setting instructions in the nodes, setting a dependency relationship between tasks, and binding an initial input file of the scientific workflow with each task of a first layer, wherein the input file carries storage position information of the input file in a cross-cloud environment; the user sets a service quality constraint, and as shown in table 2, determines the performance range of the virtual machine according to the characteristics of the task, "vm _ number" represents the number of virtual machines required by the task node, "vm" represents one of the virtual machines, and has five sub-elements "ram", "core", "mips", "storage", and "location" below the vm, which belong to hardware indexes of the service quality of the cloud provider. In addition, the user also needs to make a limit on the performance and charging condition of the provider, including "inter-bandwidth", "intra-bandwidth", "memory _ cost", and "storage _ cost", which belong to software indexes of the service quality of the cloud provider, and have a great influence on the execution completion time and cost of the scientific workflow. Specific cloud providers are not specified by users, hundreds of cloud providers can be connected across cloud environments, users cannot determine the most suitable cloud provider, and mapping of virtual machines to cloud providers is completed by user agents.

TABLE 2

vm	Virtual machine representing user requirements
		vm_number	Representing the number of user-defined virtual machines
ram	Virtual machine memory representing user requirements
		core	Virtual machine core number representing user demand
mips(per core)	Instruction execution speed per core representing user demand
		storage	Virtual machine storage capacity representing user demand
location	Geographical location of virtual machine representing user's needs
		inter-bandwidth	Data transmission bandwidth between different cloud providers representing user requirements
intra-bandwidth	Cloud provider internal data transmission bandwidth representing user demand
		inter_communicate_cost	Data transmission cost between different cloud providers representing user demand
memory_cost(per GB)	Indicating an acceptable cost of memory usage by the user (hourly charge)
		storage_cost(per GB)	Indicating acceptable usage storage costs to the user (hourly billing)

The workflow resolver of the user agent module is responsible for converting scientific workflows submitted by a user into DTG data flows, converting initial input files and positions of the initial input files into data nodes input _ data in a DTG graph, converting output result files into data nodes result _ data in the DTG graph, respectively establishing other intermediate data nodes data, setting a sequential relation among the data nodes in dependency, and then arranging the conversion nodes among the data nodes according to the input and output relations between the files and tasks in a DAG graph. In the DTG graph analyzed by the workflow analyzer, except the initial input data nodes, the rest data nodes only contain format description and do not contain data, and the data conversion nodes contain instruction sets of tasks in the original DAG graph.

The workflow mapping module maps tasks submitted by a user to a proper supplier through an algorithm, and provides a parallel scientific workflow scheduling method based on a genetic algorithm for a remote sensing analysis application program based on a cross-cloud environment set forth by the embodiment of the invention and a DTG (delay tolerant group) diagram provided by the embodiment of the invention. In the embodiment, a distributed subgroup model is adopted to map sub-scientific workflows with different initial input data position distributions in the scientific workflows in parallel, a genetic algorithm is adopted in each sub-scientific workflow, and the remote sensing analysis workflows belonging to the same region and year are regarded as a subtask and mapped to a cloud provider while the user service quality constraint is considered.

The invention provides a parallel scientific workflow scheduling method based on a genetic algorithm in a cross-cloud environment, and pseudo codes of the method are shown in figure 7. The method specifically comprises the following steps:

step 1, initializing a sub-population and iteration information:

each node (data node and data conversion node) in the DTG workflow is represented as a subtask, random mapping is carried out between the subtask and a cloud-crossing provider, and here, a subtask is set to be executed on a user-defined virtual machine, so the mapping can also be represented as the mapping between the virtual machine and the cloud provider; binary coding is adopted and specifically expressed as chromosomes, in a genetic algorithm, genes represent subtasks, alleles represent cloud providers, the mapping relation between one task and one cloud provider forms an individual, and the set of all individuals forms a chromosome c (solution), namely a sub-population; setting the number of individuals of the sub-population as s, and setting the maximum iteration number n of the genetic algorithm, wherein n is a positive integer greater than or equal to 1;

step 2, fitness evaluation:

according to the fitness function of the model, the adaptability of the individual is evaluated to determine the genetic chance of the individual; the fitness function is designed to evaluate whether the mapping meets the service quality constraint specified by a user in a certain range, so that the user constraint attribute needs to be compared with an actual attribute value in a cloud provider to judge whether the mapping is in the certain range, in addition, different weights of the service quality attribute also have great influence, and in a cross-cloud environment, a large amount of data transmission can seriously influence the scheduling performance, so that the difference between the provider position and the data storage position of a computing task is required to be as small as possible, and the customized distance constraint is the maximum weight;

1) The variables are defined as follows:

the user-specified quality of service constraint is s, defined as Q = { Q = ₁ ,…,Q _s H, the corresponding attribute value in the actual cloud provider is B = { B = { (B) } ₁ ,…,B _s User-defined weights for individual quality of service attributes may be represented by w _j ＝Q _j The parameter dis = k is added into Q as a distance constraint, k is a constant, and the distance from an actual cloud provider to the position of the initial data node is added into B;

2) Calculating the formula:

to check the consistency of the user constraints with the cloud provider's actual values, a set of inequality constraints QC = { QC = is constructed ₁ ,…,QC _s ,QC _dis The larger the attribute value is, the better the attribute value is, such as ram, storage and the like, QC _j ＝|Q _j |-|B _j | is less than or equal to 0, and when the property is smaller, the better, such as storage _ cost, distance, etc., QC _j ＝|B _j |-|Q _j | ≦ 0, defining the difference between the allelic attribute of gene g and the user constraint as

The value of each item of Dg is calculated by an inequality formula introduced by QC, and the set range is that D is more than or equal to-1 _g Less than or equal to 1, dividing the value

The value of j is considered to satisfy the constraint of gene g; the fitness function is usually modeled as a maximization problem, so the fitness function for gene g is defined as L = max (LP), i.e. the maximum length of all paths between data nodes in DTG; (ii) a

And 3, selecting mating:

individuals with higher fitness are retained in the next generation chromosome with higher probability;

step 4, single-point crossing:

randomly pairing individuals pairwise, randomly setting a cross point position for each pair, exchanging partial chromosomes, setting a rate R, and performing s x R times of single-point intersection for each generation;

step 5, random mutation:

mutating each gene of each individual with a probability of P;

step 6, selecting a new generation of elite:

selecting the ratio E with the highest gene fitness, and optimally selecting the rest to reach the size of the population through cloning;

and 7, evaluating termination conditions:

and (4) adding 1 to the iteration times, judging whether the maximum iteration times is reached, if not, returning to the step (3) for executing again, and if the maximum iteration times is reached, returning to the mapping relation and exiting the execution.

A cross-cloud adapter located in a cloud provider receives subtasks and virtual machine configuration requirements sent by a workflow scheduler, after the requirements can be executed, a receiving success command is sent to a cross-cloud environment monitor, then a virtual machine is deployed in a local cloud environment, according to the indication of a DTG (data transfer graph) model, an input data file is obtained from a data storage center, then a task execution queue is established, a task waiting queue is established, and after all data nodes of each layer are generated, the next layer of task is executed;

the embodiment of the invention combines the large data volume of the remote sensing analysis application program, has a large number of independent sub-scientific workflows, designs the data conversion diagram mainly aiming at the data under the condition that the unit sub-scientific workflow calculation amount can be supported by a single virtual machine, designs the parallel genetic scientific workflow scheduling algorithm in the cross-cloud environment according to the data conversion diagram, reduces the data transmission time and cost during the execution of the scientific workflow by optimizing the data and the data transmission, and helps the remote sensing analysis application program to realize the large-scale optimization of the execution time and the cost by optimizing the parallel optimization and the genetic algorithm in a specific case.

While embodiments of the present invention have been described above, the above description is illustrative, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A computer-implemented cross-cloud remote sensing application scheduling method comprises the following steps:

converting the cross-cloud remote sensing application program into a data conversion graph DTG, wherein the data conversion graph DTG consists of four variables<D,D _T ,T,E>Where D is a universal set of data items that have been used, D _T Is a subset of D, which is a data set containing the results of a solution to a scientific problem, T is a set of data transformation elements for using input data items and generating output data items, and E is a set of dependency relationships between data items and data transformation elements, represented as follows:

application =<D,D _T ,T,E>

D _T ＝{d _i |d _i Is one of the result data items }

D＝{d _i |d _i Is one of the used data items }

T＝{t _i |t _i Is an entry in the data conversion node }

E＝{d _i →t _j |d _i Is t _j One of the input data items of (b) }

∪{t _j →d _i |d _i Is t _j One of the output data items of (1) }

dep＝{d _i →d _j |d _j Dependent on d _i }；

And determining a distribution scheme of each data node and each data conversion node between different data centers of the cloud, wherein the execution time of the application program is optimized by modifying the DTG of the data conversion graph or modifying the distribution scheme of each data node and each data conversion node in the DTG between different data centers of the cloud.

2. The cross-cloud remote sensing application scheduling method of claim 1, the reducing the execution time of the application comprising:

the execution time of the application is reduced by "short-circuiting" the unnecessary data conversion nodes in the data conversion graph DTG.

3. The cross-cloud remote sensing application scheduling method of claim 1, the reducing the execution time of the application comprising:

and adjusting the data volume of the data nodes on the premise of ensuring the correct operation result.

4. The cross-cloud remote sensing application program scheduling method of claim 3, wherein adjusting the data volume of the data node on the premise of ensuring the correct operation result comprises:

when the remote sensing data is transmitted, only the analysis model corresponding to the data conversion node is transmitted and marked as a required wave band.

5. The cross-cloud remote sensing application scheduling method of claim 1, the reducing the execution time of the application comprising:

and storing the data correlated with each other in the data transformation graph DTG in the same data center in the cloud.

6. The cross-cloud remote sensing application program scheduling method of claim 5, wherein in the remote sensing analysis application program, data is packaged according to the year and date of divided areas, and the data under the same package is stored in the same data node.

7. The cross-cloud remote sensing application scheduling method of claim 1, the reducing the execution time of the application comprising:

and selecting high-performance computing nodes in the same data center for data conversion.

8. A computing device as a local cloud provider, comprising:

a processor; and

a memory having stored thereon a computer program that, when executed by a processor, performs the cross-cloud remote sensing application scheduling method of any of claims 1 to 7.

9. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the cross-cloud remote sensing application scheduling method of any one of claims 1 to 7.

10. A computing device as a local cloud provider, comprising:

a processor; and

a memory having stored thereon a computer program operable when executed by the processor to perform the steps of:

receiving the computing requirement of an application program of a user, and converting a scientific workflow into a data conversion diagram;

analyzing the scientific workflow and translating the execution requirements of the user program, including the content to be executed, the description of task input including the remote data file, the output information about the task, the expected service quality parameters and the service requirements;

monitoring and evaluating a network environment, maintaining a status of a cloud service by periodically checking availability of a known cloud service and discovering available new services, while monitoring an execution status of a job, so as to return a job result to a user when the job is completed,

matching workflow tasks to a cloud provider according to a hierarchy, calculating a group of mappings for an application program, and determining a distribution scheme of each data node and each data conversion node between different cloud data centers, wherein the execution time of the application program is optimized by modifying a data conversion graph (DTG) or a distribution scheme of each data node and each data conversion node in the DTG between different cloud data centers;

executing each mapping in the set of mappings according to the calculated set of mappings until a task is successfully assigned to a cloud provider.