CN112256926B - Method for storing scientific workflow data set in cloud environment - Google Patents

Method for storing scientific workflow data set in cloud environment Download PDF

Info

Publication number
CN112256926B
CN112256926B CN202011133768.8A CN202011133768A CN112256926B CN 112256926 B CN112256926 B CN 112256926B CN 202011133768 A CN202011133768 A CN 202011133768A CN 112256926 B CN112256926 B CN 112256926B
Authority
CN
China
Prior art keywords
data set
storage
cost
strategy
dependency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011133768.8A
Other languages
Chinese (zh)
Other versions
CN112256926A (en
Inventor
范磊
席雪雯
王思尧
刘西洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202011133768.8A priority Critical patent/CN112256926B/en
Publication of CN112256926A publication Critical patent/CN112256926A/en
Application granted granted Critical
Publication of CN112256926B publication Critical patent/CN112256926B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Genetics & Genomics (AREA)
  • Physiology (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

According to the method for storing the scientific workflow data set in the cloud environment, the data set generated by executing the scientific workflow task is acquired, the dependency graph of the data set is obtained according to the dependency among the data sets, a plurality of storage strategies are determined based on different storage states of the data set in the dependency graph, and the storage cost corresponding to each storage strategy is calculated; calculating the calculation cost of generating the target intermediate data set under each storage strategy based on the dependency relationship of the intermediate data set between the initial data set and the target intermediate data set in the dependency relationship graph; aiming at each storage strategy, calculating the total cost of the storage strategy based on the storage cost and the calculation cost corresponding to the storage strategy, determining the storage strategy with the minimum total cost as the optimal storage strategy, and storing the data set according to the storage state of the data set corresponding to the optimal storage strategy, so that the embodiment of the invention can save the cost of storing the data set in the scientific workflow in the cloud environment.

Description

Method for storing scientific workflow data set in cloud environment
Technical Field
The invention belongs to the field of cloud storage, and particularly relates to a storage method of a scientific workflow data set in a cloud environment.
Background
The scientific workflow system is a data set intensive application, and a large number of intermediate data sets (intermediate results) and non-intermediate data sets with complex dependency relationships are usually generated during the operation of a scientific workflow task, the intermediate data sets are often indispensable data sets for scientific research, the data sets of the intermediate data sets are huge in volume, and the storage and deletion of the intermediate data sets need to be balanced in the management process of the scientific workflow, so that the management of the data sets is realized.
In a cloud environment, based on the pay-as-needed characteristic of the cloud environment, the scientific workflow operation needs pay-as-needed (cost), the data sets are different, the execution cost of the data set management strategies is different, a strategy with the optimal cost needs to be found in all the management strategies, and the intermediate data set needs to consume storage resources and pay corresponding storage cost when being stored selectively, so that the intermediate data set is often and selectively deleted. After the intermediate data set is deleted, the intermediate data set needs to be regenerated when the intermediate data set is reused, the process of generating the intermediate data set needs to consume computing resources and pay the cost (cost) of corresponding computing resources, and because the regeneration process needs to rely on the prior data set, how to regenerate the deleted data set at lower cost, and whether to store the deleted data set is determined to be a technical problem to be solved.
Disclosure of Invention
In order to solve the above problems in the prior art, the present invention provides a method for storing a scientific workflow data set in a cloud environment. The technical problem to be solved by the invention is realized by the following technical scheme:
the embodiment of the invention provides a method for storing a scientific workflow data set in a cloud environment, which comprises the following steps:
acquiring a data set generated when a current time forescience workflow executes a task, wherein the data set comprises an initial data set and an intermediate data set;
establishing a dependency relationship graph based on the dependency relationship between the data sets;
determining a plurality of storage policies based on different storage states of the data set in the dependency graph;
calculating the storage cost corresponding to each storage strategy;
acquiring a target intermediate data set to be regenerated;
calculating a calculation cost for generating the target intermediate data set under each storage policy based on a dependency relationship of an intermediate data set between a starting data set and the target intermediate data set in the dependency relationship graph;
calculating the total cost of each storage strategy based on the storage cost corresponding to the storage strategy and the calculation cost;
determining a storage strategy with the minimum total cost as an optimal storage strategy;
storing the data set according to the storage state of the data set corresponding to the optimal storage strategy;
wherein the storage state comprises: stored and not stored.
Optionally, the step of establishing a dependency graph based on the dependency between the data sets includes:
taking each task executed by the scientific workflow as a node of a preset directed acyclic graph, wherein each task comprises an input data set and an output data set;
and taking the current data set as the input of the current node of the directed acyclic graph from the first node to any current node in the last node, taking an intermediate data set generated by depending on the current data set as the output of the current node, and taking the execution time of the current task as the connection weight between the current data set and the intermediate data set generated by depending on the current data set to obtain a dependency graph.
Optionally, the step of determining a plurality of storage policies based on different storage states of the data set in the dependency graph includes:
and in each path from the initial data set to the last data set in the dependency graph, different storage states of the data sets on each path in the dependency graph are combined into a storage strategy according to the dependency sequence of the data sets in the dependency graph.
Optionally, the step of determining a plurality of storage policies based on different storage states of the data set in the dependency graph includes:
converting the data set into binary numbers according to the storage state of the data set;
and arranging the binary numbers converted by each data set according to the dependency sequence of the data sets to obtain a plurality of storage strategies converted into binary strings.
Optionally, the step of calculating the storage cost corresponding to each storage policy includes:
aiming at each storage strategy, calculating the storage cost corresponding to the storage strategy by using a storage cost calculation formula;
the storage cost calculation formula is as follows:
StoreCost(d i ,t)=Ps·Di·t
wherein, storeCost (d) i T) represents the storage cost, ps represents the cost of storage resources in cloud computing, di represents the data set size of the stored data set, and t represents the statistical time interval.
Optionally, the step of calculating a calculation cost of generating the target intermediate data set under each storage policy based on a dependency relationship of an intermediate data set between the starting data set and the target intermediate data set in the dependency relationship graph includes:
calculating a calculation cost for generating the target intermediate data set under each storage policy using a first generation cost calculation formula based on a dependency relationship of an intermediate data set between a starting data set to the target intermediate data set in the dependency relationship graph;
the first cost of generation calculation formula is:
ComputCost(d i ,t)=R(d i )·f i
among them, computCost (d) i T) represents the computational cost, with the set of data not stored as
Figure BDA0002735999920000042
Dd denotes a data set of deletion status, R (d) i ) Representing a deleted data set d i The computational cost at regeneration, f i The access frequency of the data set is indicated by deleting the data set, i is the subscript of the data set in a scientific workflow task, and t is the statistical time interval.
Optionally, before the step of calculating a calculation cost for generating the target intermediate data set under each storage policy based on the dependency relationship between the starting data set and the target intermediate data set in the dependency relationship graph, the storage method of the first aspect further includes:
calculating a calculation cost of generating a predecessor data set of the target intermediate data set under each storage policy using a second generation cost calculation formula;
wherein the second generation cost calculation formula is:
Figure BDA0002735999920000041
where Pc represents the cost of computing resources in cloud computing, T i Representing a data set d i Generation time of, preset i Representing a deleted data set d i Of R (d) j ) Representing a deleted data set d i Of the precursor data set d j Generation cost of x j A storage status of a data set representing a jth location in a set X of storage statuses of data sets, j representing a subscript of a jth data set in a predecessor data set belonging to a data set di, X = { X = 1 ,x 2 ,...,x n },x i =1 represents a data set d i To store a state, x i =0 for data set d i To delete state, the total data set D = { D = { D } 1 ,d 2 ,...,d i ,...,d n The stored data set Ds and the deleted data set Dd are divided, the total data set D = Ds @ Dd.
Optionally, the step of calculating, for each storage policy, a total cost of the storage policy based on the storage cost and the calculation cost corresponding to the storage policy includes:
for each storage strategy, calculating the total cost of the storage strategy by using a total cost calculation formula based on the storage cost and the calculation cost corresponding to the storage strategy;
wherein, the total cost calculation formula is as follows:
Figure BDA0002735999920000051
wherein TotalCost (D, X, t) represents the total cost, X = { X = { (X) } 1 ,x 2 ,...,x n },x i =1 represents a data set d i To store a state, x i =0 for data set d i To delete state, the total data set D = { D = { D } 1 ,d 2 ,...,d i ,...,d n Is divided intoFor the stored data set Ds and the deleted data set Dd, the total data set D = Ds ≦ Dd.
Optionally, the step of determining the storage policy with the smallest total cost as the optimal storage policy includes:
determining a storage strategy with the minimum total cost by using a genetic algorithm;
and determining the storage strategy with the minimum total cost as the optimal storage strategy.
Optionally, the step of determining a storage policy with a minimum total cost by using a genetic algorithm includes:
acquiring a population of a genetic algorithm;
initializing the population and then coding to obtain a plurality of dyeing individuals, wherein the bit value of each dyeing individual is the same as the total data set number, and each dyeing individual corresponds to a storage strategy of a binary string;
taking the minimum total cost as the fitness of the dyeing individuals, repeatedly executing the operation of using a calculation operator on each dyeing individual to obtain the dyeing individual with the minimum total cost, and generating new dyeing individuals to be added into the population until a cut-off condition is reached;
and when the cutoff condition is reached, determining the storage strategy corresponding to the dyeing individual with the minimum total cost as the storage strategy with the minimum total cost.
According to the method for storing the scientific workflow data set in the cloud environment, the data set generated by executing the scientific workflow task is acquired, the dependency graph of the data set is obtained according to the dependency among the data sets, a plurality of storage strategies are determined based on different storage states of the data set in the dependency graph, and the storage cost corresponding to each storage strategy is calculated; calculating the calculation cost of generating the target intermediate data set under each storage strategy based on the dependency relationship of the intermediate data set between the initial data set and the target intermediate data set in the dependency relationship graph; aiming at each storage strategy, calculating the total cost of the storage strategy based on the storage cost and the calculation cost corresponding to the storage strategy, determining the storage strategy with the minimum total cost as the optimal storage strategy, and storing the data set according to the storage state of the data set corresponding to the optimal storage strategy, so that the embodiment of the invention can save the cost of storing the data set in the scientific workflow in the cloud environment.
The present invention will be described in further detail with reference to the accompanying drawings and examples.
Drawings
Fig. 1 is a schematic flowchart of a method for storing a scientific workflow data set in a cloud environment according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a dependency graph provided by an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a linear scientific workflow task provided by an embodiment of the present invention;
FIG. 4 is a flow chart of calculation of the generation cost of the regenerated data set according to the embodiment of the present invention;
FIG. 5 is a schematic diagram of a subgraph decomposition-reorganization process of a regenerated data set according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto.
The scientific workflow management system in the cloud environment manages the data sets generated in the execution process, part of the data sets are used for storing, part of the data sets are deleted, and the data sets are regenerated when the deleted data sets are obtained again, so that the generation cost of regenerating the data sets needs to be calculated.
Example one
As shown in fig. 1, a method for storing a scientific workflow data set in a cloud environment according to an embodiment of the present invention includes:
s1, acquiring a data set generated when a current time forescience workflow executes a task;
wherein the data set comprises a starting data set and an intermediate data set;
s2, establishing a dependency relationship graph based on the dependency relationship among the data sets;
the dependency relationship refers to the relationship between the dependent data set and the data set itself when the data set is generated.
Referring to fig. 2, fig. 2 is an intermediate data set resulting from scientific workflow task execution for non-linear scientific workflow task acquisition containing 15 data sets. And obtaining an intermediate data set dependency relationship graph according to the dependency relationship among the data sets, wherein the intermediate data set dependency relationship graph is a directed acyclic graph. In FIG. 2, d i Representing the ith data set, arrows representing dependencies between data sets, d 0 Direction d 1 Denotes d 0 Generation of d 1 ,d 1 Direction d 2 And d 3 Denotes d 1 Generating a data set d 2 And d 3 ,d 5 And d 6 Direction d 7 Denotes d 5 And d 6 Co-generating a data set d 7
It can be understood that, since the resource costs consumed by the scientific workflow tasks with different complexity are different, referring to fig. 3, fig. 3 is a graph of the established dependency relationship of the linear scientific workflow tasks. In a linear scientific workflow task, each data set has at most one predecessor and successor data sets, the structure is simple, and an unformed data set can have one or more predecessor data sets in an unstored state, but a multi-branch unstored structure does not exist. In a nonlinear scientific workflow task, each data set has a plurality of predecessors and successors, the structure is complex, and a multi-branch structure which is not stored exists, so that the calculation of the generation cost is more complex compared with a linear task structure.
S3, determining a plurality of storage strategies based on different storage states of the data set in the dependency graph;
wherein the storage state comprises: stored and not stored.
S4, calculating the storage cost corresponding to each storage strategy;
and S5, acquiring a target intermediate data set to be regenerated.
It is to be understood that the target intermediate data set refers to an intermediate data set that needs to be regenerated, the intermediate data set is a data set other than the initial data set, and the storage state of the initial data set is always stored.
It will be appreciated that when a target intermediate data set needs to be generated, the position of the target intermediate data set in the dependency graph needs to be determined, in order to determine the predecessor data sets of the target intermediate data set.
S6, calculating the calculation cost for generating the target intermediate data set under each storage strategy based on the dependency relationship of the intermediate data set between the initial data set and the target intermediate data set in the dependency relationship graph;
s7, calculating the total cost of each storage strategy based on the storage cost corresponding to the storage strategy and the calculation cost;
s8, determining the storage strategy with the minimum total cost as an optimal storage strategy;
and S9, storing the data set according to the storage state of the data set corresponding to the optimal storage strategy.
Referring to fig. 4 and fig. 5, fig. 4 is a generation flowchart of a regeneration target intermediate data set, and taking a small workflow of 10 data sets as an example, the data sets and tasks in the whole workflow are connected, where the circular nodes represent the tasks, the data sets are input and output to the tasks, the initial data set must be stored, and the data sets in the remaining workflows are selectively stored.
According to the method for storing the scientific workflow data set in the cloud environment, the data set generated by executing the scientific workflow task is acquired, the dependency graph of the data set is obtained according to the dependency among the data sets, a plurality of storage strategies are determined based on different storage states of the data set in the dependency graph, and the storage cost corresponding to each storage strategy is calculated; calculating the calculation cost of generating the target intermediate data set under each storage strategy based on the dependency relationship of the intermediate data set between the initial data set and the target intermediate data set in the dependency relationship graph; aiming at each storage strategy, calculating the total cost of the storage strategy based on the storage cost and the calculation cost corresponding to the storage strategy, determining the storage strategy with the minimum total cost as the optimal storage strategy, and storing the data set according to the storage state of the data set corresponding to the optimal storage strategy, so that the embodiment of the invention can save the cost of storing the data set in the scientific workflow in the cloud environment.
Example two
As an alternative embodiment of the present invention, the step S2 includes:
step a: taking each task executed by the scientific workflow as a node of a preset directed acyclic graph, wherein each task comprises an input data set and an output data set;
step b: and taking the current data set as the input of the current node of the directed acyclic graph from the first node to any current node in the last node, taking an intermediate data set generated by depending on the current data set as the output of the current node, and taking the execution time of the current task as the connection weight between the current data set and the intermediate data set generated by depending on the current data set to obtain the dependency graph.
EXAMPLE III
As an alternative embodiment of the present invention, the step S3 includes:
and in each path from the initial data set to the last data set in the dependency graph, different storage states of the data sets on each path in the dependency graph are combined into a storage strategy according to the dependency sequence of the data sets in the dependency graph.
Example four
As an alternative embodiment of the present invention, the step S3 includes:
step a: converting the data set into binary number according to the storage state of the data set;
step b: and arranging the binary numbers converted by each data set according to the dependency sequence of the data sets to obtain a plurality of storage strategies converted into binary strings.
Wherein for a data set dependency graph having n data sets (including the original data set), there is 2 (n-1) A storage policy, each storage policyRepresented as a binary string, i.e., a 0-1 vector, such as the data set dependency graph in S1, is represented as X =100000000011101.
Referring to fig. 2, if a data set is a stored data set, the data set is represented by a binary number of 1, whereas if the data set is an unstored data set (an erased data set), the data set is represented by a binary number of 0.
EXAMPLE five
As an alternative embodiment of the present invention, the step S6 includes:
aiming at each storage strategy, calculating the storage cost corresponding to the storage strategy by using a storage cost calculation formula;
the storage cost calculation formula is as follows:
StoreCost(d i ,t)=Ps·Di·t
wherein, storeCost (d) i T) represents the storage cost, ds represents the stored data set, ps represents the cost of storage resources in cloud computing, di represents the data set size of the stored data set, and t represents the statistical time interval.
It can be understood that the main factors influencing the storage cost are time and data set file size, the size of the intermediate data sets in the scientific workflow and the storage charging mode of the cloud environment are fixed, and the storage cost of each intermediate data set is only related to the data set file size and the charging mode when the statistical time is fixed.
EXAMPLE six
Calculating a calculation cost for generating the target intermediate data set under each storage policy using a first generation cost calculation formula based on a dependency relationship of an intermediate data set between a starting data set to the target intermediate data set in the dependency relationship graph;
the first cost of generation calculation formula is:
ComputCost(d i ,t)=R(d i )·f i
among them, computCost (d) i T) represents the computational cost, with the set of data not stored as
Figure BDA0002735999920000111
Dd denotes a data set of deletion status, R (d) i ) Representing a deleted data set d i Cost calculation at regeneration, f i The access frequency of the data set is indicated by deleting the data set, i is the subscript of the data set in a scientific workflow task, and t is the statistical time interval.
EXAMPLE seven
As an optional embodiment of the present invention, before the step of S6, the method for storing a scientific workflow data set in a cloud environment according to an embodiment of the present invention further includes:
calculating a calculation cost of generating a predecessor data set of the target intermediate data set under each storage policy using a second generation cost calculation formula;
wherein the second generation cost calculation formula is:
Figure BDA0002735999920000121
where Pc represents the cost of computing resources in cloud computing, T i Indicating the generation time of the data set di, preset i Representing a deleted data set d i Of R (d) j ) Representing a deleted data set d i Of the precursor data set d j Generation cost of x j Representing the storage state of a data set at the jth position in the set X of storage states of data sets, j representing the storage state belonging to data set d i X = { X = g, { n } of the jth data set in the predecessor data sets of (c) } 1 ,x 2 ,...,x n },x i =1 represents a data set d i To store a state, x i =0 for data set d i To delete state, the total data set D = { D = { D } 1 ,d 2 ,...,d i ,...,d n The stored data set Ds and the deleted data set Dd are divided, the total data set D = Ds @ Dd.
Because the main factors influencing the calculation cost are the influence of the data set calling frequency, the calculation generation time and the generation strategy, when the workflow task has more data sets and large scale, the total service cost is storage and calculation, and the calculation is in direct proportion to the calling frequency, the data set calling frequency is the main factor influencing the calculation cost. Secondly, in order to achieve the purpose of managing the scientific workflow data set, a storage strategy of the data set needs to be determined, so that the operation cost of the scientific workflow system is the lowest under the strategy, and a generation strategy of the intermediate data set needs to be determined.
When the calculation cost of the data sets is regenerated in the calculation storage strategy, the scientific workflow tasks are decomposed to obtain generation subgraphs of all the deleted data sets, the generation subgraphs are calculated to obtain the generation cost of each deleted data set, and the decomposed subgraphs are recombined to obtain the final generation cost.
Referring to fig. 5, fig. 5 is a flow chart of subgraph decomposition-reorganization for regenerating a data set, and the process of decomposing and reorganizing the generation subgraph of the 9 th data set in the 15 data sets in fig. 5 is described in detail by taking the generation subgraph of the deleted data set as an example.
In FIG. 5, from d 0 -d 9 According to the dependency relationship and the atomic task of the data set, the regeneration process of the data set is sequentially decomposed into d 1 -d 3 Subfigure, d 2 -d 5 Sub-drawing, d 3 -d 6 Subfigure, d 4 -d 8 Sub-drawing, d 5 -d 7 Subgraph and d 7 -d 9 Sub-graph, then making atom task reverse recombination to determine d 9 Precursor dataset d of datasets 7 、d 8 Until it is determined that d is generated 9 All of the data sets.
Knowing a data set dependency relationship graph DPG and a storage strategy X of a scientific workflow task, if a binary bit in the storage strategy is '1', indicating that the data set is stored and a calculation formula of storage cost is calculated and output, otherwise, indicating that the data set is deleted and needs to be regenerated, when the data set is regenerated, judging the storage state of a precursor data set according to the dependency relationship of the data set, if the data set is stored, only calculating the generation cost of the data set, and if the data set is also in a deletion state, calculating the generation cost of the precursor data set of the data set and the generation cost of the data set.
Example eight
As an alternative embodiment of the present invention, the step of calculating, for each storage policy, a total cost of the storage policy based on the storage cost and the calculation cost corresponding to the storage policy includes:
for each storage strategy, calculating the total cost of the storage strategy by using a total cost calculation formula based on the storage cost and the calculation cost corresponding to the storage strategy;
wherein, the total cost calculation formula is as follows:
Figure BDA0002735999920000131
wherein, totalCost (D, X, t) represents the total cost, X = { X = 1 ,x 2 ,...,x n },x i =1 represents a data set d i To store a state, x i =0 representing the data set d i In order to be in the deleted state, will sum the data set D = { D = { (D) 1 ,d 2 ,...,d i ,...,d n The stored data set Ds and the deleted data set Dd are divided, the total data set D = Ds @ Dd.
For ease of calculation, the total cost calculation formula for the storage strategy described above can also be converted into:
Figure BDA0002735999920000141
where n is expressed as the total number of data sets in a scientific workflow task.
Example nine
As an alternative embodiment of the present invention, the step of determining the storage policy with the minimum total cost as the optimal storage policy includes:
step a: determining a storage strategy with the minimum total cost by using a genetic algorithm;
step b: and determining the storage strategy with the minimum total cost as the optimal storage strategy.
The embodiment of the invention adopts the genetic algorithm as the method for searching the optimal storage strategy, can obtain the optimal storage cost in the algorithm for processing the problems of the same kind for the scientific workflow tasks of the complex data set, and improves the stability and the accuracy of the scientific workflow system.
Example ten
As an alternative embodiment of the present invention, the step of determining a storage strategy with the minimum total cost by using a genetic algorithm comprises:
step a: acquiring a population of a genetic algorithm;
step b: initializing the population and then coding to obtain a plurality of dyeing individuals, wherein the bit value of each dyeing individual is the same as the total data set number, and each dyeing individual corresponds to a storage strategy of a binary string;
step c: taking the minimum total cost as the fitness of the dyeing individuals, repeatedly executing the operation of using a calculation operator to each dyeing individual to obtain the dyeing individual with the minimum total cost, and generating new dyeing individuals to be added into the population until a cut-off condition is reached;
step d: and when the cutoff condition is reached, determining the storage strategy corresponding to the dyeing individual with the minimum total cost as the storage strategy with the minimum total cost.
It can be understood that the nature of the population (individuals stained) in the population is a storage strategy, i.e. a binary string, and the algorithm can omit the encoding and decoding process.
Figure BDA0002735999920000151
Wherein X i A storage strategy is represented. Wherein, some nodes (data sets) are used for determining storage, namely, the corresponding positions in the corresponding storage strategies are always 1, the nodes do not participate in calculating operator change, and the length of the binary string corresponding to the population in the population is the sum of the number of the nodes minus the number of the stored nodes;
taking the total cost as the fitness, F (0) = { F 0 ,f 1 ,f 2 ,...,f n In which f i Indicating the ith storage policyFitness with slight correspondence, f i =ζ(D,X i T, L). The whole process is solved by classical genetic algorithms.
The calculation operators are divided into a crossover operator, a mutation operator and a selection operator, wherein the crossover operator adopts single-point crossover to cross two population codes in the population, and a new code is generated to serve as the population in the new generation of population. And (3) randomly mutating the population in the population by using a mutation operator, wherein if a certain population is mutated, one bit in the corresponding code is randomly changed from 0 to 1, or 1 to 0. In addition, the selection operator selects the population with lower cost through a roulette strategy by taking fitness as a standard, and randomly generates a new population complementary population. And (4) iteration stopping conditions, wherein the minimum cost is kept unchanged or the iteration number reaches an upper limit.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
In the present invention, unless otherwise expressly stated or limited, "above" or "below" a first feature means that the first and second features are in direct contact, or that the first and second features are not in direct contact but are in contact with each other via another feature therebetween. Also, the first feature being "on," "above" and "over" the second feature includes the first feature being directly on and obliquely above the second feature, or merely indicating that the first feature is at a higher level than the second feature. "beneath," "under" and "beneath" a first feature includes the first feature being directly beneath and obliquely beneath the second feature, or simply indicating that the first feature is at a lesser elevation than the second feature.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples described in this specification can be combined and combined by those skilled in the art.
While the present application has been described in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed application, from a review of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (7)

1. A method for storing a scientific workflow data set in a cloud environment is characterized by comprising the following steps:
acquiring a data set generated when a current time forescience workflow executes a task, wherein the data set comprises an initial data set and an intermediate data set;
establishing a dependency relationship graph based on the dependency relationship among the data sets;
determining a plurality of storage policies based on different storage states of the data set in the dependency graph;
calculating the storage cost corresponding to each storage strategy;
acquiring a target intermediate data set to be regenerated;
calculating a calculation cost for generating the target intermediate data set under each storage policy based on a dependency relationship of intermediate data sets between a starting data set and the target intermediate data set in the dependency relationship graph;
calculating the total cost of each storage strategy based on the storage cost corresponding to the storage strategy and the calculation cost;
determining a storage strategy with the minimum total cost as an optimal storage strategy;
storing the data set according to the storage state of the data set corresponding to the optimal storage strategy;
wherein the storage state comprises: stored and not stored;
the step of calculating the storage cost corresponding to each storage policy includes:
aiming at each storage strategy, calculating the storage cost corresponding to the storage strategy by using a storage cost calculation formula;
wherein, the storage cost calculation formula is as follows:
StoreCost(d i ,t)=Ps·Di·t
wherein, storeCost (d) i T) represents storage cost, ps represents the cost of storage resources in cloud computing, di represents the size of a data set of a stored data set, and t represents a statistical time interval;
the step of calculating a calculation cost for generating the target intermediate data set under each storage policy based on the dependency relationship of the intermediate data set between the starting data set and the target intermediate data set in the dependency relationship graph includes:
calculating a calculation cost for generating the target intermediate data set under each storage policy using a first generation cost calculation formula based on a dependency relationship of an intermediate data set between a starting data set to the target intermediate data set in the dependency relationship graph;
the first cost of generation calculation formula is:
ComputCost(d i ,t)=R(d i )·f i
among them, computCost (d) i T) represents the computational cost, with the set of data not stored as
Figure FDA0003798839890000022
Dd denotes a data set of deletion status, R (d) i ) Representing a deleted data set d i The computational cost at regeneration, f i Representing the access frequency of the data set for deleting the data set, i representing the subscript of the data set in a scientific workflow task, and t representing a statistical time interval;
the step of calculating the total cost of each storage policy based on the storage cost and the calculation cost corresponding to the storage policy includes:
for each storage strategy, calculating the total cost of the storage strategy by using a total cost calculation formula based on the storage cost and the calculation cost corresponding to the storage strategy;
wherein, the total cost calculation formula is as follows:
Figure FDA0003798839890000021
wherein TotalCost (D, X, t) represents the total cost, X = { X = { (X) } 1 ,x 2 ,…,x n },x i =1 represents a data set d i To store a state, x i =0 for data set d i To delete state, the total data set D = { D = { D } 1 ,d 2 ,…,d i ,…,d n The stored data set Ds and the deleted data set Dd are divided, the total data set D = Ds @ Dd.
2. The storage method according to claim 1, wherein the step of building a dependency graph based on dependencies between the data sets comprises:
taking each task executed by the scientific workflow as a node of a preset directed acyclic graph, wherein each task comprises an input data set and an output data set;
and taking the current data set as the input of the current node of the directed acyclic graph from the first node to any current node in the last node, taking an intermediate data set generated by depending on the current data set as the output of the current node, and taking the execution time of the current task as the connection weight between the current data set and the intermediate data set generated by depending on the current data set to obtain a dependency graph.
3. The storage method according to claim 1, wherein the step of determining a plurality of storage policies based on different storage states of the data set in the dependency graph comprises:
and in each path from the initial data set to the last data set in the dependency graph, different storage states of the data sets on each path in the dependency graph are combined into a storage strategy according to the dependency sequence of the data sets in the dependency graph.
4. The storage method according to claim 3, wherein the step of determining a plurality of storage policies based on different storage states of the data set in the dependency graph comprises:
converting the data set into binary numbers according to the storage state of the data set;
and arranging the binary numbers converted from each data set according to the dependency sequence of the data sets to obtain a plurality of storage strategies converted into binary strings.
5. The storage method according to claim 1, wherein before the step of calculating a calculation cost for generating the target intermediate data set under each storage policy based on a dependency relationship of an intermediate data set between a starting data set to the target intermediate data set in the dependency relationship graph, the storage method further comprises:
calculating a calculation cost of generating a predecessor data set of the target intermediate data set under each storage policy using a second generation cost calculation formula;
wherein the second generation cost calculation formula is:
Figure FDA0003798839890000041
where Pc represents the cost of computing resources in cloud computing, T i Representing a data set d i Generation time of, preset i Representing a deleted data set d i Of R (d) j ) Representing a deleted data set d i Of the precursor data set d j Generation cost of x j A storage status of a data set representing a jth location in a set X of storage statuses of data sets, j representing a subscript of a jth data set in a predecessor data set belonging to a data set di, X = { X = 1 ,x 2 ,…,x n },x i =1 represents a data set d i To store a state, x i =0 for data set d i To delete state, the total data set D = { D = { D } 1 ,d 2 ,…,d i ,…,d n The stored data set Ds and the deleted data set Dd are divided, the total data set D = Ds @ Dd.
6. The storage method according to claim 1, wherein the step of determining the storage policy with the minimum total cost as the optimal storage policy comprises:
determining a storage strategy with the minimum cost by using a genetic algorithm;
and determining the storage strategy with the minimum total cost as the optimal storage strategy.
7. The storage method according to claim 6, wherein the step of determining a storage strategy with a minimum total cost using a genetic algorithm comprises:
acquiring a population of a genetic algorithm;
initializing the population and then coding to obtain a plurality of dyeing individuals, wherein the bit value of each dyeing individual is the same as the total data set number, and each dyeing individual corresponds to a storage strategy of a binary string;
taking the minimum total cost as the fitness of the dyeing individuals, repeatedly executing the operation of using a calculation operator to each dyeing individual to obtain the dyeing individual with the minimum total cost, and generating new dyeing individuals to be added into the population until a cut-off condition is reached;
and when the cutoff condition is met, determining the storage strategy corresponding to the dyeing individual with the minimum total cost as the storage strategy with the minimum total cost.
CN202011133768.8A 2020-10-21 2020-10-21 Method for storing scientific workflow data set in cloud environment Active CN112256926B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011133768.8A CN112256926B (en) 2020-10-21 2020-10-21 Method for storing scientific workflow data set in cloud environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011133768.8A CN112256926B (en) 2020-10-21 2020-10-21 Method for storing scientific workflow data set in cloud environment

Publications (2)

Publication Number Publication Date
CN112256926A CN112256926A (en) 2021-01-22
CN112256926B true CN112256926B (en) 2022-10-04

Family

ID=74263351

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011133768.8A Active CN112256926B (en) 2020-10-21 2020-10-21 Method for storing scientific workflow data set in cloud environment

Country Status (1)

Country Link
CN (1) CN112256926B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102236578A (en) * 2010-05-07 2011-11-09 微软公司 Distributed workflow execution
CN110033076A (en) * 2019-04-19 2019-07-19 福州大学 Mix the Work stream data layout method below cloud environment to cost optimization

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8515910B1 (en) * 2010-08-26 2013-08-20 Amazon Technologies, Inc. Data set capture management with forecasting
US8856483B1 (en) * 2010-09-21 2014-10-07 Amazon Technologies, Inc. Virtual data storage service with sparse provisioning
CN106161599A (en) * 2016-06-24 2016-11-23 电子科技大学 A kind of method reducing cloud storage overall overhead when there is data dependence relation
CN108182109B (en) * 2017-12-28 2021-08-31 福州大学 Workflow scheduling and data distribution method in cloud environment
CN108989098B (en) * 2018-08-24 2021-06-01 福建师范大学 Time delay optimization-oriented scientific workflow data layout method in hybrid cloud environment
CN109840154B (en) * 2019-01-08 2022-10-14 南京邮电大学 Task dependency-based computing migration method in mobile cloud environment
CN111008152B (en) * 2019-12-26 2022-10-11 中国人民解放军国防科技大学 Kernel module compatibility influence domain analysis method, system and medium based on function dependency graph

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102236578A (en) * 2010-05-07 2011-11-09 微软公司 Distributed workflow execution
CN110033076A (en) * 2019-04-19 2019-07-19 福州大学 Mix the Work stream data layout method below cloud environment to cost optimization

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向数据分析的云工作流优化调度方法;马子泰;《中国优秀硕士学位论文全文数据库》;20200115;第I138-57页 *

Also Published As

Publication number Publication date
CN112256926A (en) 2021-01-22

Similar Documents

Publication Publication Date Title
Rosenberg et al. Metaheuristic optimization of large-scale qos-aware service compositions
US6031984A (en) Method and apparatus for optimizing constraint models
US8250007B2 (en) Method of generating precedence-preserving crossover and mutation operations in genetic algorithms
CN108829501B (en) Batch processing scientific workflow task scheduling algorithm based on improved genetic algorithm
CN103914506A (en) Data retrieval apparatus, data storage method and data retrieval method
US9047272B1 (en) System and methods for index selection in collections of data
Chattopadhyay et al. QoS-aware automatic Web service composition with multiple objectives
CN113821983B (en) Engineering design optimization method and device based on proxy model and electronic equipment
Neumann et al. Can single-objective optimization profit from multiobjective optimization?
US8996436B1 (en) Decision tree classification for big data
CN113935235A (en) Engineering design optimization method and device based on genetic algorithm and agent model
Sun et al. A fluctuation-aware approach for predictive web service composition
JPWO2014020834A1 (en) Word latent topic estimation device and word latent topic estimation method
CN108846480B (en) Multi-specification one-dimensional nesting method and device based on genetic algorithm
CN112256926B (en) Method for storing scientific workflow data set in cloud environment
Batyuk et al. Streaming process discovery method for semi-structured business processes
Xie et al. Integration of resource allocation and task assignment for optimizing the cost and maximum throughput of business processes
US11256748B2 (en) Complex modeling computational engine optimized to reduce redundant calculations
Byun et al. S-BORM: Reliability-based optimization of general systems using buffered optimization and reliability method
CN115271130B (en) Dynamic scheduling method and system for maintenance order of ship main power equipment
JP5555238B2 (en) Information processing apparatus and program for Bayesian network structure learning
CN113220437B (en) Workflow multi-target scheduling method and device
CN112632615B (en) Scientific workflow data layout method based on hybrid cloud environment
Bohlouli et al. Grid-HPA: Predicting resource requirements of a job in the grid computing environment
CN114841664A (en) Method and device for determining multitasking sequence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant