CN112256926B - Method for storing scientific workflow data set in cloud environment - Google Patents
Method for storing scientific workflow data set in cloud environment Download PDFInfo
- Publication number
- CN112256926B CN112256926B CN202011133768.8A CN202011133768A CN112256926B CN 112256926 B CN112256926 B CN 112256926B CN 202011133768 A CN202011133768 A CN 202011133768A CN 112256926 B CN112256926 B CN 112256926B
- Authority
- CN
- China
- Prior art keywords
- data set
- storage
- cost
- strategy
- dependency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9024—Graphs; Linked lists
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/126—Evolutionary algorithms, e.g. genetic algorithms or genetic programming
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Biophysics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Genetics & Genomics (AREA)
- Physiology (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
According to the method for storing the scientific workflow data set in the cloud environment, the data set generated by executing the scientific workflow task is acquired, the dependency graph of the data set is obtained according to the dependency among the data sets, a plurality of storage strategies are determined based on different storage states of the data set in the dependency graph, and the storage cost corresponding to each storage strategy is calculated; calculating the calculation cost of generating the target intermediate data set under each storage strategy based on the dependency relationship of the intermediate data set between the initial data set and the target intermediate data set in the dependency relationship graph; aiming at each storage strategy, calculating the total cost of the storage strategy based on the storage cost and the calculation cost corresponding to the storage strategy, determining the storage strategy with the minimum total cost as the optimal storage strategy, and storing the data set according to the storage state of the data set corresponding to the optimal storage strategy, so that the embodiment of the invention can save the cost of storing the data set in the scientific workflow in the cloud environment.
Description
Technical Field
The invention belongs to the field of cloud storage, and particularly relates to a storage method of a scientific workflow data set in a cloud environment.
Background
The scientific workflow system is a data set intensive application, and a large number of intermediate data sets (intermediate results) and non-intermediate data sets with complex dependency relationships are usually generated during the operation of a scientific workflow task, the intermediate data sets are often indispensable data sets for scientific research, the data sets of the intermediate data sets are huge in volume, and the storage and deletion of the intermediate data sets need to be balanced in the management process of the scientific workflow, so that the management of the data sets is realized.
In a cloud environment, based on the pay-as-needed characteristic of the cloud environment, the scientific workflow operation needs pay-as-needed (cost), the data sets are different, the execution cost of the data set management strategies is different, a strategy with the optimal cost needs to be found in all the management strategies, and the intermediate data set needs to consume storage resources and pay corresponding storage cost when being stored selectively, so that the intermediate data set is often and selectively deleted. After the intermediate data set is deleted, the intermediate data set needs to be regenerated when the intermediate data set is reused, the process of generating the intermediate data set needs to consume computing resources and pay the cost (cost) of corresponding computing resources, and because the regeneration process needs to rely on the prior data set, how to regenerate the deleted data set at lower cost, and whether to store the deleted data set is determined to be a technical problem to be solved.
Disclosure of Invention
In order to solve the above problems in the prior art, the present invention provides a method for storing a scientific workflow data set in a cloud environment. The technical problem to be solved by the invention is realized by the following technical scheme:
the embodiment of the invention provides a method for storing a scientific workflow data set in a cloud environment, which comprises the following steps:
acquiring a data set generated when a current time forescience workflow executes a task, wherein the data set comprises an initial data set and an intermediate data set;
establishing a dependency relationship graph based on the dependency relationship between the data sets;
determining a plurality of storage policies based on different storage states of the data set in the dependency graph;
calculating the storage cost corresponding to each storage strategy;
acquiring a target intermediate data set to be regenerated;
calculating a calculation cost for generating the target intermediate data set under each storage policy based on a dependency relationship of an intermediate data set between a starting data set and the target intermediate data set in the dependency relationship graph;
calculating the total cost of each storage strategy based on the storage cost corresponding to the storage strategy and the calculation cost;
determining a storage strategy with the minimum total cost as an optimal storage strategy;
storing the data set according to the storage state of the data set corresponding to the optimal storage strategy;
wherein the storage state comprises: stored and not stored.
Optionally, the step of establishing a dependency graph based on the dependency between the data sets includes:
taking each task executed by the scientific workflow as a node of a preset directed acyclic graph, wherein each task comprises an input data set and an output data set;
and taking the current data set as the input of the current node of the directed acyclic graph from the first node to any current node in the last node, taking an intermediate data set generated by depending on the current data set as the output of the current node, and taking the execution time of the current task as the connection weight between the current data set and the intermediate data set generated by depending on the current data set to obtain a dependency graph.
Optionally, the step of determining a plurality of storage policies based on different storage states of the data set in the dependency graph includes:
and in each path from the initial data set to the last data set in the dependency graph, different storage states of the data sets on each path in the dependency graph are combined into a storage strategy according to the dependency sequence of the data sets in the dependency graph.
Optionally, the step of determining a plurality of storage policies based on different storage states of the data set in the dependency graph includes:
converting the data set into binary numbers according to the storage state of the data set;
and arranging the binary numbers converted by each data set according to the dependency sequence of the data sets to obtain a plurality of storage strategies converted into binary strings.
Optionally, the step of calculating the storage cost corresponding to each storage policy includes:
aiming at each storage strategy, calculating the storage cost corresponding to the storage strategy by using a storage cost calculation formula;
the storage cost calculation formula is as follows:
StoreCost(d i ,t)=Ps·Di·t
wherein, storeCost (d) i T) represents the storage cost, ps represents the cost of storage resources in cloud computing, di represents the data set size of the stored data set, and t represents the statistical time interval.
Optionally, the step of calculating a calculation cost of generating the target intermediate data set under each storage policy based on a dependency relationship of an intermediate data set between the starting data set and the target intermediate data set in the dependency relationship graph includes:
calculating a calculation cost for generating the target intermediate data set under each storage policy using a first generation cost calculation formula based on a dependency relationship of an intermediate data set between a starting data set to the target intermediate data set in the dependency relationship graph;
the first cost of generation calculation formula is:
ComputCost(d i ,t)=R(d i )·f i
among them, computCost (d) i T) represents the computational cost, with the set of data not stored asDd denotes a data set of deletion status, R (d) i ) Representing a deleted data set d i The computational cost at regeneration, f i The access frequency of the data set is indicated by deleting the data set, i is the subscript of the data set in a scientific workflow task, and t is the statistical time interval.
Optionally, before the step of calculating a calculation cost for generating the target intermediate data set under each storage policy based on the dependency relationship between the starting data set and the target intermediate data set in the dependency relationship graph, the storage method of the first aspect further includes:
calculating a calculation cost of generating a predecessor data set of the target intermediate data set under each storage policy using a second generation cost calculation formula;
wherein the second generation cost calculation formula is:
where Pc represents the cost of computing resources in cloud computing, T i Representing a data set d i Generation time of, preset i Representing a deleted data set d i Of R (d) j ) Representing a deleted data set d i Of the precursor data set d j Generation cost of x j A storage status of a data set representing a jth location in a set X of storage statuses of data sets, j representing a subscript of a jth data set in a predecessor data set belonging to a data set di, X = { X = 1 ,x 2 ,...,x n },x i =1 represents a data set d i To store a state, x i =0 for data set d i To delete state, the total data set D = { D = { D } 1 ,d 2 ,...,d i ,...,d n The stored data set Ds and the deleted data set Dd are divided, the total data set D = Ds @ Dd.
Optionally, the step of calculating, for each storage policy, a total cost of the storage policy based on the storage cost and the calculation cost corresponding to the storage policy includes:
for each storage strategy, calculating the total cost of the storage strategy by using a total cost calculation formula based on the storage cost and the calculation cost corresponding to the storage strategy;
wherein, the total cost calculation formula is as follows:
wherein TotalCost (D, X, t) represents the total cost, X = { X = { (X) } 1 ,x 2 ,...,x n },x i =1 represents a data set d i To store a state, x i =0 for data set d i To delete state, the total data set D = { D = { D } 1 ,d 2 ,...,d i ,...,d n Is divided intoFor the stored data set Ds and the deleted data set Dd, the total data set D = Ds ≦ Dd.
Optionally, the step of determining the storage policy with the smallest total cost as the optimal storage policy includes:
determining a storage strategy with the minimum total cost by using a genetic algorithm;
and determining the storage strategy with the minimum total cost as the optimal storage strategy.
Optionally, the step of determining a storage policy with a minimum total cost by using a genetic algorithm includes:
acquiring a population of a genetic algorithm;
initializing the population and then coding to obtain a plurality of dyeing individuals, wherein the bit value of each dyeing individual is the same as the total data set number, and each dyeing individual corresponds to a storage strategy of a binary string;
taking the minimum total cost as the fitness of the dyeing individuals, repeatedly executing the operation of using a calculation operator on each dyeing individual to obtain the dyeing individual with the minimum total cost, and generating new dyeing individuals to be added into the population until a cut-off condition is reached;
and when the cutoff condition is reached, determining the storage strategy corresponding to the dyeing individual with the minimum total cost as the storage strategy with the minimum total cost.
According to the method for storing the scientific workflow data set in the cloud environment, the data set generated by executing the scientific workflow task is acquired, the dependency graph of the data set is obtained according to the dependency among the data sets, a plurality of storage strategies are determined based on different storage states of the data set in the dependency graph, and the storage cost corresponding to each storage strategy is calculated; calculating the calculation cost of generating the target intermediate data set under each storage strategy based on the dependency relationship of the intermediate data set between the initial data set and the target intermediate data set in the dependency relationship graph; aiming at each storage strategy, calculating the total cost of the storage strategy based on the storage cost and the calculation cost corresponding to the storage strategy, determining the storage strategy with the minimum total cost as the optimal storage strategy, and storing the data set according to the storage state of the data set corresponding to the optimal storage strategy, so that the embodiment of the invention can save the cost of storing the data set in the scientific workflow in the cloud environment.
The present invention will be described in further detail with reference to the accompanying drawings and examples.
Drawings
Fig. 1 is a schematic flowchart of a method for storing a scientific workflow data set in a cloud environment according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a dependency graph provided by an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a linear scientific workflow task provided by an embodiment of the present invention;
FIG. 4 is a flow chart of calculation of the generation cost of the regenerated data set according to the embodiment of the present invention;
FIG. 5 is a schematic diagram of a subgraph decomposition-reorganization process of a regenerated data set according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto.
The scientific workflow management system in the cloud environment manages the data sets generated in the execution process, part of the data sets are used for storing, part of the data sets are deleted, and the data sets are regenerated when the deleted data sets are obtained again, so that the generation cost of regenerating the data sets needs to be calculated.
Example one
As shown in fig. 1, a method for storing a scientific workflow data set in a cloud environment according to an embodiment of the present invention includes:
s1, acquiring a data set generated when a current time forescience workflow executes a task;
wherein the data set comprises a starting data set and an intermediate data set;
s2, establishing a dependency relationship graph based on the dependency relationship among the data sets;
the dependency relationship refers to the relationship between the dependent data set and the data set itself when the data set is generated.
Referring to fig. 2, fig. 2 is an intermediate data set resulting from scientific workflow task execution for non-linear scientific workflow task acquisition containing 15 data sets. And obtaining an intermediate data set dependency relationship graph according to the dependency relationship among the data sets, wherein the intermediate data set dependency relationship graph is a directed acyclic graph. In FIG. 2, d i Representing the ith data set, arrows representing dependencies between data sets, d 0 Direction d 1 Denotes d 0 Generation of d 1 ,d 1 Direction d 2 And d 3 Denotes d 1 Generating a data set d 2 And d 3 ,d 5 And d 6 Direction d 7 Denotes d 5 And d 6 Co-generating a data set d 7 。
It can be understood that, since the resource costs consumed by the scientific workflow tasks with different complexity are different, referring to fig. 3, fig. 3 is a graph of the established dependency relationship of the linear scientific workflow tasks. In a linear scientific workflow task, each data set has at most one predecessor and successor data sets, the structure is simple, and an unformed data set can have one or more predecessor data sets in an unstored state, but a multi-branch unstored structure does not exist. In a nonlinear scientific workflow task, each data set has a plurality of predecessors and successors, the structure is complex, and a multi-branch structure which is not stored exists, so that the calculation of the generation cost is more complex compared with a linear task structure.
S3, determining a plurality of storage strategies based on different storage states of the data set in the dependency graph;
wherein the storage state comprises: stored and not stored.
S4, calculating the storage cost corresponding to each storage strategy;
and S5, acquiring a target intermediate data set to be regenerated.
It is to be understood that the target intermediate data set refers to an intermediate data set that needs to be regenerated, the intermediate data set is a data set other than the initial data set, and the storage state of the initial data set is always stored.
It will be appreciated that when a target intermediate data set needs to be generated, the position of the target intermediate data set in the dependency graph needs to be determined, in order to determine the predecessor data sets of the target intermediate data set.
S6, calculating the calculation cost for generating the target intermediate data set under each storage strategy based on the dependency relationship of the intermediate data set between the initial data set and the target intermediate data set in the dependency relationship graph;
s7, calculating the total cost of each storage strategy based on the storage cost corresponding to the storage strategy and the calculation cost;
s8, determining the storage strategy with the minimum total cost as an optimal storage strategy;
and S9, storing the data set according to the storage state of the data set corresponding to the optimal storage strategy.
Referring to fig. 4 and fig. 5, fig. 4 is a generation flowchart of a regeneration target intermediate data set, and taking a small workflow of 10 data sets as an example, the data sets and tasks in the whole workflow are connected, where the circular nodes represent the tasks, the data sets are input and output to the tasks, the initial data set must be stored, and the data sets in the remaining workflows are selectively stored.
According to the method for storing the scientific workflow data set in the cloud environment, the data set generated by executing the scientific workflow task is acquired, the dependency graph of the data set is obtained according to the dependency among the data sets, a plurality of storage strategies are determined based on different storage states of the data set in the dependency graph, and the storage cost corresponding to each storage strategy is calculated; calculating the calculation cost of generating the target intermediate data set under each storage strategy based on the dependency relationship of the intermediate data set between the initial data set and the target intermediate data set in the dependency relationship graph; aiming at each storage strategy, calculating the total cost of the storage strategy based on the storage cost and the calculation cost corresponding to the storage strategy, determining the storage strategy with the minimum total cost as the optimal storage strategy, and storing the data set according to the storage state of the data set corresponding to the optimal storage strategy, so that the embodiment of the invention can save the cost of storing the data set in the scientific workflow in the cloud environment.
Example two
As an alternative embodiment of the present invention, the step S2 includes:
step a: taking each task executed by the scientific workflow as a node of a preset directed acyclic graph, wherein each task comprises an input data set and an output data set;
step b: and taking the current data set as the input of the current node of the directed acyclic graph from the first node to any current node in the last node, taking an intermediate data set generated by depending on the current data set as the output of the current node, and taking the execution time of the current task as the connection weight between the current data set and the intermediate data set generated by depending on the current data set to obtain the dependency graph.
EXAMPLE III
As an alternative embodiment of the present invention, the step S3 includes:
and in each path from the initial data set to the last data set in the dependency graph, different storage states of the data sets on each path in the dependency graph are combined into a storage strategy according to the dependency sequence of the data sets in the dependency graph.
Example four
As an alternative embodiment of the present invention, the step S3 includes:
step a: converting the data set into binary number according to the storage state of the data set;
step b: and arranging the binary numbers converted by each data set according to the dependency sequence of the data sets to obtain a plurality of storage strategies converted into binary strings.
Wherein for a data set dependency graph having n data sets (including the original data set), there is 2 (n-1) A storage policy, each storage policyRepresented as a binary string, i.e., a 0-1 vector, such as the data set dependency graph in S1, is represented as X =100000000011101.
Referring to fig. 2, if a data set is a stored data set, the data set is represented by a binary number of 1, whereas if the data set is an unstored data set (an erased data set), the data set is represented by a binary number of 0.
EXAMPLE five
As an alternative embodiment of the present invention, the step S6 includes:
aiming at each storage strategy, calculating the storage cost corresponding to the storage strategy by using a storage cost calculation formula;
the storage cost calculation formula is as follows:
StoreCost(d i ,t)=Ps·Di·t
wherein, storeCost (d) i T) represents the storage cost, ds represents the stored data set, ps represents the cost of storage resources in cloud computing, di represents the data set size of the stored data set, and t represents the statistical time interval.
It can be understood that the main factors influencing the storage cost are time and data set file size, the size of the intermediate data sets in the scientific workflow and the storage charging mode of the cloud environment are fixed, and the storage cost of each intermediate data set is only related to the data set file size and the charging mode when the statistical time is fixed.
EXAMPLE six
Calculating a calculation cost for generating the target intermediate data set under each storage policy using a first generation cost calculation formula based on a dependency relationship of an intermediate data set between a starting data set to the target intermediate data set in the dependency relationship graph;
the first cost of generation calculation formula is:
ComputCost(d i ,t)=R(d i )·f i
among them, computCost (d) i T) represents the computational cost, with the set of data not stored asDd denotes a data set of deletion status, R (d) i ) Representing a deleted data set d i Cost calculation at regeneration, f i The access frequency of the data set is indicated by deleting the data set, i is the subscript of the data set in a scientific workflow task, and t is the statistical time interval.
EXAMPLE seven
As an optional embodiment of the present invention, before the step of S6, the method for storing a scientific workflow data set in a cloud environment according to an embodiment of the present invention further includes:
calculating a calculation cost of generating a predecessor data set of the target intermediate data set under each storage policy using a second generation cost calculation formula;
wherein the second generation cost calculation formula is:
where Pc represents the cost of computing resources in cloud computing, T i Indicating the generation time of the data set di, preset i Representing a deleted data set d i Of R (d) j ) Representing a deleted data set d i Of the precursor data set d j Generation cost of x j Representing the storage state of a data set at the jth position in the set X of storage states of data sets, j representing the storage state belonging to data set d i X = { X = g, { n } of the jth data set in the predecessor data sets of (c) } 1 ,x 2 ,...,x n },x i =1 represents a data set d i To store a state, x i =0 for data set d i To delete state, the total data set D = { D = { D } 1 ,d 2 ,...,d i ,...,d n The stored data set Ds and the deleted data set Dd are divided, the total data set D = Ds @ Dd.
Because the main factors influencing the calculation cost are the influence of the data set calling frequency, the calculation generation time and the generation strategy, when the workflow task has more data sets and large scale, the total service cost is storage and calculation, and the calculation is in direct proportion to the calling frequency, the data set calling frequency is the main factor influencing the calculation cost. Secondly, in order to achieve the purpose of managing the scientific workflow data set, a storage strategy of the data set needs to be determined, so that the operation cost of the scientific workflow system is the lowest under the strategy, and a generation strategy of the intermediate data set needs to be determined.
When the calculation cost of the data sets is regenerated in the calculation storage strategy, the scientific workflow tasks are decomposed to obtain generation subgraphs of all the deleted data sets, the generation subgraphs are calculated to obtain the generation cost of each deleted data set, and the decomposed subgraphs are recombined to obtain the final generation cost.
Referring to fig. 5, fig. 5 is a flow chart of subgraph decomposition-reorganization for regenerating a data set, and the process of decomposing and reorganizing the generation subgraph of the 9 th data set in the 15 data sets in fig. 5 is described in detail by taking the generation subgraph of the deleted data set as an example.
In FIG. 5, from d 0 -d 9 According to the dependency relationship and the atomic task of the data set, the regeneration process of the data set is sequentially decomposed into d 1 -d 3 Subfigure, d 2 -d 5 Sub-drawing, d 3 -d 6 Subfigure, d 4 -d 8 Sub-drawing, d 5 -d 7 Subgraph and d 7 -d 9 Sub-graph, then making atom task reverse recombination to determine d 9 Precursor dataset d of datasets 7 、d 8 Until it is determined that d is generated 9 All of the data sets.
Knowing a data set dependency relationship graph DPG and a storage strategy X of a scientific workflow task, if a binary bit in the storage strategy is '1', indicating that the data set is stored and a calculation formula of storage cost is calculated and output, otherwise, indicating that the data set is deleted and needs to be regenerated, when the data set is regenerated, judging the storage state of a precursor data set according to the dependency relationship of the data set, if the data set is stored, only calculating the generation cost of the data set, and if the data set is also in a deletion state, calculating the generation cost of the precursor data set of the data set and the generation cost of the data set.
Example eight
As an alternative embodiment of the present invention, the step of calculating, for each storage policy, a total cost of the storage policy based on the storage cost and the calculation cost corresponding to the storage policy includes:
for each storage strategy, calculating the total cost of the storage strategy by using a total cost calculation formula based on the storage cost and the calculation cost corresponding to the storage strategy;
wherein, the total cost calculation formula is as follows:
wherein, totalCost (D, X, t) represents the total cost, X = { X = 1 ,x 2 ,...,x n },x i =1 represents a data set d i To store a state, x i =0 representing the data set d i In order to be in the deleted state, will sum the data set D = { D = { (D) 1 ,d 2 ,...,d i ,...,d n The stored data set Ds and the deleted data set Dd are divided, the total data set D = Ds @ Dd.
For ease of calculation, the total cost calculation formula for the storage strategy described above can also be converted into:
where n is expressed as the total number of data sets in a scientific workflow task.
Example nine
As an alternative embodiment of the present invention, the step of determining the storage policy with the minimum total cost as the optimal storage policy includes:
step a: determining a storage strategy with the minimum total cost by using a genetic algorithm;
step b: and determining the storage strategy with the minimum total cost as the optimal storage strategy.
The embodiment of the invention adopts the genetic algorithm as the method for searching the optimal storage strategy, can obtain the optimal storage cost in the algorithm for processing the problems of the same kind for the scientific workflow tasks of the complex data set, and improves the stability and the accuracy of the scientific workflow system.
Example ten
As an alternative embodiment of the present invention, the step of determining a storage strategy with the minimum total cost by using a genetic algorithm comprises:
step a: acquiring a population of a genetic algorithm;
step b: initializing the population and then coding to obtain a plurality of dyeing individuals, wherein the bit value of each dyeing individual is the same as the total data set number, and each dyeing individual corresponds to a storage strategy of a binary string;
step c: taking the minimum total cost as the fitness of the dyeing individuals, repeatedly executing the operation of using a calculation operator to each dyeing individual to obtain the dyeing individual with the minimum total cost, and generating new dyeing individuals to be added into the population until a cut-off condition is reached;
step d: and when the cutoff condition is reached, determining the storage strategy corresponding to the dyeing individual with the minimum total cost as the storage strategy with the minimum total cost.
It can be understood that the nature of the population (individuals stained) in the population is a storage strategy, i.e. a binary string, and the algorithm can omit the encoding and decoding process.Wherein X i A storage strategy is represented. Wherein, some nodes (data sets) are used for determining storage, namely, the corresponding positions in the corresponding storage strategies are always 1, the nodes do not participate in calculating operator change, and the length of the binary string corresponding to the population in the population is the sum of the number of the nodes minus the number of the stored nodes;
taking the total cost as the fitness, F (0) = { F 0 ,f 1 ,f 2 ,...,f n In which f i Indicating the ith storage policyFitness with slight correspondence, f i =ζ(D,X i T, L). The whole process is solved by classical genetic algorithms.
The calculation operators are divided into a crossover operator, a mutation operator and a selection operator, wherein the crossover operator adopts single-point crossover to cross two population codes in the population, and a new code is generated to serve as the population in the new generation of population. And (3) randomly mutating the population in the population by using a mutation operator, wherein if a certain population is mutated, one bit in the corresponding code is randomly changed from 0 to 1, or 1 to 0. In addition, the selection operator selects the population with lower cost through a roulette strategy by taking fitness as a standard, and randomly generates a new population complementary population. And (4) iteration stopping conditions, wherein the minimum cost is kept unchanged or the iteration number reaches an upper limit.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
In the present invention, unless otherwise expressly stated or limited, "above" or "below" a first feature means that the first and second features are in direct contact, or that the first and second features are not in direct contact but are in contact with each other via another feature therebetween. Also, the first feature being "on," "above" and "over" the second feature includes the first feature being directly on and obliquely above the second feature, or merely indicating that the first feature is at a higher level than the second feature. "beneath," "under" and "beneath" a first feature includes the first feature being directly beneath and obliquely beneath the second feature, or simply indicating that the first feature is at a lesser elevation than the second feature.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples described in this specification can be combined and combined by those skilled in the art.
While the present application has been described in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed application, from a review of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.
Claims (7)
1. A method for storing a scientific workflow data set in a cloud environment is characterized by comprising the following steps:
acquiring a data set generated when a current time forescience workflow executes a task, wherein the data set comprises an initial data set and an intermediate data set;
establishing a dependency relationship graph based on the dependency relationship among the data sets;
determining a plurality of storage policies based on different storage states of the data set in the dependency graph;
calculating the storage cost corresponding to each storage strategy;
acquiring a target intermediate data set to be regenerated;
calculating a calculation cost for generating the target intermediate data set under each storage policy based on a dependency relationship of intermediate data sets between a starting data set and the target intermediate data set in the dependency relationship graph;
calculating the total cost of each storage strategy based on the storage cost corresponding to the storage strategy and the calculation cost;
determining a storage strategy with the minimum total cost as an optimal storage strategy;
storing the data set according to the storage state of the data set corresponding to the optimal storage strategy;
wherein the storage state comprises: stored and not stored;
the step of calculating the storage cost corresponding to each storage policy includes:
aiming at each storage strategy, calculating the storage cost corresponding to the storage strategy by using a storage cost calculation formula;
wherein, the storage cost calculation formula is as follows:
StoreCost(d i ,t)=Ps·Di·t
wherein, storeCost (d) i T) represents storage cost, ps represents the cost of storage resources in cloud computing, di represents the size of a data set of a stored data set, and t represents a statistical time interval;
the step of calculating a calculation cost for generating the target intermediate data set under each storage policy based on the dependency relationship of the intermediate data set between the starting data set and the target intermediate data set in the dependency relationship graph includes:
calculating a calculation cost for generating the target intermediate data set under each storage policy using a first generation cost calculation formula based on a dependency relationship of an intermediate data set between a starting data set to the target intermediate data set in the dependency relationship graph;
the first cost of generation calculation formula is:
ComputCost(d i ,t)=R(d i )·f i
among them, computCost (d) i T) represents the computational cost, with the set of data not stored asDd denotes a data set of deletion status, R (d) i ) Representing a deleted data set d i The computational cost at regeneration, f i Representing the access frequency of the data set for deleting the data set, i representing the subscript of the data set in a scientific workflow task, and t representing a statistical time interval;
the step of calculating the total cost of each storage policy based on the storage cost and the calculation cost corresponding to the storage policy includes:
for each storage strategy, calculating the total cost of the storage strategy by using a total cost calculation formula based on the storage cost and the calculation cost corresponding to the storage strategy;
wherein, the total cost calculation formula is as follows:
wherein TotalCost (D, X, t) represents the total cost, X = { X = { (X) } 1 ,x 2 ,…,x n },x i =1 represents a data set d i To store a state, x i =0 for data set d i To delete state, the total data set D = { D = { D } 1 ,d 2 ,…,d i ,…,d n The stored data set Ds and the deleted data set Dd are divided, the total data set D = Ds @ Dd.
2. The storage method according to claim 1, wherein the step of building a dependency graph based on dependencies between the data sets comprises:
taking each task executed by the scientific workflow as a node of a preset directed acyclic graph, wherein each task comprises an input data set and an output data set;
and taking the current data set as the input of the current node of the directed acyclic graph from the first node to any current node in the last node, taking an intermediate data set generated by depending on the current data set as the output of the current node, and taking the execution time of the current task as the connection weight between the current data set and the intermediate data set generated by depending on the current data set to obtain a dependency graph.
3. The storage method according to claim 1, wherein the step of determining a plurality of storage policies based on different storage states of the data set in the dependency graph comprises:
and in each path from the initial data set to the last data set in the dependency graph, different storage states of the data sets on each path in the dependency graph are combined into a storage strategy according to the dependency sequence of the data sets in the dependency graph.
4. The storage method according to claim 3, wherein the step of determining a plurality of storage policies based on different storage states of the data set in the dependency graph comprises:
converting the data set into binary numbers according to the storage state of the data set;
and arranging the binary numbers converted from each data set according to the dependency sequence of the data sets to obtain a plurality of storage strategies converted into binary strings.
5. The storage method according to claim 1, wherein before the step of calculating a calculation cost for generating the target intermediate data set under each storage policy based on a dependency relationship of an intermediate data set between a starting data set to the target intermediate data set in the dependency relationship graph, the storage method further comprises:
calculating a calculation cost of generating a predecessor data set of the target intermediate data set under each storage policy using a second generation cost calculation formula;
wherein the second generation cost calculation formula is:
where Pc represents the cost of computing resources in cloud computing, T i Representing a data set d i Generation time of, preset i Representing a deleted data set d i Of R (d) j ) Representing a deleted data set d i Of the precursor data set d j Generation cost of x j A storage status of a data set representing a jth location in a set X of storage statuses of data sets, j representing a subscript of a jth data set in a predecessor data set belonging to a data set di, X = { X = 1 ,x 2 ,…,x n },x i =1 represents a data set d i To store a state, x i =0 for data set d i To delete state, the total data set D = { D = { D } 1 ,d 2 ,…,d i ,…,d n The stored data set Ds and the deleted data set Dd are divided, the total data set D = Ds @ Dd.
6. The storage method according to claim 1, wherein the step of determining the storage policy with the minimum total cost as the optimal storage policy comprises:
determining a storage strategy with the minimum cost by using a genetic algorithm;
and determining the storage strategy with the minimum total cost as the optimal storage strategy.
7. The storage method according to claim 6, wherein the step of determining a storage strategy with a minimum total cost using a genetic algorithm comprises:
acquiring a population of a genetic algorithm;
initializing the population and then coding to obtain a plurality of dyeing individuals, wherein the bit value of each dyeing individual is the same as the total data set number, and each dyeing individual corresponds to a storage strategy of a binary string;
taking the minimum total cost as the fitness of the dyeing individuals, repeatedly executing the operation of using a calculation operator to each dyeing individual to obtain the dyeing individual with the minimum total cost, and generating new dyeing individuals to be added into the population until a cut-off condition is reached;
and when the cutoff condition is met, determining the storage strategy corresponding to the dyeing individual with the minimum total cost as the storage strategy with the minimum total cost.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011133768.8A CN112256926B (en) | 2020-10-21 | 2020-10-21 | Method for storing scientific workflow data set in cloud environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011133768.8A CN112256926B (en) | 2020-10-21 | 2020-10-21 | Method for storing scientific workflow data set in cloud environment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112256926A CN112256926A (en) | 2021-01-22 |
CN112256926B true CN112256926B (en) | 2022-10-04 |
Family
ID=74263351
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011133768.8A Active CN112256926B (en) | 2020-10-21 | 2020-10-21 | Method for storing scientific workflow data set in cloud environment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112256926B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102236578A (en) * | 2010-05-07 | 2011-11-09 | 微软公司 | Distributed workflow execution |
CN110033076A (en) * | 2019-04-19 | 2019-07-19 | 福州大学 | Mix the Work stream data layout method below cloud environment to cost optimization |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8515910B1 (en) * | 2010-08-26 | 2013-08-20 | Amazon Technologies, Inc. | Data set capture management with forecasting |
US8856483B1 (en) * | 2010-09-21 | 2014-10-07 | Amazon Technologies, Inc. | Virtual data storage service with sparse provisioning |
CN106161599A (en) * | 2016-06-24 | 2016-11-23 | 电子科技大学 | A kind of method reducing cloud storage overall overhead when there is data dependence relation |
CN108182109B (en) * | 2017-12-28 | 2021-08-31 | 福州大学 | Workflow scheduling and data distribution method in cloud environment |
CN108989098B (en) * | 2018-08-24 | 2021-06-01 | 福建师范大学 | Time delay optimization-oriented scientific workflow data layout method in hybrid cloud environment |
CN109840154B (en) * | 2019-01-08 | 2022-10-14 | 南京邮电大学 | Task dependency-based computing migration method in mobile cloud environment |
CN111008152B (en) * | 2019-12-26 | 2022-10-11 | 中国人民解放军国防科技大学 | Kernel module compatibility influence domain analysis method, system and medium based on function dependency graph |
-
2020
- 2020-10-21 CN CN202011133768.8A patent/CN112256926B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102236578A (en) * | 2010-05-07 | 2011-11-09 | 微软公司 | Distributed workflow execution |
CN110033076A (en) * | 2019-04-19 | 2019-07-19 | 福州大学 | Mix the Work stream data layout method below cloud environment to cost optimization |
Non-Patent Citations (1)
Title |
---|
面向数据分析的云工作流优化调度方法;马子泰;《中国优秀硕士学位论文全文数据库》;20200115;第I138-57页 * |
Also Published As
Publication number | Publication date |
---|---|
CN112256926A (en) | 2021-01-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Rosenberg et al. | Metaheuristic optimization of large-scale qos-aware service compositions | |
US6031984A (en) | Method and apparatus for optimizing constraint models | |
US8250007B2 (en) | Method of generating precedence-preserving crossover and mutation operations in genetic algorithms | |
CN108829501B (en) | Batch processing scientific workflow task scheduling algorithm based on improved genetic algorithm | |
CN103914506A (en) | Data retrieval apparatus, data storage method and data retrieval method | |
US9047272B1 (en) | System and methods for index selection in collections of data | |
Chattopadhyay et al. | QoS-aware automatic Web service composition with multiple objectives | |
CN113821983B (en) | Engineering design optimization method and device based on proxy model and electronic equipment | |
Neumann et al. | Can single-objective optimization profit from multiobjective optimization? | |
US8996436B1 (en) | Decision tree classification for big data | |
CN113935235A (en) | Engineering design optimization method and device based on genetic algorithm and agent model | |
Sun et al. | A fluctuation-aware approach for predictive web service composition | |
JPWO2014020834A1 (en) | Word latent topic estimation device and word latent topic estimation method | |
CN108846480B (en) | Multi-specification one-dimensional nesting method and device based on genetic algorithm | |
CN112256926B (en) | Method for storing scientific workflow data set in cloud environment | |
Batyuk et al. | Streaming process discovery method for semi-structured business processes | |
Xie et al. | Integration of resource allocation and task assignment for optimizing the cost and maximum throughput of business processes | |
US11256748B2 (en) | Complex modeling computational engine optimized to reduce redundant calculations | |
Byun et al. | S-BORM: Reliability-based optimization of general systems using buffered optimization and reliability method | |
CN115271130B (en) | Dynamic scheduling method and system for maintenance order of ship main power equipment | |
JP5555238B2 (en) | Information processing apparatus and program for Bayesian network structure learning | |
CN113220437B (en) | Workflow multi-target scheduling method and device | |
CN112632615B (en) | Scientific workflow data layout method based on hybrid cloud environment | |
Bohlouli et al. | Grid-HPA: Predicting resource requirements of a job in the grid computing environment | |
CN114841664A (en) | Method and device for determining multitasking sequence |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |