CN112328380A

CN112328380A - Task scheduling method and device based on heterogeneous computing

Info

Publication number: CN112328380A
Application number: CN202011245253.7A
Authority: CN
Inventors: 邹承明; 史梦园
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2021-02-05

Abstract

The invention relates to a task scheduling method based on heterogeneous computing, which comprises the following steps: establishing a DAG model of a task to be scheduled, and establishing an operation queue according to the DAG model; establishing a network topological graph among heterogeneous processors, and distributing initial task amount for each processor according to the calculation speed of each processor; randomly distributing tasks in the job queue to each processor according to the initial task amount, randomly distributing initial voltage to each processor, and constructing an initial scheduling list; generating an initial population according to the initial scheduling list, initializing parameters of a genetic algorithm, and executing the genetic algorithm in parallel by each processor to perform population iteration updating to obtain an optimal population; and acquiring a task scheduling list corresponding to the optimal population as an optimal task scheduling list. The method can obtain the global optimal solution of task scheduling and fully exert the advantages of heterogeneous computation.

Description

Task scheduling method and device based on heterogeneous computing

Technical Field

The present invention relates to the field of computing task scheduling technologies, and in particular, to a task scheduling method and apparatus based on heterogeneous computing, and a computer storage medium.

Background

With the rapid development of computer technology, the development of chips is subject to performance improvement one time. However, with the explosive development and the popularization of information of the internet and the rise of the fields with high demand on computing performance, such as machine learning, deep learning, artificial intelligence, industrial simulation and the like, in recent years, computing performance bottlenecks, such as low parallelism, insufficient bandwidth, high time delay and the like, appear. The computational performance and characteristics of different types of processors vary, for example: the CPU is composed of an internal arithmetic logic unit, a register unit and a control unit, but 70% of transistors are used for constructing a cache and a part of control units, the part responsible for logic arithmetic is not much, and more of the transistors are focused on control; the GPU is designed as a coprocessor, is suitable for being used for a large number of intensive computing types and is suitable for executing highly threaded parallel processing tasks; the FPGA has the characteristics of repeated programming and low power consumption, has larger parallelism and is mainly realized by two technologies of concurrency and pipelining. In order to improve the computing performance, different types of processors can be adopted to form a heterogeneous computing system, so that the advantages and the disadvantages among various processors are taken out. At present, heterogeneous computing is usually performed by adopting a CPU + GPU and a CPU + FPGA heterogeneous computing mode. The heterogeneous computing mode of the CPU and the GPU cannot realize flexible programming, and the energy consumption cost is increased. Compared with the FPGA, the GPU has the advantages that the bandwidth of a memory interface of the GPU is far better than that of the FPGA, and in addition, the computing capacity of an FPGA basic unit is limited. In order to realize the reconfigurable characteristic, a large number of basic units with extremely fine granularity exist in the FPGA, but the computing power of each unit (mainly relying on LUT lookup tables) is far lower than that of ALU modules in a CPU and a GPU.

When a computing task is executed between heterogeneous processors, the task needs to be reasonably scheduled, so that the task is guaranteed to be completed smoothly, and the scheduling length of task scheduling is minimized. The task scheduling method adopted at present has the following problems:

1. the technical method of table scheduling is adopted, the priority of the tasks is firstly calculated, then the tasks are sequenced according to the priority, and finally the tasks are scheduled to a proper processor for processing according to the priority sequence. This scheduling approach has a significant disadvantage and does not take into account energy consumption.

2. By adopting heuristic scheduling algorithms, such as task scheduling based on a genetic algorithm, task scheduling based on simulated annealing and task scheduling based on an ant colony algorithm, the scheduling algorithms can easily lead the result to enter a local optimal solution, so that a proper solution needs to be found for improvement. At present, although some researchers research parallel genetic algorithms, the researched algorithms have a fatal defect in the parallel process, namely, sub-populations are uniformly distributed to each processor, but the computing characteristics, the computing speed, the time consumed by transmission and the power consumption of the algorithms are different on different processors, and obviously, the uniformly distributed rules are not suitable for heterogeneous systems and cannot fully exert the advantages of heterogeneous computing.

3. At present, technologies adopted for reducing power consumption during task scheduling are mainly DPM and DVFS, wherein the working principle of the DPM is to switch an idle component to a low power consumption mode or to close the idle component so as to achieve the purpose of reducing power consumption. However, DPM reduces the processing speed of the CPU, so it generally has this constraint: the energy consumption is reduced while the QoS of the system is ensured. DVFS is dynamic voltage frequency adjustment, and the dynamic technique is to dynamically adjust the operating frequency and voltage of a chip according to different requirements of an application program operated by the chip on computing capability, thereby achieving the purpose of energy saving. However, the power consumption in the context of dynamic voltage frequency adjustment is derived from two aspects, namely the dynamic power consumption when the CMOS circuit is switched on and off and the static power consumption when the CMOS circuit leaks. However, in a practical application scenario, a part of the energy consumption should also include energy consumption loss during transmission, and in a sleep state, although the energy consumption loss is slower than the energy consumption in an active state, in a small task, often a task in a sleep state, reducing sleep energy consumption is also an effective way to facilitate achieving a low energy consumption goal.

In summary, the existing task scheduling method has defects, and a new heterogeneous computing task scheduling method is urgently needed.

Disclosure of Invention

In view of the above, it is necessary to provide a task scheduling method and device based on heterogeneous computing, so as to solve the problem that task scheduling optimization of heterogeneous computing is easy to fall into a local optimal solution and cannot exert the advantages of heterogeneous computing to the greatest extent.

The invention provides a task scheduling method based on heterogeneous computing, which comprises the following steps:

establishing a DAG model of a task to be scheduled, and establishing an operation queue according to the DAG model;

establishing a network topological graph among heterogeneous processors, and distributing initial task amount for each processor according to the calculation speed of each processor;

randomly distributing tasks in the job queue to each processor according to the initial task amount, randomly distributing initial voltage to each processor, and constructing an initial scheduling list;

generating an initial population according to the initial scheduling list, initializing parameters of a genetic algorithm, and executing the genetic algorithm in parallel by each processor to perform population iteration updating to obtain an optimal population;

and acquiring a task scheduling list corresponding to the optimal population as an optimal task scheduling list.

Further, establishing a DAG model of the task to be scheduled, specifically:

and establishing the DAG model by taking the operation contained in the task to be scheduled as a node of the DAG model, taking the operation execution time as a node attribute, establishing directed edges between the nodes according to the execution sequence dependency relationship between the operations, and taking the communication traffic between the operations as the directed edge attribute.

Further, establishing a job queue according to the DAG model specifically comprises:

copying all nodes in the DAG model to obtain a node set;

screening out nodes with the access degree of 0, if only one node with the access degree of 0 exists, directly putting the node into an operation queue, if a plurality of nodes with the access degree of 0 exist, further screening out the node with the minimum operation size, if only one node with the minimum operation size exists, directly putting the node into the operation queue, and if a plurality of nodes with the minimum operation size exist, randomly selecting one node from the nodes to put into the operation queue;

deleting the enqueued nodes from the node set, and deleting the dependency relationship between subsequent nodes connected with the enqueued nodes; and judging whether the current node set is empty, if so, outputting the operation queue, and if not, turning to the previous step for next node enqueue.

Further, establishing a network topology graph among the heterogeneous processors specifically includes:

the method comprises the steps of taking each processor as a node of a network topological graph, taking the execution speed of the processor as a node attribute, establishing a non-directional edge between the nodes according to whether communication between the processors is available, and taking the communication speed between the processors as a non-directional edge attribute to establish the network topological graph.

Further, allocating an initial task amount to each processor according to the calculation speed of each processor specifically includes:

according to the calculation speed of each processor, calculating the hyper-parameter of each processor:

wherein, theta_aIs a hyperparameter of the a-th processor, W (p)_a) For the computation rate of the a-th processor, W (p)_b) D is the calculation rate of the b-th processor, a is more than or equal to 1 and less than or equal to d, b is 1, …, d and d is the number of processors;

allocating initial task amount to each processor according to the hyper-parameters:

Num(p_a)＝θ_a*M；

wherein, Num (p)_a) The initial amount of tasks allocated to the a-th processor, and M is the total number of tasks.

Further, each processor executes a genetic algorithm in parallel to perform population iteration updating to obtain an optimal population, specifically:

establishing an optimization model by taking the minimum energy consumption as an objective function and taking the task execution time as a constraint condition;

evaluating the fitness of the current population according to the objective function, and carrying out genetic operations of a selection operator, a crossover operator and a mutation operator on the current population in parallel by each processor; each processor exchanges information to realize population updating;

and judging whether an iteration termination condition is reached, if so, outputting the current population as an optimal population, and otherwise, turning to the previous step for next iteration.

Further, an optimization model is established by taking the minimum energy consumption as an objective function and taking the task execution time as a constraint condition, and specifically comprises the following steps:

and (3) obtaining total energy consumption:

EC_total＝EC_dynamic+EC_static+EC_trans+EC_sleep；

wherein, EC_totalFor total energy consumption, EC_dynamicFor dynamic energy consumption, EC_staticFor static energy consumption, EC_transFor transmission of energy consumption, EC_sleepEnergy consumption for sleep;

establishing an objective function by taking the minimum total energy consumption as a target:

E_min＝min(EC_total)；

wherein E is_minRepresents the objective function, min () represents the minimum value;

sequentially solving the earliest execution time and the earliest ending time of each operation node:

End(t_i,p_a)＝Start(t_i,p_a)+W(t_i)/W(p_a)；

wherein, Start (t)_i,p_a) Representing a working node t_iAt processor p_aThe earliest start time of (d), (t)_j) Is t_iDirect predecessor node of, End (t)_j,p_b) Representing a working node t_iAt processor p_bThe earliest end time of (C)_abRepresenting the communication time between the a-th processor and the b-th processor; end (t)_i,p_a) Representing a working node t_iAt processor p_aThe earliest end time of the above (c),W(t_i) Representing a working node t_iJob size of (1), W (p)_a) Representing the execution rate of the processor;

wherein, C_abRepresenting the communication time, PL, between the a-th and b-th processors_abRepresenting the communication rate between the a-th processor and the B-th processor, B_abRepresenting a communication bandwidth between the a-th processor and the b-th processor;

and (3) taking the last executed operation node as an exit node, constraining the earliest ending time of the exit node, and establishing a constraint condition:

End(t_exit)＜deadline；

wherein, t_exitRepresents an egress job node, End (t)_exit) Indicating an egress job node t_exitDeadline represents the latest deadline.

Further, the genetic operation of a mutation operator is performed on the current population, specifically:

respectively calculating the ratio of the fitness of each individual of the current population to the average fitness of the population;

judging whether the ratio is greater than 1, if so, reducing the variation rate of the corresponding individual, otherwise, increasing the variation rate of the corresponding individual;

and performing genetic operation of a mutation operator on the current population according to the regulated mutation rate.

The invention also provides a task scheduling device based on heterogeneous computing, which comprises a processor and a memory, wherein the memory is stored with a computer program, and the computer program is executed by the processor to realize the task scheduling method based on heterogeneous computing.

The invention also provides a computer storage medium, on which a computer program is stored, which, when executed by a processor, implements the heterogeneous computing-based task scheduling method.

Has the advantages that: before task scheduling, firstly, a DAG model is established so as to split tasks and establish a job queue, and secondly, a processor network topological graph is established. In order to avoid uniform task distribution when the processors execute the genetic algorithm in parallel, the task quantity is distributed to the processors according to the speed of the processors, and the advantages of different processors of the heterogeneous system are exerted to a greater extent. And then, an overall scheduling mode is adopted, namely, the voltage is provided while the operation nodes are allocated to the processors, and a genetic algorithm is adopted for task allocation to form an optimal task scheduling list. After the optimal task scheduling list is determined, the scheduling length and the energy consumption are also determined. The invention provides a real-time task scheduling method on a heterogeneous computing platform, which is different from the existing common heterogeneous computing platform, fully utilizes the existing computing resources and improves the task scheduling performance by task allocation based on the computing rate.

Drawings

FIG. 1 is a flowchart of a task scheduling method based on heterogeneous computing according to a first embodiment of the present invention;

FIG. 2 is a DAG model diagram of a task scheduling method based on heterogeneous computing according to a first embodiment of the present invention;

FIG. 3 is a network topology diagram of a task scheduling method based on heterogeneous computing according to a first embodiment of the present invention;

FIG. 4 is a comparison diagram of the runtime of the first embodiment of the task scheduling method based on heterogeneous computing according to the present invention on different computing platforms;

FIG. 5 is a comparison graph of the running times of a genetic algorithm NPGA and an existing parallel genetic algorithm NGA in the first embodiment of the task scheduling method based on heterogeneous computing according to the present invention;

fig. 6 is a diagram illustrating energy consumption comparison between a genetic algorithm NPGA and an existing genetic algorithm GA in a first embodiment of a task scheduling method based on heterogeneous computing according to the present invention.

Detailed Description

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the invention and together with the description, serve to explain the principles of the invention and not to limit the scope of the invention.

Example 1

As shown in fig. 1, embodiment 1 of the present invention provides a task scheduling method based on heterogeneous computing, including the following steps:

s1, establishing a DAG model of the task to be scheduled, and establishing a job queue according to the DAG model;

s2, establishing a network topological graph among the heterogeneous processors, and distributing initial task amount for each processor according to the calculation speed of each processor;

s3, randomly distributing the tasks in the job queue to each processor according to the initial task amount, randomly distributing initial voltage to each processor, and constructing an initial scheduling list;

s4, generating an initial population according to the initial scheduling list, initializing genetic algorithm parameters, and executing a genetic algorithm in parallel by each processor to perform population iteration updating to obtain an optimal population;

and S5, acquiring a task scheduling list corresponding to the optimal population as an optimal task scheduling list.

Before the scheduling method provided by the embodiment is operated, a heterogeneous computing system needs to be built to realize effective communication among different processors, and the method for building the heterogeneous system is as follows:

A1. in this embodiment, the heterogeneous computing platform is a heterogeneous system which is constructed by a CPU + GPU + FPGA based on an OpenCL framework, the CPU is used as a host of the heterogeneous system, and the GPU + FPGA is used as a computing device of the heterogeneous system;

A2. developing host machine codes and equipment codes according to an OpenCL framework;

and A3, the CPU host machine executes the host machine code developed based on the OpenCL framework, and information communication between the CPU host machine and the GPU + FPGA computing equipment is promoted. In addition, the CPU host machine is also responsible for scheduling the calculation tasks;

A4. the device code is placed on a computing device for execution in accordance with the device code written in step a2.

A5. After the CPU + GPU + FPGA heterogeneous system detects a computing task, an operation queue is generated, then operation is distributed to each computing node, and a plurality of operations form a working group.

A6. In the process of executing a computing task, if many data with low coupling degree need to execute the same operation, the OpenCL framework splits the data and respectively sends the data to a plurality of working examples to execute the same command, so that data parallel is realized.

A7. Each operation example runs independently, all the operation examples in the same working group share data in the local memory together, but the operation among the operation examples is independent, and the operation examples are not influenced, so that the task parallelism is realized.

A8. And converting the input into a calculation output result according to a defined function after the data parallel and the task parallel, and writing the calculation result into a local memory of the equipment.

A9. And returning the calculation result to the host machine, and reading the output cache by the host machine to obtain the calculation result on the GPU + FPGA calculation equipment.

A10. When the calculation is completed, various resources are released to wait for the next task.

After the heterogeneous system is built, task scheduling can be carried out, one task is formed by combining a plurality of jobs aiming at the scheduled tasks, and the execution sequence dependency relationship exists among the jobs, so that a DAG model is built so as to build a job queue. Second, a processor network topology is established. In order to avoid uniform task allocation when the processors execute the genetic algorithm in parallel, the embodiment allocates the task amount to each processor according to the speed of the processor, and exerts the advantages of different processors of the heterogeneous system to a greater extent. The overall scheduling is then used, i.e. the assignment of the job node to the processor is done while the voltage is supplied. And then, distributing tasks by adopting a genetic algorithm to form an optimal task scheduling list. After the optimal task scheduling list is determined, the scheduling length and the energy consumption are also determined.

The invention provides a real-time task scheduling method on a heterogeneous computing platform, which is different from the existing common heterogeneous computing platform, fully utilizes the existing computing resources and improves the task scheduling performance by task allocation based on the computing rate.

Preferably, the establishing of the DAG model of the task to be scheduled specifically includes:

A task is formed by combining a plurality of jobs, and the jobs have a dependency relationship of execution sequence. This relationship is represented by a DAG model, i.e., a directed acyclic graph.

The DAG model specifically comprises:

Task＝(T,W,E)；

wherein, Task is DAG model; t is a node set, T ═ T₀,t₁,...,t_n-1}，t_iThe ith job of the task to be scheduled is represented, i is 0,1, …, n-1, and n is the number of jobs; w is the node execution time set, W ═ W₀,w₁,…,w_n-1}，w_iRepresenting the worst expected execution time of the ith job, E is a node execution time set, and E { [ E ]₀₀,e₀₁,…,e_0n-1],[e₁₀,e₁₁,…,e_1n-1],[e_n-10e_n-11,…,e_n-1n-1]}，e_ijRepresents the dependency relationship and traffic between the ith job and the jth job, j is 0,1, …, n-1, e_ijOperation t is represented by not less than 0_jDependent on the job t_iI.e. operation t_jRequiring an operation t_iAfter execution, the operation is executed, and the operation t_iTo operation t_jTraffic size of e_ijAt e _ij1 denotes operation t_jIndependent of job t_iAnd no traffic.

Specifically, the DAG model graph created in this embodiment is shown in fig. 2, where the DAG model in fig. 2 is composed of 6 nodes, the nodes are represented by circle graphs, and each node represents one job. The value in a node is composed of two parts, the upper halfThe numerical values represent job numbers, i.e., i, j, and are determined by human beings for the purpose of distinguishing the respective jobs. The lower half of the numerical value represents the size w of each job_iTwo nodes connected by a directed edge represent that a dependency exists between jobs, and an arrow represents the direction of the dependency, that is, the execution of the jobs is sequential, such as: the earliest time job 1 starts must be after job 0 is completed before job 5 starts, and both job 3 and job 4 must be completed before job 5 starts. The values on the directed edges represent traffic.

Preferably, the job queue is established according to the DAG model, specifically:

copying all nodes in the DAG model to obtain a node set;

Because the DAG directed acyclic graph has a sequential dependency order, each job node has a sequential execution order, when task scheduling is executed, firstly, the job node T is copied and named as T _ copy, a node with the degree of entry of 0 is found out and is put into a job queue TQ, and the queued job node is deleted from the job node T _ copy set. Secondly, the dependency relationship between the node and the connected successor node is removed, namely the corresponding e is removed_ijIs set to-1. If there are multiple nodes with the degree of income of 0, one job with the smallest job size is selected, and if the jobs with the same size are selected randomly.And finally, repeating the steps to form a job queue, then randomly distributing processors and voltage for each task, and then optimizing the result by using a genetic algorithm.

Taking fig. 2 as an example, the execution steps are:

finding a node with the degree of income of 0 in the T _ copy operation set to obtain T₀，t₀In the entry queue TQ, TQ is { t }₀}. Deleting T in T _ copy₀Node, T _ copy ═ T₁,t₂,t₃,t₄,t₅}. Will be compared with t₀Is directly succeeding node t₁、t₂、t₃Communication amount e of₀₁、e₀₂、e₀₃Is set to-1.

Finding out node T with 0 degree of income from nodes of T _ copy₁、t₂、t ₃3, according to the principle of selecting the task with the minimum size and randomly selecting if the task with the same size, t₂Performing enqueue operation, wherein TQ is equal to { t₀,t₂}，T_copy＝{t₁,t₃,t₄,t₅Will be with t₂Is directly succeeding node t₄Communication amount e of₂₄Is set to-1.

Finding out the point with the degree of income of 0 from the nodes of T _ copy as T₁、t₃According to the principle of selecting the minimum task size, t₃Performing enqueue operation, wherein TQ is equal to { t₀,t₂,t₃}，T_copy＝{t₁,t₄,t₅Will be with t₃Is directly succeeding node t₅Communication amount e of₃₅Is set to-1.

Finding out the point with the degree of income of 0 from the nodes of T _ copy as T₁，t₁Performing enqueue operation, wherein TQ is equal to { t₀,t₂,t₃,t₁}，T_copy＝{t₄,t₅Will be with t₁Is directly succeeding node t₄Communication amount e of₁₄Is set to-1.

Finding out the point with the degree of income of 0 from the nodes of T _ copy as T₄，t₄Performing enqueue operation, wherein TQ is equal to { t₀,t₂,t₃,t₁,t₄}，T_copy＝{t₅Will be with t₄Is directly succeeding node t₅Communication amount e of₄₅Is set to-1.

Finding out the point with the degree of income of 0 from the nodes of T _ copy as T₅，t₅Performing enqueue operation, wherein TQ is equal to { t₀,t₂,t₃,t₁,t₄,t₅And f, ending the operation.

Preferably, the establishing of the network topology graph among the heterogeneous processors specifically includes:

The network topology specifically comprises:

Net＝(P,V,B)；

where Net is a network topology, P is a processor set, and P is { P ═ P₀,p₁,…,p_d-1}，p_aRepresenting the a-th processor, wherein a is 0,1, …, d-1, and d is the number of processors; v is the processor execution speed set, V ═ V₀,v₁,…,v_d-1}，v_aRepresenting the execution speed of the a-th processor; b { [ B { ]₀₀,b₀₁,…,b_0d-1],[b₁₀,b₁₁,…,b_1d-1],[b_d-10b_d-11,…,b_d-1d-1]}，b_abRepresenting the communication relationship and the communication bandwidth between the a-th processor and the b-th processor, b_abThe value of more than or equal to 0 indicates that the a-th processor and the b-th processor can communicate with each other, and the communication bandwidth size is e_ijIn b_abNo communication can be made between-1 a-th and b-th processors.

The network topology of the processor in this embodiment is shown in fig. 3, where fig. 3 has 4 processor nodes, which are indicated by circles, and each node represents a processor. The values in the nodes are composed of two parts, the values in the upper half represent the processor numbers, namely a and b, and are artificialThe determination is made in order to distinguish between the processors. The lower half of the values represent the execution speed of each processor, with or without p connected to the edge_aAnd p_bRepresenting the processor p_aAnd processor p_bTwo-way communication is possible between the heterogeneous processors, and the values on the edges represent the communication speed, i.e., the communication bandwidth, between the heterogeneous processors. P in FIG. 3₀Can communicate with any processor, in this case p₀May act as a host. p is a radical of₁And p₂There is no connection without a direct edge between them, which means that they cannot communicate. p is a radical of₀And p₃The communication speed of (2). At the same time, the execution speed is different between different processors, e.g. p₀Is 2, and p₃Is 5, which indicates that in the same time, p₃The amount of computation that can be performed is better than p₀。

Preferably, the allocating the initial task amount to each processor according to the calculation speed of each processor specifically includes:

Num(p_a)＝θ_a*M；

The improvement of the invention lies in that when the genetic evolution operation is carried out, the task is not evenly distributed to each processor, and a hyper-parameter theta is introduced_aFor controllingAnd distributing the task amount.

In particular, according to θ_aCalculating Num (p)_a)，Num(p_a) Rounding off, and then according to theta_aAnd performing distribution tasks in descending order. If it is not

The last processor assigned task is

And (4) each task.

And after the task allocation amount is determined, randomly allocating processors and voltages for the job. In summary, the last formed initial scheduling list S { (p)₁,v₁),(p₁,v₂),(p₂,v₂),(p₃,v₀),(p₃,v₁),(p₂,v₃)}. The specific scheduling process is shown in table 1:

TABLE 1 task scheduling flow sheet

Preferably, each processor executes the genetic algorithm in parallel to perform population iteration updating to obtain an optimal population, specifically:

Firstly, an optimization model is established, the task scheduling energy consumption is the lowest, a constraint condition is established, the task scheduling adopts an integral scheduling mode, and the voltage is provided while the task nodes are allocated to the processor.

The novel parallel genetic algorithm model provided by the embodiment is named as NPGA, each parameter of the novel parallel genetic algorithm NPGA is initialized, the optimal individual is used as an exchange object, and information exchange is carried out on each evolution generation. In order to facilitate information exchange of each generation of population, firstly, a novel parallel genetic algorithm establishing model NPGA ═ P, C, F, N, NGA is established, wherein P represents each processor set, C represents the content obtained by information exchange among the processors, namely, the individual exchange in the subgroup is carried out according to what rule, in other words, the best individual in the exchange subgroup is or the individual is randomly selected for exchange, F represents the frequency of information exchange, N represents the quantity of information exchanged each time, namely the quantity of exchanged individuals, and NGA represents the genetic algorithm operated on each processor.

The information exchange specifically comprises the following steps: when a pre-specified information exchange time, i.e. exchange frequency F, is reached, each processor sends information C to be exchanged to the other processors, while also receiving exchanged information from the other processors. The processor replaces the exchange information obtained by exchange by one or more individuals in the processor according to a certain rule, and the exchange quantity depends on the quantity N. And continuously performing population iteration according to the steps until an iteration termination condition is met.

The novel genetic algorithm NGA is specifically as follows: defining a genetic algebra t, and initializing a parameter t. Defining a chromosome, which should satisfy uniqueness, and selecting a schedule S to be made to the chromosome. Initial population number P_l(t), initializing t to 0. Defining maximum evolutionary algebra T, population size M and initialized cross rate P_cThe rate of variation P_m。

When T < T, define parameter l, initialize to l ═ 1, calculate Num (p)_a) And executing the following steps circularly and parallelly until an iteration termination condition is reached: evaluation of population P_l(t) fitness and speciesGroup P_l(t) performing a selection operator operation on the population P_l(t) performing a crossover operator operation on the population P_l(t) performing a mutation operator operation. If the defined frequency F of information exchange is reached, the information exchange is carried out according to the information exchange steps to obtain the child group P_l(t+1)＝N[P_l(t),C₁,C₂,…,C_k]In which C is_kThe size of k depends on the number of exchanges N. The variable t performs a self-increment operation.

Specifically, after the initial scheduling list is formed, the initial scheduling list is normalized and proportioned to select the first M individuals to form the sub-population. Taking DAG graph as an example, further analysis is carried out. The initialization genetic algebra is 0, and the initialization population S { (p)₁,v₁),(p₁,v₂),(p₂,v₂),(p₃,v₀),(p₃,v₁),(p₂,v₃)}. The maximum genetic algebra is set to be 500, the size of the population is set to be 6, the number of sub-populations classified into each processor is calculated to be 2, and when the genetic algebra is less than 500, the NGA algorithm is executed in parallel on each processor. The fitness obtained for the initialized population is shown in table 2:

TABLE 2 initialization fitness table

And performing single-point crossing by setting the crossing rate to be 0.8 according to the result crossing operation after the selection operation. And carrying out mutation according to a mutation rate formula. If the exchange frequency is reached, information exchange is carried out to obtain the next generation of population P_l(t+1)＝N[P_l(t),C₁,C₂,…,C_k]Until the number of iterations 500 is satisfied.

The NPGA algorithm provided in this example is described as follows:

preferably, the minimum energy consumption is used as an objective function, the task execution time is used as a constraint condition, and an optimization model is established, specifically:

and (3) obtaining total energy consumption:

EC_total＝EC_dynamic+EC_static+EC_trans+EC_sleep；

E_min＝min(EC_total)；

End(t_i,p_a)＝Start(t_i,p_a)+W(t_i)/W(p_a)；

wherein, Start (t)_i,p_a) Representing a working node t_iAt processor p_aThe earliest start time of (d), (t)_j) Is t_iDirect predecessor node of, End (t)_j,p_b) Representing a working node t_iAt processor p_bThe earliest end time of (C)_abRepresenting the communication time between the a-th processor and the b-th processor; end (t)_i,p_a) Representing a working node t_iAt processor p_aThe earliest end time of (d), W (t)_i) Representing a working node t_iJob size of (1), W (p)_a) Representing the execution rate of the processor;

the earliest execution time of the task:

End(t_exit)＜deadline；

In the embodiment, the DVFS is optimized by considering energy consumption in the transmission process and energy consumption in the sleep state, so that a higher low-energy consumption effect is achieved. In the transition stage that the previous task is finished and the next task is not started, the voltage of the system is determined according to whether the total energy consumption is low under the voltage of the previous task or the total energy consumption is low under the sleep state, and the voltage with the lowest energy consumption is selected as the voltage of the transition stage.

Specifically, the dynamic power calculation formula is as follows:

the dynamic energy consumption calculation formula is as follows:

EC_dynamic＝P_a,m,k*t；

wherein a is capacitance, P_a,m,kFor processor P_aAt a voltage v_a,mFrequency of f_a,kDynamic power of the time. EC (EC)_dynamicFor dynamic energy consumption, t represents the corresponding time.

The static power calculation formula is as follows:

P_a,s＝I_a*v_a,s

the static energy consumption calculation formula is as follows:

EC_static＝P_a,s*t

wherein, EC_staticFor static energy consumption, P_a,sIs static power, I_aFor reverse biasing the junction current, v_a,sThe assigned voltage of the task, t represents the corresponding time.

The transmission energy consumption calculation formula is as follows:

wherein, EC_transFor transmission of energy, P_a,sFor communication power, B (t)_i,t_j) Indicates an operation t_iProcessor and task t_jCommunication bandwidth, comm (t) between the processors_i,t_j) Indicates an operation t_iTo task t_jThe amount of communication data.

The sleep energy consumption calculation formula is as follows:

EC_sleep＝P_a,sleep*t

wherein, EC_sleepFor sleep energy consumption, P_a,sleepFor sleep power, t represents the corresponding time.

Preferably, the genetic operation of the mutation operator is performed on the current population, specifically:

Because the present invention solves the minimized objective function problem and the objective function value is greater than 0, the fitness function is chosen to be:

wherein F (S) is an objective function value, F (S) is an adaptability value, and the value range of F (S) is (0, 1).

Specifically, the operations of the three genetic operators are as follows:

selecting an operator operation: when the value of the objective function value F (S) is greater than 0, the value of F (S) is increased, the value of F (S) is reduced, namely, the energy consumption is higher, the fitness is lower, and the elimination possibility is higher; the value of F (S) is reduced, the value of F (S) is increased, in other words, the smaller the energy consumption is, the higher the fitness is, and the higher the probability of being selected is. And because the value range of the fitness is (0, 1), the fitness is normalized and the first M individuals are selected to form the sub-population according to the proportion.

And (3) operation of a crossover operator: the crossover operator is helpful for inheriting chromosome segments of excellent individuals to offspring, and meanwhile, the crossover operator generally plays a role in global search and can exploit unknown space.

Mutation operator operation: the mutation operator has the function of enabling the genetic algorithm to have local random search capability. After the operation of the crossover operator, the result of the genetic algorithm approaches to the optimal solution, and at the moment, the addition of the mutation operator can accelerate the convergence of the result to the optimal solution. Meanwhile, through variation, the diversity of the population can be increased. However, at different times, the variation values adopted should be different, so the invention provides a calculation method of a variation operator capable of adapting, if the ratio of the fitness of an individual to the average fitness is greater than or equal to 1, the individual is biased to a high-quality individual, the variation rate of the individual should be reduced, thereby reducing the variation probability and achieving the goal of maintaining the good quality, and if the ratio is less than 1, the individual is biased to a poor-quality individual, the variation rate is increased, thereby being beneficial to generating a new individual and increasing the possibility of the good quality. So as to gradually improve the whole, and the formula is as follows:

wherein, P_mThe variation rate F is the individual fitness F_avgAnd the average value of population fitness is obtained.

The embodiment improves a scheduling mode that local optimal solution is easy to occur in scheduling based on a genetic algorithm, and improves a mutation operator to realize real-time task scheduling on a heterogeneous computing platform.

In order to verify the effect of the present invention, the present invention is compared with the prior art from different aspects, and the following specific description is provided:

fig. 4 shows the runtime of the method when it is run on a different platform. As can be seen from fig. 4, at the beginning, the advantages of the heterogeneous computing system in data processing cannot be well reflected due to the small number of tasks, and on the contrary, when the number of tasks is small, the communication cost between different types of processors does not exist due to the fact that the heterogeneous computing system runs on a single multi-core CPU, and the running time is less. When the number is small, the running time is long because the communication cost between different processors is more compared with that of a single multi-core CPU when the multi-core CPU is run on a plurality of platforms. But with the increasing number of tasks, the heterogeneous computing platform gradually shows advantages and is far superior to the running speed of a single multi-core CPU. When the task amount is small, the advantages of the FPGA cannot be well played, so that the running time of the CPU + the GPU is basically equivalent to that of the CPU + the GPU + the FPGA, but the advantages of the CPU + the GPU + the FPGA are gradually revealed along with the increase of the task amount.

Fig. 5 shows a running-time comparison graph of the NGA algorithm and the NPGA algorithm, namely, a running-time comparison graph of the novel genetic algorithm provided by the invention executed in parallel and a running-time comparison graph of the novel genetic algorithm provided by the invention executed in non-parallel. As can be seen from fig. 5, as the number of tasks increases, the operation efficiency of the NPGA provided by the invention is greatly improved compared with the serial NGA algorithm. The advantage depends on the rapid development of computer technology, and the development of parallel technology shortens the processing time of the same task, thereby providing a solid and posterous guarantee for a large-scale computing scene. The novel parallel genetic algorithm NPGA is realized based on the improved NGA algorithm, and the NGA algorithm redesigns a mutation operator on the traditional GA algorithm, so that the mutation operator can be adaptively changed along with specific conditions, a guarantee is provided for jumping out of a local optimal solution, a novel fitness function is conformed to the subject of the invention, and compared with a fitness calculation mode of the traditional GA algorithm, the rigor and the conformity degree are increased, and the global optimal solution is favorably searched.

Fig. 6 shows a comparison graph of energy consumption of the conventional four GA algorithm and the NPGA algorithm in the present invention, and meanwhile, the parallel genetic algorithm provided by the present invention saves energy consumption compared to the conventional genetic algorithm in terms of energy consumption saving through an energy consumption calculation formula. From the energy consumption calculation formula, the total energy consumption is in direct proportion to the running time, when the number of tasks is large, the task scheduling time based on the NPGA algorithm is shorter than that of the traditional GA algorithm, and the generated energy consumption is correspondingly reduced, so that the aim of low energy consumption is fulfilled.

By analyzing the comparison of the running time of different tasks on different execution platforms and the comparison of the running time of different tasks under different task scheduling algorithms on a CPU + GPU + FPGA platform, the following conclusion is finally obtained:

1) under the condition of small task amount, the running time of the CPU + GPU + FPGA heterogeneous computing platform is longer than that of the multi-core CPU, and the communication cost exists in the scheduling of the heterogeneous computing platform. However, in the case of a large task amount, the running time of the heterogeneous computing platform is shorter than that of the multi-core CPU, because the heterogeneous computing platform can fully utilize the computing power with the increase of the tasks, and the communication cost time is far shorter than that of the multi-core CPU.

2) The difference between the CPU + GPU + FPGA heterogeneous computing platform and the CPU + GPU heterogeneous computing platform is almost equal to the difference between the CPU + GPU heterogeneous computing platform and the CPU + GPU heterogeneous computing platform in operation time, and the execution time is superior to that of the CPU + GPU in large task amount.

3) Compared with the traditional GA algorithm, the NPGA algorithm provided by the invention has the advantages that the addition of parallelism plays a role in improving the operation speed and shortening the operation time, and lays a foundation for realizing low energy consumption.

4) With the shortening of the operation time, the aim of low energy consumption can be achieved according to the energy consumption calculation formula. On the basis of the NPGA algorithm, the local optimal solution is improved by realizing the adaptive mutation operator and the improved fitness calculation function, and a new solution idea is provided for realizing the global optimal solution.

In the embodiment, a heterogeneous computing system of CPU + GPU + FPGA is firstly built according to OpenCL. The CPU is used as a host machine and is responsible for distributing and scheduling tasks and counting results of the GPU + FPGA computing equipment. And the GPU and the FPGA are used as computing equipment and are responsible for processing the operation. After the heterogeneous computing system is built, firstly, according to the dependency relationship of the DAG task graph, the operation execution sequence is sequentially eliminated from the operation node with the degree of entry of 0. And copying the task node set to form a T _ copy set, constructing an empty queue, executing enqueue operation on a node with the degree of enqueue of 0, deleting the node from the T _ copy, and setting the communication traffic of a directed edge between the task node and a node directly succeeding the task node to be-1. If there are a plurality of points with an in-degree of 0, a rule for short job priority is followed, and based on this, if there is a case where the sizes of a plurality of jobs match, random selection is performed. And finally forming a task queue. In the aspect of selection of a scheduling algorithm, important differences of heterogeneous computation in the aspects of computing power, communication time and the like are fully considered, and a novel parallel genetic algorithm NPGA is provided to solve the problem of a traditional parallel genetic algorithm. The parallel genetic algorithm has the advantages that the number of tasks processed by each processor is fixed, the problems of heterogeneous computing power, communication markets and the like are not considered, the problem is improved through the novel parallel genetic algorithm NPGA provided by the invention, and the number of tasks is distributed according to the computing power of each computer. On the basis of a novel parallel genetic algorithm NPGA (genetic algorithm), each processor implements a novel genetic algorithm NGA, the inherent genetic algorithm is easy to fall into a local optimal solution, and the mode for improving the defect provided by the invention is to provide a novel fitness calculation method and a self-adaptive mutation operator probability calculation mode. The defect that the traditional genetic algorithm is easy to fall into the local optimal solution is effectively overcome. And randomly generating a scheduling queue by adopting an overall scheduling mode according to the job queue and the job number divided by each processor. And performing task modeling according to a known condition, wherein the objective function is to obtain the minimum value of energy consumption, and the constraint condition is that the latest execution ending time is less than the final deadline. On each processor, the population is divided into M individuals according to the number of distributed tasks, and the evaluation fitness, selection operators, crossover operators and mutation operators in the genetic algorithm are executed in parallel on each processor. After the execution of each generation is finished, each processor receives exchange information sent by other processors, and the exchange information exchanges the information of the optimal sub-population in other processors. Meanwhile, the processor can also send the optimal individual information in the processor to other processors, so that the information sharing is realized. And repeating the genetic iteration until the genetic generation reaches the maximum genetic generation to obtain the optimal scheduling list S. And according to the scheduling list, obtaining the minimum energy consumption and the optimal distribution mode on the premise of meeting the real-time performance.

Example 2

Embodiment 2 of the present invention provides a task scheduling apparatus based on heterogeneous computing, including a processor and a memory, where the memory stores a computer program, and when the computer program is executed by the processor, the task scheduling apparatus based on heterogeneous computing according to embodiment 1 is implemented.

The task scheduling device based on heterogeneous computing provided in the embodiments of the present invention is used to implement the task scheduling method based on heterogeneous computing, and therefore, the task scheduling device based on heterogeneous computing also has the technical effect, and is not described herein again.

The method provided by the invention can be applied to any form of heterogeneous systems, namely CPU + GPU, CPU + FPGA, CPU + GPU + FPGA. In the embodiment, the heterogeneous system of CPU + GPU + FPGA is selected, the heterogeneous computing structure can fully use respective advantages of the CPU, the GPU and the FPGA, the CPU is responsible for logic control operation, the GPU and the FPGA are responsible for parallelization accelerated processing, in addition, the FPGA has the characteristic of low power consumption, customizable repeated programming is supported, and flexibility and performance are greatly improved. However, the prior art rarely uses such a heterogeneous system including more than two types of processors, because the combination of such heterogeneous processors brings about performance improvement, and the following disadvantages are obvious, and switching between different processors increases energy consumption. In order to solve the problem that high energy consumption occurs while performance is improved, an optimized scheduling algorithm is needed urgently, and the purpose of reducing power consumption is achieved by reducing the core frequency of a processor, prolonging the working time and reducing the working idle time of the processor.

Specifically, before task scheduling is carried out, a heterogeneous computing system of CPU + GPU + FPGA is built according to OpenCL. The CPU is used as a host machine and is responsible for distributing and scheduling tasks and counting results of the GPU + FPGA computing equipment. And the GPU and the FPGA are used as computing equipment and are responsible for processing the operation. The specific implementation steps for developing the CPU + GPU + FPGA heterogeneous system by adopting the OpenCL framework are as follows:

firstly, available equipment is obtained, namely, the equipment without tasks at present is obtained, and according to the scheduling principle, the most appropriate equipment is selected from the available equipment to complete loading and initialization.

A context environment is created, the role of the context being responsible for managing the device.

Command queues are created, one for each device, thereby ensuring independence.

And sending the commands to the command queue by using the context according to the context environment created in the step, and executing the commands by the equipment corresponding to the command queue according to the order.

The context environment is used for creating and managing a device cache which is used for storing data to be processed by the program, and one or more devices managed by the context environment can share the data in the device cache.

Because the host machine is in the core position of the brain in the whole heterogeneous system, the data of the host machine is written into the device cache, so that the other devices can smoothly receive the data shared by the host machine.

The source program file is acquired in preparation for later task execution.

And creating a device program, and writing the device program by the OpenCL framework, wherein the part of the program can run on a corresponding device.

And acquiring the corresponding parameter configuration of the equipment program, and initializing the parameters for the smooth execution of the subsequent task.

The index space, workgroup, and working instance also initialize their parameters in preparation for smooth execution of subsequent computational tasks.

And the preparation work is completely finished, and the equipment executes the task according to the equipment program.

After the computing equipment completes the computation, the computation result is written into the local memory, and meanwhile, the computation result is returned to the host machine.

And the host machine receives a result returned by the computing equipment, reads the output cache, acquires the Simon, namely the completion of the current computing task of the computing result, releases resources and waits for the next task.

Example 3

Embodiment 3 of the present invention provides a computer storage medium on which a computer program is stored, the computer program, when executed by a processor, implementing the heterogeneous computing-based task scheduling method provided in embodiment 1.

The computer storage medium provided by the embodiment of the invention is used for realizing the task scheduling method based on heterogeneous computing, so that the computer storage medium has the technical effects of the task scheduling method based on heterogeneous computing, and the details are not repeated herein.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A task scheduling method based on heterogeneous computing is characterized by comprising the following steps:

2. The task scheduling method based on heterogeneous computing according to claim 1, wherein the establishing of the DAG model of the task to be scheduled specifically comprises:

3. The task scheduling method based on heterogeneous computing according to claim 2, wherein the job queue is established according to the DAG model, specifically:

copying all nodes in the DAG model to obtain a node set;

4. The task scheduling method based on heterogeneous computing according to claim 1, wherein a network topology map between heterogeneous processors is established, specifically:

5. The task scheduling method based on heterogeneous computing according to claim 1, wherein the allocating of the initial task amount to each processor according to the computing speed of each processor specifically comprises:

Num(p_a)＝θ_a*M；

wherein, Num (p)_a) The initial amount of tasks allocated to the a-th processor, M is the task totalThe number of the cells.

6. The task scheduling method based on heterogeneous computing according to claim 1, wherein each processor executes a genetic algorithm in parallel to perform population iteration updating to obtain an optimal population, specifically:

7. The task scheduling method based on heterogeneous computing according to claim 6, wherein an optimization model is established with a minimum energy consumption as an objective function and a task execution time as a constraint condition, and specifically comprises:

and (3) obtaining total energy consumption:

EC_total＝EC_dynamic+EC_static+EC_trans+EC_sleep；

E_min＝min(EC_total)；

sequentially calculating the earliest execution time and the earliest ending time of each operation node, taking the operation node executed last as an exit node, constraining the earliest ending time of the exit node, and establishing a constraint condition:

End(t_exit)＜deadline；

8. The task scheduling method based on heterogeneous computing according to claim 6, wherein the genetic operation of a mutation operator is performed on the current population, specifically:

9. A task scheduling apparatus based on heterogeneous computing, comprising a processor and a memory, wherein the memory stores a computer program, and the computer program, when executed by the processor, implements the task scheduling method based on heterogeneous computing according to any one of claims 1 to 8.

10. A computer storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the heterogeneous computing based task scheduling method according to any one of claims 1 to 8.