CN112463346B

CN112463346B - Heuristic processor partitioning method, system and storage medium for DAG task based on partition scheduling

Info

Publication number: CN112463346B
Application number: CN202011631493.0A
Authority: CN
Inventors: 张伟哲; 吴毓龙; 何慧; 方滨兴
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-10-15
Anticipated expiration: 2040-12-31
Also published as: CN112463346A

Abstract

The invention provides a heuristic processor partitioning method, a system and a storage medium for DAG tasks based on partitioning scheduling, which firstly deduces response time analysis of the DAG tasks based on a partition fixed priority scheduling algorithm; based on the intuition of the analysis, the invention provides a Greedy Parallel Execution Cluster (GPEC) processor allocation strategy, which takes the topology of the DAG tasks and the self-interference among subtasks in the tasks into consideration. The invention has the beneficial effects that: the GPEC strategy of the invention considers the influence of the internal topology and self-interference of the DAG task. In addition, the invention transplants the real-time system to the embedded board and evaluates the performance of the GPEC strategy on a real platform. Compared with two latest processor allocation strategies in experiments, the GPEC strategy of the invention reduces the average WCRT by 35.59% at most and improves the schedulable rate of a DAG task set by 76% at most.

Description

Heuristic processor partitioning method, system and storage medium for DAG task based on partition scheduling

Technical Field

The invention relates to the technical field of data processing, in particular to a heuristic processor partitioning method, a heuristic processor partitioning system and a storage medium for DAG tasks based on partitioning scheduling.

Background

With the increasing number of processors and the strict requirement of completing a large amount of computation before the expiration date, more and more applications are migrated to the embedded multiprocessor platforms [1], [2] of different types of mobile terminals and edge clouds to be executed in parallel. These parallel programs can typically be represented with a Directed Acyclic Graph (DAG) task model, where the DAG tasks are composed of subtasks and edges connecting the subtasks [3 ]. Subtasks represent sequential computations, and edges represent dependencies between connected subtasks.

Fig. 1 shows a real-time obstacle avoidance application for an autonomous vehicle. In this case, there is an obstacle in front of the vehicle, while the vehicle B is traveling in the left lane of the vehicle a. To avoid this obstacle, the vehicle must safely plan a route based on information received from the body sensors and the roadside server. Thus, this application can map to a DAG task with the following 7 subtasks. V _ (i,0) represents a target recognition operation in which the a car recognizes an obstacle using information given by its front sensor. V _ (i,1) is a switch to a change-of-travel-route mode of operation which causes the next three operations of slowing the car (V _ (i,2)), obtaining information from the roadside server to determine that the road ahead is safe (V _ (i,3)), and checking whether the lanes of both parties are safe to use the information from the side sensors (V _ (i, 4)). Finally, V _ (i,5) and V _ (i,6) are operations to perform a steering controller lane change and to resume the normal driving mode, respectively. Due to the dependencies between these operations, V _ (i,5) cannot be executed until V _ (i,3) and V _ (i,4) are complete. Otherwise, the vehicle a may choose to turn left to avoid the obstacle, but such a choice may result in the vehicle a colliding with the vehicle B.

For parallel tasks on a multiprocessor platform, the real-time scheduling algorithm has three types: global scheduling, partition scheduling, and federal scheduling. Under global scheduling, the subtasks can be executed on any processor, and [3] - [5] can be migrated during execution. Federal scheduling and its variants [6] - [9] assign each task to a group of processors, and the subtasks of this task can be executed on any of the assigned processors. In contrast, under partition scheduling [10] - [14], each subtask is assigned to one processor and will always execute on that processor (cannot be migrated to execute on other processors). Compared with global scheduling and federal scheduling, the advantage of partition scheduling is that it has no migration cost of subtasks, and the isolation between processors is better, and is widely used in the industry.

The problem of real-time scheduling of parallel Directed Acyclic Graph (DAG) tasks is a subject of extensive research in recent years. However, it is not clear how to allocate the sub-tasks of the DAG task to reduce the worst-case response time and improve the schedulability of the tasks under the partition fixed priority scheduling.

Disclosure of Invention

The invention provides a heuristic processor partitioning method for DAG tasks based on partition scheduling, which comprises a heuristic processor distribution step of a PEC structure, wherein the heuristic processor distribution step of the PEC structure comprises the following steps:

step 1: initializing remaining utilization U of each processor_a(i) 1, i, m; initializing Ready queue Ready to be empty; initializing the processor allocation policy to null

Step 2: calculating the tolerable latest starting time of each subtask according to the formula (4) and saving the latest starting time into a table LST;

and step 3: slave task tau_iCheck whether the PEC structure is derived from the first task to the last task of

If get

Then step 4 is executed, otherwise, the process is ended; the set of subtasks with and having only one identical parent-subtask is called the PEC structure, which is a parallel execution cluster structure, using

Denotes τ_iThe kth PEC structure of (1), wherein k ∈ [0, π_i]，π_iRepresenting tasks τ_iThe number of PEC structures in;

and 4, step 4: will be provided with

All the subtasks in the queue are added into the Ready queue Ready;

and 5: arranging the subtasks in the Ready queue Ready according to the LST non-descending order;

step 6: allocating processors from the first subtask to the last subtask in Ready;

LST(V_i，j) Representing a subtask V_i，jTolerable latest start time if V_i，jIs later than LST (V)_i，j) Then task τ_iMust not be scheduled, LST (V)_i，j) Can be calculated by the formula (4) C_iRepresenting the total worst case response time, D, of all subtasks_iIndicates the deadline of the task, C_i，jRepresents V_i，jWCET, WCET being the worst case response time, F (τ) is used for the set of terminating subtasks_i) And (4) showing.

As a further improvement of the present invention, the step 6 includes:

step 61: distributing the subtasks of the current processor to be distributed to the processor p according to a Worst-Fit algorithm;

step 62: updating processor residual utilization U_a(p*)；

And step 63: updating processor allocation policy θ_p*。

As a further improvement of the invention, the heuristic processor partitioning method comprises the following steps:

step S1: initializing the processor allocation policy to null

Initializing a PST, initializing Ready queue Ready, the PST being a table of potential self-interference, the PST storing τ_iPotential self-interference subtasks for each subtask;

step S2: a processor for initially allocating a PEC according to the heuristic processor allocation step of the PEC structure;

step S3: updating the latest end time and the PST of the subtask of the allocated processor according to the result of the allocation in the step S2;

step S4: adding the subtasks of which all the previous subtasks are distributed to the processor into Ready;

step S5: judging whether Ready is empty or not, and if not, allocating a processor for the subtask in Ready; if Ready is empty, the allocation completes the exit routine.

As a further improvement of the present invention, in step S5, if Ready is not empty, the allocating processors for the subtasks in Ready includes:

step S51: sequencing the subtasks in Ready according to the LST non-descending order;

step S52: the first subtask V after sequencing Ready_i，jA distribution processor;

step S53: updating the PST;

step S54: and adding a new subtask meeting the condition of the step S4 to Ready, and repeating the step S5.

As a further improvement of the present invention, the step S52 includes:

step S521: respectively calculating and distributing the subtasks to m different processors

Wherein k ∈ [1, m)]，

Representing a subtask V_i，jThe latest end time of;

step S522: k is calculated, wherein

Step S523: and if there are 2 or more than 2 k, distributing the subtask to the k with the maximum residual utilization rate according to the Worst-Fit algorithm.

The invention also provides a real-time system, wherein the real-time system runs on the embedded development board, and the steps in the heuristic processor partitioning method run on the real-time system.

As a further improvement of the invention, the embedded development board is Raspberry Pi 3Model B +, the real-time system is v4.0.2 version of RT-Thread, and the RT-Thread is an open-source real-time operating system.

As a further improvement of the present invention, the file path and source code for the real-time system are modified as follows:

setting the value of the macro instruction RT _ TICK _ PER _ SECOND at the 18 th line in the rtconfig.h file under the path "/bsp/raspberry-pi/raspi 3-32/rtconfig.h" to 100;

changing the variable "cntfrq" at line 57 in the board.c file under the path "/bsp/raspberry-pi/raspi 3-32/driver/board.c" from 35000 to 10000;

the value of the variable cntfrq, which represents a counter that obtains a clock from an external crystal, is set to 1000;

the CPU is required to perform 1,500,000 auto-increment operations; the execution time of 1500000 times of the incremental operation is taken as a certain task to execute the worst execution time of the tasks in the experiment after the system time 1ticks according to the unit time.

As a further improvement of the invention, the scheduler of the real-time system performs the following steps:

the method comprises the following steps: initializing the release queue omega to be empty, and initializing the current system time t^currentEither ← 0, initialize task set schedulable Flag ← TRUE;

step two: initialization vector

Wherein i 1^nextEach element stores the time of the next release of the corresponding task;

step three: at the beginning of the system, all tasks are as follows

Adding the non-descending order into a release queue omega;

step four: if Flag is True, starting to schedule the task set to execute the step five, otherwise, the task set can not be scheduled, and exiting the program;

step five: if it is not

Step six is executed, otherwise, the suspension self is awakened when waiting for the task to be added into omega;

step six: obtaining the current system time t^current←GetSystemTick()；

Step seven: acquiring task tau to be released last time_x←GetFirstElement(Ω)；

Step eight: if it is not

It means that t has not been reached_xTime required to be released, sleep

Time, otherwise, executing step nine;

step nine: releasing task tau_xAnd according to the last τ_xWhether the task instance finishes judging whether the task set can be scheduled or not, and modifying the value of Flag;

step ten: if Flag is TRUE,

and will tau_xAnd adding the obtained product into omega again according to the rule of the step three, and repeatedly executing the step four.

The present invention also provides a computer readable storage medium having stored thereon a computer program configured to, when invoked by a processor, implement the steps of the heuristic processor partitioning method of the present invention.

The invention has the beneficial effects that: the invention provides a heuristic processor allocation strategy-GPEC strategy aiming at DAG tasks under the partition fixed priority scheduling, wherein the strategy takes the internal topological structure of the DAG tasks and the influence of self-interference into consideration. In addition, the invention transplants the real-time system to the embedded board and evaluates the performance of the GPEC strategy on a real platform. Compared with two latest processor allocation strategies in experiments, the GPEC strategy of the invention reduces the average WCRT by 35.59% at most and improves the schedulable rate of a DAG task set by 76% at most.

Drawings

Fig. 1 is a schematic diagram illustrating obstacle avoidance operation;

FIG. 2 is a schematic diagram of a DAG task;

FIG. 3 is a graph of worst case response time versus utilization;

FIG. 4 is a diagram of schedulable rate versus utilization for a DAG task set;

FIG. 5 is a graph of worst case percent response time reduction as a function of utilization;

FIG. 6 is a graph of percentage increase in schedulable rate of a set of tasks versus utilization.

Detailed Description

The invention discloses a heuristic processor partitioning method for DAG tasks based on partitioned scheduling, which researches the processor allocation problem of fixed priority level partition scheduling of parallel DAG tasks on a multiprocessor. Since this problem has been demonstrated for NP-hard [15], it is not expected that an optimal processor allocation strategy will be found in polynomial time. Thus, different heuristic processor allocation algorithms are proposed to reduce the task Worst Case Response Time (WCRT) [16 ]. However, existing work does not take into account the topology of the DAG task and the effects of self-interference, resulting in pessimism or long response times to the task in the analysis. For example, if Vi in FIG. 1; 2 and Vi; 3 are all allocated on the processor 2, which causes interference with each other; if they are distributed over different processors, they may run in parallel.

Aiming at the problem, the invention provides a novel processor allocation strategy, which utilizes the topological structure of a DAG task, considers the influence of self-interference among subtasks of the same task and constructs an embedded real-time platform to verify the performance of the strategy. Specifically, the present invention makes the following main contributions:

1. the worst-case response time analysis of parallel DAG tasks under fixed priority partition scheduling is deduced, and the interference of high-priority tasks and the self-interference of sub-tasks of the same task are analyzed.

2. With the topology of DAG tasks, the present invention first defines a Parallel Execution Cluster (PEC) structure and designs a processor allocation strategy that attempts to allocate sub-tasks belonging to the same PEC structure to different processors.

3. Based on intuition of WCRT analysis, the invention improves the algorithm, further reduces self-interference, improves the task schedulability and provides a Greedy Parallel Execution Cluster (GPEC) processor allocation strategy.

4. The invention also transplants the open-source real-time operating system to the embedded development board and rewrites the task release scheduler thereof so as to execute the real-time DAG task in an event-driven manner.

Finally, a large number of experience experiments are carried out on the comprehensive task on the platform, and evaluation results show that compared with two algorithms, the GPEC algorithm provided by the invention reduces WCRT and obviously improves the schedulability of the task set.

The present invention is described in detail below:

the related work is as follows:

the partition scheduling and parallel tasks most relevant to the present invention will be described below. Parallel task scheduling on multiple processors has been extensively studied over the last few years. Scholars propose different parallel task models. A task in the synchronous task model is composed of a series of computing segments, wherein each segment has any number of parallel subtasks. A subtask in a segment can only be executed after all subtasks of the previous segment have been completed. In contrast, Directed Acyclic Graph (DAG) tasks allow for a more general parallel structure, where subtasks may have arbitrary dependencies, as long as there are no dependency loops. In the orthogonal dimension, different types of scheduling algorithms are proposed for parallel real-time tasks, which we will briefly introduce in the following.

Global scheduling of parallel tasks: saifullah et al demonstrate that decomposed DAG tasks can be scheduled with a global earliest deadline first schedule with a speed-up ratio of 4[20 ]. Bonifaci et al demonstrate speed-up ratios of the global earliest deadline first schedule and the global single deadline first schedule for DAG tasks of 2-1/m and 3-1/m, respectively, where m is the number of processors in the system [3 ].

And (3) partition scheduling of parallel synchronous tasks: lakshmann et al propose a task-decomposed partitioned fixed-priority scheduling algorithm for constrained synchronous tasks, whose subtasks have the same length in the same segment [14], and demonstrate a speed-up ratio of 3.42. Based on a similar decomposition idea, Saifullah et al developed a zone scheduling algorithm for unrestricted synchronous tasks and demonstrated an acceleration ratio of 5[5 ]. They also generalize the results to DAG tasks with unit-size subtasks. However, applying this result directly to convert a sub-task of a non-unit size to a node of a unit size may result in the sub-task migrating from one processor to another, since the nodes of a unit size belonging to the same sub-task may be allocated to different processors.

Partitioning scheduling of parallel DAG tasks: unlike synchronous tasks, which perform sub-task synchronization after each segment, DAG tasks have a more complex topology and are more difficult to analyze. Fonseca et al propose a response time analysis method for DAG tasks under partition scheduling to convert the DAG tasks into self-suspending tasks [10 ]. Due to the complexity of the analysis, most existing work focuses on different heuristically partitioning subtasks of the DAG task to the processor. To our knowledge, existing processor allocation strategies for DAG tasks under partitioned scheduling include the dagP algorithm proposed by Herrmann et al [12] and the MACRO algorithm proposed by O zkaya et al [13 ]. In particular, the dagP algorithm allocates subtasks to processors in three phases. First, the topology of the DAG task is roughly divided into convex sets [21 ]. The subtasks are then initially allocated to the processors by calculating the cost of switching a subtask from one processor to another. And finally, refining the partition result calculated in the second stage to obtain a final distribution result. In contrast, the MACRO algorithm uses the BL-EST algorithm to compute the weight for each subtask [22], thereby assigning the subtask to the processor. Similar to dagP, after initial allocation, MACRO will attempt to move subtasks from one processor to another according to a predetermined priority and calculate the cost of the move to optimize the allocation.

Federal scheduling of parallel tasks: li et al propose a federated scheduling strategy that allocates high-utilization tasks to a set of dedicated cores, with the remaining low-utilization tasks sharing the remaining cores [6]. In addition, they have demonstrated that G-EDF and G-RM have

And

the acceleration ratio of (1). Integration of instruction cache sharing into federated scheduling by reducing usage for high-utilization tasks [9]]The number of processors of (2) improves schedulability.

Secondly, a system model:

the system of the invention is composed of a task set composed of n preemptible real-time tasks, wherein gamma is { tau ═ tau₁，...，τ_nEach of which is a Directed Acyclic Graph (DAG) task. These tasks execute P ═ P on a multi-core platform with m identical processors₁，...，p_m}. Per DAG task τ_i＝(V_i，E_i，C_i，T_i，D_i，f_i) There are 6 parameters. Wherein V_iRepresenting a set of subtasks (nodes), E_iRepresenting a set of edges (inter-subtask dependencies), C_iRepresenting the total worst case response time (WCET), T, of all subtasks_iIndicating the period of the task, D_iIndicates the deadline of the task (D)_i≤T_i)，f_iIndicating the priority of the task.

Per DAG task τ_iFrom beta_iAnd sub-tasks which are divided to different processors and executed based on the division schedule. If the response time (the time interval from task release to task completion) is greater than his deadline, the task is said to be non-dispatchable. Further, a set of tasks is said to be non-dispatchable if it has a task that is non-dispatchable, otherwise it is dispatchable. Subtask V_i，jRepresenting tasks τ_iJ sub-task, V_i，jHas 2 parameters<C_i，j，P_i，j>In which C is_i，jRepresents V_i，jWCET, P_i，jRepresents V_i，jIs distributed toThe processor of (1).

We use pr (τ)_i) Representing DAG tasks τ_iSet of processors used, where | pr (τ)_i) Less than or equal to m. The deadline for each subtask is inherited from the task. We use e (V)_i，j，V_i，k)∈E_jRepresenting a slave V_i，jPoint of direction V_i，kThis means V_i，kOnly when V_i，jExecution can only begin when completed.

When each DAG task is released, all its subtasks are released at the same time, but not all are ready because the dependencies described above exist.

For convenience of analysis, we used U_i＝C_iV (T m) denotes task τ_iThe utilization ratio of (2). The utilization cannot be greater than 1 in any case, otherwise the task set is not schedulable.

There are n independent priorities in the system corresponding to the n DAG tasks one to one. We use f_iDenoting task τ by j_iIs j. We specify that the smaller the priority of a task, the higher its priority. In other words, τ_iPriority higher than τ_jIf and only if x<y, wherein f_i＝x，f_jY. In the Deadline Monotonics (DM) priority assignment algorithm used in the present invention, priorities are assigned according to deadlines of tasks, i.e., the lower the deadlines, the higher the priority. And the processor will always select the highest priority task that is ready in the current system to execute.

Definition 1: if there is one edge e (V)_i，j，V_i，k)∈E_iWe then call V_i，jIs V_i，kThe preceding subtask of (1), otherwise V_i，kReferred to as V_i，jThe successor subtask of (1).

Definition 2: if a task has no previous subtask, the subtask is called as a source subtask, and S is used_iAnd (4) showing. Similarly, a subtask is said to be a terminator if it has no successor subtaskF (tau) for task, terminating set of subtasks_i) And (4) showing.

For a task τ_iOnly one source subtask, then S_i＝V_i，1. A task with multiple source subtasks can be easily added by adding a predecessor subtask (V) with WCET 0 to these source subtasks_i，0In which C is_i，00) into a task with only one source subtask (for convenience of the following analysis), i.e. S_i＝V_i，0。

Definition 3: we use the northern (V)_i，j) Represents V_i，jA set of all the preceding subtasks. Similarly, V_i，jThe set of all successor subtasks uses child (V)_i，j) And (4) showing.

V is readily found according to definition 3 and the description of the opposite side above_i，jAt the very point (V)_i，j) Cannot begin execution until all subtasks in the set are completed.

Definition 4: we use λ_i，kRepresenting tasks τ_iThe k-th path.

Is a continuous set of subtasks, where V_i，f∈F(τ_i). We use λ ═ λ₁，...，λ_γiDenotes the task τ_iSet of all paths in (1), wherein γ_iIs the number of all paths.

Definition 5: if the following two conditions are satisfied simultaneously, then V is called_i，jIs V_i，kIndirect successor subtask of (V)_i，kIs V_i，jThe indirect predecessor subtask of (2):

1) absence of edge e (V)_i，j，V_i，k)∈E_iFrom V_i，fPoint of direction V_i，k；

2) There is a path first through V_i，jAfter passing through V_i，j。

For example, as shown in FIG. 2, a DAG task τ is composed of 7 subtasks for one task_i。τ_iIs distributed over two processors, i.e. P ═ P₁，P₂}，m＝2。C_i＝22，T_i＝D_i80. Utilization rate of U_i＝22/(80·2)＝0.275。V_i，1Is the source subtask, V_i，2，V_i，3And V_i，7Is the terminator subtask. V_i，1Is V_i，6Because the two are not directly connected and there is a path { V }_i，1→V_i，4→V_i，6→V_i，7Firstly pass through V_i，1Rear pass through V_i，7。τ_iThere are 4 paths in total, i.e. gamma_i＝4。

Response time analysis:

response time analysis in real-time systems is one way to determine whether a set of tasks is schedulable. The intuition derived from the analysis may help us to develop a good processor allocation strategy that enables the DAG task set to be scheduled. Fonseca et al propose an analysis method for computing WCRT by DAG task under partition scheduling [10]. They demonstrated tau_iIs equal to the largest of the WCRTs of all paths, can be calculated by equation (1), where (R (λ)_i，k) Can be calculated by the formula (2).

R(λ_i，k) The calculation of (c) is divided into 3 parts. Wherein len (lambda)_i，k) Represents a path λ_i，kCan use

And (4) calculating.

And

respectively represent lambda_i，kSelf-interference (self-interference) and interference from high priority task nodes (high-interference).

The impact of high priority task interference is the workload of the high priority DAG tasks. Note that once the priority of the DAG task is assigned, the high interference per path is determined. The invention mainly researches the influence of self-interference on DAG tasks.

Each DAG task has a unique priority without loss of generality. Since all sub-tasks from the same DAG task share the same priority, they cannot preempt each other, thereby creating self-interference. Furthermore, the sub-tasks from the same path do not interfere with each other because of the dependencies between them. Therefore, will be for λ_i，kThe subtask that generates self-interference cannot belong to the subtask. We use self (V)_i，j) Is shown as pair V_i，jA set of sub-tasks that generate self-interference.

Theorem 1: subtask V_i，kWill pair subtask V_i，jSelf-interference occurs if and only if the following two conditions are satisfied simultaneously:

1) the two subtasks being assigned to the same processor, i.e. P_i，j＝P_i，k；

2)V_i，jIs other than V_i，jOr an indirect preceding sub-task, and vice versa.

And (3) proving that: in a partitioned real-time system, a subtask can only run on one processor as long as it is assigned to that processor. Furthermore, executing one sub-task on one processor does not interfere with the execution of some sub-tasks on other processors, and vice versa. Considering the first condition if P_i，j≠P_i，kThe two subtasks never interfere with each other's execution, i.e. V_i，kCan not be aligned with V_i，jThe performing of (2) generates self-interference. Without loss of generality, we assume V_i，jIs V_i，kIs (indirectly) a preceding sub-task. Then V_i，kAt V_i，jDo not start until execution is complete, then V_i，kAlso will not be aligned with V_i，jSelf-interference is generated. In summary, if V_i，kWill pair subtask V_i，jSelf-interference is generated, the above two conditions must be satisfied at the same time.

Inference 1: let self (lambda)_i，k) Will be opposite to path λ_i，kSet of tasks that generate self-interference, wherein

Then

Can be calculated by the formula (3).

And (3) proving that: consider 3 subtasks V_i，a，V_i，bAnd V_i，cIn which V is_i，aAnd V_i，bBelonging to path λ_i，kAnd V_i，cWhile belonging to self (V)_i，a) And self (V)_i，b). Task τ is known from the description in chapter three_iIs released within its period and only one sub-task instance, i.e. V, is released_i，cMaximum pairs of V_i，aAnd V_i，bOne of the two sub-tasks generates self-interference. So that λ is the worst case_i，kThe received self-interference is equal to the sum of the WCET of all the subtasks that would generate self-interference for that path.

The processor is divided into:

in this section, we propose a processor allocation strategy under partition scheduling that takes into account both the topology and the effects of self-interference. It is easy to infer that the worst self-interference per path is determined if all the subtask assigned processors pass observation theorem 1 and equation (3). Our intuition is to configure the strategy (reduce WCRT DAG tasks and increase schedulable probability of DAG task set) for improving the performance of the processor to minimize self-interference of each sub-task.

4.1 heuristic distribution method based on DAG task topological structure

According to theorem 1, if we want to improve the performance of a processor allocation strategy by reducing self-interference among subtasks, we can only allocate subtasks that have a potential self-interference relationship with each other to different processors. In other words, the first condition of theorem 1 should not be satisfied. We cannot break the second condition of theorem 1 by changing the dependencies between subtasks, because the topology of the DAG task is an inherent property we cannot change.

Definition 6: a set of subtasks with and without one identical parent-subtask is referred to as a Parallel Execution Cluster (PEC) structure. In addition, there are at least two subtasks per PEC structure. We use pi_iTo represent task τ_iNumber of PEC structures, use

Denotes τ_iThe kth PEC structure of (1), wherein k ∈ [0, π_i]。

For example, only one PEC structure is present in FIG. 2

V_i，3And V_i，6Does not form a PEC structure because of V_i，6With two successor subtasks V_i，4And V_i，5。

Theorem 2:

if the subtasks in (1) are distributed to the same processor, they will inevitably interfere with each other, i.e. they will generate self-interference on the path they are on.

And (3) proving that: consider a PEC structure

Wherein V_i，aIs thatTheir predecessor subtasks. Because of the fact that

All subtasks in (a) inherit from the same preceding subtask, so there is no dependency between them. In other words, there is no path through it at the same time

Any two subtasks in (c). As long as V_i，aHas completed execution, then

All subtasks in (a) may start executing. According to theorem 1, if these subtasks are allocated to the same processor, they will increase self-interference.

Let LST (V)_i，j) Representing a subtask V_i，jTolerable latest start time. If V_i，jIs later than LST (V)_i，j) Then task τ_iMust not be scheduled, LST (V)_i，j) Can be calculated by the formula (4).

Apparently, LST (V)_i，j) The smaller the value of (c) the more should be performed as early as possible, so we follow this order as our processor assignment order. Furthermore, we use LST (τ)_i) Represents a beta_iA vector of elements, each element corresponding to a task τ one-to-one_iA tolerable latest start time of the subtask in (1), i.e.

According to theorem 2, we shall turn

The subtasks in (A) are distributed to different processors toReducing inter-task self-interference. The heuristic processor assignment step of the PEC structure (algorithm one) is described as follows:

If get

Then step 4 is executed, otherwise, the process is ended;

and 4, step 4: will be provided with

All the subtasks in the queue are added into the Ready queue Ready;

step 6: processors are allocated from the first subtask to the last subtask in Ready.

The step 6 comprises the following steps:

step 62: updating processor residual utilization U_a(p*)；

And step 63: updating processor allocation policy θ_p*。

The data that the above steps need to occupy space for long-term storage is U_a(i) Ready, LST and θ_kWith the number of subtasks and the number of processorsThere is an ongoing increase. Therefore, the spatial complexity of the above algorithm is O (max { β) }_iM }). Each subtask belongs to at most one PEC structure, otherwise this task will have at least 2 previous subtasks, contrary to definition 6. Worst case algorithm 1 performs beta_iAnd (5) performing secondary circulation. Therefore, the time complexity is O (. beta.)_i)。

4.2 heuristic distribution Algorithm (GPEC strategy) based on self-interference cost function

In this section, we propose a heuristic processor allocation algorithm that takes into account the effects of self-interference. The heuristic algorithm aims to reduce the WCRT of the DAG task by reducing the self-interference of each subtask as much as possible, so that the schedulable probability of the DAG task is improved.

As can be seen from theorem 1, once the topology of a DAG task is determined, the potential self-interference tasks of all the subtasks are determined. For subtask V_i，jIf they are assigned to V, the potential self-interference tasks of (1)_i，jOn the same processor, they can then pair V_i，jSelf-interference is generated. We use the potential self-interference table (PST) to store τ_iThe potential self-interference subtasks for each subtask. A total of beta in the PST_iEach element is a set of subtasks corresponding to τ_iPotential self-interference subtasks in (2). We use

To represent the worst case subtask V_i，jCan be calculated by equation (5).

τ_iEach subtask V in (1)_i，jAll have an earliest start time

And a latest end time

Since there are dependencies within a task, subtask V_i，jIs no more than

Early in the day.

Depending on the maximum value of the latest end time in its preceding subtask. Furthermore, execution can start immediately, i.e. without the constraint of a preceding subtask by the source node, i.e. it is possible to start immediately

The earliest start execution time and the latest end time of the other subtasks can be calculated by equation (6) and equation (7), respectively.

We use

As a cost function directing the subtasks to allocate the processors. Our goal is to reduce the latest end time of each subtask as much as possible. Therefore we use the base

To assign each subtask to the processor. For each subtask we calculate m latest end time values

Respectively correspondingly distributing the m processorsThe resulting latest end time value. We select the processor at which the minimum is located as the processor of the subtask. If the number of the minimum values exceeds 2 (2 or more than 2 of the latest end times are all minimum values), the processor where the minimum value is located is distributed according to the Worst-Fit algorithm. We combine the heuristic processor allocation step (algorithm one) based on the PEC structure with the above algorithm to get algorithm two, which is described in detail as follows:

step S1: initializing the processor allocation policy to null

Initializing a PST (power system time), and initializing a Ready queue Ready;

step S2: a processor that initially allocates a PEC according to a heuristic processor allocation step (Algorithm one) of the PEC structure;

In step S5, if Ready is not empty, allocating a processor to the subtask in Ready includes:

step S53: updating the PST;

The step S52 includes:

Wherein k ∈ [1, m)]，

Representing a subtask V_i，jThe latest end time of;

step S522: k is calculated, wherein

Because the second algorithm only stores four parameters, namely PST theta_iLST and U_aThus the spatial complexity of algorithm two is

From step S5 and step S52, the time complexity of the second algorithm is O (m.beta.. beta.)_i)。

And V, experiment:

in this section, the validity of the processor allocation strategy proposed by the present invention was verified by experiments performed on real embedded devices. We first propose a method for generating a DAG task set based on UUnifast algorithm [23 ]. The multiprocessor platform used by the present invention and the real-time system we have developed are then presented. Next, we rewrite the task release scheduler (hereinafter simply scheduler) of the real-time system to enable the system to support event-driven computing tasks [24 ]. We compare the task set execution based on processor allocation policies in real-time systems under the same DAG task set with the current state-of-the-art MACRO and dagP policies. Finally, the effectiveness of the three processor allocation strategies is analyzed and evaluated according to the execution result. In addition, the priority assignment method of the present invention is a DM method, i.e., the smaller the deadline, the higher the priority of the DAG task.

5.1 Generation of DAG task sets

We generated the set of tasks for the experiment according to the following parameters.

U: representing the utilization of a set of tasks

N: representing the number of DAG tasks in a task set

·β_i: representing DAG tasks τ_iNumber of neutron tasks

M: representing the number of processors

P: probability factor representing per-DAG task topology generation variations

[C_min，C_max]Respectively representing the upper and lower bounds of the worst-case execution time of each subtask.

The total utilization of each processor allocation policy is generated from 0.4 to 0.9 in steps of 0.1. For each exact utilization, we generate 100 DAG task sets and use their average to characterize the DAG tasks in this utilization. We specify that the number of tasks in each task set and the number of subtasks in each task are both 10, i.e. n ═ β_i10. The worst-case execution time for each subtask is randomly generated from 1 to 5, i.e. C_min＝1，C _max5. After the utilization rate of each task is generated, the period of the task can be obtained according to the following formula

The topology of each task is generated by randomly adding edges between the subtasks according to the probability p. Herein fixed p ═ 0.15, and one β is used_iLine beta_iThe matrix a of columns stores the topology, i.e. a (x, y) ═ 1 means that there is an edge from V_i，xPoint of direction V_i，y。

5.2 Experimental platform, real-time System

The embedded development board used in the invention is Raspberry Pi 3Model B + [25 ]. The development board has a quad 1.4GHz 64-bit processor based on the Cortex-a53 architecture, i.e., m-4. In addition, it has dual-band wireless local area network, Bluetooth 4:2/BLE, faster Ethernet, power-on-Ethernet support (with separate PoE HAT), which enables it to provide excellent scalability.

The real-time system of choice in the present invention is RT-Thread [26 ]. RT-Thread is an open source real-time operating system that has been licensed under Apache License Version 2.0 starting at v3.1.1. In addition, RT-Thread supports preemptive scheduling. We used the v4.0.2 version of RT-Thread as an experimental system that supports Symmetric Multiprocessing (SMP) scheduling and hardware driving of the platform used by the present invention.

Since the official migration of RT-Thread is too crude for the Raspberry Pi platform, we find some errors when reading the source code. We do a lot of work to correct the known errors. To make it easier for the technician to reproduce our experiments, we list the modifications of the modified file path and source code as follows.

We are in the path: the macro instruction RT _ TICK _ PER _ SECOND at line 18 in the rtconfig.h file under "./bsp/raspberry-pi/raspi 3-32/rtconfig.h" has its value set to 100.

We are in the path: the variable "cntfrq" at line 57 of the board.c document under "./bsp/raspberry-pi/raspi 3-32/driver/board.c" was changed from 35000 to 10000.

The macro definition of RT _ TICK _ PER _ SECOND represents how many system TICKs will be executed in one SECOND. The system tick is the minimum unit of time for all programs executing on the system, that is, it is the atomic time of the system. We set RT _ TICK _ PER _ SECOND to 100, which means 100 TICKs in one SECOND, with 1TICK equal to 10 milliseconds. Therefore, the WCET of a subtask is equal to 4, which means that the subtask will perform 4 ticks (40 milliseconds). The variable cntfrq represents a counter that gets a clock from an external crystal. The value of cntfrq is directly related to the accuracy of the system clock. Based on our extensive experimentation and observation, if we set it to 1000, it will provide an accurate system clock.

The features of a preemptive real-time system would be violated if sleep mode or suspend mode were used to simulate the execution of a subtask. Since both modes may cause the executing subtasks to relinquish the privilege of the processor. To avoid the above problem, we require the CPU to perform a certain number of auto-increment operations. According to our test results, it takes exactly 1 system tick if we do 1,500,000 auto-increment operations. The worst-case execution time of the CPU executing 1500000 times of the ramp operation as a certain task executing a system time (1ticks) in the following experiment depends on the unit time. That is, the worst case execution time of the task in the experiment of the present invention is 5, that is, the task actually executes 5 times 1500000 times of the ramp-up operation instead of occupying 5 unit times of the CPU.

The scheduler is the thread with the highest priority in the system (the priority of the scheduler thread is zero) and it is initialized and started at system start-up. For an event-driven real-time system, the response time of a DAG task is its interval from publication to completion. To satisfy this condition, we rewritten the scheduler of the real-time system. The detailed steps of the new scheduler are described as follows (algorithm three).

step two: initialization vector

step three: at the beginning of the system, all tasks are as follows

Adding the non-descending order into a release queue omega;

step five: if it is not

step six: obtaining a current system timet^current←GetSystemTick()；

Step eight: if it is not

It means that t has not been reached_xTime required to be released, sleep

Time, otherwise, executing step nine;

step ten: if Flag is TRUE,

The dependency relationship between the subtasks is ensured by an event set structure provided by the system, and the event set structure is a mechanism which can communicate between threads and is provided by the RT-Thread real-time system.

5.3 Experimental results and analysis

Because DAG tasks cannot be performed on a physical machine for an infinitely long time as theoretical analysis. Thus, for each set of DAG tasks at each utilization, we release all DAG tasks simultaneously and perform their 30,000 system ticks (5 minutes in our system). Then, we observe the state of each DAG task set and save the results of its response time.

After starting the real-time system, we first run 1,000 ticks empty to avoid interference from the system, and then we perform the time of 1,000 ticks as the system warmup. Furthermore, before experimenting on any DAG task set, we set the state of the system to idle 500 ticks to eliminate any possible interference of the previous DAG task set on the next DAG task set. The 500 ticks number is chosen as the idle time because it is exactly equal to the time that all subtasks of the previous DAG task set were executed once in order under worst case conditions.

FIG. 3 shows the average WCRT of DAG task sets at different utilization rates, where MACRO, dagP, and GPEC represent the MACRO, dagP, and GPEC processor allocation policies, respectively. The average WCRT for all three processor allocation strategies increases with increasing utilization. In addition, under the same utilization rate, the task sets processed by the three processor allocation strategies are the same. When the utilization rate is 0.9, the average WCRT of the DAG task set allocated by the MACRO policy has no corresponding value, because all DAG task sets cannot be scheduled under such a processor allocation policy. Macroscopically, the average WCRT of the GPEC strategy for the DAG task set is smaller than the other two strategies.

We represent the schedulable rate of DAG task sets for a given utilization using the proportion of the number of DAG task sets we can schedule in all 100 DAG task sets we create to 100. FIG. 4 illustrates schedulable rates for DAG task sets for different use cases. In general, the schedulable rates of all three processor allocation strategies decrease as utilization increases. For each particular utilization, the schedulable proportion of the DAG task set based on the GPEC policy is greater than the other two policies. Moreover, the tunable rate of the DAG task set based on the MACRO strategy is greater than the tunable rate based on the dagP strategy.

FIG. 5 shows the average WCRT reduction percentage for different utilization modes, where legends GPEC-MACRO and GPEC-dagP represent the WCRT reduction percentage for GPEC versus MACRO and GPEC versus dagP, respectively. The average WCRT of the DAG task set allocated by the GPEC strategy is smaller than the values of the other two strategies. The maximum percent reduction in average WCRT for GPEC compared to dagP was 35.59% when the utilization was 0.7. Similarly, when the utilization is 0.6, GPEC can reduce the average WCRT by up to 28.59% by comparison to MACRO. Furthermore, the outlier (negative) at 0.9 utilization is due to the large response time when some task sets are scheduled using the GPEC policy, which is not schedulable by the dagP policy. For example, table 1 shows WCRT obtained by two strategies for 3 task sets with a utilization rate of 0.9. Although the GPEC policy is superior to the dagP policy, the GPEC policy is inferior to the dagP policy on average WCRT due to the too small number of schedulable task sets.

TABLE 1 partial task set response time Table

Figure 6 shows the percentage increase in schedulable rate using the GPEC policy compared to the mac ro policy and the dagP policy for different utilization scenarios. As the utilization increases, the results of the two comparisons increase and then decrease. The GPEC strategy increases the most at 0.8 utilization compared to the MACRO and dagP strategies, 76% and 72%, respectively.

In the invention, firstly, response time analysis of a DAG task based on a partition fixed priority scheduling algorithm is deduced. Based on the intuition of the analysis, we propose a Greedy Parallel Execution Cluster (GPEC) processor allocation strategy that takes into account the topology of the DAG tasks and self-interference among subtasks within the tasks. In addition, an open-source real-time operating system is transplanted to the embedded development board, and an experimental experiment is carried out on the embedded development board to evaluate the performance of the GPEC strategy provided by the invention. Experimental results show that, compared with the existing processor allocation strategy, the GPEC can reduce the average worst-case response time of the tasks by 35.59%, and improve the schedulable rate of the task set by 76%.

Sixth, reference:

[1]N.Abbas,Y.Zhang,A.Taherkordi,and T.Skeie,“Mobile edge computing:A survey,”IEEE Internet of Things Journal,vol.5,no.1,pp.450–465,Feb2018.

[2]S.Abedi,N.Gandhi,H.M.Demoulin,Y.Li,Y.Wu,and L.T.X.Phan,“Rtnf:Predictable latency for network function virtualization,”in 2019IEEE Real-Time and Embedded Technology and Applications Symposium(RTAS).IEEE,2019,pp.368–379.

[3]V.Bonifaci,A.Marchetti-Spaccamela,S.Stiller,and A.Wiese,“Feasibility analysis in the sporadic dag task model,”in 2013 25th Euromicro Conference on Real-Time Systems,2013,pp.225–233.

[4]H.S.Chwa,J.Lee,J.Lee,K.Phan,A.Easwaran,and I.Shin,“Global edf schedulability analysis for parallel tasks on multi-core platforms,”IEEE Transactions on Parallel and Distributed Systems,vol.28,no.5,pp.1331–1345,2017.

[5]A.Saifullah,J.Li,K.Agrawal,C.Lu,and C.Gill,“Multi-core real-time scheduling for generalized parallel task models,”Real-Time Systems,vol.49,no.4,pp.404–435,2013.

[6]J.Li,J.J.Chen,K.Agrawal,C.Lu,C.Gill,and A.Saifullah,“Analysis of federated and global scheduling for parallel real-time tasks,”in 2014 26th Euromicro Conference on Real-Time Systems.IEEE,2014,pp.85–96.

[7]X.Jiang,N.Guan,X.Long,and W.Yi,“Semi-federated scheduling of parallel real-time tasks on multiprocessors,”in 2017 IEEE Real-Time Systems Symposium(RTSS).IEEE,2017,pp.80–91.

[8]S.Baruah,“Federated scheduling of sporadic dag task systems,”in 2015 IEEE International Parallel and Distributed Processing Symposium.IEEE,2015,pp.179–186.

[9]C.Tessler,V.P.Modekurthy,N.Fisher,and A.Saifullah,“Bringing inter-thread cache benefits to federated scheduling,”in 2020 IEEE Real-Time and Embedded Technology and Applications Symposium(RTAS).IEEE,2020,pp.281–295.

[10]J.Fonseca,G.Nelissen,V.Nelis,and L.M.Pinho,“Response time analysis of sporadic dag tasks under partitioned scheduling,”in 2016 11th IEEE Symposium on Industrial Embedded Systems(SIES).IEEE,2016,pp.1–10.

[11]D.Casini,A.Biondi,G.Nelissen,and G.Buttazzo,“Partitioned fixedpriority scheduling of parallel tasks without preemptions,”in 2018 IEEE Real-Time Systems Symposium(RTSS).IEEE,2018,pp.421–433.

[12]J.Herrmann,J.Kho,B.Uc，ar,K.Kaya,and U¨.V.C，atalyu¨rek,“Acyclic partitioning of large directed acyclic graphs,”in 2017 17th IEEE/ACM international symposium on cluster,cloud and grid computing(CCGRID).IEEE,2017,pp.371–380.

[13]M.Y.O¨zkaya,A.Benoit,B.Uc，ar,J.Herrmann,and U¨.V.C，atalyu¨rek,“A scalable clustering-based task scheduler for homogeneousprocessors using dag partitioning,”in 2019 IEEE International Parallel and Distributed Processing Symposium(IPDPS).IEEE,2019,pp.155–165.

[14]K.Lakshmanan,S.Kato,and R.Rajkumar,“Scheduling parallel realtime tasks on multi-core processors,”in 2010 31st IEEE Real-Time Systems Symposium,2010,pp.259–268.

[15]M.Xu,L.T.X.Phan,H.-Y.Choi,Y.Lin,H.Li,C.Lu,and I.Lee,“Holistic resource allocation for multicore real-time systems,”in 2019 IEEE Real-Time and Embedded Technology and Applications Symposium(RTAS).IEEE,2019,pp.345–356.

[16]M.Fan and G.Quan,“Harmonic-fit partitioned scheduling for fixedpriority real-time tasks on the multiprocessor platform,”in 2011 IFIP 9th International Conference on Embedded and Ubiquitous Computing.IEEE,2011,pp.27–32.

[17]N.Fisher,S.Baruah,and T.P.Baker,“The partitioned scheduling of sporadic tasks according to static-priorities,”in 18th Euromicro Conference on Real-Time Systems(ECRTS’06).IEEE,2006,pp.10–pp.

[18]R.M.Pathan and J.Jonsson,“Load regulating algorithm for staticpriority task scheduling on multiprocessors,”in 2010 IEEE International Symposium on Parallel&Distributed Processing(IPDPS).IEEE,2010,pp.1–12.

[19]F.Fauberteau,S.Midonnet,and L.George,“Allowance-fit:a partitioning algorithm for temporal robustness of hard real-time systems upon multiprocessors,”in 2009 IEEE Conference on Emerging Technologies&Factory Automation.IEEE,2009,pp.1–4.

[20]A.Saifullah,D.Ferry,J.Li,K.Agrawal,C.Lu,and C.D.Gill,“Parallel real-time scheduling of dags,”IEEE Transactions on Parallel and Distributed Systems,vol.25,no.12,pp.3242–3252,2014.

[21]N.Fauzia,V.Elango,M.Ravishankar,J.Ramanujam,F.Rastello,A.Rountev,L.-N.Pouchet,and P.Sadayappan,“Beyond reuse distance analysis:Dynamic analysis for characterization of data locality potential,”ACM Transactions on Architecture and Code Optimization(TACO),vol.10,no.4,pp.1–29,2013.

[22]H.Wang and O.Sinnen,“List-scheduling versus cluster-scheduling,”IEEE Transactions on Parallel and Distributed Systems,vol.29,no.8,pp.1736–1749,2018.

[23]E.Bini and G.C.Buttazzo,“Measuring the performance of schedulability tests,”Real-Time Systems,vol.30,no.1-2,pp.129–154,2005.

[24]S.Chakraborty,T.Erlebach,S.K¨unzli,and L.Thiele,“Schedulability of event-driven code blocks in real-time embedded systems,”in Proceedings of the 39th annual Design Automation Conference,2002,pp.616–621.

[25]“Raspberry pi 3model b+,”https://www.raspberrypi.org/products/raspberry-pi-3-model-b-plus/.

[26]“Rt-thread system,”https://github.com/RT-Thread/rt-thread.

the foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A heuristic processor partitioning method for a scheduled DAG task based on partitioning, comprising a heuristic processor allocation step of a PEC structure, the heuristic processor allocation step of the PEC structure comprising:

If get

and 4, step 4: will be provided with

All the subtasks in the queue are added into the Ready queue Ready;

2. The heuristic processor partitioning method of claim 1, wherein the step 6 comprises:

step 62: updating processor residual utilization U_a(p*)；

And step 63: updating processor allocation policy θ_p*。

3. The heuristic processor partitioning method of any of claims 1-2, wherein the heuristic processor partitioning method comprises the steps of:

step S1: initializing the processor allocation policy to null

4. A heuristic processor partitioning method as claimed in claim 3 wherein, in the step S5, if Ready is not empty, allocating processors for the subtasks in Ready comprises:

step S53: updating the PST;

5. The heuristic processor partitioning method of claim 4, wherein the step S52 comprises:

Wherein k ∈ [1, m)]，

Representing a subtask V_i，jThe latest end time of;

step S522: k is calculated, wherein

6. A real-time system, wherein the real-time system runs on an embedded development board, and the steps in the heuristic processor partitioning method of any of claims 3 to 5 run on the real-time system.

7. The real-time system of claim 6, wherein the embedded development board is a Raspberry Pi 3Model B +, the real-time system is version v4.0.2 of RT-Thread, and RT-Thread is an open-source real-time operating system.

8. The real-time system of claim 7, wherein the file path and source code for the real-time system are modified as follows:

setting the value of the macro instruction RT _ TICK _ PER _ SECOND in the rtconfig.h file under the path "/bsp/raspberry-pi/raspi 3-32/rtconfig.h" to 100;

changing the variable "cntfrq" in the board.c file under the path "/bsp/raspberry-pi/raspi 3-32/driver/board.c" from 35000 to 10000;

9. A real-time system according to any one of claims 6-8, characterized in that the scheduler of the real-time system performs the steps of:

step two: initialization vector

step three: in-systemAt the beginning, all tasks are as follows

Adding the non-descending order into a release queue omega;

step five: if it is not

step six: obtaining the current system time t^current←GetSystemTick()；

Step eight: if it is not

It means that t has not been reached_xTime required to be released, sleep

Time, otherwise, executing step nine;

step ten: if Flag is TRUE,

10. A computer-readable storage medium characterized by: the computer readable storage medium stores a computer program configured to implement the steps of the heuristic processor partitioning method of any of claims 1-5 when invoked by a processor.