CN109918195B

CN109918195B - Resource scheduling method for many-core system processor based on thermal perception dynamic task migration

Info

Publication number: CN109918195B
Application number: CN201910049800.5A
Authority: CN
Inventors: 文生雁; 王小航
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-01-18
Filing date: 2019-01-18
Publication date: 2023-06-20
Anticipated expiration: 2039-01-18
Also published as: CN109918195A

Abstract

The invention discloses a resource scheduling method of a many-core system processor based on thermal perception dynamic task migration, which comprises the following steps: step one, detecting whether a waiting sequence is empty; if not, mapping application is carried out, and then the second step is carried out; step two, detecting whether an arrival queue is empty; if not, carrying out a third step; detecting whether no application is operated in the system, if not, respectively estimating the operation and the waiting time of the bubbles by using a model, and searching an optimal bubble distribution result by using a branch limit algorithm; searching an optimal bubble distribution result through a branch limit algorithm; step five, an application mapping stage, step six, an application running stage. The method utilizes the black silicon phenomenon, adapts to application arrival queues with different lengths according to the arrival rate of the application and the computation sensitivity or communication sensitivity characteristic of the application, maintains the task running frequency, can respond to changeable application arrival rate, effectively improves the throughput rate of the system, and improves the system performance.

Description

Resource scheduling method for many-core system processor based on thermal perception dynamic task migration

Technical Field

The invention relates to the technical field of communication, in particular to a resource scheduling method of a many-core system processor based on thermal perception dynamic task migration.

Background

Many-core chip is one of the processor components in the fields of cloud computing, mobile computing, etc. Many-core processors are widely used in the fields of servers, data centers, and the like. In the development of computer systems, many-core chips are becoming an increasingly important platform. With the increase of the application to the computing demands, the integration level and the performance of many-core chips are continuously improved, and along with the increase, the power density and the temperature of the chips are rapidly increased, and the temperature becomes an important factor for limiting the performance of the chips. The reliability and lifetime of many-core chips can be affected by excessive temperatures for long periods of time. Because of the limitation of heat dissipation conditions and in order to ensure the operation safety of the system, power constraint is generally set for the chip system. To meet the power constraint while maintaining high-speed computing performance, a portion of the processors on the chip have to be in a shut down state, which is known as black silicon phenomenon.

The idle processors within the system-on-chip that are not activated are commonly referred to as bubbles. Because of the better heat dissipation of the bubbles, some practice places the bubbles around the active processor, causing the processor to run at a higher frequency to improve computational performance. This approach is still in the category of static task-processor core mapping, and because the task mapping position is fixed, hot spots are still possible to occur in a state where the operating frequency is unchanged. Some methods employ dynamic task migration to reduce hot spots, which migrate threads or tasks from the overheated core to other cores as the temperature of the original processor gradually rises above a set temperature threshold. One class of methods tends to choose the lowest global temperature processor or randomly choose an idle processor as the migration target for the task, which can greatly increase communication distance. This obviously results in excessive communication costs if there are more communication between application tasks. Another type of method selects one processor migration task adjacent to the overheating processor at a time, and after a plurality of such migration, an application may leave a discontinuous idle area when leaving the system, and an application newly arrived or waiting in a queue cannot be mapped continuously onto the idle processor. Both of the above methods may face the situation that the inter-task communication may pass through multiple processor cores running other applications, resulting in collision of the inter-task communication of the applications, which reduces communication efficiency.

Because of the diversity and complexity of user requests, server systems are expected to be able to cope with varying workloads, responding in as short a time as possible. A task migration method enables each processor running tasks to run over-frequently, and all tasks are migrated to another continuous idle processor area when a temperature threshold is reached. Although the communication distance between tasks is kept unchanged, the method is only suitable for systems with lower loads, and when more applications arrive and more needed processor resources are needed, the processor resource scheduling method with low system utilization rate can bring overlong application waiting time, offset performance gain under over-frequency and increase response time.

The invention patent with the publication number of CN201310059705 discloses a core resource allocation method, a device and a many-core system, and mainly provides a core resource allocation method, which combines scattered core partitions according to the thread number (the required number of idle cores) of a user process so as to allocate the formed continuous core partitions to the user process and optimize the communication cost. According to the method, a reference core partition and a slave core partition are selected from at least two scattered core partitions according to the migration cost of the core partition, so that the total migration cost of the core partition is minimized. And then migrating the idle core of the slave core partition, so that the idle core of the slave core partition and the idle core of the reference core partition are combined to form a continuous core partition. The above patent integrates processor resources on-chip mainly from the perspective of defragmentation, enabling the arriving applications to achieve continuous mapping, which has the disadvantage of not taking into account the power constraints and temperature constraints of the system, not taking into account possible temperature peaks and non-uniform temperature distribution, and the system has a risk of overheating.

None of the above methods regarding task migration take into account the utilization of bubbles. When the black silicon phenomenon exists, the chip temperature constraint and the load are considered at the same time, and how to design the dynamic processor resource scheduling of the many-core system is the key for maintaining the high performance of the many-core system.

Disclosure of Invention

The invention discloses a processor resource scheduling method under the black silicon phenomenon, which is realized in a task scheduling simulation system of a two-dimensional network-on-chip many-core system, can avoid thermal risk and consider communication cost while maintaining higher operating frequency of a processor, and is used for realizing improvement of the throughput rate of the system.

Therefore, the invention provides a resource scheduling method for a many-core system processor based on thermal perception dynamic task migration, which comprises the following steps:

step one, detecting whether a waiting sequence is empty; if not, mapping each application in the waiting sequence, and then carrying out the second step; if the clock is empty, ending the resource scheduling, and waiting for the starting of the next clock cycle, and performing the step one;

step two, detecting whether an arrival queue is empty; if the queue is not empty, performing a third step; if the arrival queue is empty, no new application arrives in the clock cycle, the resource scheduling of the processor is not needed, the resource scheduling is ended, and the step one is carried out when the next clock cycle is waited to start;

detecting whether no application is running in the system, if no application is running, namely N-N processor resources of the system are available, respectively estimating the execution time and the waiting time under different bubbles for each application by using a bubble-performance model and a waiting time model, and searching a bubble distribution result with the shortest total response time of the application by using a branch boundary algorithm as the calculation input of a cost function in a branch boundary to ensure that the total response time is shortest; if the application runs, performing a fourth step;

step four: if an application is running, the available processor of the system is partitioned by the occupied application area into a set of discrete free available areas. The bubble-performance model and the waiting time model are used for respectively estimating the execution time and the waiting time under the selectable bubble number for each application, and the execution time and the waiting time are used as the calculation input of the cost function in the branch limit, and the optimal bubble distribution result is searched through the branch limit algorithm so as to minimize the total response time.

Fifth, the mapping stage is applied: selecting an idle area by adopting a First-time adaptive Heuristic algorithm (First-Fit heuristics), and then selecting an applied mapping mode for mapping; the mapping mode of the application comprises a square mapping mode and a communication priority mapping mode;

step six, application operation phase: a migration mode is selected for different types of applications based on the mapping mode of the application and its own calculated traffic-to-traffic ratio (Computation communication rate, CCR) as selection basis.

Further, the bubble-performance model is used for calculating the corresponding execution time of each application under different numbers of bubbles, and is a polynomial regression model, so that the execution time of the application is pi _i Number of bubbles contained in application area b _i The critical path hop count of the application, i.e. the path hop count h with the longest weighted path _i Average computation time c of tasks in application _i Application ofAverage communication time t of middle task _i The bubble-performance model is as follows:

wherein n is ₁ ～n ₄ Is polynomial order, alpha _k ，β _k ，γ _k ，θ _k The fitting coefficient of the model is obtained by a maximum likelihood method.

To the k-th power of the number of bubbles applying i, < >>

To the k-th power of the number of hops applying the i critical path,/-th>

Computing the k-th power of the time for the average task in application i,/-th>

For the average inter-task communication time of application i.

Further, the waiting time model is given as the sum of the area size of the current application, namely the bubble number and the task number, and is used for calculating the waiting time of each application corresponding to different bubble numbers and the current arrival rate, the model is a polynomial regression model, the area of the current application is recorded as R, the total number of processors of the system is recorded as |T|, and the average task number of the mapped application is recorded as |A _i Let the waiting time of the current application be eta _i ，η _i Modeling was performed by the following variables: average bubble number-to-task number ratio r for mapped applications _i Average execution time e of mapped applications _i The arrival rate λ is applied and the latency model is as follows:

wherein a is ₀ Is a constant term, z is a polynomial order, r ^j To the power j, e of the average bubble number-task number ratio of the mapped application ^j To the power j, λ of the average execution time of the mapped application ^j To the j power of the current application arrival rate. Delta _j ，ε _j ，μ _j Fitting coefficients corresponding to the parameter items obtained by maximum likelihood regression are adopted.

Further, the number of application tasks is the number of processors occupied by the application map. An application is a collection of small tasks, each of which runs a different instruction portion of the application, thereby executing the applications in parallel. Each task map occupies a processor core, and the number of application tasks is equal to the number of processors occupied by the application running.

Further, the selectable bubble number is the difference between the total number of processors in each free area and the number of application tasks; the number of application tasks refers to the number of processors that are equal to the number of processors that the application is running.

Further, step three is implemented by a Global manager (Global manager), when the subsequent applications arrive, the Global manager counts all available processor areas in the current system, estimates the execution time and the waiting time corresponding to the available processor areas for each subsequent arriving application, and finds the application-available area corresponding relation with the minimum cost by a branch limit algorithm.

Further, the specific implementation process of the first adaptation algorithm is as follows: searching idle processors from left to right and from top to bottom in the system, judging whether the idle processors can be used as a starting point of an application area, wherein the condition that the starting point can be used as the starting point is that the product of the number of idle processors on the right side of the same row and the number of idle processors under the same row is larger than or equal to the area of the application area; when a first processor which can be used as the starting point of the application area is found, the processor is used as the idle area at the leftmost upper corner to be distributed to the application, and if the processor which can be used as the starting point of the application area is not found, the application is put into a waiting queue; after the region is selected, the application is mapped within the region in the selected mode. The mode of selecting the mapping mode of the application is as follows: when the number of the applied bubbles and tasks meets 1:1, selecting a square mapping mode by application; when the number of bubbles and tasks of the application does not meet 1:1, the application selects a mapping mode of communication priority.

Further, the migration modes include three migration modes, which are respectively: square migration mode, coldest core migration mode in area and coldest neighbor core migration mode in area; step five, according to the mapping mode of the application and the calculation amount-traffic ratio of the application, the maximum number of bubbles of each application must not exceed the task number, and the process of selecting the migration mode is as follows:

if the number of bubbles is equal to the number of tasks, a square migration mode is selected for the application to keep the communication distance unchanged in the migration process, and the processor operates in an over-frequency mode;

if the number of bubbles is not equal to the number of tasks, a communication priority map is selected for the application. The threshold value is 2 h/(h-1), h is the hop count of the application critical path, namely the hop count of the longest weighted path, the coldest core migration of the area is selected for the application with CCR larger than the threshold value, and the coldest neighbor core migration of the area is selected for the application with CCR smaller than the threshold value.

Further, the implementation process of the three migration modes is as follows:

(1) The square migration mode is applied, the square mapping is required to be met, the size of an application area is at least twice the number of tasks, all tasks are unbinding with a currently mapped processor when the application is called, and the tasks are mapped into another idle area of the application area in sequence according to the sequence of task serial numbers (IDs);

(2) The method comprises the steps that a coldest core migration mode in an area is located on a processor which is a overheat processor and is obtained through hot spot detection when the coldest core migration mode is called, then the processor core with the lowest temperature is searched in an application area, when a lowest-temperature processor core is found and migration conditions of the coldest core migration mode in the area are that the processor is an air bubble, namely an idle processor without mapping tasks are not mapped on the processor, the tasks are unbinding and mapped with the overheat processor, namely the original processor, and if the migration conditions of the tasks cannot be met, namely the coldest core is not the air bubble, the overheat processor is subjected to frequency reduction treatment;

(3) The method comprises the steps that a coldest neighbor core migration mode in an area is located on a overheat processor when a task ID obtained through hot spot detection is called, the processor core with the lowest temperature is searched in 8 adjacent cores adjacent to the processor, if the searched processor core meets the migration condition of the coldest neighbor core migration mode in the area, namely, the task is mapped on the processor core and the temperature is not higher than two thirds of the threshold temperature, or the processor is an air bubble, namely, an idle processor, the task migration is executed, and if the task is already in the processor, the tasks on the processor and the overheat processor are exchanged, namely, double unbinding-remapping is carried out; if the processor core is in an idle state, single unbinding-remapping is executed, and if the migration condition of the task cannot be met, namely the temperature of the processor core is higher than two thirds of the threshold temperature, the down-conversion processing is carried out on the overheat processor.

Further, the branch limit algorithm is as follows: allocating a different number of bubbles for each application, calculating a current total response time at each node as a cost function; the total response time σ is the maximum of the sum of the latency and execution time of the application at the node to which the bubble has been allocated:

σ＝max{η _i +∏ _i }，i∈{0，1，...，n}

wherein eta _i For the latency of application i, pi _i Is the execution time of application i. And (3) expanding branches with slowest lower bound growth in priority each time, cutting off upper nodes with higher cost than the next nodes, namely longer response time, and finally obtaining a bubble distribution result with shortest total response time, namely an application region division result.

Further, after the application arrives, the execution time of the application under different bubble numbers is calculated according to the bubble-performance model estimation. The number of bubbles for each application must not exceed the number of tasks at maximum. When the number of bubbles is equal to the number of tasks, the execution time in all migration modes is estimated. When the number of bubbles is less than the number of tasks, estimating the execution time of the neighbor coldest core task migration mode which is beneficial to communication for the application of which the CCR is higher than the threshold value; the coldest kernel task migration pattern in the region that favors computation is estimated for applications where CCR is below the threshold.

Further, if the number of processors included in the region is greater than the number of application tasks, the frequency of the active processors in the region is calculated according to the ratio of the bubble and the number of tasks in the mapped region, so that the processors can operate in an over-frequency mode.

Further, during the application run phase, the system detects hotspots in each control interval. And when the hot spot occurs, performing task migration in the corresponding application area according to the selected migration mode.

Further, the communication priority mapping mode uses the node with the largest communication weight as a preferred task to be mapped on the processor cores near the geometric center of the application area, and the nodes connected with the node are sequentially mapped on the available processor cores with the shortest Manhattan distance. And selecting the parent node mapped by the unmapped task, and sequentially mapping the nodes connected with the parent node to the available processor cores with the shortest Manhattan distance. And so on until all tasks of the application are mapped onto the processor.

Further, in the square mapping mode, the square root of the number of the application tasks is rounded down to be a side length, and all tasks are continuously mapped in a rectangular area.

The same control interval as the sampling interval is selected, and each control interval detects whether the processor temperature is above a given temperature threshold. If not, the task migration is not performed. If the processor core exceeding the temperature threshold exists, the tasks mapped on the hot spot are positioned to the corresponding application, and the task migration is carried out according to the selected mode in the event class instance binding the application. The hot spot detection gives the overheated application ID and its overheated task ID.

By establishing an applied bubble-performance model and a waiting time model, designing three migration modes and expanding a two-dimensional network-on-chip many-core system. In resource allocation, the black silicon phenomenon is utilized to obtain better calculation performance. On task migration, a migration mode is selected for different types of applications according to the mapping mode of the application and its own calculation amount-traffic ratio (Computation communication rate, CCR) as selection basis. A higher CCR generally means that computational performance contributes more to the overall performance of the application, whereas communication performance is more important in overall performance.

Compared with the prior art, the invention has the following beneficial effects:

1. in this scheduling method, the bubbles can be used not only as heat dissipation but also as migration targets for tasks on the overheat processor.

2. The system can dynamically respond to the current application arrival rate, fully utilizes the black silicon phenomenon to improve the running performance of the application when the arrival of the application is less, improves the utilization rate of the system when the arrival of the application is more, and avoids overlong waiting time of the application. Overall, compared with other traditional task migration methods, the system throughput rate is improved, and the average response time of the application is reduced.

Drawings

FIG. 1 is a block diagram of an expanded simulation system of the present invention.

FIG. 2 is a diagram showing an example of an algorithm for distributing bubbles in a branch boundary according to the present invention.

FIG. 3 is a flow chart of a method for scheduling processor resources according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.

According to the resource scheduling method for the many-core system processor based on thermal perception dynamic task migration, the applied CCR threshold value corresponding to application 1 is 2.6, the CCR threshold value corresponding to application 2 is 2.6, and the CCR threshold value corresponding to application 3 is 2.4. The system scale is 5*5, and 25 processor cores are included, the clock frequency of the processor cores is 1GHZ, and the clock period is 1 nanosecond.

As shown in fig. 3, the method comprises the following steps:

step one, detecting whether a waiting sequence is empty; if not, mapping each application in the waiting sequence, and then carrying out the second step; if the time is empty, ending the resource scheduling, and waiting for the next clock cycle;

step two, detecting whether an arrival queue is empty; if the queue is not empty, performing a third step; if the arrival queue is empty, no new application arrives in the clock cycle, the processor resource scheduling is not needed, the resource scheduling is ended, and the step one is performed when the next clock cycle is waited to start.

Detecting whether no application is running in the system, if no application is running, namely 5*5 processor resources of the system are available, respectively estimating the execution time and the waiting time under different bubbles for each application by using a bubble-performance model and a waiting time model, taking the execution time and the waiting time as the calculation input of a cost function in a branch limit, and searching the optimal bubble distribution result by a branch limit algorithm; and if the application runs, performing a step four.

Step four: if an application is running, the available processor of the system is partitioned by the occupied application area into a set of discrete free available areas. The bubble-performance model and the waiting time model are used for respectively estimating the execution time and the waiting time under the selectable bubble number (the selectable bubble number is the difference between the total number of processors in each idle area and the number of application tasks) for each application, and the execution time and the waiting time are used as the calculation input of the cost function in the branch limit, and the optimal bubble distribution result is searched through a branch limit algorithm, so that the total response time is the shortest.

Further, the bubble-performance model is used for calculating the corresponding execution time of each application under different numbers of bubbles, and is a polynomial regression model, so that the execution time of the application is pi _i Number of bubbles contained in application area b _i The critical path hop count of the application, i.e. the path hop count h with the longest weighted path _i Average computation time c of tasks in application _i Task average communication time t in application _i The bubble-performance model is as follows:

wherein n is ₁ ～n ₄ Is polynomial order, alpha _k ，β _k ，γ _k ，θ _k The value is obtained by a maximum likelihood method for the model coefficient.

To the k-th power of the number of bubbles applying i, < >>

To the k-th power of the number of hops applying the i critical path,/-th>

For the average inter-task communication time of application i.

Further, step three is implemented by a Global manager (Global manager), that is, when the subsequent applications arrive, the Global manager counts all available processor areas in the current system, estimates the execution time and the waiting time corresponding to the available processor areas for each subsequent arriving application, and finds the application-available area corresponding relation with the minimum cost by a branch limit algorithm.

Further, the specific implementation process of the first adaptation algorithm is as follows: searching idle processors from left to right and from top to bottom in the system, judging whether the idle processors can be used as a starting point of an application area, wherein the condition that the starting point can be used as the starting point is that the product of the number of idle processors on the right side of the same row and the number of idle processors under the same row is larger than or equal to the area of the application area; when a first processor which can be used as the starting point of the application area is found, the processor is used as the idle area at the leftmost upper corner to be distributed to the application, and if the processor which can be used as the starting point of the application area is not found, the application is put into a waiting queue; after the region is selected, the application is mapped within the region in the selected mode.

The mode of selecting the mapping mode of the application is as follows: when the number of the applied bubbles and tasks meets 1:1, selecting a square mapping mode by application; when the number of bubbles and tasks of the application does not meet 1:1, the application selects a mapping mode of communication priority.

if the number of bubbles is not equal to the number of tasks, a communication priority map is selected for the application. Region coldest core migration is selected for applications with CCR greater than a threshold, and region coldest neighbor core migration is selected for applications with CCR less than a threshold.

Further, the implementation process of the three migration modes is as follows:

(1) The square migration mode is applied, the square mapping is required to be met, the size of an application area is at least twice the number of tasks, all tasks are unbinding with a currently mapped processor when the application is called, and the tasks are mapped into another idle area of the application area in sequence;

(2) The method comprises the steps that a coldest core migration mode in an area is called, a task ID obtained through hot spot detection is located on a overheat processor, then a processor core with the lowest temperature is searched in an application area, when one processor core with the lowest temperature is found and migration conditions are met, namely the processor is a bubble, tasks are not mapped on the processor, the tasks and the overheat processor are unbinding and remapped on the processor, and if the migration conditions of the tasks cannot be met, namely the coldest core is not the bubble, the overheat processor is subjected to frequency reduction treatment;

(3) And (3) locating a task ID obtained by hot spot detection on the overheat processor in the coldest neighbor core migration mode in the region during calling, searching for the processor core with the lowest temperature in 8 adjacent cores adjacent to the processor, and selecting the processor as a target processor for task migration if the searched processor core meets the migration condition, namely, the task is mapped on the searched processor core but the temperature is not higher than two thirds of the threshold temperature, or the processor is a bubble. If the task exists in the processor, exchanging the tasks on the processor and the overheat processor, namely, executing unbinding-remapping twice, unbinding the overheat processor and the task X mapped by the overheat processor, unbinding the target processor and the task Y mapped by the target processor, remapping the task X to the target processor, and remapping the task Y to the overheat processor; if the processor core is an idle processor, single unbinding-remapping is executed, and if the migration condition of the task cannot be met, namely the temperature of the processor core is higher than two thirds of the threshold temperature, the down-conversion processing is carried out on the overheat processor.

Further, during the application run phase, the system detects hotspots during each control interval. And when the hot spot occurs, performing task migration in the corresponding application area according to the selected migration mode.

The same control interval as the sampling interval is selected, and each control interval detects whether the processor temperature is above a given temperature threshold. If not, the task migration is not performed. If the processor core exceeding the temperature threshold by 80 ℃ exists, the tasks mapped on the hot spots are positioned to the corresponding application, and the task migration is carried out according to the selected mode in the event class instance binding the application. The hot spot detection gives the overheated application ID and its overheated task ID.

The invention is realized in a simulation system, see fig. 1, the two-dimensional many-core system task scheduling simulation system comprises an application generator, an event driven simulation multi-core system, and a HotSpot (Chinese name hot spot) temperature simulator is a temperature simulation model developed by using a resistance-capacitance equivalent model by the university of Freund, and has the characteristics of accuracy and rapidness, and a latest release version HotSpot6.0 temperature simulator is used. The application generator randomly generates a task graph, namely, the number of tasks and the calculated amount of the tasks (range 100-800) of the simulation application are created, the communication topology among the tasks and the communication amount of each communication edge (range 50-500), and the task graph is expressed as a weighted directed acyclic graph and comprises the calculated amount of each task and the communication dependence among the tasks.

The event-driven simulated multi-core system simulates the binding relationship between a single task instance and a simulated processor, and the execution computation of tasks and the communication between tasks. The processor resource scheduling algorithm includes assigning bubbles to the applications, task mapping and task execution migration simulation. The task mapping is implemented by a single task-single processor binding. When the application finishes running, the processor occupied by the task is released. When the simulation task runs, the calculation speed of the task is equal to the running frequency of the corresponding processor (the running frequency of each simulation processor in the system can be regulated), and the communication speed between each pair of tasks is equal to the routing frequency of the processor/the Manhattan distance between the corresponding two processors.

In the task mapping stage, two alternative mapping methods, square mapping and communication priority mapping, are provided:

square mapping simply maps all tasks of an application consecutively in order of order within an approximately square area that is half of the overall application area (allocated free consecutive processor area).

The communication priority MAP first creates three dynamic arrays MAP, MET, un representing mapped tasks, tasks connected to (i.e., having communication with) the mapped tasks, and other tasks not in the first two queues, respectively. The nodes represent analog processors. When initializing, selecting an approximate geometric center of an application area (an allocated idle continuous processor area) as a first node, selecting a task with highest accumulated traffic as a first-choice task, mapping the task to the first node, and putting the first node into a MAP array. All tasks connected to the preferred task are placed in the MUP array in turn, and the remaining tasks are placed in the UNM array. For the first task in MET, find its parent task and its corresponding node in MAP, start searching in turn from the

node manhattan distance

1,2,3, … to that node until the first available node is found, MAP the task within MET array onto the node, move it from MET into MAP array, and move the task connected thereto and in UNM from UNM into MET. The above steps are performed for each node in MET until the size of the MAP array is equal to the number of tasks applied and the mapping is complete.

According to the power model of the single processor, the power of each processor in the many-core system is calculated in each microsecond and used as a power trace at the current moment. Taking a fixed time period as a sampling interval, calling a HotSpot temperature simulator to simulate the system temperature every one sampling interval, taking a power track of each moment of the system operation consumption time as an input, and returning the instantaneous simulated temperature of each processor core in the system at the moment by the Hotspot. When the temperature of a certain processor exceeds a set temperature threshold, the application corresponding to the hot spot executes task migration.

In the task migration stage, three selectable migration modes, namely a square migration mode, a coldest core migration mode in an area and a coldest neighbor core migration mode in the area, are provided.

The square migration mode is based on square mapping, and the task which is mapped into a rectangle is integrally migrated into another rectangle area in the application area.

And searching the processor core with the lowest temperature in the application area by using the coldest core migration mode in the area, and migrating the overheated task to the coldest core when one processor core with the lowest temperature is found and the migration condition is met.

The coldest neighbor core migration mode in the region searches for the processor core with the lowest temperature in 8 cores adjacent to the overheat processor, and if a certain temperature condition is met, the tasks on the processor and the overheat processor are exchanged, and the frequency of the overheat processor is properly reduced.

Figure 2 shows an example showing the system first arriving at three applications, with the three applications being assigned different numbers of bubbles according to the branch boundary algorithm to obtain the shortest total response time.

Application 1 contains seven tasks, and the execution times for

bubble

0,1,3,5,7 from the bubble-performance model are 80, 76, 62, 51, 40, respectively (for simplicity of representation, only the number of bubbles that can make up a regular area is considered, the longest side of the application area is smaller than the side of the system).

Application 2 contains five tasks, with corresponding execution times of 300, 275, 250, 217, 150 for

bubble

0,1,3,4,5 from the bubble-performance model.

Application 3 contains eight tasks, and

bubble

0,1,2,4,6,8 is obtained from the bubble-performance model with corresponding execution times of 195, 184, 166, 147, 125, 99, respectively.

In the embodiment b1, b2, b3 represent the bubble allocated to application 1, the bubble allocated to application 2, and the bubble allocated to application 3, respectively. By the nature of the branch bound method, branches with the slowest increase in response time (total response time, i.e., lower bound) are pruned each time preferentially expanding, and branches at the upper level with a lower bound greater than the lower bound of branches at the lower level (e.g., branches at the first branch of the third level with a lower bound of 195 and all lower bounds at the second level with a lower bound greater than 195) are pruned. Finding the branch with the shortest total response time in the bottommost layer, namely the optimal solution, wherein the optimal solution of the total response time in the embodiment is 165, and pruning all branches with lower bounds larger than the optimal solution. The bubbles corresponding to the optimal solution are allocated to three applications, and the bubbles corresponding to the optimal solution in the embodiment are { b1=7, b2=5, b3=6 }. As shown in fig. 2, the ancestor node corresponding to the optimal solution node is b1=7, i.e. 7 bubbles are allocated to application 1, the corresponding parent node is b2=5, i.e. 5 bubbles are allocated to application 2, and the node itself corresponds to b3=6, i.e. 6 bubbles are allocated to application 3.

The application area division and resource scheduling are as follows: the application area of application 1 needs to map 7 tasks and contains 7 bubbles, a processor area with the size of 14 is allocated to the application area, and a first adaptation algorithm searches a 3*5 continuous idle area (one processor is not occupied) for the application area; the application area of the application 2 needs to map 5 tasks and comprises 5 bubbles, a processor area with the size of 10 is allocated for the application area, and a continuous idle area of 2*5 is searched for the application area by a first adaptation algorithm; the application area of application 3 needs to map 8 tasks and contains 6 bubbles, there is not enough free area in the system, application 3 waits in the queue.

For the application 1, the task number is equal to the bubble number, the task number is mapped into the area according to the square mapping mode, the square migration mode is selected, and the application 1 starts to run; for application 2, the task number is equal to the bubble number, the task number is mapped into the area according to the square mapping mode, the square migration mode is selected, and application 2 starts to run.

Application 1 ends running first and the processor core occupied by its application area is released. The application area of application 3 needs to map 8 tasks and contains 6 bubbles, which are allocated with a continuous free processor area of size 14, for which the first adaptation algorithm searches for a free area of 3*5.

For application 3, the number of bubbles is smaller than the number of tasks, and the bubbles are mapped into the area according to the communication priority mapping mode. Computing its CCR to be 3.2 greater than the threshold, selecting a zone coldest core migration mode for it, and application 3 begins running.

The foregoing is a detailed description of the present invention in connection with the specific embodiments, but it is not to be construed that the invention is limited to the specific embodiments. Several adaptations, modifications, substitutions and/or variations of these embodiments may be made by those of ordinary skill in the art without departing from the principles and spirit of the invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The method for scheduling the processor resources of the many-core system based on the thermal perception dynamic task migration is characterized by comprising the following steps:

detecting whether no application is running in the many-core system, if no application is running, using a bubble-performance model and a waiting time model to estimate the execution time and the waiting time under different bubbles for each application respectively, and searching a bubble distribution result with the shortest total response time of the application through a branch limit algorithm, wherein the execution time and the waiting time are used as calculation input of a cost function in a branch limit; if the application runs, performing a fourth step;

the bubble-performance model is used for calculating the corresponding execution time of each application under different numbers of bubbles, and is a polynomial regression model, so that the execution time of the application is pi _i Number of bubbles contained in application area b _i The critical path hop count of the application, i.e. the path hop count h with the longest weighted path _i Average computation time c of tasks in application _i Task average communication time t in application _i The bubble-performance model is as follows:

wherein n is ₁ ～n ₄ Is polynomial order, alpha _k ,β _k ,γ _k ,θ _k The fitting coefficient of the model is obtained by a maximum likelihood method;

to the k-th power of the number of bubbles applying i, < >>

To the k-th power of the number of hops applying the i critical path,/-th>

Average inter-task communication time per two tasks for application i;

the waiting time model is given to the sum of the area size of the current application, namely the bubble number and the task number, and is used for calculating the waiting time of each application corresponding to different bubble numbers and the current arrival rate, the model is a polynomial regression model, the area of the current application is recorded as R, the total number of processors of the system is recorded as |T|, and the average task number of the mapped application is recorded as |A _i Let the waiting time of the current application be eta _i ，η _i Modeling was performed by the following variables: r is the average bubble number-task number ratio of the mapped application, e is the average execution time of the mapped application, λ is the application arrival rate, and the waiting time model is as follows:

wherein a is ₀ Is a constant term, z is a polynomial order, r ^j To the power j, e of the average bubble number-task number ratio of the mapped application ^j To the power j, λ of the average execution time of the mapped application ^j J is a value from 1 to z to the power j of the arrival rate of the current application, delta _j ，ε _j ，μ _j Fitting coefficients of parameter items obtained by maximum likelihood regression;

step four: if the application runs, the available processor of the system is divided into a group of discontinuous idle available areas by the occupied application area, the execution time and the waiting time under the selectable bubble number are estimated for each application by using a bubble-performance model and a waiting time model respectively, the execution time and the waiting time are used as the calculation input of a cost function in a branch limit, and the optimal bubble distribution result is searched by a branch limit algorithm;

fifth, the mapping stage is applied: selecting an idle area by adopting a first-time adaptive heuristic algorithm, and then selecting an applied mapping mode for mapping; the first adaptation heuristic algorithm is as follows: searching idle processors from left to right and from top to bottom in the system, judging whether the idle processors can be used as a starting point of an application area, wherein the condition that the starting point can be used as the starting point is that the product of the number of idle processors on the right side of the same row and the number of idle processors under the same row is larger than or equal to the area of the application area; when a first processor which can be used as the starting point of the application area is found, the processor is used as the idle area at the leftmost upper corner to be distributed to the application, and if the processor which can be used as the starting point of the application area is not found, the application is put into a waiting queue; after the region is selected, mapping the application in the selected mode in the region; the mode of selecting the mapping mode of the application is as follows: when the number of bubbles and tasks applied satisfies 1:1, selecting a square mapping mode by application; when the number of bubbles and tasks applied does not satisfy 1:1, applying a mapping mode for selecting communication priority;

step six, application operation phase: selecting a migration mode for different types of applications according to the mapping mode of the application and the calculation amount-traffic ratio (Computation communication rate, CCR) of the application as selection basis; the migration modes comprise three migration modes, namely: square migration mode, coldest core migration mode in area and coldest neighbor core migration mode in area;

the specific implementation process of the three migration modes is as follows:

(1) The square migration mode is applied, the square mapping is required to be met, the size of an application area is at least twice the number of tasks, all tasks are unbinding with a currently mapped processor when the application is called, and the tasks are mapped into an idle area of the application area in sequence according to the sequence of task serial numbers;

(2) The method comprises the steps that a coldest core migration mode in an area is located on a processor which is a overheat processor and is obtained through hot spot detection when the coldest core migration mode is called, then the processor core with the lowest temperature is searched in an application area, when a lowest-temperature processor core is found and migration conditions of the coldest core migration mode in the area are met, namely the processor is an air bubble, namely an idle processor without a mapping task, the processor is not mapped with the mapping task, the task and the overheat processor, namely the original processor are unbinding and mapped to the processor, and if migration conditions of the coldest core migration mode in the area cannot be met, namely the coldest core is not an air bubble, the overheat processor is subjected to down-conversion treatment;

(3) The method comprises the steps that a coldest neighbor core migration mode in an area is located on a overheat processor when a task ID obtained through hot spot detection is called, the processor core with the lowest temperature is searched in 8 adjacent cores adjacent to the processor, if the searched processor core meets the migration condition of the coldest neighbor core migration mode in the area, namely, the task is mapped on the processor core but the temperature of the processor core is not higher than two thirds of the threshold temperature, or the processor is a bubble, the task migration is executed, and if the task is already in the processor, the tasks on the processor and the overheat processor are exchanged, namely, double unbinding-remapping is carried out; if the processor core is in an idle state, executing single unbinding-remapping, and if the migration condition of the task cannot be met, namely the temperature of the processor core is higher than two thirds of the threshold temperature, performing down-conversion treatment on the overheat processor;

step five, according to the mapping mode of the application and the calculation amount-traffic ratio of the application, the maximum number of bubbles of each application must not exceed the task number, and the condition of selecting the migration mode is as follows:

if the bubble number is not equal to the task number, selecting the coldest core migration of the area for the application with the CCR larger than the threshold value, selecting the coldest neighbor core migration of the area for the application with the CCR smaller than the threshold value, wherein the threshold value is 2 h/(h-1), and h is the hop number of the application critical path, namely the longest path hop number of the weighted path.

2. The method according to claim 1, wherein step three is implemented by a global manager, and when the subsequent applications arrive, the global manager counts all available processor areas in the current system, estimates the execution time and the waiting time corresponding to the available processor areas for each subsequent arriving application, and finds the application-available area correspondence relationship with the minimum cost by a branch limit algorithm.

3. The processor resource scheduling method of claim 1, wherein the branch boundary algorithm is: allocating a different number of bubbles for each application, calculating a current total response time at each node as a cost function; the total response time σ is the maximum of the sum of the latency and execution time of the application at the node to which the bubble has been allocated:

σ＝max{η _i +Π _i }，i∈{0,1,…,n}

wherein eta _i To apply i's latency, pi _i The execution time for application i; and (3) expanding branches with slowest lower bound growth in priority each time, cutting off upper nodes with higher cost than the next nodes, namely longer response time, and finally obtaining a bubble distribution result with shortest total response time, namely an application region division result.

4. The processor resource scheduling method according to claim 1, wherein the selectable number of bubbles is a difference between a total number of processors per free area and a number of application tasks; the number of application tasks refers to the number of processors that are equal to the number of processors that the application is running.