CN109918195B - Resource scheduling method for many-core system processor based on thermal perception dynamic task migration - Google Patents

Resource scheduling method for many-core system processor based on thermal perception dynamic task migration Download PDF

Info

Publication number
CN109918195B
CN109918195B CN201910049800.5A CN201910049800A CN109918195B CN 109918195 B CN109918195 B CN 109918195B CN 201910049800 A CN201910049800 A CN 201910049800A CN 109918195 B CN109918195 B CN 109918195B
Authority
CN
China
Prior art keywords
application
processor
area
task
migration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910049800.5A
Other languages
Chinese (zh)
Other versions
CN109918195A (en
Inventor
文生雁
王小航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201910049800.5A priority Critical patent/CN109918195B/en
Publication of CN109918195A publication Critical patent/CN109918195A/en
Application granted granted Critical
Publication of CN109918195B publication Critical patent/CN109918195B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Multi Processors (AREA)

Abstract

The invention discloses a resource scheduling method of a many-core system processor based on thermal perception dynamic task migration, which comprises the following steps: step one, detecting whether a waiting sequence is empty; if not, mapping application is carried out, and then the second step is carried out; step two, detecting whether an arrival queue is empty; if not, carrying out a third step; detecting whether no application is operated in the system, if not, respectively estimating the operation and the waiting time of the bubbles by using a model, and searching an optimal bubble distribution result by using a branch limit algorithm; searching an optimal bubble distribution result through a branch limit algorithm; step five, an application mapping stage, step six, an application running stage. The method utilizes the black silicon phenomenon, adapts to application arrival queues with different lengths according to the arrival rate of the application and the computation sensitivity or communication sensitivity characteristic of the application, maintains the task running frequency, can respond to changeable application arrival rate, effectively improves the throughput rate of the system, and improves the system performance.

Description

Resource scheduling method for many-core system processor based on thermal perception dynamic task migration
Technical Field
The invention relates to the technical field of communication, in particular to a resource scheduling method of a many-core system processor based on thermal perception dynamic task migration.
Background
Many-core chip is one of the processor components in the fields of cloud computing, mobile computing, etc. Many-core processors are widely used in the fields of servers, data centers, and the like. In the development of computer systems, many-core chips are becoming an increasingly important platform. With the increase of the application to the computing demands, the integration level and the performance of many-core chips are continuously improved, and along with the increase, the power density and the temperature of the chips are rapidly increased, and the temperature becomes an important factor for limiting the performance of the chips. The reliability and lifetime of many-core chips can be affected by excessive temperatures for long periods of time. Because of the limitation of heat dissipation conditions and in order to ensure the operation safety of the system, power constraint is generally set for the chip system. To meet the power constraint while maintaining high-speed computing performance, a portion of the processors on the chip have to be in a shut down state, which is known as black silicon phenomenon.
The idle processors within the system-on-chip that are not activated are commonly referred to as bubbles. Because of the better heat dissipation of the bubbles, some practice places the bubbles around the active processor, causing the processor to run at a higher frequency to improve computational performance. This approach is still in the category of static task-processor core mapping, and because the task mapping position is fixed, hot spots are still possible to occur in a state where the operating frequency is unchanged. Some methods employ dynamic task migration to reduce hot spots, which migrate threads or tasks from the overheated core to other cores as the temperature of the original processor gradually rises above a set temperature threshold. One class of methods tends to choose the lowest global temperature processor or randomly choose an idle processor as the migration target for the task, which can greatly increase communication distance. This obviously results in excessive communication costs if there are more communication between application tasks. Another type of method selects one processor migration task adjacent to the overheating processor at a time, and after a plurality of such migration, an application may leave a discontinuous idle area when leaving the system, and an application newly arrived or waiting in a queue cannot be mapped continuously onto the idle processor. Both of the above methods may face the situation that the inter-task communication may pass through multiple processor cores running other applications, resulting in collision of the inter-task communication of the applications, which reduces communication efficiency.
Because of the diversity and complexity of user requests, server systems are expected to be able to cope with varying workloads, responding in as short a time as possible. A task migration method enables each processor running tasks to run over-frequently, and all tasks are migrated to another continuous idle processor area when a temperature threshold is reached. Although the communication distance between tasks is kept unchanged, the method is only suitable for systems with lower loads, and when more applications arrive and more needed processor resources are needed, the processor resource scheduling method with low system utilization rate can bring overlong application waiting time, offset performance gain under over-frequency and increase response time.
The invention patent with the publication number of CN201310059705 discloses a core resource allocation method, a device and a many-core system, and mainly provides a core resource allocation method, which combines scattered core partitions according to the thread number (the required number of idle cores) of a user process so as to allocate the formed continuous core partitions to the user process and optimize the communication cost. According to the method, a reference core partition and a slave core partition are selected from at least two scattered core partitions according to the migration cost of the core partition, so that the total migration cost of the core partition is minimized. And then migrating the idle core of the slave core partition, so that the idle core of the slave core partition and the idle core of the reference core partition are combined to form a continuous core partition. The above patent integrates processor resources on-chip mainly from the perspective of defragmentation, enabling the arriving applications to achieve continuous mapping, which has the disadvantage of not taking into account the power constraints and temperature constraints of the system, not taking into account possible temperature peaks and non-uniform temperature distribution, and the system has a risk of overheating.
None of the above methods regarding task migration take into account the utilization of bubbles. When the black silicon phenomenon exists, the chip temperature constraint and the load are considered at the same time, and how to design the dynamic processor resource scheduling of the many-core system is the key for maintaining the high performance of the many-core system.
Disclosure of Invention
The invention discloses a processor resource scheduling method under the black silicon phenomenon, which is realized in a task scheduling simulation system of a two-dimensional network-on-chip many-core system, can avoid thermal risk and consider communication cost while maintaining higher operating frequency of a processor, and is used for realizing improvement of the throughput rate of the system.
Therefore, the invention provides a resource scheduling method for a many-core system processor based on thermal perception dynamic task migration, which comprises the following steps:
step one, detecting whether a waiting sequence is empty; if not, mapping each application in the waiting sequence, and then carrying out the second step; if the clock is empty, ending the resource scheduling, and waiting for the starting of the next clock cycle, and performing the step one;
step two, detecting whether an arrival queue is empty; if the queue is not empty, performing a third step; if the arrival queue is empty, no new application arrives in the clock cycle, the resource scheduling of the processor is not needed, the resource scheduling is ended, and the step one is carried out when the next clock cycle is waited to start;
detecting whether no application is running in the system, if no application is running, namely N-N processor resources of the system are available, respectively estimating the execution time and the waiting time under different bubbles for each application by using a bubble-performance model and a waiting time model, and searching a bubble distribution result with the shortest total response time of the application by using a branch boundary algorithm as the calculation input of a cost function in a branch boundary to ensure that the total response time is shortest; if the application runs, performing a fourth step;
step four: if an application is running, the available processor of the system is partitioned by the occupied application area into a set of discrete free available areas. The bubble-performance model and the waiting time model are used for respectively estimating the execution time and the waiting time under the selectable bubble number for each application, and the execution time and the waiting time are used as the calculation input of the cost function in the branch limit, and the optimal bubble distribution result is searched through the branch limit algorithm so as to minimize the total response time.
Fifth, the mapping stage is applied: selecting an idle area by adopting a First-time adaptive Heuristic algorithm (First-Fit heuristics), and then selecting an applied mapping mode for mapping; the mapping mode of the application comprises a square mapping mode and a communication priority mapping mode;
step six, application operation phase: a migration mode is selected for different types of applications based on the mapping mode of the application and its own calculated traffic-to-traffic ratio (Computation communication rate, CCR) as selection basis.
Further, the bubble-performance model is used for calculating the corresponding execution time of each application under different numbers of bubbles, and is a polynomial regression model, so that the execution time of the application is pi i Number of bubbles contained in application area b i The critical path hop count of the application, i.e. the path hop count h with the longest weighted path i Average computation time c of tasks in application i Application ofAverage communication time t of middle task i The bubble-performance model is as follows:
Figure BDA0001950374430000021
wherein n is 1 ~n 4 Is polynomial order, alpha k ,β k ,γ k ,θ k The fitting coefficient of the model is obtained by a maximum likelihood method.
Figure BDA0001950374430000022
To the k-th power of the number of bubbles applying i, < >>
Figure BDA0001950374430000023
To the k-th power of the number of hops applying the i critical path,/-th>
Figure BDA0001950374430000024
Computing the k-th power of the time for the average task in application i,/-th>
Figure BDA0001950374430000025
For the average inter-task communication time of application i.
Further, the waiting time model is given as the sum of the area size of the current application, namely the bubble number and the task number, and is used for calculating the waiting time of each application corresponding to different bubble numbers and the current arrival rate, the model is a polynomial regression model, the area of the current application is recorded as R, the total number of processors of the system is recorded as |T|, and the average task number of the mapped application is recorded as |A i Let the waiting time of the current application be eta i ,η i Modeling was performed by the following variables: average bubble number-to-task number ratio r for mapped applications i Average execution time e of mapped applications i The arrival rate λ is applied and the latency model is as follows:
Figure BDA0001950374430000031
wherein a is 0 Is a constant term, z is a polynomial order, r j To the power j, e of the average bubble number-task number ratio of the mapped application j To the power j, λ of the average execution time of the mapped application j To the j power of the current application arrival rate. Delta j ,ε j ,μ j Fitting coefficients corresponding to the parameter items obtained by maximum likelihood regression are adopted.
Further, the number of application tasks is the number of processors occupied by the application map. An application is a collection of small tasks, each of which runs a different instruction portion of the application, thereby executing the applications in parallel. Each task map occupies a processor core, and the number of application tasks is equal to the number of processors occupied by the application running.
Further, the selectable bubble number is the difference between the total number of processors in each free area and the number of application tasks; the number of application tasks refers to the number of processors that are equal to the number of processors that the application is running.
Further, step three is implemented by a Global manager (Global manager), when the subsequent applications arrive, the Global manager counts all available processor areas in the current system, estimates the execution time and the waiting time corresponding to the available processor areas for each subsequent arriving application, and finds the application-available area corresponding relation with the minimum cost by a branch limit algorithm.
Further, the specific implementation process of the first adaptation algorithm is as follows: searching idle processors from left to right and from top to bottom in the system, judging whether the idle processors can be used as a starting point of an application area, wherein the condition that the starting point can be used as the starting point is that the product of the number of idle processors on the right side of the same row and the number of idle processors under the same row is larger than or equal to the area of the application area; when a first processor which can be used as the starting point of the application area is found, the processor is used as the idle area at the leftmost upper corner to be distributed to the application, and if the processor which can be used as the starting point of the application area is not found, the application is put into a waiting queue; after the region is selected, the application is mapped within the region in the selected mode. The mode of selecting the mapping mode of the application is as follows: when the number of the applied bubbles and tasks meets 1:1, selecting a square mapping mode by application; when the number of bubbles and tasks of the application does not meet 1:1, the application selects a mapping mode of communication priority.
Further, the migration modes include three migration modes, which are respectively: square migration mode, coldest core migration mode in area and coldest neighbor core migration mode in area; step five, according to the mapping mode of the application and the calculation amount-traffic ratio of the application, the maximum number of bubbles of each application must not exceed the task number, and the process of selecting the migration mode is as follows:
if the number of bubbles is equal to the number of tasks, a square migration mode is selected for the application to keep the communication distance unchanged in the migration process, and the processor operates in an over-frequency mode;
if the number of bubbles is not equal to the number of tasks, a communication priority map is selected for the application. The threshold value is 2 h/(h-1), h is the hop count of the application critical path, namely the hop count of the longest weighted path, the coldest core migration of the area is selected for the application with CCR larger than the threshold value, and the coldest neighbor core migration of the area is selected for the application with CCR smaller than the threshold value.
Further, the implementation process of the three migration modes is as follows:
(1) The square migration mode is applied, the square mapping is required to be met, the size of an application area is at least twice the number of tasks, all tasks are unbinding with a currently mapped processor when the application is called, and the tasks are mapped into another idle area of the application area in sequence according to the sequence of task serial numbers (IDs);
(2) The method comprises the steps that a coldest core migration mode in an area is located on a processor which is a overheat processor and is obtained through hot spot detection when the coldest core migration mode is called, then the processor core with the lowest temperature is searched in an application area, when a lowest-temperature processor core is found and migration conditions of the coldest core migration mode in the area are that the processor is an air bubble, namely an idle processor without mapping tasks are not mapped on the processor, the tasks are unbinding and mapped with the overheat processor, namely the original processor, and if the migration conditions of the tasks cannot be met, namely the coldest core is not the air bubble, the overheat processor is subjected to frequency reduction treatment;
(3) The method comprises the steps that a coldest neighbor core migration mode in an area is located on a overheat processor when a task ID obtained through hot spot detection is called, the processor core with the lowest temperature is searched in 8 adjacent cores adjacent to the processor, if the searched processor core meets the migration condition of the coldest neighbor core migration mode in the area, namely, the task is mapped on the processor core and the temperature is not higher than two thirds of the threshold temperature, or the processor is an air bubble, namely, an idle processor, the task migration is executed, and if the task is already in the processor, the tasks on the processor and the overheat processor are exchanged, namely, double unbinding-remapping is carried out; if the processor core is in an idle state, single unbinding-remapping is executed, and if the migration condition of the task cannot be met, namely the temperature of the processor core is higher than two thirds of the threshold temperature, the down-conversion processing is carried out on the overheat processor.
Further, the branch limit algorithm is as follows: allocating a different number of bubbles for each application, calculating a current total response time at each node as a cost function; the total response time σ is the maximum of the sum of the latency and execution time of the application at the node to which the bubble has been allocated:
σ=max{η i +∏ i },i∈{0,1,...,n}
wherein eta i For the latency of application i, pi i Is the execution time of application i. And (3) expanding branches with slowest lower bound growth in priority each time, cutting off upper nodes with higher cost than the next nodes, namely longer response time, and finally obtaining a bubble distribution result with shortest total response time, namely an application region division result.
Further, after the application arrives, the execution time of the application under different bubble numbers is calculated according to the bubble-performance model estimation. The number of bubbles for each application must not exceed the number of tasks at maximum. When the number of bubbles is equal to the number of tasks, the execution time in all migration modes is estimated. When the number of bubbles is less than the number of tasks, estimating the execution time of the neighbor coldest core task migration mode which is beneficial to communication for the application of which the CCR is higher than the threshold value; the coldest kernel task migration pattern in the region that favors computation is estimated for applications where CCR is below the threshold.
Further, if the number of processors included in the region is greater than the number of application tasks, the frequency of the active processors in the region is calculated according to the ratio of the bubble and the number of tasks in the mapped region, so that the processors can operate in an over-frequency mode.
Further, during the application run phase, the system detects hotspots in each control interval. And when the hot spot occurs, performing task migration in the corresponding application area according to the selected migration mode.
Further, the communication priority mapping mode uses the node with the largest communication weight as a preferred task to be mapped on the processor cores near the geometric center of the application area, and the nodes connected with the node are sequentially mapped on the available processor cores with the shortest Manhattan distance. And selecting the parent node mapped by the unmapped task, and sequentially mapping the nodes connected with the parent node to the available processor cores with the shortest Manhattan distance. And so on until all tasks of the application are mapped onto the processor.
Further, in the square mapping mode, the square root of the number of the application tasks is rounded down to be a side length, and all tasks are continuously mapped in a rectangular area.
The same control interval as the sampling interval is selected, and each control interval detects whether the processor temperature is above a given temperature threshold. If not, the task migration is not performed. If the processor core exceeding the temperature threshold exists, the tasks mapped on the hot spot are positioned to the corresponding application, and the task migration is carried out according to the selected mode in the event class instance binding the application. The hot spot detection gives the overheated application ID and its overheated task ID.
By establishing an applied bubble-performance model and a waiting time model, designing three migration modes and expanding a two-dimensional network-on-chip many-core system. In resource allocation, the black silicon phenomenon is utilized to obtain better calculation performance. On task migration, a migration mode is selected for different types of applications according to the mapping mode of the application and its own calculation amount-traffic ratio (Computation communication rate, CCR) as selection basis. A higher CCR generally means that computational performance contributes more to the overall performance of the application, whereas communication performance is more important in overall performance.
Compared with the prior art, the invention has the following beneficial effects:
1. in this scheduling method, the bubbles can be used not only as heat dissipation but also as migration targets for tasks on the overheat processor.
2. The system can dynamically respond to the current application arrival rate, fully utilizes the black silicon phenomenon to improve the running performance of the application when the arrival of the application is less, improves the utilization rate of the system when the arrival of the application is more, and avoids overlong waiting time of the application. Overall, compared with other traditional task migration methods, the system throughput rate is improved, and the average response time of the application is reduced.
Drawings
FIG. 1 is a block diagram of an expanded simulation system of the present invention.
FIG. 2 is a diagram showing an example of an algorithm for distributing bubbles in a branch boundary according to the present invention.
FIG. 3 is a flow chart of a method for scheduling processor resources according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.
According to the resource scheduling method for the many-core system processor based on thermal perception dynamic task migration, the applied CCR threshold value corresponding to application 1 is 2.6, the CCR threshold value corresponding to application 2 is 2.6, and the CCR threshold value corresponding to application 3 is 2.4. The system scale is 5*5, and 25 processor cores are included, the clock frequency of the processor cores is 1GHZ, and the clock period is 1 nanosecond.
As shown in fig. 3, the method comprises the following steps:
step one, detecting whether a waiting sequence is empty; if not, mapping each application in the waiting sequence, and then carrying out the second step; if the time is empty, ending the resource scheduling, and waiting for the next clock cycle;
step two, detecting whether an arrival queue is empty; if the queue is not empty, performing a third step; if the arrival queue is empty, no new application arrives in the clock cycle, the processor resource scheduling is not needed, the resource scheduling is ended, and the step one is performed when the next clock cycle is waited to start.
Detecting whether no application is running in the system, if no application is running, namely 5*5 processor resources of the system are available, respectively estimating the execution time and the waiting time under different bubbles for each application by using a bubble-performance model and a waiting time model, taking the execution time and the waiting time as the calculation input of a cost function in a branch limit, and searching the optimal bubble distribution result by a branch limit algorithm; and if the application runs, performing a step four.
Step four: if an application is running, the available processor of the system is partitioned by the occupied application area into a set of discrete free available areas. The bubble-performance model and the waiting time model are used for respectively estimating the execution time and the waiting time under the selectable bubble number (the selectable bubble number is the difference between the total number of processors in each idle area and the number of application tasks) for each application, and the execution time and the waiting time are used as the calculation input of the cost function in the branch limit, and the optimal bubble distribution result is searched through a branch limit algorithm, so that the total response time is the shortest.
Fifth, the mapping stage is applied: selecting an idle area by adopting a First-time adaptive Heuristic algorithm (First-Fit heuristics), and then selecting an applied mapping mode for mapping; the mapping mode of the application comprises a square mapping mode and a communication priority mapping mode;
step six, application operation phase: a migration mode is selected for different types of applications based on the mapping mode of the application and its own calculated traffic-to-traffic ratio (Computation communication rate, CCR) as selection basis.
Further, the bubble-performance model is used for calculating the corresponding execution time of each application under different numbers of bubbles, and is a polynomial regression model, so that the execution time of the application is pi i Number of bubbles contained in application area b i The critical path hop count of the application, i.e. the path hop count h with the longest weighted path i Average computation time c of tasks in application i Task average communication time t in application i The bubble-performance model is as follows:
Figure BDA0001950374430000051
wherein n is 1 ~n 4 Is polynomial order, alpha k ,β k ,γ k ,θ k The value is obtained by a maximum likelihood method for the model coefficient.
Figure BDA0001950374430000052
To the k-th power of the number of bubbles applying i, < >>
Figure BDA0001950374430000061
To the k-th power of the number of hops applying the i critical path,/-th>
Figure BDA0001950374430000062
Computing the k-th power of the time for the average task in application i,/-th>
Figure BDA0001950374430000063
For the average inter-task communication time of application i.
Further, the waiting time model is given as the sum of the area size of the current application, namely the bubble number and the task number, and is used for calculating the waiting time of each application corresponding to different bubble numbers and the current arrival rate, the model is a polynomial regression model, the area of the current application is recorded as R, the total number of processors of the system is recorded as |T|, and the average task number of the mapped application is recorded as |A i Let the waiting time of the current application be eta i ,η i Modeling was performed by the following variables: average bubble number-to-task number ratio r for mapped applications i Average execution time e of mapped applications i The arrival rate λ is applied and the latency model is as follows:
Figure BDA0001950374430000064
wherein a is 0 Is a constant term, z is a polynomial order, r j To the power j, e of the average bubble number-task number ratio of the mapped application j To the power j, λ of the average execution time of the mapped application j To the j power of the current application arrival rate. Delta j ,ε j ,μ j Fitting coefficients corresponding to the parameter items obtained by maximum likelihood regression are adopted.
Further, step three is implemented by a Global manager (Global manager), that is, when the subsequent applications arrive, the Global manager counts all available processor areas in the current system, estimates the execution time and the waiting time corresponding to the available processor areas for each subsequent arriving application, and finds the application-available area corresponding relation with the minimum cost by a branch limit algorithm.
Further, the specific implementation process of the first adaptation algorithm is as follows: searching idle processors from left to right and from top to bottom in the system, judging whether the idle processors can be used as a starting point of an application area, wherein the condition that the starting point can be used as the starting point is that the product of the number of idle processors on the right side of the same row and the number of idle processors under the same row is larger than or equal to the area of the application area; when a first processor which can be used as the starting point of the application area is found, the processor is used as the idle area at the leftmost upper corner to be distributed to the application, and if the processor which can be used as the starting point of the application area is not found, the application is put into a waiting queue; after the region is selected, the application is mapped within the region in the selected mode.
The mode of selecting the mapping mode of the application is as follows: when the number of the applied bubbles and tasks meets 1:1, selecting a square mapping mode by application; when the number of bubbles and tasks of the application does not meet 1:1, the application selects a mapping mode of communication priority.
Further, the migration modes include three migration modes, which are respectively: square migration mode, coldest core migration mode in area and coldest neighbor core migration mode in area; step five, according to the mapping mode of the application and the calculation amount-traffic ratio of the application, the maximum number of bubbles of each application must not exceed the task number, and the process of selecting the migration mode is as follows:
if the number of bubbles is equal to the number of tasks, a square migration mode is selected for the application to keep the communication distance unchanged in the migration process, and the processor operates in an over-frequency mode;
if the number of bubbles is not equal to the number of tasks, a communication priority map is selected for the application. Region coldest core migration is selected for applications with CCR greater than a threshold, and region coldest neighbor core migration is selected for applications with CCR less than a threshold.
Further, the implementation process of the three migration modes is as follows:
(1) The square migration mode is applied, the square mapping is required to be met, the size of an application area is at least twice the number of tasks, all tasks are unbinding with a currently mapped processor when the application is called, and the tasks are mapped into another idle area of the application area in sequence;
(2) The method comprises the steps that a coldest core migration mode in an area is called, a task ID obtained through hot spot detection is located on a overheat processor, then a processor core with the lowest temperature is searched in an application area, when one processor core with the lowest temperature is found and migration conditions are met, namely the processor is a bubble, tasks are not mapped on the processor, the tasks and the overheat processor are unbinding and remapped on the processor, and if the migration conditions of the tasks cannot be met, namely the coldest core is not the bubble, the overheat processor is subjected to frequency reduction treatment;
(3) And (3) locating a task ID obtained by hot spot detection on the overheat processor in the coldest neighbor core migration mode in the region during calling, searching for the processor core with the lowest temperature in 8 adjacent cores adjacent to the processor, and selecting the processor as a target processor for task migration if the searched processor core meets the migration condition, namely, the task is mapped on the searched processor core but the temperature is not higher than two thirds of the threshold temperature, or the processor is a bubble. If the task exists in the processor, exchanging the tasks on the processor and the overheat processor, namely, executing unbinding-remapping twice, unbinding the overheat processor and the task X mapped by the overheat processor, unbinding the target processor and the task Y mapped by the target processor, remapping the task X to the target processor, and remapping the task Y to the overheat processor; if the processor core is an idle processor, single unbinding-remapping is executed, and if the migration condition of the task cannot be met, namely the temperature of the processor core is higher than two thirds of the threshold temperature, the down-conversion processing is carried out on the overheat processor.
Further, after the application arrives, the execution time of the application under different bubble numbers is calculated according to the bubble-performance model estimation. The number of bubbles for each application must not exceed the number of tasks at maximum. When the number of bubbles is equal to the number of tasks, the execution time in all migration modes is estimated. When the number of bubbles is less than the number of tasks, estimating the execution time of the neighbor coldest core task migration mode which is beneficial to communication for the application of which the CCR is higher than the threshold value; the coldest kernel task migration pattern in the region that favors computation is estimated for applications where CCR is below the threshold.
Further, if the number of processors included in the region is greater than the number of application tasks, the frequency of the active processors in the region is calculated according to the ratio of the bubble and the number of tasks in the mapped region, so that the processors can operate in an over-frequency mode.
Further, during the application run phase, the system detects hotspots during each control interval. And when the hot spot occurs, performing task migration in the corresponding application area according to the selected migration mode.
Further, the communication priority mapping mode uses the node with the largest communication weight as a preferred task to be mapped on the processor cores near the geometric center of the application area, and the nodes connected with the node are sequentially mapped on the available processor cores with the shortest Manhattan distance. And selecting the parent node mapped by the unmapped task, and sequentially mapping the nodes connected with the parent node to the available processor cores with the shortest Manhattan distance. And so on until all tasks of the application are mapped onto the processor.
Further, in the square mapping mode, the square root of the number of the application tasks is rounded down to be a side length, and all tasks are continuously mapped in a rectangular area.
The same control interval as the sampling interval is selected, and each control interval detects whether the processor temperature is above a given temperature threshold. If not, the task migration is not performed. If the processor core exceeding the temperature threshold by 80 ℃ exists, the tasks mapped on the hot spots are positioned to the corresponding application, and the task migration is carried out according to the selected mode in the event class instance binding the application. The hot spot detection gives the overheated application ID and its overheated task ID.
The invention is realized in a simulation system, see fig. 1, the two-dimensional many-core system task scheduling simulation system comprises an application generator, an event driven simulation multi-core system, and a HotSpot (Chinese name hot spot) temperature simulator is a temperature simulation model developed by using a resistance-capacitance equivalent model by the university of Freund, and has the characteristics of accuracy and rapidness, and a latest release version HotSpot6.0 temperature simulator is used. The application generator randomly generates a task graph, namely, the number of tasks and the calculated amount of the tasks (range 100-800) of the simulation application are created, the communication topology among the tasks and the communication amount of each communication edge (range 50-500), and the task graph is expressed as a weighted directed acyclic graph and comprises the calculated amount of each task and the communication dependence among the tasks.
The event-driven simulated multi-core system simulates the binding relationship between a single task instance and a simulated processor, and the execution computation of tasks and the communication between tasks. The processor resource scheduling algorithm includes assigning bubbles to the applications, task mapping and task execution migration simulation. The task mapping is implemented by a single task-single processor binding. When the application finishes running, the processor occupied by the task is released. When the simulation task runs, the calculation speed of the task is equal to the running frequency of the corresponding processor (the running frequency of each simulation processor in the system can be regulated), and the communication speed between each pair of tasks is equal to the routing frequency of the processor/the Manhattan distance between the corresponding two processors.
In the task mapping stage, two alternative mapping methods, square mapping and communication priority mapping, are provided:
square mapping simply maps all tasks of an application consecutively in order of order within an approximately square area that is half of the overall application area (allocated free consecutive processor area).
The communication priority MAP first creates three dynamic arrays MAP, MET, un representing mapped tasks, tasks connected to (i.e., having communication with) the mapped tasks, and other tasks not in the first two queues, respectively. The nodes represent analog processors. When initializing, selecting an approximate geometric center of an application area (an allocated idle continuous processor area) as a first node, selecting a task with highest accumulated traffic as a first-choice task, mapping the task to the first node, and putting the first node into a MAP array. All tasks connected to the preferred task are placed in the MUP array in turn, and the remaining tasks are placed in the UNM array. For the first task in MET, find its parent task and its corresponding node in MAP, start searching in turn from the node manhattan distance 1,2,3, … to that node until the first available node is found, MAP the task within MET array onto the node, move it from MET into MAP array, and move the task connected thereto and in UNM from UNM into MET. The above steps are performed for each node in MET until the size of the MAP array is equal to the number of tasks applied and the mapping is complete.
According to the power model of the single processor, the power of each processor in the many-core system is calculated in each microsecond and used as a power trace at the current moment. Taking a fixed time period as a sampling interval, calling a HotSpot temperature simulator to simulate the system temperature every one sampling interval, taking a power track of each moment of the system operation consumption time as an input, and returning the instantaneous simulated temperature of each processor core in the system at the moment by the Hotspot. When the temperature of a certain processor exceeds a set temperature threshold, the application corresponding to the hot spot executes task migration.
In the task migration stage, three selectable migration modes, namely a square migration mode, a coldest core migration mode in an area and a coldest neighbor core migration mode in the area, are provided.
The square migration mode is based on square mapping, and the task which is mapped into a rectangle is integrally migrated into another rectangle area in the application area.
And searching the processor core with the lowest temperature in the application area by using the coldest core migration mode in the area, and migrating the overheated task to the coldest core when one processor core with the lowest temperature is found and the migration condition is met.
The coldest neighbor core migration mode in the region searches for the processor core with the lowest temperature in 8 cores adjacent to the overheat processor, and if a certain temperature condition is met, the tasks on the processor and the overheat processor are exchanged, and the frequency of the overheat processor is properly reduced.
Figure 2 shows an example showing the system first arriving at three applications, with the three applications being assigned different numbers of bubbles according to the branch boundary algorithm to obtain the shortest total response time.
Application 1 contains seven tasks, and the execution times for bubble 0,1,3,5,7 from the bubble-performance model are 80, 76, 62, 51, 40, respectively (for simplicity of representation, only the number of bubbles that can make up a regular area is considered, the longest side of the application area is smaller than the side of the system).
Application 2 contains five tasks, with corresponding execution times of 300, 275, 250, 217, 150 for bubble 0,1,3,4,5 from the bubble-performance model.
Application 3 contains eight tasks, and bubble 0,1,2,4,6,8 is obtained from the bubble-performance model with corresponding execution times of 195, 184, 166, 147, 125, 99, respectively.
In the embodiment b1, b2, b3 represent the bubble allocated to application 1, the bubble allocated to application 2, and the bubble allocated to application 3, respectively. By the nature of the branch bound method, branches with the slowest increase in response time (total response time, i.e., lower bound) are pruned each time preferentially expanding, and branches at the upper level with a lower bound greater than the lower bound of branches at the lower level (e.g., branches at the first branch of the third level with a lower bound of 195 and all lower bounds at the second level with a lower bound greater than 195) are pruned. Finding the branch with the shortest total response time in the bottommost layer, namely the optimal solution, wherein the optimal solution of the total response time in the embodiment is 165, and pruning all branches with lower bounds larger than the optimal solution. The bubbles corresponding to the optimal solution are allocated to three applications, and the bubbles corresponding to the optimal solution in the embodiment are { b1=7, b2=5, b3=6 }. As shown in fig. 2, the ancestor node corresponding to the optimal solution node is b1=7, i.e. 7 bubbles are allocated to application 1, the corresponding parent node is b2=5, i.e. 5 bubbles are allocated to application 2, and the node itself corresponds to b3=6, i.e. 6 bubbles are allocated to application 3.
The application area division and resource scheduling are as follows: the application area of application 1 needs to map 7 tasks and contains 7 bubbles, a processor area with the size of 14 is allocated to the application area, and a first adaptation algorithm searches a 3*5 continuous idle area (one processor is not occupied) for the application area; the application area of the application 2 needs to map 5 tasks and comprises 5 bubbles, a processor area with the size of 10 is allocated for the application area, and a continuous idle area of 2*5 is searched for the application area by a first adaptation algorithm; the application area of application 3 needs to map 8 tasks and contains 6 bubbles, there is not enough free area in the system, application 3 waits in the queue.
For the application 1, the task number is equal to the bubble number, the task number is mapped into the area according to the square mapping mode, the square migration mode is selected, and the application 1 starts to run; for application 2, the task number is equal to the bubble number, the task number is mapped into the area according to the square mapping mode, the square migration mode is selected, and application 2 starts to run.
Application 1 ends running first and the processor core occupied by its application area is released. The application area of application 3 needs to map 8 tasks and contains 6 bubbles, which are allocated with a continuous free processor area of size 14, for which the first adaptation algorithm searches for a free area of 3*5.
For application 3, the number of bubbles is smaller than the number of tasks, and the bubbles are mapped into the area according to the communication priority mapping mode. Computing its CCR to be 3.2 greater than the threshold, selecting a zone coldest core migration mode for it, and application 3 begins running.
The foregoing is a detailed description of the present invention in connection with the specific embodiments, but it is not to be construed that the invention is limited to the specific embodiments. Several adaptations, modifications, substitutions and/or variations of these embodiments may be made by those of ordinary skill in the art without departing from the principles and spirit of the invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (4)

1. The method for scheduling the processor resources of the many-core system based on the thermal perception dynamic task migration is characterized by comprising the following steps:
step one, detecting whether a waiting sequence is empty; if not, mapping each application in the waiting sequence, and then carrying out the second step; if the clock is empty, ending the resource scheduling, and waiting for the starting of the next clock cycle, and performing the step one;
step two, detecting whether an arrival queue is empty; if the queue is not empty, performing a third step; if the arrival queue is empty, no new application arrives in the clock cycle, the resource scheduling of the processor is not needed, the resource scheduling is ended, and the step one is carried out when the next clock cycle is waited to start;
detecting whether no application is running in the many-core system, if no application is running, using a bubble-performance model and a waiting time model to estimate the execution time and the waiting time under different bubbles for each application respectively, and searching a bubble distribution result with the shortest total response time of the application through a branch limit algorithm, wherein the execution time and the waiting time are used as calculation input of a cost function in a branch limit; if the application runs, performing a fourth step;
the bubble-performance model is used for calculating the corresponding execution time of each application under different numbers of bubbles, and is a polynomial regression model, so that the execution time of the application is pi i Number of bubbles contained in application area b i The critical path hop count of the application, i.e. the path hop count h with the longest weighted path i Average computation time c of tasks in application i Task average communication time t in application i The bubble-performance model is as follows:
Figure FDA0004190497050000011
wherein n is 1 ~n 4 Is polynomial order, alpha kkkk The fitting coefficient of the model is obtained by a maximum likelihood method;
Figure FDA0004190497050000012
to the k-th power of the number of bubbles applying i, < >>
Figure FDA0004190497050000013
To the k-th power of the number of hops applying the i critical path,/-th>
Figure FDA0004190497050000014
Computing the k-th power of the time for the average task in application i,/-th>
Figure FDA0004190497050000015
Average inter-task communication time per two tasks for application i;
the waiting time model is given to the sum of the area size of the current application, namely the bubble number and the task number, and is used for calculating the waiting time of each application corresponding to different bubble numbers and the current arrival rate, the model is a polynomial regression model, the area of the current application is recorded as R, the total number of processors of the system is recorded as |T|, and the average task number of the mapped application is recorded as |A i Let the waiting time of the current application be eta i ,η i Modeling was performed by the following variables: r is the average bubble number-task number ratio of the mapped application, e is the average execution time of the mapped application, λ is the application arrival rate, and the waiting time model is as follows:
Figure FDA0004190497050000016
wherein a is 0 Is a constant term, z is a polynomial order, r j To the power j, e of the average bubble number-task number ratio of the mapped application j To the power j, λ of the average execution time of the mapped application j J is a value from 1 to z to the power j of the arrival rate of the current application, delta j ,ε j ,μ j Fitting coefficients of parameter items obtained by maximum likelihood regression;
step four: if the application runs, the available processor of the system is divided into a group of discontinuous idle available areas by the occupied application area, the execution time and the waiting time under the selectable bubble number are estimated for each application by using a bubble-performance model and a waiting time model respectively, the execution time and the waiting time are used as the calculation input of a cost function in a branch limit, and the optimal bubble distribution result is searched by a branch limit algorithm;
fifth, the mapping stage is applied: selecting an idle area by adopting a first-time adaptive heuristic algorithm, and then selecting an applied mapping mode for mapping; the first adaptation heuristic algorithm is as follows: searching idle processors from left to right and from top to bottom in the system, judging whether the idle processors can be used as a starting point of an application area, wherein the condition that the starting point can be used as the starting point is that the product of the number of idle processors on the right side of the same row and the number of idle processors under the same row is larger than or equal to the area of the application area; when a first processor which can be used as the starting point of the application area is found, the processor is used as the idle area at the leftmost upper corner to be distributed to the application, and if the processor which can be used as the starting point of the application area is not found, the application is put into a waiting queue; after the region is selected, mapping the application in the selected mode in the region; the mode of selecting the mapping mode of the application is as follows: when the number of bubbles and tasks applied satisfies 1:1, selecting a square mapping mode by application; when the number of bubbles and tasks applied does not satisfy 1:1, applying a mapping mode for selecting communication priority;
step six, application operation phase: selecting a migration mode for different types of applications according to the mapping mode of the application and the calculation amount-traffic ratio (Computation communication rate, CCR) of the application as selection basis; the migration modes comprise three migration modes, namely: square migration mode, coldest core migration mode in area and coldest neighbor core migration mode in area;
the specific implementation process of the three migration modes is as follows:
(1) The square migration mode is applied, the square mapping is required to be met, the size of an application area is at least twice the number of tasks, all tasks are unbinding with a currently mapped processor when the application is called, and the tasks are mapped into an idle area of the application area in sequence according to the sequence of task serial numbers;
(2) The method comprises the steps that a coldest core migration mode in an area is located on a processor which is a overheat processor and is obtained through hot spot detection when the coldest core migration mode is called, then the processor core with the lowest temperature is searched in an application area, when a lowest-temperature processor core is found and migration conditions of the coldest core migration mode in the area are met, namely the processor is an air bubble, namely an idle processor without a mapping task, the processor is not mapped with the mapping task, the task and the overheat processor, namely the original processor are unbinding and mapped to the processor, and if migration conditions of the coldest core migration mode in the area cannot be met, namely the coldest core is not an air bubble, the overheat processor is subjected to down-conversion treatment;
(3) The method comprises the steps that a coldest neighbor core migration mode in an area is located on a overheat processor when a task ID obtained through hot spot detection is called, the processor core with the lowest temperature is searched in 8 adjacent cores adjacent to the processor, if the searched processor core meets the migration condition of the coldest neighbor core migration mode in the area, namely, the task is mapped on the processor core but the temperature of the processor core is not higher than two thirds of the threshold temperature, or the processor is a bubble, the task migration is executed, and if the task is already in the processor, the tasks on the processor and the overheat processor are exchanged, namely, double unbinding-remapping is carried out; if the processor core is in an idle state, executing single unbinding-remapping, and if the migration condition of the task cannot be met, namely the temperature of the processor core is higher than two thirds of the threshold temperature, performing down-conversion treatment on the overheat processor;
step five, according to the mapping mode of the application and the calculation amount-traffic ratio of the application, the maximum number of bubbles of each application must not exceed the task number, and the condition of selecting the migration mode is as follows:
if the number of bubbles is equal to the number of tasks, a square migration mode is selected for the application to keep the communication distance unchanged in the migration process, and the processor operates in an over-frequency mode;
if the bubble number is not equal to the task number, selecting the coldest core migration of the area for the application with the CCR larger than the threshold value, selecting the coldest neighbor core migration of the area for the application with the CCR smaller than the threshold value, wherein the threshold value is 2 h/(h-1), and h is the hop number of the application critical path, namely the longest path hop number of the weighted path.
2. The method according to claim 1, wherein step three is implemented by a global manager, and when the subsequent applications arrive, the global manager counts all available processor areas in the current system, estimates the execution time and the waiting time corresponding to the available processor areas for each subsequent arriving application, and finds the application-available area correspondence relationship with the minimum cost by a branch limit algorithm.
3. The processor resource scheduling method of claim 1, wherein the branch boundary algorithm is: allocating a different number of bubbles for each application, calculating a current total response time at each node as a cost function; the total response time σ is the maximum of the sum of the latency and execution time of the application at the node to which the bubble has been allocated:
σ=max{η ii },i∈{0,1,…,n}
wherein eta i To apply i's latency, pi i The execution time for application i; and (3) expanding branches with slowest lower bound growth in priority each time, cutting off upper nodes with higher cost than the next nodes, namely longer response time, and finally obtaining a bubble distribution result with shortest total response time, namely an application region division result.
4. The processor resource scheduling method according to claim 1, wherein the selectable number of bubbles is a difference between a total number of processors per free area and a number of application tasks; the number of application tasks refers to the number of processors that are equal to the number of processors that the application is running.
CN201910049800.5A 2019-01-18 2019-01-18 Resource scheduling method for many-core system processor based on thermal perception dynamic task migration Active CN109918195B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910049800.5A CN109918195B (en) 2019-01-18 2019-01-18 Resource scheduling method for many-core system processor based on thermal perception dynamic task migration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910049800.5A CN109918195B (en) 2019-01-18 2019-01-18 Resource scheduling method for many-core system processor based on thermal perception dynamic task migration

Publications (2)

Publication Number Publication Date
CN109918195A CN109918195A (en) 2019-06-21
CN109918195B true CN109918195B (en) 2023-06-20

Family

ID=66960500

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910049800.5A Active CN109918195B (en) 2019-01-18 2019-01-18 Resource scheduling method for many-core system processor based on thermal perception dynamic task migration

Country Status (1)

Country Link
CN (1) CN109918195B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112445154B (en) * 2019-08-27 2021-09-17 无锡江南计算技术研究所 Multi-stage processing method for heterogeneous many-core processor temperature alarm
CN110794949A (en) * 2019-09-27 2020-02-14 苏州浪潮智能科技有限公司 Power consumption reduction method and system for automatically allocating computing resources based on component temperature
CN114039980B (en) * 2021-11-08 2023-06-16 欧亚高科数字技术有限公司 Low-delay container migration path selection method and system for edge collaborative computing
CN113867973B (en) * 2021-12-06 2022-02-25 腾讯科技(深圳)有限公司 Resource allocation method and device
WO2024009747A1 (en) * 2022-07-08 2024-01-11 ソニーグループ株式会社 Information processing device, information processing method, and program

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102473161A (en) * 2009-08-18 2012-05-23 国际商业机器公司 Decentralized load distribution to reduce power and/or cooling cost in event-driven system
CN107193656A (en) * 2017-05-17 2017-09-22 深圳先进技术研究院 Method for managing resource, terminal device and the computer-readable recording medium of multiple nucleus system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8887165B2 (en) * 2010-02-19 2014-11-11 Nec Corporation Real time system task configuration optimization system for multi-core processors, and method and program
JP5946068B2 (en) * 2013-12-17 2016-07-05 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Computation method, computation apparatus, computer system, and program for evaluating response performance in a computer system capable of operating a plurality of arithmetic processing units on a computation core
US20180107965A1 (en) * 2016-10-13 2018-04-19 General Electric Company Methods and systems related to allocating field engineering resources for power plant maintenance

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102473161A (en) * 2009-08-18 2012-05-23 国际商业机器公司 Decentralized load distribution to reduce power and/or cooling cost in event-driven system
CN107193656A (en) * 2017-05-17 2017-09-22 深圳先进技术研究院 Method for managing resource, terminal device and the computer-readable recording medium of multiple nucleus system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Energy Efficient Run-Time Incremental Mapping for 3-D Networks-On-Chip;Wang Xiao-Hang,等;;《JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY》;20130131;第28卷(第1期);54-71 *
数据网格资源协同分配问题研究;卢国明,等;;《系统工程与电子技术》;20060131;第28卷(第1期);110-114 *

Also Published As

Publication number Publication date
CN109918195A (en) 2019-06-21

Similar Documents

Publication Publication Date Title
CN109918195B (en) Resource scheduling method for many-core system processor based on thermal perception dynamic task migration
Grandl et al. Multi-resource packing for cluster schedulers
Chen et al. MapReduce scheduling for deadline-constrained jobs in heterogeneous cloud computing systems
Safari et al. PL-DVFS: combining Power-aware List-based scheduling algorithm with DVFS technique for real-time tasks in Cloud Computing
Mohapatra et al. A comparison of four popular heuristics for load balancing of virtual machines in cloud computing
Stavrinides et al. Scheduling multiple task graphs in heterogeneous distributed real-time systems by exploiting schedule holes with bin packing techniques
US20130191612A1 (en) Interference-driven resource management for gpu-based heterogeneous clusters
Guzek et al. HEROS: Energy-efficient load balancing for heterogeneous data centers
Jiang et al. Scheduling concurrent workflows in HPC cloud through exploiting schedule gaps
CN110362388B (en) Resource scheduling method and device
Pascual et al. Towards a greener cloud infrastructure management using optimized placement policies
CN102609303B (en) Slow-task dispatching method and slow-task dispatching device of Map Reduce system
Stavrinides et al. Energy-aware scheduling of real-time workflow applications in clouds utilizing DVFS and approximate computations
Wang et al. An adaptive model-free resource and power management approach for multi-tier cloud environments
Singh et al. Run-time mapping of multiple communicating tasks on MPSoC platforms
Li et al. On runtime communication and thermal-aware application mapping and defragmentation in 3D NoC systems
CN109062682B (en) Resource scheduling method and system for cloud computing platform
Pascual et al. Effects of topology-aware allocation policies on scheduling performance
Than et al. Energy-saving resource allocation in cloud data centers
Meng et al. Communication and cooling aware job allocation in data centers for communication-intensive workloads
Wang et al. Exploiting dark cores for performance optimization via patterning for many-core chips in the dark silicon era
Hussin et al. Efficient energy management using adaptive reinforcement learning-based scheduling in large-scale distributed systems
Kaushik et al. Run-time computation and communication aware mapping heuristic for NoC-based heterogeneous MPSoC platforms
Singh et al. Value and energy optimizing dynamic resource allocation in many-core HPC systems
Li et al. On runtime communication-and thermal-aware application mapping in 3D NoC

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant