CN116450339A

CN116450339A - GPU resource scheduling method

Info

Publication number: CN116450339A
Application number: CN202210022477.4A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Muxi Integrated Circuit Shanghai Co ltd
Current assignee: Muxi Integrated Circuit Shanghai Co ltd
Priority date: 2022-01-10
Filing date: 2022-01-10
Publication date: 2023-07-18

Abstract

The invention relates to a GPU resource scheduling method, which comprises the following steps of C1, obtaining task groups to be distributed corresponding to each current task channel, reading resource demand information of each task group to be distributed, and obtaining current residual resource information of each execution module in the current GPU; step C2, matching the resource demand information of each task group to be distributed with the current residual resource information of all execution modules respectively, if the matching of all the current task groups to be distributed fails, setting a value G=G+1 of a matching failure round counter, judging whether G exceeds a preset frequency threshold, if so, executing the step C3, otherwise, returning to execute the step C1; and C3, reading independent identification information of each task group to be distributed, and if the task groups to be distributed with the independent identifications are marked, cutting at least one task group to be distributed with the independent identifications into a plurality of subtask groups. The invention improves the task distribution efficiency and the resource scheduling efficiency of the GPU.

Description

GPU resource scheduling method

Technical Field

The invention relates to the technical field of computers, in particular to a GPU resource scheduling method.

Background

Graphics processors (Graphics Processing Unit, GPUs for short), also known as display cores, vision processors, display chips, are designed for computationally intensive, highly parallelized computations. In the process of executing tasks by the GPU, if the allocation of any one of the resources is unbalanced, the waste of the GPU resources can be caused, so that the utilization rate of the GPU resources and the computing performance of the GPU are reduced. Therefore, in the running process of the GPU, each GPU resource needs to be balanced and scheduled as much as possible, so that each GPU resource is in a resource balance state as much as possible, and the running of the whole GPU is in the resource balance state, thereby improving the resource utilization rate and the computing performance of the GPU.

However, in the prior art, when the GPU performs a task, particularly when performing a complex computing task, it is still difficult to implement balanced scheduling of GPU resources, a great deal of time is generally required to allocate resources, and the allocation result cannot ensure the balance of resources, so that the reliability is poor. Therefore, how to provide an efficient and reliable GPU resource balance scheduling technology, to reasonably allocate corresponding GPU resources for multiple task groups, to improve task processing efficiency, and to improve GPU resource utilization and computing performance, is a technical problem to be solved.

Disclosure of Invention

The invention aims to provide a GPU resource scheduling method, which improves task distribution efficiency and GPU resource scheduling efficiency.

According to the present invention, there is provided a GPU resource scheduling method, including

Step C1, acquiring a task group to be distributed corresponding to each current task channel, reading resource demand information of each task group to be distributed, and acquiring current residual resource information of each execution module in the current GPU;

step C2, matching the resource demand information of each task group to be distributed with the current residual resource information of all execution modules respectively, if the matching of all the current task groups to be distributed fails, setting a value G=G+1 of a matching failure round counter, judging whether G exceeds a preset frequency threshold, if so, executing the step C3, otherwise, returning to execute the step C1;

Step C3, reading independent identification information of each task group to be distributed, and if the task groups to be distributed with independent identifications are marked, segmenting at least one task group to be distributed with independent identifications into a plurality of subtask groups

Compared with the prior art, the invention has obvious advantages and beneficial effects. By means of the technical scheme, the GPU resource scheduling method provided by the invention can achieve quite technical progress and practicality, has wide industrial utilization value, and has at least the following advantages:

according to the invention, for the situation that the wave is mutually independent in the currently blocked task group to be processed, the task blocking is relieved rapidly by splitting the task group to be distributed, and the task distribution efficiency and the resource scheduling efficiency of the GPU are improved.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention, as well as the preferred embodiments thereof, together with the following detailed description of the invention, given by way of illustration only, together with the accompanying drawings.

Drawings

FIG. 1 is a schematic diagram of a prior art multi-tasking channel issuing a task group to a GPU;

FIG. 2 is a flowchart of a GPU resource scheduling method according to a first embodiment;

FIG. 3 is a flowchart of a GPU resource scheduling method according to a second embodiment;

FIG. 4 is a flowchart of a GPU resource scheduling method according to the third embodiment;

fig. 5 is a flowchart of a method for obtaining a GPU maximum continuous resource block according to the fourth embodiment;

fig. 6 is a flowchart of a method for obtaining a GPU maximum continuous resource block according to the fifth embodiment;

fig. 7 is a flowchart of a method for obtaining a GPU maximum continuous resource block according to the sixth embodiment;

fig. 8 is a flowchart of a method for acquiring a GPU maximum continuous resource block based on time division multiplexing according to the seventh embodiment;

FIG. 9 is a flowchart of a GPU resource scheduling method according to an eighth embodiment;

fig. 10 is a flowchart of a GPU resource scheduling method provided in the ninth embodiment;

fig. 11 is a flowchart of a GPU resource scheduling method according to the tenth embodiment.

Detailed Description

In order to further describe the technical means and effects adopted by the present invention to achieve the preset purposes, the following detailed description refers to the specific implementation of a GPU resource scheduling method and the effects thereof according to the present invention with reference to the accompanying drawings and preferred embodiments.

As shown in fig. 1, in the conventional scenario where the upper layer software connected to the GPU distributes tasks to the GPU, the tasks are typically distributed through a plurality of task channels W ₁ -W _R And distributing tasks to the GPU, wherein each task channel is independent of each other, different task packages can be issued to the GPU, each task package comprises different task groups (WGs for short), the same task package corresponds to the same process, and the same task channel can receive task packages issued by different processes. Each task group comprises M thread bundles (wave), the value range of M is 1 to M, M is the maximum wave number in the task group, and the thread number contained in each wave is the same.

In the conventional GPU structure, P execution modules (denoted by AP in fig. 1) are generally included, each execution module includes Q execution units (denoted by PEU in fig. 1), when multiple task groups are issued, the multiple task groups need to be distributed to the execution modules as balanced as possible, specifically, a corresponding execution module needs to be selected for each task group, and wave in one task group needs to be distributed to Q execution units of the selected execution module as balanced as possible, so as to ensure balance of GPU resources.

In the existing GPU resource scheduling technology, priorities are usually identified for task groups, when the GPU receives R-path task groups, it preferentially searches for an execution module capable of being allocated for the task group with the first priority, then allocates the task group with the first priority to the execution module, then searches for an execution module capable of being allocated for the task group with the second priority, and so on, where the level of the first priority is higher than that of the second priority.

However, the existing GPU resource scheduling techniques have at least the following problems: firstly, a resource search matching is required to be performed for each task group of each task channel, and for each resource that needs to be continuous, in the prior art, hardware is adopted to search based on a plurality of clock cycles (clock), or the search is performed in a software mode, so that the search efficiency is very low, and a large amount of GPU resources are required to be consumed. Secondly, when the high priority cannot be matched with a proper execution module, the task groups of all task group channels are blocked, and the task groups can not be relieved until the execution module releases resources to meet the task groups with high priority, so that the task distribution and processing efficiency are seriously affected. In view of the above-mentioned problems, the present invention provides the following embodiments to be solved respectively.

Embodiment 1,

A GPU resource scheduling method, as shown in figure 2, comprises

Step A1, acquiring a task group to be distributed corresponding to each current task channel, and reading resource demand information of each task group to be distributed;

it should be noted that, each task group carries requirement information for various GPU resources, including a resource identifier and a corresponding number of resources. The method is directly realized by hardware reading in the prior art, and is not described in detail herein.

Step A2, obtaining current residual resource information of each execution module in the current GPU;

it is understood that the current remaining resource information of the execution module includes the number of remaining resources currently corresponding to each resource in the execution module.

Step A3, matching the resource demand information of each task group to be distributed with the current residual resource information of all execution modules respectively, and adding the task group to be distributed into a candidate task group set if the current residual resource information of at least one execution module is matched with the resource demand information of the task group to be distributed;

it should be noted that, when the current remaining amount of each resource of the execution module is greater than or equal to the corresponding resource requirement of the task group to be distributed, the matching is indicated.

And A4, selecting one task group to be distributed with the highest priority from the candidate task group set as a target task group, selecting one target execution module from execution modules matched with the target task group, and distributing the target task group to the target execution module.

Each task group carries priority identification information, and the corresponding priority can be obtained by reading the priority identification information corresponding to the task group.

In this embodiment, by performing parallel synchronous matching on multiple task groups, all task groups meeting the requirements can be directly screened out, and then the highest priority is selected from the candidate task groups for allocation. It can be understood that even if the number of resources required in the task group with the current high priority is more, the execution module which meets the high priority resources is not available at present, at this time, the task group with the lower priority which meets the resource requirements can be rapidly allocated, so that the situation that the tasks are blocked when the high priority cannot be matched in the prior art is avoided, the efficiency of task allocation is improved, the calculation of the GPU is saved, and the power consumption of the GPU is saved.

As an embodiment, in the step A2, the remaining resource information includes a maximum number of continuous remaining resources corresponding to each first type of resources and a maximum number of remaining resources corresponding to each second type of resources, where the first type of resources are resources with continuous allocation requirements in the execution module, and the second type of resources are resources without continuous allocation requirements in the execution module, and it can be understood that the GPU resources include multiple types of first type resources and also include multiple types of second type resources.

For example, resources in the GPU include scalar general purpose registers (Scalar General Purpose Register, S-GPRs for short), vector general purpose registers (Vector General Purpose Register, V-GPRs for short) distributed among execution units, wave number resources available for allocation among execution units, memory modules (memory) accessible to all execution units among execution modules, and so forth. S-GPR and V-GPR are the first type of resources with continuous allocation requirements. The number of allocable wave number resources is small and only wave identification mapping is involved, thus belonging to the second class of resources without continuous allocation requirements. The continuous maximum resource number of each first type of resource can be calculated through the hardware implementation mode of the GPU, or the continuous maximum resource number of each first type of resource can be calculated and obtained through the software inquiry and reading mode.

In this embodiment, instead of searching for the remaining resources for each task group to match, the resources in the execution module are divided into the first type of resources and the second type of resources, and the maximum continuous remaining resource number corresponding to each first type of resources and the maximum remaining resource number corresponding to each second type of resources are obtained respectively, so that the remaining resource information in all the execution modules can be synchronously matched with each task group by performing one-time calculation.

As an embodiment, each task channel is provided with a corresponding FIFO (first in first out) queue (First Input First Output), the task channels store the received task groups into the FIFO one by one according to the received sequence, and the task groups to be distributed are the first task groups in the FIFO of the task group channel.

It should be noted that, by setting a first-in first-out queue in each task channel, it can be ensured that the task groups of each task channel are always executed according to the order issued by the upper layer software, and no disorder occurs.

As an embodiment, the GPU chip generally performs task distribution and processing on the multi-channel task group continuously, and specifically, the step A4 further includes:

and step A5, judging whether the task group to be distributed corresponding to each task channel is empty or not, if so, ending the flow, otherwise, returning to the step A1.

As one embodiment, each execution module includes Q execution units, and the step A3 includes:

step A31, dividing task groups to be distributed into Q wave groups, and acquiring resource demand information corresponding to each wave group and shared resource demand information of the task groups to be distributed for the whole execution module;

And step A32, matching the resource demand information corresponding to the Q wave groups with the residual resource information of the Q execution units, matching the shared resource demand information with the shared residual resource information of the execution module, and adding the task group to be distributed into a candidate task group set if the matching is successful.

It can be understood that when the resource requirement information corresponding to each wave group is smaller than or equal to the residual resource information of the corresponding execution unit, the matching is successful when the shared resource requirement information of the Q wave groups is smaller than or equal to the shared residual resource information of the execution module.

The highest priority in the candidate task group set may be one or more, so a set of processing mechanisms needs to be further set to ensure the task allocation balance of multiple task channels.

In one embodiment, in the step A4, a target execution module is selected from execution modules matched with the target task group, including:

and step A41, if the candidate task group set only comprises one task group with the highest priority to be distributed, determining the task group to be distributed as the target task group.

In one embodiment, in the step A4, the selecting a target execution module from execution modules matched with the target task group includes:

Step A42, if the candidate task group set includes a plurality of task groups with highest priority to be distributed, judging whether task channels corresponding to the task groups with highest priority exist or not, wherein the task channels are not marked with the selected identifications:

if so, randomly selecting one task group to be distributed from the task groups corresponding to the unlabeled channels as a target task group, and labeling the selected identification for the priority in the task group channel corresponding to the task group;

if the task group does not exist, randomly selecting one task group from a plurality of task groups with highest priority to be distributed as a target task group, reserving a selected identifier for marking the priority in a task group channel corresponding to the task group, and clearing the selected identifier corresponding to the priority in task group channels corresponding to other task groups.

Through the step A42, the balance of task allocation of the multi-path task channels can be ensured under the condition that the task groups with the same priority appear for a plurality of times and simultaneously meet the resource allocation requirement.

As an embodiment, in the step A4, the allocating the target task group to the target execution module includes:

step A43, obtaining a starting allocation address of each first type resource in the target execution module;

Step A44, allocating the corresponding first type of resources to the corresponding tasks in the target task group according to the initial allocation address and the allocation quantity;

and A45, distributing the corresponding second type of resources to the corresponding tasks in the target task group.

It should be noted that, resources with continuous allocation requirements may also be discontinuously allocated, and in the prior art, such allocation methods are mostly adopted, but in this way, the target placement address needs to be found multiple times, and allocation efficiency is low. In addition, in the process of executing the task by the execution unit, multiple addressing accesses are needed, the information interaction burden is large, and the execution efficiency is low. In the first embodiment, the target task is allocated to the continuous address space by acquiring the maximum continuous residual resource number of the first type of resources, each storage position is not required to be addressed respectively, and the rapid allocation can be realized only by the initial address and the target number, so that the task allocation efficiency is improved. It should be noted that, the related technical details in other embodiments are also applicable to the related steps in the present embodiment, and the detailed description is not repeated here.

Embodiment II,

In the first embodiment, although the problem that the residual resources of the execution module do not meet the blocking problem of the task group with high priority can be avoided, there is a case that if the task group with high priority can be matched with the execution module meeting the condition for a long time, the task group with low priority cannot be allocated, and the task channel with low priority is in a blocking state for a long time, so as to solve the problem, the invention further provides a second embodiment.

A GPU resource scheduling method, as shown in fig. 3, includes:

step B1, acquiring the priority { P } included in the task group received by the GPU ₁ ,P ₂ ,…P _S }，P _s The value range of S is 1 to S for the S priority, and S is the total number of the task group priorities;

when each task group issues, priority identifiers are carried, and priority information corresponding to the task group can be obtained based on the priority identifiers. Wherein, can be set as P ₁ ,P ₂ ,…P _S The priority of S is sequentially reduced, and the value range of S is determined according to specific application requirements, for example, the value range of S may be set to 2 to 8, and specifically, the value of S may be 4.

Step B2, dividing the preset clock cycle into Y time windows { T } ₁ ，T ₂ ，…T _Y }，T _y Is the y time window;

wherein, can be set as T ₁ ，T ₂ ，…T _Y The corresponding number of cycles decreases in turn. The value range of Y is determined according to the specific application requirements, for example, the value range of Y may be set to 2 to 8, and specifically, Y may be set to 4. Preferably, the value of S is equal to the value of Y, P _s And T is _s One-to-one correspondence.

Step B3, setting a clock cycle counter on the GPU, circularly counting in a preset clock cycle number range, and when the value of the clock cycle counter is positioned at T _y Within the time window, P is _s Adjusting to the highest priority;

preferably, when P is _s When the priority is adjusted to the highest priority, the rest priorities are sequentially arranged according to the original priority order.

And step B4, distributing GPU resources for each task group based on the current priority ordering.

By the method, the priority ordering of the task groups can be dynamically adjusted, different time windows are allocated for different priorities, the time window with high priority is larger than the time window with low priority, and GPU resources are allocated for each task group based on the current priority ordering obtained by the priority adjusting scheme.

As an embodiment, the step B4 includes:

step B41, acquiring a task group to be distributed corresponding to each current task channel, and reading the resource demand information of each task group to be distributed;

Step B42, obtaining current residual resource information of each execution module in the current GPU;

step B43, constructing a candidate task group set based on the resource demand information and the current residual resource information of each task group to be distributed;

and B44, reading the current priority order, selecting one task group to be distributed with the highest current priority from the candidate task group set according to the current priority order as a target task group, selecting one target execution module from execution modules matched with the target task group, and distributing the target task group to the target execution module.

It should be noted that, the related technical details in the foregoing embodiment and other embodiments are applicable to the related steps in the present embodiment, and are not repeated here.

In the embodiment, the priority ordering of the task groups is dynamically adjusted, time windows with different lengths are allocated for different priorities, the situation that the task groups with low priorities cannot be allocated due to the fact that the task groups with high priorities can be always matched with the execution modules meeting the conditions within a long period of time under the condition of static priorities is avoided, and the task channels with low priorities are in a blocking state for a long time is avoided, so that the task distribution efficiency and the resource scheduling efficiency of the GPU are improved.

Third embodiment,

In general, the wave in a task group has an association with each other, so that the wave needs to be allocated to the same execution module for processing, but in some cases, the wave in a task group is independent from each other and can be divided into different execution modules. If a task group with wave independent each other is encountered and the required resource amount is large, when a plurality of execution modules cannot meet the resource requirement of the task group for a long time, the task group cannot be distributed for a long time and is in a blocking state, and based on the scene, the invention further provides a third embodiment.

A GPU resource scheduling method, as shown in FIG. 4, comprises

The preset frequency threshold is set according to specific application requirements, for example, the frequency threshold may be set to 256.

And C3, reading independent identification information of each task group to be distributed, and if the task groups to be distributed with the independent identifications are marked, cutting at least one task group to be distributed with the independent identifications into a plurality of subtask groups.

If the independent identification is marked on the task group to be distributed, the wave in the task group is independent, and the task group to be distributed without the independent identification is marked, so that the wave in the task group is related. Specifically, an independent flag bit may also be set, where the independent flag is "1" to indicate that the wave in the task group is independent of each other, and the independent flag is "0" to indicate that the wave in the task group is associated with each other.

The independent mark can be directly marked on the task group by upper software when the task group is issued. It should be noted that in this embodiment, at least two cases may exist when G exceeds the preset frequency threshold, where in the first case, task groups to be distributed corresponding to all task channels fail to find execution modules meeting the resource requirement for G times. And secondly, the tasks of other task channels are distributed, and the task groups to be distributed of at least one task channel are left for G times continuously, so that an execution module meeting the resource requirement cannot be found. Under the above circumstances, if the resource release of the execution unit is slow, or the resource demand of the task group to be distributed is too large, it may cause a long-time blockage of multiple task channels, and seriously affect the task distribution efficiency and the resource scheduling efficiency of the GPU.

In one embodiment, in the step C3, if there is at least one execution module having the current remaining resource information matching the resource requirement information of the task group to be distributed, the task group to be distributed is added to the candidate task group set, and step C4 is performed,

and C4, selecting one task group to be distributed with the highest priority from the candidate task group set as a target task group, selecting one target execution module from execution modules matched with the target task group, distributing the target task group to the target execution module, and returning to the execution step C1.

As an embodiment, the step C1 is further preceded by a step C0 of setting an initial value of G. Preferably, the initial value of G is set to 0, which is convenient for counting.

As an embodiment, in the step C3, splitting at least one task group to be distributed with an independent identifier into a plurality of subtask groups includes: and splitting each wave in at least one task group to be distributed, which is marked with independent identification, as a subtask group.

It should be noted that, each wave in the task group to be distributed marked with an independent identifier is used as a subtask group to be segmented, so that the operation is convenient, the execution logic is simple, no additional grouping is needed, the processing flow is simplified, the processing efficiency is improved, and at least one wave in the current task group to be distributed only contains one wave after the segmentation is completed, so that the probability of successful resource matching is greatly improved, and the task channel blockage is effectively relieved.

As an embodiment, the step C3 further includes: the priority of each subtask group is set to the lowest priority.

It should be noted that, by setting the priority of each subtask group to be the lowest priority, the to-be-distributed task groups which still meet the resource requirements can be guaranteed to be distributed with priority in the to-be-distributed task groups which can integrally meet the resource requirements in the non-segmentation state on the premise of relieving the task channel blockage, and the GPU resource utilization rate is improved.

As an embodiment, after the step C3 is performed, the method further includes: and C1, taking each subtask group as a task group to be processed of a corresponding task channel in sequence, setting G as an initial value, and returning to the execution step C1. It should be noted that, the related technical details in the foregoing embodiment and other embodiments are applicable to the related steps in the present embodiment, and are not repeated here.

In the third embodiment, for the situation that the wave is independent of each other in the currently blocked task group to be processed, the task blocking is rapidly relieved by splitting the task group to be distributed, and the task distribution efficiency and the resource scheduling efficiency of the GPU are improved.

Fourth embodiment,

In the prior art, the continuous maximum resource number of each first type of resource is generally read through a mode of GPU hardware implementation, or the continuous maximum resource number of each first type of resource is obtained through a mode of software reading. If the software is used for searching, because the software needs to consider the time complexity, the software needs to search bit by bit, and each time an available resource is found, the required clock period number is uncontrollable, the searching efficiency is extremely low, and the software is realized by adopting the hardware mode based on the searching of a plurality of clock periods (clock), and the defects that the clock period number of the software mode is uncontrollable and the clock period number can be larger are also existed. Based on this, a fourth embodiment provides a method for acquiring a GPU maximum continuous resource block, including the following steps:

Step D1,Reading a current resource state sequence S of the resource to be checked ₀ ＝{d ₁ ,d ₂ ,…d _N },d _n The value range of the state identifier N of the nth resource block of the resource to be checked is 1 to N, and N is the total number of the resource blocks of the resource to be checked;

step D2, parallel acquisition S ₀ State sequence S of moving i bits in a preset direction ₁ 、S ₂ 、…S _N-1 Wherein S is _i Is S ₀ Moving the i bit to a preset direction, and moving the S bit to the preset direction ₀ Setting continuous i bits at the tail along the preset direction as a sequence obtained by occupied marks, wherein the value range of i is 0 to N-1; wherein the shifting of the i bits to the preset direction comprises shifting the i bits left or shifting the i bits right.

Step D3, parallel acquisition S ₀ To S _i Result SA of performing bitwise AND operation or bitwise OR operation _i ；

Step D4, each SA _i And performing self-OR operation or self-and-post negation operation, and determining the current maximum continuous resource block number of the resource to be checked.

Wherein the current resource state sequence can be read directly from the hardware by prior art techniques. The state identifier comprises an occupied identifier and an unoccupied identifier, the occupied identifier is 0, the unoccupied identifier is 1, the bitwise AND operation is executed in the step D3, and the self-OR operation is executed in the step D4. Or the occupied mark is 1, the unoccupied mark is 0, the bit-wise OR operation is executed in the step D3, and the self-AND post-negation operation is executed in the step D4.

S is the same as that of S ₀ To S _i Performing bit-wise AND operation or bit-wise OR operation, i.e. S is first performed ₀ And S is equal to ₁ Performing bit-wise AND operation or bit-wise OR operation to obtain S _0-1 Then S _0-1 At and S ₂ Performing bit-wise AND operation or bit-wise OR operation to obtain S _0-2 ，S _0-2 Then with S ₃ Performing bit-wise AND operation or bit-wise OR operation to obtain S _0-3 Sequentially executing the steps until S _0-(i-1) And S is equal to _i Performing bit-wise AND operation or bit-wise OR operation to obtain S _0-i I.e.For the SA _i . For example, the two sequences are subjected to bit-wise AND operation, wherein the numerical values at the same position of the two sequences are subjected to AND operation, and the operation result is taken as the numerical value at the position, so that a new sequence is finally obtained. For example, sequence 0 0 1 0 0 1 and sequence 1 0 1 1 0 1 are bitwise and operated to result in 0 0 1 0 0 1, and those skilled in the art will recognize that other bitwise or operations and negation from operations are similar to this logic and are not listed here. As an embodiment, the step D4 includes:

step D41, associating each SA _i Performing self-OR operation or self-AND post-negation operation to obtain SAR _i ；

Step D42, based on all SAR _i Generating a first sequence under test { SAR } ₀ ，SAR ₁ ，…SAR _N-1 }；

Step D43, based on { SAR ] ₀ ，SAR ₁ ，…SAR _N-1 And determining the current maximum continuous resource block number of the resource to be checked.

It will be appreciated that each SA will be _i SAR obtained by performing self-OR operation or self-AND post-negation operation _i The value is 0 or 1.

As an embodiment, the step D2, the step D3, and the step D4 are executed in the same clock cycle, and a set of corresponding hardware units is respectively configured to execute the step D2, the step D3, and the step D4, so that task distribution and resource scheduling efficiency can be improved in one cycle, and when the operating frequency requirement of the GPU chip is not high, the GPU chip area is reduced, and power consumption is reduced. In this embodiment, the steps D2, D3, and D4 are executed in the same clock cycle, that is, the steps D2, D3, and D4 transfer information in the same clock cycle, so that the hardware unit may be implemented by using combinational logic, and the specific implementation manner of the combinational logic technology is an existing implementation manner, which is not described herein.

As an embodiment, the step D2, the step D3, and the step D4 may also be performed serially while each occupies a preset clock cycle, and the step D2, the step D3, and the step D4 are each provided with a set of hardware units, which is beneficial to improving the running frequency and the execution performance of the GPU chip. In this embodiment, the step D2, the step D3, and the step D4 may specifically transmit information in three consecutive clock cycles, and the information transmission manner may be implemented by using a register, and technical details of implementing using a register are in the prior art, which will not be described herein.

As an embodiment, the occupied flag is 0, the unoccupied flag is 1, the bitwise and operation is performed in step D3, and the self-or operation is performed in step D4. Or the occupied mark is 1, the unoccupied mark is 0, the bit-wise OR operation is executed in the step D3, and the self-AND post-negation operation is executed in the step D4. It can be appreciated that, after the preset direction, the occupied identifier is determined, and the unoccupied identifier is determined, the first mapping table of the subsequent configuration is also set based on the configuration.

As an embodiment, the step D43 includes:

step D431, combining { SAR } ₀ ，SAR ₁ ，…SAR _N-1 Comparing the first mapping table with a preset first mapping table, and outputting the current maximum continuous resource block number of the resource to be checked, wherein the first mapping table is used for storing the mapping relation between the first sequence to be checked and the maximum continuous resource block number.

As an embodiment, the step D43 includes:

step D432, at { SAR ] ₀ ，SAR ₁ ，…SAR _N-1 From SAR in } _N-1 Starting the read-forward, determining that a first SAR equal to 1 is present _i I 'is denoted as i';

and D433, determining i' +1 as the current maximum continuous resource block number of the resource to be checked.

In the fourth embodiment, based on the current resource state sequence of the resource to be checked, the current maximum continuous resource block number of the resource to be checked is quickly and accurately obtained in one or a few clock cycles with controllable clock only through simple hardware operation and by combining with the first mapping table which is pre-configured, so that the efficiency of resource scheduling is improved.

Fifth embodiment (V),

In the fourth embodiment, the method is preferably suitable for the case that the total number of resources to be checked is small, for example, the number of resources to be checked is only 8 bits or 16 bits. However, when the total number of resources to be checked is large, for example, the number of resources to be checked is 128, a large amount of hardware needs to be arranged, for example, a large amount of registers or circuit lines need to be arranged, a large amount of and gates or gates need to be used, and the area and power consumption of the GPU chip increase. Based on this, on the basis of the fourth embodiment, a fifth embodiment is further proposed.

The method for acquiring the maximum continuous resource block of the GPU comprises the following steps:

step E1, reading the current resource state sequence S of the resource to be checked ₀ ＝{d ₁ ,d ₂ ,…d _N },d _n The method comprises the steps that the state identification of an nth resource block of a resource to be checked is obtained, the value range of N is 1 to N, and N is the total number of resource blocks of the resource to be checked;

step E2, S ₀ Equally split into Z sets of resource state sequences { U ] ₁ ,U ₂ ,…U _Z U, where _z For the z-th group resource state sequence, U _z ＝{d _N*(z-1)/Z+1 ,d _N*(z-1)/Z+2 ,…d _N*z/Z Z has a value ranging from 1 to Z, Z being less than N, and Z being divisible by N, to give { U } ₁ ,U ₂ ,…U _Z Each U in } _z Performing bit-wise AND operation or bit-wise OR operation to generate a state sequence F to be processed ₀ ＝{UA ₁ ,UA ₂ ,…UA _Z }，UA _z Is U (U) _z Corresponding bitwise AND operation or bitwise OR operation result;

by the method of S ₀ The equal segmentation is Z groups, so that the sequence can be shortened, the calculation amount of subsequent shift, AND operation and self OR operation or self AND negation operation is greatly reduced, and the corresponding hardware layout quantity is reduced, thereby reducing the area and the power consumption of the GPU.

Preferably, N is an integer multiple of 4 and Z is N/4.

Step E3, baseAt F ₀ And determining the current maximum continuous resource block number of the resource to be checked.

As an embodiment, the step E3 includes:

step E31 parallel acquisition of F ₀ A state sequence F of j bits moving to a preset direction ₁ 、F ₂ 、…S _Z-1 Wherein F is _j Is F ₀ Moving j bits in a preset direction, setting the j bits at the tail of the preset direction as a sequence obtained by occupied marks, wherein the value range of j is 0 to Z-1;

wherein the moving j bits to the preset direction comprises moving j bits left or moving j bits right.

Step E32, parallel acquisition of F ₀ To F _j Results FA of performing bitwise and operations or bitwise or operations _j ；

Step E33, each FA is set _j And performing self-OR operation or self-and-post negation operation, and determining the current maximum continuous resource block number of the resource to be checked.

As an embodiment, the steps E31, E32, and E33 are executed in the same clock cycle, and a set of corresponding hardware units is respectively configured to execute the steps E31, E32, and E33, so that the task distribution and resource scheduling efficiency can be improved in one cycle. When the operating frequency requirement of the GPU chip is low, the area of the GPU chip is reduced, and the power consumption is reduced. In this embodiment, the steps E31, E32, and E33 are executed in the same clock cycle, that is, the steps E31, E32, and E33 transfer information in the same clock cycle, so that the hardware unit may be implemented by using combinational logic, and the specific implementation manner of the combinational logic technology is an existing implementation manner, which is not described herein.

As an example, the occupied flag is 0, the unoccupied flag is 1, and in the step E2, { U } ₁ ,U ₂ ,…U _Z Each U in } _z Performing bit-wise AND operation, wherein the bit-wise AND operation is performed in the step E32, and the self-OR operation is performed in the step E33; alternatively, the occupied flag is 1, the unoccupied flag is 0, and { U ₁ ,U ₂ ,…U _Z Each U in } _z And performing bit-wise OR operation, wherein the bit-wise OR operation is performed in the step E32, and the self-AND post-negation operation is performed in the step E33.

As an embodiment, the step E31, the step E32, and the step E33 are respectively performed in series and occupy a preset clock cycle, and the step E31, the step E32, and the step E33 multiplex a set of hardware units, which is beneficial to reducing the area of the GPU chip and reducing the power consumption.

As an embodiment, the step E31, the step E32, and the step E33 may also be performed serially and respectively occupy a preset clock cycle, and the step E31, the step E32, and the step E33 respectively set a set of hardware units, which is beneficial to improving the running frequency and the execution performance of the GPU chip. In this embodiment, the steps E31, E32, and E33 may specifically transmit information in three consecutive clock cycles, and the information transmission manner may be implemented by using a register, and the technical details of the implementation using the register are the prior art and will not be described herein. As one embodiment, step E33 includes:

step E331, associating each FA _j Performing self-OR operation or self-and-post negation operation to obtain FAR _j ；

Step E332, based on all FARs _j Generating a first test sequence { FAR ₀ ，FAR ₁ ，…FAR _Z-1 }；

Step E333 { FAR based ₀ ，FAR ₁ ，…FAR _Z-1 And determining the current maximum continuous resource block number of the resource to be checked.

As an embodiment, the step E333 includes:

step E3333, mixing { FAR } ₀ ，FAR ₁ ，…FAR _Z-1 And comparing the second mapping table with a second mapping table which is pre-configured, outputting the current maximum continuous resource block number of the resource to be checked, wherein the second mapping table is used for storing the mapping relation between the self-or sequence and the maximum continuous resource block number.

It will be appreciated that the second mapping table is also set based on the configuration after the preset direction, occupancy flag, unoccupied flag are determined.

As an embodiment, the step E333 includes:

step E3331, at { FAR ₀ ，FAR ₁ ，…FAR _Z-1 From FAR in } _Z-1 The bit starts to read forward, determining that the first FAR equal to 1 occurs _j Setting j' =j+1;

step E3332, determining the current maximum number of consecutive resource blocks X of the resource under investigation based on j':

X＝j’*(N/Z)。

in the fifth embodiment, the current resource state sequence of the resource to be checked is segmented into Z groups, so that the state sequence is shortened from N bits to Z bits, the subsequent calculated amount is greatly reduced, and the area and the power consumption of the GPU chip are reduced. The specific value of Z is determined according to the specific application scene. Although the fifth embodiment does not cover all possible continuous maximum resource block values, it is understood that when the resource requirement is between (j-1) x (N/Z) and j x (N/Z), the maximum resource block may satisfy the condition that j x (N/Z) or more is successful, and although the probability may be missed, especially the GPU generally includes a plurality of execution modules, each of which is equipped with the same hardware resource, so that when matching with a plurality of the same resource, the influence of the packet on the matching result is almost negligible, which is described below by data of a specific example:

Taking the number of resource blocks as 128 as an example, the resource blocks are divided into groups of every 4 blocks, and a total of 32 groups are formed. Each resource block has occupied mark as 0, unoccupied mark as 1, after dividing into 32 groups, performing logical AND operation on 4 status mark bits of each group to become 1 bit, if 1, indicating that the current group is available, and 0 indicating that the current group is unavailable. Assuming that the resource availability is a completely random distribution, the probabilities of 1 and 0 are both 1/2.

When the demand of the task group to be distributed on the resources is large, no matter the task group is grouped or not, a plurality of resource blocks or groups are required to be continuous and are all 1, and the disadvantage of grouping is not obvious; when the demand of the task group to be distributed on the resources is smaller, if the demand is 4 resource blocks, the demand can be met only by finding out that the number of the continuous 1 is more than or equal to 4 when the task group to be distributed is not grouped, and the starting positions of the 4 1 are not necessarily met, so that the 4 can be divided; while the grouping requires at least 4 1 s and the starting position can divide the 4 s, the effect of intuitively meeting the condition of the resource is poor, but because the same resource of the GPU is usually a plurality of, the final matching effect is basically unchanged, and the resource requirement is specifically analyzed by taking 4 as the resource requirement, in the example, the task group to be distributed is supposed to comprise 16 execution modules, each execution module comprises 4 execution units, and the task group to be distributed is supposed to distribute wave to four execution units of one execution module.

For 1 execution unit in 1 execution module, the probability of each packet not satisfying is 1-1/16=15/16, and the probability of 32 packets not satisfying is (15/16) ³² =0.127. In the worst case, if the current task group to be distributed exceeds 4 waves, the task group to be distributed cannot be adapted as long as 1 execution unit resource is not satisfied in 4 execution units of 1 execution module, and the probability is 1- (1-0.127) ⁴ ＝0.418。

It can be seen that when there are only 1 execution module, the grouping of the fifth embodiment has a significant influence on searching resources for the task group to be distributed, but as the number of execution modules increases, the influence is significantly reduced, for example, in the case of commonly used 16 execution modules, the probability that all execution modules cannot adapt to the task group to be distributed is 0.418 ¹⁶ =9e—7=0, i.e. the probability that all 16 execution modules cannot find the resource is close to 0, i.e. the grouping has an influence on the resource requirement of 4 can be ignored.

Embodiment six,

In the fourth embodiment, when the total number of resources to be checked is larger, the area and the power consumption of the GPU chip are increased, and in the fifth embodiment, on the basis of the embodiments, a large amount of hardware to be arranged is reduced, so that the area and the power consumption of the GPU chip are reduced, but the maximum number of resources finally determined can only be a multiple of N/Z, and the method has certain limitation. Based on this, embodiment six is presented.

step F1, reading the current resource state sequence S of the resource to be checked ₀ ＝{d ₁ ,d ₂ ,…d _N },d _n The method comprises the steps that the state identification of an nth resource block of a resource to be checked is obtained, the value range of N is 1 to N, and N is the total number of resource blocks of the resource to be checked;

step F2, parallel acquisition S ₀ State sequence S of moving i bits in a preset direction ₁ 、S ₂ 、…S _N-1 Wherein S is _i Is S ₀ Moving the i bit to a preset direction, and moving the S bit to the preset direction ₀ Setting continuous i bits at the tail along the preset direction as a sequence obtained by occupied marks, wherein the value range of i is 0 to N-1;

wherein the shifting of the i bits to the preset direction comprises shifting the i bits left or shifting the i bits right.

Step F3, sampling from 0 to N-1 to obtain W k values { k0, k1, … k (W-1) } and obtaining S in parallel ₀ To S _k Result SA of performing bitwise AND operation or bitwise OR operation _k ；

Step F4, based on SA _k And determining the current maximum continuous resource block number of the resource to be checked.

As an embodiment, the step F4 includes:

step F41, associating each SA _k Performing self-OR operation or self-AND post-negation operation to obtain SAR _k ；

Step F42, based on all SAR _k Generating a sample sequence to be measured { SAR ] _k0 ，SAR _k1 ，…SAR _k(w-1) }；

Step F43, based on { SAR ] _k0 ，SAR _k1 ，…SAR _k(w-1) And determining the current maximum continuous resource block number of the resource to be checked.

As an example, the status identifications include occupied identifications and unoccupied identifications. The occupied mark is 0, the unoccupied mark is 1, the bit-wise AND operation is executed in the step F3, and the self-OR operation is executed in the step F41; or the occupied mark is 1, the unoccupied mark is 0, the bit-wise OR operation is executed in the step F3, and the self-AND post-negation operation is executed in the step F41.

As an example, k (w+1) -kw is equal to or greater than kw-k (W-1), kw is the (w+1) th k value in { k0, k1, … k (W-1) }, and W ranges from 0 to W-1. Preferably, k (w+1) -kw is an integer power of 2.

It should be noted that, k (w+1) -kw is greater than or equal to kw-k (w-1) so that the sampling step length can be gradually increased, the value with smaller k value can be denser, and the maximum continuous resource block number with smaller value and more possibility can be hit; the numerical value with larger k value is sparse, so that the calculated amount of AND operation can be reduced to a greater extent, the arrangement quantity of hardware such as AND gates is reduced, and the accuracy of a calculation result can be ensured.

As an embodiment, the steps F2, F3, and F4 are executed in the same clock cycle, and a set of corresponding hardware units is respectively configured to execute the steps F2, F3, and F4, so that task distribution and resource scheduling efficiency can be improved in one cycle, and when the operating frequency requirement of the GPU chip is not high, the GPU chip area is reduced, and power consumption is reduced. In this embodiment, the steps F2, F3, and F4 are performed in the same clock cycle, that is, the steps F2, F3, and F4 transfer information in the same clock cycle, so that the hardware unit may be implemented by using combinational logic, and a specific implementation manner of the combinational logic technology is an existing implementation manner, which is not described herein.

As an embodiment, the steps F2, F3, and F4 may also be executed serially while each occupies a preset clock cycle, and the steps F2, F3, and F4 each set a set of hardware units, which is beneficial to improving the running frequency and execution performance of the GPU chip. In this embodiment, the steps F2, F3, and F4 may specifically transmit information in three consecutive clock cycles, and the information transmission manner may be implemented by using a register, and technical details of implementing using a register are the prior art and will not be described herein.

As an embodiment, in the step F3, the sampling from 0 to N-1 includes:

and F31, firstly eliminating preset elimination values from 0 to N-1, and then sampling to obtain W k values, wherein the preset elimination values comprise values with the hit probability of the continuous resource number required by the task group being smaller than or equal to a preset probability threshold.

Wherein, the preset exclusion value may specifically include prime numbers from 0 to N-1.

As an embodiment, the step F43 includes:

step F431, combining { SAR } _k0 ，SAR _k1 ，…SAR _k(w-1) And comparing the detected resource with a pre-configured third mapping table, and outputting the current maximum continuous resource block number of the detected resource, wherein the third mapping table is used for storing the mapping relation between the sampling sequence and the maximum continuous resource block number.

It will be appreciated that the third mapping table is also set based on the configuration after the preset direction, the occupied flag is determined, and the unoccupied flag is determined.

As an embodiment, the step F43 includes:

step F432, in { SAR ] _k0 ，SAR _k1 ，…SAR _k(w-1) From SAR in } _k(w-1) Starting the read-forward, determining that a first SAR equal to 1 is present _k Is denoted as k';

and F433, determining k' +1 as the current maximum continuous resource block number of the resource to be checked.

Compared with the fourth embodiment, the sixth embodiment can reduce the number of AND operations through k value sampling, and further reduce the number of subsequent self-OR operations, so that the number of hardware to be arranged is reduced, and the area and the power consumption of the GPU chip are reduced. Compared with the fifth embodiment, the finally determined maximum resource number is not limited to the multiple of N/Z, and the sampling can be performed according to specific application requirements, so that the flexibility is provided.

Embodiment seven,

One execution module of the GPU generally includes a plurality of first type resources with resources having continuous allocation requirements, and the number of resource blocks may be different, on one hand, if a set of hardware units for obtaining the largest continuous resource block is respectively set for each first type resource, the hardware units of the GPU are complex, and the occupied area is large. On the other hand, after the GPU completes the task group to be distributed, a certain time is required to execute the task group to be distributed, so that the seventh embodiment is provided, and a set of sharable hardware units for obtaining the maximum continuous resource block can be set in each execution unit to perform time-sharing multiplexing, so that the computing requirement of the maximum continuous resource block can be met, and the number of hardware required to be set on the GPU can be reduced, thereby reducing the GPU area and power consumption.

The method for acquiring the maximum continuous resource block of the GPU based on time division multiplexing comprises the following steps:

step G1, setting a time period number C required in the acquisition of each round of maximum continuous resource block, configuring at least one resource type identifier for each time period, and initializing c=1;

the total number of the resource blocks of all the resource types corresponding to each period is smaller than or equal to A, and the A is larger than or equal to the maximum number of the resource blocks in all the resource types.

G2, acquiring a current resource state sequence corresponding to each type of resource type identifier corresponding to the c-th time period, dividing the shared hardware unit into Rc groups, wherein Rc is the number of the resource type identifiers corresponding to the c-th time period, and storing each current resource state sequence into a corresponding shared hardware unit group;

g3, executing the acquisition operation of the maximum continuous resource blocks on the Rc current resource state sequences in parallel based on the shared hardware unit in the c time period, and acquiring the maximum resource block number corresponding to each type of resource type identifier corresponding to the c time period;

and G4, judging whether C is equal to C, if so, ending the acquisition operation of the maximum continuous resource block of the round, otherwise, setting c=c+1, and returning to the step G2.

As an embodiment, if Rc is equal to 1, in the step G3, only the shared hardware units existA corresponding current resource state sequence, regarding the current resource state sequence in the shared hardware unit as S in any one of the fourth, fifth or sixth embodiments ₀ And executing the corresponding acquisition operation of the maximum continuous resource block. I.e. there is only one corresponding current sequence of resource states in the shared hardware unit, the maximum number of consecutive resource blocks is obtained for only one resource type in one time period. When a plurality of corresponding current resource state sequences exist in the shared hardware units, the maximum continuous resource block number is acquired for a plurality of types of resources simultaneously based on the same shared hardware unit in the same time period. The time period may include one or more clock cycles.

As an embodiment, if Rc is greater than or equal to 2, the step G3 includes:

step G31, dividing the shared hardware units into Rc groups, and sequentially storing each current resource state sequence into the corresponding shared hardware unit group to obtain a sequence D ₀ ＝{Q ₁ ，Q ₂ ，…Q _Rc }, wherein Q _r For the corresponding current resource state sequence in the r-th group, the value range of r is 1 to Rc;

Step G32, parallel acquisition of D ₁ ，D ₂ ，…D _E Wherein D is _e To D ₀ Moving e bits in a preset direction, and simultaneously moving each Q _r Setting continuous E bits at the tail along a preset direction as a state sequence generated by an occupied identifier, wherein the value range of E is 0 to E-1, and E is the maximum number of bits of a shared hardware unit;

wherein the moving the e-bit to the preset direction comprises moving the e-bit to the left or right of the preset direction.

Step G33, parallel acquisition of D ₀ To D _e Result DA of performing bitwise AND operation or bitwise OR operation _e ，DA _e ＝{DQ _e1 ，DQ _e2 ，…DQ _eRc } wherein DQ _er The result of bit-wise AND operation or bit-wise OR operation corresponding to the r-th group;

step G34, each DQ _er Performing self-OR operation or self-AND post-negationAnd (3) calculating to determine the current maximum continuous resource block number of the resource type corresponding to the r-th grouping.

As an embodiment, the step G34 includes:

step G341, convert each DQ _er Performing self-OR operation or self-and-post negation operation to obtain DQR _er ；

Step G342, based on all DQR _er Generating a first sequence { DQR to be detected corresponding to the r-th packet _0r ，DQR _1r ，…DQR _(E-1)r }；

Step G343, based on { DQR ] _0r ，DQR _1r ，…DQR _(E-1)r And determining the current maximum number of continuous resource blocks of the resource type corresponding to the r-th packet.

The step G343 includes:

the step G3431 is performed in { DQR } _0r ，DQR _1r ，…DQR _(E-1)r From DQR _(E-1)r Starting reading forward, determining that a first DQR equal to 1 is present _er E value of (2) is denoted as e';

and G3432, determining e' +1 as the current maximum continuous resource block number of the resource type corresponding to the r-th packet.

As an example, the status identifications include occupied identifications and unoccupied identifications. The occupied mark is 0, the unoccupied mark is 1, the bitwise AND operation is executed in the step G33, and the self OR operation is executed in the step G34 and the step G341; or the occupied mark is 1, the unoccupied mark is 0, the bit-wise OR operation is executed in the step G33, and the self-AND post-negation operation is executed in the steps G34 and G341.

As an embodiment, the step G343 includes:

step G3433, judging { DQR ] _0r ，DQR _1r ，…DQR _(E-1)r Whether or not the number of bits is smaller than E, if so, then in DQR _(E-1)r Supplementing the occupied mark { DQR after bit supplementing _0r ，DQR _1r ，…DQR _(E-1)r The total number of bits of the sequence is E, and the sequence is taken as a second sequence to be detected;

and G3434, comparing the second sequence to be detected with a fourth mapping table which is pre-configured, outputting the current maximum continuous resource block number of the resource type corresponding to the r-th grouping, wherein the fourth mapping table is used for storing the mapping relation between the second sequence to be detected and the maximum continuous resource block number, and the bit number of the second sequence to be detected is E.

The seventh embodiment can realize the time division multiplexing of the hardware unit for obtaining the maximum continuous resource block in the execution module, and simultaneously obtain the corresponding maximum continuous resource for multiple groups of resources in the same time period. The arrangement number of hardware units of the maximum continuous resource block in the execution module can be reduced, and the area and the power consumption of the GPU are reduced.

The following is further illustrated by a specific example:

for convenience of description, taking multiplexing of hardware units for realizing maximum continuous resource blocks by three types of resources as an example, in this embodiment, the total number of resource blocks of a first resource is 128 blocks, the total number of resource blocks of a second resource is 64 blocks, the total number of resource blocks of a third resource is 48 blocks, in order to meet sharing requirements, the maximum number of bits of a shared hardware unit is 128 bits, 1 represents unoccupied identification, 0 represents occupied identification), and I ₀ Representing the current resource state sequence corresponding to the current time period, storing the current resource state sequence into the corresponding shared hardware unit group, and dividing the current resource state sequence into a high part I and a low part I _H And I _L Each part occupies half, i.e. 64 bits.

In this embodiment, the first time period of the shared hardware unit searches the maximum number of continuous resource blocks for the first resource, and the searching step is shown in the following block diagram; block 1 is the case where the current sequence is shifted right by 0 to 127 bits, 128 total. The block diagram 2 is the current sequence, the current sequence and the right shift 1-bit sequence thereof are logically and operated, the current sequence and the right shift 2-bit sequence thereof are logically and operated, and the current sequence and the right shift 127-bit sequence thereof are logically and operated until the current sequence and the right shift 2-bit sequence thereof are logically and operated. And directly performing self-OR on the result of each logical AND operation, searching the sequence number corresponding to the sequence with the first result of 1 from the last self-OR sequence network, and adding 1 to obtain the maximum resource block number of the first resource. To facilitate multiplexing shared hard for the second resource and the third resource within the same time period The part unit is used for explaining that the following method can be adopted for searching the first resource, wherein the block diagram 3 and the block diagram 4 respectively count the logic result self-OR logic condition of the block diagram 2 and divide the logic result self-OR logic condition into two parts of high 64 bits and low 64 bits for statistics, and the result is expressed as S; final result hypothesis S _Li |S _Hi =1 (i.e. S _Li And S is _Hi If only one of them is 1), the maximum continuous free space is not smaller than (i+1), and S _L127 |S _H127 Starting the operation finds the first (i+1) result of 1, i.e. the corresponding maximum continuous free space size.

The shared hardware unit searches the maximum continuous idle space size for the second type and the third type of resources at the same time in a second time period, and the searching step is shown in the following block diagram; block 5 is the case where the current sequence is shifted right by 0 to 127 bits, 128 total; wherein M may be prepared in advance according to the definition of the preceding edge; wherein M (i) represents a 128-bit sequence, which is 1 except for the ith bit which is 0, for example M (0) represents that the bits are all 1 except for the lowest bit of 0, namely M (0) = 128' hfffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_fffe; further, as M (0, 1) = 128' hfffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_fffc a. The invention relates to a method for producing a fibre-reinforced plastic composite a. The invention relates to a method for producing a fibre-reinforced plastic composite. Then adopting a block diagram 2, a block diagram 3 and a block diagram 4 to carry out response operation in the same way as the first period; final results are checked S respectively _Li And S is _Hi, S _Li If the maximum continuous free space size of the second class is not less than (i+1), S is _L63 Starting operation to find the first (i+1) result of 1, namely the maximum continuous free space size of the corresponding second class; s is S _Hi If the maximum continuous free space size of the third class is not less than (i+1), S _H63 Starting the operation finds the first (i+1) result of 1, i.e. the maximum continuous free space size of the corresponding third class.

Example eight

The basic principle and the characteristics of the GPU are that the GPU resource is balanced and the resource allocation is synchronous, and the GPU resource is basically synchronous when the GPU resource allocation is completed, and for an execution module, the execution module can continuously process a plurality of task groups, how to ensure that each task group is distributed to a plurality of execution units as evenly as possible, and how to ensure that a plurality of execution modules continuously process a plurality of task groups to be balanced in whole are key points of ensuring the GPU resource allocation balance, improving the GPU resource utilization rate and reducing the power consumption. An eighth embodiment provides a GPU resource scheduling method for solving this problem.

A GPU resource scheduling method, comprising:

Step H1, acquiring a task group to be distributed, and reading the number of tasks to be distributed, wherein the tasks to be distributed are tasks which need to be distributed in a balanced manner to execution units of an execution module, and the execution module comprises Q execution units;

step H2, determining an initial allocation combined sequence corresponding to the number of tasks to be allocated based on a task number segmentation table, wherein the task number segmentation table is used for storing the mapping relation between the number of tasks to be allocated and the initial allocation combined sequence;

wherein the initial allocation combination is a balanced allocation combination in the case that the execution module does not allocate any tasks.

Step H3, acquiring preset pointer information in the execution module, determining a cyclic shift number Su based on the pointer information, and circularly shifting the initial allocation combination sequence to a preset direction Su to obtain target allocation combination information;

and step H4, matching GPU resources of the execution module based on the target allocation combination information corresponding to the task group to be distributed.

It should be noted that, if tasks are always allocated to each execution module according to the initial allocation combination, the allocated task amount of some partial execution modules is inevitably larger than the allocated task amount of other partial execution modules, so that the task allocation and the resource scheduling are balanced. According to the method and the device, the pointer information is set, the task allocation information of the previous round of each execution module is recorded, and based on the task allocation information, the target allocation combination information of the previous round is adjusted, so that the task allocation and resource scheduling balance of each execution module in the multi-round task allocation execution process is guaranteed to be realized as much as possible.

As an embodiment, the method further includes step H10 of constructing a task number division table, including:

step H101, setting serial number identifiers corresponding to Q execution units along the preset direction from 0 to Q-1, initializing task number wx=1, and initializing initial allocation combination to { qx } ₀ ,qx ₁ ,…qx _Q-1 Each bit in } is 0, qx _t The task number is allocated for the t execution unit, and the value range of t is 0 to Q-1;

step H102, obtaining the quotient Wy and remainder Wz of WX-ratio Q, if t<Wz, then set qx _t =wy+1, if t is not less than Wz, qx is set _t Wy, based on all qx _t Generating corresponding initial allocation combinations { qx } ₀ ,qx ₁ ,…qx _Q-1 }；

And step H103, judging whether WX is equal to Q x L, wherein L is the maximum task number which can be executed by each execution unit, if yes, generating the task number segmentation table based on the mapping relation between all WX and the corresponding initial allocation combination, otherwise, setting WX=WX+1, and returning to execute the step H102.

And (3) constructing a task number dividing table through the steps H101-H103, acquiring an initial allocation combination corresponding to each execution module, and subsequently, adjusting by combining pointer information, thereby obtaining a target allocation combination. It should be noted that, the hardware resource layout of each execution module is the same, so multiple execution modules may share the same task number splitting table, and in order to ensure parallel execution of multiple task channels, each task channel may set a task number splitting table, and pointer information of each execution module is stored in the execution module.

As an example of an implementation of this embodiment,

the cyclic shift number Su is based on the execution unit identifier t pointed by the previous round of pointer ₁ ' and number of tasks WX ₂ ' determination.

The preset pointer is a tail pointer, the target of the preset pointer of the present wheel points to the t-th execution unit, and t is t ₁ ’+WX ₂ ' dividing the remainder by Q, t=t when the remainder is 0 ₁ ' the tail pointer initially points to the 0 th execution unit, and in step H3, su=t+1.

As one embodiment, the preset pointer is a head pointer, the preset pointer target of the present wheel points to the t-th execution unit, and t is t ₁ ’+1+WX ₂ ' dividing the remainder by Q, t=t when the remainder is 0 ₁ ' the head pointer initially points to the 0 th execution unit, and in step H3, su=t.

As an embodiment, the method further includes H5, if the remaining resources in the execution module are matched with the target allocation combination information of the task group to be distributed, and the execution module is selected as a target execution module, and executes the task group to be distributed, updating the preset pointer to point to the t execution unit. That is, after the target allocation combination information is obtained, resource matching judgment and selection operation of the execution module are required, and only the execution module which is finally selected as the target execution module to execute the task allocation and the task execution of the present round needs to update the corresponding pointer information.

The eighth embodiment can ensure that each task group is distributed to a plurality of execution units as balanced as possible, and ensure that a plurality of execution modules continuously process the overall resource balance of a plurality of task groups, thereby realizing the balance of GPU resource distribution, improving the utilization rate of GPU resources and reducing the power consumption.

Example nine,

In the GPU architecture, generally, P execution modules are included, and for the same task group to be distributed, there may be multiple target execution modules with residual resources matching with their resource requirements, in the prior art, a round robin scheduling (Round Robin Scheduling) algorithm is generally adopted to select a target execution module from multiple execution units that are selected to meet the conditions, but this way, the situation of the residual resources in each execution unit is not considered, so that the resource allocation balance of the GPU cannot be guaranteed. Based on this, the present invention proposes an embodiment nine.

A GPU resource scheduling method, comprising:

step I1, acquiring a candidate execution module list { AP } ₁ ，AP ₂ ，…AP _F },AP _f For the F candidate execution modules, the value range of F is 1 to F, F is the total number of the candidate execution modules, and the candidate execution modules are execution modules with the current residual resource information matched with the resource demand information of the target task group;

The candidate execution module is an execution module of which the current residual resource information is matched with the resource demand information of the target task group. The GPU resources include a first type of resources that continuously allocate resources of demand and a second type of resources that do not have resources of continuous allocation demand. The current residual resource information comprises the maximum continuous residual resource number corresponding to each first type of resource and the maximum residual resource number corresponding to each second type of resource.

Step I2, obtaining AP _f Current remaining resource quantity R of h-th resource in (2) _h And a weight a of a pre-stored h resource _h H is the total number of resource types in the execution module, and the value range of H is 1 to H;

step I3, based on R _h And a _h Acquisition of AP _f Total weight Ta of current remaining resources _f :

Step I4, obtaining Ta with the largest value _f And taking the corresponding f value as fx, determining the fx candidate execution module as a target execution module, and distributing the target task group to the target execution module.

The larger the total weight of the current residual resources is, the execution module is the execution module with the largest current residual resources, so that the module is determined to be the target execution module, the resource utilization rate of each execution unit of the GPU can be balanced, and the power consumption is reduced.

As an embodiment, the method further comprises: step I10, obtaining the weight a corresponding to each type of resource in the execution module _h The method specifically comprises the following steps:

step I101, sending MA (MA) to-be-tested tasks which only need h resources to be executed to the to-be-tested execution module, executing the to-be-tested tasks, and obtaining a power consumption value ax corresponding to the h resources _h ；

It should be noted that, by setting a corresponding upper layer test program, the MA strip can be issued to the GPU only by the task to be tested which needs to be executed by the h resource, and the task to be tested is directly realized by the prior art, which is not described herein.

Step I102, setting a weight value of each resource based on the power consumption value of each resource, and ax of all the resources _h Proportional relation of (a) to a _h The proportion relation of all a is the same _h Stored in each execution module.

As an embodiment, the GPU includes P execution modules, and the step I10 further includes:

and step I100, randomly selecting one from the P execution modules as an execution module to be tested, and closing other execution modules.

It should be noted that, since the hardware resource configuration of each execution module is the same, the determination weight a can be tested by only one execution module _h 。

As an embodiment, step I2, acquiring an AP _f Current remaining resource quantity R of h-th resource in (2) _h Comprising the following steps:

step I21, reading the current residual resource quantity R from the preset counting unit corresponding to each h resource _h 。

As an embodiment, each preset counting unit of the h-th resource is arranged in the execution module, and the preset counting unit is used for storing the current residual resource quantity R _h ，R _h Initializing to the total number of h resources, and if each h resource is allocated, then R _h Subtracting 1, every time an h resource is released, R will be _h Plus 1.

As an embodiment, step I2, acquiring an AP _f Current remaining resource quantity R of h-th resource in (2) _h If the h resource is the first type of resource requiring the acquisition of the maximum continuous resource block, the fourth embodiment can be implementedAny one of the methods in the fifth and sixth embodiments obtains the maximum number of consecutive resource blocks as the corresponding R _h The values are not described in detail herein.

The ninth embodiment can select the target execution module based on the current resource remaining condition of each candidate execution module, so that all the execution modules are in a stable state, the generated power consumption is relatively uniform, all the resources of the GPU are balanced as much as possible, and resource waste is avoided.

Example ten,

After the target task is allocated to the corresponding execution unit in the target execution module, the execution unit invokes the corresponding execution instruction to execute the target task, and the tenth embodiment proposes a method for selecting the target execution module based on the execution condition of the execution instruction, which can obtain the current resource utilization state of each execution module in a finer granularity compared with the ninth embodiment, thereby further improving the balance of GPU resource scheduling.

A GPU resource scheduling method, comprising:

step J1, obtaining a candidate execution module list { AP } ₁ ，AP ₂ ，…AP _F },AP _f For the F candidate execution modules, the value range of F is 1 to F, and F is the total number of the candidate execution modules;

wherein the candidate execution module is an execution module of which the current residual resource information is matched with the resource demand information of the target task group

Step J2, within the current preset NX historical clock periods, AP is obtained _f Number C of executed s-th instruction _s Acquiring weight B of a prestored s-th instruction _s ；

Step J3, based on C _s And B _s Acquisition of AP _f Total power consumption Tb within NX historical clock cycles from current preset _f :

Step J4, obtaining Tb with minimum value _f And taking the corresponding f value as fx, determining the fx candidate execution module as a target execution module, and distributing the target task group to the target execution module.

It should be noted that, the total power consumption Tb of the execution module in the current NX historical clock cycles _f The smaller the execution module is, the more resources are currently left, therefore, tb is given _f The minimum execution module is determined to be the target execution module, so that the GPU resource utilization rate is improved, and the power consumption is reduced.

As an embodiment, the instructions include data transfer instructions, read storage instructions, write storage instructions, matrix operation instructions, comparison instructions, jump instructions, and other instruction categories, where the matrix operation instructions may be further classified according to a matrix size, and each instruction category may further include a plurality of different execution instructions, and in step J2, the number of execution of each instruction in NX historical clock cycles is counted, so that accuracy of resource prediction is improved, and balance of GPU resource allocation is improved.

As an embodiment, in the step J1, the obtaining the candidate execution module list includes:

Step J11, obtaining resource demand information of a task group to be distributed;

step J12, obtaining current residual resource information of each execution module in the current GPU;

and step J13, matching the resource demand information of the task group to be distributed with the current residual resource information of all the execution modules respectively, and adding the successfully matched execution modules into the execution module list.

As an embodiment, the method further comprises: step J10, obtaining the weight B corresponding to each type of instruction in the execution module _s The method specifically comprises the following steps:

step J101, sending a task to be tested which only needs to call an NA (network alliance) s instruction to the execution module to be tested, executing the task to be tested, and obtaining and executingNA s-th instruction power consumption value BX _s ；

It should be noted that, the task to be tested of the NA s instruction can be issued to the GPU by setting a corresponding upper layer test program, and the task to be tested is directly realized by the prior art, which is not described herein.

Step J102, setting the weight value of each type of instruction based on the power consumption value of each type of instruction, and BX of all instructions _s Proportional relation of (B) and B _s Is equal in proportion to all B _s Stored in each execution module.

As one example, NX is an integer power of 2, where NX is proportional to statistical balance and inversely proportional to prediction accuracy. That is, the greater the NX value, the better the balance of the statistical result, the greater the NX value, the longer the time from the longest statistical time point to the current time point, and the lower the prediction accuracy, so the NX value may be set according to the specific application requirement, and as an example, the NX value may be set to 1024.

As an embodiment, step J10 further comprises:

step J100, randomly selecting one from the P execution modules as an execution module to be tested, and closing other execution modules;

it should be noted that, since the hardware resource configuration of each execution module is the same, the determination weight B can be tested by only one execution module _s 。

The tenth embodiment can select the target execution module based on the historical resource usage condition of each candidate execution module, compared with the ninth embodiment, the statistics degree is finer, so that all the execution modules are in a stable state, heating is uniform, all the resources of the GPU are balanced as much as possible, and resource waste is avoided. However, it can be understood that the tenth embodiment and the ninth embodiment may be combined and given corresponding weights, and the target execution module is selected comprehensively, which is not described herein.

It should be noted that, in the embodiments of the present invention, some exemplary embodiments are described as a process or a method depicted as a flowchart. The numbering of the steps is not meant to limit the order in which the steps are performed, unless otherwise specified, and those skilled in the art will appreciate that although the present invention implements processes that describe steps as a sequential process, some of the steps may be implemented in parallel, concurrently, or simultaneously. Furthermore, the order of the steps may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, and the like.

The present invention is not limited to the above-mentioned embodiments, but is intended to be limited to the following embodiments, and any modifications, equivalents and modifications can be made to the above-mentioned embodiments without departing from the scope of the invention.

Claims

1. The GPU resource scheduling method is characterized by comprising the following steps of

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

in the step C3, if there is at least one execution module matching the current remaining resource information with the resource requirement information of the task group to be distributed, adding the task group to be distributed into the candidate task group set, executing the step C4,

3. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the step C1 is preceded by a step C0 of setting an initial value of G.

4. The method of claim 3, wherein the step of,

the initial value of G is set to 0.

5. The method of claim 1, wherein the step of determining the position of the substrate comprises,

After the execution of the step C3 is finished, the method further comprises the following steps: and C1, taking each subtask group as a task group to be processed of a corresponding task channel in sequence, setting G as an initial value, and returning to the execution step C1.

6. The method of claim 1, wherein the step of determining the position of the substrate comprises,

in the step C3, the splitting of the task group to be distributed, which is marked with the independent identifier, into a plurality of subtask groups includes:

and splitting each wave in at least one task group to be distributed, which is marked with independent identification, as a subtask group.

7. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the step C3 further includes: the priority of each subtask group is set to the lowest priority.

8. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the number of times threshold is set to 256.