WO2016202153A1 - 一种gpu资源的分配方法及系统 - Google Patents

一种gpu资源的分配方法及系统 Download PDF

Info

Publication number
WO2016202153A1
WO2016202153A1 PCT/CN2016/083314 CN2016083314W WO2016202153A1 WO 2016202153 A1 WO2016202153 A1 WO 2016202153A1 CN 2016083314 W CN2016083314 W CN 2016083314W WO 2016202153 A1 WO2016202153 A1 WO 2016202153A1
Authority
WO
WIPO (PCT)
Prior art keywords
block
warp
distributed
logic controller
running
Prior art date
Application number
PCT/CN2016/083314
Other languages
English (en)
French (fr)
Inventor
展旭升
王聪
包云岗
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2016202153A1 publication Critical patent/WO2016202153A1/zh
Priority to US15/844,333 priority Critical patent/US10614542B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09GARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
    • G09G5/00Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators
    • G09G5/001Arbitration of resources in a display system, e.g. control of access to frame buffer by video controller and/or main processor
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09GARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
    • G09G2352/00Parallel handling of streams of display data
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09GARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
    • G09G2360/00Aspects of the architecture of display systems
    • G09G2360/06Use of more than one graphics processor to process data before displaying to one or more screens
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09GARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
    • G09G2360/00Aspects of the architecture of display systems
    • G09G2360/08Power processing, i.e. workload management for processors involved in display operations, such as CPUs or GPUs

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a method and system for allocating GPU resources.
  • GPU Graphic Processing Unit
  • GPU Graphic Processing Unit
  • the kernel program requesting access to the GPU is generally serialized to access the GPU one by one in the order in which the requests are sent.
  • the GPU must be released after the kernel program that is accessing the GPU and the kernel program waiting to access the GPU finishes running.
  • SM Stream Multiprocessor
  • Embodiments of the present invention provide a method and system for allocating GPU resources, which can solve the problem that a high-priority kernel program cannot be timely responded.
  • a method for allocating a GPU resource of a graphics processor where the method is applied to a system for allocating a GPU resource, the system includes a global logic controller and at least two capable of controlling with the global logic Streaming multiprocessor SM for communication, the method comprising:
  • a kernel program to be distributed from a core kernel register table, the kernel status register table including a priority of each kernel program that is not completed, and each The number of un-distributed thread block blocks in the kernel program that is not completed, the kernel program to be distributed is a kernel program having the highest priority and the number of undistributed blocks in the kernel status register table is not zero;
  • the global logic controller looks up the SM status register table for an SM capable of running at least one complete block, the SM status register table for storing the remaining amount of resources in each SM;
  • the global logic controller does not find an SM capable of running at least one complete block, looking up a first SM from the SM status register table, the first SM being an SM capable of running at least one warp warp;
  • the method further includes:
  • the global logic controller finds an SM capable of running at least one complete block, the first quantity being a number of blocks that the SM capable of running a complete block can actually run;
  • the first number of blocks in the kernel program to be distributed are distributed to the SM capable of running at least one complete block;
  • all undistributed blocks in the kernel program to be distributed are distributed to the SM capable of running at least one complete block.
  • the global logic controller distributes one of the to-be-distributed kernel programs to the first SM
  • the method further includes:
  • the first SM logic controller determines a block with the highest priority from the block status register table, the first SM logic controller is an SM logic controller in the first SM, the block status register table includes being distributed to a priority of each block in the first SM;
  • the first SM logic controller searches for a current idle hardware warp
  • the first SM logic controller determines that the idle hardware warp is capable of running a warp and does not receive a higher priority block, assigning one warp of the highest priority block The idle hardware warp is issued and the block status register table is updated.
  • the SM status register table includes the remaining number of registers per SM, the number of remaining hardware warps, and the remaining Shared storage space
  • the first SM is that the number of remaining registers is greater than the number of registers required to run one warp
  • the number of remaining hardware warps is greater than the number of hardware warps required to run one warp
  • the remaining shared storage space is larger than running The SM of the shared storage space required by a warp.
  • the method further includes:
  • the first SM logic controller determines that there is a warp that is running, notifying the global logic controller to update the remaining registers of the first SM in the SM status register table, the number of remaining hardware warps, and the remaining shared storage. space.
  • an embodiment of the present invention provides a system for distributing a GPU resource of a graphics processor, where the system includes a global logic controller and at least two streaming multi-processors SM capable of communicating with the global logic controller;
  • the global logic controller includes: a first determining unit, a first searching unit, and a first distributing unit;
  • the first determining unit is configured to determine a kernel program to be distributed, where the kernel status register table includes a priority of each kernel program that is not completed, and a number of undistributed blocks in each kernel program that is not completed.
  • the kernel program to be distributed is a kernel program having the highest priority and the number of undistributed blocks in the kernel status register table is not zero;
  • the first searching unit is configured to search, from the SM status register table, an SM capable of running at least one complete block, where the SM status register table is configured to store a remaining amount of resources in each SM;
  • the first SM is searched from the SM status register table, and the first SM is an SM capable of running at least one warp warp;
  • the first distribution unit configured to: when the first SM is found, distribute the block in the kernel program to be distributed to the first SM;
  • the first SM is configured to run a block in the to-be-distributed kernel program that is distributed by the first unit.
  • the first determining unit is further configured to: when the first searching unit finds an SM capable of running at least one complete block, determine the first quantity, where The first quantity is the number of blocks that the SM capable of running a complete block can actually run;
  • the first distribution unit is further configured to: when the number of undistributed blocks in the kernel program to be distributed is greater than the first quantity, distribute the first number of blocks in the kernel program to be distributed to the An SM capable of running at least one complete block; when the number of undistributed blocks in the kernel program to be distributed is less than or equal to the first number, all undistributed blocks in the kernel program to be distributed are distributed to the An SM capable of running at least one complete block;
  • the SM capable of running at least one complete block is configured to run a block in the to-be-distributed kernel program distributed by the first distribution unit.
  • the first SM includes:
  • a second determining unit configured to determine a block with the highest priority from the block status register table, where the first SM logic controller is an SM logic controller in the first SM, the block status register table includes being distributed a priority to each block in the first SM;
  • a second searching unit configured to find a current idle hardware warp
  • a second distribution unit configured to: when determining that the idle hardware warp is capable of running a warp and not receiving a higher priority block, distributing one warp of the highest priority block to the idle hardware warp And update the block status register table.
  • the SM status register table includes the remaining number of registers per SM, the number of remaining hardware warps, and the remaining Shared storage space
  • the first SM is that the number of remaining registers is greater than the number of registers required to run one warp
  • the number of remaining hardware warps is greater than the number of hardware warps required to run one warp
  • the remaining shared storage space is larger than running The SM of the shared storage space required by a warp.
  • the first SM further includes: a notification unit;
  • the notification unit is configured to notify the global logic controller to update the remaining registers of the first SM in the SM status register table, and remaining hardware when determining that there is a warp that is running The number of warp and the remaining shared storage.
  • the global logic controller determines the kernel program to be distributed from the kernel status register table, and searches the SM status register table for the SM capable of running at least one complete block, when not found.
  • the SM of at least one block can be run, the first SM that can run at least one warp is continuously searched, and the block in the kernel program to be distributed is distributed to the first SM.
  • the module in the high-priority kernel can be distributed to the SM, and the high-priority kernel program is not responded to in time.
  • the first SM that can run at least one warp is searched. Since the warp is smaller than the block, running a warp is better than running a block. Faster, so it is easier to find a SM that can run at least one warp. After finding it, you can distribute a block of the kernel program to be distributed to the first SM without waiting for the low-priority kernel program to run a block. The response speed of the high priority kernel program.
  • FIG. 1 is a schematic diagram of a logical structure of a GPU resource allocation system according to an embodiment of the present disclosure
  • FIG. 2 is a flowchart of a method for allocating GPU resources according to an embodiment of the present invention
  • FIG. 3 is a flowchart of another method for allocating GPU resources according to an embodiment of the present disclosure
  • FIG. 4 is a flowchart of another method for allocating GPU resources according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of a logical structure of another GPU resource allocation system according to an embodiment of the present disclosure.
  • FIG. 6 is a schematic diagram of a logical structure of another GPU resource allocation system according to an embodiment of the present disclosure.
  • FIG. 7 is a schematic diagram of a logical structure of a device for allocating GPU resources according to an embodiment of the present disclosure
  • FIG. 8 is a schematic diagram of a logical structure of another apparatus for allocating GPU resources according to an embodiment of the present invention.
  • the present invention is applied to a distribution system for GPU resources.
  • the system includes a global scheduler 101 and at least two SMs 102 capable of communicating with the global scheduler 101.
  • the global scheduler 101 includes a global logic controller 1011, a kernel status register table 1012, and an SM status register table 1013.
  • the SM 102 includes an SM logic controller 1021 and a block status register table 1022.
  • the global scheduler 101 is configured to distribute the kernel program to the SM 102 for operation.
  • the global logic controller 1011 is configured to distribute the kernel program to the SM 102 according to the kernel status register table 1012 and the SM status register table 1013 with a block granularity or a warp granularity.
  • a kernel program is a program that can be run on a GPU, and a kernel program includes at least two blocks (thread blocks), and one block includes at least two warps (warps). Warp is a set of threads consisting of 32 GPU threads.
  • the kernel status register table 1012 is used to store kernel program information for each outstanding operation.
  • the kernel program information includes the priority of the kernel program, the number of registers required to run the kernel program, the shared storage space required to run the kernel program, and the number of blocks that have not been distributed in the kernel program.
  • the SM status register table 1013 is used to store the current remaining amount of resources of each SM 102.
  • the current remaining amount of resources of each SM 102 includes the remaining number of registers, the number of remaining hardware warps, and the remaining shared storage space.
  • the SM 102 is configured to run a kernel program distributed by the global scheduler 101.
  • the SM logic controller 1021 is configured to distribute the warp in the block to the hardware warp according to the block status register table.
  • the block status register table 1022 is used to store the operation of each block.
  • the operation of the block includes the priority of the block, the number of the kernel to which the block belongs, the number of the block in the kernel, the number of registers required for the part not running in the block, and the required shared storage space, and the undistributed in the block. Warp number.
  • the embodiment of the present invention provides a method for allocating a GPU resource.
  • the method is applied to the GPU resource allocation system shown in FIG. 1. As shown in FIG. 2, the method includes:
  • the global logic controller determines a kernel program to be distributed from the core kernel status register table.
  • the kernel status register table includes the priority of each kernel program that is not completed and the number of undistributed blocks in each kernel program that is not completed.
  • the kernel program to be distributed has the highest priority in the kernel status register table.
  • the global logic controller looks up the SM status register table for an SM capable of running at least one complete block, and the SM status register table is used to store the remaining amount of resources in each SM.
  • the SM status register table specifically includes the number of remaining registers of each SM, the number of remaining hardware warps, and the remaining shared storage space.
  • the SM capable of running at least one complete block is the number of remaining registers greater than the number of registers required to run one block, the number of remaining hardware warps is greater than the number of hardware warps required to run a block, and the remaining shared memory is larger than the shared storage required to run a block. Space SM.
  • the global logic controller does not find an SM capable of running at least one complete block, look up the first SM from the SM status register table, where the first SM is an SM capable of running at least one warp.
  • the first SM is the number of remaining registers greater than the number of registers required to run a warp
  • the number of remaining hardware warps is greater than the number of hardware warps required to run a warp
  • the remaining shared storage space is greater than the shared storage required to run a warp.
  • the global logic controller cannot find the SM capable of running at least one block, and only one 6 kb register is required to run a warp. At this time, the remaining 12kb register SM can run two warps, that is, the global logic controller can find the first SM.
  • the first SM will run the warp in the block one by one.
  • the global logic controller when it does not find the first SM, it returns to step 201 and re-executes. After waiting for a warp in the running low-priority kernel to complete, the global logic controller can find it. The first SM.
  • the global logic controller is from the kernel state Determining the kernel program to be distributed in the register table, searching the SM status register table for the SM capable of running at least one complete block, and when not finding the SM capable of running at least one block, continuing to find the first SM capable of running at least one warp
  • the block in the kernel program to be distributed is distributed to the first SM.
  • the module in the high-priority kernel can be distributed to the SM, and the high-priority kernel program is not responded to in time.
  • the first SM that can run at least one warp is searched. Since the warp is smaller than the block, running a warp is better than running a block. Faster, so it is easier to find a SM that can run at least one warp. After finding it, you can distribute a block of the kernel program to be distributed to the first SM without waiting for the low-priority kernel program to run a block. The response speed of the high priority kernel program.
  • step 202 the global logic controller searches for the at least one complete block from the SM status register table. After the SM, if an SM capable of running at least one complete block is found, the following steps 205 to 207 are performed.
  • the global logic controller finds the SM capable of running at least one complete block, determine the first quantity, where the first quantity is the number of blocks that the SM capable of running a complete block can actually run.
  • the first number is determined by the global logic controller by an SM status register table in the SM capable of running at least one complete block.
  • the global logic controller can calculate the number of blocks that the SM can actually run based on the remaining resources of the SM stored in the SM status register table and the amount of resources required to run a block.
  • the number of blocks included in the kernel program to be distributed is greater than the first number, it indicates that the remaining resources of the found SM are not enough to run all the blocks in the kernel program to be distributed, so the first number of blocks are first used.
  • the SM is distributed to the SM, and after the block operation is completed, the resources in the SM are released, and the remaining blocks in the to-be-distributed kernel are distributed to the SM.
  • the GPU resource allocation method provided by the embodiment of the present invention determines the first quantity when the global logic controller finds the SM capable of running at least one block, when the number of undistributed blocks in the kernel program to be distributed is greater than the first quantity.
  • the first number of blocks in the kernel program to be distributed are distributed to the SM capable of running at least one complete block; when the number of blocks included in the kernel program to be distributed is less than or equal to the first number, the kernel program to be distributed is not yet distributed.
  • the distributed blocks are all distributed to the SM capable of running at least one complete block. When it can find the SM running at least one block, distribute as many blocks as possible to the SM with the highest priority kernel, which can make the highest priority kernel get timely response and improve the high priority kernel program. responding speed.
  • step 204 when the global logic controller finds the first SM, After a block in the kernel program is distributed to the first SM, the first SM logic controller distributes the warp method, as shown in FIG. 4, the method includes:
  • the first SM logic controller determines a block with the highest priority from the block status register table.
  • the first SM logic controller is an SM logic controller in the first SM, and the block status register table includes being distributed to the first SM. The priority of each block.
  • the global logic controller is connected to at least two SMs.
  • the global logic controller distributes one of the highest priority kernel programs to the first SM
  • the first SM is The first SM logic controller needs to distribute the warp in the block to the hardware warp to run.
  • the first SM logic controller needs to determine the block with the highest priority from the block status register table, and run it first. The warp in the highest priority block.
  • the priority of the block stored in the block status register is the priority of the kernel to which the block belongs, and the priority of the block in the same kernel is the same.
  • the first SM logic controller searches for a current idle hardware warp.
  • step 403 when the first SM logic controller finds the idle hardware warp, the following step 403 is performed; when the first SM logic controller does not find the idle hardware warp, the search is repeated. Until the idle hardware warp is found, the following step 403 is continued.
  • the hardware warp Since the low-priority kernel program is running in the first SM, after waiting for the warp operation in the low-priority kernel program, the hardware warp will be restored to the idle state, and the first SM logic controller will be The idle hardware warp can be found, and the warp in the high-priority kernel program can occupy the hardware warp.
  • the first SM logic controller determines that the idle hardware warp can run a warp and does not receive the higher priority block, distribute a warp in the highest priority block to the idle hardware warp, and update the block status. Register table.
  • the method for determining whether the idle hardware warp can run a warp is: determining whether the number of registers in the first SM is sufficient to run a warp, and if sufficient, and the first SM does not receive the higher priority block, then Distribute a warp in the block with the highest priority at this time to the found idle hardware warp; if not, continue to wait until the warp operation ends, the number of registers is enough to run a warp, and then distribute one to the idle hardware warp Warp.
  • the first SM logic controller first searches for the idle hardware warp. When the idle hardware warp is found and the first SM can run a warp, the block with the highest priority is used. A warp is distributed to the hardware warp to run, without waiting for the first SM to have the resources to run the entire block and then distributing the entire block to the hardware warp operation, reducing the waiting time and improving the response speed of the high priority kernel program.
  • the global logic controller is notified to update the remaining registers of the first SM in the SM status register table, the number of remaining hardware warps, and the remaining shared memory space.
  • the method for allocating GPU resources provided by the embodiment of the present invention, when the first logic controller determines that there is a running When the warp is completed, the global logic controller is notified to update the SM status register table, so that the global logic controller can timely send the block in the high priority kernel to the SM according to the latest SM remaining resources and the number of blocks that have not yet been distributed. Increased resource utilization of SM while speeding up the response of high-priority kernel programs.
  • the embodiment of the present invention further provides a GPU resource allocation system.
  • the system includes a global logic controller 501 and at least two capable of globally.
  • the streaming multiprocessor SM communicated by the logic controller 501; the global logic controller includes: a first determining unit 5011, a first searching unit 5012, and a first distributing unit 5013.
  • the SM in the system may be the SM 502 or the first SM 503 capable of running at least one complete block.
  • the first SM 503 is an SM capable of running at least one warp warp.
  • a first determining unit 5011 configured to determine a kernel program to be distributed, the kernel status register table includes a priority of each kernel program that is not completed, and a number of undistributed blocks in each kernel program that is not completed, the kernel to be distributed
  • the program is the kernel program with the highest priority and undistributed blocks in the kernel status register table.
  • a first searching unit 5012 configured to search, from the SM status register table, an SM capable of running at least one complete block, where the SM status register table is used to store the remaining amount of resources in each SM; when not found to be able to run at least one full thread
  • the first SM 503 is looked up from the SM status register table.
  • the first distribution unit 5013 is configured to distribute the block in the kernel program to be distributed to the first SM 503 when the first SM 503 is found.
  • the first SM 503 is configured to run a block in the kernel program to be distributed distributed by the first distribution unit 5013.
  • the first determining unit 5011 is further configured to: when the first searching unit 5012 finds an SM capable of running at least one complete block, determine the first quantity, where the first quantity is capable of running a complete block. The number of blocks that the SM can actually run;
  • the first distribution unit 5013 is further configured to: when the number of undistributed blocks in the kernel program to be distributed is greater than the first quantity, distribute the first number of blocks in the kernel program to be distributed to the SM capable of running at least one complete block; When the number of undistributed blocks in the kernel program to be distributed is less than or equal to the first number, all the undistributed blocks in the kernel program to be distributed are distributed to be able to run at least A complete block of SM502.
  • An SM 502 capable of running at least one complete block for running a block in the kernel program to be distributed distributed by the first distribution unit 5013.
  • the first SM 503 includes a second determining unit 5031, a second searching unit 5032, a second distributing unit 5033, and a notifying unit 5034.
  • the second determining unit 5031, the second searching unit 5032, and the second distributing unit 5033 are specifically located in the first SM logic controller in the first SM 503.
  • the second determining unit 5031 is located in the first SM503 logic controller, and is configured to determine a block with the highest priority from the block status register table, where the first SM503 logic controller is the SM logic controller in the first SM503, block
  • the status register table includes the priority of each block that is distributed to the first SM 503.
  • the second search unit 5032 is located in the first SM503 logic controller for finding the current idle hardware warp.
  • the second distribution unit 5033 is located in the first SM503 logic controller, and is configured to distribute a warp in the highest priority block when it is determined that the idle hardware warp can run a warp and does not receive the higher priority block. Give the idle hardware warp and update the block status register table.
  • the SM502 that can run at least one complete block has the same structure as the first SM 503, and is not described in the embodiment of the present invention.
  • the SM status register table includes the number of remaining registers per SM, the number of remaining hardware warps, and the remaining shared memory space.
  • the first SM503 is that the number of remaining registers is greater than the number of registers required to run a warp, and the number of remaining hardware warps.
  • the SM is larger than the number of hardware warps required to run a warp and the remaining shared storage space is larger than the shared storage space required to run a warp.
  • the second determining unit 5031 is further configured to determine whether there is a warp that is running.
  • the notification unit 5034 is located in the first SM503 logic controller, and is configured to notify the global logic controller 501 to update the remaining registers of the first SM 503 in the SM status register table when the second determining unit 5031 determines that there is a warp that is running. The number of remaining hardware warps and the remaining shared storage space.
  • the first determining unit in the global logic controller determines the kernel program to be distributed from the kernel status register table, and the first searching unit can search from the SM status register table to run at least one complete Block SM, when not found to be able to run to When there is one SM of less block, the first SM that can run at least one warp is continuously searched, and the first distribution unit distributes the block in the kernel program to be distributed to the first SM.
  • the module in the high-priority kernel can be distributed to the SM, and the high-priority kernel program is not responded to in time.
  • the first SM that can run at least one warp is searched. Since the warp is smaller than the block, running a warp is better than running a block. Faster, so it is easier to find a SM that can run at least one warp. After finding it, you can distribute a block of the kernel program to be distributed to the first SM without waiting for the low-priority kernel program to run a block. The response speed of the high priority kernel program.
  • the embodiment of the invention further provides a device for allocating GPU resources.
  • the device includes a global logic controller and at least two SMs capable of communicating with the global logic controller.
  • the SM may be an SM or a first SM capable of running at least one complete block.
  • the global logic controller can include a memory 71, a transceiver 72, a processor 73, and a bus 74, wherein the memory 71, the transceiver 72, and the processor 73 are communicatively coupled by a bus 74.
  • the memory 71 may be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM).
  • the memory 71 can store an operating system and other applications.
  • the program code for implementing the technical solution provided by the embodiment of the present invention is stored in the memory 71 and executed by the processor 73 when the technical solution provided by the embodiment of the present invention is implemented by software or firmware.
  • the transceiver 72 is used for communication between the device and other devices or communication networks such as, but not limited to, Ethernet, Radio Access Network (RAN), Wireless Local Area Network (WLAN), and the like.
  • RAN Radio Access Network
  • WLAN Wireless Local Area Network
  • the processor 73 can be a general-purpose central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), or one or more integrated circuits for executing related programs.
  • CPU central processing unit
  • ASIC application specific integrated circuit
  • Bus 74 may include a path for communicating information between various components of the device, such as memory 71, transceiver 72, and processor 73.
  • FIG. 7 only shows the memory 71, the transceiver 72, and the processor. 73 and bus 74, but in a particular implementation, those skilled in the art will appreciate that the terminal also includes other components necessary to achieve proper operation. At the same time, those skilled in the art will appreciate that hardware devices that implement other functions may also be included, depending on the particular needs.
  • the processor 73 in the device is coupled to the memory 71 and the transceiver 72 for controlling the execution of the program instructions. Used to determine the kernel program to be distributed.
  • the kernel status register table includes the priority of each kernel program that is not completed and the number of undistributed blocks in each kernel program that is not completed.
  • the kernel program to be distributed is the kernel status register table.
  • the highest priority and undistributed kernel program is not zero; look up the SM status register table to find the SM that can run at least one complete block, SM status register table is used to store the remaining resources in each SM; When the SM capable of running at least one full thread block block is not found, the first SM is looked up from the SM status register table, and the first SM is an SM capable of running at least one warp warp.
  • the transceiver 72 is configured to distribute the block in the kernel program to be distributed to the first SM when the first SM is found.
  • the memory 71 is configured to store a kernel status register table and an SM status register table.
  • the processor 73 is further configured to: when the first searching unit finds the SM capable of running the at least one complete block, determine the first quantity, where the first quantity is the number of blocks that the SM capable of running one complete block can actually run.
  • the transceiver 72 is further configured to: when the number of undistributed blocks in the kernel program to be distributed is greater than the first quantity, distribute the first number of blocks in the kernel program to be distributed to the SM capable of running at least one complete block; When the number of undistributed blocks in the kernel program is less than or equal to the first number, the undistributed blocks in the kernel program to be distributed are all distributed to the SM capable of running at least one complete block.
  • the first SM includes a memory 81, a transceiver 82, a processor 83, and a bus 84, wherein the memory 81, the transceiver 82, and the processor 83 are communicably connected by a bus 84. .
  • the memory 81 may be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM).
  • the memory 81 can store an operating system and other applications.
  • the program code for implementing the technical solution provided by the embodiment of the present invention is saved. In the memory 81, and executed by the processor 83.
  • the transceiver 82 is used for communication between the device and other devices or communication networks such as, but not limited to, Ethernet, Radio Access Network (RAN), Wireless Local Area Network (WLAN), and the like.
  • RAN Radio Access Network
  • WLAN Wireless Local Area Network
  • the processor 83 can be a general-purpose central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), or one or more integrated circuits for executing related programs.
  • CPU central processing unit
  • ASIC application specific integrated circuit
  • Bus 84 may include a path for communicating information between various components of the device, such as memory 81, transceiver 82, and processor 83.
  • FIG. 8 only shows the memory 81, the transceiver 82 and the processor 83, and the bus 84, in the specific implementation process, those skilled in the art should understand that the terminal also includes the normal operation. Other devices necessary. At the same time, those skilled in the art will appreciate that hardware devices that implement other functions may also be included, depending on the particular needs.
  • the processor 83 in the apparatus is coupled to the memory 81 and the transceiver 82 for controlling program instructions.
  • the first SM logic controller is an SM logic controller in the first SM
  • the block status register table includes each of the distributed to the first SM The priority of the block; used to find the current idle hardware warp.
  • the transceiver 82 is configured to: when it is determined that the idle hardware warp can run a warp and does not receive the higher priority block, distribute a warp in the highest priority block to the idle hardware warp, and update the block status register table. .
  • the SM status register table includes the number of remaining registers per SM, the number of remaining hardware warps, and the remaining shared memory space.
  • the first SM is the number of remaining registers greater than the number of registers required to run a warp, and the number of remaining hardware warps.
  • the SM is larger than the number of hardware warps required to run a warp and the remaining shared storage space is larger than the shared storage space required to run a warp.
  • the transceiver 83 is further configured to notify the global logic controller to update the remaining registers of the first SM in the SM status register table, the remaining hardware warp number, and the remaining shared storage space when it is determined that there is a warp that is running.
  • the GPU resource allocation apparatus provided by the embodiment of the present invention, the processor from the kernel status register table Determining the kernel program to be distributed, searching the SM status register table for the SM capable of running at least one complete block, and when not finding the SM capable of running at least one block, continuing to search for the first SM capable of running at least one warp, transmitting and receiving
  • the device distributes the block in the kernel program to be distributed to the first SM.
  • the module in the high-priority kernel can be distributed to the SM, and the high-priority kernel program is not responded to in time.
  • the first SM that can run at least one warp is searched. Since the warp is smaller than the block, running a warp is better than running a block. Faster, so it is easier to find a SM that can run at least one warp. After finding it, you can distribute a block of the kernel program to be distributed to the first SM without waiting for the low-priority kernel program to run a block. The response speed of the high priority kernel program.
  • the disclosed system, apparatus, and method may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the modules or units is only a logical function division.
  • there may be another division manner for example, multiple units or components may be used. Combinations can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or a software functional unit. Formal realization.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
  • the technical solution of the present invention which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium.
  • a number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) or a processor to perform all or part of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Hardware Design (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Logic Circuits (AREA)

Abstract

一种GPU资源的分配方法及系统,涉及计算机技术领域,可以解决高优先级的kernel程序得不到及时响应的问题。通过全局逻辑控制器从kernel状态寄存器表中确定待分发kernel程序(201);全局逻辑控制器从SM状态寄存器表中查找能够运行至少一个完整block的SM,SM状态寄存器表用于存储每个SM中的剩余资源量(202);当全局逻辑控制器未查找到能够运行至少一个完整block的SM时,从SM状态寄存器表中查找第一SM,第一SM为能够运行至少一个warp的SM(203);当全局逻辑控制器查找到第一SM时,将待分发kernel程序中的block分发给第一SM(204)。适于GPU资源分配时采用。

Description

一种GPU资源的分配方法及系统 技术领域
本发明涉及计算机技术领域,尤其涉及一种GPU资源的分配方法及系统。
背景技术
随着GPU(Graphic Processing Unit,图形处理器)通用技术的发展,GPU不仅能够处理图像负载,也能够处理特定类型的通用程序。目前,当有多个不同的kernel程序需要访问GPU时,一般是以序列化的方式使请求访问GPU的kernel程序按照发送请求的时间顺序逐个访问GPU。如果一个延迟很长的kernel程序正在占用GPU,当有优先级更高的kernel程序需要访问GPU时,必须等前面正在访问GPU的kernel程序以及正在等待访问GPU的kernel程序运行结束后,释放出GPU中的SM(Stream Multiprocessor,流式多处理器)资源,该优先级更高的kernel程序才能访问GPU,使得该优先级更高的kernel程序得不到及时响应,影响业务质量。
为了避免延时长的kernel程序长时间独占GPU中的SM资源,当有高优先级的kernel程序需要访问GPU时,可以查找空闲的SM,当查找到空闲的SM时,将高优先级的kernel程序分发给该空闲的SM运行。
然而,如果GPU中没有空闲的SM,则需要等待GPU中出现空闲的SM时,才能够开始运行高优先级的kernel程序,导致高优先级的kernel程序得不到及时的响应。
发明内容
本发明的实施例提供一种GPU资源的分配方法及系统,可以解决高优先级的kernel程序得不到及时响应的问题。
为达到上述目的,本发明的实施例采用如下技术方案:
第一方面,本发明实施例一种图形处理器GPU资源的分配方法,所述方法应用于GPU资源的分配系统中,所述系统包括全局逻辑控制器以及至少两个能够与所述全局逻辑控制器通信的流式多处理器SM,所述方法包括:
所述全局逻辑控制器从核kernel状态寄存器表中确定待分发kernel程序,所述kernel状态寄存器表中包括每个未完成运行的kernel程序的优先级以及每 个未完成运行的kernel程序中未分发的线程块block数量,所述待分发kernel程序为所述kernel状态寄存器表中优先级最高且未分发的block数量不为零的kernel程序;
所述全局逻辑控制器从SM状态寄存器表中查找能够运行至少一个完整block的SM,所述SM状态寄存器表用于存储每个SM中的剩余资源量;
当所述全局逻辑控制器未查找到能够运行至少一个完整block的SM时,从所述SM状态寄存器表中查找第一SM,所述第一SM为能够运行至少一个线程束warp的SM;
当所述全局逻辑控制器查找到所述第一SM时,将所述待分发kernel程序中的block分发给所述第一SM。
在第一种可能的实施例中,结合第一方面,在所述全局逻辑控制器从SM状态寄存器表中查找能够运行至少一个完整block的SM之后,所述方法还包括:
当所述全局逻辑控制器查找到能够运行至少一个完整block的SM时,确定第一数量,所述第一数量为所述能够运行一个完整block的SM实际能够运行的block的数量;
当所述待分发kernel程序中未分发的block的数量大于所述第一数量时,将所述待分发kernel程序中所述第一数量的block分发给所述能够运行至少一个完整block的SM;
当所述待分发kernel程序中未分发的block的数量小于或等于所述第一数量时,将所述待分发kernel程序中未分发的block全部分发给所述能够运行至少一个完整block的SM。
在第二种可能的实施例中,结合第一方面中的第一种可能的实施例,在所述全局逻辑控制器将所述待分发kernel程序中的一个block分发给所述第一SM之后,所述方法还包括:
第一SM逻辑控制器从block状态寄存器表中确定优先级最高的block,所述第一SM逻辑控制器为所述第一SM中的SM逻辑控制器,所述block状态寄存器表包括被分发到所述第一SM中的每个block的优先级;
所述第一SM逻辑控制器查找当前的空闲硬件warp;
当所述第一SM逻辑控制器确定所述空闲硬件warp能够运行一个warp,且未接收到优先级更高的block时,将所述优先级最高的block中的一个warp分 发给所述空闲硬件warp,并更新所述block状态寄存器表。
在第三种可能的实施例中,结合第一方面或第一方面中上述任一种可能的实施例,所述SM状态寄存器表中包括每个SM的剩余寄存器数量、剩余硬件warp数量以及剩余共享存储空间,所述第一SM为所述剩余寄存器数量大于运行一个warp所需的寄存器数量、所述剩余硬件warp数量大于运行一个warp所需的硬件warp数量且所述剩余共享存储空间大于运行一个warp所需的共享存储空间的SM。
在第四种可能的实施例中,结合第一方面中第三种可能的实施例,在所述当所述第一SM逻辑控制器确定所述空闲硬件warp能够运行一个warp,且未接收到优先级更高的block时,将所述优先级最高的block中的一个warp分发给所述硬件warp之后,所述方法还包括:
所述第一SM逻辑控制器确定有运行完成的warp时,通知所述全局逻辑控制器更新所述SM状态寄存器表中的所述第一SM的剩余寄存器数量、剩余硬件warp数量以及剩余共享存储空间。
第二方面,本发明实施例提供一种图形处理器GPU资源的分配系统,所述系统包括全局逻辑控制器以及至少两个能够与所述全局逻辑控制器通信的流式多处理器SM;所述全局逻辑控制器包括:第一确定单元、第一查找单元以及第一分发单元;
所述第一确定单元,用于确定待分发kernel程序,所述kernel状态寄存器表中包括每个未完成运行的kernel程序的优先级以及每个未完成运行的kernel程序中未分发的block数量,所述待分发kernel程序为所述kernel状态寄存器表中优先级最高且未分发的block数量不为零的kernel程序;
所述第一查找单元,用于从SM状态寄存器表中查找能够运行至少一个完整block的SM,所述SM状态寄存器表用于存储每个SM中的剩余资源量;当未查找到能够运行至少一个完整线程块block的SM时,从所述SM状态寄存器表中查找第一SM,所述第一SM为能够运行至少一个线程束warp的SM;
所述第一分发单元,用于当查找到所述第一SM时,将所述待分发kernel程序中的block分发给所述第一SM;
所述第一SM,用于运行所述第一单元分发的所述待分发kernel程序中的block。
在第一种可能的实施例中,结合第二方面,所述第一确定单元,还用于当所述第一查找单元查找到能够运行至少一个完整block的SM时,确定第一数量,所述第一数量为所述能够运行一个完整block的SM实际能够运行的block的数量;
所述第一分发单元,还用于当所述待分发kernel程序中未分发的block的数量大于所述第一数量时,将所述待分发kernel程序中所述第一数量的block分发给所述能够运行至少一个完整block的SM;当所述待分发kernel程序中未分发的block的数量小于或等于所述第一数量时,将所述待分发kernel程序中未分发的block全部分发给所述能够运行至少一个完整block的SM;
所述能够运行至少一个完整block的SM,用于运行所述第一分发单元分发的所述待分发kernel程序中的block。
在第二种可能的实施例中,结合第二方面中的第一种可能的实施例,所述第一SM包括:
第二确定单元,用于从block状态寄存器表中确定优先级最高的block,所述第一SM逻辑控制器为所述第一SM中的SM逻辑控制器,所述block状态寄存器表包括被分发到所述第一SM中的每个block的优先级;
第二查找单元,用于查找当前的空闲硬件warp;
第二分发单元,用于当确定所述空闲硬件warp能够运行一个warp,且未接收到优先级更高的block时,将所述优先级最高的block中的一个warp分发给所述空闲硬件warp,并更新所述block状态寄存器表。
在第三种可能的实施例中,结合第二方面或第二方面中上述任一种可能的实施例,所述SM状态寄存器表中包括每个SM的剩余寄存器数量、剩余硬件warp数量以及剩余共享存储空间,所述第一SM为所述剩余寄存器数量大于运行一个warp所需的寄存器数量、所述剩余硬件warp数量大于运行一个warp所需的硬件warp数量且所述剩余共享存储空间大于运行一个warp所需的共享存储空间的SM。
在第四种可能的实施例中,结合第三种可能的实施例,所述第一SM还包括:通知单元;
所述通知单元,用于当确定有运行完成的warp时,通知所述全局逻辑控制器更新所述SM状态寄存器表中的所述第一SM的剩余寄存器数量、剩余硬件 warp数量以及剩余共享存储空间。
本发明实施例提供的GPU资源的分配方法及系统,全局逻辑控制器从kernel状态寄存器表中确定待分发kernel程序,从SM状态寄存器表中查找能够运行至少一个完整block的SM,当未查找到能够运行至少一个block的SM时,则继续查找能够运行至少一个warp的第一SM,将待分发kernel程序中的block分发给第一SM。与现有技术中必须等待GPU中有空闲的SM时,才能将高优先级kernel中的block分发给该SM而导致高优先级的kernel程序得不到及时响应相比,本发明实施例中,当未查找到能够运行至少一个block的SM时,不是等待其他kernel程序释放资源,而是查找能够运行至少一个warp的第一SM,由于warp比block小,所以运行完一个warp比运行完一个block更快,所以更容易查找到能够运行至少一个warp的SM,查找到之后就可以将待分发kernel程序的一个block分发给第一SM,无需等待低优先级的kernel程序运行完一个block,提高了高优先级的kernel程序的响应速度。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例提供的一种GPU资源的分配系统的逻辑结构示意图;
图2为本发明实施例提供的一种GPU资源的分配方法的流程图;
图3为本发明实施例提供的另一种GPU资源的分配方法的流程图;
图4为本发明实施例提供的另一种GPU资源的分配方法的流程图;
图5为本发明实施例提供的另一种GPU资源的分配系统的逻辑结构示意图;
图6为本发明实施例提供的另一种GPU资源的分配系统的逻辑结构示意图;
图7为本发明实施例提供的一种GPU资源的分配装置的逻辑结构示意图;
图8为本发明实施例提供的另一种GPU资源的分配装置的逻辑结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是 全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本发明实施应用于GPU资源的分配系统中,如图1所示,该系统包括全局调度器101以及能够与全局调度器101通信的至少两个SM 102。
其中,全局调度器101包括:全局逻辑控制器1011、kernel状态寄存器表1012以及SM状态寄存器表1013。
SM102包括:SM逻辑控制器1021以及block状态寄存器表1022。
全局调度器101,用于将kernel程序分发给SM102运行。
全局逻辑控制器1011,用于根据kernel状态寄存器表1012以及SM状态寄存器表1013将kernel程序以block为粒度或者以warp为粒度分发给SM102。
需要说明的是,在本发明实施例中kernel(核)程序为能够在GPU上运行的程序,一个kernel程序包含至少两个block(线程块),一个block包含至少两个warp(线程束),warp为32个GPU线程组成的一组线程。
kernel状态寄存器表1012,用于存储每个未完成运行的kernel程序信息。
其中,kernel程序信息包括kernel程序的优先级、运行该kernel程序需要的寄存器数量、运行该kernel程序需要的共享存储空间、该kernel程序中还未分发的block数量。
SM状态寄存器表1013,用于存储每个SM102当前的剩余资源量。
其中,每个SM102当前的剩余资源量包括剩余的寄存器数量、剩余的硬件warp数量以及剩余的共享存储空间。
SM102,用于运行全局调度器101分发的kernel程序。
SM逻辑控制器1021,用于根据block状态寄存器表将block中的warp分发给硬件warp运行。
block状态寄存器表1022,用于存储每个block的运行情况。
其中,block的运行情况包括block的优先级、block所属kernel的编号,block在kernel中的编号,block中未运行部分所需的寄存器数和所需的共享存储空间,以及block中还未分发的warp数。
为了加快高优先级的kernel程序的响应速度,本发明实施例提供一种GPU资源的分配方法,该方法应用于图1所示的GPU资源分配系统中,如图2所示,该方法包括:
201、全局逻辑控制器从核kernel状态寄存器表中确定待分发kernel程序。
其中,kernel状态寄存器表中包括每个未完成运行的kernel程序的优先级以及每个未完成运行的kernel程序中未分发的block数量,待分发kernel程序为kernel状态寄存器表中优先级最高且未分发的block数量不为零的kernel程序。
202、全局逻辑控制器从SM状态寄存器表中查找能够运行至少一个完整block的SM,SM状态寄存器表用于存储每个SM中的剩余资源量。
其中,SM状态寄存器表中具体包括每个SM的剩余寄存器数量、剩余硬件warp数量以及剩余共享存储空间。能够运行至少一个完整block的SM为剩余寄存器数量大于运行一个block所需的寄存器数量、剩余硬件warp数量大于运行一个block所需的硬件warp数量且剩余共享存储空间大于运行一个block所需的共享存储空间的SM。
举例说明,例如运行一个block需要36kb的寄存器,而一个SM中只剩下20kb的寄存器,则该SM不能运行一个block。
203、当全局逻辑控制器未查找到能够运行至少一个完整block的SM时,从SM状态寄存器表中查找第一SM,第一SM为能够运行至少一个warp的SM。
可以理解的是,第一SM为剩余寄存器数量大于运行一个warp所需的寄存器数量、剩余硬件warp数量大于运行一个warp所需的硬件warp数量且剩余共享存储空间大于运行一个warp所需的共享存储空间的SM。
需要说明的是,当运行一个block需要36kb寄存器,而剩余资源量最多的SM只剩余12kb的寄存器时,全局逻辑控制器查找不到能够运行至少一个block的SM,而运行一个warp只需要6kb寄存器,此时剩余12kb寄存器的SM可以运行两个warp,即全局逻辑控制器能够查找到第一SM。
204、当全局逻辑控制器查找到第一SM时,将待分发的kernel程序中的block分发给第一SM。
其中,如果第一SM的剩余资源只能运行一个warp,则将待分发kernel程序中的block分发给第一SM后,第一SM会将block中的warp逐个运行。
值得说明的是,当全局逻辑控制器未查找到第一SM时,则返回步骤201重新执行,等待正在运行的低优先级的kernel中的一个warp运行完成后,全局逻辑控制器即可查找到第一SM。
本发明实施例提供的GPU资源的分配方法,全局逻辑控制器从kernel状态 寄存器表中确定待分发kernel程序,从SM状态寄存器表中查找能够运行至少一个完整block的SM,当未查找到能够运行至少一个block的SM时,则继续查找能够运行至少一个warp的第一SM,将待分发kernel程序中的block分发给第一SM。与现有技术中必须等待GPU中有空闲的SM时,才能将高优先级kernel中的block分发给该SM而导致高优先级的kernel程序得不到及时响应相比,本发明实施例中,当未查找到能够运行至少一个block的SM时,不是等待其他kernel程序释放资源,而是查找能够运行至少一个warp的第一SM,由于warp比block小,所以运行完一个warp比运行完一个block更快,所以更容易查找到能够运行至少一个warp的SM,查找到之后就可以将待分发kernel程序的一个block分发给第一SM,无需等待低优先级的kernel程序运行完一个block,提高了高优先级的kernel程序的响应速度。
作为对上述实施例的补充,在本发明实施例提供的另一种实现方式中,如图3所示,在上述步骤202、全局逻辑控制器从SM状态寄存器表中查找能够运行至少一个完整block的SM之后,如果查找到能够运行至少一个完整block的SM,则执行下述步骤205至207。
205、当全局逻辑控制器查找到能够运行至少一个完整block的SM时,确定第一数量,第一数量为能够运行一个完整block的SM实际能够运行的block的数量。
其中,第一数量为全局逻辑控制器通过能够运行至少一个完整block的SM中的SM状态寄存器表确定的。全局逻辑控制器能够根据SM状态寄存器表中存储的SM的剩余资源量以及运行一个block所需的资源量计算出该SM实际能够运行的block数量。
206、当待分发kernel程序中未分发的block的数量大于第一数量时,将待分发kernel程序中第一数量的block分发给能够运行至少一个完整block的SM。
值得说明的是,当待分发kernel程序中包含的block数量大于第一数量时,说明查找到的SM的剩余资源不足以运行该待分发kernel程序中的所有block,所以先将第一数量的block分发给该SM,当有block运行完成释放出SM中的资源后,再将该待分发kernel中剩余的block分发给SM。
207、当待分发kernel程序中未分发的block的数量小于或等于第一数量时,将待分发kernel程序中未分发的block全部分发给能够运行至少一个完整block 的SM。
值得说明的是,在上述步骤204、206以及207向全局逻辑控制器向SM分发block后,都需要更新kernel状态寄存器中待分发kernel程序中未分发的block数量。
本发明实施例提供的GPU资源的分配方法,当全局逻辑控制器查找到能够运行至少一个block的SM时,确定第一数量,当待分发kernel程序中未分发的block的数量大于第一数量时,将待分发kernel程序中第一数量的block分发给能够运行至少一个完整block的SM;当待分发kernel程序中包含的block的数量小于或等于第一数量时,将待分发kernel程序中的未分发的block全部分发给能够运行至少一个完整block的SM。在能够查找到运行至少一个block的SM时,将优先级最高的kernel中尽可能多的block分发给该SM,可以使优先级最高的kernel得到及时的响应,提高了高优先级的kernel程序的响应速度。
在全局逻辑控制器将block分发给SM后,SM需合理的将block中的warp分发运行,所以本发明另一实施例提供了在步骤204、当全局逻辑控制器查找到第一SM时,将待分发kernel程序中的一个block分发给第一SM之后,第一SM逻辑控制器分发warp的方法,如图4所示,该方法包括:
401、第一SM逻辑控制器从block状态寄存器表中确定优先级最高的block,第一SM逻辑控制器为第一SM中的SM逻辑控制器,block状态寄存器表包括被分发到第一SM中的每个block的优先级。
结合图1所示的GPU资源的分配系统,全局逻辑控制器连接于至少两个SM,当全局逻辑控制器将优先级最高的kernel程序中的一个block分发给第一SM后,第一SM中的第一SM逻辑控制器需将该block中的warp分发给硬件warp运行。
由于第一SM中还正在运行其他kernel中的block,或者还有其他kernel程序中的block正在等待运行,所以第一SM逻辑控制器需要从block状态寄存器表中确定优先级最高的block,优先运行优先级最高的block中的warp。
值得说明的是,block状态寄存器中存储的block的优先级为block所属kernel的优先级,同一kernel中的block的优先级是相同的。
402、第一SM逻辑控制器查找当前的空闲硬件warp。
需要说明的是,当第一SM逻辑控制器查找到空闲硬件warp时,则执行下述步骤403;当第一SM逻辑控制器未查找到空闲硬件warp时,则重复查找动 作,直到查找到空闲的硬件warp再继续执行下述步骤403。
由于第一SM中还有低优先级的kernel程序正在运行,所以,等待低优先级的kernel程序中有warp运行结束后,就会有硬件warp恢复空闲状态,此时第一SM逻辑控制器就能够查找到空闲硬件warp,高优先级的kernel程序中的warp即可占用该硬件warp。
403、当第一SM逻辑控制器确定空闲硬件warp能够运行一个warp,且未接收到优先级更高的block时,将优先级最高的block中的一个warp分发给空闲硬件warp,并更新block状态寄存器表。
其中,判断空闲硬件warp是否能够运行一个warp的方法为:判断第一SM中的寄存器数量是否足够运行一个warp,如果足够,且此时第一SM未接收到优先级更高的block时,则将此时优先级最高的block中的一个warp分发给查找到的空闲硬件warp;如果不够,则继续等待,直到有warp运行结束,寄存器数量足够运行一个warp时,再向该空闲硬件warp分发一个warp。
值得说明的是,将优先级最高的block中的一个warp分发给空闲硬件warp后,还需判断该优先级最高的block是否分发完毕,若是,则重新执行上述步骤401至403;若否,则重新执行上述步骤402至403。
本发明实施例提供的GPU资源的分配方法,第一SM逻辑控制器首先查找空闲硬件warp,当查找到空闲硬件warp且此时第一SM能够运行一个warp时,就将优先级最高的block中的一个warp分发给硬件warp运行,无需等待第一SM中有能够运行整个block的资源后再将整个block分发给硬件warp运行,减少了等待时间,提高了高优先级kernel程序的响应速度。
为了减少SM中的空闲资源,提高SM的资源利用率,以加快高优先级kernel程序的响应速度,在本发明实施例提供的另一种实现方式中,当第一SM逻辑控制器确定有运行完成的warp时,通知全局逻辑控制器更新SM状态寄存器表中的第一SM的剩余寄存器数量、剩余硬件warp数量以及剩余共享存储空间。
可以理解的是,当运行完一个warp时,运行该warp所需的寄存器、硬件warp以及共享存储都会被释放,所以需要实时更新SM状态寄存器表中的第一SM的剩余寄存器数量、剩余硬件warp数量以及剩余共享存储空间,以便于全局逻辑控制器及时为该SM下发block。
本发明实施例提供的GPU资源的分配方法,当第一逻辑控制器确定有运行 完成的warp时,通知全局逻辑控制器更新SM状态寄存器表,使得全局逻辑控制器可以根据最新的SM剩余资源量以及还未分发的block数,及时向SM下发高优先级kernel中的block,提高了SM的资源利用率,同时加快了高优先级的kernel程序的响应速度。
结合图2至图4所示的GPU资源的分配方法,本发明实施例还提供一种GPU资源的分配系统,如图5所示,该系统包括全局逻辑控制器501以及至少两个能够与全局逻辑控制器501通信的流式多处理器SM;全局逻辑控制器包括:第一确定单元5011、第一查找单元5012以及第一分发单元5013。
需要说明的是,该系统中的SM可以为能够运行至少一个完整block的SM502或第一SM503。
其中,第一SM503为能够运行至少一个线程束warp的SM。
第一确定单元5011,用于确定待分发kernel程序,kernel状态寄存器表中包括每个未完成运行的kernel程序的优先级以及每个未完成运行的kernel程序中未分发的block数量,待分发kernel程序为kernel状态寄存器表中优先级最高且未分发的block数量不为零的kernel程序。
第一查找单元5012,用于从SM状态寄存器表中查找能够运行至少一个完整block的SM,SM状态寄存器表用于存储每个SM中的剩余资源量;当未查找到能够运行至少一个完整线程块block的SM时,从SM状态寄存器表中查找第一SM503。
第一分发单元5013,用于当查找到第一SM503时,将待分发kernel程序中的block分发给第一SM503。
第一SM503,用于运行第一分发单元5013分发的待分发kernel程序中的block。
在本发明另一实施例中,第一确定单元5011,还用于当第一查找单元5012查找到能够运行至少一个完整block的SM时,确定第一数量,第一数量为能够运行一个完整block的SM实际能够运行的block的数量;
第一分发单元5013,还用于当待分发kernel程序中未分发的block的数量大于第一数量时,将待分发kernel程序中第一数量的block分发给能够运行至少一个完整block的SM;当待分发kernel程序中未分发的block的数量小于或等于第一数量时,将待分发kernel程序中未分发的block全部分发给能够运行至少 一个完整block的SM502。
能够运行至少一个完整block的SM502,用于运行第一分发单元5013分发的待分发kernel程序中的block。
在本发明另一实施例中,如图6所示,第一SM503包括第二确定单元5031、第二查找单元5032、第二分发单元5033以及通知单元5034。
需要说明的是,第二确定单元5031、第二查找单元5032以及第二分发单元5033具体位于第一SM503中的第一SM逻辑控制器中。
第二确定单元5031,位于第一SM503逻辑控制器中,用于从block状态寄存器表中确定优先级最高的block,所述第一SM503逻辑控制器为第一SM503中的SM逻辑控制器,block状态寄存器表包括被分发到第一SM503中的每个block的优先级。
第二查找单元5032,位于第一SM503逻辑控制器中,用于查找当前的空闲硬件warp。
第二分发单元5033,位于第一SM503逻辑控制器中,用于当确定空闲硬件warp能够运行一个warp,且未接收到优先级更高的block时,将优先级最高的block中的一个warp分发给空闲硬件warp,并更新block状态寄存器表。
需要说明的是,能够运行至少一个个完整block的SM502与第一SM503的组成结构相同,在本发明实施例中不再一一说明。
值得说明的是,SM状态寄存器表中包括每个SM的剩余寄存器数量、剩余硬件warp数量以及剩余共享存储空间,第一SM503为剩余寄存器数量大于运行一个warp所需的寄存器数量、剩余硬件warp数量大于运行一个warp所需的硬件warp数量且剩余共享存储空间大于运行一个warp所需的共享存储空间的SM。
第二确定单元5031,还用于确定是否有运行完成的warp。
通知单元5034,位于第一SM503逻辑控制器中,用于当第二确定单元5031确定有运行完成的warp时,通知全局逻辑控制器501更新SM状态寄存器表中的第一SM503的剩余寄存器数量、剩余硬件warp数量以及剩余共享存储空间。
本发明实施例提供的GPU资源的分配系统,全局逻辑控制器中的第一确定单元从kernel状态寄存器表中确定待分发kernel程序,第一查找单元从SM状态寄存器表中查找能够运行至少一个完整block的SM,当未查找到能够运行至 少一个block的SM时,则继续查找能够运行至少一个warp的第一SM,第一分发单元将待分发kernel程序中的block分发给第一SM。与现有技术中必须等待GPU中有空闲的SM时,才能将高优先级kernel中的block分发给该SM而导致高优先级的kernel程序得不到及时响应相比,本发明实施例中,当未查找到能够运行至少一个block的SM时,不是等待其他kernel程序释放资源,而是查找能够运行至少一个warp的第一SM,由于warp比block小,所以运行完一个warp比运行完一个block更快,所以更容易查找到能够运行至少一个warp的SM,查找到之后就可以将待分发kernel程序的一个block分发给第一SM,无需等待低优先级的kernel程序运行完一个block,提高了高优先级的kernel程序的响应速度。
本发明实施例还提供一种GPU资源的分配装置,如图7所示,该装置中包括全局逻辑控制器以及至少两个能够与全局逻辑控制器通信的SM。其中,SM可以为能够运行至少一个完整block的SM或者第一SM。全局逻辑控制器可包括存储器71、收发器72、处理器73和总线74,其中,存储器71、收发器72、处理器73通过总线74通信连接。
存储器71可以是只读存储器(Read Only Memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(Random Access Memory,RAM)。存储器71可以存储操作系统和其他应用程序。在通过软件或者固件来实现本发明实施例提供的技术方案时,用于实现本发明实施例提供的技术方案的程序代码保存在存储器71中,并由处理器73来执行。
收发器72用于装置与其他设备或通信网络(例如但不限于以太网,无线接入网(Radio Access Network,RAN),无线局域网(Wireless Local Area Network,WLAN)等)之间的通信。
处理器73可以采用通用的中央处理器(Central Processing Unit,CPU),微处理器,应用专用集成电路(Application Specific Integrated Circuit,ASIC),或者一个或多个集成电路,用于执行相关程序,以实现本发明实施例所提供的技术方案。
总线74可包括一通路,在装置各个部件(例如存储器71、收发器72和处理器73)之间传送信息。
应注意,尽管图7所示的硬件仅仅示出了存储器71、收发器72和处理器 73以及总线74,但是在具体实现过程中,本领域的技术人员应当明白,该终端还包含实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当明白,还可包含实现其他功能的硬件器件。
具体的,图7所示的全局逻辑控制器用于实现图5实施例所示的系统时,该装置中的处理器73,与存储器71和收发器72耦合,用于控制程序指令的执行,具体用于确定待分发kernel程序,kernel状态寄存器表中包括每个未完成运行的kernel程序的优先级以及每个未完成运行的kernel程序中未分发的block数量,待分发kernel程序为kernel状态寄存器表中优先级最高且未分发的block数量不为零的kernel程序;从SM状态寄存器表中查找能够运行至少一个完整block的SM,SM状态寄存器表用于存储每个SM中的剩余资源量;当未查找到能够运行至少一个完整线程块block的SM时,从SM状态寄存器表中查找第一SM,第一SM为能够运行至少一个线程束warp的SM。
收发器72,用于当查找到第一SM时,将待分发kernel程序中的block分发给第一SM。
存储器71,用于存储kernel状态寄存器表和SM状态寄存器表。
处理器73,还用于当第一查找单元查找到能够运行至少一个完整block的SM时,确定第一数量,第一数量为能够运行一个完整block的SM实际能够运行的block的数量。
收发器72,还用于当待分发kernel程序中未分发的block的数量大于第一数量时,将待分发kernel程序中第一数量的block分发给能够运行至少一个完整block的SM;当待分发kernel程序中未分发的block的数量小于或等于第一数量时,将待分发kernel程序中未分发的block全部分发给能够运行至少一个完整block的SM。
在本发明另一实施例中,如图8所示,第一SM包括存储器81、收发器82、处理器83和总线84,其中,存储器81、收发器82、处理器83通过总线84通信连接。
存储器81可以是只读存储器(Read Only Memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(Random Access Memory,RAM)。存储器81可以存储操作系统和其他应用程序。在通过软件或者固件来实现本发明实施例提供的技术方案时,用于实现本发明实施例提供的技术方案的程序代码保存 在存储器81中,并由处理器83来执行。
收发器82用于装置与其他设备或通信网络(例如但不限于以太网,无线接入网(Radio Access Network,RAN),无线局域网(Wireless Local Area Network,WLAN)等)之间的通信。
处理器83可以采用通用的中央处理器(Central Processing Unit,CPU),微处理器,应用专用集成电路(Application Specific Integrated Circuit,ASIC),或者一个或多个集成电路,用于执行相关程序,以实现本发明实施例所提供的技术方案。
总线84可包括一通路,在装置各个部件(例如存储器81、收发器82和处理器83)之间传送信息。
应注意,尽管图8所示的硬件仅仅示出了存储器81、收发器82和处理器83以及总线84,但是在具体实现过程中,本领域的技术人员应当明白,该终端还包含实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当明白,还可包含实现其他功能的硬件器件。
具体的,图8所示的第一SM用于实现图5和图6实施例所示的系统时,该装置中的处理器83,与存储器81和收发器82耦合,用于控制程序指令的执行,具体用于从block状态寄存器表中确定优先级最高的block,第一SM逻辑控制器为第一SM中的SM逻辑控制器,block状态寄存器表包括被分发到第一SM中的每个block的优先级;用于查找当前的空闲硬件warp。
收发器82,用于当确定空闲硬件warp能够运行一个warp,且未接收到优先级更高的block时,将优先级最高的block中的一个warp分发给空闲硬件warp,并更新block状态寄存器表。
值得说明的是,SM状态寄存器表中包括每个SM的剩余寄存器数量、剩余硬件warp数量以及剩余共享存储空间,第一SM为剩余寄存器数量大于运行一个warp所需的寄存器数量、剩余硬件warp数量大于运行一个warp所需的硬件warp数量且剩余共享存储空间大于运行一个warp所需的共享存储空间的SM。
收发器83,还用于当确定有运行完成的warp时,通知全局逻辑控制器更新SM状态寄存器表中的第一SM的剩余寄存器数量、剩余硬件warp数量以及剩余共享存储空间。
本发明实施例提供的GPU资源的分配装置,处理器从kernel状态寄存器表 中确定待分发kernel程序,从SM状态寄存器表中查找能够运行至少一个完整block的SM,当未查找到能够运行至少一个block的SM时,则继续查找能够运行至少一个warp的第一SM,收发器将待分发kernel程序中的block分发给第一SM。与现有技术中必须等待GPU中有空闲的SM时,才能将高优先级kernel中的block分发给该SM而导致高优先级的kernel程序得不到及时响应相比,本发明实施例中,当未查找到能够运行至少一个block的SM时,不是等待其他kernel程序释放资源,而是查找能够运行至少一个warp的第一SM,由于warp比block小,所以运行完一个warp比运行完一个block更快,所以更容易查找到能够运行至少一个warp的SM,查找到之后就可以将待分发kernel程序的一个block分发给第一SM,无需等待低优先级的kernel程序运行完一个block,提高了高优先级的kernel程序的响应速度。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的 形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。

Claims (10)

  1. 一种图形处理器GPU资源的分配方法,其特征在于,所述方法应用于GPU资源的分配系统中,所述系统包括全局逻辑控制器以及至少两个能够与所述全局逻辑控制器通信的流式多处理器SM,所述方法包括:
    所述全局逻辑控制器从核kernel状态寄存器表中确定待分发kernel程序,所述kernel状态寄存器表中包括每个未完成运行的kernel程序的优先级以及每个未完成运行的kernel程序中未分发的线程块block数量,所述待分发kernel程序为所述kernel状态寄存器表中优先级最高且未分发的block数量不为零的kernel程序;
    所述全局逻辑控制器从SM状态寄存器表中查找能够运行至少一个完整block的SM,所述SM状态寄存器表用于存储每个SM中的剩余资源量;
    当所述全局逻辑控制器未查找到能够运行至少一个完整block的SM时,从所述SM状态寄存器表中查找第一SM,所述第一SM为能够运行至少一个线程束warp的SM;
    当所述全局逻辑控制器查找到所述第一SM时,将所述待分发kernel程序中的block分发给所述第一SM。
  2. 根据权利要求1所述的GPU资源的分配方法,其特征在于,在所述全局逻辑控制器从SM状态寄存器表中查找能够运行至少一个完整block的SM之后,所述方法还包括:
    当所述全局逻辑控制器查找到能够运行至少一个完整block的SM时,确定第一数量,所述第一数量为所述能够运行一个完整block的SM实际能够运行的block的数量;
    当所述待分发kernel程序中未分发的block的数量大于所述第一数量时,将所述待分发kernel程序中所述第一数量的block分发给所述能够运行至少一个完整block的SM;
    当所述待分发kernel程序中未分发的block的数量小于或等于所述第一数量时,将所述待分发kernel程序中未分发的block全部分发给所述能够运行至少一个完整block的SM。
  3. 根据权利要求2所述的GPU资源的分配方法,其特征在于,在所述全局逻辑控制器将所述待分发kernel程序中的一个block分发给所述第一SM之后, 所述方法还包括:
    第一SM逻辑控制器从block状态寄存器表中确定优先级最高的block,所述第一SM逻辑控制器为所述第一SM中的SM逻辑控制器,所述block状态寄存器表包括被分发到所述第一SM中的每个block的优先级;
    所述第一SM逻辑控制器查找当前的空闲硬件warp;
    当所述第一SM逻辑控制器确定所述空闲硬件warp能够运行一个warp,且未接收到优先级更高的block时,将所述优先级最高的block中的一个warp分发给所述空闲硬件warp,并更新所述block状态寄存器表。
  4. 根据权利要求1至3中任一项所述的GPU资源的分配方法,其特征在于,所述SM状态寄存器表中包括每个SM的剩余寄存器数量、剩余硬件warp数量以及剩余共享存储空间,所述第一SM为所述剩余寄存器数量大于运行一个warp所需的寄存器数量、所述剩余硬件warp数量大于运行一个warp所需的硬件warp数量且所述剩余共享存储空间大于运行一个warp所需的共享存储空间的SM。
  5. 根据权利要求4所述的GPU资源的分配方法,其特征在于,在所述当所述第一SM逻辑控制器确定所述空闲硬件warp能够运行一个warp,且未接收到优先级更高的block时,将所述优先级最高的block中的一个warp分发给所述硬件warp之后,所述方法还包括:
    所述第一SM逻辑控制器确定有运行完成的warp时,通知所述全局逻辑控制器更新所述SM状态寄存器表中的所述第一SM的剩余寄存器数量、剩余硬件warp数量以及剩余共享存储空间。
  6. 一种图形处理器GPU资源的分配系统,其特征在于,所述系统包括全局逻辑控制器以及至少两个能够与所述全局逻辑控制器通信的流式多处理器SM;所述全局逻辑控制器包括:第一确定单元、第一查找单元以及第一分发单元;
    所述第一确定单元,用于确定待分发kernel程序,所述kernel状态寄存器表中包括每个未完成运行的kernel程序的优先级以及每个未完成运行的kernel程序中未分发的线程块block数量,所述待分发kernel程序为所述kernel状态寄存器表中优先级最高且未分发的block数量不为零的kernel程序;
    所述第一查找单元,用于从SM状态寄存器表中查找能够运行至少一个完 整block的SM,所述SM状态寄存器表用于存储每个SM中的剩余资源量;当未查找到能够运行至少一个完整block的SM时,从所述SM状态寄存器表中查找第一SM,所述第一SM为能够运行至少一个线程束warp的SM;
    所述第一分发单元,用于当查找到所述第一SM时,将所述待分发kernel程序中的block分发给所述第一SM;
    所述第一SM,用于运行所述第一单元分发的所述待分发kernel程序中的block。
  7. 根据权利要求6所述的GPU资源的分配系统,其特征在于,
    所述第一确定单元,还用于当所述第一查找单元查找到能够运行至少一个完整block的SM时,确定第一数量,所述第一数量为所述能够运行一个完整block的SM实际能够运行的block的数量;
    所述第一分发单元,还用于当所述待分发kernel程序中未分发的block的数量大于所述第一数量时,将所述待分发kernel程序中所述第一数量的block分发给所述能够运行至少一个完整block的SM;当所述待分发kernel程序中未分发的block的数量小于或等于所述第一数量时,将所述待分发kernel程序中未分发的block全部分发给所述能够运行至少一个完整block的SM;
    所述能够运行至少一个完整block的SM,用于运行所述第一分发单元分发的所述待分发kernel程序中的block。
  8. 根据权利要求7所述的GPU资源的分配系统,其特征在于,所述第一SM包括:
    第二确定单元,用于从block状态寄存器表中确定优先级最高的block,所述第一SM逻辑控制器为所述第一SM中的SM逻辑控制器,所述block状态寄存器表包括被分发到所述第一SM中的每个block的优先级;
    第二查找单元,用于查找当前的空闲硬件warp;
    第二分发单元,用于当确定所述空闲硬件warp能够运行一个warp,且未接收到优先级更高的block时,将所述优先级最高的block中的一个warp分发给所述空闲硬件warp,并更新所述block状态寄存器表。
  9. 根据权利要求6至8中任一项所述的GPU资源的分配系统,其特征在于,所述SM状态寄存器表中包括每个SM的剩余寄存器数量、剩余硬件warp数量以及剩余共享存储空间,所述第一SM为所述剩余寄存器数量大于运行一 个warp所需的寄存器数量、所述剩余硬件warp数量大于运行一个warp所需的硬件warp数量且所述剩余共享存储空间大于运行一个warp所需的共享存储空间的SM。
  10. 根据权利要求9所述的GPU资源的分配系统,其特征在于,所述第一SM还包括:通知单元;
    所述通知单元,用于当确定有运行完成的warp时,通知所述全局逻辑控制器更新所述SM状态寄存器表中的所述第一SM的剩余寄存器数量、剩余硬件warp数量以及剩余共享存储空间。
PCT/CN2016/083314 2015-06-19 2016-05-25 一种gpu资源的分配方法及系统 WO2016202153A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/844,333 US10614542B2 (en) 2015-06-19 2017-12-15 High granularity level GPU resource allocation method and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510345468.9 2015-06-19
CN201510345468.9A CN106325995B (zh) 2015-06-19 2015-06-19 一种gpu资源的分配方法及系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/844,333 Continuation US10614542B2 (en) 2015-06-19 2017-12-15 High granularity level GPU resource allocation method and system

Publications (1)

Publication Number Publication Date
WO2016202153A1 true WO2016202153A1 (zh) 2016-12-22

Family

ID=57544822

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/083314 WO2016202153A1 (zh) 2015-06-19 2016-05-25 一种gpu资源的分配方法及系统

Country Status (3)

Country Link
US (1) US10614542B2 (zh)
CN (1) CN106325995B (zh)
WO (1) WO2016202153A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109445565A (zh) * 2018-11-08 2019-03-08 北京航空航天大学 一种基于流多处理器内核独占和预留的gpu服务质量保障方法

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10445850B2 (en) * 2015-08-26 2019-10-15 Intel Corporation Technologies for offloading network packet processing to a GPU
US10474468B2 (en) * 2017-02-22 2019-11-12 Advanced Micro Devices, Inc. Indicating instruction scheduling mode for processing wavefront portions
US10620994B2 (en) 2017-05-30 2020-04-14 Advanced Micro Devices, Inc. Continuation analysis tasks for GPU task scheduling
US10798162B2 (en) * 2017-08-28 2020-10-06 Texas Instruments Incorporated Cluster system with fail-safe fallback mechanism

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102508820A (zh) * 2011-11-25 2012-06-20 中国人民解放军国防科学技术大学 一种基于gpu的消除云方程并行求解过程中数据相关的方法
CN103336718A (zh) * 2013-07-04 2013-10-02 北京航空航天大学 一种gpu线程调度优化方法
US8643656B2 (en) * 2010-09-30 2014-02-04 Nec Laboratories America, Inc. Energy-aware task consolidation on graphics processing unit (GPU)
CN103729167A (zh) * 2012-10-12 2014-04-16 辉达公司 用于改进多线程处理单元中的性能的技术
US20140173611A1 (en) * 2012-12-13 2014-06-19 Nvidia Corporation System and method for launching data parallel and task parallel application threads and graphics processing unit incorporating the same
US20140204098A1 (en) * 2013-01-18 2014-07-24 Nvidia Corporation System, method, and computer program product for graphics processing unit (gpu) demand paging

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10282227B2 (en) * 2014-11-18 2019-05-07 Intel Corporation Efficient preemption for graphics processors

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8643656B2 (en) * 2010-09-30 2014-02-04 Nec Laboratories America, Inc. Energy-aware task consolidation on graphics processing unit (GPU)
CN102508820A (zh) * 2011-11-25 2012-06-20 中国人民解放军国防科学技术大学 一种基于gpu的消除云方程并行求解过程中数据相关的方法
CN103729167A (zh) * 2012-10-12 2014-04-16 辉达公司 用于改进多线程处理单元中的性能的技术
US20140173611A1 (en) * 2012-12-13 2014-06-19 Nvidia Corporation System and method for launching data parallel and task parallel application threads and graphics processing unit incorporating the same
US20140204098A1 (en) * 2013-01-18 2014-07-24 Nvidia Corporation System, method, and computer program product for graphics processing unit (gpu) demand paging
CN103336718A (zh) * 2013-07-04 2013-10-02 北京航空航天大学 一种gpu线程调度优化方法

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BURTSCHER, M. ET AL.: "A Quantitative Study of Irregular Programs on GPUs", IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION (IISWC 2012, 2012, pages 141 - 51, XP032297761 *
YU , YONG ET AL.: "Thread Mapping Model from CUDA to Heterogeneous Many-core Architecture", COMPUTER ENGINEERING, vol. 38, no. 9, 31 May 2012 (2012-05-31), pages 282 - 284 and 287, XP055337743 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109445565A (zh) * 2018-11-08 2019-03-08 北京航空航天大学 一种基于流多处理器内核独占和预留的gpu服务质量保障方法

Also Published As

Publication number Publication date
CN106325995A (zh) 2017-01-11
US10614542B2 (en) 2020-04-07
CN106325995B (zh) 2019-10-22
US20180108109A1 (en) 2018-04-19

Similar Documents

Publication Publication Date Title
EP3073374B1 (en) Thread creation method, service request processing method and related device
WO2016202153A1 (zh) 一种gpu资源的分配方法及系统
US8478926B1 (en) Co-processing acceleration method, apparatus, and system
US8209690B2 (en) System and method for thread handling in multithreaded parallel computing of nested threads
US9479449B2 (en) Workload partitioning among heterogeneous processing nodes
US20130212594A1 (en) Method of optimizing performance of hierarchical multi-core processor and multi-core processor system for performing the method
CN109726005B (zh) 用于管理资源的方法、服务器系统和计算机可读介质
US10613902B2 (en) GPU resource allocation method and system
US20140331235A1 (en) Resource allocation apparatus and method
CN109564528B (zh) 分布式计算中计算资源分配的系统和方法
US20180225155A1 (en) Workload optimization system
JP6622715B2 (ja) 共有ハードウェアリソースを使用したクラスタプロセッサコアにおけるハードウェアスレッドの動的負荷分散、ならびに関連する回路、方法、およびコンピュータ可読媒体
US20120297216A1 (en) Dynamically selecting active polling or timed waits
US9507633B2 (en) Scheduling method and system
US10831539B2 (en) Hardware thread switching for scheduling policy in a processor
US9311142B2 (en) Controlling memory access conflict of threads on multi-core processor with set of highest priority processor cores based on a threshold value of issued-instruction efficiency
US11940915B2 (en) Cache allocation method and device, storage medium, and electronic device
US20160253216A1 (en) Ordering schemes for network and storage i/o requests for minimizing workload idle time and inter-workload interference
US9547576B2 (en) Multi-core processor system and control method
US9384050B2 (en) Scheduling method and scheduling system for multi-core processor system
US20160210171A1 (en) Scheduling in job execution
CN103823712A (zh) 一种多cpu虚拟机系统的数据流处理方法和装置
CN107402807A (zh) 在计算机系统中有效提升多任务执行效率的方法、系统和处理器
CN112486638A (zh) 用于执行处理任务的方法、装置、设备和存储介质
US9405470B2 (en) Data processing system and data processing method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16810894

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16810894

Country of ref document: EP

Kind code of ref document: A1