CN114168301A

CN114168301A - Thread scheduling method, processor and electronic device

Info

Publication number: CN114168301A
Application number: CN202111565212.0A
Authority: CN
Inventors: 王陶然; 潘于
Original assignee: Haiguang Information Technology Co Ltd
Current assignee: Haiguang Information Technology Co Ltd
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2022-03-11

Abstract

A thread scheduling method, a processor and an electronic device are provided. The thread scheduling method comprises the following steps: and carrying out the operation of releasing the shared storage resource of the first thread group in advance. The operation of releasing the shared memory resource of the first thread group in advance comprises the following steps: in the case that the thread subgroup currently running in the processing unit is determined to be the last thread subgroup in the first thread group, the shared memory resource occupied by the first thread group and releasable is released. The thread scheduling method can release the shared storage resources occupied by the thread group in advance when the current thread group is still running, so that the released resources can be allocated to other thread groups, the purpose of allocating the resources in advance can be achieved, the running efficiency is improved, and the time for completing the whole task is shortened.

Description

Thread scheduling method, processor and electronic device

Technical Field

The embodiment of the disclosure relates to a thread scheduling method, a processor and an electronic device.

Background

A Graphics Processing Unit (GPU), also called a display core, a visual processor, and a display chip, is a microprocessor dedicated to image and Graphics related computing work on personal computers, workstations, game machines, and some mobile devices (e.g., tablet computers, smart phones, etc.). The GPU can convert and drive display information required by the computer system and provide line scanning signals for the display, so that the display is controlled to display correctly. The GPU is an important component for connecting a display and a computer motherboard (the computer motherboard includes a central processing unit, for example), and is also one of important devices for implementing a "man-machine conversation".

Disclosure of Invention

At least one embodiment of the present disclosure provides a thread scheduling method, including: the method for carrying out the operation of releasing the shared storage resource of the first thread group in advance comprises the following steps: in the case that the thread subgroup currently running in the processing unit is determined to be the last thread subgroup in the first thread group, the shared memory resource occupied by the first thread group and releasable is released.

For example, the thread scheduling method provided in at least one embodiment of the present disclosure further includes: the released shared memory resources previously occupied by the first thread group and releasable are allocated to a second thread group currently making a resource allocation request.

For example, in the thread scheduling method provided by at least one embodiment of the present disclosure, while the released shared memory resource that is previously occupied by the first thread group and is releasable is allocated to the second thread group, the last thread subgroup in the first thread group is still in operation.

For example, the thread scheduling method provided in at least one embodiment of the present disclosure further includes: responding to the shared memory resource early release instruction aiming at the first thread group to carry out the shared memory resource early release operation of the first thread group.

For example, in a thread scheduling method provided by at least one embodiment of the present disclosure, responding to an advanced release instruction of a shared memory resource for a first thread group includes: acquiring state information of a thread subgroup currently running in a processing unit; based on the state information, it is determined whether the thread sub-group currently running in the processing unit is the last thread sub-group in the first thread group.

For example, in the thread scheduling method provided by at least one embodiment of the present disclosure, the state information of the thread sub-group includes indication information and identification information, the indication information indicates whether the thread sub-group is a last thread sub-group in the first thread group, and the identification information indicates a resource identifier corresponding to the thread sub-group.

For example, in the thread scheduling method provided in at least one embodiment of the present disclosure, the releasing the shared storage resource occupied by the first thread group and releasable includes: and according to the identification information, obtaining the information of the shared storage resource needing to be released, and releasing the shared storage resource occupied by the first thread group and released according to the information of the shared storage resource.

For example, in a thread scheduling method provided in at least one embodiment of the present disclosure, releasing a releasable shared storage resource occupied by a first thread group according to information of the shared storage resource includes: and acquiring a mask table corresponding to the shared storage resource from the resource mask matrix, and updating the state of the shared storage resource in the mask table to an available state.

For example, in the thread scheduling method provided in at least one embodiment of the present disclosure, the information of the shared memory resource includes an address and a size of the shared memory resource.

For example, the thread scheduling method provided in at least one embodiment of the present disclosure further includes, after performing the operation of releasing the shared storage resource in advance: and carrying out ending operation of the first thread group and releasing the private storage resource occupied by the first thread group.

For example, in the thread scheduling method provided in at least one embodiment of the present disclosure, the ending operation of the first thread group is performed, including: in response to a thread group end instruction for the first thread group, and ending execution of the first thread group.

At least one embodiment of the present disclosure provides a processor, including: a processing unit configured to execute at least one thread group, wherein the thread group comprises at least one thread subgroup; and the resource manager is configured to release the shared storage resource occupied by the thread group and releasable when the thread subgroup currently running in the processing unit is determined to be the last thread subgroup in the thread group.

For example, in a processor provided in at least one embodiment of the present disclosure, a processing unit includes: and the control unit is configured to receive the shared memory resource early release instruction, and provide the state information of the thread subgroup currently running in the processing unit and the shared memory resource early release instruction to the resource manager, wherein the resource manager is further configured to release the releasable shared memory resource occupied by the thread group under the condition that the thread subgroup currently running in the processing unit is determined to be the last thread subgroup in the thread group under the condition according to the state information and the shared memory resource early release instruction.

For example, in a processor provided in at least one embodiment of the present disclosure, the processing unit further includes a plurality of vector processing units and a shared memory; each vector processing unit comprises a vector register and a scalar register, and the vector register and the scalar register are used as private storage resources and are provided for each thread subgroup; the shared memory is provided to the thread groups as a shared memory resource.

At least one embodiment of the present disclosure provides an electronic device including a processor provided in at least one embodiment of the present disclosure.

Drawings

To more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure and are not limiting to the present disclosure.

FIG. 1 is a schematic diagram of a shader processing unit;

FIG. 2A is a schematic diagram of a vector register mask matrix;

FIG. 2B is a schematic diagram of a scalar register mask matrix;

FIG. 2C is a schematic diagram of a shared memory resource mask matrix;

FIG. 3A is a timing diagram illustrating thread rank assignment and kernel program execution;

FIG. 3B is a timing diagram illustrating thread rank allocation and kernel program execution provided by at least one embodiment of the present disclosure;

FIG. 4A is a flow diagram illustrating a method of thread scheduling provided by at least one embodiment of the present disclosure;

FIG. 4B depicts a flowchart of an exemplary thread scheduling method;

FIG. 5 illustrates a schematic diagram of a processor provided by at least one embodiment of the present disclosure;

fig. 6A is a schematic view of an electronic device according to at least one embodiment of the present disclosure; and

fig. 6B is a schematic view of another electronic device according to at least one embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described below clearly and completely with reference to the accompanying drawings of the embodiments of the present disclosure. It is to be understood that the described embodiments are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the disclosure without any inventive step, are within the scope of protection of the disclosure.

Unless otherwise defined, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right": etc. are used only to indicate relative positional relationships, which may also change accordingly when the absolute position of the object being described changes.

To maintain the following description of the embodiments of the present disclosure clear and concise, a detailed description of some known functions and components have been omitted from the present disclosure.

Fig. 1 is a schematic diagram of a Shader Processing Unit (SPU) 100.

As shown in fig. 1, SPU 100 is used for executing tasks such as GPU kernel, and may include a plurality of Vector processing units (VEUs) 101, a Thread Cluster Shared Memory (TCSM) 102, a Shader Control Unit (SCU) 103, a Shader Resource Manager (SRM) (not shown), and a Command Interface module (Command Interface, CI) (not shown). It should be noted that fig. 1 schematically shows 4 VEUs 101, but this does not limit the embodiments of the present disclosure, and VEUs 101 may be in any number, for example, 2, 3, 5, etc., which may be determined according to actual needs.

In parallel computing of kernel programs, a computing task is generally performed by a plurality of threads (threads). A thread is the smallest execution unit in GPU operations, and one thread can complete a smallest logical sense operation. The Thread Cluster (TC) is a minimum Thread group unit issued by the CI, and may contain 2048 threads at most, and the Thread row (TP) is a minimum Thread group unit handled by the SRM, and may contain 64 threads at most. Kernel programs of the GPU execute in the VEUs 101, each VEU 101 includes Vector Registers (VRs) and Scalar Registers (SRs), each VR includes 64 32-bit registers, each individual 32-bit Register is used by one thread in one TP, and one TP can request multiple VRs according to its own needs. Each SR is a 32-bit register, each SR is used by all threads in one TP, and one TP can request multiple SRs according to its own needs. Different TPs belonging to the same TC may share data through a TCSM, which is equivalent to a cache of one TC.

The SCU 103, as a control unit of the SPU 100, is responsible for tasks such as fetch decoding and the like in the process of executing the GPU kernel program. And the CI is responsible for unpacking the issued tasks, so that the TC and the related information obtained by unpacking are dispatched to different SRMs to be split. The SRM is used to split the TCs sent by the CI into individual TPs and allocate each TP to a different VEU, while allocating corresponding hardware resources for each TP. And a kernel program of the GPU performs instruction fetching and decoding through the SCU, then the SRM obtains TP by splitting TC issued by the CI, and the TP is used as a minimum unit to be executed on the VEU in the SPU. Only after the SRM specifies the hardware resources belonging to a TP on the selected SPUs and VEUs that satisfy the TP's hardware resource requirements and informs the SCU of the completion of the hardware resource allocation, the TP's kernel can be decoded and executed by the SCU from the memory references.

The hardware resources include private storage resources and shared storage resources. The private storage resources belonging to one TP include VR, SR, etc., and the shared storage resources of all TPs belonging to the same TC include TCSM, etc. Each VEU stores a private mask table in the SRM for recording the use condition of the private storage resource belonging to the VEU. The mask tables that record usage of the private storage resources of the VEUs form a mask matrix, FIG. 2A shows the mask matrix for VR in the private storage resources, and FIG. 2B shows the mask matrix for SR in the private storage resources. As shown in FIGS. 2A and 2B, N-1 VEUs correspond to N-1 VR mask tables and N-1 SR mask tables, with the N-1 VR mask tables forming a VR mask matrix and the N-1 SR mask tables forming a SR mask matrix. Here, N.gtoreq.2 is an integer. Similarly, each SPU also has a private mask table for recording the usage of the shared memory resource belonging to the SPU, and a plurality of mask tables also form a mask matrix, and fig. 2C shows the mask matrix of the TCSM in the shared memory resource. As shown in FIG. 2C, M-1 SPUs correspond to M-1 TCSM mask tables, which form a TCSM mask matrix. Here, M.gtoreq.2 and is an integer.

In fig. 2A, 2B, and 2C, the ordinate represents the SPU/VEU number, for example, the ordinate represents VEU _0, which represents that the row with the ordinate of VEU _0 in the mask matrix stores the mask table of the private storage resource corresponding to the VEU with the ordinate of VEU _ 0. When a resource request of one TP is responded by the SRM, various hardware resources allocated to the TP are marked as in-use in a corresponding resource mask table of the VEU allocated to the TP, that is, the state of the corresponding resource is marked as in-use in the mask table, so that the subsequent TPs cannot use the occupied resource, thereby avoiding resource conflict and repeated allocation. For example, in one example, assuming 100 VRs per VEU, the first TP is allocated 20 VRs, and subsequent TPs are allocated only from the remaining 80 VRs.

When one TP requires the SRM to allocate a hardware resource, the SRM may traverse all the mask matrices of the hardware resource (e.g., mask matrix of VR, mask matrix of SR, mask matrix of TCSM), and may respond to the hardware resource request of the TP only if all the hardware resources can satisfy the hardware resource requirement of the TP. When a hardware resource request for the TP is responded to, the SRM designates a VEU on an SPU that meets its hardware resource requirements for the TP to execute a kernel program, and selects the number of the shared storage resource allocated to the SPU of the TP and the number of the private storage resource allocated to the VEU of the TP. Meanwhile, the SRM stores the hardware resource attributes (address and size of the hardware resource) used by the TP in a memory.

The shared storage resource is a hardware resource belonging to an SPU and is shared by all TPs within a TC, so all TPs in that TC must be allocated to the same SPU to ensure that the shared storage resource can be used. The requirements of subsequent TPs in the same TC must be considered when searching for all VEUs that satisfy the condition, rather than just the TP currently being allocated, in case the subsequent TPs do not allocate enough free resources. Therefore, in view of the SRM, the hardware resource requirement of the first TP of a TC is the largest, and as long as the available VEU and SPU can meet the requirement of the first TP, all TPs of the entire TC can be allocated in turn; otherwise, the split of the entire TC is suspended until the TP allocated to the previous TC completes its execution and releases enough hardware resources to continue allocation.

When the SCU retrieves the instruction from the memory and performs the decoding operation, and determines that the instruction is the last instruction (i.e., END instruction) of the kernel program according to the decoding result, the SCU notifies the SRM to release all resources occupied by the TP after the END instruction is executed. After obtaining the SPU number, the VEU number, and the TP number allocated to the TP and sent by the SCU, the SRM stores the memory of the hardware resource used by the TP according to the 3 number indexes, obtains the attribute of the hardware resource used by the TP, and then releases the hardware resource and updates the mask matrix of the hardware resource according to the attribute of the hardware resource (i.e., sets the hardware resource in the corresponding mask table in the resource mask matrix as unused). For the shared storage resource, the SRM will release the shared storage resource according to the hardware resource attribute stored in the memory only when the SCU notifies the SRM that the currently completed TP is the last TP in the entire TC. The condition that all TPs in one TC can be allocated consecutively is that various hardware resources required for the first TP satisfy the condition. If any resource does not satisfy the condition, all the TPs in the entire TC cannot be allocated.

In a normal execution mode, only after the last TP in the TC completes execution of the kernel program, the private storage resource and the shared storage resource are released together. This greatly reduces the parallelism of the tasks and extends the total time for all tasks to be performed.

At least one embodiment of the present disclosure provides a thread scheduling method, including: and carrying out the operation of releasing the shared storage resource of the first thread group in advance. The operation of releasing the shared memory resource of the first thread group in advance comprises the following steps: in the case that the thread subgroup currently running in the processing unit is determined to be the last thread subgroup in the first thread group, the shared memory resource occupied by the first thread group and releasable is released.

At least one embodiment of the present disclosure also provides a processor applying the above thread scheduling method and an electronic device including the processor.

The thread scheduling method provided by the embodiment of the disclosure can release the shared storage resource occupied by the thread group in advance when the current thread group is still running, so that the released resource can be allocated to other thread groups, and the purpose of allocating the resource in advance can be achieved, thereby improving the running efficiency and shortening the time for completing the whole task. For example, in some examples, since the last TP of a TC does not need to share data with other TPs when executing a kernel program, after the data of the shared storage resource is loaded, the shared storage resource is released in advance, and the released resource can be allocated to other thread groups, thereby improving the parallelism of tasks.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings, but the present disclosure is not limited to these specific embodiments.

FIG. 3A shows a timing diagram of TP allocation and kernel program execution.

As shown in fig. 3A, the abscissa represents time, and the ordinate represents a task performed. In this example, there are 2 TCs in total (TC0 and TC1), TC0 contains 3 TPs (TC0_ TP0, TC0_ TP1, and TC0_ TP2), and TC1 contains 3 TPs (TC1_ TP0, TC1_ TP1, and TC1_ TP2), each of which executes a respective kernel program. Firstly, creating and allocating TC0_ TP0, and when the creation and allocation of TC0_ TP0 are completed, TC0_ TP0 starts to execute a kernel program and simultaneously creates and allocates TC0_ TP 1; when the creation and allocation of the TC0_ TP1 are completed, the TC0_ TP1 starts executing the kernel program, and the creation and allocation of the TC0_ TP2 are performed at the same time; after the creation and allocation of the TC0_ TP2 is completed, the TC0_ TP2 starts executing the kernel program. For the shared storage resource TCSM, data of the TCSM is loaded at the beginning of executing the kernel program by the last TP (i.e., TC0_ TP2) in TC0, and after the TC0_ TP2 completes executing the kernel program, the TCSM resource is released, and at this time, the TC1_ TP0 is created and allocated. The timing for executing the tasks by the 3 TPs in TC1 is similar to the timing for executing the tasks by the 3 TPs in TC0, and is not described herein again.

In fig. 3A, the data of the TCSM has been loaded at the beginning of the execution of the kernel program by TC0_ TP 2. As can be seen from the above, the last TP of a TC does not need to share data with other TPs when executing a kernel program, so that the TCSM can be released after the data in the TCSM is loaded.

By adopting the thread scheduling method provided by the embodiment of the present disclosure, the TCSM is released after the TCSM data is loaded, so that a new timing diagram can be obtained, and the new timing diagram is shown in fig. 3B.

Fig. 3B illustrates a timing diagram of TP allocation and kernel program execution provided in at least one embodiment of the present disclosure.

As shown in fig. 3B, the TC0_ TP0 is first created and allocated, and when the TC0_ TP0 is completely created and allocated, the TC0_ TP0 starts to execute the kernel program and at the same time, the TC0_ TP1 is created and allocated; when the creation and allocation of the TC0_ TP1 are completed, the TC0_ TP1 starts executing the kernel program while the creation and allocation of the last TP (TC0_ TP2) in the TC0 is performed; after the creation and allocation of the TC0_ TP2 is completed, the TC0_ TP2 starts executing the kernel program. The loading of the data of the TCSM is completed at the beginning of the process of the TC0_ TP2 executing the kernel program, and the TCSM resource release is performed immediately after the loading of the TCSM data is completed. For example, TCSM resources may be used for early allocation of the next TC after release. Therefore, upon completion of TCSM resource release, the first TP of TC1 (i.e., TC1_ TP0) may start creating and allocating, and the subsequent process is similar to fig. 3A and will not be described again.

Comparing the timing diagrams of fig. 3A and 3B, it can be seen that by releasing unneeded TCSM resources in advance, the next TC can be allocated in advance, thereby shortening the time for completing the whole task, and the saved time is shown in the figure. The thread scheduling method corresponding to fig. 3B will be further described below.

Fig. 4A illustrates a flowchart of a thread scheduling method according to at least one embodiment of the present disclosure.

As shown in fig. 4A, the thread scheduling method includes steps S401 to S403.

Step S401: responding to the shared memory resource early release instruction aiming at the first thread group to carry out the shared memory resource early release operation of the first thread group.

For example, the shared storage resource may be the TCSM described above, and a shared storage resource EARLY RELEASE instruction (which may be denoted as TCSM _ EARLY _ RELEASE, for example) may be fetched from memory. For example, the Memory may be a Dynamic Random Access Memory (DRAM), or may be other types of memories, which are not limited in this respect. The shared storage resource is not limited to the TCSM, and may be any other type of storage resource, and any resource shared by the same TC may be used as the shared storage resource.

In some embodiments of the present disclosure, in step S401, responding to the shared memory resource early release instruction for the first thread group may include: acquiring state information of a thread subgroup currently running in a processing unit; based on the state information, it is determined whether the thread sub-group currently running in the processing unit is the last thread sub-group in the first thread group.

For example, the processing unit may be SPU 100 shown in fig. 1, the first thread group may be TC0 shown in fig. 3A and 3B, the thread subgroups in the first thread group may be TC0_ TP0, TC0_ TP1, and TC0_ TP2, and the last thread subgroup may be TC0_ TP2, and the first thread group may run on one VEU 101 in SPU 100.

For example, the state information of the thread subgroup includes indication information and identification information, the indication information indicates whether the thread subgroup is the last thread subgroup in the first thread group, and the identification information indicates the resource identifier corresponding to the thread subgroup.

For example, the identification information includes the number of the SPU and VEU in which the thread subgroup resides and the number of the thread subgroup.

After step S401 is performed, step S402 may be performed.

Step S402: and carrying out the operation of releasing the shared storage resource of the first thread group in advance.

In some embodiments of the present disclosure, step S402 may include: in the case that the thread subgroup currently running in the processing unit is determined to be the last thread subgroup in the first thread group, the shared memory resource occupied by the first thread group and releasable is released.

The shared storage resource occupied by the first thread group and releasable is releasable, which means that the shared storage resource occupied by the first thread group is in a releasable state after the last thread subgroup of the first thread group has loaded the shared storage resource.

In some embodiments of the present disclosure, releasing the shared memory resource occupied by the first thread group and releasable may include: and according to the identification information, obtaining the information of the shared storage resource needing to be released, and releasing the shared storage resource occupied by the first thread group and released according to the information of the shared storage resource.

For example, the information of the shared storage resource includes an address and a size of the shared storage resource (e.g., TCSM).

For example, the address and size of the shared memory resource (e.g., TCSM) may be obtained using the number of the SPU in which the thread subgroup resides, the VEU, and the number index of the thread subgroup to index the memory to which the thread subgroup attributes have been allocated.

For example, the releasing of the shared memory resource occupied by the first thread group and releasable according to the information of the shared memory resource includes: and acquiring a mask table corresponding to the shared storage resource from the resource mask matrix, and updating the state of the shared storage resource in the mask table to an available state.

For example, the resource mask matrix may be the TCSM mask matrix shown in fig. 2C, the mask table corresponding to the shared storage resource is obtained from the TCSM mask matrix by using the number of the SPU, and the TCSM space to be released is set to be in an available state in the mask table according to the address and size of the TCSM.

After performing step S402, step S403 may be performed.

Step S403: the released shared memory resources previously occupied by the first thread group and releasable are allocated to a second thread group currently making a resource allocation request.

In some embodiments of the present disclosure, the last thread sub-group in the first thread group is still running while the freed shared memory resources previously occupied by the first thread group and releasable are allocated to the second thread group.

For example, the first thread group may be the TC0 shown in fig. 3A and 3B, and the second thread group may be the TC1 shown in fig. 3A and 3B. As shown in fig. 3B, after the last thread sub-group (TC0_ TP2) in the first thread group finishes loading data of the shared storage resource TCSM, the TCSM resource is released, and at this time, the shared storage resource TCSM previously occupied by TC0 is allocated to TC1, and at this time, the last thread sub-group TC0_ TP2 in the first thread group is still executing the kernel program.

In some embodiments of the present disclosure, after the shared storage resource is released in advance, the thread scheduling method may further include: and carrying out ending operation of the first thread group and releasing the private storage resource occupied by the first thread group.

At this time, only the private storage resource is released without releasing the shared storage resource, because the shared storage resource (e.g., TCSM resource) originally belonging to the first thread group is already released, and thus, the release of the shared storage resource is not needed. For example, the previously freed shared memory resource may have been allocated to a subsequent thread group.

In some embodiments of the present disclosure, performing an end operation for the first thread group comprises: in response to a thread group end instruction for the first thread group, and ending execution of the first thread group.

For example, a thread group END instruction (e.g., an END instruction) is an instruction in a kernel assembler, and the instruction format of the thread group END instruction and the shared storage resource EARLY RELEASE instruction (TCSM _ EARLY _ RELEASE) described above may be the same to reduce changes to the hardware design. Of course, the embodiments of the present disclosure are not limited thereto, and the thread group ending instruction and the shared storage resource early release instruction may also adopt different instruction formats to increase the flexibility of design, which may be determined according to actual requirements.

Fig. 4B is a flowchart illustrating an exemplary thread scheduling method according to some embodiments of the present disclosure.

As shown in FIG. 4B, first, the SCU fetches an instruction from memory (e.g., DRAM) and performs a decode operation.

Then, the SCU determines whether the acquired instruction is a shared storage resource EARLY RELEASE instruction (TCSM _ EARLY _ RELEASE instruction) according to the decoding result. And if the command is judged to be a TCSM _ EARLY _ RELEASE command through decoding, sending TCSM RELEASE information to the SRM. Meanwhile, the SCU still needs to send the numbers of the SPU and the VEU where the TP is located and the number of the TP to the SRAM, and the above information is packed in a set of information data and transmitted to the SRM through a preset bus protocol.

Next, the SRM receives the TCSM release message sent by the SCU.

Then, the SRM unpacks the information data, and determines whether the TP currently executing the kernel program is the last TP in the TC and notifies the SRM of this information. If not the last TP, the TCSM Release information sent by the SCU is ignored. If it is determined that the TP currently executing the kernel program is the last TP in the associated TCs, the memory to which the TP attribute has been allocated is stored using the numbering indexes of the SPU, VEU, and TP obtained from the information data, thereby obtaining the address and size of the TCSM. And then updating the TCSM mask matrix, namely, using the number of the SPU to take out a mask table corresponding to the TCSM occupied by the SPU in the TCSM mask matrix, and setting the TCSM space to be released into an available state in the mask table.

At this time, since the TP has not completed execution of the kernel program, the remaining hardware resources occupied by the TP cannot be released. If the TC to be allocated next cannot be allocated simply because the TCSM resources cannot be met, the TCSM released in advance may meet its resource requirements so that it can be allocated in advance.

And after the shared storage resource is released in advance, ending the first thread group, and releasing the private storage resource occupied by the first thread group. That is, if the command retrieved by the SCU is decoded as an END command, the SCU sends resource release information, and the SRM receives the resource release information and then uses the numbers of the SPU, the VEU, and the TP obtained from the information data to index the memory storing the attributes of the allocated TPs, thereby obtaining the addresses and sizes of the hardware resources other than the TCSM. Then, the mask matrix of other hardware resources except the TCSM is updated, that is, the mask table corresponding to other hardware resources in the mask matrix (for example, VR mask matrix and SR mask matrix) of other hardware resources is fetched using the numbers of the VEU and TP, and the other hardware resource space that needs to be released is set to be available in the mask table. Here, since the TCSM has been released in advance, only the mask matrix of other hardware resources than the TCSM needs to be updated at this time.

In the thread scheduling method provided by the embodiment of the present disclosure, the resource release process is implemented by two steps, that is, the shared storage resource is released in advance, and then the private storage resource is released, so that the purpose of allocating the next group of TPs in advance is achieved by releasing unnecessary resources in advance, thereby shortening the total time for completing the entire task. Because the shared storage resources are released in advance, the next TC can be allocated in advance compared with the conventional mode, and the time of the next TC is overlapped with the time of the previous TC when the kernel program is executed, so that the parallelism of the task is improved, and the total execution time of the whole task is shortened.

It should be noted that, in the embodiment of the present disclosure, the thread scheduling method may include more or fewer steps, and is not limited to the steps described above, and the execution order of each step is not limited, which may be determined according to actual needs.

Fig. 5 illustrates a schematic diagram of a processor 500 provided by at least one embodiment of the present disclosure.

As shown in fig. 5, processor 500 includes a processing unit 501 and a resource manager 502.

The processing unit 501 is configured to execute at least one thread group, wherein a thread group comprises at least one thread sub-group. The processing unit 501 includes a control unit 503, a plurality of vector processing units 504, and a shared memory 505. The control unit 503 is configured to receive the shared memory resource early release instruction, and provide the state information of the thread subgroup currently running in the processing unit 501 and the shared memory resource early release instruction to the resource manager 502. Each vector processing unit 504 includes VR and SR, which are provided as private storage resources to the thread sub-group. The shared memory 505 is provided as a shared memory resource to the thread groups.

The resource manager 502 is configured to release the shared memory resources occupied by the thread group to which it belongs and which are releasable in case it is determined that the thread sub-group currently running in the processing unit 501 is the last thread sub-group in the thread group to which it belongs. The resource manager 502 is further configured to release the releasable shared memory resource occupied by the thread group under consideration in case that the thread subgroup currently running in the processing unit 501 is determined to be the last thread subgroup in the thread group under consideration according to the state information of the thread subgroup running in the processing unit 501 and the shared memory resource early release instruction.

The processing unit 501 is, for example, the SPU 100 shown in fig. 1, the resource manager 502 is, for example, the above SRM, the control unit 503 is, for example, the SCU 103 shown in fig. 1, the vector processing units 504 are, for example, the VEU 101 shown in fig. 1, and the shared memory 505 is, for example, the TCSM 102 shown in fig. 1. For the detailed description of each unit or module, reference may be made to the foregoing description, and further description is omitted here. For example, the processor 500 may be any type of processor such as a CPU, GPU, or the like. The processor 500 may also include further units and modules to implement the processing computing functionality. For technical effects of the processor 500, reference may be made to the above description of the thread scheduling method, which is not described herein again.

An embodiment of a thread scheduling method provided by at least one embodiment of the present disclosure is briefly described below with reference to fig. 5.

The resource manager 502 is configured to split a thread group into individual thread sub-groups and allocate each thread sub-group to a different vector processing unit 504, while allocating corresponding hardware resources for each thread sub-group. The kernel of the GPU is fetched and decoded by the control unit 503 and then executed by the resource manager 502 on the vector processing unit 504 in the processing unit 501 with the thread subgroup as the minimum unit. The control unit 503 determines whether the obtained command is a shared storage resource early release command according to the decoding result, and sends shared storage resource release information to the resource manager 502 if the command is determined to be the shared storage resource early release command through decoding. Meanwhile, the control unit 503 still needs to send the numbers of the processing unit 501 and the vector processing unit 504 where the thread subgroup is located and the number of the thread subgroup to the resource manager 502, which are all packaged in a set of information data.

Next, the resource manager 502 receives the shared storage resource release information transmitted by the control unit 503. Then, the resource manager 502 unpacks the information data and determines whether the thread subgroup currently executing the kernel program is the last thread subgroup in the thread group. If the thread is the last thread subgroup, the memory is indexed by using the numbers of the processing unit 501, the vector processing unit 504 and the thread subgroup obtained from the information data, so as to obtain the address and the size of the shared memory resource, and finally the TCSM mask matrix is updated. After the shared storage resource is released in advance, the thread group is ended, and the private storage resources (such as VR and SR resources) occupied by the thread group are released.

Fig. 6A is a schematic structural diagram of an electronic device 600 according to at least one embodiment of the present disclosure.

As shown in fig. 6A, the electronic device 600 includes the processor 500 shown in fig. 5. The electronic device shown in fig. 6A is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

Fig. 6B is a schematic structural diagram of another electronic device 700 according to at least one embodiment of the present disclosure. The electronic device 700 in the disclosed embodiment may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device 700 shown in fig. 6B is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

For example, as shown in fig. 6B, in some examples, electronic device 700 includes a processing device (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from storage device 608 into a Random Access Memory (RAM) 603. For example, the processing device 601 may be the processor 500 described above. In the RAM 603, various programs and data necessary for the operation of the computer system are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected thereto via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

For example, the following components may be connected to the I/O interface 605: an input device 606 such as a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 607 such as a Liquid Crystal Display (LCD), a speaker, a vibrator, or the like; a storage device 608 such as a tape, hard disk, or the like; and a communication device 609 including a network interface card such as a LAN card, modem, or the like. The communication means 609 may allow the electronic apparatus 700 to perform wireless or wired communication with other devices to exchange data, performing communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted as necessary on the drive 610, so that a computer program read out therefrom is mounted as necessary on the storage device 608. While fig. 6B illustrates an electronic device 700 that includes various means, it is to be understood that not all illustrated means are required to be implemented or included. More or fewer devices may be alternatively implemented or included.

For example, the electronic device 700 may further include a peripheral interface (not shown in the figure) and the like. The peripheral interface may be various types of interfaces, such as a USB interface, a lightning (lighting) interface, and the like. The communication device 609 may communicate with networks such as the internet, intranets, and/or wireless networks such as cellular telephone networks, wireless Local Area Networks (LANs), and/or Metropolitan Area Networks (MANs) and other devices via wireless communication. The wireless communication may use any of a number of communication standards, protocols, and technologies, including, but not limited to, global system for mobile communications (GSM), Enhanced Data GSM Environment (EDGE), wideband code division multiple access (W-CDMA), Code Division Multiple Access (CDMA), Time Division Multiple Access (TDMA), bluetooth, Wi-Fi (e.g., based on IEEE 802.11a, IEEE 802.11b, IEEE 802.11g, and/or IEEE 802.11n standards), voice over internet protocol (VoIP), Wi-MAX, protocols for email, instant messaging, and/or Short Message Service (SMS), or any other suitable communication protocol.

For example, the electronic device 700 may be any device such as a mobile phone, a tablet computer, a notebook computer, an electronic book, a game console, a television, a digital photo frame, a navigator, or any combination of electronic devices and hardware, and the embodiment of the disclosure is not limited thereto.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may be separate and not incorporated into the electronic device.

For the present disclosure, there are also the following points to be explained:

(1) the drawings of the embodiments of the disclosure only relate to the structures related to the embodiments of the disclosure, and other structures can refer to the common design.

(2) Without conflict, embodiments of the present disclosure and features of the embodiments may be combined with each other to arrive at new embodiments.

The above description is only for the specific embodiments of the present disclosure, but the scope of the present disclosure is not limited thereto, and the scope of the present disclosure should be subject to the scope of the claims.

Claims

1. A thread scheduling method, comprising:

the method for carrying out the operation of releasing the shared storage resource of the first thread group in advance comprises the following steps:

in the case that the thread subgroup currently running in the processing unit is determined to be the last thread subgroup in the first thread group, releasing the shared storage resources occupied by the first thread group and releasable.

2. The thread scheduling method of claim 1, further comprising:

allocating the released shared memory resources previously occupied by the first thread group and releasable to a second thread group currently making a resource allocation request.

3. The thread scheduling method of claim 2, wherein said last subset of threads in said first thread group is running while said released shared memory resources previously occupied by said first thread group and releasable are allocated to said second thread group.

4. The thread scheduling method of claim 1, further comprising:

responding to a shared memory resource early release instruction aiming at the first thread group to perform the shared memory resource early release operation of the first thread group.

5. The thread scheduling method of claim 4, wherein responding to an advanced release of shared memory resources for the first thread group instruction comprises:

acquiring state information of the thread subgroup currently running in the processing unit;

determining, based on the state information, whether the thread sub-group currently running in the processing unit is the last thread sub-group in the first thread group.

6. The thread scheduling method of claim 5,

the state information of the thread sub-group includes indication information and identification information,

the indication information indicates whether the thread sub-group is the last thread sub-group in the first thread group,

the identification information indicates the resource identification corresponding to the thread subgroup.

7. The thread scheduling method of claim 6, wherein releasing the shared memory resource occupied by and releasable from the first thread group comprises:

and according to the identification information, obtaining the information of the shared storage resource needing to be released, and releasing the shared storage resource occupied by the first thread group and released according to the information of the shared storage resource.

8. The thread scheduling method according to claim 7, wherein releasing the shared memory resource occupied by and releasable from the first thread group according to the information of the shared memory resource comprises:

and acquiring a mask table corresponding to the shared storage resource from a resource mask matrix, and updating the state of the shared storage resource in the mask table to be an available state.

9. The thread scheduling method of claim 7,

the information of the shared memory resource includes an address and a size of the shared memory resource.

10. The thread scheduling method according to any of claims 1-9, further comprising, after performing the advanced release operation of the shared memory resource:

and carrying out ending operation of the first thread group, and releasing the private storage resource occupied by the first thread group.

11. The thread scheduling method of claim 10, wherein performing the end operation of the first thread group comprises:

and responding to a thread group ending instruction aiming at the first thread group, and ending the running of the first thread group.

12. A processor, comprising:

a processing unit configured to execute at least one thread group, wherein the thread group comprises at least one thread sub-group;

and the resource manager is configured to release the shared storage resources occupied by the thread group and releasable under the condition that the thread subgroup currently running in the processing unit is determined to be the last thread subgroup in the thread group.

13. The processor of claim 12, wherein the processing unit comprises:

a control unit configured to receive a shared memory resource early release instruction and provide state information of the thread subgroup currently running in the processing unit and the shared memory resource early release instruction to the resource manager,

wherein the resource manager is further configured to release the shared storage resource that is occupied by the thread group and can be released in case that it is determined that the thread sub-group currently running in the processing unit is the last thread sub-group in the thread group to which the thread group belongs according to the state information and the shared storage resource early release instruction.

14. The processor of claim 12,

the processing unit further comprises a plurality of vector processing units and a shared memory;

each vector processing unit comprises a vector register and a scalar register, and the vector register and the scalar register are used as private storage resources to be provided for each thread subgroup;

the shared memory is provided as the shared memory resource to the thread group.

15. An electronic device comprising a processor according to any of claims 12-14.