CN114153500A

CN114153500A - Instruction scheduling method, instruction scheduling device, processor and storage medium

Info

Publication number: CN114153500A
Application number: CN202111462823.2A
Authority: CN
Inventors: 喻琛; 左航; 潘于
Original assignee: Haiguang Information Technology Co Ltd
Current assignee: Haiguang Information Technology Co Ltd
Priority date: 2021-12-01
Filing date: 2021-12-01
Publication date: 2022-03-08

Abstract

An instruction scheduling method, an instruction scheduling device, a processor and a storage medium, the instruction scheduling method comprising: selecting a first instruction fetching request initiated by a first thread bundle and corresponding to a first instruction address, and performing instruction fetching operation corresponding to the first instruction address; receiving first instruction data corresponding to the first instruction fetch request returned from the first instruction address; in response to a second instruction fetch request initiated by the second thread bundle to fetch the first instruction address, the first instruction data is broadcast in a first clock cycle to a write address of an instruction data access area of the first thread bundle and a write address of an instruction data access area of the second thread bundle. The instruction scheduling method can reduce the access of the computing unit to the instruction cache or other cache systems such as a plurality of levels of caches due to instruction fetching operation, reduce the access bandwidth of the instruction cache or other cache systems such as a plurality of levels of caches, and further reduce the access bandwidth of the data cache of the data required by the execution of the instruction or other cache systems such as a plurality of levels of caches.

Description

Instruction scheduling method, instruction scheduling device, processor and storage medium

Technical Field

Embodiments of the present disclosure relate to an instruction scheduling method, an instruction scheduling apparatus, a processor, and a storage medium.

Background

General-purpose computing graphics processing unit (GPGPU) belongs to a type of GPU, is more prone to general-purpose computing rather than graphics rendering, and has a large number of computing units inside, such as Streaming Multiprocessors (SM), which can run independently, so that the GPGPU has a high degree of parallelism.

FIG. 1 shows an architecture diagram of a General Purpose Graphics Processor (GPGPU). In parallel computing, a computing task is typically performed by multiple threads (threads), which share an instruction stream. Before the threads are executed in a general-purpose graphics processor (or called a parallel computing processor), the threads are divided into a plurality of thread blocks (thread blocks) in a thread block scheduling device, different computing tasks correspond to different thread blocks, and kernel (GPGPU executable program code) executed by the thread blocks corresponding to the same computing task is the same, except that data operated by each thread in the same thread block is different. The thread blocks are then distributed to respective Compute Units (CUs) (e.g., Streaming Multiprocessors (SMs)) via thread block distribution means, and the thread blocks may be executed in the same compute unit or in different compute units. All threads in a thread block have to be allocated to the same compute unit for execution. Meanwhile, a thread block is split into minimum execution thread bundles (or simply thread bundles), each of which contains a fixed number (or less than the fixed number) of threads, for example, 32 threads. When Multiple thread blocks are executed in the same computing unit, the thread bundle in the computing unit may come from the same thread block or different thread blocks, and all threads in the same thread bundle may be executed in SIMD (Single Instruction Multiple Data) manner.

For example, each computing unit shown in fig. 1 includes an Instruction scheduling device (or called a thread scheduling/dispatching module) and a plurality of computing cores, where the Instruction scheduling device may include an Instruction data access area, for example, a random access memory (ram) for temporarily storing Instruction data, and each computing core includes a register file. The Instruction Cache (Instruction Cache) and the Data Cache (Data Cache) corresponding to each computing unit may be L1 caches other than the computing unit, for example, a small number of computing units may also share the same Instruction Cache and the same Data Cache, and the Instruction caches and the Data caches of different computing units may also share the next-level Cache. The instruction scheduling device performs a series of functions of fetching, decoding, scheduling, allocating, etc. on a thread bundle running on a computing unit so that a plurality of computing cores (e.g., Stream Processors (SPs)) of the computing unit run the thread bundle. For example, each compute core includes an Arithmetic Logic Unit (ALU), a floating point compute unit, and the like. According to the number of the computing cores in the computing unit, a plurality of thread bundles in one thread block can be executed simultaneously or in a time-sharing manner. The multiple threads in each thread bundle execute the same instruction, and the result obtained after the instruction execution is updated to the corresponding register of each thread bundle.

For example, as shown in fig. 2, the pipeline operation of the GPGPU includes: instruction Fetching (IF), decoding (ID), Execution (EX), Memory access (MEM), and Write-Back (WB, Write Back, specifically, updating the result obtained after the Instruction is Executed into the register). Each thread bundle has a Program Counter (PC) that records the address of the next instruction to be executed in the thread bundle, i.e., the fetch address. The value in the program counter is used to indicate the location of the next instruction in main memory. When an instruction is fetched, the value in the program counter is automatically increased, and after the instruction is executed and the instruction data is written back, the computer acquires the next instruction address from the program counter. The instruction fetching is the first stage in the pipeline, and the fetched instruction data is input to the subsequent stage for processing, so that the operation process of the whole computing unit is realized.

For example, as shown in fig. 3, when the GPGPU includes the computing units 0 to N, the instruction scheduling device in each computing unit issues an instruction fetch request (including an instruction address) of a certain thread bundle to the instruction cache, if instruction data corresponding to the instruction address exists in the instruction cache, the shared cache does not need to be accessed, and if instruction data corresponding to the instruction address does not exist in the instruction cache, the shared cache is accessed or further issued to the unified cache for instruction fetch operation, and the like. The data needed by the instruction to perform the calculation is firstly obtained from the data cache, and if the data does not exist in the data cache, the shared cache is accessed or further unified cache is obtained. Therefore, the instruction fetching operation and the data fetching operation of the thread bundle form bandwidth competition when the shared cache is accessed.

Disclosure of Invention

At least one embodiment of the present disclosure provides an instruction scheduling method, including: selecting a first instruction fetching request initiated by a first thread bundle and aiming at a first instruction address, and carrying out instruction fetching operation on the first instruction address; receiving first instruction data corresponding to the first instruction fetch request returned from the first instruction address; in response to a second instruction fetch request initiated by a second thread bundle for fetching the first instruction address, broadcasting the first instruction data in a first clock cycle and sending the first instruction data to a write address of an instruction data access area of the first thread bundle and a write address of an instruction data access area of the second thread bundle.

For example, some embodiments of the present disclosure provide an instruction scheduling method further including:

canceling the second instruction fetch request.

For example, in the instruction scheduling method provided by some embodiments of the present disclosure, in a case that the second instruction fetch request is candidate and selected in the first clock cycle, an instruction fetch operation corresponding to the second instruction fetch request is cancelled.

For example, some embodiments of the present disclosure provide an instruction scheduling method, wherein the second instruction fetch request is candidate in the first clock cycle and is ignored.

For example, in the instruction scheduling method provided by some embodiments of the present disclosure, the first instruction fetch request and the second instruction fetch request are candidates in a second clock cycle, the first instruction fetch request is selected, and the second instruction fetch request is ignored, where the first clock cycle is after the second clock cycle.

For example, some embodiments of the present disclosure provide an instruction scheduling method, wherein the second fetch request is ignored in one or more intermediate operation periods between the second clock period and the first clock period.

For example, some embodiments of the present disclosure provide an instruction scheduling method, in the second clock cycle, a third instruction fetch request initiated by a third thread bundle for fetching the first instruction address is in a candidate, but is ignored; in a third clock cycle between the second clock cycle and the first clock cycle, the third fetch request is candidate and selected for a fetch operation for the first instruction address.

For example, some embodiments of the present disclosure provide an instruction scheduling method, in which the first thread bundle and the second thread bundle belong to the same thread block or belong to different thread blocks.

recording attribute information of each instruction fetching request, wherein the attribute information comprises a thread block number of a thread block to which a thread bundle initiating the instruction fetching request belongs and a thread bundle number of the thread bundle initiating the instruction fetching request in the thread block to which the thread bundle belongs.

For example, some embodiments of the present disclosure provide an instruction scheduling method further including: responding to the returned first instruction data, and acquiring a thread block number corresponding to the first thread bundle and a thread bundle number corresponding to the first thread bundle according to the attribute information of the first instruction fetching request; and determining the second instruction fetching request based on the thread block number corresponding to the first thread bundle, the thread bundle number corresponding to the first thread bundle and the thread block number and the thread bundle number included in the attribute information of each of the candidate plurality of instruction fetching requests.

For example, in an instruction scheduling method provided by some embodiments of the present disclosure, determining the second instruction fetch request based on the thread block number and the thread bundle number corresponding to the first thread bundle and the thread block number and the thread bundle number included in the attribute information of each of the candidate multiple instruction fetch requests includes: generating a broadcast mask, wherein the broadcast mask includes information bits corresponding to the second fetch request; wherein the broadcast mask is used to broadcast the first instruction data in the first clock cycle to a write address of an instruction data access area sent to the second thread bundle.

For example, some embodiments of the present disclosure provide an instruction scheduling method, where the broadcast mask further includes information bits corresponding to the first instruction fetch request, and the broadcast mask is used to broadcast the first instruction data in the first clock cycle to a write address of an instruction data access area of the first thread bundle and a write address of an instruction data access area of the second thread bundle.

For example, in the instruction scheduling method provided by some embodiments of the present disclosure, the attribute information of each instruction fetch request further includes a write address of an instruction data access area, and the write address of the instruction data access area of the first instruction fetch request is obtained through the attribute information of the first instruction fetch request; and acquiring the write address of the instruction data access area of the second instruction fetch request through the attribute information of the second instruction fetch request.

For example, in the instruction scheduling method provided by some embodiments of the present disclosure, the attribute information of each instruction fetch request further includes a program counter, and the program counter of the attribute information of the first instruction fetch request obtains the first instruction fetch address corresponding to the first instruction fetch request.

For example, in the instruction scheduling method provided by some embodiments of the present disclosure, the attribute information of each instruction fetch request further includes a state of an instruction data access area, where the state of the instruction data access area includes whether an instruction data access area of a thread bundle that initiated the instruction fetch request is not full.

At least some embodiments of the present disclosure also provide an instruction scheduling apparatus, including: the instruction fetch arbitration unit is configured to select a first instruction fetch request initiated by a first thread bundle and corresponding to a first instruction address, and perform instruction fetch operation corresponding to the first instruction address; an instruction pre-processing unit configured to receive first instruction data corresponding to the first fetch request returned from the first instruction address; and responding to a second instruction fetching request initiated by a second thread bundle and used for fetching the first instruction address, and broadcasting and sending the first instruction data to the write address of the instruction data access area of the first thread bundle and the write address of the instruction data access area of the second thread bundle in a first clock cycle.

For example, in the instruction scheduling apparatus provided in some embodiments of the present disclosure, the instruction preprocessing unit is further configured to cancel the second instruction fetch request.

For example, in the instruction scheduling apparatus provided in some embodiments of the present disclosure, the instruction preprocessing unit is further configured to cancel, if the second fetch request is candidate and selected in the first clock cycle, the fetch operation corresponding to the second fetch request.

For example, in the instruction scheduling apparatus provided in some embodiments of the present disclosure, the instruction preprocessing unit is further configured to ignore the second instruction fetch request in a case where the second instruction fetch request is in a candidate in the first clock cycle.

For example, in the instruction scheduling apparatus provided in some embodiments of the present disclosure, the instruction preprocessing unit is further configured to select the first instruction fetch request and ignore the second instruction fetch request in a case where the first instruction fetch request and the second instruction fetch request are candidates in a second clock cycle, where the first clock cycle is subsequent to the second clock cycle.

For example, in the instruction scheduling apparatus provided in some embodiments of the present disclosure, the instruction preprocessing unit is further configured to ignore the second instruction fetch request in one or more intermediate operation cycles between the second clock cycle and the first clock cycle.

For example, in the instruction scheduling apparatus provided in some embodiments of the present disclosure, the instruction fetch arbitration unit is further configured to,

in the second clock cycle, in the case that a third instruction fetching request for fetching the first instruction address, which is initiated by a third thread bundle, is a candidate, ignoring the third instruction fetching request;

in a third clock cycle between the second clock cycle and the first clock cycle, in the case that the third instruction fetch request is a candidate, the third instruction fetch request is selected to perform the instruction fetch operation for the first instruction address.

For example, in the instruction scheduling apparatus provided in some embodiments of the present disclosure, the first thread bundle and the second thread bundle belong to the same thread block or belong to different thread blocks.

For example, some embodiments of the present disclosure provide an instruction scheduling apparatus, further comprising an instruction broadcast determining unit configured to record attribute information of each instruction fetch request,

the attribute information includes the thread block number of the thread block to which the thread bundle initiating the instruction fetch request belongs and the thread bundle number of the thread bundle initiating the instruction fetch request in the thread block to which the thread bundle belongs.

For example, in an instruction scheduling apparatus provided in some embodiments of the present disclosure, the instruction broadcast determining unit is further configured to:

acquiring the returned first instruction data from the instruction preprocessing unit, and acquiring a thread block number corresponding to the first thread bundle and a thread bundle number corresponding to the first thread bundle according to the attribute information of the first instruction fetching request; and

and determining the second instruction fetching request based on the thread block number corresponding to the first thread bundle, the thread bundle number corresponding to the first thread bundle and the thread block number and the thread bundle number included in the attribute information of each of the candidate plurality of instruction fetching requests.

For example, in the instruction scheduling apparatus provided in some embodiments of the present disclosure, the instruction broadcast determining unit is further configured to generate a broadcast mask based on the chunk number corresponding to the first thread bundle and the bundle number corresponding to the first thread bundle, and the thread chunk number and the thread bundle number included in the attribute information of each of the candidate multiple instruction fetch requests, where the broadcast mask includes information bits corresponding to the second instruction fetch request;

the instruction pre-processing unit broadcasts the first instruction data in a first clock cycle to a write address of an instruction data access area sent to the second thread bundle using the broadcast mask.

For example, in the instruction scheduling apparatus provided in some embodiments of the present disclosure, the broadcast mask further includes information bits corresponding to the first instruction fetch request, and the instruction pre-processing unit is further configured to broadcast, in a first clock cycle, the first instruction data to a write address of an instruction data access area of the first thread bundle and a write address of an instruction data access area of the second thread bundle using the broadcast mask.

For example, in the instruction scheduling apparatus provided in some embodiments of the present disclosure, the attribute information of each instruction fetch request further includes a write address of the instruction data access area,

the instruction pre-processing unit is further configured to:

obtaining a write address of an instruction data access area of the first thread bundle through attribute information of the first instruction fetch request, an

And acquiring the write address of the instruction data access area of the second thread bundle through the attribute information of the second instruction fetching request.

For example, in some embodiments of the present disclosure, an instruction scheduling apparatus is provided, where the attribute information of each instruction fetch request further includes a program counter,

the instruction preprocessing unit is further configured to obtain a first instruction fetching address corresponding to the first instruction fetching request through a program counter of attribute information of the first instruction fetching request.

For example, in some embodiments of the present disclosure, the instruction scheduling apparatus further includes attribute information of each instruction fetch request including a status of the instruction data access area,

the instruction cache region state comprises whether an instruction data access region of a thread bundle initiating the instruction fetch request is not full.

At least one embodiment of the present disclosure further provides a processor, which includes at least one computing unit, where the computing unit includes the instruction scheduling apparatus provided in any of the above embodiments.

At least one embodiment of the present disclosure further provides an instruction scheduling apparatus, including: a memory for non-transitory storage of computer-executable instructions; and a processor for executing the computer-executable instructions, wherein when the computer-executable instructions are executed by the processor, the instruction scheduling method provided by any embodiment of the disclosure is executed.

At least one embodiment of the present disclosure further provides a non-transitory storage medium that stores non-transitory computer-executable instructions, wherein when the computer-executable instructions are executed by a computer, the instruction scheduling method provided in any embodiment of the present disclosure is performed.

Drawings

To more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure and are not limiting to the present disclosure.

FIG. 1 is a schematic diagram of a General Purpose Graphics Processor (GPGPU);

FIG. 2 is a flow diagram of a pipeline operation of a General Purpose Graphics Processor (GPGPU);

FIG. 3 is a schematic diagram of an instruction fetch flow of a computing unit;

FIG. 4 is a diagram illustrating an instruction fetch performed by an instruction dispatch device;

FIG. 5 is a diagram illustrating the operation cycles of different thread bundles in the same thread block on a computing unit;

FIG. 6 is a flowchart illustrating a method for scheduling instructions according to an embodiment of the present disclosure;

FIG. 7 is a flowchart illustrating an instruction scheduling method applied to an instruction scheduling apparatus in a compute unit according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram illustrating an instruction broadcast mode during operation according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of another instruction broadcast mode during operation according to an embodiment of the present disclosure;

FIG. 10 is a schematic block diagram of an instruction scheduling apparatus according to an embodiment of the present disclosure;

FIG. 11 is a schematic block diagram of another instruction scheduling apparatus according to an embodiment of the present disclosure;

fig. 12 is a schematic diagram of a non-transitory storage medium according to an embodiment of the disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings of the embodiments of the present disclosure. It is to be understood that the described embodiments are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the disclosure without any inventive step, are within the scope of protection of the disclosure.

Unless otherwise defined, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. Also, the use of the terms "a," "an," or "the" and similar referents do not denote a limitation of quantity, but rather denote the presence of at least one. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

The present disclosure is illustrated by the following specific examples. To maintain the following description of the embodiments of the present disclosure clear and concise, a detailed description of known functions and known components have been omitted from the present disclosure. When any component of an embodiment of the present disclosure appears in more than one drawing, that component is represented by the same or similar reference numeral in each drawing.

For example, when an instruction is fetched by an instruction scheduling device in a GPGPU, data update cannot keep pace with the instruction execution rate due to too fast instruction execution, and once data required for instruction execution does not exist in a data cache, the data needs to be fetched from a shared cache, even if the data required for instruction execution does not exist in data caches of all levels, the data needs to be fetched from an external storage space (e.g., a main memory), so that a path for fetching the data required for instruction execution is long and the latency is high. Thus, the demand for data bandwidth by the computing unit is much greater than its demand for instruction bandwidth, resulting in greater data bandwidth pressure.

For example, since there is only one interface between the compute core and the instruction cache, the instruction scheduling device can only fetch an instruction for one thread bundle in one clock cycle, so that multiple compute cores run the thread bundle simultaneously or in a time-sharing manner. As shown in fig. 4, the instruction scheduling apparatus includes: the instruction fetch arbitration unit and the instruction preprocessing unit. One or more thread bundles to be run or running on the computing unit (the thread bundles may belong to the same thread block or belong to different thread blocks) send the initiated instruction fetching request to the instruction fetching arbitration unit. The instruction fetching arbitration unit picks one thread bundle according to the selection rule in one clock cycle to carry out instruction fetching operation on the thread bundle, and instruction fetching requests of other thread bundles are blocked until the other thread bundles are selected by the instruction fetching arbitration unit. For example, the fetch arbitration unit selects the fetch request for bundle 1 in thread block 1, which has the most stringent instruction requirements. The instruction fetch arbitration unit sends an instruction fetch request initiated by a selected certain thread bundle (for example, thread bundle 1 in thread block 1) to an instruction cache (or other level cache such as a shared cache) to perform an instruction fetch operation from an instruction address. Then, the instruction preprocessing unit receives instruction data returned by the instruction cache (e.g., instruction data of the thread bundle 1 in the thread block 1) and judges a valid portion of the returned instruction data, for example, the instruction data is to be fetched for the thread bundle 1 in the thread block 1, a state of an instruction data access area of the instruction data, a write address of the instruction data access area, and the like. Finally, the instruction data is written into the write address of the instruction data access area of the thread bundle, and the instruction fetching of the thread bundle (for example, the thread bundle 1 in the thread block 1) is completed, so that the subsequent decoding, execution, access and write-back are carried out, and the running of the thread bundle on the corresponding computing unit is completed.

For example, fig. 5 shows a schematic cycle diagram of 4 threads (thread 0 to thread 4) belonging to the same thread block running from the beginning to the end on the computing unit. Although the 4 threads are continuously operated by the pipeline, the thread which is started to operate firstly does not necessarily end firstly due to the different speed of data updating required by each thread, namely the whole operation time of each thread is not necessarily the same. Therefore, different thread bundles may access the same instruction address of the instruction cache in the same clock cycle, or different thread bundles may access the same instruction address of the instruction cache in different clock cycles, and the difference between the two different clock cycles is small. Since the core instruction codes (kernel) executed by the 4 thread bundles belonging to the same thread block are the same, the instruction data fetched by one of the thread bundles can be broadcast to the remaining thread bundles which are fetching or are about to fetch the instruction data.

In addition, a parallel computing task at the CPU may be divided into a plurality of thread blocks to be completed together, but because the overall running time difference between the thread bundles belonging to different thread blocks is larger than that between the thread bundles belonging to the same thread block, the probability that two thread bundles belonging to different thread blocks access the same instruction address in the instruction cache in the same clock cycle is relatively low, and the probability that the thread bundles belonging to different thread blocks achieve synchronous running is low, but the thread bundles belonging to different thread blocks also have the possibility of instruction fetching broadcast.

At least some embodiments of the present disclosure provide an instruction scheduling method, including: selecting a first instruction fetching request initiated by a first thread bundle and corresponding to a first instruction address, and performing instruction fetching operation corresponding to the first instruction address; receiving first instruction data corresponding to the first instruction fetch request returned from the first instruction address; in response to a second instruction fetch request initiated by the second thread bundle to fetch the first instruction address, the first instruction data is broadcast in a first clock cycle to a write address of an instruction data access area of the first thread bundle and a write address of an instruction data access area of the second thread bundle.

Some embodiments of the present disclosure also provide an instruction scheduling apparatus corresponding to the instruction scheduling method, including: the instruction fetch arbitration unit is configured to select a first instruction fetch request initiated by a first thread bundle and corresponding to a first instruction address, and perform instruction fetch operation corresponding to the first instruction address; the instruction pre-processing unit is configured to receive first instruction data which is returned from a first instruction address and corresponds to a first instruction fetching request, and is configured to broadcast the first instruction data to a write address of an instruction data access area of the first thread bundle and a write address of an instruction data access area of a second thread bundle in a first clock cycle in response to a second instruction fetching request which is initiated by the second thread bundle and used for fetching the first instruction address.

Some embodiments of the present disclosure also provide a non-transitory storage medium corresponding to the above instruction scheduling method, the storage medium non-transitory storing computer readable instructions, wherein when the computer readable instructions are executed by a computer, the instruction scheduling method provided by the above embodiments of the present disclosure is performed.

In the instruction scheduling method provided by the foregoing embodiment of the present disclosure, when receiving the first instruction data returned from the first instruction address, the first instruction data is sent to the write address of the instruction data access area of each thread bundle having the instruction fetch request to the first instruction address in a broadcast manner, so that access to the instruction cache or other cache systems of several levels is effectively reduced when the computing unit performs the instruction fetch operation, that is, the access bandwidth of the instruction cache or other cache systems of several levels is reduced, and the access bandwidth of the data cache or other cache systems of several levels of data required for executing the instruction is further reduced.

Some embodiments of the present disclosure and examples thereof are described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.

Fig. 6 is a flowchart of an instruction scheduling method according to some embodiments of the present disclosure. Fig. 7 is a schematic diagram of an instruction scheduling method applied to an instruction scheduling apparatus in a compute unit according to an embodiment of the present disclosure.

As shown in fig. 7, the instruction scheduling apparatus includes an instruction fetch arbitration unit, an instruction preprocessing unit, and an instruction broadcast determination unit. The instruction fetch arbitration unit is used for arbitrating instruction fetch requests initiated by each thread bundle needing to be operated on the computing unit in each clock cycle, and selecting the instruction fetch request of one thread bundle to carry out instruction fetch operation. The instruction preprocessing unit is configured to receive and analyze instruction data returned by the instruction fetch operation corresponding to the instruction fetch request of the thread bundle in each clock cycle, for example, the instruction data includes a plurality of executable instructions, the size of the plurality of executable instructions may be 8 double words (4 bytes), and broadcast the instruction data to the write address of the instruction data access area of the one or more thread bundles in the clock cycle of receiving the returned instruction data. The instruction broadcast determination unit is used for determining one or more thread bundles which need to be broadcast the instruction data.

For example, as shown in fig. 6, the instruction scheduling method includes the following steps S100 to S300, and the instruction scheduling method according to the embodiment of the disclosure will be described below with reference to the instruction scheduling apparatus shown in fig. 7.

Step S100: and selecting a first instruction fetching request initiated by the first thread bundle and aiming at the first instruction address, and performing instruction fetching operation on the first instruction address.

For example, FIG. 7 shows thread blocks 0 through M running on a compute unit, where each thread block includes thread bundle 0 through thread bundle N (N and M are both positive integers greater than or equal to 0); one or more thread blocks may be run on the compute unit, each thread block may include one or more thread bundles. In each clock cycle, a plurality of thread bundles initiate a plurality of corresponding instruction fetching requests to be in a candidate state and wait for selection of the instruction fetching arbitration unit. For example, the attribute information of each fetch request includes: the thread block number (e.g., thread block 0) of the thread block to which the thread bundle initiating the instruction fetch request belongs, the thread bundle number (e.g., thread bundle 0) of the thread bundle initiating the instruction fetch request in the belonging thread block, the state of the instruction data access area, the write address of the instruction data access area, the program counter, etc., for example, the attribute information of each instruction fetch request may further include a single thread bundle identification. The instruction data access area is usually a RAM (random access memory) for temporarily accessing the instruction data, and for example, the returned instruction data may mixedly include a plurality of double word instructions (dword, 32bit, 4 bytes) and quad word instructions (qword, 64bit, 8 bytes). Multiple double-word instructions are not immediately issued to the compute core for execution, but are issued in sequential order, thus requiring the instruction data access area to cache one or more instruction data returned. The state of the instruction data access area comprises whether the instruction data access area corresponding to the thread bundle initiating the instruction fetching request is not full. A single thread bundle identification indicates that there is only one thread bundle in the thread block to which the thread bundle that originated the instruction fetch request belongs, i.e., the thread bundle that originated the instruction fetch request, for example, it may be determined that there is no need to broadcast the instruction data returned by the instruction fetch request for such a state.

It should be noted that, as described above, the selected "first thread bundle" may be any thread bundle of thread blocks 0 to M as a description object; the "first instruction fetch request" may be an instruction fetch request initiated by the thread bundle as the description object; the "first instruction address" may be the instruction address to which the instruction fetch request corresponds.

For example, the fetch arbitration unit picks the first thread bundle as the first fetch request initiated by thread bundle 1 in thread block 1. The attribute information of the first instruction fetch request includes: a first thread block number (e.g., thread block number 1), a first thread bundle number (e.g., thread bundle number 1), a first instruction address recorded by a program counter, a status of an instruction data access area and a write address of the instruction data access area of the first thread bundle, and the like. In the instruction fetching requests of a plurality of thread bundles in a candidate state, after arbitration, an instruction fetching arbitration unit selects a first instruction fetching request of the first thread bundle, sends the first instruction fetching request to an instruction cache, judges whether instruction data aimed at by the first instruction fetching request of the first thread bundle exists in a first instruction address, if so, indicates to fetch the first instruction address, and if not, sends the first instruction fetching request to a shared cache or a more-level cache, and performs instruction fetching on the first instruction address.

Step S200: first instruction data corresponding to a first instruction fetch request returned from a first instruction address is received.

For example, after a plurality of clock cycles, in a first clock cycle as a description object, the instruction preprocessing unit receives first instruction data corresponding to the first instruction fetch request returned from the first instruction address, and analyzes the received first instruction data to obtain analysis result information. For example, the analysis result information may include attribute information of a first instruction fetch request of the first instruction data: the number of the thread block to which the thread bundle initiating the first instruction fetch request belongs, the number of the thread bundle in the thread block to which the thread bundle initiating the first instruction fetch request belongs, and information whether the first instruction data needs to start a broadcast function.

Then, the instruction preprocessing unit sends the analysis result information to the instruction broadcast judging unit. In each clock cycle, the instruction fetch requests initiated by the candidate thread bundles are updated to the instruction broadcast judging unit, the instruction broadcast judging unit records attribute information of each candidate instruction fetch request, and the attribute information is stored until, for example, instruction data corresponding to the instruction fetch requests are sent to an instruction data access area of the corresponding thread bundles or until, for example, the instruction data corresponding to the instruction fetch requests are executed or retired (retirement). The instruction broadcast judging unit determines the instruction data access area of the thread bundle which initiates the candidate instruction fetch requests to which the first instruction data needs to be broadcast according to the analysis result information and the attribute information of the instruction fetch requests of the current record candidates, wherein the candidate instruction fetch requests correspond to the first instruction fetch address (namely correspond to the first instruction data).

In at least some examples, an exemplary process of how to determine which candidate thread bundles' instruction data access areas to send to the first instruction data need to be broadcast may include the steps of:

in step S210, a thread block number (e.g., thread block 1) corresponding to the first thread bundle and a thread bundle number (e.g., thread bundle 1 in thread block 1) corresponding to the first thread bundle are obtained according to the attribute information of the first instruction fetch request corresponding to the first instruction data. It should be noted that if the attribute information of the first instruction fetch request corresponding to the first instruction data includes a single thread bundle identifier, that is, only one first thread bundle is included in the thread block to which the first thread bundle belongs, the first instruction data does not need to start the broadcast function; if the single thread bundle identification is not included, the broadcast function is turned on by the first instruction data accordingly. And then, sending the acquired thread block number corresponding to the first thread bundle, the acquired thread bundle number corresponding to the first thread bundle and whether the first data needs to start a broadcast function to an instruction broadcast judging unit.

In step S220, the instruction broadcast determining unit determines a write address of an instruction data access area for broadcasting the first instruction data to the second thread bundle (e.g., one or more of the thread bundles 2 to N in the thread block 1) based on the thread block number (e.g., thread block 1) corresponding to the first thread bundle and the thread bundle number (e.g., thread bundle 1 in the thread block 1) corresponding to the first thread bundle and the thread bundle number (e.g., thread bundles 2 to N in the thread block 1) included in the attribute information of the candidate multiple fetch requests in the first clock cycle. The instruction broadcast judging unit sends information of instruction data access areas of candidate thread bundles to which the first instruction data needs to be broadcast, for example, attribute information of a first instruction fetching request of the first thread bundle and attribute information of a second instruction fetching request of the second thread bundle, to the instruction preprocessing unit so as to instruct the instruction preprocessing unit to broadcast and send the first instruction data to a write address of the instruction data access area corresponding to the first thread bundle and a write address of the instruction data access area corresponding to the second thread bundle in a first clock cycle.

It is noted that, in the present disclosure, the "second thread bundle" may include one of the thread bundles other than the "first thread bundle" from the thread block 0 to the thread block M, and in some examples, the number of the second thread bundles may be one or more (for example, one of them is taken as a description object).

For example, in different examples, the first thread bundle and the second thread bundle belong to the same thread block or belong to different thread blocks. As described above, since the core instruction codes (kernel) executed by the plurality of thread bundles belonging to the same thread block are the same, the fetch addresses to which the fetch requests in the similar time period are directed are the same; furthermore, in some less common cases, the thread bundles of different thread blocks may also have the same instruction fetch address.

Step S300: in response to a second instruction fetch request initiated by the second thread bundle to fetch the first instruction address, the first instruction data is broadcast in a first clock cycle to a write address of an instruction data access area of the first thread bundle and a write address of an instruction data access area of the second thread bundle.

For example, the instruction preprocessing unit acquires a broadcast object of the first instruction data (e.g., the first thread bundle and the second thread bundle, in other examples, one or more other thread bundles that initiate the fetching of the first instruction address as the second thread may be included), acquires a write address of the instruction data access area of the first thread bundle by attribute information of the first fetch request, acquires a write address of the instruction data access area of the second thread bundle by attribute information of the second fetch request, simultaneously acquires a state of the instruction data access area of the first thread bundle by attribute information of the first fetch request, acquires a state of the instruction data access area of the second thread bundle by attribute information of the first fetch request, determines whether the state of the instruction data access area of the first thread bundle and the state of the instruction data access area of the second thread bundle are occupied or idle, if both are idle, then in a first clock cycle, the received first instruction data is broadcast to the write address of the instruction data access area of the first thread bundle and the write address of the instruction data access area of the second thread bundle.

If the second thread bundle has been selected by the instruction arbitration unit but has not been issued in the first clock cycle, the instruction scheduling method of the embodiment of the present disclosure may further include the following step S400:

step S400: the second fetch request is cancelled.

For example, since the instruction preprocessing unit broadcasts the first instruction data returned from the first instruction address to the write address of the instruction data access area of the second thread bundle in the first clock cycle, the instruction preprocessing unit does not need to respond to the value taking request initiated by the second thread bundle for the first instruction address to perform the instruction taking operation, and therefore the instruction preprocessing unit cancels the second instruction taking request.

For example, in at least one example, the instruction broadcast determination unit generates the broadcast mask based on the thread block number corresponding to the first thread bundle and the thread bundle number corresponding to the first thread bundle, and the thread block number and the thread bundle number included in the attribute information of each of the candidate plurality of fetch requests. For example, the broadcast mask includes information bits corresponding to the second fetch request and information bits corresponding to the first fetch request. The instruction broadcast judging unit sends the broadcast mask and the write address of the instruction data access area corresponding to the broadcast mask to the instruction preprocessing unit. The instruction pre-processing unit broadcasts, in a first clock cycle, first instruction data to be sent to a write address of an instruction data access area of a first thread bundle and a write address of an instruction data access area of a second thread bundle using a broadcast mask.

For example, m threads running on a compute unit may be virtually recorded as thread bundle masks (thread warp slots), e.g., the m threads may belong to the same thread block or different thread blocks. For example, a thread bundle mask (thread warp slot) is a one-dimensional array having m bits, each bit representing a thread bundle belonging to a thread block. For example, the maximum number of threads running on the computing unit may be 10, and then m is 10. Because the instruction broadcast determining unit determines the second instruction fetch request based on the thread block number corresponding to the first thread bundle, the thread bundle number corresponding to the first thread bundle, and the thread block number and the thread bundle number included in the attribute information of each of the candidate instruction fetch requests, that is, determines the second thread bundle that needs to be broadcast by the instruction data (for example, more other thread bundles that satisfy the instruction data broadcast condition may also be included), the first thread bundle also needs to be broadcast by the instruction data. The broadcast mask is used to indicate the second and first thread bundles (and possibly more other thread bundles satisfying the instruction data broadcast conditions) that need to be broadcast by the instruction data.

For example, the broadcast mask is a one-dimensional array having a total of m bits, corresponding one-to-one to the m bits of the thread bundle mask. For example, the broadcast mask may be expressed as inst _ broadcast _ mask [ m ], where mmax should be the maximum value of the thread bundle mask. Whether a bit in the broadcast mask is set to 1 indicates whether the thread bundle corresponding to the bit needs to be broadcast by the instruction data, that is, the bit set to 1 includes an information bit of the first instruction fetch request and an information bit of the second instruction fetch request. For example, the broadcast mask is 0011000000, the third bit of 1 is the information bit of the first instruction fetch request, e.g., the third bit indicates that the first thread bundle is thread bundle 1 in thread block 1; the fourth bit of set 1 is the information bit of the second instruction fetch request, e.g., the fourth bit indicates that the second thread bundle is thread bundle 2 in thread block 1.

For example, in at least some examples, the broadcast pattern of the first instruction data returned from the first instruction address being broadcast on the first clock cycle to the write address of the instruction data access area of the first thread bundle and the write address of the instruction data access area of the second thread bundle may be as follows.

For example, after the instruction preprocessing unit receives first instruction data corresponding to a first instruction fetch request returned from a first instruction address in a first clock cycle and determines a second instruction fetch request initiated by a second thread bundle and used for fetching a first instruction address, if the second instruction fetch request is candidate and selected in the first clock cycle, the instruction preprocessing unit cancels the instruction fetch operation corresponding to the second instruction fetch request and simultaneously broadcasts the first instruction data to a write address of an instruction data access area of the second thread bundle.

Or, under the condition that a second instruction fetching request initiated by the second thread bundle is candidate and ignored in the first clock cycle, the instruction preprocessing unit cancels the instruction fetching operation corresponding to the second instruction fetching request, and simultaneously broadcasts and sends the first instruction data to the write address of the instruction data access area of the second thread bundle.

For example, in at least some examples, based on the two broadcast modes described above, in different cases, the first fetch request and the second fetch request are in a candidate in a second clock cycle (the first clock cycle is after the second clock cycle), the first fetch request is selected, the second fetch request is ignored, or in one or more intermediate operating cycles between the second clock cycle and the first clock cycle, the second fetch request is both ignored, or in the second clock cycle, a third fetch request initiated by a third thread bundle to fetch the first instruction address is in a candidate but ignored; and in a third clock cycle between the second clock cycle and the first clock cycle, a third instruction fetching request is in a candidate state and is selected to carry out instruction fetching operation on the first instruction address.

In the exemplary case of the broadcast mode shown in fig. 8, all of the thread bundles 0 to 3 in the same thread block have instruction fetch requests for the instruction address 0(IF, PC ═ 0, an example of the first instruction address) in T × N clock cycles, and all of the instruction fetch requests of the thread bundles 0 to 3 are in the candidate state.

In T × N clock cycles, the instruction fetch request of thread bundle 0 is selected by the instruction fetch arbitration unit (shown in bold, the same applies below) to perform the instruction fetch operation for instruction address 0, while the instruction fetch requests of other thread bundles are ignored (shown in light, the same applies below), e.g., the instruction fetch request of thread bundle 3 is candidate but ignored, and likewise, the instruction fetch requests of

thread bundles

1 and 2 are candidate but ignored.

In T × N +1 clock cycles, the instruction fetch request of thread bundle 1 is selected by the instruction fetch arbitration unit to perform the instruction fetch operation for instruction address 0, while the instruction fetch requests of other thread bundles are in the candidate state but are all ignored, for example, the instruction fetch request of thread bundle 3 is ignored, and the instruction fetch request of thread bundle 2 is in the candidate state but is ignored.

In T × N +2 clock cycles, the instruction fetch request of thread bundle 2 is selected by the instruction fetch arbitration unit to perform the instruction fetch operation for instruction address 0, while the instruction fetch request of thread bundle 3 is in the candidate state but is ignored.

In T × N +3 clock cycles, instruction data of thread bundle 0 is returned from instruction address 0 (instrtn, PC ═ 0), and at this time, the fetch request of thread bundle 3 to be fetched to instruction address 0 is still in the candidate state, then the instruction data of thread bundle 0 is simultaneously broadcast to the write address of the instruction data access area of the fetch request of thread bundle 3 that was in the candidate and ignored in the previous T × N to T × N +2 clock cycles, which is equivalent to having responded to the fetch request of thread bundle 3 to be fetched to instruction address 0 (therefore, in the figure, thread bundle 3 is also marked with instrtn, PC ═ 0), and the fetch operation of thread bundle 3 is cancelled.

On the other hand, thread 1 and thread 2, for which fetch operations have been performed, will wait for instruction data fetched for their fetch operations on T × N to T × N +2 clock cycles.

For example, in T × N +4 clock cycles, instruction data of thread bundle 1 is returned from instruction address 0, and at this time, an instruction fetch request of thread bundle 5 to instruction address 0 is candidate and selected by the instruction fetch arbitration unit, then the instruction data of thread bundle 1 may be broadcast to thread bundle 5, and the value taking operation of the instruction fetch request of thread bundle 5 is cancelled.

Similarly, in T × N +5 clock cycles, the instruction data of thread bundle 2 is returned from instruction address 0, and at this time, the instruction fetch request of thread bundle 4 to instruction address 0 is candidate and selected by the instruction fetch arbitration module, so the instruction data of thread bundle 2 can be broadcast to thread bundle 4, and the value taking operation of the instruction fetch request of thread bundle 4 is cancelled.

It is noted that in the above illustrated case, examples of the first clock cycle include T × N +3 clock cycles, T × N +5 clock cycles, correspondingly, examples of the second clock cycle include T × N clock cycles, examples of the intermediate operation cycle include T × N +1 clock cycles and T × N +2 clock cycles, for example, examples of the third clock cycle include T × N +1 clock cycles. In the above case, examples of the first instruction fetch request may include an instruction fetch request of thread bundle 0 for instruction address 0, an instruction fetch request of thread bundle 2 for instruction address 0; correspondingly, examples of the second instruction fetch request may include an instruction fetch request initiated by thread bundle 3 for instruction address 0, an instruction fetch request by thread bundle 4 for instruction address 0, an instruction fetch request by thread bundle 5 for instruction address 0; an example of a third instruction fetch request may include an instruction fetch request for instruction address 0 by thread bundle 1.

In addition, in the exemplary case of the broadcast mode shown in fig. 8, the processing procedures of the multiple instruction fetching requests of the thread bundle 0 to the thread bundle 5 in different instruction cycles for the instruction address 32(IF, PC ═ 32, an example of the first instruction address) are also shown, which is similar to the above-mentioned instruction fetching request for the instruction address 0 and is not described again here.

In the exemplary case of the broadcast mode shown in fig. 9, all of the thread bundles 0 to 3 in the same thread block have instruction fetch requests for the instruction address 0(IF, PC ═ 0, an example of the first instruction address) in T × N clock cycles, and all of the instruction fetch requests of the thread bundles 0 to 3 are in the candidate state.

In T × N clock cycles, the instruction fetch request of the thread bundle 0 is selected by the instruction fetch arbitration unit, and the instruction fetch operation is performed for the instruction address 0, while the instruction fetch requests of other thread bundles in the candidate state, for example, the instruction fetch requests of the thread bundles 1 to 3, are ignored.

At clock cycles T × N +1 and T × N +2, instruction fetch requests of thread bundles 1 to 3 in the candidate state are also ignored.

Until the instruction data of the thread bundle 0 returns from the instruction address 0 in the T × N +3 clock cycle, the instruction data of the thread bundle 0 is simultaneously broadcast to the instruction cache region write addresses of the thread bundles 1 to 3 that have been waiting before, and the value taking requests of the thread bundles 1 to 3 are substantially responded.

At clock cycle T × N +4, thread bundle 5 also initiates an instruction fetch request for instruction address 0, and since the instruction data of thread bundle 0 has been broadcasted, at this moment, the instruction fetch request for address 0 by thread bundle 5 is not ignored, but instruction address 0 is removed again, and the flow of the previous instruction fetch broadcast by thread bundle 0 is repeated.

At T × N +5 and T × N +6 clock cycles, thread bundle 4 also initiates a fetch request for instruction address 0, but is ignored.

At clock cycle T × N +7, instruction data of thread bundle 5 is returned from instruction address 0, and then the instruction data of thread 5 is broadcast to the instruction cache write address of thread bundle 4 that has been waiting before, substantially in response to the request for value of thread bundle 4.

It is noted that in the illustrated case, examples of the first clock cycle include T × N +3 clock cycles, T × N +4 clock cycles, T × N +7 clock cycles, correspondingly, examples of the second clock cycle include T × N clock cycles, and examples of the intermediate operation cycle include T × N +1 clock cycles and T × N +2 clock cycles. Examples of the first instruction fetch request may include an instruction fetch request of thread bundle 0 to instruction address 0, an instruction fetch request of thread bundle 5 to instruction address 0; correspondingly, examples of the second instruction fetch request include instruction fetch requests of the thread bundles 1 to 3 for the instruction address 0 and instruction fetch requests of the thread bundle 4 for the instruction address 0.

In addition, in the exemplary case of the broadcast mode shown in fig. 9, the processing procedures of the thread bundle 0 to the thread bundle 5 for multiple instruction fetching requests of the instruction address 32 in different instruction cycles are also shown, which is similar to the above-mentioned instruction fetching request for the instruction address 0 and is not described here again.

At least some embodiments of the present disclosure also provide an instruction scheduling apparatus, for example for a parallel processor. The parallel processor is, for example, a General Purpose Graphics Processor (GPGPU), and embodiments of the present disclosure are not limited in this respect.

Fig. 10 is a schematic block diagram of an instruction scheduling apparatus according to some embodiments of the present disclosure. As shown in fig. 10, the instruction scheduling apparatus 100 includes an instruction fetch arbitration unit 110, an instruction preprocessing unit 120, and an instruction broadcast determination unit 130.

The instruction fetch arbitration unit 110 is configured to select a first instruction fetch request for a first instruction address initiated by a first thread bundle and perform an instruction fetch operation for the first instruction address.

The instruction pre-processing unit 120 is configured to receive first instruction data corresponding to a first instruction fetch request returned from a first instruction address, and to broadcast, in a first clock cycle, the first instruction data to a write address of an instruction data access area of the first thread bundle and a write address of an instruction data access area of a second thread bundle in response to a second instruction fetch request initiated by the second thread bundle to fetch the first instruction address.

For example, instruction pre-processing unit 120 is also configured to cancel the second instruction fetch request.

For example, the instruction preprocessing unit 120 is further configured to cancel the fetch operation corresponding to the second fetch request if the second fetch request is candidate and selected in the first clock cycle.

For example, instruction pre-processing unit 120 is further configured to ignore the second instruction fetch request if the second instruction fetch request is a candidate in the first clock cycle.

For example, the instruction pre-processing unit 120 is further configured to select the first fetch request and ignore the second fetch request in a case where the first fetch request and the second fetch request are candidates in a second clock cycle, wherein the first clock cycle is subsequent to the second clock cycle.

For example, the instruction pre-processing unit 120 is further configured to ignore the second fetch request in one or more intermediate operation cycles between the second clock cycle and the first clock cycle.

For example, the instruction pre-processing unit 120 is further configured to, in a case where a third instruction fetch request initiated for a third thread bundle to fetch the first instruction address is a candidate in the second clock cycle, ignore the third instruction fetch request; and in a third clock cycle between the second clock cycle and the first clock cycle, in the case that a third instruction fetching request is in a candidate state, selecting the third instruction fetching request to carry out the instruction fetching operation of the first instruction address.

For example, the first thread bundle and the second thread bundle belong to the same thread block or belong to different thread blocks.

For example, the instruction broadcast determining unit 130 is configured to record attribute information of each instruction fetch request, where the attribute information includes a thread block number of a thread block to which a thread bundle initiating the instruction fetch request belongs and a thread bundle number of the thread bundle initiating the instruction fetch request in the belonging thread block.

For example, the instruction broadcast determination unit 130 is further configured to: acquiring returned first instruction data from the instruction preprocessing unit 110, and acquiring a thread block number corresponding to the first thread bundle and a thread bundle number corresponding to the first thread bundle according to the attribute information of the first instruction fetching request; and determining a second instruction fetching request based on the thread block number corresponding to the first thread bundle, the thread bundle number corresponding to the first thread bundle and the thread block number and the thread bundle number included in the attribute information of each of the candidate plurality of instruction fetching requests.

For example, the instruction broadcast determining unit 130 is further configured to generate a broadcast mask based on the chunk number and the bundle number corresponding to the first thread bundle and the thread chunk number and the thread bundle number included in the attribute information of each of the candidate plurality of instruction fetch requests, wherein the broadcast mask includes information bits corresponding to the second instruction fetch request; the instruction pre-processing unit 120 is further configured to broadcast the first instruction data in the first clock cycle using the broadcast mask to a write address of an instruction data access area sent to the second thread bundle.

For example, the broadcast mask further includes information bits corresponding to a first instruction fetch request, and the instruction pre-processing unit 120 is further configured to broadcast, in a first clock cycle, the first instruction data to a write address of the instruction data access area of the first thread bundle and a write address of the instruction data access area of the second thread bundle using the broadcast mask.

For example, the attribute information of each instruction fetch request further includes a write address of the instruction data access area, and the instruction preprocessing unit 120 is further configured to acquire the write address of the instruction data access area of the first instruction fetch request through the attribute information of the first instruction fetch request, and acquire the write address of the instruction data access area of the second instruction fetch request through the attribute information of the second instruction fetch request.

For example, the attribute information of each instruction fetch request further includes a program counter, and the instruction preprocessing unit 120 is further configured to obtain a first instruction fetch address corresponding to the first instruction fetch request through the program counter of the attribute information of the first instruction fetch request.

For example, the attribute information of each instruction fetch request further includes a state of the instruction data access area, which includes whether the instruction data access area of the thread bundle that initiated the instruction fetch request is not full.

For example, each of the instruction fetch arbitration unit 110, the instruction preprocessing unit 120, and the instruction broadcast determination unit 130 may be implemented by hardware, firmware, or software.

Thus, for example, a processor may execute code and programs to implement some or all of the functionality of the various modules as described above. For another example, each of the instruction fetch arbitration unit 110, the instruction preprocessing unit 120, and the instruction broadcast determination unit 130 may be a hardware device, and is configured to implement part or all of the functions of each of the above modules.

It should be noted that the instruction scheduling apparatus 100 may be used to implement the foregoing instruction scheduling method. For example, the instruction fetch arbitration unit 110 may be configured to implement step S100 in the foregoing instruction scheduling method, and specific implementation processes and details may refer to the related description of step S100, which is not repeated herein. For example, the instruction preprocessing unit 120 may be configured to implement the steps S200 and S300 in the foregoing instruction scheduling method, and specific implementation processes and details may refer to the related descriptions of the steps S200 and S300, which are not repeated herein. For example, the instruction preprocessing unit 120 may also be configured to implement step S400 in the foregoing instruction scheduling method, and for specific implementation processes and details, reference may be made to relevant descriptions of step S400, which is not repeated herein. For example, the instruction broadcast determining unit 130 may also be configured to implement step S210 and step S220 in the foregoing instruction scheduling method, and specific implementation processes and details may refer to the related descriptions of step S210 and step S220, which are not repeated herein.

Fig. 11 is a schematic block diagram of another instruction scheduling apparatus according to some embodiments of the present disclosure. For example, as shown in fig. 11, the instruction scheduling apparatus 500 includes a memory 510 and a processor 520. The instruction scheduling apparatus may be used, for example, in a parallel processor, such as a General Purpose Graphics Processor (GPGPU), and the embodiments of the present disclosure are not limited thereto.

For example, the memory 510 is used for non-transitory storage of computer-executable instructions, and the processor 520 is used for executing the computer-executable instructions, and the computer-executable instructions are executed by the processor 520 to perform the instruction scheduling method provided by any embodiment of the disclosure.

For example, the memory 510 and the processor 520 may be in direct or indirect communication with each other. For example, in some examples, as shown in fig. 11, the instruction scheduling apparatus 500 may further include a system bus 530, and the memory 510 and the processor 520 may communicate with each other through the system bus 530, for example, the processor 520 may access the memory 510 through the system bus 1006. For example, in other examples, components such as memory 510 and processor 520 may communicate over a Network On Chip (NOC) connection.

For example, processor 520 may control other components in the instruction dispatch device to perform desired functions. The processor 520 may be a device with data processing capability and/or program execution capability, such as a Central Processing Unit (CPU), Tensor Processor (TPU), Network Processor (NP), or Graphics Processor (GPU), and may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, and so forth.

For example, memory 510 may include any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, Read Only Memory (ROM), a hard disk, an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), USB memory, flash memory, and the like.

For example, one or more computer instructions may be stored on memory 510 and executed by processor 520 to implement various functions. Various applications and various data, such as instruction scheduling code and various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.

For example, some computer instructions stored by memory 510, when executed by processor 520, may perform one or more steps in accordance with the instruction scheduling methods described above.

For example, as shown in fig. 11, the instruction scheduler 500 may further include an input interface 540 that allows an external device to communicate with the instruction scheduler 500. For example, input interface 540 may be used to receive instructions from an external computer device, from a user, and the like. The instruction dispatcher 500 may also include an output interface 550 that interconnects the instruction dispatcher 500 and one or more external devices. For example, the instruction scheduler 500 may display an image or the like through the output interface 550.

For example, for detailed descriptions of the processing procedure of the instruction scheduling method according to the above embodiment of the present disclosure, reference may be made to the related descriptions in the above embodiment of the instruction scheduling method, and repeated descriptions are omitted.

It should be noted that the instruction scheduling apparatus provided in the embodiments of the present disclosure is illustrative and not restrictive, and the instruction scheduling apparatus may further include other conventional components or structures according to practical application needs, for example, in order to implement the necessary functions of the instruction scheduling apparatus, a person skilled in the art may set other conventional components or structures according to a specific application scenario, and the embodiments of the present disclosure are not limited thereto.

For technical effects of the instruction scheduling apparatus provided in the embodiments of the present disclosure, reference may be made to corresponding descriptions about the instruction scheduling method in the foregoing embodiments, and details are not repeated here.

At least some embodiments of the present disclosure also provide a processor including at least one computing unit, where the computing unit includes the instruction scheduling apparatus as provided in any of the foregoing embodiments. For example, the processor is a general-purpose graphics processor, which includes a plurality of computing units, each of which includes an instruction scheduling device, and may further include a plurality of computing cores, a register file, an instruction cache, a data cache, and the like. Each compute core includes an Arithmetic Logic Unit (ALU), a floating point compute unit, and the like. For example, when multiple thread blocks execute in the same compute unit, the thread bundles in the compute unit may come from the same thread block or different thread blocks, and all threads in the same thread bundle may execute in SIMD fashion.

At least some embodiments of the present disclosure also provide a non-transitory storage medium. Fig. 12 is a schematic diagram of a non-transitory storage medium according to some embodiments of the present disclosure. For example, as shown in fig. 12, the storage medium 600 non-temporarily stores computer-executable instructions 601, and when the non-transitory computer-executable instructions 601 are executed by a computer (including a processor), the instruction scheduling method provided by any embodiment of the disclosure can be executed.

For example, one or more computer instructions may be stored on the storage medium 600. Some of the computer instructions stored on the storage medium 600 may be, for example, instructions for implementing one or more steps of the instruction scheduling method described above.

For example, the storage medium may include a storage component of a tablet computer, a hard disk of a personal computer, a Random Access Memory (RAM), a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a compact disc read only memory (CD-ROM), a flash memory, or any combination of the above storage media, as well as other suitable storage media. For example, the storage medium 600 may include the memory 510 in the instruction scheduling apparatus 500 described above.

For technical effects of the storage medium provided by the embodiments of the present disclosure, reference may be made to corresponding descriptions about the instruction scheduling method in the foregoing embodiments, and details are not described herein again.

For the present disclosure, there are the following points to be explained:

(1) in the drawings of the embodiments of the present disclosure, only the structures related to the embodiments of the present disclosure are referred to, and other structures may refer to general designs.

(2) Features of the disclosure in the same embodiment and in different embodiments may be combined with each other without conflict.

The above is only a specific embodiment of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present disclosure, and shall be covered by the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. An instruction scheduling method, comprising:

selecting a first instruction fetching request initiated by a first thread bundle and aiming at a first instruction address, and carrying out instruction fetching operation on the first instruction address;

receiving first instruction data corresponding to the first instruction fetch request returned from the first instruction address;

in response to a second instruction fetch request initiated by a second thread bundle for fetching the first instruction address, broadcasting the first instruction data in a first clock cycle and sending the first instruction data to a write address of an instruction data access area of the first thread bundle and a write address of an instruction data access area of the second thread bundle.

2. The method of claim 1, further comprising:

canceling the second instruction fetch request.

3. The method of claim 1, wherein, if the second fetch request is candidate and selected in the first clock cycle, cancelling a fetch operation corresponding to the second fetch request.

4. The method of claim 3, wherein the second fetch request is candidate in the first clock cycle and is ignored.

5. The method of claim 1, wherein the first and second fetch requests are candidates in a second clock cycle, the first fetch request is selected, the second fetch request is ignored,

wherein the first clock cycle is subsequent to the second clock cycle.

6. The method of claim 5, wherein the second fetch request is ignored during one or more intermediate operating cycles between the second clock cycle and the first clock cycle.

7. The method of claim 5, wherein in the second clock cycle, a third instruction fetch request initiated by a third thread bundle to fetch the first instruction address is a candidate but is ignored;

in a third clock cycle between the second clock cycle and the first clock cycle, the third fetch request is candidate and selected for a fetch operation for the first instruction address.

8. The method of claim 1, wherein the first and second threads belong to a same thread block or belong to different thread blocks.

9. The method of claim 1, further comprising:

10. The method of claim 9, further comprising:

responding to the returned first instruction data, and acquiring a thread block number corresponding to the first thread bundle and a thread bundle number corresponding to the first thread bundle according to the attribute information of the first instruction fetching request;

11. The method of claim 10, wherein determining the second fetch request based on the thread block number corresponding to the first thread bundle and the thread bundle number corresponding to the first thread bundle and the thread block number and the thread bundle number included in the attribute information of each of the candidate plurality of fetch requests comprises:

generating a broadcast mask, wherein the broadcast mask includes information bits corresponding to the second fetch request;

wherein the broadcast mask is used to broadcast the first instruction data in the first clock cycle to a write address of an instruction data access area sent to the second thread bundle.

12. The method of claim 11, wherein the broadcast mask further includes information bits corresponding to the first fetch request,

broadcasting the first instruction data to a write address of an instruction data access area of the first thread bundle and a write address of an instruction data access area of the second thread bundle in the first clock cycle using the broadcast mask.

13. The method of claim 10, wherein the attribute information of each instruction fetch request further includes a write address of an instruction data access area,

acquiring a write address of an instruction data access area of the first instruction fetching request through attribute information of the first instruction fetching request;

and acquiring the write address of the instruction data access area of the second instruction fetch request through the attribute information of the second instruction fetch request.

14. The method of claim 10, wherein the attribute information of each fetch request further includes a program counter,

and acquiring a first instruction fetching address corresponding to the first instruction fetching request through a program counter of the attribute information of the first instruction fetching request.

15. The method of claim 10, wherein the attribute information of each fetch request further comprises a status of an instruction data access area, the status of the instruction data access area comprising whether an instruction data access area of a thread bundle originating the fetch request is not full.

16. An instruction scheduling apparatus comprising:

the instruction fetch arbitration unit is configured to select a first instruction fetch request initiated by a first thread bundle and corresponding to a first instruction address, and perform instruction fetch operation corresponding to the first instruction address;

an instruction preprocessing unit configured to receive first instruction data corresponding to the first instruction fetch request returned from the first instruction address, and configured to broadcast, in a first clock cycle, the first instruction data to a write address of an instruction data access area of the first thread bundle and a write address of an instruction data access area of the second thread bundle in response to a second instruction fetch request initiated by the second thread bundle to fetch the first instruction address.

17. The instruction scheduling apparatus according to claim 16, wherein the instruction pre-processing unit is further configured to cancel the second instruction fetch request.

18. The instruction scheduling apparatus according to claim 16, wherein the instruction preprocessing unit is further configured to cancel an instruction fetch operation corresponding to the second instruction fetch request if the second instruction fetch request is candidate and selected in the first clock cycle.

19. The instruction scheduling apparatus according to claim 18, wherein the instruction pre-processing unit is further configured to ignore the second fetch request if the second fetch request is in a candidate in the first clock cycle.

20. The instruction scheduling apparatus of claim 16, wherein the instruction pre-processing unit is further configured to, for a case where the first fetch request and the second fetch request are candidates in a second clock cycle, select the first fetch request, ignore the second fetch request,

wherein the first clock cycle is subsequent to the second clock cycle.

21. The instruction scheduling apparatus of claim 20, wherein the instruction pre-processing unit is further configured to ignore the second fetch request in one or more intermediate operation cycles between the second clock cycle and the first clock cycle.

22. The apparatus of claim 20, wherein the instruction pre-processing unit is further configured to,

23. The instruction scheduling apparatus according to claim 16, further comprising an instruction broadcast decision unit, wherein the instruction broadcast decision unit is configured to record attribute information of each instruction fetch request,

24. The instruction scheduling apparatus of claim 23, wherein the instruction broadcast determination unit is further configured to:

25. The instruction scheduling apparatus according to claim 24, wherein the instruction broadcast determining unit is further configured to generate a broadcast mask based on the chunk number corresponding to the first thread bundle and the bundle number corresponding to the first thread bundle, and the thread chunk number and the thread bundle number included in the attribute information of each of the candidate plurality of instruction fetch requests, wherein the broadcast mask includes information bits corresponding to the second instruction fetch request;

the instruction pre-processing unit is further configured to broadcast, in a first clock cycle, the first instruction data to a write address of an instruction data access area sent to the second thread bundle using the broadcast mask.

26. The instruction scheduling apparatus of claim 25, wherein the broadcast mask further comprises information bits corresponding to the first fetch request,

the instruction pre-processing unit is further configured to broadcast, in a first clock cycle, the first instruction data to a write address of an instruction data access area of the first thread bundle and a write address of an instruction data access area of the second thread bundle using the broadcast mask.

27. The instruction scheduling apparatus of claim 24, wherein the attribute information of each instruction fetch request further includes a write address of an instruction data access area,

the instruction pre-processing unit is further configured to:

28. A processor comprising at least one computational unit, wherein the computational unit comprises an instruction scheduling apparatus according to any one of claims 16-27.

29. An instruction scheduling apparatus comprising:

a memory for non-transitory storage of computer-executable instructions; and

a processor for executing the computer-executable instructions,

wherein the computer-executable instructions, when executed by the processor, perform the instruction scheduling method of any of claims 1-15.

30. A non-transitory storage medium that non-transitory stores computer-executable instructions, wherein the computer-executable instructions, when executed by a computer, perform the instruction scheduling method of any one of claims 1-15.