CN114610394B

CN114610394B - Instruction scheduling method, processing circuit and electronic equipment

Info

Publication number: CN114610394B
Application number: CN202210247863.3A
Authority: CN
Inventors: 王磊; 常亮; 许飞翔; 侯红朝; 姚飞; 仇小钢
Original assignee: Hexaflake Nanjing Information Technology Co Ltd
Current assignee: Hexaflake Nanjing Information Technology Co Ltd
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2023-12-22
Anticipated expiration: 2042-03-14
Also published as: CN114610394A; WO2023173642A1

Abstract

A method, processing circuit, electronic device, computer-readable storage medium, and computer program product for instruction scheduling are described herein. The method proposed herein comprises: determining a status indicator associated with the target instruction, the status indicator being for indicating a status of a resource associated with the target instruction; determining whether the target instruction is ready based on the status indicator and the type of target instruction; in response to determining that the target instruction is ready, executing a target phase of the target instruction, wherein the target phase is determined based on the type of target instruction; and updating the status indicator in response to completion of execution of the access operation for the register of the target instruction. In this manner, state indicators can be utilized to efficiently manage data dependencies between instructions, thereby improving system performance and reducing circuit complexity.

Description

Instruction scheduling method, processing circuit and electronic equipment

Technical Field

Embodiments of the present disclosure relate generally to the field of electronics and, more particularly, relate to a method, processing circuit, electronic device, computer-readable storage medium, and computer program product for instruction scheduling.

Background

Some processing units (e.g., AIGPUs) employ load-store architecture, with other instructions than access instructions using operands in registers. These instructions read data from a register file-RF, then feed into the execution unit for computation, and finally write the results back to the register file. To enable reading and writing of multiple data per clock cycle, the registers are typically multiport and divided into multiple blocks. The register file is typically small, with a read speed close to the execution unit and a fixed delay, typically placed beside the execution unit. Each thread of the AIGPU has its own register file and has fixed execution units.

To simplify the hardware implementation, some approaches impose the requirement that like read instructions must be executed in order, which exacerbates the delay caused by reading the data.

Disclosure of Invention

Embodiments of the present disclosure provide a scheme for instruction scheduling.

In a first aspect, a method for instruction scheduling is provided. The method includes determining a status indicator associated with the target instruction, the status indicator for indicating a status of a resource associated with the target instruction; determining whether the target instruction is ready based on the status indicator and the type of target instruction; in response to determining that the target instruction is ready, executing a target phase of the target instruction, wherein the target phase is determined based on the type of target instruction; and updating the status indicator in response to completion of execution of the access operation for the resource by the target instruction.

In some embodiments, determining whether the target instruction is ready based on the status indicator and the type of target instruction includes: in response to the type of the target instruction indicating a data production operation for the resource, determining whether the status indicator is a first value, the first value indicating that data in the resource has been consumed; and determining that the target instruction is ready in response to the status indicator being the first value.

In some embodiments, updating the status indicator includes: in response to completion of execution of the access operation for the resource by the target instruction, the status indicator is updated to a second value indicating that data in the resource is capable of being consumed.

In some embodiments, the target stage of executing the target instruction includes: and executing the write-back stage of the target instruction.

In some embodiments, determining whether the target instruction is ready based on the status indicator and the type of target instruction includes: in response to the type of the target instruction indicating a data consumption operation for the resource, determining whether the status indicator is a second value, the second value indicating that data in the resource is capable of being consumed; and determining that the target instruction is ready in response to the status indicator being a second value.

In some embodiments, the target stage of executing the target instruction includes: and issuing a target instruction.

In some embodiments, updating the status indicator includes: in response to completion of execution of the access operation for the resource by the target instruction, the status indicator is updated to a first value, the first value indicating that data in the resource has been consumed.

In some embodiments, the target instruction is a first instruction, the resource is a first resource, the first instruction indicates a data production operation for the first resource, and executing the target phase of the target instruction includes: during execution of the second instruction, the first instruction is issued, the second instruction indicating a data production operation for a second resource, the first resource being different from the second resource.

In some embodiments, the target instruction is a third instruction, the resource is a third resource, the third instruction indicates a data consumption operation for the third resource, and the target stage of executing the target instruction includes: issuing a third instruction in response to the fourth instruction setting the target indicator to the first value, the fourth instruction indicating a data production operation for the third resource, the fourth instruction being issued prior to the third instruction; and the method further comprises: in response to the third instruction updating the target indicator to the second value, causing a fifth instruction to be executed, the fifth instruction indicating a data production operation for the third resource, the fifth instruction being issued prior to the third instruction and later than the fourth instruction.

In some embodiments, issuing the target instruction in response to determining that the target instruction is ready comprises: in response to determining that the target instruction is ready, determining whether a number of instructions that have been issued and that have not been completed is less than a threshold; and issuing a target instruction in response to determining that the number is less than the threshold.

In some embodiments, determining the status indicator for the resource associated with the target instruction includes: issuing a target instruction as a memory loading instruction; and determining, in a first stage in which the target instruction is a memory load instruction, a status indicator for a resource associated with the target instruction; and the target phase of executing the target instruction in response to determining that the target instruction is ready comprises: in response to determining that the target instruction is ready, a reissue target instruction is issued as an operation instruction.

In some embodiments, determining whether the target instruction is ready comprises: it is determined whether the first memory load instruction sets the status indicator to a second value.

In some embodiments, the method further comprises: after issuing the target instruction, a second memory load instruction associated with the status indicator is issued without confirming whether the status indicator is the first value.

In some embodiments, the resource may include at least one of: registers, memory addresses, queues, or processor resources.

In a second aspect of the present disclosure, a processing circuit is provided that includes an on-chip memory, a stream processor, and a processing engine. The processing circuitry is configured to perform any of the methods of the first aspect and implementations thereof.

In a third aspect of the present disclosure, an electronic device is provided. The electronic device comprises processing circuitry configured to perform any of the methods of the first aspect and implementations thereof.

In a fourth aspect of the present disclosure, a computer-readable storage medium is provided. The computer readable storage medium stores instructions that, when executed by the processing circuitry, cause the processing circuitry to perform any of the methods of the first aspect and implementations thereof.

In a fifth aspect of the present disclosure, a computer program product is provided. The computer program product comprises instructions which, when executed by the processing circuit, cause the processing circuit to perform any of the methods of the first aspect and implementations thereof.

It will be appreciated that the processing circuitry of the second aspect, the electronic device of the third aspect, the computer storage medium of the fourth aspect or the computer program product of the fifth aspect provided above may be used to perform the method provided in the first aspect. Accordingly, the explanation or explanation regarding the first aspect is equally applicable to the second aspect, the third aspect, the fourth aspect, and the fifth aspect. The advantages achieved by the second, third, fourth and fifth aspects are referred to as advantages in the corresponding methods, and are not described here.

It should be understood that the description in this summary is not intended to limit the critical or essential features of the disclosure, nor is it intended to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following more particular descriptions of exemplary embodiments of the disclosure as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the disclosure.

FIG. 1 illustrates a schematic diagram of an example environment in which various embodiments of the present disclosure may be implemented;

FIG. 2 illustrates a schematic block diagram of a processing circuit according to some embodiments of the present disclosure;

FIG. 3 illustrates a schematic block diagram of a three-dimensional tensor according to some embodiments of the present disclosure;

FIG. 4 illustrates an instruction scheduling process according to some embodiments of the present disclosure;

FIG. 5 illustrates an instruction scheduling process according to further embodiments of the present disclosure;

FIG. 6 illustrates an instruction scheduling process according to further embodiments of the present disclosure;

FIG. 7 illustrates an instruction scheduling process according to further embodiments of the present disclosure;

fig. 8 illustrates a flowchart of an example process of a stream processing method according to some embodiments of the present disclosure.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are illustrated in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment. The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.

Processors typically use two ways to resolve dependencies between instructions on registers, i.e., to ensure that an instruction correctly uses the result register output by a previous instruction.

In some conventional schemes, if the execution duration of a preceding instruction is constant in any case, the processor or software may always schedule the execution of a subsequent dependent instruction to begin after this duration.

In other conventional schemes, if the execution time of an instruction is indeterminate, such as if the length of time that the memory access instruction is executing each time is not constant, the hardware records the output registers of each instruction and keeps track of their completion time, while the hardware decodes the input register(s) used by each instruction and compares it to the unfinished input registers to find dependencies.

According to the embodiment of the disclosure, resource dependency among instructions can be effectively solved through the status indicators of resources (such as registers, memory addresses, queues or processor units, etc.), so that the efficiency of instruction scheduling is improved, the system performance is improved, and the complexity of circuit implementation is reduced.

Example Environment

FIG. 1 illustrates a schematic diagram of an example environment 100 in which various embodiments of the present disclosure may be implemented. The example environment 100 may be, for example, an electronic device with computing capabilities such as a computer. In one embodiment, the example environment 100 includes, for example, a Central Processing Unit (CPU) 20, a system memory 10, a north bridge/memory bridge 30, an accelerator subsystem 40, a device memory 50, and a south bridge/Input Output (IO) bridge 60. The system memory 10 may be, for example, a volatile memory such as a Dynamic Random Access Memory (DRAM). The north bridge/memory bridge 30 integrates, for example, a memory controller, PCIe controller, etc., which is responsible for data exchange between the CPU 20 and the high speed interface, bridging the CPU 20 and the south bridge/IO bridge 60. The south bridge/IO bridge 60 is used for a low-speed interface of a computer, such as a serial advanced technology interface (SATA) controller, etc. The accelerator subsystem 40 may include, for example, devices or chips such as Graphics Processors (GPUs) and Artificial Intelligence (AI) accelerators for accelerating the processing of graphics, video, and the like. In this disclosure, accelerator subsystem 40 may also be referred to as a "processing circuit.

With continued reference to FIG. 1, the device memory 50 may be, for example, volatile memory, such as DRAM, that is located external to the accelerator subsystem 40. In this disclosure, device memory 50 is also referred to as off-chip memory, i.e., memory located outside of the chip of accelerator subsystem 40. In contrast, accelerator subsystem 40 also has volatile memory within its chip, such as a level one (L1) cache and optionally a level two (L2) cache, which may be collectively referred to as "on-chip memory"

It should be appreciated that while one example environment 100 is shown in fig. 1 in which embodiments of the present disclosure may be implemented, the present disclosure is not limited thereto. Some embodiments of the present disclosure may also be used in some application environments, such as ARM architectures and RISC-V architectures, having an accelerator subsystem, such as a GPU.

Fig. 2 shows a schematic block diagram of a processing circuit 200 according to one embodiment of the present disclosure. Processing circuit 200 may be, for example, one particular implementation of a chip of accelerator subsystem 40 of fig. 1. The processing circuit 200 is, for example, a processing circuit chip such as a GPU. In one embodiment, processing circuit 200 includes a Stream Processor (SP) 210, a page table device 220, a Processing Engine (PE) unit 230, a Direct Memory Access (DMA) controller 240, an L1 cache (cache) 260, and an L2 cache 250.

The processing circuit 200 is controlled by a host device such as the CPU 20, and receives instructions from the CPU 20. SP 210 analyzes instructions from CPU 20 and assigns the analyzed operations to PE unit 230, page table means 220, and DMA controller 240 for processing. Page table means 220 is used to manage on-chip virtual storage of processing circuitry 200. In this disclosure, the L2 cache 250 and off-chip memory, such as the device memory 50 in FIG. 1, constitute a virtual storage system. Page table apparatus 220 is maintained in common by SP 210, PE unit 230, and DMA controller 240.

PE unit 230 includes a plurality of processing engines (processing engine, PE) PE_1, PE_2 … … PE_N, where N represents an integer greater than 1. Each PE in PE unit 230 may be a Single Instruction Multithreading (SIMT) device. In a PE, each thread may have its own register file, and all threads of each PE also share a unified register file (uniform register file). Multiple PEs may perform the same or different processing tasks in parallel, and address translation and access to target data in memory, described below, may be performed in parallel, thereby reducing processing time. It is appreciated that the target elements of the multiple PE processes are not identical and that the segments, pages, cache lines, and attributes of the elements, size, dimensional ordering, etc. in which the target elements reside may be different, as described in more detail below.

Each thread may exchange thread-level data between its own register file and the memory subsystem. Each thread has its own arithmetic logic execution unit and uses its own memory address, which employs a typical register access architecture (load-store architecture). Each execution unit includes a floating point/fixed point unit that supports multiple data types and an arithmetic logic unit.

Most instructions perform arithmetic and logical operations such as addition, subtraction, multiplication, division, or logical and, or, not, etc. of floating point and fixed point numbers. The operands come from registers. Memory read-write instructions may provide for data exchange between registers and on-chip/off-chip memory. In general, all execution units in a PE may execute the same instruction in synchronization. By using predicate (predicate) registers, part of the execution units may be masked, thereby implementing the function of the branch instruction.

In one embodiment, the processing circuit 200 of FIG. 2 may, for example, perform the following: 1) Constructing page table item content and an initial state; 2) Handling data on off-chip memory, such as device memory 50 in FIG. 1, to on-chip memory, such as L2 cache 250; 3) Starting and executing a program; 4) Defining each segment and describing tensors and stored attributes; 5) And when the program execution is completed, writing the data of the execution result into the off-chip memory.

It will be appreciated that in the disclosed embodiment, the data processed by processing circuit 200 is primarily directed to multidimensional tensors. For example, in one embodiment, the tensor may be a four-dimensional tensor having four dimensions D1, D2, D3, and D4, and the dimensions of the tensor may be different across the dimensions. In other embodiments, the tensor may be a one-dimensional, two-dimensional, three-dimensional, or more-dimensional tensor, which is not limiting of the present disclosure.

Further, in embodiments of the present disclosure, tensor internals may support such as uint8, int8, bfoat 16, float16, uint16, int16, float32, int32, uint32, and other custom element types, which is also not limiting of the present disclosure. For addressing of tensors, it is in elementary units of elements. For example, if the element type is int8, the element is in bytes. For another example, if the element type is int16, the addressing base unit is double bytes, and so on.

In some cases, the amount of data contained by the tensor may be large, while the capacity of the L2 cache 250 is limited, so the tensor cannot be loaded in its entirety into the on-chip L2 cache 250. In some embodiments of the present disclosure, to facilitate parallel processing of the tensors, the tensors may be divided into at least one segment. In case the tensor comprises only one segment, the tensor is the segment. And in the case where the tensor comprises a plurality of segments, the segments are part of the tensor. The CPU 20 may specify by instruction which PE the various parts of the segment are handled by.

Tensor storage structure

Fig. 3 shows a schematic block diagram of a three-dimensional tensor 300 according to one embodiment of the present disclosure. The three-dimensional tensor 300 has three dimensions D1, D2, and D3, and includes a first segment S1, a second segment S2, and a third segment S3.CPU 20 may specify that the tensor element of segment S1 is to be processed by pe_1, pe_2, pe_3, pe_4, pe_5, pe_6, pe_7, and pe_8. In addition, the CPU20 also specifies that the tensor element of the second segment S2 is processed by PE_1-PE_4. In embodiments of the present disclosure, each segment may have a different size, so a programmer may flexibly configure the segments based on design needs. In practice, the division of pages may be implemented in any one or more dimensions, and the number of pages divided in each dimension is independent of the other.

In one embodiment, the tensor data may be stored in an on-chip cache, such as the L2 cache 250. But because of the small capacity of the on-chip high-speed memory, the programmer may divide the tensor into multiple segments, each segment describing a portion of the tensor, when the tensor is large. The kernel (kernel) may be started in multiple times, each time one segment of the tensor is moved from off-chip memory to on-chip memory in advance by the DMA controller 240 and used for kernel operations. After a number of kernel starts, all segments contained by the tensor are processed, and the whole operation process is finished. When the on-chip cache is sufficient to accommodate all tensors that the kernel needs to access, one tensor only needs one segment description, and the kernel also only needs to start once.

Further, in some embodiments of the present disclosure, within a segment, at least one page may also be set to further subdivide the tensor. For example, in the first segment S1, there are 4 pages P [1], P [2], P [3] and P [4]. The second segment S2 has only one page. In embodiments of the present disclosure, the number of pages that each segment has may be different, so a programmer may flexibly configure the size of the pages within the segment based on design needs. For example, pages are configured to fit in the L2 cache 250 as a whole.

As described above, when addressing a tensor, the smallest addressed cell is in units of elements. A page may generally include a plurality of elements. The page in which the target element is located is referred to herein as a "target element page". In some embodiments of the present disclosure, a page may include a plurality of cache lines. While the target element page may be located in L2 cache 250, if the PE reads the target element via L1 cache 260, L2 cache 250 needs to transfer a small portion of the physical address continuation of the data in L2 cache 250, including the target element, to L1 cache 260 in its entirety. This small portion of data is also referred to as cache line data, and this caching mechanism is based on the principle of spatial proximity. While it may take only a few clock cycles for a PE to read data from L1 cache 260, it may take tens or even hundreds of clock cycles for L1 cache 260 to read data from L2 cache 250. Accordingly, it is desirable to reduce the number of times L1 cache 260 reads data from L2 cache 250. Although the smallest unit of transfer data from L2 cache 250 to L1 cache 260 is described herein as a "cache line," this portion of data may not necessarily be arranged in rows or columns in the present disclosure, the data within a "cache line" is distributed across multiple dimensions, and the size of the data distributed across the dimensions is not limited to 1. The PEs perform parallel processing on the data within a segment, and the allocation of the PEs is spread out in the logical address space of the data, independent of the physical storage structure of the segment, as described in detail below.

In FIG. 3, a first set of cache lines in a first page P [1] is designated for processing by PE_1, and a second set of cache lines is designated for processing by PE_2. Although tensors are shown here in sequential order as being processed sequentially by multiple PEs, it is to be understood that the processing of tensor data is independent of the order of PEs, which is not limiting of the present disclosure. For example, the tensor data of the portion denoted by pe_2 in fig. 3 may be processed by pe_m, where M denotes any integer no greater than N.

Example scheduling procedure one

As will be described in detail below, processing circuitry 200 may manage dependencies between different resources by using status indicators. For convenience of description, a mechanism of the status indicator will be described below with a register as an example of a resource. It should be appreciated that other suitable types of resources are also used, examples of which include, but are not limited to: memory addresses, queues, and processor units, etc.

In some embodiments, processing circuitry 200 may use a status indicator (also referred to as a "token") to manage data dependencies. A token is a state value that may be used to indicate the state of data in a corresponding resource (e.g., register).

Unlike traditional hardware-based register states, token provides a software-based state management strategy for developers that does not need to access the corresponding hardware state through register identification during implementation, but rather can solve the data dependency problem through flexible token management.

Illustratively, if the token is a first value (e.g., 1), it indicates that the data in its corresponding register is ready and has not been used or consumed. Conversely, if the token is a second value (e.g., 0), it indicates that the data in its corresponding register is not ready.

In some embodiments, processing circuitry 200 may determine whether an instruction may be issued based on the token value of the register and the type of instruction. Specifically, if the instruction is a data consuming instruction, i.e., it indicates a data consuming operation on data in a register, the processing circuit 200 may determine whether the token corresponding to the register is 1. If so, the instruction may be executed for the corresponding stage. Otherwise, if the token is 0, the instruction needs to wait for the corresponding stage to be executed.

In some embodiments, this stage is determined based on the type of instruction. For example, if the instruction is a data consuming instruction, the stage may be, for example, an issue stage of the instruction. That is, the data consuming instruction may be issued only if the token is 1.

In some embodiments, if the instruction is a data production instruction, i.e., it indicates a data generation operation on data in a register, the processing circuitry may determine whether the token corresponding to the register is 0. If so, the instruction may be issued, otherwise if the token is 1, the instruction needs to wait to issue.

In some embodiments, this stage is determined based on the type of instruction. If the instruction is a data production instruction, the stage may be, for example, a write-back stage of the instruction. That is, the data production instruction may be issued first, and check is made during the write-back phase to see if the token is 0. That is, the data production instruction may wait for the token to be 0 before being able to perform the data write back.

In some embodiments, processing circuitry 200 may also update the token corresponding to the register upon completion of execution of an instruction's access operation to the register. Illustratively, if the instruction is a data consuming instruction, the token may be set to 0 after the completion of the data entry from the token to the execution unit. In another example, if the instruction is a data production instruction, the token may be set to 1 after the write back of the data to the register is completed.

FIG. 4 illustrates an example instruction scheduling process 400 according to some embodiments of the disclosure. In fig. 4, "i" represents the emission point of an instruction, "s" represents the set point (i.e., set to 1) of the corresponding token, and "c" represents the reset point (i.e., set to 0) of the token.

As shown in FIG. 4, instruction Load RF [0], memA, after completing the data Load, the corresponding token is set to 1. In some embodiments, a developer may indicate in an instruction that a particular operation is to be performed on a particular token. For example, a developer may write instruction Load RF [0], memA (no clear, check and set token 1), i.e., the instruction has no clear operation and needs to check "token1" and set "token1" to 1 after completion.

Accordingly, upon detecting that the token is set to 1, the instruction Add RF [ x ], RF [0] may be issued, and upon completing the data entry, the token is set to 0. In some embodiments, a developer may indicate in an instruction that a particular operation is to be performed on a particular token. For example, the developer may write instruction Add RF [ x ], RF [0] (check and clear token, no set), i.e., the instruction has no set operation and needs to check "token1", and set "token1" to 0 after completion.

Further, the instruction Load RF [0], memB, for example, may be issued first to perform a memory access operation. Subsequently, the instruction may perform a data write back phase after waiting for the token to be set to 0, and set the token to 1 after the data load is completed. In some embodiments, a developer may indicate in an instruction that a particular operation is to be performed on a particular token. For example, a developer may write instruction Load RF [0], memB (no clear, check and set token 1), i.e., the instruction has no clear operation and needs to check "token1" and set "token1" to 1 after completion.

Similarly, instructions Add RF [ y ], RF [0] may be issued upon detection of a token of 1, and set token to 0 upon completion of data entry. In some embodiments, a developer may indicate in an instruction that a particular operation is to be performed on a particular token. For example, the developer may write instruction Add RF [ y ], RF [0] (check and clear token, no set), i.e., the instruction has no set operation and needs to check "token1" and set "token1" to 0 after completion.

Further, instruction Load RF [0], memC may be issued to perform a data access operation and wait for the data write back stage to be performed after token is set to 0. After the data load is completed, the instruction may set the token to 1. In some embodiments, a developer may indicate in an instruction that a particular operation is to be performed on a particular token. For example, a developer may write instruction Load RF [0], memA (no clear, check and set token 1), i.e., the instruction has no clear operation and needs to check "token1" and set "token1" to 1 after completion.

Similarly, instructions Add RF [ z ], RF [0] may be issued upon detection of a token of 1, and set token to 0 upon completion of data entry. In some embodiments, a developer may indicate in an instruction that a particular operation is to be performed on a particular token. For example, the developer may write instruction Add RF z, RF 0 (check and clear token, no set), i.e., the instruction has no set operation and needs to check "token1" and set "token1" to 0 after completion.

Based on this approach, embodiments of the present disclosure are able to utilize status indicators (i.e., token) to effectively manage resource or data dependencies among instructions, thereby improving the efficiency of instruction scheduling and reducing the complexity of the system.

Example scheduling procedure two

In some embodiments, processing circuitry 200 may further enhance the execution efficiency of instructions by advancing read instructions. Illustratively, each read instruction may use a different token, with the result written to a different register. Each Add instruction thus looks at a different token, using different register operands.

In some embodiments, multiple read instructions may be required to complete in order. Thus, the processing circuitry may cause only the last degree instruction to update the token and be issued after the first Add instruction examines the token. In this way, the processing circuitry can reduce system complexity without increasing excessive latency due to the close execution completion times of the multiple read instructions.

FIG. 5 illustrates an example instruction scheduling process 500 according to some embodiments of the disclosure. As shown in FIG. 5, instruction Load RF [0], memA; load RF [1], memB and Load RF [2], memC may be issued in sequence and executed in sequence. Accordingly, the instruction Add RF [ x ], RF [0] may examine the token corresponding to "RF [0]" and issue after the token is 1. The instruction Add RF [ y ], RF [1] may examine the token corresponding to RF [1], and issue after the token is 1. The instruction Add RF [ z ], RF [2] may examine the token corresponding to RF [2], and issue after the token is 1.

Based on this manner, embodiments of the present disclosure may enable multiple read instructions to be executed in parallel, thereby improving the loading efficiency of the system.

Example scheduling procedure three

In some embodiments, processing circuitry 200 may resolve data dependencies between multiple instructions through a single token. In an example scheduling process, processing circuitry 200 may need to maintain three registers, and corresponding three token. To further reduce overhead, the processing circuitry 200 may also schedule execution of the above instructions by utilizing a single token.

FIG. 6 illustrates an example instruction scheduling process 600 according to some embodiments of the disclosure. As shown in FIG. 6, after the first read instruction writes RF [0], token [0] is set to 1, and the next read instruction cannot write RF [0]. It waits for token 0 to reset to 0, which requires the next Add instruction to do so that only Add instructions can clear token 0 to 0, which when issued looks at token 0. When token [0] becomes 1, add instructions may be issued and executed, resetting token [0] to 0 after reading RF [0].

Specifically, the scheduling process of process 600 is: initially, the value of token [0] is 0. Three read instructions may be issued in sequence for performing a memory access operation, reading data from the A, B, C three memory addresses and sending the data back in sequence onto the write-back queue of RF [0]. As shown in fig. 6, three read instructions may be issued in sequence, independent of the completion of execution of other read instructions.

Further, the first Add instruction waits at the issue stage until token [0] is set to 1; the first read instruction retrieves the data and is arranged in the first bit of the write back RF [0], checks that the value of token [0] is 0, i.e., writes RF [0], and sets token [0] to 1.

At this point, the second read instruction may have read data from the MemB address and placed side-by-side in the second queue of write back RF [0], after the first read instruction is written to RF [0], the second read instruction is placed in the first queue, while the third read instruction may be returned side-by-side in the second queue.

Alternatively, the second read instruction may be in the first bit of the begin fetch queue and the third read instruction may be in the second bit of the begin fetch queue.

Further, the second read instruction checks that token [0] is 1 (set by the first read instruction) waiting for it to become 0.

Then the first Add instruction detects a token [0] of 1, launches and executes, and clears token [0] of 0. The second Add instruction waits for token 0 to be 1 at the issue stage, and after token 0 changes from 1 to 0, the data of the second read instruction is written back to RF 0 and sets token 0 to 1.

The second Add may either be launched and executed, then token [0] cleared 0. At this point, the third read instruction should have read the data from the MemB address and wait for write back RF [0]. After token [0] changes from 1 to 0, a third read instruction may write back RF [0] and set token [0] to 1. The third Add is then transmitted and executed, and then resets token [0] to 0.

In this manner, embodiments of the present disclosure are able to efficiently manage data dependencies between multiple instructions and reduce the number of registers used with a single token.

In some embodiments, processing circuitry 200 may also increase resources at the instruction issue stage to maintain the number of instructions issued to address possible deadlock issues. In particular, processing circuitry 200 may add an instruction issue queue outside of the write-back queue and fetch queue, where the length of the queue is the maximum number of read instructions allowed to issue, which may ensure that the read instructions are not blocked.

It will be appreciated that increasing instruction issue queuing, while increasing certain resources, is more cost effective than increasing the number of registers, especially in single instruction multithreaded processors.

In some embodiments, processing circuitry 200 may also solve the problem of possible instruction blocking by looking ahead at the instructions. Although instruction pre-peeping requires increased resources, the amount of resources required is reasonable compared to the reduced register usage. By using a proper amount of registers and combining a front-end peeping mechanism, the embodiment of the disclosure can effectively hide the DRAM storage delay and greatly improve the performance.

Example scheduling procedure four

In some embodiments, processing circuitry 200 may utilize token to implement a composite instruction of memory access and computation to further improve execution efficiency. In some embodiments, processing circuitry 200 may allow, for example, the use of a compute instruction (e.g., a mm instruction as shown in fig. 7) used in conjunction with a read instruction.

In some embodiments, the calculation instruction may be issued twice, unlike a normal data consumption instruction, and have two distinct execution phases. In the first stage, the calculation instruction may be issued as a normal read store instruction. In the second stage, after detecting that the corresponding token is set to 1, the calculation instruction may be issued as a data operation instruction, and after transferring the data to the execution unit, the corresponding token is set to 0.

Fig. 7 illustrates an example instruction scheduling process 700 according to some embodiments of the disclosure. As shown in FIG. 7, load RF [0], memA through mm RF [ z ], RF [0] may be sequentially emitted. Wherein three mm instructions are issued as normal read store instructions.

As shown in FIG. 7, instruction Load RF [0], memA, after completion, sets token to 1, at which point instruction mm RF [ x ], RF [0] may be reissued as a data operation instruction to consume the data stored in the register. Further, after the data is transferred to the execution unit, the corresponding token may be cleared 0.

After the token is cleared 0, instruction Load RF [0], memB may execute, and when it completes the data Load, token may be set to 1. Further, instructions mm RF [ y ], RF [0] may be reissued as a data operation instruction to consume the data stored in the registers. Further, after the data is transferred to the execution unit, the corresponding token may be cleared 0.

Similarly, after the token is cleared of 0 here, instruction Load RF [0], memC may execute, and when it completes the data Load, the token may be set to 1. Further, instruction mm RF [ z ], RF [0] may be reissued as a data operation instruction to consume the data stored in the register. Further, after the data is transferred to the execution unit, the corresponding token may be cleared 0.

In some embodiments, the mm instruction may be, for example, an instruction to perform a matrix multiplication operation.

In this way, embodiments of the present disclosure are able to self-adjust read data and operation instructions so that programs are not affected by storage latency, thereby enabling the number of registers used to hide read data latency to be minimized.

Example procedure for instruction scheduling

Fig. 8 illustrates a flow chart of an instruction scheduling method 800 according to some embodiments of the present disclosure. In one embodiment, the method 800 may be implemented, for example, by the processing circuit 200 (or accelerator subsystem 40) such as a GPU, and thus the various aspects described above with respect to fig. 1-3 may be selectively applied to the method 800.

At block 810, processing circuitry 200 determines a status indicator associated with the target instruction, the status indicator being used to indicate a status of a resource associated with the target instruction. At block 820, processing circuit 200 determines whether the target instruction is ready based on the status indicator and the type of target instruction. In response to determining that the target instruction is ready at block 820, the method 800 proceeds to block 830 where the processing circuit 200 executes a target phase of the target instruction, the target phase being determined based on the type of target instruction. At block 840, processing circuitry 200 determines whether the access operation for the resource of the target instruction is complete. In response to determining that the access operation execution is complete at block 840, method 800 proceeds to block 850 where processing circuit 200 updates the status indicator at block 850.

The present disclosure may be a method, a processing circuit, an electronic device, a computer storage medium, and/or a computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for performing aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for performing the operations of the present disclosure can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The embodiments of the present disclosure have been described above, the foregoing description is illustrative, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments of the invention. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of instruction scheduling, the method comprising:

determining a status indicator associated with a target instruction, the status indicator being for indicating a status of a resource associated with the target instruction;

determining whether the target instruction is ready based on the status indicator and a type of the target instruction;

in response to determining that the target instruction is ready, executing a target phase of the target instruction, the target phase determined based on the type; and

updating the status indicator in response to completion of execution of the access operation for the resource by the target instruction;

Wherein the target instruction is a third instruction, the resource is a third resource, the third instruction indicates a data consumption operation for the third resource, and a target stage of executing the target instruction includes:

issuing the third instruction in response to a fourth instruction setting the target indicator to a first value, the fourth instruction indicating a data production operation for the third resource, the fourth instruction being issued prior to the third instruction; and is also provided with

In response to the third instruction updating the target indicator to a second value, causing a fifth instruction to be executed, the fifth instruction indicating a data production operation for the third resource, the fifth instruction being issued prior to the third instruction and later than the fourth instruction.

2. The method of claim 1, wherein determining whether the target instruction is ready based on the status indicator and a type of the target instruction comprises:

in response to the type of the target instruction indicating a data production operation for the resource, determining whether the status indicator is a first value indicating that data in the resource has been consumed; and

Responsive to the status indicator being the first value, the target instruction is determined to be ready.

3. The method of claim 2, wherein executing the target phase of the target instruction comprises: and executing the write-back stage of the target instruction.

4. The method of claim 2, wherein updating the status indicator comprises:

in response to completion of execution of the access operation for the resource by the target instruction, the status indicator is updated to a second value indicating that data in the resource is available for consumption.

5. The method of claim 1, wherein determining whether the target instruction is ready based on the status indicator and a type of the target instruction comprises:

in response to the type of the target instruction indicating a data consumption operation for the resource, determining whether the status indicator is a second value indicating that data in the resource can be consumed; and

responsive to the status indicator being the second value, the target instruction is determined to be ready.

6. The method of claim 5, wherein executing a target phase of the target instruction comprises: and sending out the target instruction.

7. The method of claim 5, wherein updating the status indicator comprises:

in response to completion of execution of an access operation for the resource by the target instruction, the status indicator is updated to a first value indicating that data in the resource has been consumed.

8. The method of claim 1, wherein the target instruction is a first instruction, the resource is a first resource, the first instruction indicates a data production operation for the first resource, and executing a target phase of the target instruction comprises:

the first instruction is issued during execution of a second instruction, the second instruction indicating a data production operation for a second resource, the first resource being different from the second resource.

9. The method of claim 1, wherein issuing the target instruction in response to determining that the target instruction is ready comprises:

in response to determining that the target instruction is ready, determining whether a number of instructions that have been issued and outstanding is less than a threshold; and

the target instruction is issued in response to determining that the number is less than the threshold.

10. The method of claim 1, wherein determining a status indicator for a resource associated with a target instruction comprises: issuing the target instruction as a memory loading instruction; and determining, in a first stage of the target instruction as the memory load instruction, the status indicator of the resource associated with the target instruction; and is also provided with

The target phase of the target instruction in response to determining that the target instruction is ready to execute includes: in response to determining that the target instruction is ready, the target instruction is reissued as an operation instruction.

11. The method of claim 10, wherein determining whether the target instruction is ready comprises:

a determination is made as to whether the first memory load instruction set the status indicator to a second value.

12. The method of claim 10, further comprising:

after issuing the target instruction, a second memory load instruction associated with the status indicator is issued without confirming whether the status indicator is a first value.

13. The method of any of claims 1 to 12, wherein the resources comprise at least one of: registers, memory addresses, queues, or processor resources.

14. A processing circuit comprising an on-chip memory, a stream processor and a processing engine, wherein the processing circuit is configured to perform the method of any of claims 1 to 13.

15. An electronic device comprising an off-chip storage memory and processing circuitry, wherein the processing circuitry is configured to perform the method of any of claims 1-13.

16. A computer readable storage medium having stored thereon one or more computer instructions, wherein the one or more computer instructions are executed by a processing circuit to implement the method of any of claims 1 to 13.