CN114610394A

CN114610394A - Instruction scheduling method, processing circuit and electronic equipment

Info

Publication number: CN114610394A
Application number: CN202210247863.3A
Authority: CN
Inventors: 王磊; 常亮; 许飞翔; 侯红朝; 姚飞; 仇小钢
Original assignee: Hexaflake Nanjing Information Technology Co Ltd
Current assignee: Hexaflake Nanjing Information Technology Co Ltd
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2022-06-10
Anticipated expiration: 2042-03-14
Also published as: CN114610394B; WO2023173642A1

Abstract

A method, processing circuit, electronic device, computer-readable storage medium, and computer program product for instruction scheduling are described herein. The method proposed herein comprises: determining a status indicator associated with the target instruction, the status indicator indicating a status of a resource associated with the target instruction; determining whether the target instruction is ready based on the status indicator and the type of the target instruction; in response to determining that the target instruction is ready, executing a target phase of the target instruction, wherein the target phase is determined based on a type of the target instruction; and updating the status indicator in response to completion of execution of an access operation to the register by the target instruction. In this way, data dependencies between instructions can be efficiently managed using status indicators, thereby improving system performance and reducing circuit complexity.

Description

Instruction scheduling method, processing circuit and electronic equipment

Technical Field

Embodiments of the present disclosure relate generally to the field of electronics, and more particularly, to a method, processing circuit, electronic device, computer-readable storage medium, and computer program product for instruction scheduling.

Background

Some processing units (e.g., AIGPUs) employ a load-store architecture, where instructions other than the access instruction use operands in registers. These instructions read data from a register file-RF, then feed into the execution unit to compute, and finally write the results back to the register file. To enable multiple data to be read and written per clock cycle, the registers are typically multi-ported and divided into multiple blocks. Register files are typically small, read at speeds close to the execution units and with fixed latencies, and are typically placed beside the execution units. Each thread of the AIGPU has its own register file and has fixed execution units.

To simplify hardware implementation, some approaches force that read instructions of the same type must be executed in order, which exacerbates the latency caused by reading data.

Disclosure of Invention

Embodiments of the present disclosure provide a scheme for instruction scheduling.

In a first aspect, a method for instruction scheduling is provided. The method includes determining a status indicator associated with a target instruction, the status indicator indicating a status of a resource associated with the target instruction; determining whether the target instruction is ready based on the status indicator and the type of the target instruction; in response to determining that the target instruction is ready, executing a target phase of the target instruction, wherein the target phase is determined based on a type of the target instruction; and updating the status indicator in response to completion of execution of the access operation for the resource by the target instruction.

In some embodiments, determining whether the target instruction is ready based on the status indicator and the type of the target instruction comprises: in response to the type of the target instruction indicating a data production operation for the resource, determining whether the status indicator is a first value, the first value indicating that data in the resource has been consumed; and responsive to the status indicator being a first value, determining that the target instruction is ready.

In some embodiments, updating the status indicator comprises: in response to completion of execution of an access operation for the resource by the target instruction, the status indicator is updated to a second value indicating that data in the resource can be consumed.

In some embodiments, executing the target phase of the target instruction comprises: a write back stage of the target instruction is executed.

In some embodiments, determining whether the target instruction is ready based on the status indicator and the type of the target instruction comprises: in response to the type of the target instruction indicating a data consumption operation for the resource, determining whether the status indicator is a second value, the second value indicating that data in the resource can be consumed; and responsive to the status indicator being the second value, determining that the target instruction is ready.

In some embodiments, executing the target phase of the target instruction includes: and issuing a target instruction.

In some embodiments, updating the status indicator comprises: in response to completion of execution of an access operation of the target instruction for the resource, the status indicator is updated to a first value indicating that data in the resource has been consumed.

In some embodiments, the target instruction is a first instruction, the resource is a first resource, the first instruction indicates a data production operation for the first resource, and executing the target phase of the target instruction comprises: during execution of a second instruction, the first instruction is issued, the second instruction indicating a data production operation for a second resource, the first resource being different from the second resource.

In some embodiments, the target instruction is a third instruction, the resource is a third resource, the third instruction indicates a data consumption operation for the third resource, and executing the target phase of the target instruction includes: issuing a third instruction in response to a fourth instruction setting the target indicator to the first value, the fourth instruction indicating a data production operation for the third resource, the fourth instruction issued prior to the third instruction; and the method further comprises: in response to the third instruction updating the target indicator to the second value, causing a fifth instruction to be executed, the fifth instruction indicating a data production operation for the third resource, the fifth instruction issued prior to the third instruction and later than the fourth instruction.

In some embodiments, issuing the target instruction in response to determining that the target instruction is ready comprises: in response to determining that the target instruction is ready, determining whether a number of instructions that have been issued and outstanding is less than a threshold; and issuing the target instruction in response to determining that the number is less than the threshold.

In some embodiments, determining the status indicator of the resource associated with the target instruction comprises: sending out a target instruction as a memory loading instruction; and determining a status indicator of a resource associated with the target instruction at a first stage in which the target instruction is a memory load instruction; and in response to determining that the target instruction is ready to execute the target stage of the target instruction comprises: reissuing the target instruction is issued as an arithmetic instruction in response to determining that the target instruction is ready.

In some embodiments, determining whether the target instruction is ready comprises: it is determined whether the first memory load instruction sets the status indicator to a second value.

In some embodiments, the method further comprises: after issuing the target instruction, a second memory load instruction associated with the status indicator is issued without determining whether the status indicator is the first value.

In some embodiments, the resources may include at least one of: registers, memory addresses, queues, or processor resources.

In a second aspect of the disclosure, a processing circuit is provided that includes an on-chip memory, a stream processor, and a processing engine. The processing circuitry is configured to perform any of the methods of the first aspect and its implementations.

In a third aspect of the disclosure, an electronic device is provided. The electronic device comprises processing circuitry configured to perform any of the methods of the first aspect and its implementations.

In a fourth aspect of the disclosure, a computer-readable storage medium is provided. The computer readable storage medium stores instructions that, when executed by the processing circuitry, cause the processing circuitry to perform any of the methods of the first aspect and its implementations.

In a fifth aspect of the disclosure, a computer program product is provided. The computer program product comprises instructions which, when executed by the processing circuitry, cause the processing circuitry to perform any of the methods of the first aspect and its implementations.

It will be appreciated that the processing circuitry of the second aspect, the electronic device of the third aspect, the computer storage medium of the fourth aspect or the computer program product of the fifth aspect provided above may be adapted to perform the method provided by the first aspect. Therefore, explanations or illustrations with respect to the first aspect are equally applicable to the second, third, fourth, and fifth aspects. In addition, the beneficial effects achieved by the second aspect, the third aspect, the fourth aspect and the fifth aspect may refer to the beneficial effects in the corresponding method, and are not described herein again.

It should be understood that what is described in this summary section is not intended to limit key or critical features of the disclosure, nor is it intended to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

FIG. 1 illustrates a schematic diagram of an example environment in which embodiments of the present disclosure can be implemented;

FIG. 2 shows a schematic block diagram of a processing circuit in accordance with some embodiments of the present disclosure;

figure 3 illustrates a schematic block diagram of a three-dimensional tensor, in accordance with some embodiments of the present disclosure;

FIG. 4 illustrates an instruction scheduling process according to some embodiments of the present disclosure;

FIG. 5 illustrates an instruction scheduling process according to further embodiments of the present disclosure;

FIG. 6 illustrates an instruction scheduling process according to further embodiments of the present disclosure;

FIG. 7 illustrates an instruction scheduling process according to further embodiments of the present disclosure;

fig. 8 illustrates a flow chart of an example process of a stream processing method according to some embodiments of the present disclosure.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are illustrated in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The term "include" and variations thereof as used herein is meant to be inclusive in an open-ended manner, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions are also possible below.

Processors typically resolve the register dependencies between instructions in two ways, i.e., ensuring that an instruction correctly uses the result register output by the previous instruction.

In some conventional arrangements, the processor or software may always arrange for a later dependent instruction to begin execution after the duration of execution if the duration of execution of the preceding instruction is in any case constant.

In other conventional schemes, if the execution time of an instruction is uncertain, such as the duration of each execution of a memory access instruction is not certain, the hardware records the output register of each instruction and tracks their completion time, and simultaneously, the hardware decodes the input register(s) used by each instruction and compares the input register(s) with the unfinished input register to find the dependency relationship.

According to the embodiment of the disclosure, the resource dependency between the instructions can be effectively solved through the status indicator of the resource (such as a register, a memory address, a queue or a processor unit, and the like), so that the instruction scheduling efficiency is improved, the system performance is improved, and the circuit implementation complexity is reduced.

Example Environment

Fig. 1 illustrates a schematic diagram of an example environment 100 in which various embodiments of the present disclosure can be implemented. The example environment 100 may be, for example, an electronic device with computing capabilities, such as a computer. In one embodiment, the example environment 100 includes, for example, a Central Processing Unit (CPU)20, a system memory 10, a north bridge/memory bridge 30, an accelerator subsystem 40, a device memory 50, and a south bridge/Input Output (IO) bridge 60. System memory 10 may be, for example, a volatile memory such as a Dynamic Random Access Memory (DRAM). The north bridge/memory bridge 30 integrates, for example, a memory controller, a PCIe controller, and the like, which are responsible for data exchange between the CPU20 and the high-speed interface and bridge the CPU20 and the south bridge/IO bridge 60. The south bridge/IO bridge 60 is used for a low-speed interface of a computer, such as a serial advanced technology interface (SATA) controller or the like. The accelerator subsystem 40 may include, for example, devices or chips for accelerated processing of data such as graphics, video, and the like, such as Graphics Processors (GPUs) and Artificial Intelligence (AI) accelerators. In this disclosure, accelerator subsystem 40 may also be referred to as "processing circuitry".

With continued reference to FIG. 1, the device memory 50 may be, for example, a volatile memory such as DRAM located external to the accelerator subsystem 40. In this disclosure, device memory 50 is also referred to as off-chip memory, i.e., memory located outside of the chip of accelerator subsystem 40. In contrast, the accelerator subsystem 40 also has volatile memory internal to the chip, such as a level one (L1) cache and optionally a level two (L2) cache, which may be collectively referred to as "on-chip memory"

It should be appreciated that while one example environment 100 in which embodiments of the present disclosure can be implemented is illustrated in FIG. 1, the present disclosure is not limited thereto. Some embodiments of the present disclosure may also be used in some application environments, such as ARM architectures and RISC-V architectures, with accelerator subsystems, such as GPUs.

Fig. 2 shows a schematic block diagram of a processing circuit 200 according to an embodiment of the present disclosure. Processing circuit 200 may be, for example, a specific implementation of the chip of accelerator subsystem 40 in FIG. 1. The processing circuit 200 is, for example, a processing circuit chip such as a GPU. In one embodiment, processing circuit 200 includes a Stream Processor (SP)210, a page table device 220, a Processing Engine (PE) unit 230, a Direct Memory Access (DMA) controller 240, an L1 cache (cache)260, and an L2 cache 250.

The processing circuit 200 is controlled by a host device such as the CPU20, and receives an instruction from the CPU 20. SP 210 analyzes the instructions from CPU20 and assigns the analyzed operations to PE unit 230, page table device 220, and DMA controller 240 for processing. The page table means 220 is used to manage the on-chip virtual storage of the processing circuit 200. In the present disclosure, the L2 cache 250 and off-chip memory, such as device memory 50 in FIG. 1, constitute a virtual storage system. Page table device 220 is commonly maintained by SP 210, PE unit 230, and DMA controller 240.

PE unit 230 includes a plurality of Processing Engines (PEs) PE _1, PE _2 … … PE _ N, where N represents an integer greater than 1. Each PE in PE unit 230 may be a Single Instruction Multiple Thread (SIMT) device. In a PE, each thread may have its own register file (register file), and all threads of each PE also share a unified register file (uniform register file). Multiple PEs may perform the same or different processing tasks in parallel, and address translation and access to target data in memory, described below, may be performed in parallel, thereby reducing processing time. It is understood that the target elements processed by the multiple PEs are not the same, and the segment, page, cache line, and attribute, size, dimension ordering, etc. of the elements may be different, as described in detail below.

Each thread may exchange thread-level data between its register file and the memory subsystem. Each thread has its own arithmetic logic execution unit and uses its own memory address, which employs a typical register-store architecture (load-store architecture). Each execution unit includes a floating point/fixed point unit that supports multiple data types and an arithmetic logic unit.

Most instructions perform arithmetic and logical operations, such as addition, subtraction, multiplication, division of floating point and fixed point numbers, or logical AND, OR, NOT, etc. The operands come from registers. Memory read and write instructions may provide for data exchange between registers and on/off-chip memory. In general, all execution units in a PE may execute the same instruction synchronously. By using predicate (predicate) registers, portions of the execution units may be masked, thereby implementing the functionality of the branch instruction.

In one embodiment, the processing circuit 200 of fig. 2 may, for example, perform the following operations: 1) building page table entry content and an initial state; 2) data on off-chip memory, such as device memory 50 in FIG. 1, is carried to on-chip memory, such as L2 cache 250; 3) starting and executing a program; 4) defining each segment and describing the tensor and the stored attributes; 5) and when the program execution is completed, writing the data of the execution result into the off-chip memory.

It is to be appreciated that in the disclosed embodiment, the data processed by the processing circuit 200 is primarily directed to a multidimensional tensor. For example, in one embodiment, the tensor may be a four-dimensional tensor having four dimensions D1, D2, D3, and D4, and the tensor may differ in size in the dimensions. In other embodiments, the tensor can be a one-dimensional, two-dimensional, three-dimensional, or more-dimensional tensor, which is not limited by this disclosure.

Furthermore, in embodiments of the present disclosure, tensors may internally support other custom element types such as uint8, int8, bfoat 16, float16, uint16, int16, float32, int32, uint32, and others, which the present disclosure is not limited to. For the addressing of the tensor, it is in elementary units of elements. For example, if the element type is int8, the element is in bytes. As another example, if the element type is int16, the addressing unit is a double byte, and so on.

In some cases, the tensor may contain a large amount of data, and the L2 cache 250 has a limited capacity, so the tensor cannot be loaded in its entirety into the on-chip L2 cache 250. In some embodiments of the present disclosure, to facilitate parallel processing of the tensor, the tensor can be divided into at least one segment. In case the tensor comprises only one segment, the tensor is a segment. And in the case of a tensor comprising a plurality of segments, the segments are part of the tensor. The CPU20 can specify by instruction which PE each part of the segment is processed by.

Storage structure of tensor

Figure 3 illustrates a schematic block diagram of a three-dimensional tensor 300 according to one embodiment of the present disclosure. The three-dimensional tensor 300 has three dimensions D1, D2, and D3, and includes a first segment S1, a second segment S2, and a third segment S3. The CPU20 may specify that the tensor elements of the segment S1 are processed by PE _1, PE _2, PE _3, PE _4, PE _5, PE _6, PE _7, and PE _ 8. Further, the CPU20 also specifies that the tensor elements of the second segment S2 are processed by PE _1-PE _ 4. In embodiments of the present disclosure, each segment may have different dimensions, and thus a programmer may have flexibility in configuring segments based on design needs. In practice, the division of pages may be implemented in any one or more dimensions, and the number of pages divided in each dimension is independent of each other.

In one embodiment, the tensor data may be stored in an on-chip high speed memory, such as the L2 cache 250. However, due to the small capacity of the high speed memory on chip, at larger tensor scales, a programmer may divide the tensor into segments, each segment describing a portion of the tensor. The kernel (kernel) can be started multiple times, and each time, a segment of the tensor is moved from off-chip storage to on-chip storage in advance by the DMA controller 240, and is used for kernel operation. After the kernel is started for multiple times, all the sections contained in the tensor are processed, and the whole operation process is finished. When the high-speed memory on the chip is enough to accommodate all the tensors to be accessed by the kernel, one tensor only needs one segment description, and the kernel only needs to be started once.

Further, in some embodiments of the present disclosure, within a segment, at least one page may also be set to further subdivide the tensor. For example, in the first stage S1, there are 4 pages P [1], P [2], P [3] and P [4 ]. The second segment S2 has only one page. In embodiments of the present disclosure, the number of pages each segment has may be different, so a programmer may flexibly configure the size of the pages within a segment based on design needs. For example, pages are configured to fit into the L2 cache 250 in their entirety.

As described above, when the tensor is addressed, the smallest addressing unit is the unit of an element. A page may typically include multiple elements. The page on which the target element is located is referred to herein as the "target element page". In some embodiments of the present disclosure, a page may include multiple cache lines. While the target element page may be located in L2 cache 250, if a PE reads the target element via L1 cache 260, L2 cache 250 needs to transfer a small portion of the physical address contiguous data in L2 cache 250, including the target element, in its entirety to L1 cache 260. This small portion of data is also called cache line (cache line) data, and this caching mechanism is based on the spatial proximity principle. A PE only needs a few clock cycles to read data from L1 cache 260, while L1 cache 260 may require tens or even hundreds of clock cycles to read data from L2 cache 250. Therefore, it is desirable to reduce the number of times that the L1 cache 260 reads data from the L2 cache 250. Although the minimum unit of transfer data from the L2 cache 250 to the L1 cache 260 is described herein as a "cache line," in this disclosure, the portion of data may not necessarily be arranged in rows or columns, the data within a "cache line" is distributed over multiple dimensions, and the size of the data distributed over each dimension is not limited to 1. The PEs perform parallel processing on the data in a segment, and the allocation of the PEs is expanded in the logical address space of the data, and is independent of the physical storage structure of the segment, as described in detail below.

In FIG. 3, a first set of cache lines in first page P [1] is designated for processing by PE _1 and a second set of cache lines is designated for processing by PE _ 2. Although the tensors are shown herein in sequence as being processed by multiple PEs in sequence, it is understood that the processing of tensor data is independent of the order of the PEs, and is not limited by this disclosure. For example, PE _2 in fig. 3 indicates that partial tensor data can be processed by PE _ M, where M indicates any integer no greater than N.

Example scheduling procedure one

As will be described in detail below, the processing circuit 200 may manage dependencies between different resources by using status indicators. For convenience of description, the mechanism of the status indicator will be described below with a register as an example of a resource. It should be understood that other suitable types of resources are also used, examples of which include, but are not limited to: memory addresses, queues, and processor units, etc.

In some embodiments, processing circuit 200 may use a status indicator (also referred to as a "token") to manage data dependencies. A token is a state value that may be used to indicate the state of data in a corresponding resource (e.g., register).

Unlike traditional hardware-based register states, tokens provide a software-based state management policy for developers, which does not require register identification to access corresponding hardware states during implementation, but can solve data dependency problems through flexible token management.

Illustratively, if token is a first value (e.g., 1), this indicates that the data in its corresponding register is ready and has not been used or consumed. Conversely, if token is a second value (e.g., 0), this indicates that the data in its corresponding register is not ready.

In some embodiments, processing circuit 200 may determine whether an instruction may be issued based on the token value of the register and the type of instruction. In particular, if the instruction is a data consuming instruction, i.e., it indicates a data consuming operation on data in a register, processing circuit 200 may determine whether the token corresponding to the register is a 1. If so, the instruction may be executed at the corresponding stage. Otherwise, if token is 0, the instruction needs to wait for the corresponding stage to be executed.

In some embodiments, the stage is determined based on the type of instruction. Illustratively, if the instruction is a data consuming instruction, the stage may be, for example, an issue stage of the instruction. That is, the data consumption instruction may be issued only if token is 1.

In some embodiments, if the instruction is a data production instruction, i.e., it indicates a data generation operation on data in a register, the processing circuitry may determine whether the token corresponding to the register is 0. If so, the instruction may be issued, otherwise, if token is 1, the instruction needs to wait for issue.

In some embodiments, the stage is determined based on the type of instruction. If the instruction is a data producing instruction, the stage may be, for example, a write back stage of the instruction. That is, the data producing instruction may be issued first, and the token is checked for 0 in the writeback stage. That is, the data producing instruction may wait for token to be 0 before being able to perform data write back.

In some embodiments, the processing circuit 200 may further update the token corresponding to the register when the execution of the access operation for the register of the instruction is completed. Illustratively, if the instruction is a data consuming instruction, the token may be set to 0 upon completion of the launch of data from the token into the execution unit. In another example, if the instruction is a data producing instruction, the token may be set to 1 upon completion of the write back of the data to the register.

FIG. 4 illustrates an example instruction scheduling process 400 according to some embodiments of the present disclosure. In fig. 4, "i" represents the launch point of the command, "s" represents the position location (i.e., set to 1) of the corresponding token, and "c" represents the reset point (i.e., set to 0) of the token.

As shown in FIG. 4, the instruction Load RF [0], MemA, after completing the data Load, the corresponding token is set to 1. In some embodiments, the developer may indicate in the instructions that a particular operation is to be performed on a particular token. For example, a developer may write an instruction Load RF [0], MemA (no clear, check and set token1), i.e., the instruction has no clear operation and needs to check for "token 1" and set "token 1" to 1 after completion.

Accordingly, upon detecting that token is set to 1, an instruction Add RF [ x ], RF [0] may be issued, and upon completion of data entry, token is set to 0. In some embodiments, a developer may indicate in an instruction that a particular operation is to be performed on a particular token. For example, a developer may write an instruction Add RF [ x ], RF [0] (check and clear token1, no set), i.e., the instruction has no set operation and needs to check for "token 1" and set "token 1" to 0 after completion.

Further, the instruction Load RF [0], MemB may be issued first, for example, to perform a memory access operation. Subsequently, the instruction may perform a data write back stage after waiting for the token to be set to 0 and set token to 1 after completing the data load. In some embodiments, a developer may indicate in an instruction that a particular operation is to be performed on a particular token. For example, a developer may write the instruction Load RF [0], MemB (no clear, check and set token1), i.e., the instruction has no clear operation and needs to check for "token 1" and set "token 1" to 1 after completion.

Similarly, an instruction Add RF [ y ], RF [0] may be issued upon detecting a token of 1, and upon completing the data entry, set token to 0. In some embodiments, a developer may indicate in an instruction that a particular operation is to be performed on a particular token. For example, a developer may write an instruction Add RF [ y ], RF [0] (check and clear token1, no set), i.e., the instruction has no set operation and needs to check for "token 1" and set "token 1" to 0 after completion.

Further, the instruction Load RF [0], MemC may be issued to perform a data access operation and wait for a token to be set to 0 before performing a data writeback stage. After completing the data load, the instruction may set token to 1. In some embodiments, a developer may indicate in an instruction that a particular operation is to be performed on a particular token. For example, a developer may write the instruction Load RF [0], MemA (no clear, check and set token1), i.e., the instruction has no clear operation and needs to check for "token 1" and set "token 1" to 1 after completion.

Similarly, instruction Add RF [ z ], RF [0] may be issued upon detection of a token of 1, and upon completion of data entry, set token to 0. In some embodiments, a developer may indicate in an instruction that a particular operation is to be performed on a particular token. For example, a developer may write an instruction Add RF [ z ], RF [0] (check and clear token1, no set), i.e., the instruction has no set operation and needs to check for "token 1" and set "token 1" to 0 after completion.

In this way, the embodiments of the present disclosure can effectively manage the resource or data dependency between instructions by using the status indicator (i.e., token), thereby improving the efficiency of instruction scheduling and reducing the complexity of the system.

Example scheduling procedure two

In some embodiments, processing circuit 200 may further increase the efficiency of execution of instructions by migrating read instructions forward. Illustratively, each read instruction may use a different token, with the result being written to a different register. Each Add instruction therefore looks at a different token, using operands for different registers.

In some embodiments, multiple read instructions may be required to complete in order. Thus, the processing circuitry may cause only the last degree instruction to update the token and issue after the first Add instruction checks the token. In this manner, the processing circuit may reduce system complexity without adding excessive latency due to the close completion time of the execution of multiple read instructions.

FIG. 5 illustrates an example instruction scheduling process 500 according to some embodiments of the present disclosure. As shown in FIG. 5, the instructions Load RF [0], MemA; load RF [1], MemB and Load RF [2], MemC may be issued in sequence and executed in sequence. Accordingly, the instruction Add RF [ x ], RF [0] may examine the token corresponding to "RF [0 ]" and issue after token is 1. The instruction Add RF [ y ], RF [1] may examine the token corresponding to RF [1] and issue after token is 1. The instruction Add RF [ z ], RF [2] may examine the token corresponding to RF [2] and issue after token is 1.

In this way, the embodiment of the present disclosure may enable a plurality of read instructions to be executed in parallel, thereby improving the loading efficiency of the system.

Example scheduling procedure three

In some embodiments, processing circuit 200 may resolve data dependencies between multiple instructions through a single token. In the example scheduling process, the processing circuit 200 needs to maintain three registers, and correspondingly three tokens. To further reduce overhead, processing circuit 200 may also schedule execution of the above instructions by utilizing a single token.

FIG. 6 illustrates an example instruction scheduling process 600 according to some embodiments of the disclosure. As shown in FIG. 6, after the first read instruction writes RF [0], token [0] is set to 1, and the next read instruction cannot write RF [0 ]. It waits for token [0] to be reset to 0, which requires the next Add instruction to do so only if the Add instruction can clear token [0] to 0, the Add instruction looks at token [0] at launch. When token [0] becomes 1, the Add instruction may be launched and executed, and token [0] may be reset to 0 upon reading RF [0 ].

Specifically, the scheduling process of process 600 is: initially, token [0] has a value of 0. Three read instructions may be issued in order for a memory access operation to be performed, reading data from the A, B, C three memory addresses and sending the data back in order onto the RF [0] write back queue. As shown in fig. 6, three read instructions may be issued in sequence without depending on the completion of the other read instructions.

Further, the first Add instruction waits in the issue stage until token [0] is set to 1; the first read instruction retrieves the data and arranges it in the first bit of the write-back RF [0], checks the value of token [0] to 0, and then writes RF [0] and sets token [0] to 1.

At this point, a second read instruction may have read data from the MemB address and queued at the second bit of the queue for write back RF [0], the second read instruction queued at the first bit of the queue after the first read instruction written RF [0], while a third read instruction may have returned and queued at the second bit.

Alternatively, a second read instruction may be at the first bit of the begin fetch queue and a third read instruction may be at the second bit of the begin fetch queue.

Further, the second read instruction checks token [0] as 1 (set by the first read instruction), waiting for it to become 0.

Then the first Add instruction detects token [0] as 1, launches and executes, and then clears token [0] to 0. The second Add instruction waits at the transmitter stage for token [0] to be 1, and after token [0] changes from 1 to 0, the data of the second read instruction is written back to RF [0] and token [0] is set to 1.

The second Add can either be fired and executed, then token [0] is cleared to 0. At this point, a third read instruction should have read data from the MemB address and await a write back RF [0 ]. When token [0] changes from 1 to 0, a third read instruction may write back RF [0] and set token [0] to 1. The third Add is then launched and executed, and then token [0] is reset to 0.

In this manner, embodiments of the present disclosure are able to efficiently manage data dependencies between multiple instructions with a single token and reduce the number of registers used.

In some embodiments, processing circuit 200 may also increase resources at the stage of instruction issue to maintain the number of issued instructions to resolve possible deadlocks. In particular, processing circuit 200 may add an instruction issue queue outside the writeback queue and the fetch queue, where the length of the queue is the maximum number of read instructions allowed to issue, which may ensure that read instructions are not stalled.

It will be appreciated that increasing instruction issue queuing, while increasing some resources, is more cost effective than increasing the number of registers, especially in single-instruction, multi-threaded processors.

In some embodiments, processing circuit 200 may also resolve possible instruction blocking issues by snooping instructions. Although instruction snooping requires increased resources, the amount of resources it requires is reasonable compared to reduced register usage. By using a proper amount of registers and combining a forward-looking mechanism, the embodiment of the disclosure can effectively hide the DRAM storage delay and greatly improve the performance.

Example scheduling procedure four

In some embodiments, processing circuitry 200 may utilize tokens to implement compound instructions for memory accesses and computations, thereby further improving execution efficiency. In some embodiments, processing circuitry 200 may, for example, allow for the use of a compute instruction (e.g., a mm instruction as shown in FIG. 7) for use with a read instruction.

In some embodiments, unlike ordinary data consuming instructions, the compute instruction may be issued twice and have two different execution phases. In the first stage, the compute instruction may issue as a normal read store instruction. In the second phase, upon detecting that the corresponding token is set to 1, the computing instruction may be issued as a data operation instruction, and upon transferring the data to the execution unit, the corresponding token is set to 0.

FIG. 7 illustrates an example instruction scheduling process 700 according to some embodiments of the present disclosure. As shown in FIG. 7, Load RF [0], MemA to mm RF [ z ], RF [0] may be sequentially emitted. Wherein three mm commands are issued as normal read store commands.

As shown in FIG. 7, after the instruction Load RF [0] and MemA are completed, token is set to 1, and at this time, the instruction mm RF [ x ] and RF [0] can be reissued as a data operation instruction to consume the data stored in the register. Further, when data is transferred to an execution unit, the corresponding token may be cleared by 0.

After token is cleared of 0, the instruction Load RF [0], MemB may execute, and when it completes the data Load, token may be set to 1. Further, the instructions mm RF [ y ], RF [0] may be reissued as data operation instructions to consume the data stored in the registers. Further, after the data is transferred to the execution unit, the corresponding token may be cleared by 0.

Similarly, after token is cleared 0 here, the instruction Load RF [0], MemC may execute, and when it completes the data Load, token may be set to 1. Further, the instructions mm RF [ z ], RF [0] may be reissued as data operation instructions to consume the data stored in the registers. Further, when data is transferred to an execution unit, the corresponding token may be cleared by 0.

In some embodiments, the mm instruction may be, for example, an instruction to perform a matrix multiplication operation.

In this way, embodiments of the present disclosure can self-adjust read data and arithmetic instructions so that programs are not affected by memory latency, thereby enabling the number of registers used to hide read data latency to be minimized.

Example procedure for instruction scheduling

FIG. 8 illustrates a flow diagram of an instruction scheduling method 800 according to some embodiments of the disclosure. In one embodiment, method 800 may be implemented by processing circuit 200 (or accelerator subsystem 40), such as a GPU, for example, and thus various aspects described above with respect to fig. 1-3 may be selectively applied to method 800.

At block 810, the processing circuit 200 determines a status indicator associated with the target instruction, the status indicator indicating a status of a resource associated with the target instruction. At block 820, the processing circuit 200 determines whether the target instruction is ready based on the status indicator and the type of target instruction. In response to determining that the target instruction is ready at block 820, the method 800 proceeds to block 830, and at block 830, the processing circuit 200 executes a target phase of the target instruction, the target phase being determined based on the type of the target instruction. At block 840, processing circuit 200 determines whether the access operation for the resource for the target instruction performed is complete. In response to determining that the access operation execution is complete at block 840, the method 800 proceeds to block 850, and the processing circuit 200 updates the status indicator at block 850.

In some embodiments, the target instruction is a third instruction, the resource is a third resource, the third instruction indicating a data consumption operation for the third resource, the executing the target stage of the target instruction including: issuing a third instruction in response to a fourth instruction setting the target indicator to the first value, the fourth instruction indicating a data production operation for the third resource, the fourth instruction issued prior to the third instruction; and the method further comprises: updating the target indicator to a second value in response to the third instruction, causing a fifth instruction to be executed, the fifth instruction indicating a data production operation for the third resource, the fifth instruction issued prior to the third instruction and later than the fourth instruction.

In some embodiments, the method further comprises: after issuing the target instruction, a second memory load instruction associated with the status indicator is issued without confirming whether the status indicator is the first value.

The present disclosure may be a method, processing circuit, electronic device, computer storage medium, and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for carrying out various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives the computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of instruction scheduling, the method comprising:

determining a status indicator associated with a target instruction, the status indicator indicating a status of a resource associated with the target instruction;

determining whether the target instruction is ready based on the status indicator and the type of the target instruction;

in response to determining that the target instruction is ready, executing a target phase of the target instruction, the target phase determined based on the type; and

updating the status indicator in response to completion of execution of an access operation of the target instruction for the resource.

2. The method of claim 1, wherein determining whether the target instruction is ready based on the status indicator and a type of the target instruction comprises:

in response to the type of the target instruction indicating a data production operation for the resource, determining whether the status indicator is a first value indicating that data in the resource has been consumed; and

in response to the status indicator being the first value, determining that the target instruction is ready.

3. The method of claim 2, wherein executing a target phase of the target instruction comprises: a write back stage of the target instruction is executed.

4. The method of claim 2, wherein updating the status indicator comprises:

in response to completion of execution of an access operation of the target instruction for the resource, updating the status indicator to a second value indicating that data in the resource can be consumed.

5. The method of claim 1, wherein determining whether the target instruction is ready based on the status indicator and a type of the target instruction comprises:

in response to the type of the target instruction indicating a data consumption operation for the resource, determining whether the status indicator is a second value indicating that data in the resource can be consumed; and

in response to the status indicator being the second value, determining that the target instruction is ready.

6. The method of claim 5, wherein executing a target phase of the target instruction comprises: and issuing the target instruction.

7. The method of claim 5, wherein updating the status indicator comprises:

in response to completion of execution of an access operation of the target instruction for the resource, updating the status indicator to a first value indicating that data in the resource has been consumed.

8. The method of claim 1, wherein the target instruction is a first instruction, the resource is a first resource, the first instruction indicates a data production operation for the first resource, and executing a target phase of the target instruction comprises:

issuing the first instruction during execution of a second instruction, the second instruction indicating a data production operation for a second resource, the first resource being different from the second resource.

9. The method of claim 1, wherein the target instruction is a third instruction, the resource is a third resource, the third instruction indicates a data consumption operation for the third resource,

executing the target phase of the target instruction includes: issuing a third instruction in response to a fourth instruction setting the target indicator to a first value, the fourth instruction indicating a data production operation for the third resource, the fourth instruction issued prior to the third instruction; and is

The method further comprises the following steps: updating the target indicator to a second value in response to the third instruction, causing a fifth instruction to be executed, the fifth instruction indicating a data production operation for the third resource, the fifth instruction issued prior to the third instruction and later than the fourth instruction.

10. The method of claim 1, wherein issuing the target instruction in response to determining that the target instruction is ready comprises:

in response to determining that the target instruction is ready, determining whether a number of instructions that have been issued and outstanding is less than a threshold; and

issuing the target instruction in response to determining that the number is less than the threshold.

11. The method of claim 1, wherein determining a status indicator of a resource associated with a target instruction comprises: sending out the target instruction as a memory loading instruction; and determining the status indicator of the resource associated with the target instruction at a first stage where the target instruction is the memory load instruction; and is

In response to determining that the target instruction is ready to execute the target stage of the target instruction, comprising: reissuing the target instruction as an arithmetic instruction issue in response to determining that the target instruction is ready.

12. The method of claim 11, wherein determining whether the target instruction is ready comprises:

it is determined whether the first memory load instruction sets the status indicator to a second value.

13. The method of claim 11, further comprising:

after issuing the target instruction, issuing a second memory load instruction associated with the status indicator without confirming whether the status indicator is a first value.

14. The method of any of claims 1 to 13, wherein the resources comprise at least one of: registers, memory addresses, queues, or processor resources.

15. A processing circuit comprising an on-chip memory, a stream processor, and a processing engine, wherein the processing circuit is configured to perform the method of any of claims 1-14.

16. An electronic device comprising off-chip storage memory and processing circuitry, wherein the processing circuitry is configured to perform the method of any of claims 1-14.

17. A computer readable storage medium having stored thereon one or more computer instructions, wherein the one or more computer instructions are executed by processing circuitry to implement the method of any of claims 1-14.

18. A computer program product comprising computer executable instructions, wherein the computer executable instructions, when executed by processing circuitry, implement the method of any of claims 1 to 14.