WO2022121273A1

WO2022121273A1 - Simt instruction processing method and device

Info

Publication number: WO2022121273A1
Application number: PCT/CN2021/100808
Authority: WO
Inventors: 周俊; 王文强; 夏晓旭
Original assignee: 上海阵量智能科技有限公司
Priority date: 2020-12-11
Filing date: 2021-06-18
Publication date: 2022-06-16
Also published as: CN114625421A; JP2023509813A

Abstract

Provided in the embodiments of the present invention are a single instruction multiple threads (SIMT) instruction processing method and device. The device comprises a scalar processing unit and a vector processing unit, wherein the scalar processing unit is configured to perform scalar operation according to a scalar-type SIMT instruction; and the vector processing unit is configured to perform vector operation according to a vector-type SIMT instruction. According to the embodiments of the present invention, the processing efficiency can be improved.

Description

SIMT instruction processing method and device

cross reference statement

The present invention claims the priority of the Chinese Patent Application No. 202011452846.0 filed with the Chinese Patent Office on December 11, 2020, the entire contents of which are incorporated herein by reference.

technical field

The present invention relates to the field of computer technology, in particular to a method and device for processing single instruction multiple threads (single instruction multiple threads, SIMT).

Background technique

In parallel computing, the SIMT architecture has greater flexibility and higher efficiency than the synchronous multithreading (SMT) architecture, and can achieve higher throughput by running a large number of threads in parallel. Therefore, the SIMT architecture It is widely used in high-performance processors.

In parallel operations, there are a large number of scalar operations that only operate on a single thread such as the base address, and how to improve the processing efficiency of instructions is a problem to be solved.

SUMMARY OF THE INVENTION

Embodiments of the present invention provide a SIMT instruction processing method and device, which are used to improve processing efficiency.

A first aspect provides a SIMT instruction processing device, including a scalar processing unit and a vector processing unit, wherein:

The scalar processing unit is used to perform a scalar operation according to a SIMT instruction of a scalar type;

The vector processing unit is configured to perform vector operations according to a SIMT instruction of a vector type.

In the SIMT instruction processing device provided in the embodiment of the present invention, the scalar processing unit can perform scalar operations on SIMT instructions of scalar type, and the vector processing unit can perform vector operations on SIMT instructions of vector type, and the vector operations and scalar operations are separated by different The processing unit performs processing, and the scalar operation does not affect the vector operation. Therefore, the processing efficiency of the vector operation can be improved. In addition, since the vector operation and the scalar operation do not affect each other and can be performed at the same time, the overall processing efficiency of the instruction can be improved.

As a possible implementation manner, the apparatus further includes a scalar register set for storing scalar data and a vector register set for storing vector data, wherein:

The scalar register group is respectively coupled to the scalar processing unit and the vector processing unit, and the vector register group is coupled to the vector processing unit.

The SIMT instruction processing device provided by the embodiment of the present invention includes a scalar register group and a vector register group. The information stored in the registers in the vector register group can only be accessed by corresponding threads, while the information stored in the scalar register group is shared by multiple threads. Information that can be accessed by multiple threads. Since one register in the scalar register group can correspond to multiple threads, the number of registers can be reduced; in addition, since the information stored in the scalar register can be shared by multiple threads, it is possible to avoid repeated storage of the same information. The amount of information stored in the register is reduced, thereby saving storage resources.

As a possible implementation manner, the device further includes a crossbar module, the crossbar module includes a plurality of crossbars, wherein:

the scalar processing unit is connected with the scalar register group through the crossbar module;

The vector processing unit is respectively connected with the scalar register group and the vector register group through the crossbar module.

In the SIMT instruction processing device provided by the embodiment of the present invention, the scalar register group and the scalar processing unit are connected through a crossbar module, which can ensure that the scalar processing unit can access all registers in the scalar register group. The vector processing unit is connected to the scalar register group and the vector register group respectively through a crossbar module, which can ensure that the vector processing unit can access the scalar register group and all registers in the vector register group.

As a possible implementation manner, the device further includes a control unit, wherein:

the control unit is respectively coupled to the scalar processing unit and the vector processing unit;

The control unit is configured to determine the type of the SIMT instruction, and based on the type of the SIMT instruction, send the SIMT instruction to the scalar processing unit or the vector processing unit; wherein the type includes scalar or vector.

In the SIMT instruction processing device provided by the embodiment of the present invention, the control unit can distribute different types of SIMT instructions to different processing units for processing, so that vector operations and scalar operations can be processed separately by different processing units, and scalar operations will not Affects vector operations, therefore, the processing efficiency of vector operations can be improved.

As a possible implementation manner, the control unit is configured to determine the type of the SIMT instruction according to the indication information carried by the SIMT instruction; wherein the indication information includes a destination address, an indication bit, an indication field or indicator.

In the SIMT instruction processing apparatus provided by the embodiment of the present invention, the control unit may determine the instruction type of the SIMT instruction according to the instruction information carried by the SIMT instruction. It can be seen that when the indication information is the destination address, the destination address not only has the function of pointing to the storage address of the operation result, but also has the function of determining the type of the SIMT instruction. Therefore, there is no need for the SIMT instruction to carry additional information to indicate the SIMT instruction. can reduce the information carried by the SIMT instruction, thereby improving the transmission efficiency of the instruction and saving transmission resources.

As a possible implementation manner, the apparatus further includes a scalar scheduling unit and a vector scheduling unit, wherein:

The scalar scheduling unit is coupled to the scalar processing unit, and the vector scheduling unit is coupled to the vector processing unit;

the scalar scheduling unit, configured to schedule SIMT instructions of a scalar type to the scalar processing unit;

The vector scheduling unit is configured to schedule SIMT instructions of vector type to the vector processing unit.

In the SIMT instruction processing apparatus provided by the embodiment of the present invention, the scheduling unit can schedule the corresponding SIMT instructions according to the situation of the processing unit, so that the SIMT instructions can be executed in an orderly manner.

As a possible implementation manner, when multiple threads execute the same task in parallel, the multiple threads correspond to the same base address and correspond to different offset addresses, wherein the scalar processing unit is used to A scalar operation is performed on the data corresponding to the address to obtain a first operation result, and the vector processing unit is configured to perform a vector operation on the data corresponding to the offset address to obtain a second operation result.

As a possible implementation manner, the scalar processing unit is configured to acquire a first SIMT instruction, perform an operation based on the data corresponding to the base address carried by the first SIMT instruction to obtain a first operation result, and convert the first SIMT instruction The operation result is stored in the scalar register group;

The vector processing unit is used to obtain a second SIMT instruction, perform an operation based on the data corresponding to the offset address carried by the second SIMT instruction to obtain a second operation result, and store the second operation result in the vector. a register group; and, for performing an operation on the first operation result stored in the scalar register group and the second operation result stored in the vector register group to obtain a third operation result.

A second aspect provides a SIMT instruction processing method, which is applied to an apparatus for processing SIMT instructions. The apparatus includes a scalar processing unit and a vector processing unit, including:

Perform scalar operations by the scalar processing unit according to the SIMT instruction of the scalar type;

The vector operation is performed by the vector processing unit according to the SIMT instruction of the vector type.

As a possible implementation manner, the apparatus further includes a scalar register group and a vector register group, and the method further includes:

storing scalar data through the scalar register bank;

Vector data is stored through the vector register bank.

As a possible implementation manner, the apparatus further includes a control unit, and the method further includes:

The type of the SIMT instruction is determined by the control unit, and based on the type of the SIMT instruction, the SIMT instruction is sent to the scalar processing unit or the vector processing unit; wherein the type includes scalar or vector.

As a possible implementation manner, the method further includes:

The control unit determines the type of the SIMT instruction according to the indication information carried by the SIMT instruction; wherein the indication information includes a destination address, an indication bit, an indication field or an indicator.

As a possible implementation manner, the apparatus further includes a scalar scheduling unit and a vector scheduling unit, and the method further includes:

Scheduling the scalar type SIMT instruction to the scalar processing unit by the scalar scheduling unit;

The vector-type SIMT instruction is scheduled to the vector processing unit by the vector scheduling unit.

As a possible implementation manner, the method further includes:

Obtain the first SIMT instruction through the scalar processing unit, perform an operation based on the data corresponding to the base address carried by the first SIMT instruction to obtain a first operation result, and store the first operation result in the scalar register group;

Obtain a second SIMT instruction through the vector processing unit, perform an operation based on the data corresponding to the offset address carried by the second SIMT instruction to obtain a second operation result, and store the second operation result in the vector register group and performing operations on the first operation result stored in the scalar register group and the second operation result stored in the vector register group by the vector processing unit to obtain a third operation result.

A third aspect provides a system-on-a-chip, where the system-on-a-chip integrates the device provided by the first aspect or any possible implementation manner of the first aspect. The system-on-chip can be composed of a SIMT instruction processing device, and can also include a SIMT instruction processing device and other discrete devices.

A fourth aspect provides an electronic device, including the SIMT instruction processing apparatus provided by the first aspect or any possible implementation manner of the first aspect, and a discrete device coupled to the SIMT instruction processing apparatus.

Description of drawings

1 is a schematic structural diagram of a SIMT instruction processing device provided by an embodiment of the present invention;

2 is a schematic structural diagram of another SIMT instruction processing device provided by an embodiment of the present invention;

3 is a schematic structural diagram of another SIMT instruction processing device provided by an embodiment of the present invention;

FIG. 4 is a schematic flowchart of a SIMT instruction processing method provided by an embodiment of the present invention.

Detailed ways

Embodiments of the present invention provide a SIMT instruction processing method and device, which are used to improve processing efficiency. Each of them will be described in detail below.

In order to better understand the SIMT instruction processing method and device provided by the embodiments of the present invention, the following describes application scenarios to which the embodiments of the present invention are applicable. In parallel operations, there are a large number of scalar operations that only operate on a single thread such as the base address class. In the SIMD (Single Instruction Multiple Data) architecture, scalar coprocessors are often used to process scalar operations to improve the processing efficiency of instructions. But in the SIMT architecture, instruction scheduling is more difficult. In order to solve the above problem, a possible implementation manner is that, in the SIMT architecture, regardless of whether the SIMT instruction is a scalar instruction or a vector instruction, it is sent down through the same instruction port, and a vector processor is used to perform operations. For example, when the SIMT instruction is of scalar type, part of the processing unit in the vector processor can be used to perform the operation. However, because some processing units in the vector processor are used to process scalar type instructions, the number of processing units in the vector processor for processing vector type instructions is reduced, and the processing efficiency of vector instructions is reduced.

Please refer to FIG. 1. FIG. 1 is a schematic structural diagram of a SIMT instruction processing apparatus provided by an embodiment of the present invention. As shown in FIG. 1 , the SIMT instruction processing apparatus may include a scalar processing unit 11 and a vector processing unit 12 .

The scalar processing unit 11 is configured to perform a scalar operation according to a SIMT instruction of a scalar type.

The vector processing unit 12 is configured to perform vector operations according to the SIMT instructions of the vector type.

When the type of the SIMT instruction is a scalar, that is, when the SIMT instruction is a scalar instruction, the scalar processing unit 11 may perform an operation on the SIMT instruction, that is, perform a scalar operation. When the type of the SIMT instruction is a vector, that is, when the SIMT instruction is a vector instruction, the vector processing unit 12 may perform an operation on the SIMT instruction, that is, perform a vector operation. A scalar is a vectorless quantity, that is, a quantity that has only magnitude and no direction. Scalar operations can be one or more of multiplication, addition, subtraction, and division, among others. A vector refers to a quantity that has magnitude and direction. Vector operations may include one or more of multiplication, addition, subtraction, division, dot product, cross product, and the like.

The scalar processing unit 11 may comprise one or more first processing units. When the scalar processing unit 11 includes multiple first processing units, in the case where the SIMT instruction is a scalar instruction, each first processing unit can process one SIMT instruction in each cycle, and one scalar instruction corresponds to one thread group, so that the The parallel operation of multiple scalar instructions is implemented, that is, the parallel operation of scalar operations for multiple threads can be implemented. Vector processing unit 12 may include one or more second processing units. When the vector processing unit 12 includes multiple second processing units, if the SIMT instruction is a vector instruction, each second processing unit can process one SIMT instruction per cycle, so that the parallel execution of multiple vector instructions can be realized. In addition, the number of threads corresponding to one vector instruction is the same as the number of threads processed by the second processing unit, so that parallel execution of multiple threads can be implemented in one second processing unit. The number of first processing units included in the scalar processing unit 11 and the number of second processing units included in the vector processing unit 12 may be the same or different. The first processing unit included in the scalar processing unit 11 may be an arithmetic operation unit (arithmetic and logic unit, ALU), or may be other units, which are not limited herein. The second processing unit included in the vector processing unit 12 may be an ALU, a special function unit (special function unit, SFU), a read-write unit (load store unit, LSU), or other units, here Unlimited.

Please refer to FIG. 2. FIG. 2 is a schematic structural diagram of another SIMT instruction processing apparatus provided by an embodiment of the present invention. Wherein, the SIMT instruction processing apparatus shown in FIG. 2 is obtained by optimizing the SIMT instruction processing apparatus shown in FIG. 1 .

In one embodiment, the SIMT instruction processing apparatus may further include a scalar register group 13 for storing scalar data and a vector register group 14 for storing vector data.

The scalar register group 13 is respectively coupled to the scalar processing unit 11 and the vector processing unit 12 , and the vector register group 14 is coupled to the vector processing unit 12 .

The scalar register set 13 and the vector register set 14 may be two independent register sets. Both the scalar register set 13 and the vector register set 14 may include multiple sets of registers. SIMT instructions can carry source addresses and operation types. After receiving the SIMT instruction, the scalar processing unit 11 may first obtain the operand from the register corresponding to the source address in the scalar register group 13, and then perform scalar operation on the obtained operand according to the operation type. The source address carried by the SIMT instruction received by the scalar processing unit 11 corresponds to a register in the scalar register group 13 . It can be understood that each source address corresponds to a register in the scalar register group 13, and the registers in the scalar register group 13 can be accessed by the corresponding thread. The thread corresponding to the register in the scalar register group 13 may be the thread corresponding to the warp (number of threads) to which the register belongs. The source address may include one address or multiple addresses, that is, the operand of the scalar instruction may be one or multiple, which is not limited herein.

After the vector processing unit 12 receives the SIMT instruction, in the case that the source address of the SIMT instruction points to the scalar register group 13, the operand can be obtained from the register corresponding to the source address in the scalar register group 13, and then the operation type can be obtained according to the operation type. The operands of the vector operation are performed. When the source address of the SIMT instruction points to the vector register group 14, the operand can be obtained from the register corresponding to the source address in the vector register group 14, and then the obtained operand can be subjected to a vector operation according to the operation type. When the source address of the SIMT instruction points to both the scalar register group 13 and the vector register group 14, the operand can be obtained from the register corresponding to the source address in the scalar register group 13, and the operand can be obtained from the register corresponding to the source address in the vector register group 14. The register obtains the operand, and then vector operations can be performed on the obtained operand according to the operation type. When the source address of one SIMT instruction received by the vector processing unit 12 points to the vector register group 14 , in one case, the source address carried by one SIMT instruction corresponds to multiple registers in one vector register group 14 . It can be understood that each address may correspond to multiple registers in the vector register group 14, and each register in the vector register group 14 can only be accessed by the corresponding thread. The number of registers in the vector register bank 14 is the same as the number of second processing units included in a set of vector processing units. In another case, when the source address of the SIMT instruction points to the scalar register group 13, one SIMT instruction can carry multiple source addresses, each source address corresponds to a register in the scalar register group 13, and multiple source addresses correspond to scalar Multiple registers in register bank 13.

In one embodiment, the SIMT instruction processing apparatus may further include a crossbar module 15, and the crossbar module 15 may include multiple crossbars.

The scalar processing unit 11 is connected to the scalar register set 13 through the crossbar module 15 .

The vector processing unit 12 is connected to the scalar register set 13 and the vector register set 14 respectively through the crossbar module 15 .

The crossbar module 15 can ensure that the scalar processing unit 11 can access all registers in the scalar register set 13 and the vector processing unit 12 can access all the registers in the scalar register set 13 and the vector register set 14. The crossbar module 15 may include multiple crossbars. For example, the crossbar module 15 may include two crossbars, one crossbar may be coupled to the scalar processing unit 11 and the scalar register group 13, respectively, and the other crossbar may be coupled to the vector processing unit 12, the scalar register group 13 and the vector register group 14, respectively. For another example, the crossbar module 15 may include three crossbars, the first crossbar may be coupled to the scalar processing unit 11 and the scalar register group 13 respectively, the second crossbar may be coupled to the vector processing unit 12 and the scalar register group 13 respectively, and the third crossbar may be respectively coupled Vector processing unit 12 and vector register bank 14 . When the crossbar module 15 includes three crossbars, the scalar processing unit 11 can obtain the operand from the register corresponding to the source address in the scalar register group 13 through the first crossbar. Read instruction, the crossbar forwards the read instruction to the scalar register group 13, the scalar register group 13 sends the operand in the register corresponding to the source address to the first crossbar, and the first crossbar forwards the operand to the scalar processing unit 11. Others are similar and will not be repeated here.

In one embodiment, the SIMT instruction processing apparatus may further include a control unit 16 .

The control unit 16 is coupled to the scalar processing unit 11 and the vector processing unit 12, respectively.

The control unit 16 is configured to determine the type of the SIMT instruction, the type including a scalar or a vector, and send the SIMT instruction to the scalar processing unit 11 or the vector processing unit 12 based on the type of the SIMT instruction.

In one embodiment, the control unit 16 is configured to determine the type of the SIMT instruction according to the destination address carried by the SIMT instruction.

After receiving the SIMT instruction, the control unit 16 may first determine the type of the SIMT instruction. In the case that the type is scalar, that is, the SIMT instruction is a scalar instruction, the control unit 16 may send the SIMT instruction to the scalar processing unit 11 . In the case that the type is a vector, that is, the SIMT instruction is a vector instruction, the control unit 16 may send the SIMT instruction to the vector processing unit 12 .

The SIMT instruction may also carry indication information, and the indication information may indicate the type of the SIMT instruction. After receiving the SIMT instruction, the control unit 16 can determine the type of the SIMT instruction according to the instruction information.

The indication information can be the destination address. After the control unit 16 receives the SIMT instruction, it can first identify whether the destination address is the address of the register in the scalar register group 13 or the address of the register in the vector register group 14, that is, identify whether the destination address points to the scalar register group 13 or points to the vector register group 14. When the destination address is the address of a register in the scalar register group 13 , that is, when the destination address points to the scalar register group 13 , the control unit 16 may send the SIMT instruction to the scalar processing unit 11 . When the destination address is the address of a register in the vector register group 14 , that is, the destination address points to the vector register group 14 , the control unit 16 may assign the SIMT instruction to the vector processing unit 12 .

The indication information can also be an indication bit or a flag bit. When the indication bit or flag bit has a first value, the SIMT instruction can be indicated as a vector instruction, and when the indication bit or flag bit has a second value, it can be indicated that the SIMT instruction is a scalar instruction.

The indication information can also be an indication field. When the indication field is in the first state, it can indicate that the SIMT instruction is a vector instruction, and when the indication field is in the second state, it can indicate that the SIMT instruction is a scalar instruction.

The indication information may also be an indicator. When the indicator is in the third state, it may indicate that the SIMT instruction is a vector instruction, and if the indicator is in the fourth state, it may indicate that the SIMT instruction is a scalar instruction.

The SIMT instruction may also indicate the type of SIMT instruction in other ways. For example, in the case that the SIMT instruction carries a specified flag field or identifier, etc., the instruction SIMT instruction may be a scalar instruction, and if the SIMT instruction does not carry such information, the instruction SIMT instruction may be a vector instruction, and vice versa.

It should be understood that the above explanation of the indication information is only exemplary, and does not constitute a limitation on the indication information.

In addition, when the SIMI instruction carries the destination address, after the scalar processing unit 11 and the vector processing unit 12 complete the operation, the operation result can be stored in the register corresponding to the destination address, so that subsequent calls can be made directly according to the destination address.

In one embodiment, the SIMT instruction processing apparatus may further include a scalar scheduling unit 17 and a vector scheduling unit 18 .

The scalar scheduling unit 17 is coupled to the scalar processing unit 11 , and the vector scheduling unit 18 is coupled to the vector processing unit 12 .

The scalar scheduling unit 17 is configured to schedule SIMT instructions of scalar type to the scalar processing unit 11 .

The vector scheduling unit 18 is used for scheduling the SIMT instruction of the vector type to the vector processing unit 12 .

When there is no idle processing unit in the scalar processing unit 11 or the vector processing unit 12, the control unit 16 sends the SIMT instruction to the scalar processing unit 11 or the vector processing unit 12, and the scalar processing unit 11 or the vector processing unit 12 cannot perform processing. . Therefore, the control unit 16 can send scalar type SIMT instructions to the scalar scheduling unit 17 so that the scalar scheduling unit 17 can schedule the scalar instructions uniformly; and can send the vector type SIMT instructions to the vector scheduling unit 18 for vector scheduling Unit 18 may schedule vector instructions collectively. The scheduling method can be the principle of first-in, first-out, or the principle of scheduling according to priority, that is, the higher the priority, the first to be executed. It can also be scheduled according to resource occupancy, or according to other principles. Plus limit.

In one embodiment, when multiple threads execute the same task in parallel, the multiple threads correspond to the same base address and correspond to different offset addresses, and the scalar processing unit 11 is configured to operate on the data of the base address to obtain As for the first operation result, the vector processing unit 12 is configured to operate on the data of the offset address to obtain the second operation result.

The registers in the scalar register group 13 store data corresponding to the base addresses of multiple threads. The scalar processing unit 11 may calculate the data corresponding to the base addresses of the multiple threads, and store the obtained first operation result in the scalar register group 13 for subsequent calls. The registers in the vector register group 14 store data corresponding to the offset addresses of the multiple threads. The vector processing unit 12 may calculate the data corresponding to the offset addresses of the multiple threads, and store the obtained second operation result in the vector register group 14 for subsequent calling. For example, after acquiring the SIMT instruction of the scalar type, the scalar processing unit 11 can perform a scalar operation according to the SIMT instruction to obtain the first operation result, and then store the first operation result in the scalar register group 13. After receiving the SIMT instruction of the vector type, the vector processing unit 12 may perform a vector operation according to the SIMT instruction to obtain a second operation result, and then store the second operation result in the vector register group 14 . After the vector processing unit 12 receives the SIMT instruction of the vector type, the SIMT instruction carries the storage address of the first operation result and the storage address of the second operation result, and the vector processing unit 12 can obtain the first operation result from the storage address of the first operation. The operation result, and obtaining the second operation result from the storage address of the second operation result, and performing a vector operation on the first operation result and the second operation result to obtain the third operation result. When the first operation result is the data operation result corresponding to the base address and the second operation result is the data operation result corresponding to the offset address, the third operation result is the data operation result corresponding to the base address+offset address.

In one embodiment, the scalar processing unit 11 is configured to obtain a first SIMT instruction carrying a base address, perform an operation based on data corresponding to the base address to obtain a first operation result, and store the first operation result in the scalar register group 13 .

The vector processing unit 12 is used to obtain the second SIMT instruction carrying the offset address, perform an operation based on the data corresponding to the offset address to obtain the second operation result, and store the second operation result in the vector register group 14; and to the scalar register The first operation result stored in the group 13 and the second operation result stored in the vector register group 14 are operated to obtain the third operation result as the task processing result.

After acquiring the SIMT instruction of the scalar type, the scalar processing unit 11 may perform a scalar operation according to the SIMT instruction to obtain a first operation result, and then store the first operation result in the scalar register group 13 . When the vector processing unit 12 receives the SIMT instruction carrying the storage address and the offset address of the first operation result, the vector processing unit 12 may first obtain data from the registers in the vector register group 14 corresponding to the offset address and perform vector operation to obtain the first operation result. The second operation result is obtained, the first operation result is obtained from the storage address of the first operation result, and the third operation result is obtained by performing a vector operation on the first operation result and the second operation result.

The working principle of the SIMT instruction processing apparatus is described below by taking an example as an example. Please refer to FIG. 3 . FIG. 3 is a schematic structural diagram of another SIMT instruction processing apparatus provided by an embodiment of the present invention. In FIG. 3, it is assumed that the SIMT instruction processing device can support up to 2048 threads. If organized according to 32 threads and one warp, there are 64 warps in total, and the 64 warps can be divided into 8 banks (groups). It is assumed that the scalar processing unit includes 4 first processing units, and the vector processing unit includes 4 second processing units, each of which supports 32 threads. A scalar register bank can include 8 banks, and each bank can include 128 scalar registers. The vector register bank can include 8 banks, and each bank can include 128 32-thread vector registers. All registers in a bank can be shared by warps in this bank, and registers in a bank can also be divided according to warp, and the registers corresponding to each warp can only be shared by threads in this warp.

Each processing unit requires at least one operand for processing, so there can be an 8x8 crossbar between the scalar processing unit and the scalar register bank, so that the scalar processing unit can ensure that the scalar processing unit can access the scalar registers of all banks in the scalar register bank. There can be an 8x8 32-thread crossbar between the vector processing unit and the vector register set, and an 8x8 crossbar between the vector processing unit and the scalar register set, so that the vector processing unit can access the scalar register set and the vector register set. Registers of all banks in .

Each clock cycle, the scalar processing unit can receive 4 SIMT instructions from the scalar scheduling unit, and the vector processing unit can receive 4 SIMT instructions from the vector scheduling unit. Since only one processing unit can access one bank at the same time, in order to ensure that multiple processing units access different banks at the same time, the parity of the SIMT instructions scheduled by the scalar scheduling unit and the vector scheduling unit in the two clock cycles before and after the on the contrary.

For example, at time 0, the control unit may schedule SIMT instructions accessing respective bank0, bank2, bank4 and bank6 through the scalar/vector scheduling unit, and the first/second processing unit may read operand 0 of the respective even-numbered banks. At time 1, the control unit may schedule SIMT instructions accessing respective bank1, bank3, bank5, and bank7 through the scalar/vector scheduling unit, and the first/second processing unit may read operand 1 of the respective even-numbered banks, and read the respective odd-numbered banks Operand 0 for bank. At time 2, the first/second processing unit may read operand 2 of the respective odd bank. Therefore, as long as the scheduling ensures that the parity is interleaved at adjacent moments and the scheduling banks do not conflict at the same moment, the maximum access efficiency without crossbar conflict can be guaranteed. For example, at the above moment 1, the 8 operand read interfaces of the 4 first/second processing units respectively access the respective bank0-bank7.

It can be seen that the processing unit can read two operands from the same bank in two consecutive beats, that is, two cycles. Because the warp parity of the two-shot instructions before and after is opposite, that is, the bank parity is opposite, so there will be no conflict, and it can be guaranteed that at most 8 processing unit read interfaces can access 8 banks at the same time.

Please refer to FIG. 4. FIG. 4 is a schematic flowchart of a SIMT instruction processing method provided by an embodiment of the present invention. The SIMT instruction processing method may be applied to the SIMT instruction processing apparatus shown in FIG. 1 to FIG. 3 . As shown in FIG. 4 , the SIMT instruction processing method may include the following steps.

401. Determine the type of the SIMT instruction by the control unit, and send the SIMT instruction to the scalar scheduling unit or the vector scheduling unit based on the type of the SIMT instruction.

The type of the SIMT instruction is determined by the control unit according to the destination address carried by the SIMT instruction.

402. Schedule the SIMT instruction of the scalar type to the scalar processing unit by using the scalar scheduling unit, and schedule the SIMT instruction of the vector type to the vector processing unit by using the vector scheduling unit.

403. Perform a scalar operation according to a SIMT instruction of a scalar type by a scalar processing unit, and perform a vector operation according to a SIMT instruction of a vector type by the vector processing unit.

404. Store scalar data through a scalar register set, and store vector data through a vector register set.

Optionally, obtain the first SIMT instruction through the scalar processing unit, perform an operation based on the data corresponding to the base address carried by the first SIMT instruction to obtain the first operation result, and store the first operation result in the scalar register group; through the vector processing unit Obtain the second SIMT instruction, perform operation based on the data corresponding to the offset address carried by the second SIMT instruction to obtain the second operation result, and store the second operation result in the vector register group; and store the scalar register group by the vector processing unit. The first operation result is operated with the second operation result stored in the vector register group, and the third operation result is obtained as the task processing result.

It should be noted that, for the relevant functions of the specific processes in the SIMT instruction processing method described in the embodiments of the present invention, reference may be made to the relevant descriptions in the embodiments of the SIMT instruction processing apparatus described in FIG. 1 to FIG. 3 , here No longer.

It can be understood that the SIMT instruction processing method may be a combination of all or part of the steps in step 401 to step 404, which is not limited herein.

In some embodiments, a system-on-chip is provided, and the system-on-chip may include the SIMT instruction processing apparatus provided in the above embodiments. The system-on-chip can be composed of a SIMT instruction processing device, and can also include a SIMT instruction processing device and other discrete devices.

In some embodiments, an electronic device is provided, including the SIMT instruction processing apparatus provided in the above-mentioned embodiments and a discrete device coupled to the SIMT instruction processing apparatus.

The embodiments of the present invention have been introduced in detail above, and specific examples are used to illustrate the principles and implementations of the present invention. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present invention; at the same time, for Persons of ordinary skill in the art, according to the idea of the present invention, will have changes in the specific embodiments and application scope. To sum up, the contents of this specification should not be construed as limiting the present invention.

Claims

A single-instruction multi-thread SIMT instruction processing device, comprising a scalar processing unit and a vector processing unit, wherein:

The scalar processing unit is used to perform a scalar operation according to a SIMT instruction of a scalar type;

The vector processing unit is configured to perform vector operations according to a SIMT instruction of a vector type.
The apparatus of claim 1, wherein the apparatus further comprises a control unit, wherein:

the control unit is respectively coupled to the scalar processing unit and the vector processing unit;

The control unit is configured to determine the type of the SIMT instruction, and based on the type of the SIMT instruction, send the SIMT instruction to the scalar processing unit or the vector processing unit; wherein the type includes scalar or vector.
The device according to claim 2, wherein the control unit is configured to determine the type of the SIMT instruction according to the indication information carried by the SIMT instruction; wherein the indication information includes a destination address, an indication bit, indicator field, or indicator.
The apparatus according to any one of claims 1-3, wherein the apparatus further comprises a scalar scheduling unit and a vector scheduling unit, wherein:

The scalar scheduling unit is coupled to the scalar processing unit, and is configured to schedule SIMT instructions of a scalar type to the scalar processing unit;

The vector scheduling unit is coupled to the vector processing unit, and is configured to schedule SIMT instructions of vector type to the vector processing unit.
The device according to any one of claims 1-4, wherein, in the case where multiple threads execute the same task in parallel, the multiple threads correspond to the same base address and correspond to different offset addresses, wherein

The scalar processing unit is configured to perform a scalar operation on the data corresponding to the base address to obtain a first operation result,

The vector processing unit is configured to perform a vector operation on the data corresponding to the offset address to obtain a second operation result.
The apparatus according to any one of claims 1-5, wherein the apparatus further comprises a scalar register group for storing scalar data and a vector register group for storing vector data, wherein:

the scalar register group is respectively coupled to the scalar processing unit and the vector processing unit,

The vector register set is coupled to the vector processing unit.
The apparatus of claim 6, wherein the apparatus further comprises a crossbar module, the crossbar module comprising a plurality of crossbars, wherein:

the scalar processing unit is connected with the scalar register group through the crossbar module;

The vector processing unit is respectively connected with the scalar register group and the vector register group through the crossbar module.
The apparatus according to claim 6 or 7, wherein the scalar processing unit is configured to obtain a first SIMT instruction, and perform an operation based on data corresponding to a base address carried by the first SIMT instruction to obtain a first operation result , and store the first operation result in the scalar register group;

The vector processing unit is used to obtain a second SIMT instruction, perform an operation based on the data corresponding to the offset address carried by the second SIMT instruction to obtain a second operation result, and store the second operation result in the vector. a register group; and, for performing an operation on the first operation result stored in the scalar register group and the second operation result stored in the vector register group to obtain a third operation result.
A single-instruction multi-thread SIMT instruction processing method, applied to a SIMT instruction processing device, the device comprising a scalar processing unit and a vector processing unit, including:

Perform scalar operations by the scalar processing unit according to the SIMT instruction of the scalar type;

The vector operation is performed by the vector processing unit according to the SIMT instruction of the vector type.
The method according to claim 9, wherein the device further comprises a control unit, the method further comprising:

The type of the SIMT instruction is determined by the control unit, and based on the type of the SIMT instruction, the SIMT instruction is sent to the scalar processing unit or the vector processing unit; wherein the type includes scalar or vector.
The method of claim 10, wherein the method further comprises:

The control unit determines the type of the SIMT instruction according to the indication information carried by the SIMT instruction; wherein the indication information includes a destination address, an indication bit, an indication field or an indicator.
The method according to any one of claims 9-11, wherein the apparatus further comprises a scalar scheduling unit and a vector scheduling unit, and the method further comprises:

Scheduling the scalar type SIMT instruction to the scalar processing unit by the scalar scheduling unit;

The vector-type SIMT instruction is scheduled to the vector processing unit by the vector scheduling unit.
The method according to any one of claims 9-12, wherein the device further comprises a scalar register group and a vector register group, and the method further comprises:

storing scalar data through the scalar register bank;

Vector data is stored through the vector register bank.
The method of claim 13, wherein the method further comprises:

Obtain the first SIMT instruction by the scalar processing unit, carry out operation based on the corresponding data of the base address carried by the first SIMT instruction to obtain the first operation result, and store the first operation result in the scalar register group;

Obtain a second SIMT instruction through the vector processing unit, perform an operation based on the data corresponding to the offset address carried by the second SIMT instruction to obtain a second operation result, and store the second operation result in the vector register group ;as well as

A third operation result is obtained by performing an operation on the first operation result stored in the scalar register group and the second operation result stored in the vector register group by the vector processing unit.
A system-on-a-chip integrated with the single-instruction multi-thread SIMT instruction processing device according to any one of claims 1-8.
An electronic device, comprising the single-instruction multi-thread SIMT instruction processing apparatus according to any one of claims 1-8.