WO2022121273A1 - Simt instruction processing method and device - Google Patents
Simt instruction processing method and device Download PDFInfo
- Publication number
- WO2022121273A1 WO2022121273A1 PCT/CN2021/100808 CN2021100808W WO2022121273A1 WO 2022121273 A1 WO2022121273 A1 WO 2022121273A1 CN 2021100808 W CN2021100808 W CN 2021100808W WO 2022121273 A1 WO2022121273 A1 WO 2022121273A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- scalar
- vector
- processing unit
- instruction
- simt
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 13
- 238000000034 method Methods 0.000 claims description 24
- 238000010586 diagram Methods 0.000 description 6
- 230000005540 biological transmission Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3851—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3888—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple threads [SIMT] in parallel
Definitions
- the present invention relates to the field of computer technology, in particular to a method and device for processing single instruction multiple threads (single instruction multiple threads, SIMT).
- SIMT single instruction multiple threads
- SIMT architecture In parallel computing, the SIMT architecture has greater flexibility and higher efficiency than the synchronous multithreading (SMT) architecture, and can achieve higher throughput by running a large number of threads in parallel. Therefore, the SIMT architecture It is widely used in high-performance processors.
- SMT synchronous multithreading
- Embodiments of the present invention provide a SIMT instruction processing method and device, which are used to improve processing efficiency.
- a first aspect provides a SIMT instruction processing device, including a scalar processing unit and a vector processing unit, wherein:
- the scalar processing unit is used to perform a scalar operation according to a SIMT instruction of a scalar type
- the vector processing unit is configured to perform vector operations according to a SIMT instruction of a vector type.
- the scalar processing unit can perform scalar operations on SIMT instructions of scalar type, and the vector processing unit can perform vector operations on SIMT instructions of vector type, and the vector operations and scalar operations are separated by different
- the processing unit performs processing, and the scalar operation does not affect the vector operation. Therefore, the processing efficiency of the vector operation can be improved.
- the overall processing efficiency of the instruction can be improved.
- the apparatus further includes a scalar register set for storing scalar data and a vector register set for storing vector data, wherein:
- the scalar register group is respectively coupled to the scalar processing unit and the vector processing unit, and the vector register group is coupled to the vector processing unit.
- the SIMT instruction processing device includes a scalar register group and a vector register group.
- the information stored in the registers in the vector register group can only be accessed by corresponding threads, while the information stored in the scalar register group is shared by multiple threads. Information that can be accessed by multiple threads. Since one register in the scalar register group can correspond to multiple threads, the number of registers can be reduced; in addition, since the information stored in the scalar register can be shared by multiple threads, it is possible to avoid repeated storage of the same information. The amount of information stored in the register is reduced, thereby saving storage resources.
- the device further includes a crossbar module, the crossbar module includes a plurality of crossbars, wherein:
- the scalar processing unit is connected with the scalar register group through the crossbar module;
- the vector processing unit is respectively connected with the scalar register group and the vector register group through the crossbar module.
- the scalar register group and the scalar processing unit are connected through a crossbar module, which can ensure that the scalar processing unit can access all registers in the scalar register group.
- the vector processing unit is connected to the scalar register group and the vector register group respectively through a crossbar module, which can ensure that the vector processing unit can access the scalar register group and all registers in the vector register group.
- the device further includes a control unit, wherein:
- control unit is respectively coupled to the scalar processing unit and the vector processing unit;
- the control unit is configured to determine the type of the SIMT instruction, and based on the type of the SIMT instruction, send the SIMT instruction to the scalar processing unit or the vector processing unit; wherein the type includes scalar or vector.
- control unit can distribute different types of SIMT instructions to different processing units for processing, so that vector operations and scalar operations can be processed separately by different processing units, and scalar operations will not Affects vector operations, therefore, the processing efficiency of vector operations can be improved.
- control unit is configured to determine the type of the SIMT instruction according to the indication information carried by the SIMT instruction; wherein the indication information includes a destination address, an indication bit, an indication field or indicator.
- the control unit may determine the instruction type of the SIMT instruction according to the instruction information carried by the SIMT instruction. It can be seen that when the indication information is the destination address, the destination address not only has the function of pointing to the storage address of the operation result, but also has the function of determining the type of the SIMT instruction. Therefore, there is no need for the SIMT instruction to carry additional information to indicate the SIMT instruction. can reduce the information carried by the SIMT instruction, thereby improving the transmission efficiency of the instruction and saving transmission resources.
- the apparatus further includes a scalar scheduling unit and a vector scheduling unit, wherein:
- the scalar scheduling unit is coupled to the scalar processing unit, and the vector scheduling unit is coupled to the vector processing unit;
- the scalar scheduling unit configured to schedule SIMT instructions of a scalar type to the scalar processing unit
- the vector scheduling unit is configured to schedule SIMT instructions of vector type to the vector processing unit.
- the scheduling unit can schedule the corresponding SIMT instructions according to the situation of the processing unit, so that the SIMT instructions can be executed in an orderly manner.
- the multiple threads correspond to the same base address and correspond to different offset addresses
- the scalar processing unit is used to A scalar operation is performed on the data corresponding to the address to obtain a first operation result
- the vector processing unit is configured to perform a vector operation on the data corresponding to the offset address to obtain a second operation result.
- the scalar processing unit is configured to acquire a first SIMT instruction, perform an operation based on the data corresponding to the base address carried by the first SIMT instruction to obtain a first operation result, and convert the first SIMT instruction
- the operation result is stored in the scalar register group;
- the vector processing unit is used to obtain a second SIMT instruction, perform an operation based on the data corresponding to the offset address carried by the second SIMT instruction to obtain a second operation result, and store the second operation result in the vector.
- a register group ; and, for performing an operation on the first operation result stored in the scalar register group and the second operation result stored in the vector register group to obtain a third operation result.
- a second aspect provides a SIMT instruction processing method, which is applied to an apparatus for processing SIMT instructions.
- the apparatus includes a scalar processing unit and a vector processing unit, including:
- the vector operation is performed by the vector processing unit according to the SIMT instruction of the vector type.
- the apparatus further includes a scalar register group and a vector register group
- the method further includes:
- Vector data is stored through the vector register bank.
- the apparatus further includes a control unit, and the method further includes:
- the type of the SIMT instruction is determined by the control unit, and based on the type of the SIMT instruction, the SIMT instruction is sent to the scalar processing unit or the vector processing unit; wherein the type includes scalar or vector.
- the method further includes:
- the control unit determines the type of the SIMT instruction according to the indication information carried by the SIMT instruction; wherein the indication information includes a destination address, an indication bit, an indication field or an indicator.
- the apparatus further includes a scalar scheduling unit and a vector scheduling unit, and the method further includes:
- the vector-type SIMT instruction is scheduled to the vector processing unit by the vector scheduling unit.
- the method further includes:
- Obtain a second SIMT instruction through the vector processing unit perform an operation based on the data corresponding to the offset address carried by the second SIMT instruction to obtain a second operation result, and store the second operation result in the vector register group and performing operations on the first operation result stored in the scalar register group and the second operation result stored in the vector register group by the vector processing unit to obtain a third operation result.
- a third aspect provides a system-on-a-chip, where the system-on-a-chip integrates the device provided by the first aspect or any possible implementation manner of the first aspect.
- the system-on-chip can be composed of a SIMT instruction processing device, and can also include a SIMT instruction processing device and other discrete devices.
- a fourth aspect provides an electronic device, including the SIMT instruction processing apparatus provided by the first aspect or any possible implementation manner of the first aspect, and a discrete device coupled to the SIMT instruction processing apparatus.
- FIG. 1 is a schematic structural diagram of a SIMT instruction processing device provided by an embodiment of the present invention.
- FIG. 2 is a schematic structural diagram of another SIMT instruction processing device provided by an embodiment of the present invention.
- FIG. 3 is a schematic structural diagram of another SIMT instruction processing device provided by an embodiment of the present invention.
- FIG. 4 is a schematic flowchart of a SIMT instruction processing method provided by an embodiment of the present invention.
- Embodiments of the present invention provide a SIMT instruction processing method and device, which are used to improve processing efficiency. Each of them will be described in detail below.
- SIMT Single Instruction Multiple Data
- scalar coprocessors are often used to process scalar operations to improve the processing efficiency of instructions.
- instruction scheduling is more difficult.
- a possible implementation manner is that, in the SIMT architecture, regardless of whether the SIMT instruction is a scalar instruction or a vector instruction, it is sent down through the same instruction port, and a vector processor is used to perform operations.
- the SIMT instruction when the SIMT instruction is of scalar type, part of the processing unit in the vector processor can be used to perform the operation.
- some processing units in the vector processor are used to process scalar type instructions, the number of processing units in the vector processor for processing vector type instructions is reduced, and the processing efficiency of vector instructions is reduced.
- FIG. 1 is a schematic structural diagram of a SIMT instruction processing apparatus provided by an embodiment of the present invention.
- the SIMT instruction processing apparatus may include a scalar processing unit 11 and a vector processing unit 12 .
- the scalar processing unit 11 is configured to perform a scalar operation according to a SIMT instruction of a scalar type.
- the vector processing unit 12 is configured to perform vector operations according to the SIMT instructions of the vector type.
- the scalar processing unit 11 may perform an operation on the SIMT instruction, that is, perform a scalar operation.
- the type of the SIMT instruction is a vector
- the vector processing unit 12 may perform an operation on the SIMT instruction, that is, perform a vector operation.
- a scalar is a vectorless quantity, that is, a quantity that has only magnitude and no direction. Scalar operations can be one or more of multiplication, addition, subtraction, and division, among others.
- a vector refers to a quantity that has magnitude and direction.
- Vector operations may include one or more of multiplication, addition, subtraction, division, dot product, cross product, and the like.
- the scalar processing unit 11 may comprise one or more first processing units.
- each first processing unit can process one SIMT instruction in each cycle, and one scalar instruction corresponds to one thread group, so that the The parallel operation of multiple scalar instructions is implemented, that is, the parallel operation of scalar operations for multiple threads can be implemented.
- Vector processing unit 12 may include one or more second processing units. When the vector processing unit 12 includes multiple second processing units, if the SIMT instruction is a vector instruction, each second processing unit can process one SIMT instruction per cycle, so that the parallel execution of multiple vector instructions can be realized.
- the number of threads corresponding to one vector instruction is the same as the number of threads processed by the second processing unit, so that parallel execution of multiple threads can be implemented in one second processing unit.
- the number of first processing units included in the scalar processing unit 11 and the number of second processing units included in the vector processing unit 12 may be the same or different.
- the first processing unit included in the scalar processing unit 11 may be an arithmetic operation unit (arithmetic and logic unit, ALU), or may be other units, which are not limited herein.
- the second processing unit included in the vector processing unit 12 may be an ALU, a special function unit (special function unit, SFU), a read-write unit (load store unit, LSU), or other units, here Unlimited.
- FIG. 2 is a schematic structural diagram of another SIMT instruction processing apparatus provided by an embodiment of the present invention. Wherein, the SIMT instruction processing apparatus shown in FIG. 2 is obtained by optimizing the SIMT instruction processing apparatus shown in FIG. 1 .
- the SIMT instruction processing apparatus may further include a scalar register group 13 for storing scalar data and a vector register group 14 for storing vector data.
- the scalar register group 13 is respectively coupled to the scalar processing unit 11 and the vector processing unit 12
- the vector register group 14 is coupled to the vector processing unit 12 .
- the scalar register set 13 and the vector register set 14 may be two independent register sets. Both the scalar register set 13 and the vector register set 14 may include multiple sets of registers.
- SIMT instructions can carry source addresses and operation types. After receiving the SIMT instruction, the scalar processing unit 11 may first obtain the operand from the register corresponding to the source address in the scalar register group 13, and then perform scalar operation on the obtained operand according to the operation type. The source address carried by the SIMT instruction received by the scalar processing unit 11 corresponds to a register in the scalar register group 13 .
- each source address corresponds to a register in the scalar register group 13, and the registers in the scalar register group 13 can be accessed by the corresponding thread.
- the thread corresponding to the register in the scalar register group 13 may be the thread corresponding to the warp (number of threads) to which the register belongs.
- the source address may include one address or multiple addresses, that is, the operand of the scalar instruction may be one or multiple, which is not limited herein.
- the vector processing unit 12 After the vector processing unit 12 receives the SIMT instruction, in the case that the source address of the SIMT instruction points to the scalar register group 13, the operand can be obtained from the register corresponding to the source address in the scalar register group 13, and then the operation type can be obtained according to the operation type.
- the operands of the vector operation are performed.
- the operand can be obtained from the register corresponding to the source address in the vector register group 14, and then the obtained operand can be subjected to a vector operation according to the operation type.
- the operand can be obtained from the register corresponding to the source address in the scalar register group 13, and the operand can be obtained from the register corresponding to the source address in the vector register group 14.
- the register obtains the operand, and then vector operations can be performed on the obtained operand according to the operation type.
- the source address of one SIMT instruction received by the vector processing unit 12 points to the vector register group 14
- the source address carried by one SIMT instruction corresponds to multiple registers in one vector register group 14 . It can be understood that each address may correspond to multiple registers in the vector register group 14, and each register in the vector register group 14 can only be accessed by the corresponding thread.
- the number of registers in the vector register bank 14 is the same as the number of second processing units included in a set of vector processing units.
- one SIMT instruction can carry multiple source addresses, each source address corresponds to a register in the scalar register group 13, and multiple source addresses correspond to scalar Multiple registers in register bank 13.
- the SIMT instruction processing apparatus may further include a crossbar module 15, and the crossbar module 15 may include multiple crossbars.
- the scalar processing unit 11 is connected to the scalar register set 13 through the crossbar module 15 .
- the vector processing unit 12 is connected to the scalar register set 13 and the vector register set 14 respectively through the crossbar module 15 .
- the crossbar module 15 can ensure that the scalar processing unit 11 can access all registers in the scalar register set 13 and the vector processing unit 12 can access all the registers in the scalar register set 13 and the vector register set 14.
- the crossbar module 15 may include multiple crossbars.
- the crossbar module 15 may include two crossbars, one crossbar may be coupled to the scalar processing unit 11 and the scalar register group 13, respectively, and the other crossbar may be coupled to the vector processing unit 12, the scalar register group 13 and the vector register group 14, respectively.
- the crossbar module 15 may include three crossbars, the first crossbar may be coupled to the scalar processing unit 11 and the scalar register group 13 respectively, the second crossbar may be coupled to the vector processing unit 12 and the scalar register group 13 respectively, and the third crossbar may be respectively coupled Vector processing unit 12 and vector register bank 14 .
- the scalar processing unit 11 can obtain the operand from the register corresponding to the source address in the scalar register group 13 through the first crossbar.
- the crossbar forwards the read instruction to the scalar register group 13, the scalar register group 13 sends the operand in the register corresponding to the source address to the first crossbar, and the first crossbar forwards the operand to the scalar processing unit 11.
- the crossbar forwards the read instruction to the scalar register group 13
- the scalar register group 13 sends the operand in the register corresponding to the source address to the first crossbar
- the first crossbar forwards the operand to the scalar processing unit 11. Others are similar and will not be repeated here.
- the SIMT instruction processing apparatus may further include a control unit 16 .
- the control unit 16 is coupled to the scalar processing unit 11 and the vector processing unit 12, respectively.
- the control unit 16 is configured to determine the type of the SIMT instruction, the type including a scalar or a vector, and send the SIMT instruction to the scalar processing unit 11 or the vector processing unit 12 based on the type of the SIMT instruction.
- control unit 16 is configured to determine the type of the SIMT instruction according to the destination address carried by the SIMT instruction.
- the control unit 16 may first determine the type of the SIMT instruction. In the case that the type is scalar, that is, the SIMT instruction is a scalar instruction, the control unit 16 may send the SIMT instruction to the scalar processing unit 11 . In the case that the type is a vector, that is, the SIMT instruction is a vector instruction, the control unit 16 may send the SIMT instruction to the vector processing unit 12 .
- the SIMT instruction may also carry indication information, and the indication information may indicate the type of the SIMT instruction.
- the control unit 16 After receiving the SIMT instruction, the control unit 16 can determine the type of the SIMT instruction according to the instruction information.
- the indication information can be the destination address.
- the control unit 16 After the control unit 16 receives the SIMT instruction, it can first identify whether the destination address is the address of the register in the scalar register group 13 or the address of the register in the vector register group 14, that is, identify whether the destination address points to the scalar register group 13 or points to the vector register group 14. When the destination address is the address of a register in the scalar register group 13 , that is, when the destination address points to the scalar register group 13 , the control unit 16 may send the SIMT instruction to the scalar processing unit 11 . When the destination address is the address of a register in the vector register group 14 , that is, the destination address points to the vector register group 14 , the control unit 16 may assign the SIMT instruction to the vector processing unit 12 .
- the indication information can also be an indication bit or a flag bit.
- the SIMT instruction can be indicated as a vector instruction, and when the indication bit or flag bit has a second value, it can be indicated that the SIMT instruction is a scalar instruction.
- the indication information can also be an indication field.
- the indication field When the indication field is in the first state, it can indicate that the SIMT instruction is a vector instruction, and when the indication field is in the second state, it can indicate that the SIMT instruction is a scalar instruction.
- the indication information may also be an indicator.
- the indicator When the indicator is in the third state, it may indicate that the SIMT instruction is a vector instruction, and if the indicator is in the fourth state, it may indicate that the SIMT instruction is a scalar instruction.
- the SIMT instruction may also indicate the type of SIMT instruction in other ways.
- the instruction SIMT instruction may be a scalar instruction, and if the SIMT instruction does not carry such information, the instruction SIMT instruction may be a vector instruction, and vice versa.
- the operation result can be stored in the register corresponding to the destination address, so that subsequent calls can be made directly according to the destination address.
- the SIMT instruction processing apparatus may further include a scalar scheduling unit 17 and a vector scheduling unit 18 .
- the scalar scheduling unit 17 is coupled to the scalar processing unit 11
- the vector scheduling unit 18 is coupled to the vector processing unit 12 .
- the scalar scheduling unit 17 is configured to schedule SIMT instructions of scalar type to the scalar processing unit 11 .
- the vector scheduling unit 18 is used for scheduling the SIMT instruction of the vector type to the vector processing unit 12 .
- the control unit 16 When there is no idle processing unit in the scalar processing unit 11 or the vector processing unit 12, the control unit 16 sends the SIMT instruction to the scalar processing unit 11 or the vector processing unit 12, and the scalar processing unit 11 or the vector processing unit 12 cannot perform processing. . Therefore, the control unit 16 can send scalar type SIMT instructions to the scalar scheduling unit 17 so that the scalar scheduling unit 17 can schedule the scalar instructions uniformly; and can send the vector type SIMT instructions to the vector scheduling unit 18 for vector scheduling Unit 18 may schedule vector instructions collectively.
- the scheduling method can be the principle of first-in, first-out, or the principle of scheduling according to priority, that is, the higher the priority, the first to be executed. It can also be scheduled according to resource occupancy, or according to other principles. Plus limit.
- the multiple threads correspond to the same base address and correspond to different offset addresses
- the scalar processing unit 11 is configured to operate on the data of the base address to obtain As for the first operation result
- the vector processing unit 12 is configured to operate on the data of the offset address to obtain the second operation result.
- the registers in the scalar register group 13 store data corresponding to the base addresses of multiple threads.
- the scalar processing unit 11 may calculate the data corresponding to the base addresses of the multiple threads, and store the obtained first operation result in the scalar register group 13 for subsequent calls.
- the registers in the vector register group 14 store data corresponding to the offset addresses of the multiple threads.
- the vector processing unit 12 may calculate the data corresponding to the offset addresses of the multiple threads, and store the obtained second operation result in the vector register group 14 for subsequent calling.
- the scalar processing unit 11 can perform a scalar operation according to the SIMT instruction to obtain the first operation result, and then store the first operation result in the scalar register group 13.
- the vector processing unit 12 may perform a vector operation according to the SIMT instruction to obtain a second operation result, and then store the second operation result in the vector register group 14 .
- the SIMT instruction carries the storage address of the first operation result and the storage address of the second operation result, and the vector processing unit 12 can obtain the first operation result from the storage address of the first operation.
- the operation result and obtaining the second operation result from the storage address of the second operation result, and performing a vector operation on the first operation result and the second operation result to obtain the third operation result.
- the third operation result is the data operation result corresponding to the base address+offset address.
- the scalar processing unit 11 is configured to obtain a first SIMT instruction carrying a base address, perform an operation based on data corresponding to the base address to obtain a first operation result, and store the first operation result in the scalar register group 13 .
- the vector processing unit 12 is used to obtain the second SIMT instruction carrying the offset address, perform an operation based on the data corresponding to the offset address to obtain the second operation result, and store the second operation result in the vector register group 14; and to the scalar register
- the first operation result stored in the group 13 and the second operation result stored in the vector register group 14 are operated to obtain the third operation result as the task processing result.
- the scalar processing unit 11 may perform a scalar operation according to the SIMT instruction to obtain a first operation result, and then store the first operation result in the scalar register group 13 .
- the vector processing unit 12 may first obtain data from the registers in the vector register group 14 corresponding to the offset address and perform vector operation to obtain the first operation result.
- the second operation result is obtained, the first operation result is obtained from the storage address of the first operation result, and the third operation result is obtained by performing a vector operation on the first operation result and the second operation result.
- FIG. 3 is a schematic structural diagram of another SIMT instruction processing apparatus provided by an embodiment of the present invention.
- the SIMT instruction processing device can support up to 2048 threads. If organized according to 32 threads and one warp, there are 64 warps in total, and the 64 warps can be divided into 8 banks (groups).
- the scalar processing unit includes 4 first processing units, and the vector processing unit includes 4 second processing units, each of which supports 32 threads.
- a scalar register bank can include 8 banks, and each bank can include 128 scalar registers.
- the vector register bank can include 8 banks, and each bank can include 128 32-thread vector registers. All registers in a bank can be shared by warps in this bank, and registers in a bank can also be divided according to warp, and the registers corresponding to each warp can only be shared by threads in this warp.
- Each processing unit requires at least one operand for processing, so there can be an 8x8 crossbar between the scalar processing unit and the scalar register bank, so that the scalar processing unit can ensure that the scalar processing unit can access the scalar registers of all banks in the scalar register bank.
- the scalar processing unit can receive 4 SIMT instructions from the scalar scheduling unit, and the vector processing unit can receive 4 SIMT instructions from the vector scheduling unit. Since only one processing unit can access one bank at the same time, in order to ensure that multiple processing units access different banks at the same time, the parity of the SIMT instructions scheduled by the scalar scheduling unit and the vector scheduling unit in the two clock cycles before and after the on the contrary.
- the control unit may schedule SIMT instructions accessing respective bank0, bank2, bank4 and bank6 through the scalar/vector scheduling unit, and the first/second processing unit may read operand 0 of the respective even-numbered banks.
- the control unit may schedule SIMT instructions accessing respective bank1, bank3, bank5, and bank7 through the scalar/vector scheduling unit, and the first/second processing unit may read operand 1 of the respective even-numbered banks, and read the respective odd-numbered banks Operand 0 for bank.
- the first/second processing unit may read operand 2 of the respective odd bank.
- the scheduling ensures that the parity is interleaved at adjacent moments and the scheduling banks do not conflict at the same moment, the maximum access efficiency without crossbar conflict can be guaranteed.
- the 8 operand read interfaces of the 4 first/second processing units respectively access the respective bank0-bank7.
- the processing unit can read two operands from the same bank in two consecutive beats, that is, two cycles. Because the warp parity of the two-shot instructions before and after is opposite, that is, the bank parity is opposite, so there will be no conflict, and it can be guaranteed that at most 8 processing unit read interfaces can access 8 banks at the same time.
- FIG. 4 is a schematic flowchart of a SIMT instruction processing method provided by an embodiment of the present invention.
- the SIMT instruction processing method may be applied to the SIMT instruction processing apparatus shown in FIG. 1 to FIG. 3 .
- the SIMT instruction processing method may include the following steps.
- the type of the SIMT instruction is determined by the control unit according to the destination address carried by the SIMT instruction.
- obtain the first SIMT instruction through the scalar processing unit perform an operation based on the data corresponding to the base address carried by the first SIMT instruction to obtain the first operation result, and store the first operation result in the scalar register group; through the vector processing unit
- obtain the second SIMT instruction perform operation based on the data corresponding to the offset address carried by the second SIMT instruction to obtain the second operation result, and store the second operation result in the vector register group; and store the scalar register group by the vector processing unit.
- the first operation result is operated with the second operation result stored in the vector register group, and the third operation result is obtained as the task processing result.
- SIMT instruction processing method may be a combination of all or part of the steps in step 401 to step 404, which is not limited herein.
- a system-on-chip is provided, and the system-on-chip may include the SIMT instruction processing apparatus provided in the above embodiments.
- the system-on-chip can be composed of a SIMT instruction processing device, and can also include a SIMT instruction processing device and other discrete devices.
- an electronic device including the SIMT instruction processing apparatus provided in the above-mentioned embodiments and a discrete device coupled to the SIMT instruction processing apparatus.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Multimedia (AREA)
- Advance Control (AREA)
- Complex Calculations (AREA)
- Executing Machine-Instructions (AREA)
Abstract
Description
Claims (16)
- 一种单指令多线程SIMT指令处理装置,包括标量处理单元和向量处理单元,其中:A single-instruction multi-thread SIMT instruction processing device, comprising a scalar processing unit and a vector processing unit, wherein:所述标量处理单元,用于根据标量类型的SIMT指令,进行标量运算;The scalar processing unit is used to perform a scalar operation according to a SIMT instruction of a scalar type;所述向量处理单元,用于根据向量类型的SIMT指令,进行向量运算。The vector processing unit is configured to perform vector operations according to a SIMT instruction of a vector type.
- 根据权利要求1所述的装置,其特征在于,所述装置还包括控制单元,其中:The apparatus of claim 1, wherein the apparatus further comprises a control unit, wherein:所述控制单元分别耦合所述标量处理单元和所述向量处理单元;the control unit is respectively coupled to the scalar processing unit and the vector processing unit;所述控制单元,用于确定所述SIMT指令的类型,并基于所述SIMT指令的所述类型,向所述标量处理单元或所述向量处理单元发送所述SIMT指令;其中,所述类型包括标量或向量。The control unit is configured to determine the type of the SIMT instruction, and based on the type of the SIMT instruction, send the SIMT instruction to the scalar processing unit or the vector processing unit; wherein the type includes scalar or vector.
- 根据权利要求2所述的装置,其特征在于,所述控制单元,用于根据所述SIMT指令携带的指示信息,确定所述SIMT指令的所述类型;其中所述指示信息包括目的地址、指示位、指示字段或指示符。The device according to claim 2, wherein the control unit is configured to determine the type of the SIMT instruction according to the indication information carried by the SIMT instruction; wherein the indication information includes a destination address, an indication bit, indicator field, or indicator.
- 根据权利要求1-3任一项所述的装置,其特征在于,所述装置还包括标量调度单元和向量调度单元,其中:The apparatus according to any one of claims 1-3, wherein the apparatus further comprises a scalar scheduling unit and a vector scheduling unit, wherein:所述标量调度单元耦合所述标量处理单元,用于将标量类型的SIMT指令调度至所述标量处理单元;The scalar scheduling unit is coupled to the scalar processing unit, and is configured to schedule SIMT instructions of a scalar type to the scalar processing unit;所述向量调度单元耦合所述向量处理单元,用于将向量类型的SIMT指令调度至所述向量处理单元。The vector scheduling unit is coupled to the vector processing unit, and is configured to schedule SIMT instructions of vector type to the vector processing unit.
- 根据权利要求1-4任一项所述的装置,其特征在于,在多个线程并行执行同一任务的情况下,所述多个线程对应相同的基地址且对应不同的偏移地址,其中The device according to any one of claims 1-4, wherein, in the case where multiple threads execute the same task in parallel, the multiple threads correspond to the same base address and correspond to different offset addresses, wherein所述标量处理单元用于对所述基地址对应的数据进行标量运算,得到第一运算结果,The scalar processing unit is configured to perform a scalar operation on the data corresponding to the base address to obtain a first operation result,所述向量处理单元用于对所述偏移地址对应的数据进行向量运算,得到第二运算结果。The vector processing unit is configured to perform a vector operation on the data corresponding to the offset address to obtain a second operation result.
- 根据权利要求1-5任一所述的装置,其特征在于,所述装置还包括用于存储标量数据的标量寄存器组和用于存储向量数据的向量寄存器组,其中:The apparatus according to any one of claims 1-5, wherein the apparatus further comprises a scalar register group for storing scalar data and a vector register group for storing vector data, wherein:所述标量寄存器组分别耦合所述标量处理单元和所述向量处理单元,the scalar register group is respectively coupled to the scalar processing unit and the vector processing unit,所述向量寄存器组耦合所述向量处理单元。The vector register set is coupled to the vector processing unit.
- 根据权利要求6所述的装置,其特征在于,所述装置还包括crossbar模块,所述crossbar模块包括多个crossbar,其中:The apparatus of claim 6, wherein the apparatus further comprises a crossbar module, the crossbar module comprising a plurality of crossbars, wherein:所述标量处理单元通过所述crossbar模块与所述标量寄存器组连接;the scalar processing unit is connected with the scalar register group through the crossbar module;所述向量处理单元通过所述crossbar模块分别与所述标量寄存器组和所述向量寄存器组连接。The vector processing unit is respectively connected with the scalar register group and the vector register group through the crossbar module.
- 根据权利要求6或7所述的装置,其特征在于,所述标量处理单元,用于获取第一SIMT指令,基于所述第一SIMT指令携带的基地址对应的数据进行运算得到第一运算结果,并将所述第一运算结果存储在所述标量寄存器组;The apparatus according to claim 6 or 7, wherein the scalar processing unit is configured to obtain a first SIMT instruction, and perform an operation based on data corresponding to a base address carried by the first SIMT instruction to obtain a first operation result , and store the first operation result in the scalar register group;所述向量处理单元,用于获取第二SIMT指令,基于所述第二SIMT指令携带的偏移地址对应的数据进行运算得到第二运算结果,并将所述第二运算结果存储在所述向量寄存器组;以及,用于对所述标量寄存器组存储的所述第一运算结果和所述向量寄存器组存储的所述第二运算结果进行运算,得到第三运算结果。The vector processing unit is used to obtain a second SIMT instruction, perform an operation based on the data corresponding to the offset address carried by the second SIMT instruction to obtain a second operation result, and store the second operation result in the vector. a register group; and, for performing an operation on the first operation result stored in the scalar register group and the second operation result stored in the vector register group to obtain a third operation result.
- 一种单指令多线程SIMT指令处理方法,应用于SIMT指令处理装置,所述装置包括标量处理单元和向量处理单元,包括:A single-instruction multi-thread SIMT instruction processing method, applied to a SIMT instruction processing device, the device comprising a scalar processing unit and a vector processing unit, including:通过所述标量处理单元根据标量类型的SIMT指令,进行标量运算;Perform scalar operations by the scalar processing unit according to the SIMT instruction of the scalar type;通过所述向量处理单元根据向量类型的SIMT指令,进行向量运算。The vector operation is performed by the vector processing unit according to the SIMT instruction of the vector type.
- 根据权利要求9所述的方法,其特征在于,所述装置还包括控制单元,所述方法还包括:The method according to claim 9, wherein the device further comprises a control unit, the method further comprising:通过所述控制单元确定所述SIMT指令的类型,并基于所述SIMT指令的所述类型,向所述标量处理单元或所述向量处理单元发送所述SIMT指令;其中,所述类型包括标量或向量。The type of the SIMT instruction is determined by the control unit, and based on the type of the SIMT instruction, the SIMT instruction is sent to the scalar processing unit or the vector processing unit; wherein the type includes scalar or vector.
- 根据权利要求10所述的方法,其特征在于,所述方法还包括:The method of claim 10, wherein the method further comprises:通过所述控制单元根据所述SIMT指令携带的指示信息,确定所述SIMT指令的所述类型;其中所述指示信息包括目的地址、指示位、指示字段或指示符。The control unit determines the type of the SIMT instruction according to the indication information carried by the SIMT instruction; wherein the indication information includes a destination address, an indication bit, an indication field or an indicator.
- 根据权利要求9-11任一项所述的方法,其特征在于,所述装置还包括标量调度单元和向量调度单元,所述方法还包括:The method according to any one of claims 9-11, wherein the apparatus further comprises a scalar scheduling unit and a vector scheduling unit, and the method further comprises:通过所述标量调度单元将标量类型的SIMT指令调度至所述标量处理单元;Scheduling the scalar type SIMT instruction to the scalar processing unit by the scalar scheduling unit;通过所述向量调度单元将向量类型的SIMT指令调度至所述向量处理单元。The vector-type SIMT instruction is scheduled to the vector processing unit by the vector scheduling unit.
- 根据权利要求9-12任一项所述的方法,其特征在于,所述装置还包括标量寄存器组和向量寄存器组,所述方法还包括:The method according to any one of claims 9-12, wherein the device further comprises a scalar register group and a vector register group, and the method further comprises:通过所述标量寄存器组存储标量数据;storing scalar data through the scalar register bank;通过所述向量寄存器组存储向量数据。Vector data is stored through the vector register bank.
- 根据权利要求13所述的方法,其特征在于,所述方法还包括:The method of claim 13, wherein the method further comprises:通过所述标量处理单元获取第一SIMT指令,基于所述第一SIMT指令携带的基地 址对应的数据进行运算得到第一运算结果,并将所述第一运算结果存储在所述标量寄存器组;Obtain the first SIMT instruction by the scalar processing unit, carry out operation based on the corresponding data of the base address carried by the first SIMT instruction to obtain the first operation result, and store the first operation result in the scalar register group;通过所述向量处理单元获取第二SIMT指令,基于所述第二SIMT指令携带的偏移地址对应的数据进行运算得到第二运算结果,并将所述第二运算结果存储在所述向量寄存器组;以及Obtain a second SIMT instruction through the vector processing unit, perform an operation based on the data corresponding to the offset address carried by the second SIMT instruction to obtain a second operation result, and store the second operation result in the vector register group ;as well as通过所述向量处理单元对所述标量寄存器组存储的所述第一运算结果和所述向量寄存器组存储的所述第二运算结果进行运算,得到第三运算结果。A third operation result is obtained by performing an operation on the first operation result stored in the scalar register group and the second operation result stored in the vector register group by the vector processing unit.
- 一种片上系统芯片,集成有如权利要求1-8任一所述的单指令多线程SIMT指令处理装置。A system-on-a-chip integrated with the single-instruction multi-thread SIMT instruction processing device according to any one of claims 1-8.
- 一种电子设备,包括如权利要求1-8任一所述的单指令多线程SIMT指令处理装置。An electronic device, comprising the single-instruction multi-thread SIMT instruction processing apparatus according to any one of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2022523849A JP2023509813A (en) | 2020-12-11 | 2021-06-18 | SIMT command processing method and device |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011452846.0 | 2020-12-11 | ||
CN202011452846.0A CN114625421A (en) | 2020-12-11 | 2020-12-11 | SIMT instruction processing method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022121273A1 true WO2022121273A1 (en) | 2022-06-16 |
Family
ID=81895766
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/100808 WO2022121273A1 (en) | 2020-12-11 | 2021-06-18 | Simt instruction processing method and device |
Country Status (3)
Country | Link |
---|---|
JP (1) | JP2023509813A (en) |
CN (1) | CN114625421A (en) |
WO (1) | WO2022121273A1 (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5423051A (en) * | 1992-09-24 | 1995-06-06 | International Business Machines Corporation | Execution unit with an integrated vector operation capability |
US20130042090A1 (en) * | 2011-08-12 | 2013-02-14 | Ronny M. KRASHINSKY | Temporal simt execution optimization |
CN104699465A (en) * | 2015-03-26 | 2015-06-10 | 中国人民解放军国防科学技术大学 | Vector access and storage device supporting SIMT in vector processor and control method |
US20160188531A1 (en) * | 2014-12-24 | 2016-06-30 | Samsung Electronics Co., Ltd. | Operation processing apparatus and method |
CN106257411A (en) * | 2015-06-17 | 2016-12-28 | 联发科技股份有限公司 | Single instrction multithread calculating system and method thereof |
CN111240745A (en) * | 2019-02-20 | 2020-06-05 | 上海天数智芯半导体有限公司 | Enhanced scalar vector dual pipeline architecture for interleaved execution |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170371654A1 (en) * | 2016-06-23 | 2017-12-28 | Advanced Micro Devices, Inc. | System and method for using virtual vector register files |
US10776311B2 (en) * | 2017-03-14 | 2020-09-15 | Azurengine Technologies Zhuhai Inc. | Circular reconfiguration for a reconfigurable parallel processor using a plurality of chained memory ports |
-
2020
- 2020-12-11 CN CN202011452846.0A patent/CN114625421A/en active Pending
-
2021
- 2021-06-18 JP JP2022523849A patent/JP2023509813A/en active Pending
- 2021-06-18 WO PCT/CN2021/100808 patent/WO2022121273A1/en active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5423051A (en) * | 1992-09-24 | 1995-06-06 | International Business Machines Corporation | Execution unit with an integrated vector operation capability |
US20130042090A1 (en) * | 2011-08-12 | 2013-02-14 | Ronny M. KRASHINSKY | Temporal simt execution optimization |
US20160188531A1 (en) * | 2014-12-24 | 2016-06-30 | Samsung Electronics Co., Ltd. | Operation processing apparatus and method |
CN104699465A (en) * | 2015-03-26 | 2015-06-10 | 中国人民解放军国防科学技术大学 | Vector access and storage device supporting SIMT in vector processor and control method |
CN106257411A (en) * | 2015-06-17 | 2016-12-28 | 联发科技股份有限公司 | Single instrction multithread calculating system and method thereof |
CN111240745A (en) * | 2019-02-20 | 2020-06-05 | 上海天数智芯半导体有限公司 | Enhanced scalar vector dual pipeline architecture for interleaved execution |
Also Published As
Publication number | Publication date |
---|---|
CN114625421A (en) | 2022-06-14 |
JP2023509813A (en) | 2023-03-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9672035B2 (en) | Data processing apparatus and method for performing vector processing | |
US10768989B2 (en) | Virtual vector processing | |
US7418576B1 (en) | Prioritized issuing of operation dedicated execution unit tagged instructions from multiple different type threads performing different set of operations | |
US9207995B2 (en) | Mechanism to speed-up multithreaded execution by register file write port reallocation | |
US9092429B2 (en) | DMA vector buffer | |
US8539211B2 (en) | Allocating registers for loop variables in a multi-threaded processor | |
US10268519B2 (en) | Scheduling method and processing device for thread groups execution in a computing system | |
US9286114B2 (en) | System and method for launching data parallel and task parallel application threads and graphics processing unit incorporating the same | |
Chen et al. | Improving GPGPU performance via cache locality aware thread block scheduling | |
US20110119468A1 (en) | Mechanism of supporting sub-communicator collectives with o(64) counters as opposed to one counter for each sub-communicator | |
JP2017045151A (en) | Arithmetic processing device and control method of arithmetic processing device | |
TW201543357A (en) | Detecting data dependencies of instructions associated with threads in a simultaneous multithreading scheme | |
US20220220644A1 (en) | Warp scheduling method and stream multiprocessor using the same | |
WO2022121273A1 (en) | Simt instruction processing method and device | |
WO2021111272A1 (en) | Processor unit for multiply and accumulate operations | |
US20100011195A1 (en) | Processor | |
US8055883B2 (en) | Pipe scheduling for pipelines based on destination register number | |
WO2022161013A1 (en) | Processor apparatus and instruction execution method therefor, and computing device | |
CN112463218B (en) | Instruction emission control method and circuit, data processing method and circuit | |
US20220197696A1 (en) | Condensed command packet for high throughput and low overhead kernel launch | |
WO2022121090A1 (en) | Processor supporting high-throughput multi-precision multiplication | |
US20130262819A1 (en) | Single cycle compare and select operations | |
US8683181B2 (en) | Processor and method for distributing load among plural pipeline units | |
JP5630798B1 (en) | Processor and method | |
WO2022141321A1 (en) | Dsp and parallel computing method therefor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
ENP | Entry into the national phase |
Ref document number: 2022523849 Country of ref document: JP Kind code of ref document: A |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21901985 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21901985 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 231123) |