WO2022121273A1 - Simt instruction processing method and device - Google Patents

Simt instruction processing method and device Download PDF

Info

Publication number
WO2022121273A1
WO2022121273A1 PCT/CN2021/100808 CN2021100808W WO2022121273A1 WO 2022121273 A1 WO2022121273 A1 WO 2022121273A1 CN 2021100808 W CN2021100808 W CN 2021100808W WO 2022121273 A1 WO2022121273 A1 WO 2022121273A1
Authority
WO
WIPO (PCT)
Prior art keywords
scalar
vector
processing unit
instruction
simt
Prior art date
Application number
PCT/CN2021/100808
Other languages
French (fr)
Chinese (zh)
Inventor
周俊
王文强
夏晓旭
Original Assignee
上海阵量智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海阵量智能科技有限公司 filed Critical 上海阵量智能科技有限公司
Priority to JP2022523849A priority Critical patent/JP2023509813A/en
Publication of WO2022121273A1 publication Critical patent/WO2022121273A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3851Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3888Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple threads [SIMT] in parallel

Definitions

  • the present invention relates to the field of computer technology, in particular to a method and device for processing single instruction multiple threads (single instruction multiple threads, SIMT).
  • SIMT single instruction multiple threads
  • SIMT architecture In parallel computing, the SIMT architecture has greater flexibility and higher efficiency than the synchronous multithreading (SMT) architecture, and can achieve higher throughput by running a large number of threads in parallel. Therefore, the SIMT architecture It is widely used in high-performance processors.
  • SMT synchronous multithreading
  • Embodiments of the present invention provide a SIMT instruction processing method and device, which are used to improve processing efficiency.
  • a first aspect provides a SIMT instruction processing device, including a scalar processing unit and a vector processing unit, wherein:
  • the scalar processing unit is used to perform a scalar operation according to a SIMT instruction of a scalar type
  • the vector processing unit is configured to perform vector operations according to a SIMT instruction of a vector type.
  • the scalar processing unit can perform scalar operations on SIMT instructions of scalar type, and the vector processing unit can perform vector operations on SIMT instructions of vector type, and the vector operations and scalar operations are separated by different
  • the processing unit performs processing, and the scalar operation does not affect the vector operation. Therefore, the processing efficiency of the vector operation can be improved.
  • the overall processing efficiency of the instruction can be improved.
  • the apparatus further includes a scalar register set for storing scalar data and a vector register set for storing vector data, wherein:
  • the scalar register group is respectively coupled to the scalar processing unit and the vector processing unit, and the vector register group is coupled to the vector processing unit.
  • the SIMT instruction processing device includes a scalar register group and a vector register group.
  • the information stored in the registers in the vector register group can only be accessed by corresponding threads, while the information stored in the scalar register group is shared by multiple threads. Information that can be accessed by multiple threads. Since one register in the scalar register group can correspond to multiple threads, the number of registers can be reduced; in addition, since the information stored in the scalar register can be shared by multiple threads, it is possible to avoid repeated storage of the same information. The amount of information stored in the register is reduced, thereby saving storage resources.
  • the device further includes a crossbar module, the crossbar module includes a plurality of crossbars, wherein:
  • the scalar processing unit is connected with the scalar register group through the crossbar module;
  • the vector processing unit is respectively connected with the scalar register group and the vector register group through the crossbar module.
  • the scalar register group and the scalar processing unit are connected through a crossbar module, which can ensure that the scalar processing unit can access all registers in the scalar register group.
  • the vector processing unit is connected to the scalar register group and the vector register group respectively through a crossbar module, which can ensure that the vector processing unit can access the scalar register group and all registers in the vector register group.
  • the device further includes a control unit, wherein:
  • control unit is respectively coupled to the scalar processing unit and the vector processing unit;
  • the control unit is configured to determine the type of the SIMT instruction, and based on the type of the SIMT instruction, send the SIMT instruction to the scalar processing unit or the vector processing unit; wherein the type includes scalar or vector.
  • control unit can distribute different types of SIMT instructions to different processing units for processing, so that vector operations and scalar operations can be processed separately by different processing units, and scalar operations will not Affects vector operations, therefore, the processing efficiency of vector operations can be improved.
  • control unit is configured to determine the type of the SIMT instruction according to the indication information carried by the SIMT instruction; wherein the indication information includes a destination address, an indication bit, an indication field or indicator.
  • the control unit may determine the instruction type of the SIMT instruction according to the instruction information carried by the SIMT instruction. It can be seen that when the indication information is the destination address, the destination address not only has the function of pointing to the storage address of the operation result, but also has the function of determining the type of the SIMT instruction. Therefore, there is no need for the SIMT instruction to carry additional information to indicate the SIMT instruction. can reduce the information carried by the SIMT instruction, thereby improving the transmission efficiency of the instruction and saving transmission resources.
  • the apparatus further includes a scalar scheduling unit and a vector scheduling unit, wherein:
  • the scalar scheduling unit is coupled to the scalar processing unit, and the vector scheduling unit is coupled to the vector processing unit;
  • the scalar scheduling unit configured to schedule SIMT instructions of a scalar type to the scalar processing unit
  • the vector scheduling unit is configured to schedule SIMT instructions of vector type to the vector processing unit.
  • the scheduling unit can schedule the corresponding SIMT instructions according to the situation of the processing unit, so that the SIMT instructions can be executed in an orderly manner.
  • the multiple threads correspond to the same base address and correspond to different offset addresses
  • the scalar processing unit is used to A scalar operation is performed on the data corresponding to the address to obtain a first operation result
  • the vector processing unit is configured to perform a vector operation on the data corresponding to the offset address to obtain a second operation result.
  • the scalar processing unit is configured to acquire a first SIMT instruction, perform an operation based on the data corresponding to the base address carried by the first SIMT instruction to obtain a first operation result, and convert the first SIMT instruction
  • the operation result is stored in the scalar register group;
  • the vector processing unit is used to obtain a second SIMT instruction, perform an operation based on the data corresponding to the offset address carried by the second SIMT instruction to obtain a second operation result, and store the second operation result in the vector.
  • a register group ; and, for performing an operation on the first operation result stored in the scalar register group and the second operation result stored in the vector register group to obtain a third operation result.
  • a second aspect provides a SIMT instruction processing method, which is applied to an apparatus for processing SIMT instructions.
  • the apparatus includes a scalar processing unit and a vector processing unit, including:
  • the vector operation is performed by the vector processing unit according to the SIMT instruction of the vector type.
  • the apparatus further includes a scalar register group and a vector register group
  • the method further includes:
  • Vector data is stored through the vector register bank.
  • the apparatus further includes a control unit, and the method further includes:
  • the type of the SIMT instruction is determined by the control unit, and based on the type of the SIMT instruction, the SIMT instruction is sent to the scalar processing unit or the vector processing unit; wherein the type includes scalar or vector.
  • the method further includes:
  • the control unit determines the type of the SIMT instruction according to the indication information carried by the SIMT instruction; wherein the indication information includes a destination address, an indication bit, an indication field or an indicator.
  • the apparatus further includes a scalar scheduling unit and a vector scheduling unit, and the method further includes:
  • the vector-type SIMT instruction is scheduled to the vector processing unit by the vector scheduling unit.
  • the method further includes:
  • Obtain a second SIMT instruction through the vector processing unit perform an operation based on the data corresponding to the offset address carried by the second SIMT instruction to obtain a second operation result, and store the second operation result in the vector register group and performing operations on the first operation result stored in the scalar register group and the second operation result stored in the vector register group by the vector processing unit to obtain a third operation result.
  • a third aspect provides a system-on-a-chip, where the system-on-a-chip integrates the device provided by the first aspect or any possible implementation manner of the first aspect.
  • the system-on-chip can be composed of a SIMT instruction processing device, and can also include a SIMT instruction processing device and other discrete devices.
  • a fourth aspect provides an electronic device, including the SIMT instruction processing apparatus provided by the first aspect or any possible implementation manner of the first aspect, and a discrete device coupled to the SIMT instruction processing apparatus.
  • FIG. 1 is a schematic structural diagram of a SIMT instruction processing device provided by an embodiment of the present invention.
  • FIG. 2 is a schematic structural diagram of another SIMT instruction processing device provided by an embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of another SIMT instruction processing device provided by an embodiment of the present invention.
  • FIG. 4 is a schematic flowchart of a SIMT instruction processing method provided by an embodiment of the present invention.
  • Embodiments of the present invention provide a SIMT instruction processing method and device, which are used to improve processing efficiency. Each of them will be described in detail below.
  • SIMT Single Instruction Multiple Data
  • scalar coprocessors are often used to process scalar operations to improve the processing efficiency of instructions.
  • instruction scheduling is more difficult.
  • a possible implementation manner is that, in the SIMT architecture, regardless of whether the SIMT instruction is a scalar instruction or a vector instruction, it is sent down through the same instruction port, and a vector processor is used to perform operations.
  • the SIMT instruction when the SIMT instruction is of scalar type, part of the processing unit in the vector processor can be used to perform the operation.
  • some processing units in the vector processor are used to process scalar type instructions, the number of processing units in the vector processor for processing vector type instructions is reduced, and the processing efficiency of vector instructions is reduced.
  • FIG. 1 is a schematic structural diagram of a SIMT instruction processing apparatus provided by an embodiment of the present invention.
  • the SIMT instruction processing apparatus may include a scalar processing unit 11 and a vector processing unit 12 .
  • the scalar processing unit 11 is configured to perform a scalar operation according to a SIMT instruction of a scalar type.
  • the vector processing unit 12 is configured to perform vector operations according to the SIMT instructions of the vector type.
  • the scalar processing unit 11 may perform an operation on the SIMT instruction, that is, perform a scalar operation.
  • the type of the SIMT instruction is a vector
  • the vector processing unit 12 may perform an operation on the SIMT instruction, that is, perform a vector operation.
  • a scalar is a vectorless quantity, that is, a quantity that has only magnitude and no direction. Scalar operations can be one or more of multiplication, addition, subtraction, and division, among others.
  • a vector refers to a quantity that has magnitude and direction.
  • Vector operations may include one or more of multiplication, addition, subtraction, division, dot product, cross product, and the like.
  • the scalar processing unit 11 may comprise one or more first processing units.
  • each first processing unit can process one SIMT instruction in each cycle, and one scalar instruction corresponds to one thread group, so that the The parallel operation of multiple scalar instructions is implemented, that is, the parallel operation of scalar operations for multiple threads can be implemented.
  • Vector processing unit 12 may include one or more second processing units. When the vector processing unit 12 includes multiple second processing units, if the SIMT instruction is a vector instruction, each second processing unit can process one SIMT instruction per cycle, so that the parallel execution of multiple vector instructions can be realized.
  • the number of threads corresponding to one vector instruction is the same as the number of threads processed by the second processing unit, so that parallel execution of multiple threads can be implemented in one second processing unit.
  • the number of first processing units included in the scalar processing unit 11 and the number of second processing units included in the vector processing unit 12 may be the same or different.
  • the first processing unit included in the scalar processing unit 11 may be an arithmetic operation unit (arithmetic and logic unit, ALU), or may be other units, which are not limited herein.
  • the second processing unit included in the vector processing unit 12 may be an ALU, a special function unit (special function unit, SFU), a read-write unit (load store unit, LSU), or other units, here Unlimited.
  • FIG. 2 is a schematic structural diagram of another SIMT instruction processing apparatus provided by an embodiment of the present invention. Wherein, the SIMT instruction processing apparatus shown in FIG. 2 is obtained by optimizing the SIMT instruction processing apparatus shown in FIG. 1 .
  • the SIMT instruction processing apparatus may further include a scalar register group 13 for storing scalar data and a vector register group 14 for storing vector data.
  • the scalar register group 13 is respectively coupled to the scalar processing unit 11 and the vector processing unit 12
  • the vector register group 14 is coupled to the vector processing unit 12 .
  • the scalar register set 13 and the vector register set 14 may be two independent register sets. Both the scalar register set 13 and the vector register set 14 may include multiple sets of registers.
  • SIMT instructions can carry source addresses and operation types. After receiving the SIMT instruction, the scalar processing unit 11 may first obtain the operand from the register corresponding to the source address in the scalar register group 13, and then perform scalar operation on the obtained operand according to the operation type. The source address carried by the SIMT instruction received by the scalar processing unit 11 corresponds to a register in the scalar register group 13 .
  • each source address corresponds to a register in the scalar register group 13, and the registers in the scalar register group 13 can be accessed by the corresponding thread.
  • the thread corresponding to the register in the scalar register group 13 may be the thread corresponding to the warp (number of threads) to which the register belongs.
  • the source address may include one address or multiple addresses, that is, the operand of the scalar instruction may be one or multiple, which is not limited herein.
  • the vector processing unit 12 After the vector processing unit 12 receives the SIMT instruction, in the case that the source address of the SIMT instruction points to the scalar register group 13, the operand can be obtained from the register corresponding to the source address in the scalar register group 13, and then the operation type can be obtained according to the operation type.
  • the operands of the vector operation are performed.
  • the operand can be obtained from the register corresponding to the source address in the vector register group 14, and then the obtained operand can be subjected to a vector operation according to the operation type.
  • the operand can be obtained from the register corresponding to the source address in the scalar register group 13, and the operand can be obtained from the register corresponding to the source address in the vector register group 14.
  • the register obtains the operand, and then vector operations can be performed on the obtained operand according to the operation type.
  • the source address of one SIMT instruction received by the vector processing unit 12 points to the vector register group 14
  • the source address carried by one SIMT instruction corresponds to multiple registers in one vector register group 14 . It can be understood that each address may correspond to multiple registers in the vector register group 14, and each register in the vector register group 14 can only be accessed by the corresponding thread.
  • the number of registers in the vector register bank 14 is the same as the number of second processing units included in a set of vector processing units.
  • one SIMT instruction can carry multiple source addresses, each source address corresponds to a register in the scalar register group 13, and multiple source addresses correspond to scalar Multiple registers in register bank 13.
  • the SIMT instruction processing apparatus may further include a crossbar module 15, and the crossbar module 15 may include multiple crossbars.
  • the scalar processing unit 11 is connected to the scalar register set 13 through the crossbar module 15 .
  • the vector processing unit 12 is connected to the scalar register set 13 and the vector register set 14 respectively through the crossbar module 15 .
  • the crossbar module 15 can ensure that the scalar processing unit 11 can access all registers in the scalar register set 13 and the vector processing unit 12 can access all the registers in the scalar register set 13 and the vector register set 14.
  • the crossbar module 15 may include multiple crossbars.
  • the crossbar module 15 may include two crossbars, one crossbar may be coupled to the scalar processing unit 11 and the scalar register group 13, respectively, and the other crossbar may be coupled to the vector processing unit 12, the scalar register group 13 and the vector register group 14, respectively.
  • the crossbar module 15 may include three crossbars, the first crossbar may be coupled to the scalar processing unit 11 and the scalar register group 13 respectively, the second crossbar may be coupled to the vector processing unit 12 and the scalar register group 13 respectively, and the third crossbar may be respectively coupled Vector processing unit 12 and vector register bank 14 .
  • the scalar processing unit 11 can obtain the operand from the register corresponding to the source address in the scalar register group 13 through the first crossbar.
  • the crossbar forwards the read instruction to the scalar register group 13, the scalar register group 13 sends the operand in the register corresponding to the source address to the first crossbar, and the first crossbar forwards the operand to the scalar processing unit 11.
  • the crossbar forwards the read instruction to the scalar register group 13
  • the scalar register group 13 sends the operand in the register corresponding to the source address to the first crossbar
  • the first crossbar forwards the operand to the scalar processing unit 11. Others are similar and will not be repeated here.
  • the SIMT instruction processing apparatus may further include a control unit 16 .
  • the control unit 16 is coupled to the scalar processing unit 11 and the vector processing unit 12, respectively.
  • the control unit 16 is configured to determine the type of the SIMT instruction, the type including a scalar or a vector, and send the SIMT instruction to the scalar processing unit 11 or the vector processing unit 12 based on the type of the SIMT instruction.
  • control unit 16 is configured to determine the type of the SIMT instruction according to the destination address carried by the SIMT instruction.
  • the control unit 16 may first determine the type of the SIMT instruction. In the case that the type is scalar, that is, the SIMT instruction is a scalar instruction, the control unit 16 may send the SIMT instruction to the scalar processing unit 11 . In the case that the type is a vector, that is, the SIMT instruction is a vector instruction, the control unit 16 may send the SIMT instruction to the vector processing unit 12 .
  • the SIMT instruction may also carry indication information, and the indication information may indicate the type of the SIMT instruction.
  • the control unit 16 After receiving the SIMT instruction, the control unit 16 can determine the type of the SIMT instruction according to the instruction information.
  • the indication information can be the destination address.
  • the control unit 16 After the control unit 16 receives the SIMT instruction, it can first identify whether the destination address is the address of the register in the scalar register group 13 or the address of the register in the vector register group 14, that is, identify whether the destination address points to the scalar register group 13 or points to the vector register group 14. When the destination address is the address of a register in the scalar register group 13 , that is, when the destination address points to the scalar register group 13 , the control unit 16 may send the SIMT instruction to the scalar processing unit 11 . When the destination address is the address of a register in the vector register group 14 , that is, the destination address points to the vector register group 14 , the control unit 16 may assign the SIMT instruction to the vector processing unit 12 .
  • the indication information can also be an indication bit or a flag bit.
  • the SIMT instruction can be indicated as a vector instruction, and when the indication bit or flag bit has a second value, it can be indicated that the SIMT instruction is a scalar instruction.
  • the indication information can also be an indication field.
  • the indication field When the indication field is in the first state, it can indicate that the SIMT instruction is a vector instruction, and when the indication field is in the second state, it can indicate that the SIMT instruction is a scalar instruction.
  • the indication information may also be an indicator.
  • the indicator When the indicator is in the third state, it may indicate that the SIMT instruction is a vector instruction, and if the indicator is in the fourth state, it may indicate that the SIMT instruction is a scalar instruction.
  • the SIMT instruction may also indicate the type of SIMT instruction in other ways.
  • the instruction SIMT instruction may be a scalar instruction, and if the SIMT instruction does not carry such information, the instruction SIMT instruction may be a vector instruction, and vice versa.
  • the operation result can be stored in the register corresponding to the destination address, so that subsequent calls can be made directly according to the destination address.
  • the SIMT instruction processing apparatus may further include a scalar scheduling unit 17 and a vector scheduling unit 18 .
  • the scalar scheduling unit 17 is coupled to the scalar processing unit 11
  • the vector scheduling unit 18 is coupled to the vector processing unit 12 .
  • the scalar scheduling unit 17 is configured to schedule SIMT instructions of scalar type to the scalar processing unit 11 .
  • the vector scheduling unit 18 is used for scheduling the SIMT instruction of the vector type to the vector processing unit 12 .
  • the control unit 16 When there is no idle processing unit in the scalar processing unit 11 or the vector processing unit 12, the control unit 16 sends the SIMT instruction to the scalar processing unit 11 or the vector processing unit 12, and the scalar processing unit 11 or the vector processing unit 12 cannot perform processing. . Therefore, the control unit 16 can send scalar type SIMT instructions to the scalar scheduling unit 17 so that the scalar scheduling unit 17 can schedule the scalar instructions uniformly; and can send the vector type SIMT instructions to the vector scheduling unit 18 for vector scheduling Unit 18 may schedule vector instructions collectively.
  • the scheduling method can be the principle of first-in, first-out, or the principle of scheduling according to priority, that is, the higher the priority, the first to be executed. It can also be scheduled according to resource occupancy, or according to other principles. Plus limit.
  • the multiple threads correspond to the same base address and correspond to different offset addresses
  • the scalar processing unit 11 is configured to operate on the data of the base address to obtain As for the first operation result
  • the vector processing unit 12 is configured to operate on the data of the offset address to obtain the second operation result.
  • the registers in the scalar register group 13 store data corresponding to the base addresses of multiple threads.
  • the scalar processing unit 11 may calculate the data corresponding to the base addresses of the multiple threads, and store the obtained first operation result in the scalar register group 13 for subsequent calls.
  • the registers in the vector register group 14 store data corresponding to the offset addresses of the multiple threads.
  • the vector processing unit 12 may calculate the data corresponding to the offset addresses of the multiple threads, and store the obtained second operation result in the vector register group 14 for subsequent calling.
  • the scalar processing unit 11 can perform a scalar operation according to the SIMT instruction to obtain the first operation result, and then store the first operation result in the scalar register group 13.
  • the vector processing unit 12 may perform a vector operation according to the SIMT instruction to obtain a second operation result, and then store the second operation result in the vector register group 14 .
  • the SIMT instruction carries the storage address of the first operation result and the storage address of the second operation result, and the vector processing unit 12 can obtain the first operation result from the storage address of the first operation.
  • the operation result and obtaining the second operation result from the storage address of the second operation result, and performing a vector operation on the first operation result and the second operation result to obtain the third operation result.
  • the third operation result is the data operation result corresponding to the base address+offset address.
  • the scalar processing unit 11 is configured to obtain a first SIMT instruction carrying a base address, perform an operation based on data corresponding to the base address to obtain a first operation result, and store the first operation result in the scalar register group 13 .
  • the vector processing unit 12 is used to obtain the second SIMT instruction carrying the offset address, perform an operation based on the data corresponding to the offset address to obtain the second operation result, and store the second operation result in the vector register group 14; and to the scalar register
  • the first operation result stored in the group 13 and the second operation result stored in the vector register group 14 are operated to obtain the third operation result as the task processing result.
  • the scalar processing unit 11 may perform a scalar operation according to the SIMT instruction to obtain a first operation result, and then store the first operation result in the scalar register group 13 .
  • the vector processing unit 12 may first obtain data from the registers in the vector register group 14 corresponding to the offset address and perform vector operation to obtain the first operation result.
  • the second operation result is obtained, the first operation result is obtained from the storage address of the first operation result, and the third operation result is obtained by performing a vector operation on the first operation result and the second operation result.
  • FIG. 3 is a schematic structural diagram of another SIMT instruction processing apparatus provided by an embodiment of the present invention.
  • the SIMT instruction processing device can support up to 2048 threads. If organized according to 32 threads and one warp, there are 64 warps in total, and the 64 warps can be divided into 8 banks (groups).
  • the scalar processing unit includes 4 first processing units, and the vector processing unit includes 4 second processing units, each of which supports 32 threads.
  • a scalar register bank can include 8 banks, and each bank can include 128 scalar registers.
  • the vector register bank can include 8 banks, and each bank can include 128 32-thread vector registers. All registers in a bank can be shared by warps in this bank, and registers in a bank can also be divided according to warp, and the registers corresponding to each warp can only be shared by threads in this warp.
  • Each processing unit requires at least one operand for processing, so there can be an 8x8 crossbar between the scalar processing unit and the scalar register bank, so that the scalar processing unit can ensure that the scalar processing unit can access the scalar registers of all banks in the scalar register bank.
  • the scalar processing unit can receive 4 SIMT instructions from the scalar scheduling unit, and the vector processing unit can receive 4 SIMT instructions from the vector scheduling unit. Since only one processing unit can access one bank at the same time, in order to ensure that multiple processing units access different banks at the same time, the parity of the SIMT instructions scheduled by the scalar scheduling unit and the vector scheduling unit in the two clock cycles before and after the on the contrary.
  • the control unit may schedule SIMT instructions accessing respective bank0, bank2, bank4 and bank6 through the scalar/vector scheduling unit, and the first/second processing unit may read operand 0 of the respective even-numbered banks.
  • the control unit may schedule SIMT instructions accessing respective bank1, bank3, bank5, and bank7 through the scalar/vector scheduling unit, and the first/second processing unit may read operand 1 of the respective even-numbered banks, and read the respective odd-numbered banks Operand 0 for bank.
  • the first/second processing unit may read operand 2 of the respective odd bank.
  • the scheduling ensures that the parity is interleaved at adjacent moments and the scheduling banks do not conflict at the same moment, the maximum access efficiency without crossbar conflict can be guaranteed.
  • the 8 operand read interfaces of the 4 first/second processing units respectively access the respective bank0-bank7.
  • the processing unit can read two operands from the same bank in two consecutive beats, that is, two cycles. Because the warp parity of the two-shot instructions before and after is opposite, that is, the bank parity is opposite, so there will be no conflict, and it can be guaranteed that at most 8 processing unit read interfaces can access 8 banks at the same time.
  • FIG. 4 is a schematic flowchart of a SIMT instruction processing method provided by an embodiment of the present invention.
  • the SIMT instruction processing method may be applied to the SIMT instruction processing apparatus shown in FIG. 1 to FIG. 3 .
  • the SIMT instruction processing method may include the following steps.
  • the type of the SIMT instruction is determined by the control unit according to the destination address carried by the SIMT instruction.
  • obtain the first SIMT instruction through the scalar processing unit perform an operation based on the data corresponding to the base address carried by the first SIMT instruction to obtain the first operation result, and store the first operation result in the scalar register group; through the vector processing unit
  • obtain the second SIMT instruction perform operation based on the data corresponding to the offset address carried by the second SIMT instruction to obtain the second operation result, and store the second operation result in the vector register group; and store the scalar register group by the vector processing unit.
  • the first operation result is operated with the second operation result stored in the vector register group, and the third operation result is obtained as the task processing result.
  • SIMT instruction processing method may be a combination of all or part of the steps in step 401 to step 404, which is not limited herein.
  • a system-on-chip is provided, and the system-on-chip may include the SIMT instruction processing apparatus provided in the above embodiments.
  • the system-on-chip can be composed of a SIMT instruction processing device, and can also include a SIMT instruction processing device and other discrete devices.
  • an electronic device including the SIMT instruction processing apparatus provided in the above-mentioned embodiments and a discrete device coupled to the SIMT instruction processing apparatus.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Advance Control (AREA)
  • Complex Calculations (AREA)
  • Executing Machine-Instructions (AREA)

Abstract

Provided in the embodiments of the present invention are a single instruction multiple threads (SIMT) instruction processing method and device. The device comprises a scalar processing unit and a vector processing unit, wherein the scalar processing unit is configured to perform scalar operation according to a scalar-type SIMT instruction; and the vector processing unit is configured to perform vector operation according to a vector-type SIMT instruction. According to the embodiments of the present invention, the processing efficiency can be improved.

Description

SIMT指令处理方法及装置SIMT instruction processing method and device
交叉引用声明cross reference statement
本发明要求于2020年12月11日提交中国专利局的申请号为202011452846.0的中国专利申请的优先权,其全部内容通过引用结合在本申请中。The present invention claims the priority of the Chinese Patent Application No. 202011452846.0 filed with the Chinese Patent Office on December 11, 2020, the entire contents of which are incorporated herein by reference.
技术领域technical field
本发明涉及计算机技术领域,具体涉及一种单指令多线程(single instruction multiple threads,SIMT)指令处理方法及装置。The present invention relates to the field of computer technology, in particular to a method and device for processing single instruction multiple threads (single instruction multiple threads, SIMT).
背景技术Background technique
在并行运算中,SIMT架构和同步多线程(simultaneous mutlithreading,SMT)架构相比具有更大的灵活性和更高的效率,可以通过大量的线程并行运行达到更高的吞吐率,因此,SIMT架构被广泛使用于高性能处理器中。In parallel computing, the SIMT architecture has greater flexibility and higher efficiency than the synchronous multithreading (SMT) architecture, and can achieve higher throughput by running a large number of threads in parallel. Therefore, the SIMT architecture It is widely used in high-performance processors.
在并行运算中,存在大量的只对例如基地址等单一线程进行操作的标量运算,如何提高指令的处理效率是待解决问题。In parallel operations, there are a large number of scalar operations that only operate on a single thread such as the base address, and how to improve the processing efficiency of instructions is a problem to be solved.
发明内容SUMMARY OF THE INVENTION
本发明实施例提供一种SIMT指令处理方法及装置,用于提高处理效率。Embodiments of the present invention provide a SIMT instruction processing method and device, which are used to improve processing efficiency.
第一方面提供一种SIMT指令处理装置,包括标量处理单元和向量处理单元,其中:A first aspect provides a SIMT instruction processing device, including a scalar processing unit and a vector processing unit, wherein:
所述标量处理单元,用于根据标量类型的SIMT指令,进行标量运算;The scalar processing unit is used to perform a scalar operation according to a SIMT instruction of a scalar type;
所述向量处理单元,用于根据向量类型的SIMT指令,进行向量运算。The vector processing unit is configured to perform vector operations according to a SIMT instruction of a vector type.
本发明实施例提供的SIMT指令处理装置中,标量处理单元可以对标量类型的SIMT指令进行标量运算,向量处理单元可以对向量类型的SIMT指令进行向量运算,将向量运算与标量运算分开由不同的处理单元进行处理,标量运算不会影响向量运算,因此,可以提高向量运算的处理效率。此外,由于向量运算和标量运算互不影响,可以同时进行,因此,可以提高指令的整体处理效率。In the SIMT instruction processing device provided in the embodiment of the present invention, the scalar processing unit can perform scalar operations on SIMT instructions of scalar type, and the vector processing unit can perform vector operations on SIMT instructions of vector type, and the vector operations and scalar operations are separated by different The processing unit performs processing, and the scalar operation does not affect the vector operation. Therefore, the processing efficiency of the vector operation can be improved. In addition, since the vector operation and the scalar operation do not affect each other and can be performed at the same time, the overall processing efficiency of the instruction can be improved.
作为一种可能的实施方式,所述装置还包括用于存储标量数据的标量寄存器组和用 于存储向量数据的向量寄存器组,其中:As a possible implementation manner, the apparatus further includes a scalar register set for storing scalar data and a vector register set for storing vector data, wherein:
所述标量寄存器组分别耦合所述标量处理单元和所述向量处理单元,所述向量寄存器组耦合所述向量处理单元。The scalar register group is respectively coupled to the scalar processing unit and the vector processing unit, and the vector register group is coupled to the vector processing unit.
本发明实施例提供的SIMT指令处理装置,包括标量寄存器组和向量寄存器组,向量寄存器组中寄存器存储的信息只能被对应的线程访问,而标量寄存器组中存储的信息为多个线程共用的信息,可以被多个线程访问。由于标量寄存器组中的一个寄存器可以对应多个线程,因此,可以减少寄存器的数量;此外,由于标量寄存器中存储的信息可以为多个线程共用,因此,可以避免同一信息重复存储,因此,可以减少寄存器存储的信息量,从而可以节约存储资源。The SIMT instruction processing device provided by the embodiment of the present invention includes a scalar register group and a vector register group. The information stored in the registers in the vector register group can only be accessed by corresponding threads, while the information stored in the scalar register group is shared by multiple threads. Information that can be accessed by multiple threads. Since one register in the scalar register group can correspond to multiple threads, the number of registers can be reduced; in addition, since the information stored in the scalar register can be shared by multiple threads, it is possible to avoid repeated storage of the same information. The amount of information stored in the register is reduced, thereby saving storage resources.
作为一种可能的实施方式,所述装置还包括交换(crossbar)模块,所述crossbar模块包括多个crossbar,其中:As a possible implementation manner, the device further includes a crossbar module, the crossbar module includes a plurality of crossbars, wherein:
所述标量处理单元通过所述crossbar模块与所述标量寄存器组连接;the scalar processing unit is connected with the scalar register group through the crossbar module;
所述向量处理单元通过所述crossbar模块分别与所述标量寄存器组和所述向量寄存器组连接。The vector processing unit is respectively connected with the scalar register group and the vector register group through the crossbar module.
本发明实施例提供的SIMT指令处理装置,标量寄存器组与标量处理单元之间通过crossbar模块连接,可以保证标量处理单元能够访问标量寄存器组中的所有寄存器。向量处理单元与标量寄存器组和向量寄存器组之间分别通过crossbar模块连接,可以保证向量处理单元能够访问标量寄存器组以及向量寄存器组中的所有寄存器。In the SIMT instruction processing device provided by the embodiment of the present invention, the scalar register group and the scalar processing unit are connected through a crossbar module, which can ensure that the scalar processing unit can access all registers in the scalar register group. The vector processing unit is connected to the scalar register group and the vector register group respectively through a crossbar module, which can ensure that the vector processing unit can access the scalar register group and all registers in the vector register group.
作为一种可能的实施方式,所述装置还包括控制单元,其中:As a possible implementation manner, the device further includes a control unit, wherein:
所述控制单元分别耦合所述标量处理单元和所述向量处理单元;the control unit is respectively coupled to the scalar processing unit and the vector processing unit;
所述控制单元,用于确定所述SIMT指令的类型,并基于所述SIMT指令的所述类型,向所述标量处理单元或所述向量处理单元发送所述SIMT指令;其中,所述类型包括标量或向量。The control unit is configured to determine the type of the SIMT instruction, and based on the type of the SIMT instruction, send the SIMT instruction to the scalar processing unit or the vector processing unit; wherein the type includes scalar or vector.
本发明实施例提供的SIMT指令处理装置,控制单元可以将不同类型的SIMT指令分发给不同的处理单元进行处理,以便可以将向量运算与标量运算分开由不同的处理单元进行处理,标量运算不会影响向量运算,因此,可以提高向量运算的处理效率。In the SIMT instruction processing device provided by the embodiment of the present invention, the control unit can distribute different types of SIMT instructions to different processing units for processing, so that vector operations and scalar operations can be processed separately by different processing units, and scalar operations will not Affects vector operations, therefore, the processing efficiency of vector operations can be improved.
作为一种可能的实施方式,所述控制单元,用于根据所述SIMT指令携带的指示信息,确定所述SIMT指令的所述类型;其中所述指示信息包括目的地址、指示位、指示 字段或指示符。As a possible implementation manner, the control unit is configured to determine the type of the SIMT instruction according to the indication information carried by the SIMT instruction; wherein the indication information includes a destination address, an indication bit, an indication field or indicator.
本发明实施例提供的SIMT指令处理装置,控制单元可以根据SIMT指令携带的指示信息,确定SIMT指令的指令类型。可见,当所述指示信息为目的地址时,该目的地址不仅具有指向运算结果存储地址的作用,还具有确定SIMT指令的类型的作用,因此,不需要SIMT指令额外专门携带信息用于指示SIMT指令的类型,可以减少SIMT指令携带的信息,从而可以提高指令的传输效率和节约传输资源。In the SIMT instruction processing apparatus provided by the embodiment of the present invention, the control unit may determine the instruction type of the SIMT instruction according to the instruction information carried by the SIMT instruction. It can be seen that when the indication information is the destination address, the destination address not only has the function of pointing to the storage address of the operation result, but also has the function of determining the type of the SIMT instruction. Therefore, there is no need for the SIMT instruction to carry additional information to indicate the SIMT instruction. can reduce the information carried by the SIMT instruction, thereby improving the transmission efficiency of the instruction and saving transmission resources.
作为一种可能的实施方式,所述装置还包括标量调度单元和向量调度单元,其中:As a possible implementation manner, the apparatus further includes a scalar scheduling unit and a vector scheduling unit, wherein:
所述标量调度单元耦合所述标量处理单元,所述向量调度单元耦合所述向量处理单元;The scalar scheduling unit is coupled to the scalar processing unit, and the vector scheduling unit is coupled to the vector processing unit;
所述标量调度单元,用于将标量类型的SIMT指令调度至所述标量处理单元;the scalar scheduling unit, configured to schedule SIMT instructions of a scalar type to the scalar processing unit;
所述向量调度单元,用于将向量类型的SIMT指令调度至所述向量处理单元。The vector scheduling unit is configured to schedule SIMT instructions of vector type to the vector processing unit.
本发明实施例提供的SIMT指令处理装置,调度单元可以根据处理单元的情况对相应的SIMT指令进行调度,以便SIMT指令可以有序执行。In the SIMT instruction processing apparatus provided by the embodiment of the present invention, the scheduling unit can schedule the corresponding SIMT instructions according to the situation of the processing unit, so that the SIMT instructions can be executed in an orderly manner.
作为一种可能的实施方式,在多个线程并行执行同一任务的情况下,所述多个线程对应相同的基地址且对应不同的偏移地址,其中所述标量处理单元用于对所述基地址对应的数据进行标量运算,得到第一运算结果,所述向量处理单元用于对所述偏移地址对应的数据进行向量运算,得到第二运算结果。As a possible implementation manner, when multiple threads execute the same task in parallel, the multiple threads correspond to the same base address and correspond to different offset addresses, wherein the scalar processing unit is used to A scalar operation is performed on the data corresponding to the address to obtain a first operation result, and the vector processing unit is configured to perform a vector operation on the data corresponding to the offset address to obtain a second operation result.
作为一种可能的实施方式,所述标量处理单元,用于获取第一SIMT指令,基于所述第一SIMT指令携带的基地址对应的数据进行运算得到第一运算结果,并将所述第一运算结果存储在所述标量寄存器组;As a possible implementation manner, the scalar processing unit is configured to acquire a first SIMT instruction, perform an operation based on the data corresponding to the base address carried by the first SIMT instruction to obtain a first operation result, and convert the first SIMT instruction The operation result is stored in the scalar register group;
所述向量处理单元,用于获取第二SIMT指令,基于所述第二SIMT指令携带的偏移地址对应的数据进行运算得到第二运算结果,并将所述第二运算结果存储在所述向量寄存器组;以及,用于对所述标量寄存器组存储的所述第一运算结果和所述向量寄存器组存储的所述第二运算结果进行运算,得到第三运算结果。The vector processing unit is used to obtain a second SIMT instruction, perform an operation based on the data corresponding to the offset address carried by the second SIMT instruction to obtain a second operation result, and store the second operation result in the vector. a register group; and, for performing an operation on the first operation result stored in the scalar register group and the second operation result stored in the vector register group to obtain a third operation result.
第二方面提供一种SIMT指令处理方法,应用于SIMT指令处理装置,所述装置包括标量处理单元和向量处理单元,包括:A second aspect provides a SIMT instruction processing method, which is applied to an apparatus for processing SIMT instructions. The apparatus includes a scalar processing unit and a vector processing unit, including:
通过所述标量处理单元根据标量类型的SIMT指令,进行标量运算;Perform scalar operations by the scalar processing unit according to the SIMT instruction of the scalar type;
通过所述向量处理单元根据向量类型的SIMT指令,进行向量运算。The vector operation is performed by the vector processing unit according to the SIMT instruction of the vector type.
作为一种可能的实施方式,所述装置还包括标量寄存器组和向量寄存器组,所述方法还包括:As a possible implementation manner, the apparatus further includes a scalar register group and a vector register group, and the method further includes:
通过所述标量寄存器组存储标量数据;storing scalar data through the scalar register bank;
通过所述向量寄存器组存储向量数据。Vector data is stored through the vector register bank.
作为一种可能的实施方式,所述装置还包括控制单元,所述方法还包括:As a possible implementation manner, the apparatus further includes a control unit, and the method further includes:
通过所述控制单元确定所述SIMT指令的类型,并基于所述SIMT指令的所述类型,向所述标量处理单元或所述向量处理单元发送所述SIMT指令;其中,所述类型包括标量或向量。The type of the SIMT instruction is determined by the control unit, and based on the type of the SIMT instruction, the SIMT instruction is sent to the scalar processing unit or the vector processing unit; wherein the type includes scalar or vector.
作为一种可能的实施方式,所述方法还包括:As a possible implementation manner, the method further includes:
通过所述控制单元根据所述SIMT指令携带的指示信息,确定所述SIMT指令的所述类型;其中所述指示信息包括目的地址、指示位、指示字段或指示符。The control unit determines the type of the SIMT instruction according to the indication information carried by the SIMT instruction; wherein the indication information includes a destination address, an indication bit, an indication field or an indicator.
作为一种可能的实施方式,所述装置还包括标量调度单元和向量调度单元,所述方法还包括:As a possible implementation manner, the apparatus further includes a scalar scheduling unit and a vector scheduling unit, and the method further includes:
通过所述标量调度单元将标量类型的SIMT指令调度至所述标量处理单元;Scheduling the scalar type SIMT instruction to the scalar processing unit by the scalar scheduling unit;
通过所述向量调度单元将向量类型的SIMT指令调度至所述向量处理单元。The vector-type SIMT instruction is scheduled to the vector processing unit by the vector scheduling unit.
作为一种可能的实施方式,所述方法还包括:As a possible implementation manner, the method further includes:
通过所述标量处理单元获取第一SIMT指令,基于所述第一SIMT指令携带的基地址对应的数据进行运算得到第一运算结果,并将所述第一运算结果存储在所述标量寄存器组;Obtain the first SIMT instruction through the scalar processing unit, perform an operation based on the data corresponding to the base address carried by the first SIMT instruction to obtain a first operation result, and store the first operation result in the scalar register group;
通过所述向量处理单元获取第二SIMT指令,基于所述第二SIMT指令携带的偏移地址对应的数据进行运算得到第二运算结果,并将所述第二运算结果存储在所述向量寄存器组;以及通过所述向量处理单元对所述标量寄存器组存储的所述第一运算结果和所述向量寄存器组存储的所述第二运算结果进行运算,得到第三运算结果。Obtain a second SIMT instruction through the vector processing unit, perform an operation based on the data corresponding to the offset address carried by the second SIMT instruction to obtain a second operation result, and store the second operation result in the vector register group and performing operations on the first operation result stored in the scalar register group and the second operation result stored in the vector register group by the vector processing unit to obtain a third operation result.
第三方面提供一种片上系统芯片,该片上系统芯片集成有上述第一方面或第一方面的任意一种可能的实现方式所提供的装置。该片上系统芯片,可以由SIMT指令处理装置构成,也可以包含SIMT指令处理装置和其他分立器件。A third aspect provides a system-on-a-chip, where the system-on-a-chip integrates the device provided by the first aspect or any possible implementation manner of the first aspect. The system-on-chip can be composed of a SIMT instruction processing device, and can also include a SIMT instruction processing device and other discrete devices.
第四方面提供一种电子设备,包括上述第一方面或第一方面的任意一种可能的实现 方式所提供的SIMT指令处理装置以及耦合于所述SIMT指令处理装置的分立器件。A fourth aspect provides an electronic device, including the SIMT instruction processing apparatus provided by the first aspect or any possible implementation manner of the first aspect, and a discrete device coupled to the SIMT instruction processing apparatus.
附图说明Description of drawings
图1是本发明实施例提供的一种SIMT指令处理装置的结构示意图;1 is a schematic structural diagram of a SIMT instruction processing device provided by an embodiment of the present invention;
图2是本发明实施例提供的另一种SIMT指令处理装置的结构示意图;2 is a schematic structural diagram of another SIMT instruction processing device provided by an embodiment of the present invention;
图3是本发明实施例提供的又一种SIMT指令处理装置的结构示意图;3 is a schematic structural diagram of another SIMT instruction processing device provided by an embodiment of the present invention;
图4是本发明实施例提供的一种SIMT指令处理方法的流程示意图。FIG. 4 is a schematic flowchart of a SIMT instruction processing method provided by an embodiment of the present invention.
具体实施方式Detailed ways
本发明实施例提供一种SIMT指令处理方法及装置,用于提高处理效率。以下分别进行详细说明。Embodiments of the present invention provide a SIMT instruction processing method and device, which are used to improve processing efficiency. Each of them will be described in detail below.
为了更好地对本发明实施例提供的一种SIMT指令处理方法及装置进行理解,下面先对本发明实施例适用的应用场景进行描述。在并行运算中,有大量的只对例如基地址类等单一线程进行操作的标量运算。SIMD(Single Instruction Multiple Data)架构中常使用标量协处理器对标量运算进行处理以提高指令的处理效率。但在SIMT架构中,指令调度较为困难。为了解决上述问题,一种可能的实现方式为,在SIMT架构中,无论SIMT指令为标量指令还是向量指令均通过同一指令口向下发送,使用向量处理器来进行运算。例如,在SIMT指令为标量类型的情况下,可以使用向量处理器中的部分处理单元来进行运算。但是由于向量处理器中部分处理单元用于处理标量类型的指令,以致减少了向量处理器中处理向量类型的指令的处理单元,降低了向量指令的处理效率。In order to better understand the SIMT instruction processing method and device provided by the embodiments of the present invention, the following describes application scenarios to which the embodiments of the present invention are applicable. In parallel operations, there are a large number of scalar operations that only operate on a single thread such as the base address class. In the SIMD (Single Instruction Multiple Data) architecture, scalar coprocessors are often used to process scalar operations to improve the processing efficiency of instructions. But in the SIMT architecture, instruction scheduling is more difficult. In order to solve the above problem, a possible implementation manner is that, in the SIMT architecture, regardless of whether the SIMT instruction is a scalar instruction or a vector instruction, it is sent down through the same instruction port, and a vector processor is used to perform operations. For example, when the SIMT instruction is of scalar type, part of the processing unit in the vector processor can be used to perform the operation. However, because some processing units in the vector processor are used to process scalar type instructions, the number of processing units in the vector processor for processing vector type instructions is reduced, and the processing efficiency of vector instructions is reduced.
请参阅图1,图1是本发明实施例提供的一种SIMT指令处理装置的结构示意图。如图1所示,该SIMT指令处理装置可以包括标量处理单元11和向量处理单元12。Please refer to FIG. 1. FIG. 1 is a schematic structural diagram of a SIMT instruction processing apparatus provided by an embodiment of the present invention. As shown in FIG. 1 , the SIMT instruction processing apparatus may include a scalar processing unit 11 and a vector processing unit 12 .
标量处理单元11,用于根据标量类型的SIMT指令,进行标量运算。The scalar processing unit 11 is configured to perform a scalar operation according to a SIMT instruction of a scalar type.
向量处理单元12,用于根据向量类型的SIMT指令,进行向量运算。The vector processing unit 12 is configured to perform vector operations according to the SIMT instructions of the vector type.
在SIMT指令的类型为标量,也即SIMT指令为标量指令的情况下,标量处理单元11可以对SIMT指令进行运算,即进行标量运算。在SIMT指令的类型为向量,也即SIMT指令为向量指令的情况下,向量处理单元12可以对SIMT指令进行运算,即进行向量运算。标量即无向量,也即只有大小没有方向的量。标量运算可以为乘法、加法、 减法和除法等中的一种或多种。向量指具有大小和方向的量。向量运算可以包括乘法、加法、减法、除法、点乘、叉乘等中的一种或多种。When the type of the SIMT instruction is a scalar, that is, when the SIMT instruction is a scalar instruction, the scalar processing unit 11 may perform an operation on the SIMT instruction, that is, perform a scalar operation. When the type of the SIMT instruction is a vector, that is, when the SIMT instruction is a vector instruction, the vector processing unit 12 may perform an operation on the SIMT instruction, that is, perform a vector operation. A scalar is a vectorless quantity, that is, a quantity that has only magnitude and no direction. Scalar operations can be one or more of multiplication, addition, subtraction, and division, among others. A vector refers to a quantity that has magnitude and direction. Vector operations may include one or more of multiplication, addition, subtraction, division, dot product, cross product, and the like.
标量处理单元11可以包括一个或多个第一处理单元。当标量处理单元11包括多个第一处理单元时,在SIMT指令为标量指令的情况下,每个周期每个第一处理单元可以处理一个SIMT指令,而一个标量指令对应一个线程组,从而可以实现多个标量指令的并行运行,即可以实现针对多个线程的标量运算的并行运行。向量处理单元12可以包括一个或多个第二处理单元。当向量处理单元12包括多个第二处理单元时,在SIMT指令为向量指令的情况下,每个周期每个第二处理单元可以处理一个SIMT指令,从而可以实现多个向量指令的并行运行。此外,一个向量指令对应的线程数与第二处理单元处理的线程数相同,从而在一个第二处理单元中可以实现多线程的并行执行。标量处理单元11包括的第一处理单元的数量与向量处理单元12包括的第二处理单元的数量可以相同,也可以不同。标量处理单元11包括的第一处理单元可以为算数运算单元(arithmetic and logic unit,ALU),也可以为其它单元,在此不加限定。向量处理单元12包括的第二处理单元可以为ALU,也可以为特殊函数单元(special function unit,SFU),还可以为读写单元(load store unit,LSU),还可以为其它单元,在此不加限定。The scalar processing unit 11 may comprise one or more first processing units. When the scalar processing unit 11 includes multiple first processing units, in the case where the SIMT instruction is a scalar instruction, each first processing unit can process one SIMT instruction in each cycle, and one scalar instruction corresponds to one thread group, so that the The parallel operation of multiple scalar instructions is implemented, that is, the parallel operation of scalar operations for multiple threads can be implemented. Vector processing unit 12 may include one or more second processing units. When the vector processing unit 12 includes multiple second processing units, if the SIMT instruction is a vector instruction, each second processing unit can process one SIMT instruction per cycle, so that the parallel execution of multiple vector instructions can be realized. In addition, the number of threads corresponding to one vector instruction is the same as the number of threads processed by the second processing unit, so that parallel execution of multiple threads can be implemented in one second processing unit. The number of first processing units included in the scalar processing unit 11 and the number of second processing units included in the vector processing unit 12 may be the same or different. The first processing unit included in the scalar processing unit 11 may be an arithmetic operation unit (arithmetic and logic unit, ALU), or may be other units, which are not limited herein. The second processing unit included in the vector processing unit 12 may be an ALU, a special function unit (special function unit, SFU), a read-write unit (load store unit, LSU), or other units, here Unlimited.
请参阅图2,图2是本发明实施例提供的另一种SIMT指令处理装置的结构示意图。其中,图2所示的SIMT指令处理装置是由图1所示的SIMT指令处理装置优化得到的。Please refer to FIG. 2. FIG. 2 is a schematic structural diagram of another SIMT instruction processing apparatus provided by an embodiment of the present invention. Wherein, the SIMT instruction processing apparatus shown in FIG. 2 is obtained by optimizing the SIMT instruction processing apparatus shown in FIG. 1 .
在一个实施例中,该SIMT指令处理装置还可以包括用于存储标量数据的标量寄存器组13和用于存储向量数据的向量寄存器组14。In one embodiment, the SIMT instruction processing apparatus may further include a scalar register group 13 for storing scalar data and a vector register group 14 for storing vector data.
标量寄存器组13分别耦合标量处理单元11和向量处理单元12,向量寄存器组14耦合向量处理单元12。The scalar register group 13 is respectively coupled to the scalar processing unit 11 and the vector processing unit 12 , and the vector register group 14 is coupled to the vector processing unit 12 .
标量寄存器组13与向量寄存器组14可以是两个独立的寄存器组。标量寄存器组13与向量寄存器组14均可以包括多组寄存器。SIMT指令可以携带有源地址和操作类型。标量处理单元11接收到SIMT指令之后,可以先从标量寄存器组13中源地址对应的寄存器获取操作数,之后可以根据操作类型对获取的操作数进行标量运算。标量处理单元11接收到的SIMT指令携带的源地址对应标量寄存器组13中的一个寄存器。可以理解,每个源地址对应标量寄存器组13内一个寄存器,标量寄存器组13中的寄存器可以被对应的线程访问。标量寄存器组13中寄存器对应的线程可以为该寄存器所属warp(线程数)对应的线程。源地址可以包括一个地址,也可以包括多个地址,即标量指令的操作数可以为一个,也可以为多个,在此不加限定。The scalar register set 13 and the vector register set 14 may be two independent register sets. Both the scalar register set 13 and the vector register set 14 may include multiple sets of registers. SIMT instructions can carry source addresses and operation types. After receiving the SIMT instruction, the scalar processing unit 11 may first obtain the operand from the register corresponding to the source address in the scalar register group 13, and then perform scalar operation on the obtained operand according to the operation type. The source address carried by the SIMT instruction received by the scalar processing unit 11 corresponds to a register in the scalar register group 13 . It can be understood that each source address corresponds to a register in the scalar register group 13, and the registers in the scalar register group 13 can be accessed by the corresponding thread. The thread corresponding to the register in the scalar register group 13 may be the thread corresponding to the warp (number of threads) to which the register belongs. The source address may include one address or multiple addresses, that is, the operand of the scalar instruction may be one or multiple, which is not limited herein.
向量处理单元12接收到SIMT指令之后,在该SIMT指令的源地址指向标量寄存器组13的情况下,可以先从标量寄存器组13中源地址对应的寄存器获取操作数,之后可以根据操作类型对获取的操作数进行向量运算。在该SIMT指令的源地址指向向量寄存器组14的情况下,可以先从向量寄存器组14中源地址对应的寄存器获取操作数,之后可以根据操作类型对获取的操作数进行向量运算。在该SIMT指令的源地址既指向标量寄存器组13也指向向量寄存器组14的情况下,可以从标量寄存器组13中源地址对应的寄存器获取操作数,以及从向量寄存器组14中源地址对应的寄存器获取操作数,之后可以根据操作类型对获取的操作数进行向量运算。在向量处理单元12接收到的一个SIMT指令的源地址指向向量寄存器组14的情况下,在一种情况下,一个SIMT指令携带的源地址对应一个向量寄存器组14中的多个寄存器。可以理解,每个地址可以对应向量寄存器组14内多个寄存器,向量寄存器组14中的每个寄存器只能被对应线程访问。向量寄存器组14中寄存器的数量与一组向量处理单元包括的第二处理单元的数量相同。在另一种情况下,当SIMT指令的源地址指向标量寄存器组13时,一个SIMT指令可以携带多个源地址,每个源地址对应标量寄存器组13中的一个寄存器,多个源地址对应标量寄存器组13中的多个寄存器。After the vector processing unit 12 receives the SIMT instruction, in the case that the source address of the SIMT instruction points to the scalar register group 13, the operand can be obtained from the register corresponding to the source address in the scalar register group 13, and then the operation type can be obtained according to the operation type. The operands of the vector operation are performed. When the source address of the SIMT instruction points to the vector register group 14, the operand can be obtained from the register corresponding to the source address in the vector register group 14, and then the obtained operand can be subjected to a vector operation according to the operation type. When the source address of the SIMT instruction points to both the scalar register group 13 and the vector register group 14, the operand can be obtained from the register corresponding to the source address in the scalar register group 13, and the operand can be obtained from the register corresponding to the source address in the vector register group 14. The register obtains the operand, and then vector operations can be performed on the obtained operand according to the operation type. When the source address of one SIMT instruction received by the vector processing unit 12 points to the vector register group 14 , in one case, the source address carried by one SIMT instruction corresponds to multiple registers in one vector register group 14 . It can be understood that each address may correspond to multiple registers in the vector register group 14, and each register in the vector register group 14 can only be accessed by the corresponding thread. The number of registers in the vector register bank 14 is the same as the number of second processing units included in a set of vector processing units. In another case, when the source address of the SIMT instruction points to the scalar register group 13, one SIMT instruction can carry multiple source addresses, each source address corresponds to a register in the scalar register group 13, and multiple source addresses correspond to scalar Multiple registers in register bank 13.
在一个实施例中,该SIMT指令处理装置还可以包括crossbar模块15,crossbar模块15可以包括多个crossbar。In one embodiment, the SIMT instruction processing apparatus may further include a crossbar module 15, and the crossbar module 15 may include multiple crossbars.
标量处理单元11通过crossbar模块15与标量寄存器组13连接。The scalar processing unit 11 is connected to the scalar register set 13 through the crossbar module 15 .
向量处理单元12通过crossbar模块15分别与标量寄存器组13和向量寄存器组14连接。The vector processing unit 12 is connected to the scalar register set 13 and the vector register set 14 respectively through the crossbar module 15 .
crossbar模块15可以保证标量处理单元11能够访问标量寄存器组13中的所有寄存器,以及向量处理单元12能够访问标量寄存器组13和向量寄存器组14中的所有寄存器。crossbar模块15可以包括多个crossbar。例如,crossbar模块15可以包括两个crossbar,一个crossbar可以分别耦合标量处理单元11和标量寄存器组13,另一个crossbar可以分别耦合向量处理单元12、标量寄存器组13和向量寄存器组14。再例如,crossbar模块15可以包括三个crossbar,第一crossbar可以分别耦合标量处理单元11和标量寄存器组13,第二crossbar可以分别耦合向量处理单元12和标量寄存器组13,第三crossbar可以分别耦合向量处理单元12和向量寄存器组14。当crossbar模块15包括三个crossbar时,标量处理单元11可以通过第一crossbar从标量寄存器组13中源地址对应的寄存器获取操作数,可以理解为标量处理单元11向第一crossbar发送携带源地址的读取指令, 这个crossbar将该读取指令转发给标量寄存器组13,标量寄存器组13将源地址对应的寄存器中的操作数发送给该第一crossbar,该第一crossbar将操作数转发给标量处理单元11。其它类似,在此不加赘述。The crossbar module 15 can ensure that the scalar processing unit 11 can access all registers in the scalar register set 13 and the vector processing unit 12 can access all the registers in the scalar register set 13 and the vector register set 14. The crossbar module 15 may include multiple crossbars. For example, the crossbar module 15 may include two crossbars, one crossbar may be coupled to the scalar processing unit 11 and the scalar register group 13, respectively, and the other crossbar may be coupled to the vector processing unit 12, the scalar register group 13 and the vector register group 14, respectively. For another example, the crossbar module 15 may include three crossbars, the first crossbar may be coupled to the scalar processing unit 11 and the scalar register group 13 respectively, the second crossbar may be coupled to the vector processing unit 12 and the scalar register group 13 respectively, and the third crossbar may be respectively coupled Vector processing unit 12 and vector register bank 14 . When the crossbar module 15 includes three crossbars, the scalar processing unit 11 can obtain the operand from the register corresponding to the source address in the scalar register group 13 through the first crossbar. Read instruction, the crossbar forwards the read instruction to the scalar register group 13, the scalar register group 13 sends the operand in the register corresponding to the source address to the first crossbar, and the first crossbar forwards the operand to the scalar processing unit 11. Others are similar and will not be repeated here.
在一个实施例中,该SIMT指令处理装置还可以包括控制单元16。In one embodiment, the SIMT instruction processing apparatus may further include a control unit 16 .
控制单元16分别耦合标量处理单元11和向量处理单元12。The control unit 16 is coupled to the scalar processing unit 11 and the vector processing unit 12, respectively.
控制单元16,用于确定SIMT指令的类型,该类型包括标量或向量,并基于SIMT指令的类型,向标量处理单元11或向量处理单元12发送SIMT指令。The control unit 16 is configured to determine the type of the SIMT instruction, the type including a scalar or a vector, and send the SIMT instruction to the scalar processing unit 11 or the vector processing unit 12 based on the type of the SIMT instruction.
在一个实施例中,控制单元16,用于根据SIMT指令携带的目的地址,确定SIMT指令的类型。In one embodiment, the control unit 16 is configured to determine the type of the SIMT instruction according to the destination address carried by the SIMT instruction.
控制单元16接收到SIMT指令之后,可以先确定SIMT指令的类型。在该类型为标量,即SIMT指令为标量指令的情况下,控制单元16可以将SIMT指令发送给标量处理单元11。在该类型为向量,即SIMT指令为向量指令的情况下,控制单元16可以将SIMT指令发送给向量处理单元12。After receiving the SIMT instruction, the control unit 16 may first determine the type of the SIMT instruction. In the case that the type is scalar, that is, the SIMT instruction is a scalar instruction, the control unit 16 may send the SIMT instruction to the scalar processing unit 11 . In the case that the type is a vector, that is, the SIMT instruction is a vector instruction, the control unit 16 may send the SIMT instruction to the vector processing unit 12 .
SIMT指令还可以携带有指示信息,该指示信息可以指示SIMT指令的类型。控制单元16接收到SIMT指令之后,可以根据指示信息确定SIMT指令的类型。The SIMT instruction may also carry indication information, and the indication information may indicate the type of the SIMT instruction. After receiving the SIMT instruction, the control unit 16 can determine the type of the SIMT instruction according to the instruction information.
指示信息可以为目的地址。控制单元16接收到SIMT指令之后,可以先识别目的地址为标量寄存器组13中寄存器的地址还是为向量寄存器组14中寄存器的地址,即识别目的地址指向标量寄存器组13还是指向向量寄存器组14。当目的地址为标量寄存器组13中寄存器的地址,即目的地址指向标量寄存器组13时,控制单元16可以将SIMT指令发送给标量处理单元11。当目的地址为向量寄存器组14中寄存器的地址,即目的地址指向向量寄存器组14时,控制单元16可以将SIMT指令分配给向量处理单元12。The indication information can be the destination address. After the control unit 16 receives the SIMT instruction, it can first identify whether the destination address is the address of the register in the scalar register group 13 or the address of the register in the vector register group 14, that is, identify whether the destination address points to the scalar register group 13 or points to the vector register group 14. When the destination address is the address of a register in the scalar register group 13 , that is, when the destination address points to the scalar register group 13 , the control unit 16 may send the SIMT instruction to the scalar processing unit 11 . When the destination address is the address of a register in the vector register group 14 , that is, the destination address points to the vector register group 14 , the control unit 16 may assign the SIMT instruction to the vector processing unit 12 .
指示信息也可以为一个指示位或标志位。在该指示位或标志位具有第一值的情况下,可以指示SIMT指令为向量指令,在该指示位或标志位具有第二值的情况下,可以指示SIMT指令为标量指令。The indication information can also be an indication bit or a flag bit. When the indication bit or flag bit has a first value, the SIMT instruction can be indicated as a vector instruction, and when the indication bit or flag bit has a second value, it can be indicated that the SIMT instruction is a scalar instruction.
指示信息也可以为一个指示字段,在该指示字段为第一态的情况下,可以指示SIMT指令为向量指令,在该指示字段为第二态的情况下,可以指示SIMT指令为标量指令。The indication information can also be an indication field. When the indication field is in the first state, it can indicate that the SIMT instruction is a vector instruction, and when the indication field is in the second state, it can indicate that the SIMT instruction is a scalar instruction.
指示信息也可以为一个指示符,在该指示符为第三态的情况下,可以指示SIMT指令为向量指令,在该指示符为第四态的情况下,可以指示SIMT指令为标量指令。The indication information may also be an indicator. When the indicator is in the third state, it may indicate that the SIMT instruction is a vector instruction, and if the indicator is in the fourth state, it may indicate that the SIMT instruction is a scalar instruction.
SIMT指令还可以通过其它方式指示SIMT指令的类型。例如,在SIMT指令携带指定的标志字段或标识符等的情况下,指示SIMT指令可以为标量指令,在SIMT指令未携带这些信息的情况下,指示SIMT指令可以为向量指令,反之亦然。The SIMT instruction may also indicate the type of SIMT instruction in other ways. For example, in the case that the SIMT instruction carries a specified flag field or identifier, etc., the instruction SIMT instruction may be a scalar instruction, and if the SIMT instruction does not carry such information, the instruction SIMT instruction may be a vector instruction, and vice versa.
应理解,上述对指示信息的解释说明只是示例性的,并不对指示信息构成限定。It should be understood that the above explanation of the indication information is only exemplary, and does not constitute a limitation on the indication information.
此外,在SIMI指令携带目的地址的情况下,标量处理单元11和向量处理单元12运算完成之后,可以将运算结果存储至目的地址对应的寄存器,以便后续可以根据目的地址直接进行调用。In addition, when the SIMI instruction carries the destination address, after the scalar processing unit 11 and the vector processing unit 12 complete the operation, the operation result can be stored in the register corresponding to the destination address, so that subsequent calls can be made directly according to the destination address.
在一个实施例中,该SIMT指令处理装置还可以包括标量调度单元17和向量调度单元18。In one embodiment, the SIMT instruction processing apparatus may further include a scalar scheduling unit 17 and a vector scheduling unit 18 .
标量调度单元17耦合标量处理单元11,向量调度单元18耦合向量处理单元12。The scalar scheduling unit 17 is coupled to the scalar processing unit 11 , and the vector scheduling unit 18 is coupled to the vector processing unit 12 .
标量调度单元17,用于将标量类型的SIMT指令调度至标量处理单元11。The scalar scheduling unit 17 is configured to schedule SIMT instructions of scalar type to the scalar processing unit 11 .
向量调度单元18,用于将向量类型的SIMT指令调度至向量处理单元12。The vector scheduling unit 18 is used for scheduling the SIMT instruction of the vector type to the vector processing unit 12 .
在标量处理单元11或向量处理单元12中没有空闲的处理单元的情况下,控制单元16将SIMT指令发送给标量处理单元11或向量处理单元12,标量处理单元11或向量处理单元12无法进行处理。因此,控制单元16可以将标量类型的SIMT指令发送给标量调度单元17,以便标量调度单元17可以统一对标量指令进行调度;以及可以将向量类型的SIMT指令发送给向量调度单元18,以便向量调度单元18可以统一对向量指令进行调度。调度方式可以为先进先出原则,也可以为按照优先级进行调度的原则,即优先级越高越先被执行,还可以按照资源占用情况进行调度,还可以按照其他原则进行调度,在此不加限定。When there is no idle processing unit in the scalar processing unit 11 or the vector processing unit 12, the control unit 16 sends the SIMT instruction to the scalar processing unit 11 or the vector processing unit 12, and the scalar processing unit 11 or the vector processing unit 12 cannot perform processing. . Therefore, the control unit 16 can send scalar type SIMT instructions to the scalar scheduling unit 17 so that the scalar scheduling unit 17 can schedule the scalar instructions uniformly; and can send the vector type SIMT instructions to the vector scheduling unit 18 for vector scheduling Unit 18 may schedule vector instructions collectively. The scheduling method can be the principle of first-in, first-out, or the principle of scheduling according to priority, that is, the higher the priority, the first to be executed. It can also be scheduled according to resource occupancy, or according to other principles. Plus limit.
在一个实施例中,在多个线程并行执行同一任务的情况下,这多个线程对应相同的基地址且对应不同的偏移地址,标量处理单元11用于对基地址的数据进行运算,得到第一运算结果,向量处理单元12用于对偏移地址的数据进行运算,得到第二运算结果。In one embodiment, when multiple threads execute the same task in parallel, the multiple threads correspond to the same base address and correspond to different offset addresses, and the scalar processing unit 11 is configured to operate on the data of the base address to obtain As for the first operation result, the vector processing unit 12 is configured to operate on the data of the offset address to obtain the second operation result.
标量寄存器组13中寄存器存储有多个线程的基地址对应的数据。标量处理单元11可以对多个线程的基地址对应的数据进行计算,以及将得到的第一运算结果存储至标量寄存器组13中,以便后续调用。向量寄存器组14中寄存器存储有多个线程的偏移地址对应的数据。向量处理单元12可以对多个线程的偏移地址对应的数据进行计算,以及将得到的第二运算结果存储至向量寄存器组14中,以便后续调用。例如,标量处理单元11获取到标量类型的SIMT指令之后,可以根据SIMT指令进行标量运算得到第一运 算结果,之后可以将第一运算结果存储到标量寄存器组13中。向量处理单元12接收到向量类型的SIMT指令之后,可以根据SIMT指令进行向量运算得到第二运算结果,之后可以将第二运算结果存储到向量寄存器组14中。向量处理单元12接收到向量类型的SIMT指令之后,该SIMT指令携带有第一运算结果的存储地址和第二运算结果的存储地址,向量处理单元12可以从第一运算结果的存储地址获取第一运算结果,以及从第二运算结果的存储地址获取第二运算结果,对第一运算结果和第二运算结果进行向量运算可以得到第三运算结果。当第一运算结果为基地址对应的数据运算结果、第二运算结果为偏移地址对应的数据运算结果时,第三运算结果为基地址+偏移地址对应的数据运算结果。The registers in the scalar register group 13 store data corresponding to the base addresses of multiple threads. The scalar processing unit 11 may calculate the data corresponding to the base addresses of the multiple threads, and store the obtained first operation result in the scalar register group 13 for subsequent calls. The registers in the vector register group 14 store data corresponding to the offset addresses of the multiple threads. The vector processing unit 12 may calculate the data corresponding to the offset addresses of the multiple threads, and store the obtained second operation result in the vector register group 14 for subsequent calling. For example, after acquiring the SIMT instruction of the scalar type, the scalar processing unit 11 can perform a scalar operation according to the SIMT instruction to obtain the first operation result, and then store the first operation result in the scalar register group 13. After receiving the SIMT instruction of the vector type, the vector processing unit 12 may perform a vector operation according to the SIMT instruction to obtain a second operation result, and then store the second operation result in the vector register group 14 . After the vector processing unit 12 receives the SIMT instruction of the vector type, the SIMT instruction carries the storage address of the first operation result and the storage address of the second operation result, and the vector processing unit 12 can obtain the first operation result from the storage address of the first operation. The operation result, and obtaining the second operation result from the storage address of the second operation result, and performing a vector operation on the first operation result and the second operation result to obtain the third operation result. When the first operation result is the data operation result corresponding to the base address and the second operation result is the data operation result corresponding to the offset address, the third operation result is the data operation result corresponding to the base address+offset address.
在一个实施例中,标量处理单元11,用于获取携带基地址的第一SIMT指令,基于基地址对应的数据进行运算得到第一运算结果,并将第一运算结果存储在标量寄存器组13。In one embodiment, the scalar processing unit 11 is configured to obtain a first SIMT instruction carrying a base address, perform an operation based on data corresponding to the base address to obtain a first operation result, and store the first operation result in the scalar register group 13 .
向量处理单元12,用于获取携带偏移地址的第二SIMT指令,基于偏移地址对应的数据进行运算得到第二运算结果,并将第二运算结果存储在向量寄存器组14;以及对标量寄存器组13存储的第一运算结果和向量寄存器组14存储的第二运算结果进行运算,得到第三运算结果作为任务处理结果。The vector processing unit 12 is used to obtain the second SIMT instruction carrying the offset address, perform an operation based on the data corresponding to the offset address to obtain the second operation result, and store the second operation result in the vector register group 14; and to the scalar register The first operation result stored in the group 13 and the second operation result stored in the vector register group 14 are operated to obtain the third operation result as the task processing result.
标量处理单元11获取到标量类型的SIMT指令之后,可以根据SIMT指令进行标量运算得到第一运算结果,之后可以将第一运算结果存储到标量寄存器组13中。当向量处理单元12接收到携带第一运算结果的存储地址和偏移地址的SIMT指令时,向量处理单元12可以先从偏移地址对应的向量寄存器组14中的寄存器获取数据进行向量运算得到第二运算结果,再从第一运算结果的存储地址获取第一运算结果,对第一运算结果和第二运算结果进行向量运算可以得到第三运算结果。After acquiring the SIMT instruction of the scalar type, the scalar processing unit 11 may perform a scalar operation according to the SIMT instruction to obtain a first operation result, and then store the first operation result in the scalar register group 13 . When the vector processing unit 12 receives the SIMT instruction carrying the storage address and the offset address of the first operation result, the vector processing unit 12 may first obtain data from the registers in the vector register group 14 corresponding to the offset address and perform vector operation to obtain the first operation result. The second operation result is obtained, the first operation result is obtained from the storage address of the first operation result, and the third operation result is obtained by performing a vector operation on the first operation result and the second operation result.
下面以一个实例为例说明SIMT指令处理装置的工作原理。请参阅图3,图3是本发明实施例提供的又一种SIMT指令处理装置的结构示意图。图3中假设SIMT指令处理装置最多可以支持2048个线程,按照32个线程一个warp进行组织的话,总共有64个warp,64个warp可以被分为8个bank(组)。假设标量处理单元包括4个第一处理单元,向量处理单元包括4个第二处理单元,每个第二处理单元支持32线程。标量寄存器组可以包括8个bank,每个bank可以包括128个标量寄存器。向量寄存器组可以包括8个bank,每个bank可以包括128个32线程的向量寄存器。一个bank内的所有寄存器可以被这个bank内的warp共享,一个bank内的寄存器也可以按照warp划分, 每个warp对应的寄存器只能被这个warp内的线程共享。The working principle of the SIMT instruction processing apparatus is described below by taking an example as an example. Please refer to FIG. 3 . FIG. 3 is a schematic structural diagram of another SIMT instruction processing apparatus provided by an embodiment of the present invention. In FIG. 3, it is assumed that the SIMT instruction processing device can support up to 2048 threads. If organized according to 32 threads and one warp, there are 64 warps in total, and the 64 warps can be divided into 8 banks (groups). It is assumed that the scalar processing unit includes 4 first processing units, and the vector processing unit includes 4 second processing units, each of which supports 32 threads. A scalar register bank can include 8 banks, and each bank can include 128 scalar registers. The vector register bank can include 8 banks, and each bank can include 128 32-thread vector registers. All registers in a bank can be shared by warps in this bank, and registers in a bank can also be divided according to warp, and the registers corresponding to each warp can only be shared by threads in this warp.
每个处理单元需要至少一个操作数进行处理,因此,标量处理单元与标量寄存器组之间可以有一个8x8的crossbar,以便可以保证标量处理单元可以访问到标量寄存器组中所有bank的标量寄存器。向量处理单元与向量寄存器组之间可以有一个8x8的32线程crossbar,向量处理单元与标量寄存器组之间可以有一个8x8的crossbar,以便可以保证向量处理单元可以访问到标量寄存器组和向量寄存器组中所有bank的寄存器。Each processing unit requires at least one operand for processing, so there can be an 8x8 crossbar between the scalar processing unit and the scalar register bank, so that the scalar processing unit can ensure that the scalar processing unit can access the scalar registers of all banks in the scalar register bank. There can be an 8x8 32-thread crossbar between the vector processing unit and the vector register set, and an 8x8 crossbar between the vector processing unit and the scalar register set, so that the vector processing unit can access the scalar register set and the vector register set. Registers of all banks in .
每一个时钟周期,标量处理单元可以接收到来自标量调度单元的4条SIMT指令,向量处理单元可以接收到来自向量调度单元的4条SIMT指令。由于同一时间只能有一个处理单元访问一个bank,因此,为了保证多个处理单元同一时间分别访问到不同bank,标量调度单元和向量调度单元在前后两个时钟周期调度的SIMT指令的奇偶性可以相反。Each clock cycle, the scalar processing unit can receive 4 SIMT instructions from the scalar scheduling unit, and the vector processing unit can receive 4 SIMT instructions from the vector scheduling unit. Since only one processing unit can access one bank at the same time, in order to ensure that multiple processing units access different banks at the same time, the parity of the SIMT instructions scheduled by the scalar scheduling unit and the vector scheduling unit in the two clock cycles before and after the on the contrary.
举例说明,在时刻0时,控制单元可以通过标量/向量调度单元调度访问各自bank0、bank2、bank4和bank6的SIMT指令,则第一/第二处理单元可以读取各自偶数bank的操作数0。在时刻1,控制单元可以通过标量/向量调度单元调度访问各自bank1、bank3、bank5和bank7的SIMT指令,第一/第二处理单元可以读取各自偶数bank的操作数1,以及读取各自奇数bank的操作数0。在时刻2,第一/第二处理单元可以读取各自奇数bank的操作数2。因此,只要在调度上保证相邻时刻奇偶交错,同一时刻内调度bank不冲突,即可保证crossbar不冲突下的最大访问效率。例如,上述时刻1,4个第一/第二处理单元的8个操作数读取接口分别访问到各自的bank0-bank7。For example, at time 0, the control unit may schedule SIMT instructions accessing respective bank0, bank2, bank4 and bank6 through the scalar/vector scheduling unit, and the first/second processing unit may read operand 0 of the respective even-numbered banks. At time 1, the control unit may schedule SIMT instructions accessing respective bank1, bank3, bank5, and bank7 through the scalar/vector scheduling unit, and the first/second processing unit may read operand 1 of the respective even-numbered banks, and read the respective odd-numbered banks Operand 0 for bank. At time 2, the first/second processing unit may read operand 2 of the respective odd bank. Therefore, as long as the scheduling ensures that the parity is interleaved at adjacent moments and the scheduling banks do not conflict at the same moment, the maximum access efficiency without crossbar conflict can be guaranteed. For example, at the above moment 1, the 8 operand read interfaces of the 4 first/second processing units respectively access the respective bank0-bank7.
可见,处理单元可以在连续两拍,即两个周期,内从同一bank内读取两个操作数。由于前后两拍指令warp奇偶相反,即bank奇偶相反,因此,不会产生冲突,可以保证同一时刻最多有8个处理单元读取接口访问到8个bank上。It can be seen that the processing unit can read two operands from the same bank in two consecutive beats, that is, two cycles. Because the warp parity of the two-shot instructions before and after is opposite, that is, the bank parity is opposite, so there will be no conflict, and it can be guaranteed that at most 8 processing unit read interfaces can access 8 banks at the same time.
请参阅图4,图4是本发明实施例提供的一种SIMT指令处理方法的流程示意图。其中,该SIMT指令处理方法可以应用于图1-图3所示的SIMT指令处理装置。如图4所示,该SIMT指令处理方法可以包括以下步骤。Please refer to FIG. 4. FIG. 4 is a schematic flowchart of a SIMT instruction processing method provided by an embodiment of the present invention. The SIMT instruction processing method may be applied to the SIMT instruction processing apparatus shown in FIG. 1 to FIG. 3 . As shown in FIG. 4 , the SIMT instruction processing method may include the following steps.
401、通过控制单元确定SIMT指令的类型,基于SIMT指令的类型向标量调度单元或向量调度单元发送SIMT指令。401. Determine the type of the SIMT instruction by the control unit, and send the SIMT instruction to the scalar scheduling unit or the vector scheduling unit based on the type of the SIMT instruction.
通过控制单元根据SIMT指令携带的目的地址,确定SIMT指令的类型。The type of the SIMT instruction is determined by the control unit according to the destination address carried by the SIMT instruction.
402、通过标量调度单元将标量类型的SIMT指令调度至标量处理单元,通过向量调度单元将向量类型的SIMT指令调度至向量处理单元。402. Schedule the SIMT instruction of the scalar type to the scalar processing unit by using the scalar scheduling unit, and schedule the SIMT instruction of the vector type to the vector processing unit by using the vector scheduling unit.
403、通过标量处理单元根据标量类型的SIMT指令进行标量运算,通过向量处理单元根据向量类型的SIMT指令进行向量运算。403. Perform a scalar operation according to a SIMT instruction of a scalar type by a scalar processing unit, and perform a vector operation according to a SIMT instruction of a vector type by the vector processing unit.
404、通过标量寄存器组存储标量数据,通过向量寄存器组存储向量数据。404. Store scalar data through a scalar register set, and store vector data through a vector register set.
可选地,通过标量处理单元获取第一SIMT指令,基于第一SIMT指令携带的基地址对应的数据进行运算得到第一运算结果,并将第一运算结果存储在标量寄存器组;通过向量处理单元获取第二SIMT指令,基于第二SIMT指令携带的偏移地址对应的数据进行运算得到第二运算结果,并将第二运算结果存储在向量寄存器组;以及通过向量处理单元对标量寄存器组存储的第一运算结果和向量寄存器组存储的第二运算结果进行运算,得到第三运算结果作为任务处理结果。Optionally, obtain the first SIMT instruction through the scalar processing unit, perform an operation based on the data corresponding to the base address carried by the first SIMT instruction to obtain the first operation result, and store the first operation result in the scalar register group; through the vector processing unit Obtain the second SIMT instruction, perform operation based on the data corresponding to the offset address carried by the second SIMT instruction to obtain the second operation result, and store the second operation result in the vector register group; and store the scalar register group by the vector processing unit. The first operation result is operated with the second operation result stored in the vector register group, and the third operation result is obtained as the task processing result.
需要说明的是,本发明实施例中所描述的SIMT指令处理方法中的具体流程的相关功能,可参见上述图1-图3中所述的SIMT指令处理装置实施例中的相关描述,此处不再赘述。It should be noted that, for the relevant functions of the specific processes in the SIMT instruction processing method described in the embodiments of the present invention, reference may be made to the relevant descriptions in the embodiments of the SIMT instruction processing apparatus described in FIG. 1 to FIG. 3 , here No longer.
可以理解,SIMT指令处理方法可以是步骤401-步骤404中的全部或部分步骤的组合,在此不加限定。It can be understood that the SIMT instruction processing method may be a combination of all or part of the steps in step 401 to step 404, which is not limited herein.
在一些实施例中提供了一种片上系统芯片,该片上系统芯片可以包括上述实施例所提供的SIMT指令处理装置。该片上系统芯片,可以由SIMT指令处理装置构成,也可以包含SIMT指令处理装置和其他分立器件。In some embodiments, a system-on-chip is provided, and the system-on-chip may include the SIMT instruction processing apparatus provided in the above embodiments. The system-on-chip can be composed of a SIMT instruction processing device, and can also include a SIMT instruction processing device and other discrete devices.
在一些实施例中提供了一种电子设备,包括上述实施例所提供的SIMT指令处理装置以及耦合于SIMT指令处理装置的分立器件。In some embodiments, an electronic device is provided, including the SIMT instruction processing apparatus provided in the above-mentioned embodiments and a discrete device coupled to the SIMT instruction processing apparatus.
以上对本发明实施例进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。The embodiments of the present invention have been introduced in detail above, and specific examples are used to illustrate the principles and implementations of the present invention. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present invention; at the same time, for Persons of ordinary skill in the art, according to the idea of the present invention, will have changes in the specific embodiments and application scope. To sum up, the contents of this specification should not be construed as limiting the present invention.

Claims (16)

  1. 一种单指令多线程SIMT指令处理装置,包括标量处理单元和向量处理单元,其中:A single-instruction multi-thread SIMT instruction processing device, comprising a scalar processing unit and a vector processing unit, wherein:
    所述标量处理单元,用于根据标量类型的SIMT指令,进行标量运算;The scalar processing unit is used to perform a scalar operation according to a SIMT instruction of a scalar type;
    所述向量处理单元,用于根据向量类型的SIMT指令,进行向量运算。The vector processing unit is configured to perform vector operations according to a SIMT instruction of a vector type.
  2. 根据权利要求1所述的装置,其特征在于,所述装置还包括控制单元,其中:The apparatus of claim 1, wherein the apparatus further comprises a control unit, wherein:
    所述控制单元分别耦合所述标量处理单元和所述向量处理单元;the control unit is respectively coupled to the scalar processing unit and the vector processing unit;
    所述控制单元,用于确定所述SIMT指令的类型,并基于所述SIMT指令的所述类型,向所述标量处理单元或所述向量处理单元发送所述SIMT指令;其中,所述类型包括标量或向量。The control unit is configured to determine the type of the SIMT instruction, and based on the type of the SIMT instruction, send the SIMT instruction to the scalar processing unit or the vector processing unit; wherein the type includes scalar or vector.
  3. 根据权利要求2所述的装置,其特征在于,所述控制单元,用于根据所述SIMT指令携带的指示信息,确定所述SIMT指令的所述类型;其中所述指示信息包括目的地址、指示位、指示字段或指示符。The device according to claim 2, wherein the control unit is configured to determine the type of the SIMT instruction according to the indication information carried by the SIMT instruction; wherein the indication information includes a destination address, an indication bit, indicator field, or indicator.
  4. 根据权利要求1-3任一项所述的装置,其特征在于,所述装置还包括标量调度单元和向量调度单元,其中:The apparatus according to any one of claims 1-3, wherein the apparatus further comprises a scalar scheduling unit and a vector scheduling unit, wherein:
    所述标量调度单元耦合所述标量处理单元,用于将标量类型的SIMT指令调度至所述标量处理单元;The scalar scheduling unit is coupled to the scalar processing unit, and is configured to schedule SIMT instructions of a scalar type to the scalar processing unit;
    所述向量调度单元耦合所述向量处理单元,用于将向量类型的SIMT指令调度至所述向量处理单元。The vector scheduling unit is coupled to the vector processing unit, and is configured to schedule SIMT instructions of vector type to the vector processing unit.
  5. 根据权利要求1-4任一项所述的装置,其特征在于,在多个线程并行执行同一任务的情况下,所述多个线程对应相同的基地址且对应不同的偏移地址,其中The device according to any one of claims 1-4, wherein, in the case where multiple threads execute the same task in parallel, the multiple threads correspond to the same base address and correspond to different offset addresses, wherein
    所述标量处理单元用于对所述基地址对应的数据进行标量运算,得到第一运算结果,The scalar processing unit is configured to perform a scalar operation on the data corresponding to the base address to obtain a first operation result,
    所述向量处理单元用于对所述偏移地址对应的数据进行向量运算,得到第二运算结果。The vector processing unit is configured to perform a vector operation on the data corresponding to the offset address to obtain a second operation result.
  6. 根据权利要求1-5任一所述的装置,其特征在于,所述装置还包括用于存储标量数据的标量寄存器组和用于存储向量数据的向量寄存器组,其中:The apparatus according to any one of claims 1-5, wherein the apparatus further comprises a scalar register group for storing scalar data and a vector register group for storing vector data, wherein:
    所述标量寄存器组分别耦合所述标量处理单元和所述向量处理单元,the scalar register group is respectively coupled to the scalar processing unit and the vector processing unit,
    所述向量寄存器组耦合所述向量处理单元。The vector register set is coupled to the vector processing unit.
  7. 根据权利要求6所述的装置,其特征在于,所述装置还包括crossbar模块,所述crossbar模块包括多个crossbar,其中:The apparatus of claim 6, wherein the apparatus further comprises a crossbar module, the crossbar module comprising a plurality of crossbars, wherein:
    所述标量处理单元通过所述crossbar模块与所述标量寄存器组连接;the scalar processing unit is connected with the scalar register group through the crossbar module;
    所述向量处理单元通过所述crossbar模块分别与所述标量寄存器组和所述向量寄存器组连接。The vector processing unit is respectively connected with the scalar register group and the vector register group through the crossbar module.
  8. 根据权利要求6或7所述的装置,其特征在于,所述标量处理单元,用于获取第一SIMT指令,基于所述第一SIMT指令携带的基地址对应的数据进行运算得到第一运算结果,并将所述第一运算结果存储在所述标量寄存器组;The apparatus according to claim 6 or 7, wherein the scalar processing unit is configured to obtain a first SIMT instruction, and perform an operation based on data corresponding to a base address carried by the first SIMT instruction to obtain a first operation result , and store the first operation result in the scalar register group;
    所述向量处理单元,用于获取第二SIMT指令,基于所述第二SIMT指令携带的偏移地址对应的数据进行运算得到第二运算结果,并将所述第二运算结果存储在所述向量寄存器组;以及,用于对所述标量寄存器组存储的所述第一运算结果和所述向量寄存器组存储的所述第二运算结果进行运算,得到第三运算结果。The vector processing unit is used to obtain a second SIMT instruction, perform an operation based on the data corresponding to the offset address carried by the second SIMT instruction to obtain a second operation result, and store the second operation result in the vector. a register group; and, for performing an operation on the first operation result stored in the scalar register group and the second operation result stored in the vector register group to obtain a third operation result.
  9. 一种单指令多线程SIMT指令处理方法,应用于SIMT指令处理装置,所述装置包括标量处理单元和向量处理单元,包括:A single-instruction multi-thread SIMT instruction processing method, applied to a SIMT instruction processing device, the device comprising a scalar processing unit and a vector processing unit, including:
    通过所述标量处理单元根据标量类型的SIMT指令,进行标量运算;Perform scalar operations by the scalar processing unit according to the SIMT instruction of the scalar type;
    通过所述向量处理单元根据向量类型的SIMT指令,进行向量运算。The vector operation is performed by the vector processing unit according to the SIMT instruction of the vector type.
  10. 根据权利要求9所述的方法,其特征在于,所述装置还包括控制单元,所述方法还包括:The method according to claim 9, wherein the device further comprises a control unit, the method further comprising:
    通过所述控制单元确定所述SIMT指令的类型,并基于所述SIMT指令的所述类型,向所述标量处理单元或所述向量处理单元发送所述SIMT指令;其中,所述类型包括标量或向量。The type of the SIMT instruction is determined by the control unit, and based on the type of the SIMT instruction, the SIMT instruction is sent to the scalar processing unit or the vector processing unit; wherein the type includes scalar or vector.
  11. 根据权利要求10所述的方法,其特征在于,所述方法还包括:The method of claim 10, wherein the method further comprises:
    通过所述控制单元根据所述SIMT指令携带的指示信息,确定所述SIMT指令的所述类型;其中所述指示信息包括目的地址、指示位、指示字段或指示符。The control unit determines the type of the SIMT instruction according to the indication information carried by the SIMT instruction; wherein the indication information includes a destination address, an indication bit, an indication field or an indicator.
  12. 根据权利要求9-11任一项所述的方法,其特征在于,所述装置还包括标量调度单元和向量调度单元,所述方法还包括:The method according to any one of claims 9-11, wherein the apparatus further comprises a scalar scheduling unit and a vector scheduling unit, and the method further comprises:
    通过所述标量调度单元将标量类型的SIMT指令调度至所述标量处理单元;Scheduling the scalar type SIMT instruction to the scalar processing unit by the scalar scheduling unit;
    通过所述向量调度单元将向量类型的SIMT指令调度至所述向量处理单元。The vector-type SIMT instruction is scheduled to the vector processing unit by the vector scheduling unit.
  13. 根据权利要求9-12任一项所述的方法,其特征在于,所述装置还包括标量寄存器组和向量寄存器组,所述方法还包括:The method according to any one of claims 9-12, wherein the device further comprises a scalar register group and a vector register group, and the method further comprises:
    通过所述标量寄存器组存储标量数据;storing scalar data through the scalar register bank;
    通过所述向量寄存器组存储向量数据。Vector data is stored through the vector register bank.
  14. 根据权利要求13所述的方法,其特征在于,所述方法还包括:The method of claim 13, wherein the method further comprises:
    通过所述标量处理单元获取第一SIMT指令,基于所述第一SIMT指令携带的基地 址对应的数据进行运算得到第一运算结果,并将所述第一运算结果存储在所述标量寄存器组;Obtain the first SIMT instruction by the scalar processing unit, carry out operation based on the corresponding data of the base address carried by the first SIMT instruction to obtain the first operation result, and store the first operation result in the scalar register group;
    通过所述向量处理单元获取第二SIMT指令,基于所述第二SIMT指令携带的偏移地址对应的数据进行运算得到第二运算结果,并将所述第二运算结果存储在所述向量寄存器组;以及Obtain a second SIMT instruction through the vector processing unit, perform an operation based on the data corresponding to the offset address carried by the second SIMT instruction to obtain a second operation result, and store the second operation result in the vector register group ;as well as
    通过所述向量处理单元对所述标量寄存器组存储的所述第一运算结果和所述向量寄存器组存储的所述第二运算结果进行运算,得到第三运算结果。A third operation result is obtained by performing an operation on the first operation result stored in the scalar register group and the second operation result stored in the vector register group by the vector processing unit.
  15. 一种片上系统芯片,集成有如权利要求1-8任一所述的单指令多线程SIMT指令处理装置。A system-on-a-chip integrated with the single-instruction multi-thread SIMT instruction processing device according to any one of claims 1-8.
  16. 一种电子设备,包括如权利要求1-8任一所述的单指令多线程SIMT指令处理装置。An electronic device, comprising the single-instruction multi-thread SIMT instruction processing apparatus according to any one of claims 1-8.
PCT/CN2021/100808 2020-12-11 2021-06-18 Simt instruction processing method and device WO2022121273A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2022523849A JP2023509813A (en) 2020-12-11 2021-06-18 SIMT command processing method and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011452846.0 2020-12-11
CN202011452846.0A CN114625421A (en) 2020-12-11 2020-12-11 SIMT instruction processing method and device

Publications (1)

Publication Number Publication Date
WO2022121273A1 true WO2022121273A1 (en) 2022-06-16

Family

ID=81895766

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/100808 WO2022121273A1 (en) 2020-12-11 2021-06-18 Simt instruction processing method and device

Country Status (3)

Country Link
JP (1) JP2023509813A (en)
CN (1) CN114625421A (en)
WO (1) WO2022121273A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5423051A (en) * 1992-09-24 1995-06-06 International Business Machines Corporation Execution unit with an integrated vector operation capability
US20130042090A1 (en) * 2011-08-12 2013-02-14 Ronny M. KRASHINSKY Temporal simt execution optimization
CN104699465A (en) * 2015-03-26 2015-06-10 中国人民解放军国防科学技术大学 Vector access and storage device supporting SIMT in vector processor and control method
US20160188531A1 (en) * 2014-12-24 2016-06-30 Samsung Electronics Co., Ltd. Operation processing apparatus and method
CN106257411A (en) * 2015-06-17 2016-12-28 联发科技股份有限公司 Single instrction multithread calculating system and method thereof
CN111240745A (en) * 2019-02-20 2020-06-05 上海天数智芯半导体有限公司 Enhanced scalar vector dual pipeline architecture for interleaved execution

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170371654A1 (en) * 2016-06-23 2017-12-28 Advanced Micro Devices, Inc. System and method for using virtual vector register files
US10776311B2 (en) * 2017-03-14 2020-09-15 Azurengine Technologies Zhuhai Inc. Circular reconfiguration for a reconfigurable parallel processor using a plurality of chained memory ports

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5423051A (en) * 1992-09-24 1995-06-06 International Business Machines Corporation Execution unit with an integrated vector operation capability
US20130042090A1 (en) * 2011-08-12 2013-02-14 Ronny M. KRASHINSKY Temporal simt execution optimization
US20160188531A1 (en) * 2014-12-24 2016-06-30 Samsung Electronics Co., Ltd. Operation processing apparatus and method
CN104699465A (en) * 2015-03-26 2015-06-10 中国人民解放军国防科学技术大学 Vector access and storage device supporting SIMT in vector processor and control method
CN106257411A (en) * 2015-06-17 2016-12-28 联发科技股份有限公司 Single instrction multithread calculating system and method thereof
CN111240745A (en) * 2019-02-20 2020-06-05 上海天数智芯半导体有限公司 Enhanced scalar vector dual pipeline architecture for interleaved execution

Also Published As

Publication number Publication date
CN114625421A (en) 2022-06-14
JP2023509813A (en) 2023-03-10

Similar Documents

Publication Publication Date Title
US9672035B2 (en) Data processing apparatus and method for performing vector processing
US10768989B2 (en) Virtual vector processing
US7418576B1 (en) Prioritized issuing of operation dedicated execution unit tagged instructions from multiple different type threads performing different set of operations
US9207995B2 (en) Mechanism to speed-up multithreaded execution by register file write port reallocation
US9092429B2 (en) DMA vector buffer
US8539211B2 (en) Allocating registers for loop variables in a multi-threaded processor
US10268519B2 (en) Scheduling method and processing device for thread groups execution in a computing system
US9286114B2 (en) System and method for launching data parallel and task parallel application threads and graphics processing unit incorporating the same
Chen et al. Improving GPGPU performance via cache locality aware thread block scheduling
US20110119468A1 (en) Mechanism of supporting sub-communicator collectives with o(64) counters as opposed to one counter for each sub-communicator
JP2017045151A (en) Arithmetic processing device and control method of arithmetic processing device
TW201543357A (en) Detecting data dependencies of instructions associated with threads in a simultaneous multithreading scheme
US20220220644A1 (en) Warp scheduling method and stream multiprocessor using the same
WO2022121273A1 (en) Simt instruction processing method and device
WO2021111272A1 (en) Processor unit for multiply and accumulate operations
US20100011195A1 (en) Processor
US8055883B2 (en) Pipe scheduling for pipelines based on destination register number
WO2022161013A1 (en) Processor apparatus and instruction execution method therefor, and computing device
CN112463218B (en) Instruction emission control method and circuit, data processing method and circuit
US20220197696A1 (en) Condensed command packet for high throughput and low overhead kernel launch
WO2022121090A1 (en) Processor supporting high-throughput multi-precision multiplication
US20130262819A1 (en) Single cycle compare and select operations
US8683181B2 (en) Processor and method for distributing load among plural pipeline units
JP5630798B1 (en) Processor and method
WO2022141321A1 (en) Dsp and parallel computing method therefor

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2022523849

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21901985

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21901985

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 231123)