CN112579164A - SIMT conditional branch processing device and method - Google Patents

SIMT conditional branch processing device and method Download PDF

Info

Publication number
CN112579164A
CN112579164A CN202011404147.9A CN202011404147A CN112579164A CN 112579164 A CN112579164 A CN 112579164A CN 202011404147 A CN202011404147 A CN 202011404147A CN 112579164 A CN112579164 A CN 112579164A
Authority
CN
China
Prior art keywords
instruction
component
warp
stack
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011404147.9A
Other languages
Chinese (zh)
Other versions
CN112579164B (en
Inventor
任向隆
田泽
张骏
韩立敏
许宏杰
牛少平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Xiangteng Microelectronics Technology Co Ltd
Original Assignee
Xian Xiangteng Microelectronics Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Xiangteng Microelectronics Technology Co Ltd filed Critical Xian Xiangteng Microelectronics Technology Co Ltd
Priority to CN202011404147.9A priority Critical patent/CN112579164B/en
Publication of CN112579164A publication Critical patent/CN112579164A/en
Application granted granted Critical
Publication of CN112579164B publication Critical patent/CN112579164B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3005Arrangements for executing specific machine instructions to perform operations for flow control
    • G06F9/30058Conditional branch instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3005Arrangements for executing specific machine instructions to perform operations for flow control
    • G06F9/30069Instruction skipping instructions, e.g. SKIP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3851Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Abstract

The present invention relates to an SIMT conditional branch processing apparatus and method. The device comprises branch processing device hardware, instructions and a programming model; the branch processing device hardware is connected with the instruction and the programming model respectively, and the instruction and the programming model are connected, wherein: the branch processing device hardware is a hardware carrier for executing the external user code and has the relationship from the external user code; the instructions provide an external user code with a use interface of the apparatus, having a relationship to the external user code; the programming model provides external user code with constraints and limitations on the use of instruction sequences, with relationships to the external user code. The invention provides a low-cost and easy-to-realize scheme for conditional branching of the SIMT processor, and is simultaneously suitable for a fine-grained SIMT multi-thread processor and a coarse-grained SIMT multi-thread processor.

Description

SIMT conditional branch processing device and method
Technical Field
The invention belongs to the technical field of design of processors and graphics processors, and particularly relates to an SIMT conditional branch processing device and method.
Background
In a coarse-grained multithreaded processor, when a current thread encounters a long latency event, processor resources are used by switching to other threads to mask the stall caused by the long latency event. However, thread switching of coarse-grained multithreading occurs only when a long delay event is encountered, the switching frequency is low, and the thread scene is generally saved by an operating system, so that certain overhead is required for saving and loading the thread scene during thread switching, and the processor computing resources are idle during the period.
In the fine-grained multithread processor, the pause caused by a long delay event is covered by thread switching, different from coarse-grained multithread, fine-grained multithread carries out thread switching in a fixed period, the switching frequency is high, in order to reduce the expenses of thread field storage and loading during switching, the fine-grained multithread supports the thread multithread in a hardware mode, and therefore zero expenses can be achieved during thread switching.
Warp is a set consisting of data to be processed, a program for processing the data, and result data generated after processing. Multiple warps can be executed in a time-sharing manner on the same hardware without mutual interference, and different warps need to have different field recording related information, namely, the warps need to be supported by corresponding hardware.
Single Instruction Multiple Thread (SIMT) uses a Single Instruction to control the execution of multiple threads, i.e., multiple threads execute the same Instruction at the same time. The SIMT technology is applied to processor design, can save instruction-fetching logic resources, uses more transistors for calculation and provides the operational capability of a processor; in graphic calculation, a large number of vertexes and pixels need to be subjected to the same operation, the data parallelism is extremely high, and the SIMT has good adaptability; in non-graphic calculation, execution paths of different threads are different, and the SIMT has the problem of low efficiency.
In an SIMT processor, when all threads in a thread group have the same execution path, the SIMT processor can obtain the full efficiency and performance; if the execution paths of the threads in the thread group are different due to different thread data when the threads are in conditional branching, the threads need to be executed according to the sequence of each branch path.
Most of the existing SIMT processors use a branch synchronization stack to manage each branch and aggregation thread, and have complex realization and higher hardware cost.
Disclosure of Invention
In order to solve the above problems in the background art, the present invention provides an SIMT conditional branch processing apparatus and method, which provides a low-cost and easy-to-implement solution for conditional branch of an SIMT processor, and is suitable for both fine-grained SIMT multithreaded processors and coarse-grained SIMT multithreaded processors.
The technical solution of the invention is as follows: the invention relates to an SIMT conditional branch processing device, which is characterized in that: the SIMT conditional branch processing apparatus comprises branch processing apparatus hardware, instructions and a programming model; the branch processing device hardware is connected with the instruction and the programming model respectively, and the instruction and the programming model are connected, wherein:
the branch processing device hardware is a hardware carrier for executing the external user code and has the relationship from the external user code; the instructions provide an external user code with a use interface of the apparatus, having a relationship to the external user code; the programming model provides external user code with constraints and limitations of instruction sequences in use, with a relationship to the external user code;
the branch processing apparatus hardware operates by instructions, having dependencies from the instructions; the hardware of the branch processing device is realized on the premise of a programming model and has a dependency relationship on the programming model; the instructions are executed by branch processing apparatus hardware, having dependencies from the branch processing apparatus hardware; the use of the instructions must adhere to the programming model, with dependencies from the programming model; the implementation of the programming model depends on the sequence of instruction components, with dependencies from the instructions; the programming model is embodied in code that executes on the branch processing apparatus hardware, the programming model having dependencies from the branch processing apparatus hardware.
Preferably, the branch processing device hardware comprises a conditional branch unit, a Mask stack control unit, an absolute jump unit, a multi-site (Warp) ThreadMask stack, a multi-site (Warp) PC stack, a current ThreadMask, a Warp scheduling unit, an instruction fetch Warp scheduling unit, a PC updating unit and a current PC; wherein
The conditional branch unit is used for executing the conditional branch instruction, generating assertion Mask information of all threads according to an execution result, and whether the results of all threads have ramification and branch address information, wherein a rule generated by the assertion Mask is as follows: if the condition of the branch instruction corresponding to a certain thread in the warp is met, the assertion Mask corresponding to the thread is valid, otherwise, the assertion Mask corresponding to the thread is invalid; the Mask stack control component is used for executing a Mask stack control instruction and generating corresponding control actions, wherein the control actions comprise stack entering, stack exiting and negation taking, and the rule of negation taking operation is as follows: negating the assertion masks of all threads in the warp according to bits, wherein the assertion masks are changed into invalid and the invalidity masks are changed into valid; the absolute jump component is used for executing an absolute jump instruction, generating control actions and data of a multi-site (warp) PC stack, wherein the actions comprise stacking and unstacking and generating current PC information; the multi-site (warp) ThreadMask stack is provided with a plurality of hardware sites, each site is used for storing Mask information of all threads of a warp and can output the ThreadMask information of the warp according to the warp information output by the warp scheduling component; the multi-site (warp) PC stack is provided with a plurality of hardware sites, each site user stores PC information of a warp and can output the PC information of the warp according to the warp information output by the fetch instruction warp scheduling component; the current ThreadMask is used for keeping the ThreadMask information of all threads of the current warp; the Warp scheduling component is used for selecting one Warp with an operating condition from a plurality of warps according to the information of the external scoreboard unit; the instruction fetch warp scheduling component is used for selecting one to-be-fetched warp from a plurality of warps according to the state information cached in the external IF component; the PC updating component is used for updating the PC of the current warp according to the condition test results of all the thread condition branches given by the condition branch component, the data information related to the absolute jump instruction given by the absolute jump component and the current PC given by the current PC, and the specific updating rule is that if the condition test results of all the thread condition branches given by the condition branch component are not diverged and the test conditions are satisfied, the PC is updated to the jump address sent by the condition branch component; if the conditional test results of all the thread conditional branches given by the conditional branch component are divergent or the test conditions are not satisfied, the PC is updated to PC +1 sent by the current PC; if the absolute jump sent by the absolute jump component is effective, selecting a jump address sent by the absolute jump component or a stack top PC of a multi-site (warp) corresponding to the warp according to the type of the absolute jump; otherwise, the PC is updated to PC + 1' given by the current PC; the current PC is used to hold PC information for the current warp.
Preferably, the conditional branch unit is connected with an ID unit from the outside for control signals and data of the conditional branch instruction therefrom; the Mask stack control component is connected with the external ID component and is used for receiving a control signal of a Mask stack control instruction from the external ID component; the connection of the absolute jump unit and the ID unit from the outside is used for receiving the control signal and data of the absolute jump instruction; the current PC is connected with an IF component from the outside and is used for outputting PC information of selected warp to the current PC; the instruction fetching warp scheduling component is connected with an external IF component and is used for receiving the empty and full information of the multi-warp instruction cache; the warp scheduling component is connected with a scoreboard unit from the outside and is used for receiving related information of data correlation, control correlation, function component correlation, write-back path correlation and the like of the multi-warp instructions and state information of instruction execution on the function component; the current ThreadMask is connected with an external Ex component and used for outputting Mask information of a current warp corresponding instruction to the current ThreadMask; the warp scheduling component is connected with an ID component from outside for outputting ID information of the winning current warp thereto.
Preferably, the conditional branch unit is connected to the PC update unit, and is configured to output the conditional test results and branch address information of all current threads of warp to the PC update unit; the Mask stack control component is connected with a multi-field (warp) Threadmask stack and used for outputting control action information to the Mask stack control component; the absolute jump unit is connected with the PC updating unit and used for outputting data information related to the absolute jump instruction to the PC updating unit; the absolute jump component is connected with a multi-site (warp) PC stack and is used for outputting control action and data information of the stack to the absolute jump component; a multi-field (warp) ThreadMask stack is connected with a warp scheduling component and is used for receiving id information of winning current warp; the current ThreadMask is connected with a multi-site (warp) ThreadMask stack and is used for receiving Mask information of all threads of the current warp; a multi-site (warp) PC stack connected to the PC update component for receiving therefrom an updated next PC for a current warp; a multi-site (warp) PC stack is connected with the fetch warp scheduling component and used for receiving id information of winning fetch warp from the multi-site (warp) PC stack; a multi-site (warp) PC stack is connected with the current PC and used for outputting PC information for fetching warp; the current ThreadMask is connected with the conditional branch component and used for outputting Mask information of all current warp threads to the conditional branch component; the Warp scheduling component is connected with a multi-site (Warp) PC stack and used for outputting id information of a winning current Warp to the multi-site (Warp) PC stack; the current PC is connected with the PC updating component and used for outputting the PC information of the current warp to the current PC updating component; the current ThreadMask is connected with a Mask stack control unit and is used for outputting Mask information of all threads of the current warp to the current ThreadMask.
Preferably, the instructions include conditional branch instructions, absolute jump instructions, and predicate operation instructions.
Preferably, the conditional branch instruction is subdivided into two types, namely a conditional jump instruction and an unconditional jump instruction, and is respectively used for jumping to an address specified by the immediate when the condition is satisfied and jumping to an address specified by the immediate when the condition is not satisfied; the instruction format of the conditional branch instruction comprises mnemonics, test conditions and an immediate jump address;
the absolute jump instruction is subdivided into a jump instruction, a push and jump instruction, a pop and jump instruction; the jump instruction is used for jumping to an address specified by the immediate; the push and jump instruction is used for storing the next instruction address to the stack and jumping to the address specified by the immediate; the stack popping and jumping instruction is used for popping and jumping the address in the stack to the address; the instruction formats of the jump instruction and the push-to-jump instruction comprise mnemonic symbols and immediate jump addresses; the instruction format of the stack popping and jumping instruction only contains mnemonics;
the assertion operation instruction is subdivided into an assertion push instruction, an assertion pop instruction and an assertion negation instruction; the assertion push instruction is used for pushing the current assertion Mask into the assertion stack; the assertion stack popping instruction is used for popping an assertion Mask at the top of the assertion stack and taking the assertion Mask as a current assertion Mask; and the assertion negation instruction is used for performing bitwise AND operation on the current assertion Mask after negation and the assertion Mask at the top of the assertion stack to serve as the current assertion Mask.
Preferably, the programming model includes assertion rules, flow control rules, and PC update rules.
Preferably, the assertion rule includes: each instruction in the instruction has an attribute of whether the attribute is controlled by an asserted Mask or not, wherein: the absolute jump instruction and the assertion operation instruction are not controlled by the assertion Mask, and the instructions are executed no matter whether the assertion Mask is effective or not; the conditional branch instruction is controlled by the assertion Mask, and the instruction is executed when the assertion Mask is effective and is not executed when the assertion Mask is invalid;
the flow control rules include: except for realizing while, for and other loop control, the jump instruction is not allowed to appear inside the conditional branch; the push and jump instruction, the pop and jump instruction can be inside the conditional branch; the stack pressing and jumping instruction and the stack popping and jumping instruction are used in a matched mode to realize function calling, and a plurality of function calling can be nested; two path addresses of the conditional branch, wherein one path is fixedly PC +1, and the other path is larger than the current PC but not allowed to be smaller than the current PC, namely, the immediate number carried in the conditional branch instruction is larger than the address of the current instruction; the absolute jump instruction supports immediate addressing; absolute jump instructions can also support register indirect addressing, but the programmer must ensure that the values in the register are the same in all threads; the user program must meet the specifications in the flow control rules, otherwise the execution flow of the program may not meet the expectations;
the PC update rule includes: in the designed instruction set, the instructions capable of changing the program execution flow only comprise absolute jump instructions and conditional branch instructions; the update strategy of the PC is as follows: when an absolute jump instruction is encountered, the PC jumps certainly, and is irrelevant to whether the instruction is positioned in the conditional branch or not; when a conditional branch instruction is encountered, if all active thread conditional branches are not diverged and the test condition is established, the PC jumps; when a conditional branch instruction is encountered, if all the active thread conditional branches have a divergence or no divergence but the test condition is not satisfied, the PC is updated to PC + 1; when other instructions in the instruction set are encountered, PC is updated to PC + 1.
A method for realizing the SIMT conditional branch processing apparatus described above, characterized in that: the method comprises the following steps:
1) the branch processing device hardware judges the type of the instruction input from the outside; respectively executing a conditional branch flow, a Mask stack control flow and an absolute jump flow according to the corresponding types;
2) conditional branch flow:
2.1) the conditional branch component carries out conditional test according to the input operation code and operand and the input current ThreadMask information;
2.2) the conditional branch component informs the PC to update the conditional test results of all threads of the component according to the conditional test condition of the step 2.1) and the updated PC value;
2.3) the conditional branch component generates new Mask information according to the condition test condition in the step 2.1) and informs a multi-site Threadmask stack; selecting one multi-site (Warp) ThreadMask stack from a plurality of Warp sites according to current Warp information input by a Warp scheduling component, outputting Mask values of all threads of the Warp to the current ThreadMask, and respectively outputting the received Mask information of all threads of the current Warp to an external Ex component, a conditional branch component and a Mask stack control component by the current ThreadMask;
3) the Mask stack control flow is as follows: the Mask stack control component informs a multi-site (warp) thread stack of actions to be executed according to the operation code input by the ID component and the input masks of all threads of the current warp: pushing, popping, negating and Mask values of all threads during pushing;
4) the absolute jump flow is as follows: and the absolute jump component informs the PC updating component of the operand carried by the absolute jump instruction according to the operation code and the operand input by the ID component and informs a multi-site (warp) PC stack of the operation to be carried out: pushing and popping; and the multi-site (warp) PC stack acquires the updated PC value from the PC updating component and executes corresponding operation according to the operation input by the absolute jump component.
The invention provides an SIMT conditional branch processing device and method, which can realize the processing of the conditional branch of an SIMT processor in a simplified mode, and simultaneously support the branch processing of fine grain multithreading and coarse grain multithreading in a hardware multi-site mode. Therefore, the invention has the following advantages:
1) branch processing of the SIMT processor can be implemented;
2) supporting branch processing of a fine-grained SIMT multithreaded processor;
3) simultaneously, the branch processing of the coarse-grained SIMT multi-thread processor is supported;
4) hardware multi-site is supported, and zero-cost fast switching of fine-grained multithreading and coarse-grained multithreading is supported;
5) the invention has simple structure, easy realization and strong expandability.
Drawings
FIG. 1 is a composition and relationship diagram of the present invention;
FIG. 2 is a diagram of the hardware components and connections of the apparatus of the present invention.
Fig. 3 is a flow chart of the method of the present invention.
Detailed Description
The invention provides a SIMT conditional branch processing device, which comprises branch processing device hardware, instructions and a programming model; the branch processing device hardware is connected with the instruction and the programming model respectively, and the instruction and the programming model are connected, wherein:
the branch processing device hardware is a hardware carrier for executing the external user code and has the relationship from the external user code; the instructions provide an external user code with a use interface of the apparatus, having a relationship to the external user code; the programming model provides external user code with constraints and limitations of instruction sequences in use, with a relationship to the external user code;
the branch processing apparatus hardware operates by instructions, having dependencies from the instructions; the hardware of the branch processing device is realized on the premise of a programming model and has a dependency relationship on the programming model; the instructions are executed by branch processing apparatus hardware, having dependencies from the branch processing apparatus hardware; the use of the instructions must adhere to the programming model, with dependencies from the programming model; the implementation of the programming model depends on the sequence of instruction components, with dependencies from the instructions; the programming model is embodied in code that executes on the branch processing apparatus hardware, the programming model having dependencies from the branch processing apparatus hardware.
The branch processing device hardware comprises a conditional branch component, a Mask stack control component, an absolute jump component, a multi-site (Warp) ThreadMask stack, a multi-site (Warp) PC stack, a current ThreadMask, a Warp scheduling component, an instruction fetch Warp scheduling component, a PC updating component and a current PC; wherein
The conditional branch unit is used for executing the conditional branch instruction, generating assertion Mask information of all threads according to an execution result, and whether the results of all threads have ramification and branch address information, wherein a rule generated by the assertion Mask is as follows: if the condition of the branch instruction corresponding to a certain thread in the warp is met, the assertion Mask corresponding to the thread is valid, otherwise, the assertion Mask corresponding to the thread is invalid; the Mask stack control component is used for executing a Mask stack control instruction and generating corresponding control actions, wherein the control actions comprise stack entering, stack exiting and negation taking, and the rule of negation taking operation is as follows: negating the assertion masks of all threads in the warp according to bits, wherein the assertion masks are changed into invalid and the invalidity masks are changed into valid; the absolute jump component is used for executing an absolute jump instruction, generating control actions and data of a multi-site (warp) PC stack, wherein the actions comprise stacking and unstacking and generating current PC information; the multi-site (warp) ThreadMask stack is provided with a plurality of hardware sites, each site is used for storing Mask information of all threads of a warp and can output the ThreadMask information of the warp according to the warp information output by the warp scheduling component; the multi-site (warp) PC stack is provided with a plurality of hardware sites, each site user stores PC information of a warp and can output the PC information of the warp according to the warp information output by the fetch instruction warp scheduling component; the current ThreadMask is used for keeping the ThreadMask information of all threads of the current warp; the Warp scheduling component is used for selecting one Warp with an operating condition from a plurality of warps according to the information of the external scoreboard unit; the instruction fetch warp scheduling component is used for selecting one to-be-fetched warp from a plurality of warps according to the state information cached in the external IF component; the PC updating component is used for updating the PC of the current warp according to the condition test results of all the thread condition branches given by the condition branch component, the data information related to the absolute jump instruction given by the absolute jump component and the current PC given by the current PC, and the specific updating rule is that if the condition test results of all the thread condition branches given by the condition branch component are not diverged and the test conditions are satisfied, the PC is updated to the jump address sent by the condition branch component; if the conditional test results of all the thread conditional branches given by the conditional branch component are divergent or the test conditions are not satisfied, the PC is updated to PC +1 sent by the current PC; if the absolute jump sent by the absolute jump component is effective, selecting a jump address sent by the absolute jump component or a stack top PC of a multi-site (warp) corresponding to the warp according to the type of the absolute jump; otherwise, the PC is updated to PC + 1' given by the current PC; the current PC is used to hold PC information for the current warp.
The conditional branch unit is connected with the ID unit from the outside, and is used for controlling signals and data of the conditional branch instruction; the Mask stack control component is connected with the external ID component and is used for receiving a control signal of a Mask stack control instruction from the external ID component; the connection of the absolute jump unit and the ID unit from the outside is used for receiving the control signal and data of the absolute jump instruction; the current PC is connected with an IF component from the outside and is used for outputting PC information of selected warp to the current PC; the instruction fetching warp scheduling component is connected with an external IF component and is used for receiving the empty and full information of the multi-warp instruction cache; the warp scheduling component is connected with a scoreboard unit from the outside and is used for receiving related information of data correlation, control correlation, function component correlation, write-back path correlation and the like of the multi-warp instructions and state information of instruction execution on the function component; the current ThreadMask is connected with an external Ex component and used for outputting Mask information of a current warp corresponding instruction to the current ThreadMask; the warp scheduling component is connected with an ID component from outside for outputting ID information of the winning current warp thereto.
The conditional branch component is connected with the PC updating component and is used for outputting the conditional test results and the branch address information of all current warp threads to the conditional branch component; the Mask stack control component is connected with a multi-field (warp) Threadmask stack and used for outputting control action information to the Mask stack control component; the absolute jump unit is connected with the PC updating unit and used for outputting data information related to the absolute jump instruction to the PC updating unit; the absolute jump component is connected with a multi-site (warp) PC stack and is used for outputting control action and data information of the stack to the absolute jump component; a multi-field (warp) ThreadMask stack is connected with a warp scheduling component and is used for receiving id information of winning current warp; the current ThreadMask is connected with a multi-site (warp) ThreadMask stack and is used for receiving Mask information of all threads of the current warp; a multi-site (warp) PC stack connected to the PC update component for receiving therefrom an updated next PC for a current warp; a multi-site (warp) PC stack is connected with the fetch warp scheduling component and used for receiving id information of winning fetch warp from the multi-site (warp) PC stack; a multi-site (warp) PC stack is connected with the current PC and used for outputting PC information for fetching warp; the current ThreadMask is connected with the conditional branch component and used for outputting Mask information of all current warp threads to the conditional branch component; the Warp scheduling component is connected with a multi-site (Warp) PC stack and used for outputting id information of a winning current Warp to the multi-site (Warp) PC stack; the current PC is connected with the PC updating component and used for outputting the PC information of the current warp to the current PC updating component; the current ThreadMask is connected with a Mask stack control unit and is used for outputting Mask information of all threads of the current warp to the current ThreadMask.
The instructions include conditional branch instructions, absolute jump instructions, and predicate operation instructions.
The conditional branch instruction is subdivided into two types of a conditional jump instruction and an unconditional jump instruction, and is respectively used for jumping to an address specified by an immediate value when the condition is satisfied and jumping to an address specified by the immediate value when the condition is not satisfied; the instruction format of the conditional branch instruction comprises mnemonics, test conditions and an immediate jump address;
the absolute jump instruction is subdivided into a jump instruction, a push and jump instruction, a pop and jump instruction; the jump instruction is used for jumping to an address specified by the immediate; the push and jump instruction is used for storing the next instruction address to the stack and jumping to the address specified by the immediate; the stack popping and jumping instruction is used for popping and jumping the address in the stack to the address; the instruction formats of the jump instruction and the push-to-jump instruction comprise mnemonic symbols and immediate jump addresses; the instruction format of the stack popping and jumping instruction only contains mnemonics;
the assertion operation instruction is subdivided into an assertion push instruction, an assertion pop instruction and an assertion negation instruction; the assertion push instruction is used for pushing the current assertion Mask into the assertion stack; the assertion stack popping instruction is used for popping an assertion Mask at the top of the assertion stack and taking the assertion Mask as a current assertion Mask; and the assertion negation instruction is used for performing bitwise AND operation on the current assertion Mask after negation and the assertion Mask at the top of the assertion stack to serve as the current assertion Mask.
The programming model includes assertion rules, flow control rules, and PC update rules.
The assertion rules include: each instruction in the instruction has an attribute of whether the attribute is controlled by an asserted Mask or not, wherein: the absolute jump instruction and the assertion operation instruction are not controlled by the assertion Mask, and the instructions are executed no matter whether the assertion Mask is effective or not; the conditional branch instruction is controlled by the assertion Mask, and the instruction is executed when the assertion Mask is effective and is not executed when the assertion Mask is invalid;
the flow control rules include: except for realizing while, for and other loop control, the jump instruction is not allowed to appear inside the conditional branch; the push and jump instruction, the pop and jump instruction can be inside the conditional branch; the stack pressing and jumping instruction and the stack popping and jumping instruction are used in a matched mode to realize function calling, and a plurality of function calling can be nested; two path addresses of the conditional branch, wherein one path is fixedly PC +1, and the other path is larger than the current PC but not allowed to be smaller than the current PC, namely, the immediate number carried in the conditional branch instruction is larger than the address of the current instruction; the absolute jump instruction supports immediate addressing; absolute jump instructions can also support register indirect addressing, but the programmer must ensure that the values in the register are the same in all threads; the user program must meet the specifications in the flow control rules, otherwise the execution flow of the program may not meet the expectations;
the PC update rule includes: in the designed instruction set, the instructions capable of changing the program execution flow only comprise absolute jump instructions and conditional branch instructions; the update strategy of the PC is as follows: when an absolute jump instruction is encountered, the PC jumps certainly, and is irrelevant to whether the instruction is positioned in the conditional branch or not; when a conditional branch instruction is encountered, if all active thread conditional branches are not diverged and the test condition is established, the PC jumps; when a conditional branch instruction is encountered, if all the active thread conditional branches have a divergence or no divergence but the test condition is not satisfied, the PC is updated to PC + 1; when other instructions in the instruction set are encountered, PC is updated to PC + 1.
The invention also provides a SIMT conditional branching processing method, which comprises the following steps:
1) the branch processing device hardware judges the type of the instruction input from the outside; respectively executing a conditional branch flow, a Mask stack control flow and an absolute jump flow according to the corresponding types;
2) conditional branch flow:
2.1) the conditional branch component carries out conditional test according to the input operation code and operand and the input current ThreadMask information;
2.2) the conditional branch component informs the PC to update the conditional test results of all threads of the component according to the conditional test condition of the step 2.1) and the updated PC value;
2.3) the conditional branch component generates new Mask information according to the condition test condition in the step 2.1) and informs a multi-site Threadmask stack; selecting one multi-site (Warp) ThreadMask stack from a plurality of Warp sites according to current Warp information input by a Warp scheduling component, outputting Mask values of all threads of the Warp to the current ThreadMask, and respectively outputting the received Mask information of all threads of the current Warp to an external Ex component, a conditional branch component and a Mask stack control component by the current ThreadMask;
3) the Mask stack control flow is as follows: the Mask stack control component informs a multi-site (warp) thread stack of actions to be executed according to the operation code input by the ID component and the input masks of all threads of the current warp: pushing, popping, negating and Mask values of all threads during pushing;
4) the absolute jump flow is as follows: and the absolute jump component informs the PC updating component of the operand carried by the absolute jump instruction according to the operation code and the operand input by the ID component and informs a multi-site (warp) PC stack of the operation to be carried out: pushing and popping; and the multi-site (warp) PC stack acquires the updated PC value from the PC updating component and executes corresponding operation according to the operation input by the absolute jump component.
The technical solution of the present invention is further described in detail with reference to the accompanying drawings and specific embodiments.
Referring to fig. 1, a SIMT conditional branch processing apparatus of an embodiment of the present invention includes branch processing apparatus hardware 101, instructions 102 and a programming model 103.
The branch processing device hardware 101 is a hardware carrier of the SIMT branch processing device, and is represented in the form of hardware; the instruction 102 is not a hardware device, but is a device use interface provided for a user, and is represented in the form of an instruction; the programming model 103 is not a hardware device, but rather provides the user with interface usage rules in the form of constraints and limitations on the use of instruction sequences.
The branch processing apparatus hardware 101 is a hardware carrier of external user code execution, having a relationship 132 from the external user code; the instructions 102 provide a usage interface of the apparatus for an external user code, having a relationship 130 to the external user code; the programming model 103 provides external user code with constraints and limits on the instruction sequence in use, with a relationship 131 to the external user code;
the branch processing apparatus hardware 101 operates with instructions 102, i.e. has dependencies 134 from the instructions 102; the implementation of the branch processing apparatus hardware 101 is premised on the programming model 103, i.e. has a dependency 137 on the programming model 103; the instructions 102 are executed by the branch processing apparatus hardware 101, i.e. have dependencies 136 from the branch processing apparatus hardware 101; the use of the instructions 102 must adhere to the programming model 103, i.e., have dependencies 135 from the programming model 103; the implementation of the programming model 103 depends on the sequence of instructions 102, i.e. has dependencies 133 from the instructions 102; the programming model 103 is embodied in code executing on the branch processing apparatus hardware 101, i.e. the programming model 103 has dependencies 138 from the branch processing apparatus hardware 101.
Referring to fig. 2, the branch processing apparatus hardware 101 of the present invention is composed of a conditional branch unit 201, a Mask stack control unit 202, an absolute jump unit 203, a multi-site (Warp) ThreadMask stack 204, a multi-site (Warp) PC stack 205, a current ThreadMask206, a Warp scheduling unit 207, an instruction fetch Warp scheduling unit 208, a PC updating unit 209, and a current PC 210;
the conditional branch unit 201 is configured to execute the conditional branch instruction, generate Mask assertion information of all threads according to an execution result, and determine whether the result of all threads has divergence and branch address information, where a rule generated by the Mask assertion is: if the condition of the branch instruction corresponding to a certain thread in the warp is met, the assertion Mask corresponding to the thread is valid, otherwise, the assertion Mask corresponding to the thread is invalid; the Mask stack control unit 202 is configured to execute a Mask stack control instruction and generate a corresponding control action, where the control action includes stack entry, stack exit, and negation, and a rule of negation operation is: negating the assertion masks of all threads in the warp according to bits, wherein the assertion masks are changed into invalid and the invalidity masks are changed into valid; the absolute jump component 203 is used for executing an absolute jump instruction, generating control actions and data of a multi-site (warp) PC stack, wherein the actions comprise stacking and unstacking, and generating current PC information; the multi-context (warp) ThreadMask stack 204 has a plurality of hardware contexts, each context is used for storing Mask information of all threads of a warp, and can output the warp ThreadMask information according to the warp information output by the warp scheduling component 207; the multi-site (warp) PC stack 205 has a plurality of hardware sites, and each site user stores PC information of a warp and can output the PC information of the warp according to the warp information output by the fetch-and-instruct-warp scheduling component 208; the current ThreadMask206 is used for keeping ThreadMask information of all threads of the current warp; the Warp scheduling component 207 is used for selecting one Warp with an operating condition from a plurality of warps according to the information of the external scoreboard unit; the instruction warp scheduling component 208 is configured to select a warp to be instructed from multiple warps according to the state information cached in the external IF component; the PC updating unit 209 is configured to update the PC of the current warp according to the conditional test results of all the conditional branches of the thread given by the conditional branch unit 201, according to the data information related to the absolute jump instruction given by the absolute jump unit 203, and the current PC given by the current PC210, where a specific update rule is "if the conditional test results of all the conditional branches of the thread given by the conditional branch unit 201 are not divergent and the test condition is satisfied, the PC updates the jump address sent by the conditional branch unit 201; if the conditional test results of all the thread conditional branches given by the conditional branch unit 201 are divergent or the test conditions are not satisfied, the PC is updated to PC +1 sent by the current PC 210; if the absolute jump sent by the absolute jump unit 203 is valid, selecting to jump to the jump address sent by the absolute jump unit 203 or a stack top PC of a multi-site (warp) PC stack 205 corresponding to the warp according to the type of the absolute jump; in other cases, the PC is updated to PC +1 given by the current PC 210; "; the current PC210 is used for maintaining the PC information of the current warp;
conditional branch component 201 has a connection 230 from an external ID component for control signals and data from which to branch instructions; the Mask stack control section 202 has a connection 231 from the external ID section for receiving a control signal of a Mask stack control instruction therefrom; the absolute jump unit 203 has a connection 232 from the external ID unit for receiving control signals and data of absolute jump instructions therefrom; the current PC210 has a connection 233 to an external IF component for outputting selected warp PC information thereto; fetch warp scheduler component 208 has a connection 234 from an external IF component for receiving empty full information from the multi-warp instruction cache; the warp scheduling component 207 has connections 235 from the external scoreboard unit for receiving data-related, control-related, functional component-related, writeback path-related, etc. information about the multi-warp instructions from, and status information about the execution of instructions on the functional component; the current ThreadMask206 has a connection 236 to an external Ex component for outputting Mask information for the current warp corresponding instruction thereto; warp scheduling component 207 has a connection 237 to an ID component for outputting ID information of the winning current warp thereto;
conditional branch component 201 has a connection 248 to PC update component 209 for outputting thereto conditional test results and branch address information for all threads of the current warp; the Mask stack control block 202 has a connection 249 to a multi-site (warp) ThreadMask stack 204 for outputting control action information thereto; absolute jump unit 203 has a connection 250 to PC update unit 209 for outputting thereto absolute jump instruction related data information; absolute skip component 203 has a connection 251 to multi-site (warp) PC stack 205 for outputting stack control action and data information thereto; the multi-context (warp) ThreadMask stack 204 has a connection 252 from the warp scheduler 207 for receiving id information of the winning current warp therefrom; the current ThreadMask206 has a connection 253 from the multi-live (warp) ThreadMask stack 204 for receiving Mask information for all threads of the current warp therefrom; multi-locale (warp) PC stack 205 has a connection 254 from PC update component 209 for receiving the current warp's updated next PC therefrom; a multi-context (warp) PC stack 205 has a connection 255 from the fetch warp scheduler component 208 for receiving winning fetch warp id information therefrom; multi-context (warp) PC stack 205 has a connection 256 to current PC210 for outputting warp-fetching PC information thereto; the current ThreadMask206 has a connection 257 to the conditional branch unit 201 for outputting Mask information for all threads of the current warp thereto; warp scheduling component 207 has a connection 258 to multi-site (Warp) PC stack 205 for outputting id information of the winning current Warp thereto; the current PC210 has a connection 259 to the PC update section 209 for outputting current warp's PC information thereto; the current ThreadMask206 has a connection 260 to the Mask stack control unit 202 for outputting Mask information for all threads of the current warp thereto.
Wherein: PC-ProgrammCounter, program counter
Referring to fig. 3, the steps of the SIMT conditional branch processing method according to the embodiment of the present invention are as follows:
the branch processing device hardware 101 has the following working procedures: determining a type of an externally input instruction; respectively executing a conditional branch flow, a Mask stack control flow and an absolute jump flow according to the corresponding types;
the conditional branch flow is as follows: conditional branch unit 201 performs conditional testing based on the opcode and operand inputs via connection 230 and the current ThreadMask information input via connection 257; conditional branch section 201 informs PC updating section 209 of the conditional test results of all threads and the updated PC value through connection 248 according to the above-described conditional test condition; the conditional branch component 201 generates new Mask information according to the condition test condition, and informs the multi-field ThreadMask stack 204 unit through a connection 247; the multi-context (warp) ThreadMask stack 204 selects one of the plurality of warp contexts according to the current warp information input by the connection 252, and outputs Mask values of all threads of the warp via the connection 253; the current ThreadMask206 holds Mask information for all current warp threads received over connection 253 and outputs to external Ex components, conditional branch component 201, and Mask stack control component 202 over connection 236, connection 257, connection 260;
the Mask stack control flow is as follows: the Mask stack control component 202 informs the multi-site (warp) Thradmask stack 204 of actions (push, pop, and negation) to be executed and Mask values of all threads during push through a connection 249 according to the operation code input through the connection 231 and masks of all current warp threads input through the connection 260;
the absolute jump flow is as follows: the absolute jump unit 203 informs the PC update unit 209 of the operand carried by the absolute jump instruction via the connection 250 and informs the multi-site (warp) PC stack 205 of the operation to be performed (including push, pop) via the connection 251, according to the operation code and operand inputted via the connection 232; the multi-context (warp) PC stack 205 retrieves the updated PC values from connection 254 and performs the corresponding operations according to the operations input over connection 251.
The instructions 102 include: a conditional branch instruction 102C, an absolute jump instruction 102J, a predicate operation instruction 102M;
the conditional branch instruction can be subdivided into two types of a conditional jump instruction BEQZ and an unconditional jump instruction BNEZ, and is respectively used for jumping to an address specified by the immediate value when the condition is satisfied and jumping to the address specified by the immediate value when the condition is not satisfied; the instruction format of the conditional branch instruction 102C is Opcode RS, # imm, where Opcode is a mnemonic, RS is a source register, and # imm is an immediate;
the absolute jump instruction 102J is subdivided into a jump instruction J, a push-and-jump instruction JS and a pop-and-jump instruction JR; the jump instruction J is used for jumping to an address specified by an immediate; the stack pushing and jumping instruction JS is used for saving the next instruction address to the stack and jumping to the address specified by the immediate value; the popping and jumping instruction JR is used for popping and jumping an address in a stack to the address; the instruction format of the jump instruction J, the push and jump instruction JS is Opcode # imm, the instruction format of the pop and jump instruction JR only contains Opcode, wherein Opcode is a mnemonic symbol, and # imm is an immediate number;
the assertion operation instruction 102M is subdivided into a assertion PUSH instruction PUSH, a assertion POP instruction POP and an assertion inversion instruction INV; the assertion PUSH instruction PUSH is used for pushing the current assertion Mask into the assertion stack; the assertion POP instruction POP is used for popping the assertion Mask at the top of the assertion stack and taking the assertion Mask as the current assertion Mask; the assertion and negation instruction INV is used for performing bitwise AND operation on the current assertion Mask after negation and the assertion Mask at the top of the assertion stack to serve as the current assertion Mask.
The programming model 103 comprises an assertion rule 103M, a flow control rule 103F and a PC update rule 103P;
the assertion rule 103M includes: each instruction in the instructions 102 has an attribute of whether the attribute is controlled by a predicate Mask or not, wherein: the absolute jump instruction 102J and the predicate operation instruction 102M are not controlled by a predicate Mask, and the instructions are executed no matter whether the predicate Mask is valid or not; the conditional branch instruction 102C is controlled by a predicate Mask, and when the predicate Mask is valid, the instruction is executed, and when the predicate Mask is invalid, the instruction is not executed;
the flow control rule 103F includes: the jump instruction J is not allowed to occur inside a conditional branch except for implementing while, for, etc. loop control; the stack pushing and jumping instruction JS and the stack popping and jumping instruction JR can be positioned inside the conditional branch; function calling is realized by using a stack pressing and jumping instruction JS and a stack popping and jumping instruction JR in a matched mode, and a plurality of function calling can be nested; two path addresses of the conditional branch, one path is fixedly PC +1, and the other path must be larger than the current PC but not allowed to be smaller than the current PC, that is, the immediate carried in the conditional branch instruction 102C must be larger than the address of the current instruction; the absolute jump instruction 102J supports immediate addressing; the absolute jump instruction 102J may also support register indirection, but the programmer must ensure that the values in the register are the same in all threads; the user program must comply with the rules specified in the flow control rules 103F, otherwise the execution flow of the program may not be in accordance with the expectation;
the PC update rule 103P includes: in the designed instruction set, the instructions capable of changing the program execution flow only include an absolute jump instruction 102J and a conditional branch instruction 102C; the update strategy of the PC is as follows: when an absolute jump instruction 102J is encountered, the PC jumps certainly, and is irrelevant to whether the instruction is positioned in the conditional branch or not; when a conditional branch instruction 102C is encountered, if all active thread conditional branches are not divergent and the test condition is established, the PC jumps; when the conditional branch instruction 102C is encountered, if all the active thread conditional branches have a divergence or no divergence but the test condition is not satisfied, the PC is updated to PC + 1; encountering the other instructions in the instruction set except for instruction 102, PC is updated to PC + 1.
For the purposes of the present invention: the thread is disambiguated, namely the condition assertions of all threads are different when the conditional branches are carried out, and the thread is not disambiguated, namely the condition assertions of all threads are the same; PC jump refers to a PC update to a value other than PC + 1.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art; the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (9)

1. An SIMT conditional branch processing apparatus, comprising: the SIMT conditional branch processing apparatus comprises branch processing apparatus hardware, instructions and a programming model; the branch processing apparatus hardware is connected with instructions and programming models respectively, the instructions and programming models are connected, wherein:
the branch processing device hardware is a hardware carrier for executing external user codes and has a relationship from the external user codes; the instructions provide an external user code with a use interface of the apparatus, having a relationship to the external user code; the programming model provides external user code with constraints and limitations of instruction sequences in use, with a relationship to the external user code;
the branch processing apparatus hardware operates by instructions, having dependencies from the instructions; the hardware of the branch processing device is realized on the premise of a programming model and has a dependency relationship on the programming model; the instructions are executed by branch processing apparatus hardware, having dependencies from the branch processing apparatus hardware; the use of the instructions must adhere to the programming model, with dependencies from the programming model; the implementation of the programming model depends on the sequence of instruction components, with dependencies from the instructions; the programming model is embodied in code that executes on the branch processing apparatus hardware, the programming model having dependencies from the branch processing apparatus hardware.
2. The SIMT conditional branch processing apparatus according to claim 1, wherein: the branch processing device hardware comprises a conditional branch component, a Mask stack control component, an absolute jump component, a multi-field ThreadMask stack, a multi-field PC stack, a current ThreadMask, a Warp scheduling component, an instruction fetch Warp scheduling component, a PC updating component and a current PC; wherein
The conditional branch unit is used for executing the conditional branch instruction, generating assertion Mask information of all threads according to an execution result, and whether the results of all threads have ramification and branch address information, wherein a rule generated by the assertion Mask is as follows: if the condition of the branch instruction corresponding to a certain thread in the warp is met, the assertion Mask corresponding to the thread is valid, otherwise, the assertion Mask corresponding to the thread is invalid; the Mask stack control component is used for executing a Mask stack control instruction and generating corresponding control actions, the control actions comprise stack entering, stack exiting and negation taking, and the rule of negation taking operation is as follows: negating the assertion masks of all threads in the warp according to bits, wherein the assertion masks are changed into invalid and the invalidity masks are changed into valid; the absolute jump component is used for executing an absolute jump instruction, generating control actions and data of the multi-site PC stack, wherein the actions comprise stacking and popping, and generating current PC information; the multi-site ThreadMask stack is provided with a plurality of hardware sites, each site is used for storing Mask information of all threads of a warp and can output the ThreadMask information of the warp according to the warp information output by the warp scheduling component; the multi-site PC stack is provided with a plurality of hardware sites, each site user stores one piece of warp PC information and can output the warp PC information according to the warp information output by the fetch instruction warp scheduling component; the current ThreadMask is used for keeping ThreadMask information of all threads of the current warp; the Warp scheduling component is used for selecting one Warp with an operating condition from a plurality of warps according to the information of the external scoreboard unit; the instruction fetch warp scheduling component is used for selecting one to-be-fetched warp from a plurality of warps according to the state information cached in the external IF component; the PC updating component is used for updating the PC of the current warp according to the condition test results of all the thread condition branches given by the condition branch component, the data information related to the absolute jump instruction given by the absolute jump component and the current PC given by the current PC, and the specific updating rule is that if the condition test results of all the thread condition branches given by the condition branch component are not diverged and the test conditions are satisfied, the PC is updated to the jump address sent by the condition branch component; if the conditional test results of all the thread conditional branches given by the conditional branch component are divergent or the test conditions are not satisfied, the PC is updated to PC +1 sent by the current PC; if the absolute jump sent by the absolute jump component is effective, selecting a jump address sent by the absolute jump component or a stack top PC corresponding to warp of the multi-field PC stack according to the type of the absolute jump; otherwise, the PC is updated to PC + 1' given by the current PC; the current PC is used to maintain PC information for the current warp.
3. The SIMT conditional branch processing apparatus according to claim 2, wherein: the conditional branch unit is connected with an ID unit from the outside, and is used for controlling signals and data of a conditional branch instruction; the Mask stack control component is connected with an external ID component and is used for receiving a control signal of a Mask stack control instruction from the ID component; the absolute jump unit is connected with an ID unit from the outside and is used for receiving control signals and data of absolute jump instructions; the current PC is connected with an IF component from the outside and is used for outputting PC information of selected warp to the current PC; the instruction fetch warp scheduling component is connected with an external IF component and used for receiving empty and full information of the multi-warp instruction cache from the instruction fetch warp scheduling component; the warp scheduling component is connected with a scoreboard unit from the outside and is used for receiving related information of data correlation, control correlation, function component correlation, write-back path correlation and the like of multi-warp instructions and state information of instruction execution on the function component; the current ThreadMask is connected with an external Ex component and used for outputting Mask information of a current warp corresponding instruction to the current ThreadMask; the warp scheduling component is connected with an ID component from the outside and is used for outputting ID information of the winning current warp to the warp scheduling component.
4. The SIMT conditional branch processing apparatus according to claim 3, wherein: the conditional branch component is connected with the PC updating component and is used for outputting the conditional test results and the branch address information of all current warp threads to the conditional branch component; the Mask stack control component is connected with the multi-field Threadmask stack and is used for outputting control action information to the Mask stack control component; the absolute jump unit is connected with the PC updating unit and is used for outputting data information related to the absolute jump instruction to the PC updating unit; the absolute skip component is connected with the multi-site PC stack and is used for outputting control actions and data information of the stack to the multi-site PC stack; the multi-field ThreadMask stack is connected with the warp scheduling component and is used for receiving id information of the winning current warp; the current ThreadMask is connected with a multi-field ThreadMask stack and used for receiving Mask information of all current warp threads from the current ThreadMask stack; the multi-site PC stack is connected with the PC updating component and is used for receiving the updated next PC of the current warp; the multi-field PC stack is connected with the instruction fetch dispatch component and is used for receiving id information of the winning instruction fetch from the multi-field PC stack; the multi-site PC stack is connected with the current PC and used for outputting the PC information of the instruction warp to the current PC stack; the current ThreadMask is connected with the conditional branch component and used for outputting Mask information of all current warp threads to the conditional branch component; the Warp scheduling component is connected with a multi-field PC stack and is used for outputting id information of a current winning Warp to the multi-field PC stack; the current PC is connected with the PC updating component and used for outputting the PC information of the current warp to the current PC updating component; and the current ThreadMask is connected with a Mask stack control unit and used for outputting Mask information of all current warp threads to the current ThreadMask.
5. The SIMT conditional branch processing apparatus according to claim 4, wherein: the instructions include conditional branch instructions, absolute jump instructions, and predicate operation instructions.
6. The SIMT conditional branch processing apparatus according to claim 5, wherein: the conditional branch instruction is subdivided into two types of a conditional jump instruction and an unconditional jump instruction, and is respectively used for jumping to an address specified by an immediate when the condition is satisfied and jumping to an address specified by the immediate when the condition is not satisfied; the instruction format of the conditional branch instruction comprises a mnemonic, a test condition and an immediate jump address;
the absolute jump instruction is subdivided into a jump instruction, a push and jump instruction, a pop and jump instruction; the jump instruction is used for jumping to an address specified by an immediate; the push and jump instruction is used for storing the next instruction address to the stack and jumping to the address specified by the immediate; the stack popping and jumping instruction is used for popping and jumping an address in a stack to the address; the instruction formats of the jump instruction, the stack pressing jump instruction and the jump instruction comprise mnemonics and immediate jump addresses; the instruction format of the pop stack and jump instruction only comprises mnemonics;
the assertion operation instruction is subdivided into an assertion push instruction, an assertion pop instruction and an assertion negation instruction; the assertion push instruction is used for pushing a current assertion Mask into an assertion stack; the assertion stack popping instruction is used for popping an assertion Mask at the top of the assertion stack and taking the assertion Mask as a current assertion Mask; and the assertion negation instruction is used for performing bitwise AND operation on the current assertion Mask after negation and the assertion Mask at the top of the assertion stack to serve as the current assertion Mask.
7. The SIMT conditional branch processing apparatus according to claim 6, wherein: the programming model includes assertion rules, flow control rules, and PC update rules.
8. The SIMT conditional branch processing apparatus according to claim 7, wherein: the assertion rule includes: each instruction in the instructions has an attribute of whether the attribute is controlled by an asserted Mask or not, wherein: the absolute jump instruction and the assertion operation instruction are not controlled by assertion Mask, and the instructions are executed no matter whether the assertion Mask is effective or not; the conditional branch instruction is controlled by a predicate Mask, and the instruction is executed when the Mask is predicated to be effective and is not executed when the Mask is not effective;
the flow control rule comprises: the jump instruction is not allowed to occur inside a conditional branch except for the implementation of while, for, etc. loop control; the push and jump instruction, the pop and jump instruction can be inside the conditional branch; the stack pressing and jumping instruction and the stack popping and jumping instruction are used in a matched mode to realize function calling, and a plurality of function calling can be nested; two path addresses of the conditional branch, wherein one path is fixedly PC +1, and the other path is larger than the current PC but not allowed to be smaller than the current PC, namely, the immediate number carried in the conditional branch instruction is larger than the address of the current instruction; the absolute jump instruction supports immediate addressing; absolute jump instructions can also support register indirect addressing, but the programmer must ensure that the values in the register are the same in all threads; the user program must meet the specifications in the flow control rules, otherwise the execution flow of the program may not meet the expectations;
the PC update rule includes: in the designed instruction set, the instructions capable of changing the program execution flow only comprise absolute jump instructions and conditional branch instructions; the update strategy of the PC is as follows: when an absolute jump instruction is encountered, the PC jumps certainly, and is irrelevant to whether the instruction is positioned in the conditional branch or not; when a conditional branch instruction is encountered, if all active thread conditional branches are not diverged and the test condition is established, the PC jumps; when a conditional branch instruction is encountered, if all the active thread conditional branches have a divergence or no divergence but the test condition is not satisfied, the PC is updated to PC + 1; when other instructions in the instruction set are encountered, PC is updated to PC + 1.
9. A method of implementing the SIMT conditional branch processing apparatus according to claim 1, comprising: the method comprises the following steps:
1) the branch processing device hardware judges the type of the instruction input from the outside; respectively executing a conditional branch flow, a Mask stack control flow and an absolute jump flow according to the corresponding types;
2) conditional branch flow:
2.1) the conditional branch component carries out conditional test according to the input operation code and operand and the input current ThreadMask information;
2.2) the conditional branch component informs the PC to update the conditional test results of all threads of the component according to the conditional test condition of the step 2.1) and the updated PC value;
2.3) the conditional branch component generates new Mask information according to the condition test condition in the step 2.1) and informs a multi-site Threadmask stack; the multi-site ThreadMask stack selects one from a plurality of Warp sites according to current Warp information input by the Warp scheduling component, the Mask values of all threads of the Warp are output to the current ThreadMask, and the current ThreadMask keeps the received Mask information of all the threads of the current Warp and respectively outputs the Mask information to an external Ex component, a conditional branch component and a Mask stack control component;
3) the Mask stack control flow is as follows: the Mask stack control component informs the multi-site Thradmask stack of actions to be executed according to the operation code input by the ID component and the masks of all the threads of the current warp, wherein the actions are as follows: pushing, popping, negating and Mask values of all threads during pushing;
4) the absolute jump flow is as follows: and the absolute jump component informs the PC updating component of the operand carried by the absolute jump instruction according to the operation code and the operand input by the ID component and informs the multi-site PC stack of the operation to be carried out: pushing and popping; the multi-site PC stack acquires the updated PC value from the PC updating component and executes corresponding operation according to the operation input by the absolute skip component.
CN202011404147.9A 2020-12-05 2020-12-05 SIMT conditional branch processing device and method Active CN112579164B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011404147.9A CN112579164B (en) 2020-12-05 2020-12-05 SIMT conditional branch processing device and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011404147.9A CN112579164B (en) 2020-12-05 2020-12-05 SIMT conditional branch processing device and method

Publications (2)

Publication Number Publication Date
CN112579164A true CN112579164A (en) 2021-03-30
CN112579164B CN112579164B (en) 2022-10-25

Family

ID=75127105

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011404147.9A Active CN112579164B (en) 2020-12-05 2020-12-05 SIMT conditional branch processing device and method

Country Status (1)

Country Link
CN (1) CN112579164B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114253821A (en) * 2022-03-01 2022-03-29 西安芯瞳半导体技术有限公司 Method and device for analyzing GPU performance and computer storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110072248A1 (en) * 2009-09-24 2011-03-24 Nickolls John R Unanimous branch instructions in a parallel thread processor
CN102109979A (en) * 2009-12-28 2011-06-29 索尼公司 Processor, co-processor, information processing system, and method thereof
CN103809964A (en) * 2012-11-05 2014-05-21 辉达公司 System and method for executing sequential code using a group of hreads and single-instruction, multiple-thread processor incorporating the same
CN105045564A (en) * 2015-06-26 2015-11-11 季锦诚 Front end dynamic sharing method in graphics processor
CN106648545A (en) * 2016-01-18 2017-05-10 天津大学 Register file structure used for branch processing in GPU
CN106708780A (en) * 2016-12-12 2017-05-24 中国航空工业集团公司西安航空计算技术研究所 Low complexity branch processing circuit of uniform dyeing array towards SIMT framework
CN108133452A (en) * 2017-12-06 2018-06-08 中国航空工业集团公司西安航空计算技术研究所 A kind of instruction issue processing circuit of unified stainer array

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110072248A1 (en) * 2009-09-24 2011-03-24 Nickolls John R Unanimous branch instructions in a parallel thread processor
CN102109979A (en) * 2009-12-28 2011-06-29 索尼公司 Processor, co-processor, information processing system, and method thereof
CN103809964A (en) * 2012-11-05 2014-05-21 辉达公司 System and method for executing sequential code using a group of hreads and single-instruction, multiple-thread processor incorporating the same
CN105045564A (en) * 2015-06-26 2015-11-11 季锦诚 Front end dynamic sharing method in graphics processor
CN106648545A (en) * 2016-01-18 2017-05-10 天津大学 Register file structure used for branch processing in GPU
CN106708780A (en) * 2016-12-12 2017-05-24 中国航空工业集团公司西安航空计算技术研究所 Low complexity branch processing circuit of uniform dyeing array towards SIMT framework
CN108133452A (en) * 2017-12-06 2018-06-08 中国航空工业集团公司西安航空计算技术研究所 A kind of instruction issue processing circuit of unified stainer array

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
W.DWEIK ET AL.: "Warped-Shield: Tolerating Hard Faults in GPGPUs", 《2014 44TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS》 *
刘素芹 等: "基于CUDA的GPU条件分支分歧聚合优化策略", 《中国石油大学学报(自然科学版)》 *
韩峰: "可重构及SIMT处理器系统架构存储映射方法研究", 《中国博士学位论文全文数据库-信息科技辑》 *
魏艳艳等: "统一染色器阵列中取指译码单元的设计与实现", 《航空计算技术》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114253821A (en) * 2022-03-01 2022-03-29 西安芯瞳半导体技术有限公司 Method and device for analyzing GPU performance and computer storage medium
CN114253821B (en) * 2022-03-01 2022-05-27 西安芯瞳半导体技术有限公司 Method and device for analyzing GPU performance and computer storage medium

Also Published As

Publication number Publication date
CN112579164B (en) 2022-10-25

Similar Documents

Publication Publication Date Title
US7178011B2 (en) Predication instruction within a data processing system
JP5043560B2 (en) Program execution control device
CN108287730B (en) Processor pipeline device
JP5329410B2 (en) Method and apparatus for executing processor instructions based on dynamic variable delay
US20130263129A1 (en) Semiconductor device
US6687812B1 (en) Parallel processing apparatus
JPH10283183A (en) Branching prediction adjustment method
JP2010286898A (en) Multithread execution device, and multithread execution method
JP2010086131A (en) Multi-thread processor and its interrupt processing method
Irie et al. STRAIGHT: Hazardless processor architecture without register renaming
CN112579164B (en) SIMT conditional branch processing device and method
US20070174592A1 (en) Early conditional selection of an operand
JP2010102732A (en) Information processing apparatus, exception control circuit, and exception control method
JP5316407B2 (en) Arithmetic processing device and control method of arithmetic processing device
US7269720B2 (en) Dynamically controlling execution of operations within a multi-operation instruction
GB2429084A (en) Operating system coprocessor support module
US7065636B2 (en) Hardware loops and pipeline system using advanced generation of loop parameters
US7831979B2 (en) Processor with instruction-based interrupt handling
JP2002366351A (en) Super-scalar processor
US20070260858A1 (en) Processor and processing method of the same
JP2000353092A (en) Information processor and register file switching method for the processor
JP2005234968A (en) Arithmetic processing unit
US7472264B2 (en) Predicting a jump target based on a program counter and state information for a process
US6453412B1 (en) Method and apparatus for reissuing paired MMX instructions singly during exception handling
JP2010257199A (en) Processor, and method of controlling instruction issue in processor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant