CN114911528B

CN114911528B - Branch instruction processing method, processor, chip, board card, equipment and medium

Info

Publication number: CN114911528B
Application number: CN202210613728.6A
Authority: CN
Inventors: 徐永康; 夏晓旭; 唐印; 王文强; 徐宁仪
Original assignee: Shanghai Power Tensors Intelligent Technology Co Ltd
Current assignee: Shanghai Power Tensors Intelligent Technology Co Ltd
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2024-09-13
Anticipated expiration: 2042-05-31
Also published as: CN114911528A

Abstract

The embodiment of the disclosure provides a branch instruction processing method, a SIMT processor, a chip, a board card, equipment and a storage medium. The SIMT processor includes a branch instruction processing unit and a stack; the branch instruction processing unit is configured to: in the case where at least two of the plurality of threads are to execute different branches of a branch instruction, determining a first address parameter of an aggregate instruction and a second address parameter of a first instruction of a later-executed branch; pushing an entry comprising a first address parameter and a second address parameter to a top of a stack; in the process of executing a branch instruction or executing an instruction in a branch in advance, in the case that the third address parameter of the next instruction to be executed is consistent with the first address parameter in the stack top, the third address parameter is updated to the second address parameter in the stack top. The present implementation enables efficient processing of branch instructions.

Description

Branch instruction processing method, processor, chip, board card, equipment and medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a branch instruction processing method, a SIMT processor, a chip, a board, a device, and a storage medium.

Background

Single instruction multithreading (Single Instruction Multi Thread, SIMT) uses a single instruction to control the execution of multiple threads, i.e., multiple threads execute the same instruction at the same time. The SIMT technology is applied to the design of the processor, so that instruction fetching logic resources can be saved, more transistors are used for calculation, and the operation capability of the processor is provided; in graphic computation, for example, a large number of vertexes and pixels need to perform the same operation, so that the data parallelism is extremely high, and the SIMT has good adaptability.

In a SIMT processor, when all threads in a thread cluster have the same execution path, the SIMT processor can obtain the whole efficiency and performance; if the threads in the thread cluster have conditional branches, the difference of execution paths is caused by the difference of thread data, for example, when a branch instruction is encountered, the execution results of some threads in the thread cluster meet the condition indicated by the branch instruction, the execution results of other threads do not meet the condition, the threads meeting the condition need to carry out branch transfer, and other threads do not carry out branch transfer, aiming at the situation, the SIMT processor has the problem of low execution efficiency. Unlike a pipelined processor, a SIMT processor does not need to predict a branch instruction, and cannot apply a branch instruction processing method in the pipelined processor to the SIMT processor.

Therefore, there is a need for a more efficient branch instruction processing mechanism for SIMT processors.

Disclosure of Invention

The present disclosure provides a branch instruction processing method, a SIMT processor, a chip, a board card, an apparatus, and a storage medium.

According to a first aspect of embodiments of the present disclosure, there is provided a SIMT processor including a branch instruction processing unit and a stack; multiple threads in the SIMT processor can synchronously execute instructions in the same instruction sequence, wherein the instruction sequence at least comprises a branch instruction and a convergence instruction, the branch instruction indicates multiple executable branches, and the multiple branches are gathered at the convergence instruction;

the branch instruction processing unit is configured to:

determining a first address parameter of a converging instruction and a second address parameter of a first instruction of a later executing branch of the different branches in the case that at least two of the plurality of threads are used to execute different branches of the branch instruction;

pushing an entry comprising the first address parameter and the second address parameter to a top of the stack;

In the process of executing the branch instruction or the instruction in the previous execution branch, in the case that the third address parameter of the next instruction to be executed is consistent with the first address parameter in the stack top, updating the third address parameter to the second address parameter in the stack top so that the first instruction of the subsequent execution branch is determined to be the next instruction.

Optionally, the branch instruction processing unit is further configured to: in the process of executing the instruction of the later execution branch, under the condition that the third address parameter of the next instruction is consistent with the first address parameter of the stack top, ejecting an item comprising the first address parameter and the second address parameter from the stack top.

Optionally, the plurality of branches includes a branch and a sequential execution branch; the first instruction on the sequentially executed branch is the instruction immediately following the branch instruction in the instruction sequence, and the branch is other branches of the plurality of branches except the sequentially executed branch; the address parameter of the first instruction on the sequential execution branch is obtained by increasing the address parameter of the branch instruction; the address parameter of the first instruction on the branch instruction is specified by the branch instruction and is discontinuous with the address parameter of the branch instruction.

Optionally, the branch instruction processing unit is further configured to: and according to the judging result of each thread on the judging condition of the branch instruction, acquiring a first flag bit of each thread corresponding to the branch instruction, wherein the first flag bit of one thread corresponding to the branch instruction is used for indicating the next branch executed by the thread after the execution of the branch instruction.

Optionally, the plurality of branches includes a branch and a sequential execution branch; the first instruction on the sequentially executed branch is the instruction immediately following the branch instruction in the instruction sequence, and the branch is other branches of the plurality of branches except the sequentially executed branch; the first flag bit of one thread corresponding to the branch instruction is the same as the first flag bit of the corresponding branch instruction, and the first flag bit of one thread corresponding to the sequential branch instruction is obtained by inverting the first flag bit of the thread corresponding to the branch instruction.

Optionally, the entry further includes a first flag bit for each thread corresponding to the later execution branch.

Optionally, each thread also corresponds to a second flag bit, which is used for indicating whether the thread runs; the entry further includes a first flag bit and the second flag bit for each thread corresponding to the later execution branch; or the entry also includes a second flag bit for each thread; wherein a first flag bit of a thread corresponding to the later execution branch is determined based on a second flag bit of the thread and a first flag bit of the thread corresponding to the earlier execution branch.

Optionally, the number of branch instruction processing units and the number of stacks are each greater than 1, each branch instruction processing unit corresponding to one thread cluster in the SIMT processor, each stack corresponding to one thread cluster in the SIMT processor; wherein, a plurality of threads in the same thread cluster can synchronously execute each instruction in the same instruction sequence.

Optionally, the SIMT processor further includes one or more shared memories, each shared memory corresponding to at least two thread clusters, different thread clusters corresponding to different shared memories;

The branch instruction processing unit is further configured to: under the condition that the stack overflows, transferring the appointed items in the stack to a shared memory corresponding to the thread cluster; the specified entry is determined according to the write pointer of the stack, and the push time of the specified entry is earlier than the push time of other entries which are not transferred.

Optionally, the branch instruction processing unit is further configured to: returning the stack; the storage position of the target item in the stack is adjacent to or different from the original storage position of the appointed item in the stack by a preset number of storage units, and each storage unit is used for storing one item.

Optionally, the plurality of branches includes a branch and a sequential execution branch; the first instruction on the sequentially executed branch is the instruction immediately following the branch instruction in the instruction sequence, and the branch is other branches of the plurality of branches except the sequentially executed branch; the branch instruction processing unit is further configured to: under the condition that the threads are used for executing the sequential execution branches indicated by the branch instructions, the address parameters of the branch instructions are increased to obtain the address parameters of the next instruction to be executed; and taking the address parameter of the first instruction on the branch as the address parameter of the next instruction to be executed under the condition that the plurality of threads are used for executing the branch instruction to indicate the branch.

Optionally, the branch instruction processing unit is further configured to: in executing a preceding execution branch or an instruction in the following execution branch, in the event that the current instruction is a branch instruction and there are at least two threads for executing different branches of the branch instruction, the step of determining and pushing first and second address parameters onto a stack is performed to process the preceding execution branch or the branch instruction nested in the following execution branch.

Optionally, the later execution branches in the different branches are determined according to a preset branch execution sequence.

According to a second aspect of embodiments of the present disclosure, there is provided a branch instruction processing method applied to a SIMT processor including a branch instruction processing unit and a stack, a plurality of threads in the SIMT processor being capable of executing respective instructions in a same instruction sequence synchronously, the instruction sequence including at least a branch instruction and an aggregate instruction, the branch instruction indicating a plurality of branches that are executable, the plurality of branches being aggregated at the aggregate instruction; the method comprises the following steps:

According to a third aspect of embodiments of the present disclosure, there is provided a chip comprising the SIMT processor of any of the first aspects.

According to a fourth aspect of embodiments of the present disclosure, there is provided a board comprising a chip as described in the third aspect.

According to a fifth aspect of an embodiment of the present disclosure, there is provided an electronic device including the SIMT processor of any one of the first aspect, or the chip of the third aspect, or the board card of the fourth aspect.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium having stored thereon computer instructions which, when executed, implement the method mentioned in the second aspect above.

In the embodiments of the present disclosure, it should be appreciated that the embodiments of the present disclosure propose a processing mechanism for efficiently processing branch instructions for a SIMT processor in which corresponding stacks and branch instruction processing units are provided for a plurality of threads capable of executing the same instruction sequence synchronously. In the case that at least two of the plurality of threads are used for executing different branches of a branch instruction, the branch instruction processing unit can determine a first address parameter of a converging instruction and a second address parameter of a first instruction of a later-executed branch of the different branches, push an entry comprising the first address parameter and the second address parameter to the stack top of the stack, realize key information of the later-executed branch stored in the stack, and only need to push the first address parameter and the second address parameter for one branch instruction, and need not to push other address parameters, thereby realizing that the branch control of the multithreading can be completed with less stack space. Moreover, in the process of executing the branch instruction or the instruction in the previous execution branch, when the third address parameter of the next instruction to be executed is consistent with the first address parameter in the stack top, the third address parameter is updated to the second address parameter in the stack top, so that the execution of the subsequent execution branch is realized after the execution of the branch instruction (without the previous execution branch) or after the execution of the previous execution branch is completed, and the orderly and efficient processing of different branches of the branch instruction is realized.

The foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the technical aspects of the disclosure.

Fig. 1 is a flow chart illustrating execution of one instruction sequence in an embodiment of the present disclosure.

Fig. 2 is a block diagram of a SIMT processor shown in an embodiment of the present disclosure.

Fig. 3 is a block diagram of a second SIMT processor shown in an embodiment of the present disclosure.

Fig. 4 is a block diagram of a third SIMT processor shown in an embodiment of the present disclosure.

Fig. 5 is a block diagram of a fourth SIMT processor shown in an embodiment of the present disclosure.

Fig. 6 is a block diagram of a fifth SIMT processor shown in an embodiment of the present disclosure.

FIG. 7 is a schematic diagram illustrating the movement of entries in a stack to shared memory according to an embodiment of the present disclosure.

FIG. 8 is a schematic diagram illustrating writing an entry back to a stack from shared memory according to an embodiment of the present disclosure.

Fig. 9 is a flow chart illustrating a method of processing a branch instruction according to an embodiment of the disclosure.

Fig. 10 is a flow diagram illustrating a second branch instruction processing method according to an embodiment of the present disclosure.

Fig. 11 is a schematic structural view of a board card according to an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality.

It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" depending on the context.

In order to better understand the technical solutions in the embodiments of the present disclosure and make the above objects, features and advantages of the embodiments of the present disclosure more comprehensible, the technical solutions in the embodiments of the present disclosure are described in further detail below with reference to the accompanying drawings.

Single instruction multithreading (Single Instruction Multi Thread, SIMT) uses a single instruction to control the execution of multiple threads, i.e., multiple threads execute the same instruction at the same time. The SIMT technology is applied to the design of the processor, so that instruction fetching logic resources can be saved, more transistors are used for calculation, and the operation capability of the processor is provided; in graphic computation, for example, a large number of vertexes and pixels need to perform the same operation, so that the data parallelism is extremely high, and the SIMT has good adaptability. However, in non-graphics computing, the execution paths of different threads may differ (a branch instruction may differ the execution paths of different threads for executing the same instruction), resulting in inefficiency of SIMT.

The branch instructions (Branch Instruction) are frequently occurring instructions in the program, and on average, every 9 instructions in the C language will have a branch instruction, which gives the diversity of program behavior. Branch instructions such as if instructions, if-else instructions, while instructions, or for-loop instructions in a program, and the like. A branch instruction typically corresponds to a specified condition for which a thread needs to branch if the thread's determination is true (i.e., the condition is satisfied); if the thread's determination of the specified condition is false (i.e., the condition is not satisfied), then no branch needs to be taken but the immediately following instruction is executed sequentially.

In one example, referring to FIG. 1, for a sequence of instructions such as:

PC1: if (PC 0: condition) {

PC5: an instruction a1;

PC6: an instruction a2;

PC7: an instruction a3; }

else{

PC2: instruction b1;

PC3: instruction b2;

PC4: instruction b3 (jump jmp instruction); }

PC8: instruction c1;

PC9: instruction c2.

Wherein PC (program counter) is the abbreviation of "program counter", the PC value of an instruction can indicate the memory address of the instruction, e.g., in the instruction sequence described above, PC0 indicates the memory address of the "condition"; PC1 indicates the memory address of the branch instruction (if-else instruction), and PC2 indicates the memory address of the "instruction b 1"; the PC value of an instruction can be determined during execution of the last instruction; for example, in executing the instruction b1, the PC value of the next instruction b2 to be executed may be obtained by incrementing the PC value of the instruction b1, so that the instruction b2 is acquired and executed based on the PC value of the next instruction b 2.

Referring to fig. 2, the simt processor includes a fetching unit 10, a plurality of threads 20 capable of synchronously executing the same instruction sequence, n being an integer greater than 1, and a memory 30, wherein the instruction sequence may be stored in the memory 30, and the fetching unit 10 is capable of reading instructions from the memory 30 based on a memory address indicated by the PC value so as to be synchronously executed by the plurality of threads 20.

Under the condition of sequential execution, the SIMT processor can increment the PC value of the instruction to be executed, obtain the PC value of the next instruction to be executed, and prepare for taking the next instruction to be executed; if the above-mentioned instructions a1, a2 and a3 are sequentially executed instructions, when the instruction a1 is executed, the PC6 may be obtained by incrementing the PC5, and the instruction fetch unit may read the instruction a2 from the memory based on the memory address indicated by the PC 6.

In the case of encountering a branch instruction and requiring a branch transfer, the memory address (PC value) of the next instruction is specified by the branch instruction, rather than being obtained by sequentially incrementing the PC value; as in executing the above-described branch instruction, in the SIMT processor, assuming that there are 4 threads, two of which are true for the determination of the condition, a branch transfer is required, the branch instruction specifying the address of the next instruction after the branch transfer, i.e., PC5; and the other two threads are judged to be false for the condition, so that branch transfer is not needed, the next instruction to be executed PC2 can be obtained by sequentially increasing PC1, and the instruction b1 indicated by PC2 is the next instruction following the branch instruction in the instruction sequence.

The branch instruction indicates that there are at most two executable branches, in this embodiment of the disclosure, for convenience of distinction, an execution path that does not need to make a branch transition is referred to as a sequential execution branch (i.e., ①:Pc2→Pc3→Pc4 in fig. 1), an execution path that does need to make a branch transition is referred to as a branch transition (i.e., ②:Pc5→Pc6→Pc7 in fig. 1), a first instruction on the sequential execution branch is an instruction immediately following the branch instruction in the instruction sequence, the branch transition is another branch other than the sequential execution branch in the plurality of branches, and finally the sequential execution branch and the branch transition are gathered at a gathering instruction c 1; the PC value corresponding to the convergence instruction c1 is also specified by a branch instruction, and the convergence instruction refers to an instruction at a convergence point of two branches. That is, when processing the branch instruction, the PC value of the first instruction on the branch and the PC value of the converging instruction can be determined, while the PC value of the first instruction on the branch is not obtained in a sequential increasing manner, if no related means are adopted, the PC value of the first instruction on the branch cannot be found in the subsequent instruction execution process, which may cause the branch not to be executed and the instruction execution error.

Embodiments of the present disclosure propose a processing mechanism for efficiently processing branch instructions for a SIMT processor in which corresponding stack and branch instruction processing units are provided for a plurality of threads capable of executing the same instruction sequence synchronously. In the case that at least two of the plurality of threads are used for executing different branches of a branch instruction, the branch instruction processing unit can determine a first address parameter of a converging instruction and a second address parameter of a first instruction of a later-executed branch of the different branches, push an entry comprising the first address parameter and the second address parameter to the stack top of the stack, realize key information of the later-executed branch stored in the stack, and only need to push the first address parameter and the second address parameter for one branch instruction, and need not to push other address parameters, thereby realizing that the branch control of the multithreading can be completed with less stack space. Moreover, in the process of executing the branch instruction or the instruction in the previous execution branch, when the third address parameter of the next instruction to be executed is consistent with the first address parameter in the stack top, the third address parameter is updated to the second address parameter in the stack top, so that the execution of the subsequent execution branch is realized after the execution of the branch instruction (without the previous execution branch) or after the execution of the previous execution branch is completed, and the orderly and efficient processing of different branches of the branch instruction is realized.

In some embodiments, referring to FIG. 3, the SIMT processor includes a branch instruction processing unit 40 and a stack 50; multiple threads in the SIMT processor are capable of executing instructions (such as the instruction sequence of fig. 1) in the same instruction sequence synchronously, wherein the instruction sequence at least includes a branch instruction and an aggregate instruction, the branch instruction indicates multiple executable branches, and the multiple executable branches are aggregated at the aggregate instruction.

The branch instruction processing unit 40 is configured to:

pushing an entry comprising the first address parameter and the second address parameter to a top of the stack 50;

In the present embodiment, it is realized that the key information of the post-execution branch is saved by the stack 50 in the case where at least two of the plurality of threads are used to execute different branches of the branch instruction, and the post-execution branch may be executed by the information saved in the stack 50 after the branch instruction (without the previous execution branch) or after the previous execution branch execution is completed, thereby realizing orderly and efficient processing of different branches of the branch instruction.

In some embodiments, referring to fig. 4, the SIMT processor further includes a value unit 10 and a memory 30, the instruction sequence may be stored in the memory 30, the instruction fetch unit 10 is capable of reading an instruction from the memory 30 based on a storage address indicated by an address parameter (such as a PC value), and the branch instruction processing unit 40 is connected to the instruction fetch unit 10 in the SIMT processor, and is configured to perform branch instruction determination on the instruction fetched by the instruction fetch unit 10; the branch instruction processing unit 40 is further coupled to the stack 50 for performing associated push processing operations in the event that at least two of the plurality of threads are to execute different branches of a branch instruction.

In some embodiments, an address parameter of an instruction specifies the memory address of the instruction. Illustratively, the address parameter of an instruction includes a PC value (or program counter value). In some embodiments, the branch instruction processing unit 40 obtains an instruction to be executed from the instruction fetch unit 10, first determines whether the instruction to be executed is a branch instruction, if the instruction to be executed is a branch instruction, then determines whether the plurality of threads need to branch for the branch instruction, the branch instruction indicates that there are two more branches, the executable branches include a branch (such as ② →pc6→pc7 in fig. 1) and a sequentially executed branch (such as ①: PC2→pc3→pc4 in fig. 1); wherein a first instruction on the sequentially executed branch is a next instruction in the instruction sequence immediately following the branch instruction, and the branch is a branch other than the sequentially executed branch among the plurality of branches; the address parameter of the first instruction on the sequential execution branch is obtained by incrementing the address parameter of the branch instruction, and the address parameter of the first instruction on the branch instruction is specified by the branch instruction and is discontinuous with the address parameter of the branch instruction.

It should be noted that, when the branch instruction processing unit 40 obtains each instruction to be executed from the instruction fetch unit 10, it is not known whether the instruction to be executed is a branch instruction, and therefore, for each instruction to be executed obtained from the instruction fetch unit, the branch instruction processing unit 40 needs to perform a step of determining whether the instruction to be executed is a branch instruction.

If the instruction to be executed is a branch instruction, there are three possible situations at this time:

In the first case, the multiple threads are all used for executing the sequential execution branches indicated by the branch instruction, and then the address parameters of the branch instruction are increased to obtain the address parameters of the next instruction to be executed; for example, in the embodiment of fig. 1, if the judging results of the judging conditions by the multiple threads for synchronously executing the instruction sequence are all false, the sequential execution branch ① is executed, and the address parameter (PC 2) of the first instruction on the sequential execution branch is the increment result of PC 1.

In a second case, the multiple threads are all used for executing the branch instruction to indicate a branch, and the address parameter of the first instruction on the branch is taken as the address parameter of the next instruction to be executed; for example, in the embodiment of fig. 1, if the judging results of the judging conditions are all true by the plurality of threads for synchronously executing the instruction sequence, executing a branch ②, wherein the address parameter (PC 5) of the first instruction on the branch is specified by the branch instruction; illustratively, the address parameter of the first instruction on the branch may be determined from the address code field of the branch instruction.

In a third case, at least two of the multiple threads are configured to execute different branches of the branch instruction, that is, there is a case where a part of threads execute a sequential execution branch and another part of threads execute a branch, where the above-mentioned push operation and a subsequent address parameter update operation may be executed.

In some embodiments, the branch instruction processing unit 40 may obtain, according to a determination result of the determination condition of the branch instruction by each thread, a first flag bit of each thread corresponding to the branch instruction, where the first flag bit of one thread corresponding to the branch instruction is used to indicate a next branch executed by the thread after the execution of the branch instruction is completed. For example, when the flag bit is a first preset value, indicating that the thread does not execute the branch (i.e., execute a sequential execution branch); when the flag bit is a second preset value, indicating a thread to execute the transfer branch; the first preset value and the second preset value are different. In one example, one of the first preset value and the second preset value is 1, and the other is 0.

Then, corresponding to the three cases, the first flag bits of all threads corresponding to the branch instruction are all a first preset value, indicating that the multiple threads are all used for executing the sequential execution branches indicated by the branch instruction; the first zone bits of all threads corresponding to the branch instruction are second preset values, and the first zone bits indicate that a plurality of threads are used for executing the branch instruction to indicate branch transfer; there are at least two threads that differ in correspondence with the first flag bit of the branch instruction, indicating different branches of at least two of the plurality of threads for executing the branch instruction.

In one example, assume that a first flag bit of "1" indicates that the thread is executing the branch of transfer and a first flag bit of "0" indicates that the thread is executing the branch of sequential execution. In the embodiment shown in fig. 1, it is assumed that there are 4 threads (A, B, C, D is assumed), after the condition indicated by PC0 is determined, thread a, thread B, and thread C are determined to be true, the first flag is "1", thread D is determined to be false, the first flag is "0", there are at least two threads that are different from the first flag corresponding to the branch instruction, and it is known from all the first flags "1110" that thread a, thread B, and thread C execute branch branches, and thread D executes sequential branches.

In some embodiments, the first flag bit of each thread is used to distinguish different branches executed by each thread, so as to control each thread to execute and complete the instruction in the corresponding branch when at least two of the plurality of threads are used to execute different branches of the branch instruction. In the case where multiple threads are each used to execute the same branch in the branch instruction, then it is not necessary to distinguish the branches executed by the respective threads by means of the first flag bit.

In some embodiments, where the plurality of threads are used to execute different branches of a branch instruction (i.e., at least one of the plurality of threads is used to execute a branch, while the other portion is used to execute a sequential execution branch), the branch instruction processing unit 40 obtains the address parameters of the aggregate instruction and the second address parameters of the first instruction of the later execution branch of the different branches. Such as in the embodiment depicted in fig. 1, such as the later execution branch being a branch, the PC5 of the first instruction in the branch and the converging instruction PC8 are fetched.

And, the branch instruction processing unit 40 may determine the first flag bits of the respective threads corresponding to the branch and the sequential branch, respectively, according to the first flag bits of the respective threads corresponding to the branch instruction. The first flag bit of one thread corresponding to the branch instruction is the same as the first flag bit of the branch instruction, and the first flag bit of the thread corresponding to the sequential branch instruction is obtained by inverting the first flag bit of the thread corresponding to the branch instruction.

In one example, taking the example that the first flag bit of each thread corresponding to the branch instruction is "1110", the first flag bits of the plurality of threads corresponding to the branch are also "1110", taking the branch shown in fig. 1 as an example, and "1110" indicates that thread a, thread B, and thread C execute instructions a1, a2, and a3 on the branch; while the first flag bit of the plurality of threads corresponding to the sequential branch is "0001", such as the sequential execution branch in the embodiment of fig. 1 for example, "0001" indicates that thread D points to instructions b1, b2, and b3 on the sequential branch.

In some embodiments, for a case where at least two of the plurality of threads are used to execute different branches of the branch instruction, a branch order may be preset, so that a preceding execution branch and a following execution branch of the different branches may be clarified. The address parameter of the first instruction in the previous execution branch can be used to acquire the next instruction to be executed, and the second address parameter of the first instruction in the subsequent execution branch and the first address parameter of the converging instruction are pushed onto the stack top of the stack 50, so that the key information in the subsequent execution branch can be saved, and only the first address parameter and the second address parameter are needed to be pushed onto one branch instruction, and other address parameters are needed to be pushed onto the first branch instruction, so that the multi-thread branch control can be realized with less stack 50 space. Illustratively, the second address parameter of the first instruction in the later execution branch and the first address parameter of the aggregate instruction may be pushed onto the top of the stack 50 in the form of an entry.

In some exemplary embodiments, the preset branching precedence order includes a first order and a second order; the first order indicates that the instructions on the order execution branch are executed first, and then the instructions on the branch are executed; the second order indicates that instructions on the branch are executed first and then instructions on the sequential execution branch are executed. Any order described above may be selected according to the actual application scenario, and this embodiment does not impose any limitation. For example, in the embodiment shown in fig. 1, the PC2 in the execution branch, the PC5 in the branch, and the converging instruction PC8 are acquired, and assuming that the preset execution order of the branches is the first order, the entries that need to be pushed onto the top of the stack 50 are { PC5, PC8}.

In one example, for a sequence of instructions such as the following:

PC1: if (PC 0: condition) {

PC3: an instruction a1;

PC4: an instruction a2;

PC5: an instruction a3; }

PC2: instruction c1.

It can be seen that the first instruction located on the sequentially executed branch is the instruction c1, the aggregate instruction is also the instruction c1, that is, there is a case that the first instruction located on the sequentially executed branch is the aggregate instruction, and if the first instruction is executed in the first order, in the process of executing the branch instruction, the third address parameter of the next instruction to be executed (that is, the address parameter of the first instruction located on the sequentially executed branch) is compared with the first address parameter of the aggregate instruction in the stack top, and if the comparison is consistent with the comparison, the third address parameter is updated to the second address parameter in the stack top, so that the first instruction of the later executed branch is determined to be the next instruction.

In some embodiments, after the next instruction to be executed is fetched using the address parameter of the first instruction in the previous execution branch, the associated thread may be controlled to execute the instruction in the previous execution branch according to the first flag bit of each thread corresponding to the previous execution branch. The first flag bit of each thread corresponding to the previous execution branch may be determined based on the first flag bit of each thread corresponding to the branch instruction and the preset branch execution order, for example, the preset branch execution order is the first order, and it may be determined that the first flag bit of each thread corresponding to the previous execution branch (sequential branch) is obtained by inverting the first flag bit of the thread corresponding to the branch instruction.

In some embodiments, in addition to storing the two address parameters (first address parameter and second address parameter) in stack 50, each thread is also saved in view of the fact that the first flag bit of each thread corresponding to the subsequent execution branch indicates the thread that needs to execute the instruction in the subsequent execution branch. Illustratively, the entry also includes a first flag bit for each thread corresponding to the later branch of execution, so that subsequent threads that explicitly need to execute instructions in the later branch of execution may be based on the first flag bit of the stack 50. The first flag bit of each thread corresponding to the post execution branch may be determined based on the first flag bit of each thread corresponding to the branch instruction and the preset branch execution order, for example, if the preset branch execution order is the first order, it may be determined that the first flag bit of each thread corresponding to the post execution branch (branch) is the same as the first flag bit corresponding to the branch instruction.

In some embodiments, in a case where at least two of the plurality of threads are configured to execute different branches of the branch instruction, the instructions in the previous execution branch are executed in advance according to a preset branch execution sequence, where an address parameter of the instruction in the previous execution branch is determined by the branch instruction processing unit 40 when a previous instruction is processed, for example, an address parameter of the first instruction in the previous execution branch is determined during execution of the branch instruction, and an address parameter of the second instruction in the previous execution branch is determined during execution of the first instruction.

In executing the branch instruction or an instruction that previously executed a branch, the branch instruction processing unit 40 may prefetch a third address parameter of a next instruction to be executed according to the currently executed instruction; typically, the branch instruction processing unit 40 may increment the address parameter of the currently executed instruction to obtain a third address parameter; in addition, in the case where the currently executed instruction is a jump (jmp) instruction, the branch instruction processing unit 40 may determine a third address parameter based on the jump instruction.

After initially determining the third address parameter of the next instruction to be executed, in order to determine whether the branch instruction is followed by a converging instruction or whether the execution of the previously executed branch is complete, the branch instruction processing unit 40 may compare the third address parameter with the first address parameter of the converging instruction located at the top of the stack 50 in order to determine whether the branch instruction is followed by a converging instruction or whether the execution of the previously executed branch is complete. In the case that the comparison is consistent, indicating that the branch instruction is followed by a converging instruction or that the instruction on the previously executed branch is executed, then an instruction in a later executed branch needs to be executed, the branch instruction processing unit 40 may update the third address parameter to the second address parameter read from the stack top of the stack 50, so that the first instruction of the later executed branch is determined to be the next instruction, and execution of the later executed branch is achieved. And if the comparison is inconsistent, indicating that a previous execution branch exists after the branch instruction or the instruction on the previous execution branch is not executed, keeping the determined third address parameter unchanged for continuing to acquire the next instruction on the previous execution branch.

In one example, taking the embodiment shown in fig. 1 as an example, assuming that the execution sequence of the branches is the first sequence, that is, the previous execution branch is the sequential execution branch, the entry pushed onto the top of the stack 50 includes at least { PC5, PC8}, assuming that the currently executed instruction is the instruction b3 indicated by PC4, b3 is a jump (jmp) instruction, the third address parameter of the next instruction to be executed determined according to the jump instruction is PC8, and the third address parameter is consistent with the first address parameter located at the top of the stack 50, and is PC8, which indicates that the instruction on the previous execution branch is executed, and the branch instruction processing unit 40 updates the third address parameter from PC8 to PC5 in the top of the stack 50, so as to implement executing the subsequent execution branch after the instruction in the previous execution branch is executed.

In some embodiments, after the updating of the third address parameter of the next instruction to be executed to the second address parameter read from the top of the stack 50, a thread in the SIMT processor begins executing a later branch of execution. In addition, as mentioned above, in addition to pushing the two address parameters at the top of stack 50, the first flag bit of each thread corresponding to the later execution branch is pushed. When a thread in the SIMT processor starts executing a post-execution branch, a first flag bit corresponding to the post-execution branch for each thread needs to be read from the top of the stack 50 to determine the thread that needs to execute the post-execution branch.

In one example, taking the embodiment shown in fig. 1 as an example, assuming that the branch execution sequence is the first sequence, that is, the later execution branch is the branch, it is known based on the above example that the first flag bit corresponding to the later execution branch is also "1110", which indicates that, in the process of executing the instruction of the later execution branch, the thread a, the thread B, and the thread C need to execute the instruction on the branch, and the thread D does not need.

In executing the instruction of the later execution branch, the branch instruction processing unit 40 may determine a third address parameter of the next instruction to be executed according to the currently executed instruction; typically, the branch instruction processing unit 40 may increment the address parameter of the currently executed instruction to obtain a third address parameter of the next instruction to be executed; in addition, in the case where the currently executed instruction is a jump (jmp) instruction, the branch instruction processing unit 40 may determine a third address parameter of the next instruction to be executed according to the jump instruction.

After initially determining the third address parameter of the next instruction to be executed, in order to determine whether the execution of the later execution branch is completed, the branch instruction processing unit 40 may compare the third address parameter of the next instruction to be executed with the first address parameter of the converging instruction located at the top of the stack 50. In the case of a match, indicating that the instruction on the later branch is also being executed, the branch instruction processing unit 40 may pop an entry from the top of the stack 50 that includes at least the first address parameter and the second address parameter; illustratively, the entry further includes a first flag bit for each thread corresponding to the subsequent branch of execution. And if the comparison is inconsistent, indicating that the execution of the instruction on the later execution branch is not completed, keeping the determined third address parameter of the next instruction to be executed unchanged, and continuing to acquire the next instruction to be executed on the later execution branch.

In one example, taking the embodiment shown in fig. 1 as an example, assuming that the execution sequence of the branches is the first sequence, that is, the later execution branch is the branch transfer, the entry pushed onto the top of the stack 50 includes at least { PC5, PC8}, assuming that the currently executed instruction is the instruction a3 indicated by PC7, by incrementing the address parameter of the instruction a3, the third address parameter of the next to-be-instructed is obtained as PC8, where the third address parameter of the next instruction is consistent with the first address parameter of the converging instruction located on the top of the stack 50, and is PC8, which indicates that the instruction on the later execution branch is also executed, and the branch instruction processing unit 40 pops the entry located on the top of the stack 50, which indicates that both branches corresponding to the branch instruction are executed.

In some embodiments, in executing the instruction in the preceding execution branch or the following execution branch, if the instruction is a branch instruction and there are at least two threads for executing different branches of the branch instruction, the step of determining the first address parameter and the second address parameter and pushing them onto the stack 50 may be performed in the same manner as the above-mentioned processing of the branch instruction that occurs in the same case, so as to process the preceding execution branch or the branch instruction nested in the following execution branch, and, due to the data storage form of the first-in-last-out of the stack 50, it is possible to process the branch instruction of the preceding execution branch or the branch instruction nested in the following execution branch before further processing the branch instruction of the outer layer (i.e., the original branch instruction indicating the preceding execution branch and the following execution branch) to achieve the situation of efficiently processing the branch nesting.

In one example, for a sequence of instructions such as the following:

PC1: if (PC 0: condition) {

PC8: an instruction a1;

PC9: an instruction a2;

PC10: an instruction a3; }

else{

PC2: instruction b1;

PC4: if (PC 3: condition) {

PC6: instruction d1;

PC7: instruction d2 (jump jmp instruction);

}

PC5: instruction b3 (jump jmp instruction); }

PC11: instruction c1;

PC12: instruction c2.

In the case where the instruction indicated by PC1 is determined to be a branch instruction and at least two of the plurality of threads are used to execute different branches of the branch instruction, determining a first address parameter PC8 of the aggregate instruction and a second address parameter PC11 of a first instruction of a later executed branch (assumed to be a branch) of the different branches, pushing { PC8, PC11} onto the top of stack 50. In the process of executing the instructions in the previous execution branch (sequential branch), assuming that the currently executed instruction is the instruction indicated by PC4, under the condition that the instruction indicated by PC4 is judged to be the branch instruction and at least two of the multiple threads are used for executing different branches of the branch instruction, determining a first address parameter PC5 of a converging instruction and a second address parameter PC6 of the first instruction of the later execution branch (assumed to be a branch transition) in the different branches, pushing { PC5, PC6} to the stack top of the stack 50, at this time, because the data storage form of the stack 50 which is advanced and then is out of the stack is adopted, the PC4 can be increased to obtain a third address parameter PC5 of the next instruction to be executed, the third address parameter is the same as the first address parameter of the top, updating the third address parameter to be PC6 in the stack top, the branch instruction processing unit 40 obtains the instruction indicated by PC6 to be executed, the first address of the third address parameter is obtained from the first instruction to be executed, the first address of the third address parameter is the top 7, the third address parameter is obtained from the first instruction to be executed, the third address parameter is the top 7, the third address parameter is obtained from the top 7 instruction is not executed, the top 7 is obtained from the top 7, the third address parameter is obtained from the top 7 instruction to be executed, and the top 7 is different from the top instruction, and the top is obtained from the top parameter is different from the top 7, and the top parameter is obtained from the third instruction is different.

In some embodiments, each thread also corresponds to a second flag bit that is used to indicate whether the thread is running. For example, when the flag bit is a first preset value, the thread is instructed not to run the instruction sequence; when the flag bit is a second preset value, indicating a thread to run an instruction sequence; the first preset value and the second preset value are different. In one example, one of the first preset value and the second preset value is 1, and the other is 0.

After the preceding execution branch and the following execution branch are executed, a plurality of threads are gathered at the converging instruction, then the converging instruction and the following instructions are synchronously executed, and then the thread in operation is required to be indicated by the second flag bit at this time so as to determine the thread which needs to execute the converging instruction and the following instructions. Thus, the second flag bit also needs to be pushed onto the top of the stack 50, and then in the case that at least two of the plurality of threads are determined to be used to execute different branches of the branch instruction, an entry including the first address parameter, the second address parameter, the first flag bit and the second flag bit for each thread corresponding to the later executed branch may be pushed onto the top of the stack 50. The second flag bit is used for indicating threads needing to execute the converging instruction and subsequent instructions after the execution of the subsequent execution branch is completed. In other embodiments, it is contemplated that the first flag of each thread corresponding to the subsequent execution branch may be determined based on the second flag of the thread and the first flag of the thread corresponding to the prior execution branch. The second flag bit is used for indicating whether the thread runs, and the first flag bit of one thread corresponding to the previous execution branch indicates whether the thread executes the branch, so that the thread which does not run and the thread which executes the previous execution branch can be eliminated from a plurality of threads, and the rest is the thread which needs to execute the subsequent execution branch, namely the first flag bit of the thread corresponding to the subsequent execution branch is determined. Therefore, in order to save the memory space of the stack 50, the first flag bit of each thread corresponding to the later execution branch may not be stored, i.e., an entry pushed onto the stack top of the stack 50 includes { the first address parameter, the second address parameter, and the second flag bit corresponding to each thread }. Subsequent to updating a third address parameter of a next instruction to a second address parameter in the stack top such that a first instruction of the later execution branch is determined to be the next instruction, a first flag of each thread corresponding to the later execution branch may be determined based on a second flag of the thread in the stack top and the first flag of the thread corresponding to the earlier execution branch; wherein the first flag bit of the thread corresponding to the previously executed branch is also known to branch instruction processing unit 40 since branch instruction processing unit 40 is currently still executing instructions in the previously executed branch.

In one possible implementation, the first flag bit of the thread corresponding to the later execution branch is based on inverting the first flag bit of the thread corresponding to the earlier execution branch, and then summing the inverted result with the second flag bit of the thread.

In one example, in the second flag bit, assume that a flag bit of "1" indicates that the thread runs the instruction sequence and that a flag bit of "0" indicates that the thread executes the non-running instruction sequence. Assuming that there are 4 threads (A, B, C, D), where all 4 threads are running, the second thread enables bit "1111". In the first thread enable bit, assume that a flag bit of "1" indicates that the thread executes the branch of transfer and a flag bit of "0" indicates that the thread executes the branch of sequential execution. At this time, after determining the condition indicated by PC0, the thread a, the thread B and the thread C determine true, the first flag bit corresponding to the branch instruction is "1", the thread D determines false, the first flag bit corresponding to the branch instruction is "0", the first flag bits corresponding to the plurality of threads are "1110", the post-execution branch is a branch provided that the preset branch execution order is the first order, the first flag bits corresponding to the post-execution branch are "1110", and the second flag bits corresponding to the plurality of threads are "1111" and "1110" based on the first flag bits corresponding to the prior-execution branch for the plurality of threads.

It should be noted that the "multiple threads" mentioned above can execute each instruction in the same instruction sequence synchronously.

In one embodiment, to extend the functionality of the SIMT processor, as shown in fig. 5, the number of branch instruction processing units 40 and the number of stacks 50 are each greater than 1, and the SIMT processor may include one or more thread clusters, where n is an integer greater than 0, each thread cluster including a plurality of threads, and where multiple threads in the same thread cluster are capable of executing instructions in the same instruction sequence synchronously. Each branch instruction processing unit 40 corresponds to a thread cluster in the SIMT processor, and each stack 50 corresponds to a thread cluster in the SIMT processor; the branch instruction processing unit 40 can perform the above-described processing when there is a divergence in the execution of a branch instruction by a plurality of threads in the corresponding thread cluster.

For example, referring to fig. 5, the SIMT processor further includes an instruction execution unit 60, where the instruction execution unit 60 may be called by the thread to execute related instructions, and further store the execution result of the instructions into the memory 30; in one example, the instruction execution unit 60 includes an adder, a subtractor, a multiplier, a divider, an integrating operation circuit, a logarithmic operation circuit, etc., and the specific operation circuits and the number thereof can be specifically set according to the practical application scenario, which is not limited in this embodiment.

Illustratively, in order to better execute the instruction sequences corresponding to the thread clusters, the SIMT processor further includes a fetching unit corresponding to each thread cluster, where the fetching unit is configured to obtain, from a fetching unit, an instruction in the instruction sequences corresponding to the thread clusters, and send the instruction to the branch instruction processing unit 40.

In some embodiments, considering that in the case of multi-level nesting of branch instructions, the stack 50 may have an overflow condition, for which one or more shared memories 70 are implemented in the SIMT processor, such as illustrated in fig. 6 with n thread clusters for one shared memory 70, each shared memory 70 corresponds to at least two thread clusters, and the thread clusters corresponding to different shared memories 70 are different. In one example, multiple thread clusters may be grouped into two groups, one for each shared memory 70, for example.

The stack 50 corresponds to a stack 50 pointer, and the stack 50 instruction includes a read pointer pointing to the top of the stack 50 and a write pointer=read pointer+1. In the event of overflow of the stack 50, the branch instruction processing unit 40 may dump at least one specified entry in the stack 50 into the shared memory 70 corresponding to the thread cluster, the specified entry being determined from the write pointer, the at least one specified entry having a push time that is earlier than push times of other non-dumped entries.

In one example, referring to fig. 7, where fig. 7 shows a stack 50 having 8 layers (8 storage units), each layer (each storage unit) is used to store one entry, reg0 to reg7 indicate storage addresses of each layer, where 8 entries are stored in the stack 50, a read pointer=7 points to a top of the stack 50 (i.e., reg 7), a write pointer=8 points to reg0, it may be determined that there is an overflow condition in the stack 50 at this time, and a next entry cannot be pushed any more, and if the next entry is pushed, entry 1 stored in reg0 will be overwritten, so, in order to save entry 1 and ensure that the stack 50 can continue pushing the next entry, at least one entry to be restored, i.e., entry 1 stored in reg0, may be determined according to the write pointer, and entry 1 stored in reg0 may be stored in the shared memory 70. Of course, in order to avoid multiple copy operations, at least two entries may be copied at a time, for example, 4 entries stored in reg0 to reg3 may be copied to the shared RAM at a time, so as to reduce high-frequency multiple read/write operations on the shared memory 70, which is beneficial to improving the processing efficiency. It can be seen that the push time for 4 entries stored in reg 0-reg 3 is earlier than the push time for other entries that are not transferred.

In some embodiments, after the target entry is popped in the stack 50, the at least one specified entry that was dumped into the shared memory 70 is written back into the stack 50; in consideration of the deployment difference of different SIMT processors, the clock period required for reading the at least one specified entry from the shared memory 70 is different, in one case, the at least one specified entry may be written back to the stack 50 by the next clock period when a read request is initiated to the shared memory 70, in another case, after the read request is initiated to the shared memory 70, the operation of writing the at least one specified entry back to the stack 50 may need to be completed by waiting for at least one clock period, so as to avoid that all entries in the stack 50 are popped up and at least one specified entry stored in the shared memory 70 is not written back yet, which causes the situation that the branch instruction processing unit 40 reads the stack 50 to be empty may occur, the storage position of the target entry in the stack 50 may be set adjacent to the original storage position of the specified entry in the stack 50 or may be different by a preset number of storage units according to the actual application scenario, and the specific preset number of values may be determined according to the clock period required for waiting for writing back the at least one specified entry, so as to ensure efficient execution of the instruction sequence. In one example, referring to fig. 8, for example, the timing requirement in the SIMT processor, the next clock cycle when the read request is initiated to the shared memory 70 will move the at least one designated entry stored in the shared memory 70 to the read data bus, and the next clock cycle will accurately write back to the stack 50, after the read request is initiated to the shared memory 70, 1 clock cycle will be required to complete the operation of writing the at least one designated entry back to the stack 50, then the preset number is 1, for example, in the embodiment shown in fig. 7, the entries that are transferred to the shared memory 70 are entries 1 to 4, the original memory addresses in the stack 50 are reg1 to reg4, and when the original memory location of the entry 6 in the stack 50 is different from the original memory location of the entry 4 in the stack 50 by 1 memory unit, then the entries 1 to 4 transferred to the shared memory 70 can be moved back, written back to the stack 50, and the read instruction processing efficiency of the read instruction 5 is avoided after the entry is ejected from the stack 50, and the read instruction processing device is prevented from waiting for processing the read instruction sequence.

It will be understood that the solutions described in the above embodiments may be combined without conflict, and are not exemplified in the embodiments of the present disclosure.

Accordingly, referring to fig. 9, the embodiment of the present disclosure further provides a branch instruction processing method, which is applied to a SIMT processor including a branch instruction processing unit and a stack, where multiple threads in the SIMT processor can synchronously execute instructions in a same instruction sequence, where the instruction sequence includes at least a branch instruction and an aggregate instruction, where the branch instruction indicates multiple branches that are executable, and where the multiple branches are aggregated at the aggregate instruction. The method is performed by the branch instruction processing unit, the method comprising:

in step S101, in case at least two of the plurality of threads are used to execute different branches of the branch instruction, a first address parameter of an aggregate instruction and a second address parameter of a first instruction of a later executed branch of the different branches are determined.

In step S102, an entry including the first address parameter and the second address parameter is pushed onto the top of the stack.

In step S103, in the process of executing the branch instruction or the instruction in the previous execution branch, in the case that the third address parameter of the next instruction to be executed is consistent with the first address parameter in the stack top, the third address parameter is updated to the second address parameter in the stack top, so that the first instruction of the subsequent execution branch is determined as the next instruction.

In some embodiments, further comprising: in the process of executing the instruction of the later execution branch, under the condition that the third address parameter of the next instruction is consistent with the first address parameter of the stack top, ejecting an item comprising the first address parameter and the second address parameter from the stack top.

In some embodiments, the plurality of branches includes a branch and a sequential execution branch; the first instruction on the sequentially executed branch is the instruction immediately following the branch instruction in the instruction sequence, and the branch is other branches of the plurality of branches except the sequentially executed branch; the address parameter of the first instruction on the sequential execution branch is obtained by increasing the address parameter of the branch instruction; the address parameter of the first instruction on the branch instruction is specified by the branch instruction and is discontinuous with the address parameter of the branch instruction.

In some embodiments, further comprising: and according to the judging result of each thread on the judging condition of the branch instruction, acquiring a first flag bit of each thread corresponding to the branch instruction, wherein the first flag bit of one thread corresponding to the branch instruction is used for indicating the next branch executed by the thread after the execution of the branch instruction.

In some embodiments, the plurality of branches includes a branch and a sequential execution branch; the first instruction on the sequentially executed branch is the instruction immediately following the branch instruction in the instruction sequence, and the branch is other branches of the plurality of branches except the sequentially executed branch; the first flag bit of one thread corresponding to the branch instruction is the same as the first flag bit of the corresponding branch instruction, and the first flag bit of one thread corresponding to the sequential branch instruction is obtained by inverting the first flag bit of the thread corresponding to the branch instruction.

In some embodiments, the entry further includes a first flag bit for each thread corresponding to the later execution branch.

In some embodiments, each thread also corresponds to a second flag bit for indicating whether the thread is running; the entry further includes a first flag bit and the second flag bit for each thread corresponding to the later execution branch; or the entry also includes a second flag bit for each thread; wherein a first flag bit of a thread corresponding to the later execution branch is determined based on a second flag bit of the thread and a first flag bit of the thread corresponding to the earlier execution branch.

In some embodiments, the number of branch instruction processing units and the number of stacks are each greater than 1, each branch instruction processing unit corresponding to one thread cluster in the SIMT processor, each stack corresponding to one thread cluster in the SIMT processor; wherein, a plurality of threads in the same thread cluster can synchronously execute each instruction in the same instruction sequence.

In some embodiments, the SIMT processor further includes one or more shared memories, each shared memory corresponding to at least two thread clusters, different shared memories corresponding to different thread clusters; the method further comprises the steps of: before pushing an entry comprising the first address parameter and the second address parameter to the top of the stack, under the condition of overflow of the stack, forwarding a specified entry in the stack to a shared memory corresponding to the thread cluster; the specified entry is determined according to the write pointer of the stack, and the push time of the specified entry is earlier than the push time of other entries which are not transferred.

In some embodiments, the method further comprises: after the target item is popped in the stack, writing the specified item back to the stack; the storage position of the target item in the stack is adjacent to or different from the original storage position of the appointed item in the stack by a preset number of storage units, and each storage unit is used for storing one item.

In some embodiments, the plurality of branches includes a branch and a sequential execution branch; the first instruction on the sequentially executed branch is the instruction immediately following the branch instruction in the instruction sequence, and the branch is other branches of the plurality of branches except the sequentially executed branch; the method further comprises the steps of: under the condition that the threads are used for executing the sequential execution branches indicated by the branch instructions, the address parameters of the branch instructions are increased to obtain the address parameters of the next instruction to be executed; and taking the address parameter of the first instruction on the branch as the address parameter of the next instruction to be executed under the condition that the plurality of threads are used for executing the branch instruction to indicate the branch.

In some embodiments, the method further comprises: in executing a preceding execution branch or an instruction in the following execution branch, in the event that the current instruction is a branch instruction and there are at least two threads for executing different branches of the branch instruction, the step of determining and pushing first and second address parameters onto a stack is performed to process the preceding execution branch or the branch instruction nested in the following execution branch.

In some embodiments, the later execution branch of the different branches is determined according to a preset branch execution sequence.

In some exemplary embodiments, referring to FIG. 10, FIG. 10 shows another flow diagram of a branch instruction processing method. The method may be performed by a branch instruction processing unit in the SIMT processor described above.

In step S201, it is determined whether the currently executed instruction is a branch instruction; if not, go to step S202, if yes, go to step S203;

In step S202, the PC value of the currently executed instruction is incremented to obtain the PC value of the next instruction to be executed; after the next instruction is obtained according to the PC value of the next instruction, step S201 is repeatedly executed, that is, each obtained instruction needs to determine whether it is a branch instruction.

In step S203, determining a branch to be executed according to a first flag bit of each thread corresponding to the branch instruction;

In step S204, if the multiple threads are all used for executing the sequential execution branches indicated by the branch instruction, the PC value of the branch instruction is incremented to obtain the PC value of the next instruction to be executed; the PC value of the subsequent instruction is obtained by increasing the PC value of the last instruction;

In step S205, if the plurality of threads are all used for executing the branch instruction to indicate a branch, the PC value of the first instruction located on the branch is obtained as the PC value of the next instruction to be executed; the PC value of the subsequent instruction is obtained by increasing the PC value of the last instruction;

In step S206, if at least two of the plurality of threads are configured to execute different branches of the branch instruction (i.e. the first flag bits of at least two threads are different), determining whether the stack overflows, if yes, executing step S207, and if not, executing step S208;

In step S207, at least one designated entry in the stack is dumped to the shared memory corresponding to the thread cluster; at which point the stack makes room to continue pushing new entries.

In step S208, determining a first PC value of the converging instruction, a second PC value of a first instruction of the later execution branch in the different branches, and a second flag bit corresponding to each thread; pushing an entry comprising the first PC value, the second PC value and a second flag bit corresponding to each thread to the stack top of the stack;

In step S209, in the process of executing the branch instruction or executing the instruction in the branch, it is determined whether the third PC value of the next instruction to be executed is consistent with the first PC value of the stack top; if yes, go to step S210; if not, the third PC value remains unchanged, the next instruction is acquired by using the third PC value, and step S209 is repeated;

In step S210, updating the third PC value to the second PC value in the stack top, and determining, according to the first flag bit of each thread in the stack top and the first flag bit corresponding to the previous execution branch, the first flag bit of each thread corresponding to the subsequent execution branch;

In step S211, in the process of executing the instruction of the later execution branch, it is determined whether the third PC value of the next instruction to be executed is consistent with the first PC value of the stack top; if yes, go to step S212; if not, the third PC value remains unchanged, the next instruction is acquired by using the third PC value, and step S211 is repeated;

in step S212, an entry including the first PC value, the second PC value, and a second flag bit corresponding to each thread is popped from the stack top of the stack;

In step S213, the aggregate instruction is executed, at which time all threads aggregate.

Correspondingly, the embodiment of the disclosure also provides a chip comprising the SIMT processor.

Correspondingly, the embodiment of the disclosure also provides a board card comprising the chip.

Referring to fig. 11, an exemplary board card is provided, where the board card includes a package structure that encapsulates at least one chip 100, and may further include other components, including but not limited to: a memory 102 and an interface device 104.

The memory 102 is connected to the chip in the chip package structure through a bus for storing data. The memory 102 may include multiple sets of memory cells 106, for example: DDR SDRAM (English: double DATA RATE SDRAM, double Rate synchronous dynamic random Access memory), etc. Each set of memory cells 106 is connected to the chip 100 via a bus.

The interface device 104 is electrically connected to a chip within the chip package structure. The interface device 104 is used to enable data transfer between the chip and an external device 108 (e.g., terminal, server, camera, etc.). In one embodiment, the interface device 104 may include a PCIE interface, a network interface, or other interfaces, which is not limited by the disclosure.

Correspondingly, the embodiment of the disclosure also provides electronic equipment, which comprises the SIMT processor, the chip or the board card. The specific functions of the SIMT processor, the chip and the board may be described with reference to the above embodiments, and will not be described herein.

The disclosed embodiments also provide a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the method of any of the previous embodiments.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

From the foregoing description of the embodiments, it will be apparent to those skilled in the art that the disclosed embodiments may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions of the embodiments of the present disclosure may be embodied in essence or a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present disclosure.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The apparatus embodiments described above are merely illustrative, in which the modules illustrated as separate components may or may not be physically separate, and the functions of the modules may be implemented in the same piece or pieces of software and/or hardware when implementing embodiments of the present disclosure. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The foregoing is merely a specific implementation of the embodiments of this disclosure, and it should be noted that, for a person skilled in the art, several improvements and modifications may be made without departing from the principles of the embodiments of this disclosure, which should also be considered as the protection scope of the embodiments of this disclosure.

Claims

1. A SIMT processor comprising a branch instruction processing unit and a stack; multiple threads in the SIMT processor can synchronously execute instructions in the same instruction sequence, wherein the instruction sequence at least comprises a branch instruction and a convergence instruction, the branch instruction indicates multiple executable branches, and the multiple branches are gathered at the convergence instruction;

the branch instruction processing unit is configured to:

2. The SIMT processor of claim 1, wherein said branch instruction processing unit is further configured to:

in the process of executing the instruction of the later execution branch, under the condition that the third address parameter of the next instruction is consistent with the first address parameter of the stack top, ejecting an item comprising the first address parameter and the second address parameter from the stack top.

3. The SIMT processor according to claim 1 or 2, wherein said plurality of branches includes a branch and a sequential execution branch; the first instruction on the sequentially executed branch is the instruction immediately following the branch instruction in the instruction sequence, and the branch is other branches of the plurality of branches except the sequentially executed branch;

The address parameter of the first instruction on the sequential execution branch is obtained by increasing the address parameter of the branch instruction;

The address parameter of the first instruction on the branch instruction is specified by the branch instruction and is discontinuous with the address parameter of the branch instruction.

4. A SIMT processor according to any of claims 1-3, wherein said branch instruction processing unit is further adapted to:

and according to the judging result of each thread on the judging condition of the branch instruction, acquiring a first flag bit of each thread corresponding to the branch instruction, wherein the first flag bit of one thread corresponding to the branch instruction is used for indicating the next branch executed by the thread after the execution of the branch instruction.

5. The SIMT processor of claim 4, wherein said plurality of branches includes a branch-to-branch and a sequential execution branch; the first instruction on the sequentially executed branch is the instruction immediately following the branch instruction in the instruction sequence, and the branch is other branches of the plurality of branches except the sequentially executed branch;

The first flag bit of one thread corresponding to the branch instruction is the same as the first flag bit of the corresponding branch instruction, and the first flag bit of one thread corresponding to the sequential branch instruction is obtained by inverting the first flag bit of the thread corresponding to the branch instruction.

6. The SIMT processor of claim 5, wherein said entries further include a first flag bit for each thread corresponding to said later execution branch.

7. The SIMT processor of claim 5, wherein each thread further corresponds to a second flag bit indicating whether said thread is running;

The entry further includes a first flag bit and the second flag bit for each thread corresponding to the later execution branch; or alternatively

The entry further includes a second flag bit for each thread; wherein a first flag bit of a thread corresponding to the later execution branch is determined based on a second flag bit of the thread and a first flag bit of the thread corresponding to the earlier execution branch.

8. The SIMT processor according to any of claims 1-7, wherein said number of branch instruction processing units and said number of stacks are each greater than 1, each branch instruction processing unit corresponding to a thread cluster in said SIMT processor, each stack corresponding to a thread cluster in said SIMT processor;

Wherein, a plurality of threads in the same thread cluster can synchronously execute each instruction in the same instruction sequence.

9. The SIMT processor of claim 8, further comprising one or more shared memories, each shared memory corresponding to at least two thread clusters, different shared memories corresponding to different thread clusters;

The branch instruction processing unit is further configured to:

under the condition that the stack overflows, transferring the appointed items in the stack to a shared memory corresponding to the thread cluster; the specified entry is determined according to the write pointer of the stack, and the push time of the specified entry is earlier than the push time of other entries which are not transferred.

10. The SIMT processor of claim 9, wherein said branch instruction processing unit is further configured to:

after the target item is popped in the stack, writing the specified item back to the stack; the storage position of the target item in the stack is adjacent to or different from the original storage position of the appointed item in the stack by a preset number of storage units, and each storage unit is used for storing one item.

11. The SIMT processor according to any of claims 1 to 10, wherein said plurality of branches includes a branch-off and a sequential execution branch; the first instruction on the sequentially executed branch is the instruction immediately following the branch instruction in the instruction sequence, and the branch is other branches of the plurality of branches except the sequentially executed branch;

The branch instruction processing unit is further configured to:

Under the condition that the threads are used for executing the sequential execution branches indicated by the branch instructions, the address parameters of the branch instructions are increased to obtain the address parameters of the next instruction to be executed;

and taking the address parameter of the first instruction on the branch as the address parameter of the next instruction to be executed under the condition that the plurality of threads are used for executing the branch instruction to indicate the branch.

12. The SIMT processor according to any of claims 1 to 11, wherein said branch instruction processing unit is further configured to:

In executing a preceding execution branch or an instruction in the following execution branch, in the event that the current instruction is a branch instruction and there are at least two threads for executing different branches of the branch instruction, the step of determining and pushing first and second address parameters onto a stack is performed to process the preceding execution branch or the branch instruction nested in the following execution branch.

13. The SIMT processor according to any of claims 1 to 12, wherein a later execution branch of said different branches is determined according to a preset branch sequencing order.

14. A branch instruction processing method, applied to a SIMT processor including a branch instruction processing unit and a stack, wherein a plurality of threads in the SIMT processor are capable of synchronously executing instructions in a same instruction sequence, wherein the instruction sequence includes at least a branch instruction and a converging instruction, the branch instruction indicating a plurality of executable branches, the plurality of branches being gathered at the converging instruction; the method comprises the following steps:

15. A chip comprising a SIMT processor as claimed in any of claims 1 to 13.

16. A board comprising a package structure encapsulating at least one chip according to claim 15.

17. An electronic device comprising a SIMT processor as claimed in any one of claims 1 to 13, or a chip as claimed in claim 15, or a board as claimed in claim 16.

18. A computer readable storage medium having stored thereon computer instructions, which when executed by a processor, implement the steps of the method of claim 14.