CN114896079B

CN114896079B - Instruction execution method, processor and electronic device

Info

Publication number: CN114896079B
Application number: CN202210590406.4A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Biren Intelligent Technology Co Ltd
Current assignee: Shanghai Bi Ren Technology Co ltd
Priority date: 2022-05-26
Filing date: 2022-05-26
Publication date: 2023-11-24
Anticipated expiration: 2042-05-26
Also published as: CN117472600A; CN114896079A

Abstract

An instruction execution method, a processor and an electronic device. The instruction execution method is used for a plurality of working subgroups in the working group, and each working subgroup corresponds to a father thread group. The parent thread group includes a plurality of dependent thread groups. Each dependency thread group is configured to execute at least one dependency instruction group. Each dependency instruction group includes a first thread group, a second thread group, and a synchronization barrier instruction that is synchronized to a plurality of dependency thread groups included in the same parent thread group. The method comprises the following steps: executing each dependency instruction set; the second thread group is released in response to the first thread group executing the graduation synchronization barrier instruction and the second thread group executing the graduation synchronization barrier instruction. According to the instruction execution method, the synchronous range of the synchronous barrier instruction is limited to the plurality of child thread groups corresponding to the same father thread group running on the same execution unit, so that the synchronous range of the synchronous barrier instruction is reduced, and the delay existing in the instruction execution process is reduced.

Description

Instruction execution method, processor and electronic device

Technical Field

Embodiments of the present disclosure relate to an instruction execution method, a processor, and an electronic device.

Background

In current computing devices, data processing integrated circuits such as central processing units (Central Processing Unit, CPU), graphics processors (Graphics Processing Unit, GPU), general-purpose graphics processors (General-purpose Computing on Graphics Processing Units, GPGPU) may execute programs to perform various functions such as convolutional neural network (Convolutional Neural Network, CNN) operations, artificial intelligence (Artificial Intelligence) operations, and the like. Efficient multithreading performance is becoming increasingly important to current computing devices and applications.

Disclosure of Invention

At least one embodiment of the present disclosure provides an instruction execution method. The method for executing the instructions is used for a plurality of working sub-groups in a working group, wherein each working sub-group in the plurality of working sub-groups corresponds to a parent thread group, the parent thread group comprises a plurality of dependent thread groups, the plurality of dependent thread groups are used for executing tasks of the working sub-groups corresponding to the parent thread groups, the plurality of dependent thread groups included in the same parent thread group are executed on the same execution unit, each dependent thread group in the plurality of dependent thread groups is configured to execute at least one dependent item instruction group, each dependent item instruction group in the at least one dependent item instruction group comprises a synchronous barrier instruction, the synchronous barrier instruction is synchronous within the plurality of dependent thread groups included in the same parent thread group, each dependent thread group comprises a first thread group and a second thread group, and the first thread group and the second thread group are configured to execute the synchronous barrier instruction in the each dependent item instruction group respectively, and the method comprises: executing said each dependency instruction set; releasing the second thread group in response to the first thread group completing execution of the synchronization barrier instruction and the second thread group completing execution of the synchronization barrier instruction.

For example, the method for executing an instruction provided in at least one embodiment of the present disclosure further includes entering a wait mode by the second thread group in response to the first thread group not completing execution of the synchronization barrier instruction or the second thread group not completing execution of the synchronization barrier instruction.

For example, in an instruction execution method provided in at least one embodiment of the present disclosure, each dependency instruction set further includes a producer instruction set and a consumer instruction set, the producer instruction set includes at least one producer instruction, the consumer instruction set includes at least one consumer instruction, the first thread set is configured to execute the producer instruction set to generate an execution result, and the second thread set is configured to execute the consumer instruction set to use the execution result.

For example, in an instruction execution method provided by at least one embodiment of the present disclosure, the execution results are received and stored using a shared memory space, so that the second thread group uses the execution results in executing the consumer instruction group.

For example, in an instruction execution method provided in at least one embodiment of the present disclosure, the execution unit includes at least one counter, and executing each dependency instruction set includes: the first thread group and the second thread group are caused to execute the synchronization barrier instruction to change the count value of the counter, respectively.

For example, in an instruction execution method provided in at least one embodiment of the present disclosure, the causing the first thread group and the second thread group to execute the synchronization barrier instruction to change the count value of the counter includes: initializing the count value of the counter to a preset initial value; causing the first thread group to execute the set of producer instructions; causing the first thread group to execute the synchronization barrier instruction to increment a count value of the counter by a step value; the second thread group is caused to execute the synchronization barrier instruction to increment a count value of the counter by the step value.

For example, in an instruction execution method provided in at least one embodiment of the present disclosure, the releasing the second thread group in response to the first thread group completing execution of the synchronization barrier instruction and the second thread group completing execution of the synchronization barrier instruction includes: and in response to the count value of the counter reaching a preset threshold, enabling the second thread group to execute the consumer instruction group.

For example, in an instruction execution method provided in at least one embodiment of the present disclosure, the causing the first thread group and the second thread group to execute the synchronization barrier instruction to change the count value of the counter includes: initializing the count value of the counter to a preset initial value; causing the second thread group to execute the synchronization barrier instruction to increment a count value of the counter by a step value; responsive to the count value of the counter not reaching a preset threshold, causing the second thread group to enter a wait mode; causing the first thread group to execute the set of producer instructions; the first thread group is caused to execute the synchronization barrier instruction to increment a count value of the counter by the step value.

For example, in an instruction execution method provided in at least one embodiment of the present disclosure, the releasing the second thread group in response to the first thread group completing execution of the synchronization barrier instruction and the second thread group completing execution of the synchronization barrier instruction includes: and in response to the count value of the counter reaching the preset threshold, enabling the second thread group to execute the consumer instruction group.

For example, in an instruction execution method provided by at least one embodiment of the present disclosure, the set of producer instructions includes a first producer instruction and a second producer instruction, and the causing the first thread set and the second thread set to execute the synchronization barrier instruction, respectively, to change a count value of the counter includes: initializing the count value of the counter to a preset initial value; causing the first thread group to execute the first producer instruction; causing the second thread group to execute the synchronization barrier instruction to increment a count value of the counter by a step value; responsive to the count value of the counter not reaching a preset threshold, causing the second thread group to enter a wait mode; causing the first thread group to execute the second producer instruction; the first thread group is caused to execute the synchronization barrier instruction to increment a count value of the counter by the step value.

For example, in an instruction execution method provided by at least one embodiment of the present disclosure, the consumer instruction group includes a first consumer instruction, and the releasing the second thread group in response to the first thread group completing execution of the one synchronization barrier instruction and the second thread group completing execution of the one synchronization barrier instruction includes: and in response to the count value of the counter reaching the preset threshold, enabling the second thread group to execute the first consumer instruction.

At least one embodiment of the present disclosure provides a processor comprising: at least one computing unit, each of the at least one computing unit configured to execute at least one working group, each of the at least one computing unit including at least one execution unit, each of the at least one execution unit configured to execute at least one working sub-group, each of the at least one working sub-group corresponding to a parent thread group, the parent thread group including a plurality of dependent thread groups for executing tasks of the working sub-group corresponding to the corresponding parent thread group, the plurality of dependent thread groups included in the same parent thread group being executed on the same execution unit, each of the plurality of dependent thread groups being configured to execute at least one dependent instruction group, each of the at least one dependent instruction group including a barrier synchronization instruction, the barrier synchronization instruction being synchronized in a range of the same thread group, the plurality of dependent thread groups included in the parent thread group, each of the dependent thread groups including a first thread group and a second thread group, the barrier synchronization instruction being configured to be executed in response to the barrier synchronization instruction, the barrier instruction being configured to execute in the first thread group and the barrier synchronization barrier instruction group, and the barrier synchronization instruction being released for each of the second thread groups.

For example, in a processor provided in at least one embodiment of the present disclosure, each execution unit includes at least one counter, and each of the dependent thread groups is further configured to cause the first thread group and the second thread group to execute the synchronization barrier instruction to change a count value of the counter.

For example, in a processor provided in at least one embodiment of the present disclosure, each dependency instruction set further includes a producer instruction set and a consumer instruction set, the producer instruction set including at least one producer instruction and the consumer instruction set including at least one consumer instruction, the first thread set being configured to execute the producer instruction set and the second thread set being configured to execute the consumer instruction set.

For example, in a processor provided by at least one embodiment of the present disclosure,

for example, at least one embodiment of the present disclosure provides for the processor to further include a shared memory space, wherein the shared memory space is configured to receive and store an execution result generated by the first thread group executing the producer instruction group for use by the second thread group in executing the consumer instruction group.

At least one embodiment of the present disclosure also provides an electronic device including a processor provided in any one of the embodiments of the present disclosure.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly described below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure, not to limit the present disclosure.

FIG. 1A is a schematic diagram of a thread scheduling grid;

FIG. 1B is a schematic diagram of a thread group execution process;

FIG. 1C is a schematic diagram of another thread group execution;

FIG. 2A is a diagram illustrating the execution of an instruction in a dependency relationship;

FIG. 2B is a diagram illustrating a synchronization barrier command synchronization range;

FIG. 3 is a schematic diagram of a dependent thread group execution provided in at least one embodiment of the present disclosure;

FIG. 4 is a flow chart of an instruction execution method according to at least one embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a synchronization barrier instruction synchronization range provided in at least one embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a dependency instruction set execution process in a dependency thread set according to at least one embodiment of the present disclosure;

FIG. 7 is an exemplary flowchart of another example of step S10 in FIG. 4;

FIG. 8 is an exemplary flowchart of yet another example of step S10 in FIG. 4;

fig. 9 is an exemplary flowchart of still another example of step S10 in fig. 4;

FIG. 10 is a schematic block diagram of a processor provided in accordance with at least one embodiment of the present disclosure;

FIG. 11 is a schematic block diagram of another processor provided in accordance with at least one embodiment of the present disclosure;

FIG. 12 is a schematic block diagram of an electronic device provided in accordance with at least one embodiment of the present disclosure; and

fig. 13 is a schematic block diagram of another electronic device provided in accordance with at least one embodiment of the present disclosure.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present disclosure. It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without the need for inventive faculty, are within the scope of the present disclosure, based on the described embodiments of the present disclosure.

Unless defined otherwise, technical or scientific terms used in this disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The terms "first," "second," and the like, as used in this disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. Likewise, the terms "a," "an," or "the" and similar terms do not denote a limitation of quantity, but rather denote the presence of at least one. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

The present disclosure is illustrated by the following several specific examples. Detailed descriptions of known functions and known components may be omitted for the sake of clarity and conciseness in the following description of the embodiments of the present disclosure. When any element of an embodiment of the present disclosure appears in more than one drawing, the element is identified by the same or similar reference numeral in each drawing.

FIG. 1A is a schematic diagram of a thread scheduling grid. For example, as shown in FIG. 1A, when a processor allocates computing tasks, a task may be represented by a dispatch grid (grid). A scheduling grid may be divided into a plurality of work groups (work groups), and a work group may be allocated to a Computing Unit (CU) of hardware only for computation, that is, a computing Unit may execute a work group in a program in a process. One work group may in turn be divided into a plurality of work subgroups (subgroupings). Each working subgroup is assigned to an Execution Unit (EU) in the Computation Unit (CU) for computation, i.e. an Execution Unit may execute a working subgroup in a program in one process. Each working subset includes multiple threads (threads), which are the smallest granularity of execution in the processor. Threads in the same workgroup may be grouped by scheduling unit and then scheduled to hardware for execution group by group. This scheduling unit is called thread group (warp).

FIG. 1B is a schematic diagram of a thread group execution process. For example, as shown in FIG. 1B, the scheduling grid includes work groups, which include work subgroups. A subset of work can be mapped onto a thread group. The thread groups run on one execution unit, i.e., each execution unit may perform computations in one process on one of the thread groups.

FIG. 1C is a schematic diagram of another thread group execution process. For example, as shown in FIG. 1C, the scheduling grid includes work groups, which include work subgroups, as in FIG. 1B. Unlike in FIG. 1B, a working subgroup can also be mapped onto a virtual parent thread group (virtual parent warp). One virtual parent thread group corresponds to a plurality of child thread groups (child warp). The corresponding plurality of child thread groups of each parent thread group run on the same execution unit, and the corresponding plurality of child thread groups together complete tasks assigned to the parent thread groups. That is, each execution unit may perform computations in one process for multiple child thread groups corresponding to one parent thread group.

FIG. 2A is a diagram illustrating execution of instructions in a dependency relationship. There are large-scale producer-consumer dependencies between sub-thread groups. For example, as shown in FIG. 2A, the plurality of sub-thread groups, such as in FIG. 1C, includes both producer sub-thread groups and consumer sub-thread groups. Each producer sub-thread group executes a plurality of producer instructions and each consumer sub-thread group executes a plurality of consumer instructions. Producer instructions and consumer instructions are two different instruction categories of a synchronous resource. For a set of dependencies, a producer sub-thread group executes a set of producer instructions to generate a set of execution results, and fills the set of execution results into slots (slots) corresponding to a shared memory space; the consumer child thread group executes a set of consumer instructions to use the execution result in the slot corresponding to the shared memory space. For example, as shown in FIG. 2A, for slot k in shared memory, the producer sub-thread group executes producer instruction 2k and producer instruction 2k+1 in order to generate execution result k, and fills execution result k into slot k; consumer child thread group k executes consumer instruction k after slot k is filled, thereby using execution result k in slot k.

For example, as shown in fig. 2A, in the parallel processing, the consumer sub-thread group needs to wait for the corresponding slot to be filled with the execution result (i.e., after the producer sub-thread group finishes executing the corresponding producer instruction and obtains the execution result), and then the consumer instruction can be executed to use the execution result. Thus, as shown in FIG. 2A, the work group barrier instruction bar.wg is used in the parallel processing described above to achieve mutual coordination in the execution of each group of producer instructions-consumer instructions. The workgroup barrier instruction bar.wg is one of the synchronous barrier instructions (barrier). For example, by setting a synchronization barrier instruction in a set of instruction execution processes executing in parallel, each process may wait at the barrier until all processes in the set have executed the synchronization barrier instruction. Unless all processes execute the synchronization barrier instruction, no process can continue to execute across the barrier, thereby implementing the synchronization function through the synchronization barrier instruction.

For example, as shown in fig. 2A, for the workgroup barrier instruction bar.wg, one implementation is to integrate a counter in the corresponding computing unit. For example, for slot k, after the producer sub-thread group executes the producer instruction 2k and the producer instruction 2k+1 in order to obtain the execution result k to fill the slot k, the producer sub-thread group executes the workgroup barrier instruction bar.wgid (k) to increment the count value of the counter by one step value (e.g., increment the value by 1). In the consumer sub-thread group, the consumer sub-thread group first executes a workgroup barrier instruction bar.wgid (k) to increment the count value of the counter by a step value (e.g., increment of 1), and if the count value does not reach the predetermined value, the consumer sub-thread group enters a waiting mode without executing the consumer instruction k; if the count value reaches a predetermined value (i.e., the producer and consumer sub-thread groups each execute a finished workgroup barrier instruction bar.wgid (k)), the consumer sub-thread group is released (i.e., exits the wait mode) to execute consumer instruction k to use the execution result k in slot k. Thus, the latency and release of the consumer child thread group is controlled by the workgroup barrier instruction bar.wg, which implements a synchronization function in the producer-consumer dependency instruction execution process.

FIG. 2B is a diagram illustrating a synchronization range of a synchronization barrier instruction. For example, as shown in fig. 2B, when the synchronization barrier instruction is, for example, the work group barrier instruction bar.wg in fig. 2A, the synchronization range of the synchronization barrier instruction is one work group executed in one computing unit (for example, the illustrated computing unit CU 0). One computing unit CU 0 includes a plurality of execution units (e.g., EU 0, EU 1, … …, EU n-1, where n is a positive integer) each having a plurality of sub-thread groups running thereon. Thus, as shown in FIG. 2B, the synchronous range of the workgroup barrier instruction bar.wg is large, resulting in a large delay in the instruction execution process.

At least one embodiment of the present disclosure provides an instruction execution method. The instruction execution method is used for a plurality of working subgroups in the working group, and each working subgroup in the working subgroups corresponds to a father thread group. The parent thread group comprises a plurality of dependent thread groups, and the plurality of dependent thread groups are used for running tasks of working sub-groups corresponding to the corresponding parent thread groups. Multiple dependent thread groups included in the same parent thread group run on the same execution unit. Each dependency thread group is configured to execute at least one dependency instruction group. Each dependency instruction group includes a synchronization barrier instruction that is synchronized to a plurality of dependency thread groups included in the same parent thread group. Each dependency thread group includes a first thread group and a second thread group configured to execute the synchronization barrier instructions in each dependency instruction group, respectively. The instruction execution method comprises the following steps: executing each dependency instruction set; the second thread group is released in response to the first thread group executing the graduation synchronization barrier instruction and the second thread group executing the graduation synchronization barrier instruction.

Embodiments of the present disclosure also provide a processor or an electronic device corresponding to performing the above-described instruction execution method.

According to the instruction execution method, the processor and the electronic device provided by at least one embodiment of the present disclosure, the synchronization scope of the synchronization barrier instruction is reduced by limiting the synchronization scope of the synchronization barrier instruction to a plurality of sub-thread groups running on the same execution unit, so that the delay existing in the instruction execution process is reduced.

Hereinafter, at least one embodiment of the present disclosure will be described in detail with reference to the accompanying drawings. It should be noted that the same reference numerals in different drawings will be used to refer to the same elements already described.

FIG. 3 is a schematic diagram of a dependent thread group execution process provided in at least one embodiment of the present disclosure.

For example, as shown in FIG. 3, the work groups in the scheduling grid include work subgroups. A working subgroup may be mapped onto a virtual parent thread group. A virtual parent thread group includes a plurality of dependent thread groups (dependent thread group 0, dependent thread group 1, … …, dependent thread group n-1, where n is a positive integer) and the plurality of dependent thread groups are used to perform tasks of a working subgroup corresponding to the respective parent thread group. For example, in some examples, each dependency thread group includes 2 sub-thread groups for implementing producer-consumer dependencies, the 2 sub-thread groups being of two types, respectively, producer and consumer sub-thread groups, and for executing producer and consumer instructions, respectively.

For example, as shown in FIG. 3, multiple dependent thread groups run on the same execution unit, i.e., each execution unit may perform computations in a process on multiple dependent thread groups corresponding to a parent thread group. Here, the parent thread group is virtual, not an actually existing thread group, and is used to describe the correspondence between the dependent thread group and the working subgroup, and does not represent the thread group configuration manner at the time of actual running.

Fig. 4 is a flow chart illustrating an instruction execution method according to at least one embodiment of the present disclosure.

For example, the instruction execution method provided in FIG. 4 is used for the working subgroup in FIG. 3, for example. For example, each dependency thread group included in the parent thread group corresponding to the work subgroup is configured to execute at least one dependency instruction group. For example, as shown in fig. 4, the instruction execution method includes the following steps S10 and S20.

Step S10: executing each dependency instruction set;

step S20: the second thread group is released in response to the first thread group executing the graduation synchronization barrier instruction and the second thread group executing the graduation synchronization barrier instruction.

For example, in step S10, each dependency instruction group includes a synchronization barrier instruction whose synchronization range is a plurality of dependency thread groups included in the same parent thread group.

For example, in step S20, each of the dependent thread groups includes a first thread group and a second thread group, each of the first thread group and the second thread group executing a synchronization barrier instruction. The second thread group is released in response to the first thread group executing the graduation synchronization barrier instruction and the second thread group executing the graduation synchronization barrier instruction.

For example, the first thread group and the second thread group may be different types of sub-thread groups running on execution units, one of the first thread group and the second thread group for executing producer instructions and the other of the first thread group and the second thread group for executing consumer instructions. The first thread group and the second thread group together form a dependent thread group, which is not a particular sub-thread group running on the execution unit. For example, each two of the plurality of sub-thread groups running on an execution unit may be divided into a subgroup, which may be referred to as a dependent thread group. For example, in some examples, 2n sub-thread groups are run on the execution unit, the 2n sub-thread groups being divided into n subgroups for implementing producer-consumer dependencies, each subgroup containing 2 sub-thread groups, each subgroup being referred to as a dependent thread group (e.g., dependent thread group 0, dependent thread group 1, … …, dependent thread group n-1 in FIG. 3), where n is a positive integer. Of the 2 sub-thread groups included in each subgroup, one is a first thread group (for executing producer instructions) and the other is a second thread group (for executing consumer instructions). Of course, this division is merely exemplary and is not limiting, and the actual division may be as desired, and embodiments of the present disclosure are not limited in this regard.

For example, in an embodiment of the present disclosure, the instruction execution method further includes: the second thread group enters a wait mode in response to either the first thread group not executing the graduation synchronization barrier instruction or the second thread group not executing the graduation synchronization barrier instruction.

For example, in the process of executing the synchronization barrier instruction by the first thread group and the second thread group respectively, if the first thread group does not execute the finishing synchronization barrier instruction or the second thread group does not execute the finishing synchronization barrier instruction, the second thread group enters a waiting mode; and if the first thread group and the second thread group both execute the finish synchronization barrier instruction, releasing the second thread group. For example, in the above procedure, the first thread group may execute instructions without waiting, i.e., the first thread group's instruction execution process may not be affected by the synchronization barrier instruction; alternatively, the first thread group may be set to a wait mode based on a synchronization barrier instruction according to a process requirement, which is not limited by the embodiments of the present disclosure.

Fig. 5 is a schematic diagram of a synchronization range of a synchronization barrier instruction according to at least one embodiment of the present disclosure.

For example, as shown in fig. 5, at least one embodiment of the present disclosure provides a synchronization barrier instruction bar.vpw for the instruction execution method of fig. 4, for example, with a synchronization range of N dependent thread groups (dependent thread group 0, dependent thread group 1, … …, dependent thread group N-1) included in the same parent thread group, where N is a positive integer. The N dependent thread groups included in the parent thread group run on the same execution unit EU 0, and are used to execute tasks of the working sub-group corresponding to the parent thread group. For the work group barrier instruction bar.wg of FIG. 2B, the synchronization range is one work group executed in one computing unit (e.g., CU 0 of FIG. 2B), and one computing unit includes multiple execution units (EU 0, EU 1, … …, EU n-1, where n is a positive integer). Therefore, compared to the work group barrier instruction bar.wg, the synchronization barrier instruction bar.vpw provided in at least one embodiment of the present disclosure limits the synchronization scope of the synchronization barrier instruction to a plurality of child thread groups corresponding to the same parent thread group running on the same execution unit, thereby reducing the synchronization scope of the synchronization barrier instruction and reducing the delay existing in the instruction execution process.

For example, in an embodiment of the present disclosure, each dependency instruction set further includes a producer instruction set and a consumer instruction set. The first thread group is configured to execute a producer instruction group to generate an execution result, and the second thread group is configured to execute a consumer instruction group to use the execution result. The set of producer instructions includes at least one producer instruction and the set of consumer instructions includes at least one consumer instruction.

For example, at least one embodiment of the present disclosure provides for the instruction execution method further to receive and store execution results using the shared memory space such that the second thread group uses the execution results in executing the consumer instruction group.

For example, for each dependent thread group, the first thread group executes at least one producer instruction in the set of producer instructions to generate a set of execution results, and fills the set of execution results into slots corresponding to the shared memory space; the second thread group executes at least one consumer instruction in the consumer instruction group to use the execution result in the slot corresponding to the shared memory space.

FIG. 6 is a schematic diagram of a dependency instruction set execution process in a dependency thread set according to at least one embodiment of the present disclosure.

For example, as shown in FIG. 6, the dependent thread group 100 includes a first thread group 110 and a second thread group 120. The first thread group 110 is configured to execute at least one producer instruction group (producer instruction group 110-0, producer instruction group 110-1, … …, producer instruction group 110- (n-1)), and the second thread group is configured to execute at least one consumer instruction group (consumer instruction group 120-0, consumer instruction groups 120-1, … …, consumer instruction group 120- (n-1)), where n is a positive integer. For example, in the first thread group 110, instructions are executed in the order of the dashed arrow; in the second thread group 120, instructions are also executed in the order of the dashed arrows.

For example, each producer instruction set includes at least one producer instruction and each consumer instruction set includes at least one consumer instruction. For example, as shown in FIG. 6, producer instruction set 110-k includes 2 producer instructions (producer instruction 2k and producer instruction 2k+1) and consumer instruction set 120-k includes 1 consumer instruction (consumer instruction k), where k is a positive integer no greater than n. For example, the dependent thread group 100 in FIG. 6 is configured to execute n dependent instruction groups, the producer instruction group 110-k and the consumer instruction group 120-k described above constituting the kth dependent instruction group.

For example, as shown in FIG. 6, the n dependent instruction groups correspond to n slots in the shared memory space. For example, for the kth dependency instruction group, the first thread group 110 sequentially executes producer instructions 2k and 2k+1 of the producer instruction group 110-k, thereby generating a kth group of execution results, and populating the kth group of execution results into slot k; the second thread group 120 executes consumer instruction k in consumer instruction group 120-k, thereby using the kth set of execution results in slot k.

For example, as shown in FIG. 6, each dependency instruction group also includes a synchronization barrier instruction bar.vpw that spans multiple dependency thread groups included in the same parent thread group. For example, each synchronization barrier instruction bar.vpw has an id corresponding to the dependency instruction group. For example, the kth dependency instruction group includes the synchronization barrier instruction bar.vpw id (k). For example, during execution of n dependency instruction groups by the dependent thread group 100, the first thread group 110 and the second thread group 120 execute n synchronization barrier instructions bar.vpw id0, bar.vpw id1, … …, bar.vpw id (n-1), respectively.

Fig. 7 is an exemplary flowchart of another example of step S10 in fig. 4.

For example, the instruction execution method provided in FIG. 4 corresponds to multiple dependent thread groups running on the same execution unit, which includes a counter. For example, each counter corresponds to a set of synchronization barrier instructions that are synchronized to a plurality of dependent thread groups included in the same parent thread group. For example, step S10 in the instruction execution method shown in fig. 4 includes causing the first thread group and the second thread group to execute the synchronization barrier instruction to change the count value of the counter, respectively. Further, as shown in fig. 7, this step S10 includes the following steps S1101 to S1104.

Step S1101: initializing the count value of a counter to a preset initial value;

step S1102: causing the first thread group to execute a producer instruction group;

step S1103: causing the first thread group to execute a synchronization barrier instruction to increment a count value of the counter by a step value;

step S1104: the second thread group is caused to execute the synchronization barrier instruction to increment the count value of the counter by a step value.

For example, the execution of the dependent instruction group in fig. 6 is taken as an example, and the execution of the kth dependent instruction group is described. For example, for step S1101, the count value of the counter in the execution unit is first initialized to a preset initial value (e.g., the preset initial value is 0). For example, in the first thread group 110, for step S1102, the first thread group 110 sequentially executes the producer instruction 2k and the producer instruction 2k+1 in the producer instruction group 110-k to obtain a kth group of execution results, and fills the slot k; for step S1103, the first thread group 110 continues to execute the synchronization barrier instruction bar.vpw id (k), thereby increasing the count value of the counter by a step value (for example, a step value of 1). For example, in the second thread group 120, for step S1104, the second thread group 120 executes the synchronization barrier instruction bar.vpw id (k), thereby increasing the count value of the counter by a step value (for example, a step value of 1).

For example, for the consumer instruction group 120-k, if the count value does not reach a preset threshold (e.g., the preset threshold is 2), i.e., the first thread group 110 does not execute the graduation synchronization barrier instruction bar.vpw id (k) or the second thread group 120 does not execute the graduation synchronization barrier instruction bar.vpw id (k), the second thread group 120 enters the wait mode without executing the consumer instruction group 120-k; if the count value reaches a preset threshold (e.g., the preset threshold is 2), i.e., the first thread group 110 executes the done synchronization barrier instruction bar.vpw id (k) and the second thread group 120 executes the done synchronization barrier instruction bar.vpw id (k), then the second thread group 120 is released, i.e., the second thread group 120 executes the consumer instruction k in the consumer instruction group 120-k to use the kth group execution result in slot k. Thus, by synchronizing the barrier instruction bar.vpw to control the waiting and release of the second thread group 120, a synchronization function in the producer-consumer dependency instruction execution process is achieved.

In at least one embodiment of the present disclosure, the example instruction execution method of fig. 7, for example, implements the synchronization function of the synchronization barrier instruction bar.vpw by integrating a counter corresponding to the synchronization barrier instruction bar.vpw separately in the execution unit. The size of the counter in the execution unit is smaller (e.g., the size of the counter integrated in the calculation unit is about 5 bits and the size of the counter integrated in the execution unit is about 3 bits) than, for example, the counter integrated in the calculation unit in fig. 2A, thereby reducing the memory space occupied in hardware. Meanwhile, by integrating the counter in the execution unit independently, the synchronous range of the synchronous barrier instruction is limited in the same execution unit in the hardware level, so that the synchronous range of the synchronous barrier instruction is reduced, and the delay existing in the instruction execution process is reduced.

For example, in some examples, taking the dependency instruction set execution process shown in fig. 6 as an example, the second thread set 120 needs to choose to enter the waiting mode or be released from the waiting mode according to whether the count value of the counter reaches the preset threshold when executing the synchronization barrier instruction bar. For example, bar_id in the synchronization barrier instruction bar.vpw.wait bar_id represents a unique id corresponding to the dependency instruction group. For example, child_warp_count represents the number of child thread groups executing a done synchronization barrier instruction, whose count ranges from multiple child thread groups corresponding to the same parent thread group running on the same execution unit. For example, as shown in fig. 6, in one dependent thread group, when neither the first thread group 110 nor the second thread group 120 executes the synchronization barrier instruction, child_warp_count=0, and when only one of them executes the completion synchronization barrier instruction, child_warp_count=1, child_warp_count=2. For example, the synchronization barrier instruction bar.vpw.wait bar_id has the following behavior in execution:

(1) Behavior 1: notifying the barrier management unit that the second thread group 120 has arrived (corresponding to the count value of the counter being incremented by 1), and if child_warp_count is not equal to 2, the second thread group 120 enters a waiting mode;

(2) Behavior 2: if child_warp_count=2, then all waiting child thread groups (e.g., second thread group 120) are released.

For example, the first thread group 110 does not need to enter a wait mode when executing the synchronization barrier instruction bar.vpw, and therefore, the synchronization barrier instruction executed by the first thread group 110 is named bar.vpw.pass bar_id. For example, bar_id in the synchronization barrier instruction bar.vpw.wait bar_id represents a unique id corresponding to the dependency instruction group. For example, the synchronization barrier instruction bar.vpw.pass bar_id has the following behavior in execution:

(1) Behavior 1: notifying the barrier management unit that the first thread group 110 has arrived (corresponding to the count value of the counter being incremented by 1), and that the first thread group 110 can continue executing subsequent instructions without waiting;

In at least one embodiment of the present disclosure, the synchronization barrier instruction bar.vpw provided in at least one embodiment of the present disclosure limits the synchronization scope of the synchronization barrier instruction to a plurality of child thread groups corresponding to the same parent thread group running on the same execution unit by using the count instruction child_warp_count, so as to reduce the synchronization scope of the synchronization barrier instruction, thereby reducing the delay existing in the instruction execution process.

Fig. 8 is an exemplary flowchart of still another example of step S10 in fig. 4, and fig. 9 is an exemplary flowchart of still another example of step S10 in fig. 4.

For example, the process of the first thread group executing the producer instruction group and the process of the second thread group executing the consumer instruction group are independent of each other and have no fixed precedence relationship in timing. For example, as shown in FIG. 6, for the kth dependent instruction group, instruction execution may be in the chronological order of producer instruction 2k, producer instruction 2k+1, bar.vpw id (k) (executed by first thread group 110), bar.vpw id (k) (executed by second thread group 120), consumer instruction k; instruction execution may also be in the chronological order of bar.vpw id (k) (executed by the second thread group 120), producer instruction 2k, producer instruction 2k+1, bar.vpw id (k) (executed by the first thread group 110), consumer instruction k; instruction execution may also be in the chronological order of producer instructions 2k, bar.vpw id (k) (executed by the second thread group 120), producer instructions 2k+1, bar.vpw id (k) (executed by the first thread group 110), consumer instructions k; or may follow other execution time sequences, to which embodiments of the present disclosure are not limited. For example, fig. 8 and 9 show two other instruction execution time sequences different from fig. 7. It should be noted that, the instruction execution method provided in at least one embodiment of the present disclosure is not limited to the instruction execution time sequences in fig. 7, 8 and 9, and the specific execution sequence may be adjusted according to actual requirements.

For example, as shown in fig. 8, the instruction execution method such as in fig. 4 is also implemented by integrating a counter in an execution unit such as in fig. 7. Further, as shown in fig. 8, for example, step S10 in fig. 4 includes the following steps S1201 to S1205.

Step S1201: initializing the count value of a counter to a preset initial value;

step S1202: causing the second thread group to execute the synchronization barrier instruction to increment the count value of the counter by a step value;

step S1203: responsive to the count value of the counter not reaching a preset threshold, causing the second thread group to enter a wait mode;

step S1204: causing the first thread group to execute a producer instruction group;

step S1205: the first thread group is caused to execute a synchronization barrier instruction to increment a count value of the counter by a step value.

For example, as shown in fig. 8, the execution of the dependent instruction group in fig. 6 is also taken as an example, and the execution of the kth dependent instruction group is described. For example, for step S1201, the count value of the counter in the execution unit is first initialized to a preset initial value (e.g., the preset initial value is 0). For step S1202, the second thread group 120 executes the synchronization barrier instruction bar.vpw id (k), thereby increasing the count value of the counter by a step value (for example, a step value of 1). For step S1203, in response to the count value not reaching the preset threshold (e.g., the preset threshold is 2), the second thread group 120 enters a wait mode without executing the consumer instruction group 120-k. For step S1204, the first thread group 110 sequentially executes the producer instruction 2k and the producer instruction 2k+1 in the producer instruction group 110-k to obtain a kth group execution result and fill the value slot k. For step S1205, the first thread group 110 continues to execute the synchronization barrier instruction bar.vpw id (k), thereby increasing the count value of the counter by a step value (e.g., a step value of 1). At this time, in response to the count value reaching a preset threshold (e.g., a preset threshold of 2), the second thread group 120, i.e., the second thread group 120, is released to execute consumer instruction k in consumer instruction group 120-k to use the kth set of execution results in slot k.

For example, as shown in fig. 9, the instruction execution method such as in fig. 4 is also implemented by integrating a counter in an execution unit such as in fig. 7. For example, the set of producer instructions includes a first producer instruction and a second producer instruction, and the set of consumer instructions includes a first consumer instruction. Further, as shown in fig. 9, for example, step S10 in fig. 4 includes the following steps S1301 to S1306.

Step S1301: initializing the count value of a counter to a preset initial value;

step S1302: causing the first thread group to execute a first producer instruction;

step S1303: causing the second thread group to execute the synchronization barrier instruction to increment the count value of the counter by a step value;

step S1304: responsive to the count value of the counter not reaching a preset threshold, causing the second thread group to enter a wait mode;

step S1305: causing the first thread group to execute a second producer instruction;

step S1306: the first thread group is caused to execute a synchronization barrier instruction to increment the count value of the counter by a step value.

For example, as shown in fig. 9, the execution of the dependent instruction group in fig. 6 is also taken as an example, and the execution of the kth dependent instruction group is described. For example, for step S1301, the count value of the counter in the execution unit is first initialized to a preset initial value (e.g., the preset initial value is 0). For step S1302, the first thread group 110 executes a first producer instruction (e.g., producer instruction 2k in FIG. 6). For step S1303, the second thread group 120 executes the synchronization barrier instruction bar.vpw id (k), thereby increasing the count value of the counter by a step value (for example, a step value of 1). For step S1304, the second thread group 120 enters a wait mode without executing the consumer instruction group 120-k in response to the count value not reaching the preset threshold (e.g., the preset threshold is 2). For step S1305, the first thread group 110 executes a second producer instruction (e.g., producer instruction 2k+1 in FIG. 6). For step S1306, the first thread group 110 continues to execute the synchronization barrier instruction bar.vpw id (k), thereby increasing the count value of the counter by a step value (e.g., a step value of 1). At this time, in response to the count value reaching a preset threshold (e.g., a preset threshold of 2), the second thread group 120 is released, i.e., the second thread group 120 executes the first consumer instruction (e.g., consumer instruction k in fig. 6) to use the kth group execution result in slot k.

For example, the set of producer instructions may further comprise n producer instructions, where n is a positive integer; the first producer instruction may be a first producer instruction of the n producer instructions, and the second producer instruction may be a last producer instruction of the n producer instructions; embodiments of the present disclosure are not limited in this regard.

For example, the consumer instruction set may also include m producer instructions, where m is a positive integer; the first consumer instruction may be a first consumer instruction of the m consumer instructions; embodiments of the present disclosure are not limited in this regard.

Fig. 10 is a schematic block diagram of a processor provided in accordance with at least one embodiment of the present disclosure.

For example, as shown in fig. 10, the processor 200 includes at least one computing unit 210, each computing unit 210 configured to execute at least one work group. Each computing unit 210 includes at least one execution unit 211, each execution unit 211 configured to execute at least one subset of work. For example, each working subgroup corresponds to a parent thread group that includes multiple dependent thread groups. For example, the multiple dependent thread groups are used to run tasks of a working subgroup corresponding to a corresponding parent thread group, and multiple dependent thread groups included in the same parent thread group run on the same execution unit 211.

For example, each dependency thread group is configured to execute at least one dependency instruction group, each dependency instruction group including, for example, the synchronization barrier instruction bar. For example, each of the dependent thread groups includes a first thread group and a second thread group configured to execute the synchronization barrier instruction bar. For example, each dependent thread group is further configured to release the second thread group in response to the first thread group executing the graduation synchronization barrier instruction and the second thread group executing the graduation synchronization barrier instruction.

For example, the processor 200 is configured to perform an instruction execution method such as that shown in fig. 4: any one of the plurality of dependent thread groups running on the same execution unit 211 executes each of the dependent instruction groups; the dependency instruction set releases the second thread set in response to the first thread set executing the graduation synchronization barrier instruction and the second thread set executing the graduation synchronization barrier instruction.

For example, each execution unit 211 includes at least one counter 2101. For example, each counter 2101 corresponds to a set of synchronization barrier instructions that have a synchronization range of multiple dependent thread groups included in the same parent thread group. Each dependent thread group is further configured to cause the first thread group and the second thread group to execute a synchronization barrier instruction to change the count value of the counter 2101, respectively. At this time, the processor 200, when executing the instruction execution method shown in fig. 4, further includes causing the first thread group and the second thread group to execute the synchronization barrier instruction to change the count value of the counter 2101, respectively.

For example, each dependency instruction set also includes a producer instruction set and a consumer instruction set. The set of producer instructions includes at least one producer instruction and the set of consumer instructions includes at least one consumer instruction. For example, the first thread group is configured to execute a producer instruction group to generate an execution result, and the second thread group is configured to execute a consumer instruction group to use the execution result.

For example, processor 200 also includes shared memory space 220. The shared memory space 220 is configured to receive and store execution results generated by the first thread group executing the producer instruction group for use by the second thread group in executing the consumer instruction group. At this point, the processor 200 receives and stores the execution results using the shared memory space 220 such that the second thread group uses the execution results in executing the consumer instruction group.

Fig. 11 is a schematic block diagram of another processor provided in accordance with at least one embodiment of the present disclosure.

For example, in contrast to FIG. 10, each execution unit 211 in FIG. 11 includes a shared memory space 2102. The shared memory space 2102 is also configured to receive and store execution results generated by the first thread group executing the producer instruction group for use by the second thread group in executing the consumer instruction group. The processor 200 receives and stores execution results using the shared memory space 2102 such that the second thread group uses the execution results in executing the consumer instruction group.

Since other structures and functions of the processor 200 in fig. 11 are the same as those in fig. 10, the details thereof will not be described herein for brevity, and reference is made to the above description of fig. 10. .

For example, as shown in fig. 11, integrating the shared memory space 2102 within each execution unit 211 may increase the efficiency of data transfer, and the size of the shared memory space may be smaller than, for example, integrating the shared memory space in the processor in fig. 10, thereby reducing the memory space occupied in hardware. Meanwhile, the speed of data transmission is improved and the delay existing in the instruction execution process is reduced by independently integrating the shared storage space in the execution unit.

For example, the shared memory space may be integrated in the processor as shown in fig. 10, may be integrated inside each execution unit as shown in fig. 11, or may be disposed at other locations, and embodiments of the present disclosure do not limit the specific location of the shared memory space. As another example, the shared memory space may be a register, in particular, a general purpose register (General Purpose Register, GPR); the shared memory space may also be other data structures that may implement the function of receiving and storing execution results generated by the first thread group executing the producer instruction group; embodiments of the present disclosure are not limited to a particular form of shared memory space.

It should be noted that, in the embodiments of the present disclosure, specific functions and technical effects of the processor 200 may refer to the description of the method for executing instructions provided in at least one embodiment of the present disclosure, which is not repeated herein.

Fig. 12 is a schematic block diagram of an electronic device provided in accordance with at least one embodiment of the present disclosure.

For example, as shown in fig. 12, the electronic device 300 includes a processor 200, where the processor 200 is a processor provided in any embodiment of the disclosure, such as the processor 200 shown in fig. 10 or fig. 11.

For example, the electronic device 300 may be a DDR digital system, any device such as a mobile phone, a tablet computer, a notebook computer, an electronic book, a game console, a television, a digital photo frame, a navigator, or any combination of electronic devices and hardware, which is not limited in the embodiments of the present disclosure.

It should be noted that, for clarity and brevity, not all of the constituent elements of the electronic device 300 are provided in the embodiments of the present disclosure. Other constituent elements not shown, such as a communication unit (e.g., a network communication unit), an input/output unit (e.g., a keyboard, a speaker, etc.), etc., may be provided, set, etc., as required by a person skilled in the art to implement the necessary functions of the electronic device, and the embodiments of the present disclosure are not limited thereto. The related description and technical effects of the electronic device 300 may refer to those of the processor provided in the embodiments of the present disclosure, and are not repeated herein.

For example, as shown in fig. 13, the electronic device 400 is suitable for implementing the instruction execution method provided by the embodiments of the present disclosure, for example. The electronic apparatus 400 may be a terminal device or a server, etc. It should be noted that the electronic device 400 shown in fig. 13 is only an example, and does not impose any limitation on the functions and scope of use of the embodiments of the present disclosure.

For example, as shown in fig. 13, an electronic device 400 may include a processing device (e.g., a central processing unit, a graphics processor, etc.) 41, which includes, for example, a processor according to any of the embodiments of the present disclosure, and which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 42 or a program loaded from a storage device 48 into a Random Access Memory (RAM) 43. In the RAM 43, various programs and data required for the operation of the electronic apparatus 400 are also stored. The processing device 41, the ROM 42 and the RAM 43 are connected to each other via a bus 44. An input/output (I/O) interface 45 is also connected to bus 44. In general, the following devices may be connected to the I/O interface 45: input devices 46 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 37 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, etc.; storage devices 48 including, for example, magnetic tape, hard disk, etc.; and communication means 49. The communication device 49 may allow the electronic device 400 to communicate wirelessly or by wire with other electronic devices to exchange data.

While fig. 13 shows an electronic device 400 having various devices, it is to be understood that not all of the illustrated devices are required to be implemented or provided, and that electronic device 400 may alternatively be implemented or provided with more or fewer devices.

For detailed description and technical effects of the electronic device 300/400, reference may be made to the above related description of the instruction execution method, which is not repeated here.

For the purposes of this disclosure, the following points are to be described:

(1) In the drawings of the embodiments of the present disclosure, only the structures related to the embodiments of the present disclosure are referred to, and other structures may refer to the general design.

(2) Features of the same and different embodiments of the disclosure may be combined with each other without conflict.

The foregoing is merely a specific embodiment of the disclosure, but the protection scope of the disclosure is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the disclosure, and it should be covered in the protection scope of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. An instruction execution method is used for a plurality of working subgroups in a working group, wherein each working subgroup in the plurality of working subgroups corresponds to a parent thread group, the parent thread group comprises a plurality of dependent thread groups, the plurality of dependent thread groups are used for running tasks of the working subgroup corresponding to the corresponding parent thread group, the plurality of dependent thread groups included in the same parent thread group run on the same execution unit,

Each of the plurality of dependent thread groups is configured to execute at least one dependency instruction group, each of the at least one dependency instruction group including a synchronization barrier instruction having a synchronization range of the plurality of dependent thread groups included in the same parent thread group,

each of the dependent thread groups including a first thread group and a second thread group, the first thread group and the second thread group being configured to execute the synchronization barrier instructions in each of the dependent instruction groups,

the method comprises the following steps:

executing said each dependency instruction set;

in response to the first thread group completing execution of the synchronization barrier instruction and the second thread group completing execution of the synchronization barrier instruction, the second thread group exits a wait mode.

2. The instruction execution method of claim 1, further comprising:

the second thread group enters the wait mode in response to either the first thread group not completing execution of the synchronization barrier instruction or the second thread group not completing execution of the synchronization barrier instruction.

3. The instruction execution method of claim 1, wherein each dependency instruction set further comprises a producer instruction set and a consumer instruction set, the producer instruction set comprising at least one producer instruction, the consumer instruction set comprising at least one consumer instruction, the first thread set configured to execute the producer instruction set to generate an execution result, the second thread set configured to execute the consumer instruction set to use the execution result.

4. The instruction execution method according to claim 3, wherein,

the execution results are received and stored using a shared memory space such that the second thread group uses the execution results in executing the consumer instruction group.

5. The instruction execution method of claim 3, wherein said execution unit comprises at least one counter,

said executing said each dependency instruction set comprises:

the first thread group and the second thread group are caused to execute the synchronization barrier instruction to change the count value of the counter, respectively.

6. The instruction execution method of claim 5, wherein the causing the first thread group and the second thread group to execute the synchronization barrier instruction, respectively, to change the count value of the counter comprises:

initializing the count value of the counter to a preset initial value;

causing the first thread group to execute the set of producer instructions;

causing the first thread group to execute the synchronization barrier instruction to increment a count value of the counter by a step value;

the second thread group is caused to execute the synchronization barrier instruction to increment a count value of the counter by the step value.

7. The instruction execution method of claim 6, wherein the second thread group exiting a wait mode in response to the first thread group completing execution of the synchronization barrier instruction and the second thread group completing execution of the synchronization barrier instruction comprises:

and in response to the count value of the counter reaching a preset threshold, enabling the second thread group to exit the waiting mode and executing the consumer instruction group.

8. The instruction execution method of claim 5, wherein the causing the first thread group and the second thread group to execute the synchronization barrier instruction, respectively, to change the count value of the counter comprises:

initializing the count value of the counter to a preset initial value;

causing the second thread group to execute the synchronization barrier instruction to increment a count value of the counter by a step value;

responsive to the count value of the counter not reaching a preset threshold, causing the second thread group to enter the wait mode;

causing the first thread group to execute the set of producer instructions;

the first thread group is caused to execute the synchronization barrier instruction to increment a count value of the counter by the step value.

9. The instruction execution method of claim 8, wherein the second thread group exiting a wait mode in response to the first thread group completing execution of the synchronization barrier instruction and the second thread group completing execution of the synchronization barrier instruction comprises:

and in response to the count value of the counter reaching the preset threshold, enabling the second thread group to exit the waiting mode and executing the consumer instruction group.

10. The instruction execution method of claim 5 wherein the set of producer instructions comprises a first producer instruction and a second producer instruction,

the causing the first thread group and the second thread group to execute the synchronization barrier instruction, respectively, to change the count value of the counter includes:

initializing the count value of the counter to a preset initial value;

causing the first thread group to execute the first producer instruction;

causing the first thread group to execute the second producer instruction;

11. The instruction execution method of claim 10 wherein said set of consumer instructions comprises a first consumer instruction,

the releasing the second thread group in response to the first thread group completing execution of the synchronization barrier instruction and the second thread group completing execution of the synchronization barrier instruction, comprising:

and in response to the count value of the counter reaching the preset threshold, enabling the second thread group to execute the first consumer instruction.

12. A processor, comprising:

at least one computing unit, each of the at least one computing unit configured to execute at least one work group,

each computing unit comprising at least one execution unit, each of the at least one execution unit configured to execute at least one subset of work,

each working subgroup of said at least one working subgroup corresponds to a parent thread group, said parent thread group comprising a plurality of dependent thread groups for running tasks of the working subgroup corresponding to the respective parent thread group, said plurality of dependent thread groups comprised by the same parent thread group running on the same execution unit,

the each dependent thread group is further configured to exit the wait mode in response to the first thread group completing execution of the synchronization barrier instruction and the second thread group completing execution of the synchronization barrier instruction.

13. The processor of claim 12, wherein each execution unit comprises at least one counter,

each of the dependent thread groups is further configured to cause the first thread group and the second thread group to execute the synchronization barrier instruction to change a count value of the counter, respectively.

14. The processor of claim 12, wherein each dependency instruction set further comprises a producer instruction set and a consumer instruction set, the producer instruction set comprising at least one producer instruction and the consumer instruction set comprising at least one consumer instruction,

The first thread group is configured to execute the producer instruction group and the second thread group is configured to execute the consumer instruction group.

15. The processor of claim 14, further comprising a shared memory space,

wherein the shared memory space is configured to receive and store execution results generated by the first thread group executing the producer instruction group for use by the second thread group in executing the consumer instruction group.

16. An electronic device comprising a processor as claimed in any of claims 12-15.