CN115129369A

CN115129369A - Command distribution method, command distributor, chip and electronic device

Info

Publication number: CN115129369A
Application number: CN202110323622.8A
Authority: CN
Inventors: 王文强; 夏晓旭
Original assignee: Shanghai Power Tensors Intelligent Technology Co Ltd
Current assignee: Shanghai Power Tensors Intelligent Technology Co Ltd
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2022-09-30
Also published as: WO2022198955A1

Abstract

The disclosure provides a command distribution method, a command distributor, a chip and an electronic device, wherein the command distribution method comprises the following steps: determining a plurality of first target register groups corresponding to the current processing cycle from a plurality of register groups; wherein the first target register set is different from a second target register set determined by at least one history processing cycle at the latest time; determining target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from thread groups respectively corresponding to the plurality of first target register groups; and distributing the commands corresponding to the determined target threads to the corresponding arithmetic units. In the embodiment of the disclosure, each thread group is accessed by at most one arithmetic unit in each processing cycle, so that the arithmetic units receiving the command can directly access the corresponding register group without arbitration, thereby obtaining operands required by the command, and further improving the efficiency of command distribution and the efficiency of command processing.

Description

Command distribution method, command distributor, chip and electronic device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a command distribution method, a command distributor, a chip, and an electronic device.

Background

The structure of a command processing device such as a central processing unit, a graphic processor, etc. generally includes: the system comprises a controller, a command distributor connected with the controller, and a plurality of arithmetic units connected with the command distributor. The controller is used for receiving the command from the host, carrying out primary processing on the command and then sending the command to the command distributor, and the command distributor distributes the command to different arithmetic units for execution. With the continuous increase of intensive computing tasks, hardware multithreading is a technology capable of effectively improving parallel computing capacity, and is widely applied to the fields of images, neural networks, data processing and the like. Hardware multithreading effectively improves the calculation speed by increasing the number of arithmetic units, maintaining more threads to be executed in parallel, increasing the capacity of a register file for storing command operands, adopting an external memory with higher bandwidth and the like.

The current command distribution mode has the problem of low distribution efficiency.

Disclosure of Invention

The embodiment of the disclosure at least provides a command distribution method, a command distributor, a chip and an electronic device.

In a first aspect, an embodiment of the present disclosure provides a command distribution method, including: determining a plurality of first target register groups corresponding to the current processing cycle from a plurality of register groups; wherein the first set of target registers is different from a second set of target registers determined by at least one recent historical processing cycle;

determining target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from thread groups respectively corresponding to the plurality of first target register groups; and distributing the commands corresponding to the determined target threads to the corresponding arithmetic units.

In one possible implementation, the determining, from the plurality of register sets, a plurality of first target register sets corresponding to a current processing cycle includes: determining a register group with an odd number in the plurality of register groups as the first target register group when the current processing cycle is an odd cycle; and determining the register group with the even number in the plurality of register groups as the first target register group when the current processing cycle is an even cycle.

In a possible embodiment, the method further comprises: and determining the grouping number of the registers according to the number of the operands of the arithmetic unit with the largest number of required operands, and dividing the registers into the plurality of register groups.

In one possible embodiment, the determining, from the thread groups corresponding to the first target register groups, target threads corresponding to the first target register groups in the current processing cycle includes: and determining target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from the thread groups respectively corresponding to the plurality of first target register groups based on the determined command execution state information of each thread in the thread groups respectively corresponding to the plurality of first target register groups.

In one possible embodiment, the determining, based on the determined command execution states of the respective threads in the thread groups corresponding to the first target register groups, a target thread corresponding to the first target register groups in the current processing cycle from the thread groups corresponding to the first target register groups, includes: determining a plurality of standby threads with command execution state information in a ready state from thread groups corresponding to the first target register groups respectively; and determining target threads respectively corresponding to the plurality of first target register sets in the current processing cycle from the plurality of candidate threads.

In one possible implementation, the determining, from the multiple candidate threads, target threads respectively corresponding to the multiple first target register sets in a current processing cycle includes: and determining target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from the plurality of candidate threads based on the priorities of the commands to be distributed respectively corresponding to the plurality of candidate threads.

In one possible implementation, the determining, from the multiple candidate threads, target threads respectively corresponding to the multiple first target register sets in a current processing cycle includes: and determining target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from the plurality of candidate threads based on the priorities of the commands to be distributed respectively corresponding to the plurality of candidate threads and the occupation states of the arithmetic units corresponding to the commands to be distributed.

In a possible implementation manner, in response to a multi-operand command to be distributed, which has more than one operand, in a command to be distributed corresponding to a target thread determined for a current processing cycle, from the current processing cycle to each processing cycle of the target processing cycle, a first target register group corresponding to the multi-operand command to be distributed distributes a corresponding operand to an arithmetic unit corresponding to the command to be distributed respectively; wherein the difference between the number of cycles of the target processing cycle and the current processing cycle is equal to the number of the multiple operands reduced by one.

In a possible embodiment, the method further comprises: in response to the multi-operand to-be-distributed command with two operands existing in the to-be-distributed command corresponding to the target thread determined for the current processing cycle, for each single-operand to-be-distributed command existing in the to-be-distributed command corresponding to the target thread determined for the current processing cycle, another single-operand to-be-distributed command in a ready state is determined for the first target register group where the single-operand to-be-distributed command is located in the next processing cycle of the current processing cycle.

In a possible implementation, the method further comprises: in response to a multi-operand to-be-distributed command with more than one operand in the to-be-distributed command corresponding to the target thread determined for the current processing cycle, determining the operand number of the multi-operand to-be-distributed command with the largest operand number in the multi-operand to-be-distributed command; for each other command to be distributed, the number of operands of which is less than the maximum number of operands, in the command to be distributed corresponding to the target thread determined for the current processing cycle, from the next processing cycle of the current processing cycle to each processing cycle before the processing cycle in which the first target register group is scheduled again, in response to the first target register group in which the other command to be distributed is idle, determining a command to be distributed in a ready state for the first target register group from the thread group corresponding to the first target register group; the operation number of the commands to be distributed in the ready state is not more than the processing cycle number from the processing cycle of the commands to be distributed in the ready state to the processing cycle of the first target register group which is scheduled again.

In a possible embodiment, the method further comprises: acquiring feedback information generated by the arithmetic unit after the arithmetic unit executes the command; and generating command execution state information corresponding to the thread to which the executed command belongs on the basis of the feedback information.

In a possible implementation, the method further comprises: and grouping the threads currently executed on the basis of the number of the register groups and the number of the threads currently executed to obtain thread groups respectively corresponding to each register group.

In a second aspect, an embodiment of the present disclosure provides a command distributor, including: a scheduler, and a distribution interface;

the scheduler is used for determining a plurality of first target register groups corresponding to the current processing cycle from a plurality of register groups; wherein the first target register set is different from a second target register set determined at least one history processing cycle recently; determining target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from thread groups respectively corresponding to the plurality of first target register groups;

and the distribution interface is used for distributing the determined commands corresponding to the target threads to the corresponding arithmetic units.

In one possible implementation, the scheduler, when determining a plurality of first target register sets corresponding to a current processing cycle from among the plurality of register sets, is configured to:

determining a register group numbered as an odd number among the plurality of register groups as the first target register group when the current processing cycle is an odd cycle;

and determining the register group with the even number in the plurality of register groups as the first target register group when the current processing cycle is an even cycle.

In a possible implementation, the scheduler is further configured to:

and determining the grouping number of the registers according to the number of the operands of the arithmetic unit with the largest number of required operands, and dividing the registers into the plurality of register groups.

In one possible embodiment, the scheduler, when determining, from among the thread groups corresponding to the first target register groups, target threads corresponding to the first target register groups in the current processing cycle, is configured to:

and determining target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from the thread groups respectively corresponding to the plurality of first target register groups based on the determined command execution state information of each thread in the thread groups respectively corresponding to the plurality of first target register groups.

In one possible embodiment, the scheduler, when determining, from the thread groups respectively corresponding to the plurality of first target register groups, a target thread respectively corresponding to the plurality of first target register groups in a current processing cycle based on the determined command execution states of the respective threads in the thread groups respectively corresponding to the plurality of first target register groups, is configured to:

determining a plurality of alternative threads with command execution state information in a ready state from thread groups corresponding to the first target register groups respectively;

and determining target threads respectively corresponding to the plurality of first target register sets in the current processing cycle from the plurality of candidate threads.

In one possible implementation, the scheduler, when determining, from the multiple candidate threads, target threads respectively corresponding to the multiple first target register sets in a current processing cycle, is configured to:

and determining target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from the plurality of candidate threads based on the priorities of the commands to be distributed respectively corresponding to the plurality of candidate threads.

and determining target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from the plurality of candidate threads based on the priorities of the commands to be distributed respectively corresponding to the plurality of candidate threads and the occupation states of the arithmetic units corresponding to the commands to be distributed.

and determining target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from the plurality of candidate threads based on the command types of the current commands to be distributed respectively corresponding to the plurality of candidate threads and the types of the arithmetic units.

In a possible implementation, the scheduler is further configured to:

in response to the fact that a multi-operand command to be distributed with more than one operand exists in the command to be distributed corresponding to the target thread determined for the current processing period, from the current processing period to each processing period of the target processing period, a first target register group corresponding to the multi-operand command to be distributed distributes a corresponding operand to an arithmetic unit corresponding to the command to be distributed respectively;

wherein the difference between the number of cycles of the target processing cycle and the current processing cycle is equal to the number of the multiple operands reduced by one.

In a possible implementation, the scheduler is further configured to:

in response to the multi-operand to-be-distributed command with two operands existing in the to-be-distributed command corresponding to the target thread determined for the current processing cycle, for each single-operand to-be-distributed command existing in the to-be-distributed command corresponding to the target thread determined for the current processing cycle, another single-operand to-be-distributed command in a ready state is determined for the first target register group where the single-operand to-be-distributed command is located in the next processing cycle of the current processing cycle.

In a possible implementation, the scheduler is further configured to:

in response to a multi-operand command to be distributed with more than one operand in the command to be distributed corresponding to the target thread determined for the current processing cycle, determining the operand quantity of the multi-operand command to be distributed with the largest operand quantity in the multi-operand command to be distributed;

for each other command to be distributed, the number of operands of which is less than the maximum number of operands, in the command to be distributed corresponding to the target thread determined for the current processing cycle, from the next processing cycle of the current processing cycle to each processing cycle before the processing cycle in which the first target register group is scheduled again, in response to the first target register group in which the other command to be distributed is idle, determining a command to be distributed in a ready state for the first target register group from the thread group corresponding to the first target register group;

the operation number of the commands to be distributed in the ready state is not more than the processing cycle number from the processing cycle of the commands to be distributed in the ready state to the processing cycle of the first target register group which is scheduled again.

In a possible implementation, the scheduler is further configured to: acquiring feedback information generated by the arithmetic unit after executing the command;

and generating command execution state information corresponding to the thread to which the executed command belongs on the basis of the feedback information.

In a possible implementation, the scheduler is further configured to:

and grouping the threads currently being executed based on the number of the register groups and the number of the threads currently being executed to obtain the thread groups respectively corresponding to each register group.

In a third aspect, an embodiment of the present disclosure further provides a chip, a controller, a command distributor, and an arithmetic unit;

the controller is used for acquiring commands corresponding to a plurality of threads respectively and sending the commands to the command distributor;

the command distributor is configured to distribute the command to the arithmetic unit based on the command distribution method according to any one of the first aspect;

the arithmetic unit is used for reading an operand from a target register group corresponding to the command based on the command distributed by the command distributor and executing the command based on the operand.

In a fourth aspect, an embodiment of the present disclosure further provides an electronic device, including the chip of the third aspect.

In a fifth aspect, the disclosed embodiments further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program performs the steps of the command distribution method according to any one of the above first aspects.

For the description of the effects of the command distributor, the chip and the electronic device, reference is made to the description of the command distribution method, and details are not repeated here.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

FIG. 1 is a flow chart illustrating a command distribution method provided by an embodiment of the present disclosure;

FIG. 2 illustrates a specific example of a command distribution apparatus provided by an embodiment of the present disclosure;

fig. 3 is a schematic diagram illustrating a specific example of command distribution by a command distribution apparatus according to an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of a command distributor provided in an embodiment of the present disclosure;

fig. 5 shows a schematic structural diagram of a chip provided in an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.

Research shows that the command processing device comprises: the device comprises a controller, a command distribution unit and a plurality of operation units. The controller acquires the command stream, performs primary processing on a plurality of commands in the command stream, and transmits the commands to the command distribution unit; after the command distributing unit distributes the command to the operation units, the operation units executing different commands obtain the reading authority of the register through arbitration, after the reading authority of the register is obtained, operands required when the command is executed are read from the register, and then the arbitration of the reading authority of the register file can cause certain delay based on the read operands, thereby influencing the throughput rate of the operation units on the command, further causing the problem of low command distributing efficiency and causing the lower processing efficiency of executing single command.

In addition, after the command is distributed to the command distribution unit, if the operand required for executing the command is not ready, the arithmetic unit switches to execute the command corresponding to other threads; this requires the command distribution unit to distribute new commands to the arithmetic unit, which causes the commands distributed to the arithmetic unit to have the possibility of having commands that cannot be executed immediately (requiring waiting for operands to be ready before execution), and therefore, the efficiency of command distribution is reduced, and the processing efficiency of commands is low.

Based on the above research, the present disclosure provides a command distribution method, in which a register in a register file is divided into a plurality of register groups, and different register groups correspond to different thread groups. In each processing cycle, a plurality of first target register groups are determined, target threads corresponding to the first target register groups in the current processing cycle are determined from the thread groups corresponding to the first target register groups respectively, and commands corresponding to the determined target threads are distributed to corresponding arithmetic units respectively, so that in each processing cycle, each thread group can be accessed by one arithmetic unit at most in each processing cycle, and therefore, the arithmetic units receiving the commands can directly access the corresponding register groups without arbitration, operands required by the commands are obtained, the command distribution efficiency is further improved, and the command processing efficiency is improved.

The above-mentioned drawbacks are the results of the inventor after practical and careful study, and therefore, the discovery process of the above-mentioned problems and the solutions proposed by the present disclosure to the above-mentioned problems should be the contribution of the inventor in the process of the present disclosure.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures.

To facilitate understanding of the present embodiment, first, a detailed description is given to a command distribution method disclosed in the embodiments of the present disclosure, where an execution main body of the command distribution method provided in the embodiments of the present disclosure generally is a command Processing device such as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), and an Artificial Intelligence (AI) chip.

In the embodiment of the present disclosure, the operand of the command is data that needs to be read from the external memory when the command is executed; exemplary, commands are, for example: carrying out multiplication operation on the data A and the data B, wherein the corresponding command operands are the data A and the data B; for another example, if the command is to perform convolution operation on the feature map M to be processed by using the convolution kernel F, the corresponding command operands are the feature map M and the convolution kernel F.

The following describes a command distribution method provided in the embodiment of the present disclosure.

Referring to fig. 1, a flowchart of a command distribution method provided in the embodiment of the present disclosure is shown, where the method includes steps S101 to S103, where:

s101: determining a plurality of first target register groups corresponding to the current processing cycle from a plurality of register groups; wherein the first target register set is different from a second target register set determined at least one history processing cycle recently;

s102: determining target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from thread groups respectively corresponding to the plurality of first target register groups;

s103: and distributing the commands corresponding to the determined target threads to the corresponding arithmetic units.

The embodiment of the disclosure divides the register into a plurality of register groups, determines a plurality of first target register groups corresponding to a current processing cycle from the plurality of register groups in each processing cycle, and then determines target threads corresponding to the plurality of first target register groups in the current processing cycle from the thread groups corresponding to the plurality of first target register groups, respectivelyThe arithmetic units have at most one arithmetic unit, so that the arithmetic units receiving the command do not need to carry out arbitrationCan be used forThe corresponding register groups can be directly accessed to obtain operands required by the commands, so that the command distribution efficiency is improved, and the command processing efficiency is improved.

The following describes each of S101 to S103 in detail.

In the above S101, for example, the registers may be divided into a plurality of register groups in advance, and the thread to which the command is currently issued in the host is also divided into a plurality of thread groups, where each register group corresponds to one thread group, and for each thread group, an operand corresponding to the command generated by each thread in the thread group is stored in the register group corresponding to the thread group.

Here, for example, the threads currently being executed may be grouped based on the number of register groups and the number of threads currently being executed, so as to obtain thread groups corresponding to the respective register groups.

When determining the plurality of first target register groups corresponding to the current processing cycle from the plurality of register groups, for example, the plurality of register groups may be divided into at least two groups, each group including the plurality of register groups. In each processing cycle, the register group included in one of the packets is determined as a first target register group corresponding to the processing cycle. In a plurality of processing cycles, register groups in different groups are alternately determined as target register groups corresponding to the plurality of processing cycles.

For example, the numbers of the register groups may be numbered, and the correspondence between the register group numbers and the processing cycle numbers may be predetermined for the processing cycle numbers. For example: in even processing cycles of the processing cycles, determining the register group with even number as a target register group corresponding to the even processing cycles; in odd processing cycles of the processing cycles, the register group numbered as odd is determined as a target register group corresponding to the odd processing cycle.

Here, the number of grouping registers may be related to the number of operands required in the executed command, as an example; for example, if the number of operands required in each command to be executed is at most 2, the number of grouped register sets is also 2; if the number of operands required in each command to be executed is at most 3, the number of groups of registers is 3. It is assumed that the number of required operands in each command to be executed is at most n, i.e., the number of register sets is n; thus, in the ith processing cycle, if the 1 st register group of the n register groups is taken as the target register group, for the operation unit receiving the instruction with the required operation number of n, after receiving the command in the current ith processing cycle, the operation unit reads an operand from the corresponding first target register group in the current ith processing cycle, and then in the subsequent (i + 1) th to (i + n-1) th processing cycles, the remaining n-1 operands are from the corresponding first target register; in the following processing cycles from the (i + 1) th to the (i + n-1) th, the (2) th to the nth register groups are respectively used as the target register groups corresponding to the processing cycles from the (i + 1) th to the (i + n-1) th, so that the conflict of accessing the same register group by the arithmetic unit is avoided.

In step S102, after determining the first target register groups corresponding to the current processing cycle, for example, a plurality of target threads may be determined from the thread groups corresponding to the first target register groups in at least one of the following manners:

(1): for each first target register group, determining an order for a plurality of threads in the thread group corresponding to the first target register group according to a circulation mode, and determining different threads in the thread group as target threads corresponding to the first target register group according to the order and in different processing cycles.

(2) And regarding each first target register group, taking the thread with the command in the current processing cycle in the thread group corresponding to the first target register group as an alternative thread, and taking the alternative thread with the highest priority as the target thread of the first target register group in the current processing cycle according to the priority of each alternative thread.

(3) And determining target threads respectively corresponding to the plurality of first target register groups in the current processing cycle based on the determined command execution state information of each thread in the thread groups respectively corresponding to the plurality of first target register groups.

Here, for example, for each first target register group, a plurality of candidate threads whose command execution state information is a ready state may be determined from the thread group corresponding to the first target register group; and determining target threads respectively corresponding to the plurality of first target register sets in the current processing cycle from the plurality of candidate threads.

Here, the command execution state information corresponding to any thread includes, for example: whether the instruction which is distributed to the arithmetic unit by the thread recently is executed or not and/or whether the operand which is required by the instruction which is distributed by the thread currently is prepared or not.

The operands corresponding to the command are prepared, for example, the data generated by other commands on which the command depends are already stored in the corresponding registers, and/or the operands that need to be read from the external memory are already stored in the corresponding registers.

If the instruction which is recently distributed to the arithmetic unit by the thread is executed and/or the operand which is required by the current instruction to be distributed and corresponds to the thread is already prepared, the instruction execution state information corresponding to the thread is considered to be in a ready state, and the thread corresponding to the instruction can be used as a target thread.

In this case, in another embodiment of the present disclosure, the method further includes: acquiring feedback information generated by the arithmetic unit after the arithmetic unit executes the command; and generating command execution state information corresponding to the thread to which the executed command belongs on the basis of the feedback information.

Thus, the command distributor can know the execution condition of the command of each arithmetic unit in real time.

In one possible embodiment, in the above-mentioned (3), the number of target threads specified for a certain first target register group may be greater than 1, and the target threads may be specified from among a plurality of threads satisfying the requirement of (3) in combination with priorities respectively corresponding to the threads or in a round-robin manner.

It should be noted here that there may be a case where no target thread exists in a certain first target register set in a certain processing cycle, that is, the number of determined target threads is less than the number of first target register sets.

When a target thread corresponding to each of the plurality of first target register groups in the current processing cycle is specified from among the specified plurality of candidate threads, for example, any one of the following (r) to (c) may be employed:

the method comprises the following steps: and determining target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from the plurality of candidate threads based on the priorities of the commands to be distributed respectively corresponding to the plurality of candidate threads.

Secondly, the step of: and determining target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from the plurality of candidate threads based on the priorities of the commands to be distributed respectively corresponding to the plurality of candidate threads and the occupation states of the arithmetic units corresponding to the commands to be distributed.

Here, the occupation state of the arithmetic unit includes, for example: a specific target thread is already distributed to the arithmetic unit in the current processing cycle; and/or the number of the commands which are received by the operation unit in the historical processing period and are not completely executed reaches a preset number.

Illustratively, the following processes are performed in order of priority from high to low:

determining at least one command to be distributed with the highest priority according to the priorities of the commands to be distributed respectively corresponding to the alternative threads, and then determining whether the command to be distributed with the highest priority can be distributed to the corresponding arithmetic unit or not based on the occupation state of the arithmetic unit corresponding to the command to be distributed with the highest priority; if the instruction can be distributed to the corresponding arithmetic unit, the alternative thread corresponding to the instruction to be distributed with the highest priority is determined as the target thread. If the instruction cannot be distributed to the corresponding arithmetic unit, the alternative thread corresponding to the instruction to be distributed is not taken as the target thread.

Then, at least one command to be distributed with the highest priority is determined from the alternative threads, and whether the command to be distributed with the highest priority can be distributed to the corresponding arithmetic unit is determined based on the occupation state of the arithmetic unit corresponding to the command to be distributed with the highest priority.

……

And determining at least one command to be distributed with the lowest priority from the alternative threads, and then determining whether the command to be distributed with the lowest priority can be distributed to the corresponding arithmetic unit or not based on the occupation state of the arithmetic unit corresponding to the command to be distributed with the lowest priority.

Based on the above process, target threads respectively corresponding to the plurality of first target register sets in the current processing cycle are determined from the plurality of candidate threads.

③: and determining target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from the plurality of candidate threads based on the command types of the current commands to be distributed respectively corresponding to the plurality of candidate threads and the type of the arithmetic unit.

Here, the types of arithmetic units are different, and the types of commands that can be processed are also different.

For example, for an arithmetic unit, the type of command that can be processed is an arithmetic command; for the write address arithmetic unit, the type of the command capable of being processed is a write address command; a read address operation unit, wherein the type of the command capable of being processed is a read address command; and the override function operation unit is used for processing the command type of the override function.

When the target threads are determined, a plurality of target threads which can be respectively matched with the types of the operation units are determined from the alternative threads according to the types of the commands to be distributed corresponding to the alternative threads, and then the current commands to be distributed corresponding to the target threads are distributed to the operation units with the matched types.

In another embodiment of the disclosure, for some commands, the number of operands required in executing the command may vary.

After a command to be distributed corresponding to a target thread is distributed to an arithmetic unit, the arithmetic unit needs at least one period to read an operand corresponding to the command to be distributed from a corresponding register group; the number of cycles for reading the operands is the same as the number of operands corresponding to the command to be distributed.

Furthermore, in response to the to-be-distributed command with more than one operand in the to-be-distributed command corresponding to the target thread determined for the current processing cycle, in each processing cycle from the current processing cycle to the target processing cycle, the first target register group corresponding to the to-be-distributed command with the multiple operands distributes a corresponding operand to the arithmetic unit corresponding to the to-be-distributed command;

For the command to be distributed with less operands, the arithmetic unit can read the operands corresponding to different commands to be distributed from the same target register group in a plurality of cycles.

Furthermore, in response to a multi-operand to-be-distributed command with more than one operand in the to-be-distributed command corresponding to the target thread determined for the current processing cycle, determining the operand number of the multi-operand to-be-distributed command with the largest operand number in the multi-operand to-be-distributed command;

for each other command to be distributed, which has an operand quantity less than the maximum operand quantity, in the command to be distributed corresponding to the target thread determined for the current processing cycle, from the next processing cycle of the current processing cycle to each processing cycle before the processing cycle in which the first target register group is scheduled again, in response to the first target register group in which the other command to be distributed exists being idle, determining a command to be distributed in a ready state for the first target register group from the thread group corresponding to the first target register group;

For example, in response to a multi-operand to-be-distributed command with two operands in the to-be-distributed command corresponding to the target thread determined for the current processing cycle, for each single-operand to-be-distributed command in the to-be-distributed command corresponding to the target thread determined for the current processing cycle, in the next processing cycle of the current processing cycle, another single-operand to-be-distributed command in the ready state is determined for the first target register group in which the single-operand to-be-distributed command is located.

In this way, while the operation unit reads the operands corresponding to the multi-operand to-be-distributed instruction from the first target register group corresponding to the multi-operand to-be-distributed instruction in the plurality of processing cycles, the operation unit can read the operands corresponding to different single-operand to-be-distributed instructions from the single-operand to-be-distributed instruction to the first target register group in the plurality of processing cycles, and the efficiency of data reading is improved under the condition of avoiding reading conflict to the same target register group.

In one embodiment, the number N of groups of registers may be determined according to the number of operands of the arithmetic unit requiring the largest number of operands, so that in consecutive N processing cycles, N register groups may be scheduled respectively, then after the ith register group is scheduled in the ith processing cycle, N-i cycles are required to be passed to schedule the ith register group again, and assuming that the command corresponding to the register group scheduled in the ith processing cycle requires exactly N operands, for the ith register group, N operands may be distributed to the corresponding arithmetic unit in N cycles respectively, and if the number of operands required by the command corresponding to the register group scheduled in the ith processing cycle is less than N, the command whose number of operands matches the number of remaining cycles before being scheduled next time may be scheduled flexibly, and the data reading efficiency is improved.

Referring to fig. 2 and fig. 3, an embodiment of the present disclosure further provides a command distribution apparatus and a specific example of command distribution using the command distribution apparatus, in this example, the apparatus includes a command distributor and 5 arithmetic units connected to the command distributor, where the 5 arithmetic units are respectively:

two Arithmetic units (Arithmetric and Logic Unit ALUs), two operands are required for the instruction to be processed.

One write address arithmetic Unit (Store Unit, ST), the instruction processed requires two operands.

A read address Unit (LD), the instruction being processed requires an operand.

A override Function Unit (TFU) requires two operands for the instruction being processed. There are 64 threads, respectively, threads 0 to 63, 8 register groups (banks), respectively, banks 0 to 7, and 5 arithmetic units. Each register set is allocated 8 threads.

Each Bank only has one group of read paths, and in one processing cycle, different arithmetic units access the same register group without conflict, and different arithmetic units access different banks without conflict.

For instructions with two operands, the operand read needs to be performed in two cycles in the same Bank.

In the odd cycle, the register groups numbered 1, 3, 5, 7 are taken as the first register group.

The ALU, ST, LD, TFU instructions that are valid and have the highest priority are selected from the 8 threads assigned to each even numbered Bank.

From these even numbered banks, the two highest priority ALU instructions are selected and dispatched first.

And when the bank corresponding to the ALU instruction is occupied, selecting the ST instruction from the rest banks and distributing the ST instruction.

When the bank of the ALU or ST instruction is occupied, the LD instruction is selected from the remaining banks and distributed.

Because the ALU instruction and the ST instruction are two operands, and the operands need to be read from the same bank in the next cycle, when the banks of the ALU instruction and the ST instruction are occupied, the TFU instruction is selected from the remaining banks and distributed in the next processing cycle.

Two-operand instructions distributed in an even cycle need to continuously read the same even bank instruction in the next odd cycle, but the problem of bank conflict can not occur because only the odd bank instruction is distributed in the next cycle. The same is true for the odd cycle scheduling.

As shown in fig. 3:

a: in the 0 th processing cycle, the determined banks are respectively: bank0, Bank2, Bank4 and Bank 6.

Where the command determined for Bank0 is an ALU command and the arithmetic unit that reads operands from Bank0 is ALU0 and distributes the ALU command to ALU0 on processing cycle 0. In the 0 th processing cycle, and the 1 st processing cycle, the arithmetic unit ALU0 reads the first operand ALU0_ R0, and the second operand ALU0_ R1 from Bank0, respectively.

The command determined for Bank2 is the ST command, and in the 0 th processing cycle, the ST command is distributed to the ST units. The arithmetic unit that reads operands from this Bank2 is ST, and in the 0 th processing cycle, and the 1 ST processing cycle, the arithmetic unit ST reads the first operand ST _ R0, and the second operand ST _ R1 from the Bank2, respectively.

The commands determined for Bank4 are LD commands, and TFU commands, and the arithmetic units that read operands from this Bank4 are LD and TFU. In the 0 th processing cycle, distributing an LD command to an operation unit LD, and reading an operand corresponding to the LD command from Bank4 by the operation unit LD; in the 1 st processing cycle, the TFU command is distributed to the operation unit TFU, and the operation unit TFU reads the operand corresponding to the TFU command from Bank 4.

The command determined for Bank6 is an ALU command and is distributed to the ALU unit during the 0 th processing cycle. The arithmetic unit that reads operands from the Bank6 is an ALU, and in the 0 th processing cycle, and the 1 st processing cycle, the arithmetic unit ALU reads a first operand ALU1_ R0, and a second operand ALU1_ R1, respectively, from the Bank 6.

B: in the 1 st processing cycle, the determined banks are respectively: bank1, Bank3, Bank5 and Bank 6.

The command determined for Bank1 is the ST command and, in the 1 ST processing cycle, the ST command is distributed to the ST units. The arithmetic unit that reads operands from this Bank1 is ST, and in the 1 ST processing cycle, and the 2 nd processing cycle, the arithmetic unit ST reads the first operand ST _ R0, and the second operand ST _ R1 from the Bank1, respectively.

The command determined for Bank3 is an ALU command and the arithmetic unit that reads operands from Bank3 is ALU0 and distributes the ALU command to ALU0 on processing cycle 1. In the 2 nd and 1 st processing cycles, the arithmetic unit ALU0 reads the first operand ALU0_ R0 and the second operand ALU0_ R1 from Bank3, respectively.

The command determined for Bank5 is an ALU command and is distributed to the ALU unit during the 1 st processing cycle. The arithmetic unit that reads operands from the Bank5 is ALU1, and in the 1 st processing cycle, and in the 2 nd processing cycle, arithmetic unit ALU1 reads the first operand ALU1_ R0, and the second operand ALU1_ R1, respectively, from Bank 5.

The commands determined for Bank7 are LD commands, and TFU commands, and the arithmetic units that read operands from this Bank7 are LD and TFU. In the 1 st processing cycle, distributing the LD command to an operation unit LD, and reading an operand corresponding to the LD command from Bank7 by the operation unit LD; in the 2 nd processing cycle, the TFU command is distributed to the operation unit TFU, and the operation unit TFU reads the operand corresponding to the TFU command from Bank 7.

Then, in the 3 rd processing cycle and the 4 th processing cycle until the 8 th processing cycle, and then by this way, in the same processing cycle, it is guaranteed that each register group has only one arithmetic unit to access, and then data collision caused by a plurality of arithmetic units accessing the same register group in the same processing cycle is avoided, and the efficiency of command distribution is improved, and then the efficiency of command processing is improved.

It will be understood by those of skill in the art that in the above method of the present embodiment, the order of writing the steps does not imply a strict order of execution and does not impose any limitations on the implementation, as the order of execution of the steps should be determined by their function and possibly inherent logic.

Based on the same inventive concept, a command distributor corresponding to the command distribution method is further provided in the embodiments of the present disclosure, and since the principle of solving the problem of the command distributor in the embodiments of the present disclosure is similar to the command distribution method in the embodiments of the present disclosure, the implementation of the command distributor may refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 4, a schematic diagram of a command distributor provided in an embodiment of the present disclosure is shown, where the command distributor includes: a scheduler 41 and a distribution interface 42;

the scheduler 41 is configured to determine a plurality of first target register sets corresponding to a current processing cycle from the plurality of register sets; wherein the first set of target registers is different from a second set of target registers determined by at least one recent historical processing cycle; determining target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from thread groups respectively corresponding to the plurality of first target register groups;

the distribution interface 42 is configured to distribute the commands corresponding to the determined target threads to the corresponding arithmetic units.

In one possible embodiment, when determining the plurality of first target register sets corresponding to the current processing cycle from the plurality of register sets, the scheduler 41 is configured to:

determining a register group with an odd number in the plurality of register groups as the first target register group when the current processing cycle is an odd cycle;

In a possible implementation, the scheduler 41 is further configured to:

In one possible embodiment, the scheduler 41, when determining the target threads respectively corresponding to the plurality of first target register sets in the current processing cycle from the thread sets respectively corresponding to the plurality of first target register sets, is configured to:

In one possible embodiment, the scheduler 41, when determining, from the thread groups respectively corresponding to the plurality of first target register groups, a target thread respectively corresponding to the plurality of first target register groups in a current processing cycle based on the determined command execution states of the respective threads in the thread groups respectively corresponding to the plurality of first target register groups, is configured to:

In one possible implementation, the scheduler 41, when determining, from the multiple candidate threads, a target thread corresponding to each of the multiple first target register sets in the current processing cycle, is configured to:

In one possible embodiment, the scheduler 41, when determining the target threads respectively corresponding to the plurality of first target register sets in the current processing cycle from among the plurality of candidate threads, is configured to:

and determining target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from the plurality of candidate threads based on the command types of the current commands to be distributed respectively corresponding to the plurality of candidate threads and the type of the arithmetic unit.

In a possible implementation, the scheduler 41 is further configured to:

in response to a multi-operand to-be-distributed command with more than one operand in the to-be-distributed command corresponding to the target thread determined for the current processing cycle, determining the operand number of the multi-operand to-be-distributed command with the largest operand number in the multi-operand to-be-distributed command;

In a possible implementation, the scheduler 41 is further configured to: acquiring feedback information generated by the arithmetic unit after executing the command;

In a possible implementation, the scheduler 41 is further configured to:

The description of the processing flow of each module in the command distributor and the interaction flow between each module may refer to the related description in the above method embodiments, and will not be described in detail here.

In addition, the command distributor provided by the embodiment of the present disclosure may be a chip capable of implementing the command distribution method provided by the embodiment of the present disclosure.

An embodiment of the present disclosure further provides a chip, as shown in fig. 5, including: a controller 51, a command distributor 52, and an arithmetic unit 53;

the controller 51 is configured to obtain commands corresponding to a plurality of threads, and send the commands to the command distributor 52;

the command distributor 52 is configured to distribute the command to the arithmetic unit 53 based on a command distribution method provided in any embodiment of the present disclosure;

the operator 53 is configured to read an operand from a first target register set corresponding to the command, and execute the command based on the operand.

The specific process of executing the command by the command execution device may refer to the steps of the command distribution method described in the embodiments of the present disclosure, and details are not described here.

The embodiment of the disclosure also provides an electronic device comprising the chip provided by any one of the embodiments of the disclosure.

The embodiments of the present disclosure further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program performs the steps of the command distribution method in the foregoing method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The embodiments of the present disclosure also provide a computer program product, where the computer program product carries a program code, and instructions included in the program code may be used to execute the steps of the command distribution method in the foregoing method embodiments, which may be referred to specifically in the foregoing method embodiments, and are not described herein again.

The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, each functional unit in the embodiments of the present disclosure may be integrated into one arithmetic unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in software functional units and sold or used as a stand-alone product, may be stored in a non-transitory computer-readable storage medium executable by a processor. Based on such understanding, the technical solutions of the present disclosure, which are essential or part of the technical solutions contributing to the prior art, may be embodied in the form of a software product, which is stored in a storage medium and includes several commands for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes and substitutions do not depart from the spirit and scope of the embodiments disclosed herein, and they should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method for distributing commands, comprising:

determining a plurality of first target register groups corresponding to the current processing cycle from a plurality of register groups; wherein the first target register set is different from a second target register set determined at least one history processing cycle recently;

determining target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from thread groups respectively corresponding to the plurality of first target register groups;

and distributing the commands corresponding to the determined target threads to the corresponding arithmetic units.

2. The command distribution method of claim 1, wherein the determining a plurality of first target register sets corresponding to the current processing cycle from the plurality of register sets comprises:

3. The command distribution method according to claim 1, further comprising:

and determining the grouping number of the registers according to the operand number of the arithmetic unit with the largest required operand number, and dividing the registers into the plurality of register groups.

4. The command distribution method according to any one of claims 1 to 3, wherein the determining, from the thread groups respectively corresponding to the first target register groups, target threads respectively corresponding to the first target register groups in a current processing cycle comprises:

5. The method according to claim 4, wherein the determining, based on the determined command execution states of the respective threads in the thread groups corresponding to the first target register groups, target threads corresponding to the first target register groups in the current processing cycle from the thread groups corresponding to the first target register groups, comprises:

6. The method according to claim 5, wherein said determining, from among the plurality of candidate threads, the target threads respectively corresponding to the plurality of first target register sets in a current processing cycle comprises:

7. The method according to claim 5, wherein said determining, from among the plurality of candidate threads, the target threads respectively corresponding to the plurality of first target register sets in a current processing cycle comprises:

8. The method according to claim 5, wherein said determining, from among the plurality of candidate threads, target threads respectively corresponding to the plurality of first target register sets in a current processing cycle comprises:

and determining target threads respectively corresponding to the first target register groups in the current processing cycle from the plurality of candidate threads based on the command types of the current commands to be distributed respectively corresponding to the candidate threads and the types of the arithmetic units.

9. The command distribution method according to any one of claims 1 to 8, further comprising:

10. The command distribution method according to any one of claims 1 to 9, further comprising:

in response to the fact that a multi-operand command to be distributed with two operands exists in the command to be distributed corresponding to the target thread determined for the current processing cycle, for each single-operand command to be distributed existing in the command to be distributed corresponding to the target thread determined for the current processing cycle, another single-operand command to be distributed in a ready state is determined for a first target register group where the single-operand command to be distributed exists in the next processing cycle of the current processing cycle.

11. The command distribution method according to any one of claims 1 to 10, further comprising:

12. The command distribution method according to claim 4, further comprising: acquiring feedback information generated by the arithmetic unit after executing the command;

13. The command distribution method according to any one of claims 1 to 12, further comprising:

14. A command distributor, comprising: a scheduler, and a distribution interface;

the scheduler is used for determining a plurality of first target register groups corresponding to the current processing cycle from a plurality of register groups; wherein the first set of target registers is different from a second set of target registers determined by at least one recent historical processing cycle; determining target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from thread groups respectively corresponding to the plurality of first target register groups;

15. The command distributor of claim 14, wherein the scheduler, when determining the first target register sets corresponding to the current processing cycle from among the register sets, is configured to:

16. The command distributor of claim 14, wherein the scheduler is further configured to:

17. The command distributor of any of claims 14-16, wherein the scheduler, when determining, from the thread groups corresponding to the first target register groups, target threads corresponding to the first target register groups, respectively, in a current processing cycle, is configured to:

18. The command distributor according to claim 17, wherein the scheduler, when determining, from the thread groups respectively corresponding to the plurality of first target register groups, a target thread respectively corresponding to the plurality of first target register groups in a current processing cycle based on the determined command execution states of the respective threads in the thread groups respectively corresponding to the plurality of first target register groups, is configured to:

determining a plurality of standby threads with command execution state information in a ready state from thread groups corresponding to the first target register groups respectively;

19. The command distributor of claim 18, wherein the scheduler, when determining, from among the plurality of candidate threads, a target thread that respectively corresponds to the plurality of first target register sets in a current processing cycle, is configured to:

20. The command distributor of claim 18, wherein the scheduler, when determining, from among the plurality of candidate threads, a target thread that respectively corresponds to the plurality of first target register sets in a current processing cycle, is configured to:

21. The command distributor of claim 18, wherein the scheduler, when determining, from the plurality of candidate threads, a target thread that corresponds to each of the plurality of first target register sets in a current processing cycle, is configured to:

22. The command distributor according to any of claims 14-21, wherein the scheduler is further configured to:

responding to a multi-operand command to be distributed with more than one operand in the command to be distributed corresponding to a target thread determined for the current processing period, and respectively distributing one corresponding operand to an arithmetic unit corresponding to the command to be distributed by a first target register group corresponding to the multi-operand command to be distributed from the current processing period to each processing period of the target processing period;

23. The command distributor according to any of claims 14-22, wherein the scheduler is further configured to:

24. The command distributor according to any of claims 14-23, wherein said scheduler is further configured to:

25. The command distributor of claim 17, wherein the scheduler is further configured to: acquiring feedback information generated by the arithmetic unit after the arithmetic unit executes the command;

26. The command distributor according to any of claims 14-25, wherein the scheduler is further configured to:

27. A chip, comprising: a controller, a command distributor, and an operator;

the command distributor for distributing the command to the operator based on the command distribution method of any one of claims 1 to 13;

28. An electronic device comprising the chip of claim 27.

29. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of a command distribution method as claimed in any one of the claims 1 to 13.