CN115061731A

CN115061731A - Shuffle circuit and method, and chip and integrated circuit device

Info

Publication number: CN115061731A
Application number: CN202210717989.2A
Authority: CN
Inventors: 张春焱; 李凯; 于冰; 张钰勃
Original assignee: Moore Threads Technology Co Ltd
Current assignee: Moore Threads Technology Co Ltd
Priority date: 2022-06-23
Filing date: 2022-06-23
Publication date: 2022-09-16
Anticipated expiration: 2042-06-23
Also published as: CN115061731B

Abstract

The present disclosure provides a shuffle circuit comprising a control circuit, an input selector, a shuffler, and an output selector, wherein the control circuit is configured to: dividing the m threads into n thread groups according to the maximum number k of threads which can be processed in parallel by the shuffler, each thread group including k threads, and generating data correspondence information which defines from which operation data of which thread group or groups the result data of each thread group is obtained, respectively, and sending the data correspondence information to the input selector and the output selector, wherein k, m, n are integers greater than or equal to 1. The control circuitry may generate data corresponding information based on the SIMD pattern and may also generate data corresponding information based on the result data index flag and the operation data index flag. The present disclosure also provides a data shuffling method usable for the shuffling circuit, and also relates to a chip including the shuffling circuit and an integrated circuit device including the chip.

Description

Shuffle circuit and method, and chip and integrated circuit device

Technical Field

The present disclosure relates to the field of electrical technology, in particular to a shuffle circuit and a data shuffle method suitable for multiple SIMD modes, and also to a chip comprising the shuffle circuit and an integrated circuit device comprising the chip.

Background

The shuffle operation is a data processing method of redistributing data of multiple threads, which enables data sharing and data reordering between threads, and thus is widely used in an Application Programming Interface (API) such as DX, CUDA, or VULKAN. Currently, each chip is based on a different SIMD architecture, for example, a SIMD32 architecture, a SIMD64 architecture, or a SIMD128 architecture. These different SIMD architectures often need to support multiple APIs for compatibility, and thus need to accommodate multiple SIMD models. However, the structure of the shuffle circuit usually adopts a fixed implementation based on the current SIMD architecture, resulting in that satisfactory processing results cannot be achieved in terms of support for different SIMD modes in different APIs.

Disclosure of Invention

According to a first aspect of the present disclosure, there is provided a shuffle circuit comprising a control circuit, an input selector, a shuffler, and an output selector, wherein: the control circuit is configured to: dividing m threads into n thread groups according to the maximum number k of threads which can be processed in parallel by the shuffler, wherein each thread group comprises k threads, generating data correspondence information, and sending the data correspondence information to the input selector and the output selector, wherein the data correspondence information defines from which operation data of one or more thread groups the result data of each thread group is obtained, and k, m and n are integers greater than or equal to 1; the input selector is configured to: selecting one or more corresponding thread groups of operation data from the n thread groups according to the received data correspondence information, and sequentially sending the operation data to the shuffler according to a predetermined order of the thread groups; the shuffler is configured to: receiving operation data of the one or more corresponding thread groups in sequence from the input selector, performing shuffle operation on the received k operation data of each corresponding thread group and outputting j shuffle output data, wherein j is an integer and is greater than or equal to 0 and less than or equal to k; the output selector is configured to: shuffle output data obtained from operation data of the one or more corresponding thread groups is sequentially received from the shuffler, and result data for each thread group is generated based on the shuffle output data according to the received data correspondence information.

According to some exemplary embodiments of the disclosure, the control circuit is further configured to: generating the data correspondence information based on a SIMD pattern.

According to some exemplary embodiments of the present disclosure, in the shuffle circuit according to the first aspect of the present disclosure, the value of m is 128, the value of k is 32, the value of n is 4, and the SIMD mode is one of a SIMD32 mode, a SIMD64 mode, and a SIMD128 mode.

According to some exemplary embodiments of the present disclosure, when the SIMD mode is the SIMD32 mode, the data corresponding information includes: obtaining result data of the first thread group from operation data of the first thread group; obtaining result data of the second thread group from the operation data of the second thread group; obtaining result data of the third thread group from the operation data of the third thread group; result data for the fourth thread group is obtained from the operation data for the fourth thread group.

According to some exemplary embodiments of the present disclosure, when the SIMD mode is the SIMD64 mode, the data corresponding information includes: obtaining result data of the first thread group from operation data of the first and second thread groups; obtaining result data of a second thread group from operation data of the first and second thread groups; obtaining result data of the third thread group from the operation data of the third and the four thread groups; result data for the fourth thread group is obtained from the operation data for the third, four thread group.

According to some exemplary embodiments of the present disclosure, when the SIMD mode is the SIMD128 mode, the data corresponding information includes: obtaining result data of the first thread group from the operation data of the first, second, third and fourth thread groups; obtaining result data of the second thread group from the operation data of the first, second, third and fourth thread groups; obtaining result data of a third thread group from the operation data of the first, second, third and fourth thread groups; result data for the fourth thread group is obtained from the operation data for the first, second, third, and fourth thread groups.

According to some exemplary embodiments of the present disclosure, the control circuitry comprises a result data index flag generator and an operation data index flag generator, and wherein: the result data index flag generator is configured to: generating n-bit result data index flags according to the validity of the m threads, wherein each bit in the result data index flags corresponds to a set of result data of one thread group in the n thread groups; the operational data index flag generator is configured to: calculating an operation data index corresponding to each result data to generate an n-bit operation data index flag for a set of result data of each of the n thread groups, wherein each bit of the operation data index flag corresponds to a set of operation data of one of the n thread groups; the control circuit is configured to: and generating the data corresponding information based on the result data index mark and the operation data index mark.

According to some exemplary embodiments of the disclosure, the control circuit is further configured to: and for each effective result data group, obtaining the result data from a group of operation data of the thread group corresponding to the bit with the value of 1 in the corresponding operation data index mark.

According to some exemplary embodiments of the present disclosure, in the shuffle circuit whose control circuit includes the result data index flag generator and the operation data index flag generator, m has a value of 128, k has a value of 32, and n has a value of 4.

According to a second aspect of the present disclosure, there is provided a data shuffling method comprising: dividing m threads into n thread groups according to the maximum number k of threads capable of being processed in parallel, wherein each thread group comprises k threads, and k, m and n are integers greater than or equal to 1; generating data corresponding information which defines from which one or more thread groups the result data of each thread group is obtained respectively; selecting one or more corresponding thread groups from the n thread groups according to the data corresponding information; performing shuffle operation on the k operation data of each corresponding thread group and outputting j shuffle output data, wherein j is an integer and is more than or equal to 0 and less than or equal to k; generating result data for each thread group according to the data correspondence information based on shuffle output data obtained from the operation data of the one or more corresponding thread groups.

According to some exemplary embodiments of the present disclosure, the generating data corresponding information includes: generating the data correspondence information based on a SIMD pattern.

According to some exemplary embodiments of the present disclosure, in the data shuffling method according to the second aspect of the present disclosure, the value of m is 128, the value of k is 32, the value of n is 4, and the SIMD mode is one of a SIMD32 mode, a SIMD64 mode, and a SIMD128 mode.

According to some exemplary embodiments of the present disclosure, the generating the data correspondence information based on the SIMD pattern includes: when the SIMD mode is a SIMD32 mode, the data correspondence information includes: obtaining result data of the first thread group from the operation data of the first thread group; obtaining result data of the second thread group from the operation data of the second thread group; obtaining result data of the third thread group from the operation data of the third thread group; result data for the fourth thread group is obtained from the operation data for the fourth thread group.

According to some exemplary embodiments of the present disclosure, the generating the data correspondence information based on the SIMD pattern includes: when the SIMD mode is a SIMD64 mode, the data correspondence information includes: obtaining result data of the first thread group from operation data of the first and second thread groups; obtaining result data of a second thread group from operation data of the first and second thread groups; obtaining result data of the third thread group from the operation data of the third and the four thread groups; result data for the fourth thread group is obtained from the operation data for the third, four thread group.

According to some exemplary embodiments of the present disclosure, the generating the data correspondence information based on the SIMD pattern includes: when the SIMD mode is a SIMD128 mode, the data correspondence information comprises: obtaining result data of the first thread group from the operation data of the first, second, third and fourth thread groups; obtaining result data of the second thread group from the operation data of the first, second, third and fourth thread groups; obtaining result data of a third thread group from the operation data of the first, second, third and fourth thread groups; result data for the fourth thread group is obtained from the operation data for the first, second, third, and fourth thread groups.

According to some exemplary embodiments of the present disclosure, the generating data corresponding information includes: generating n-bit result data index marks according to the effectiveness of the m threads, wherein each bit of the result data index marks corresponds to a group of result data of one thread group in the n thread groups; calculating an operation data index corresponding to each result data to generate an n-bit operation data index flag for a set of result data of each of the n thread groups, wherein each bit of the operation data index flag corresponds to a set of operation data of one of the n thread groups; and generating the data corresponding information based on the result data index mark and the operation data index mark.

According to some exemplary embodiments of the present disclosure, the generating of the data correspondence information based on the result data index flag and the operation data index flag includes: determining a group of result data of a thread group corresponding to the bit with the median value of 1 in the result data index mark as an effective result data group; for each valid result data set, result data is obtained from the operation data of the thread set corresponding to the bit with the median value of 1 in the corresponding operation data index mark.

According to a third aspect of the present disclosure, there is provided a SIMD architecture based chip comprising a shuffle circuit provided according to the first aspect of the present disclosure and exemplary embodiments thereof.

According to some exemplary embodiments of the disclosure, the chip is a GPU chip.

According to a fourth aspect of the present disclosure, there is provided an integrated circuit device comprising at least one chip provided according to the third aspect of the present disclosure and exemplary embodiments thereof.

Drawings

So that the manner in which the above recited features, characteristics and advantages of the present disclosure can be understood in detail, a more particular description of embodiments of the present disclosure, briefly summarized above, may be had by reference to the appended drawings, in which; in the drawings:

figure 1 schematically illustrates a shuffling circuit of the prior art;

FIG. 2 schematically illustrates, in block diagram form, the structure of a shuffle circuit in accordance with one exemplary embodiment of the present disclosure;

FIG. 3 schematically illustrates, in block diagram form, the structure of a shuffle circuit in accordance with another exemplary embodiment of the present disclosure;

4a, 4b, 4c schematically illustrate the correspondence between operation data sets and result data sets in different SIMD modes for the shuffle circuit shown in FIG. 3;

FIG. 5 schematically illustrates, in block diagram form, the structure of a shuffle circuit in accordance with another exemplary embodiment of the present disclosure;

6a, 6b, 6c, 6d schematically illustrate the operation of the shuffle circuit shown in FIG. 5 for one SIMD mode;

figures 7a, 7b, 7c, 7d schematically illustrate the operation of the shuffle circuit shown in figure 5 for another SIMD mode;

FIG. 8 schematically illustrates, in flow chart form, a method of data shuffling in accordance with one exemplary embodiment of the present disclosure;

FIG. 9 illustrates, in flow diagram form, details of the data shuffling method illustrated in FIG. 8;

FIG. 10 shows, in flow chart form, details of the data shuffling method shown in FIG. 9

FIG. 11 schematically illustrates, in block diagram form, a structure of a chip in accordance with one exemplary embodiment of the present disclosure; and

fig. 12 schematically illustrates, in block diagram form, the structure of an integrated circuit device in accordance with one exemplary embodiment of the present disclosure.

It is to be understood that the matter shown in the figures is merely schematic and thus it is not necessarily drawn to scale. Further, throughout the drawings, the same or similar features are indicated by the same or similar reference numerals.

Detailed Description

The following description provides specific details of various exemplary embodiments of the present disclosure so that those skilled in the art can fully understand and implement the technical solutions according to the present disclosure.

First, some terms referred to in the embodiments of the present disclosure are explained to facilitate understanding by those skilled in the art:

single Instruction Multiple Data (SIMD): in the present disclosure, the term means that one instruction can process a plurality of data at the same time. Thus, SIMD is able to obtain all the operational data at once for operation, which makes it particularly suitable for applications where data intensive operations exist.

Shuffle (Shuffle) operation: in the present disclosure, the term refers to an operation that causes data of a plurality of parallel threads to be redistributed according to a predetermined manner, which enables data sharing and data reordering among threads.

Shuffling period: in the present disclosure, the term refers to the time required for a shuffler to receive operation data and perform a shuffle operation according to the maximum number of threads that it can process in parallel. For example, in the current state of the art, in the case of pursuing high frequency, the maximum number of threads that the shuffler can perform parallel processing is 32, and therefore, the shuffle cycle thereof is a time required to receive operation data of the 32 threads and perform the shuffle operation. In the case where a high frequency is not pursued, the maximum number of threads that the shuffler can perform parallel processing may also be 64 or 128, and accordingly, the shuffle cycle thereof is the time required to receive operation data of 64 or 128 threads and perform the shuffle operation.

SIMD architecture: the term refers to an architecture that is capable of parallel processing of data for multiple threads based on SIMD fashion. In the present disclosure, for a SIMD architecture, the architecture is differentiated by the maximum number of threads it can process in parallel. For example, a SIMD128 architecture refers to a SIMD fashion that can process up to 128 threads of data in parallel. Thus, it should be understood that in this disclosure, a shuffle circuit based on a SIMD128 architecture refers to a shuffle circuit that is capable of performing a shuffle operation on data from up to 128 threads.

SIMD mode: in the present disclosure, the term refers to a correspondence relationship between result data and operation data in a shuffling circuit based on a certain SIMD architecture in a shuffling operation on data. In particular, when a shuffle circuit based on a certain SIMD architecture is run in a certain SIMD mode, a set of result data having a number corresponding to that SIMD mode will only be obtained from a set of operation data having the same corresponding number.

For example, in the present disclosure, a shuffle circuit that is based on a SIMD128 architecture and that employs a SIMD32 mode means that the shuffle circuit is capable of acquiring operation data from 128 parallel threads, and that there is a correspondence between its result data and the operation data, namely: the 0 th to 31 th result data are obtained from the 0 th to 31 th operation data only, the 32 th to 63 th result data are obtained from the 32 th to 63 th operation data only, the 64 th to 95 th result data are obtained from the 64 th to 95 th operation data only, and the 96 th to 127 th result data are obtained from the 96 th to 127 th operation data only. Similarly, a shuffle circuit that is based on a SIMD128 architecture and that employs a SIMD64 mode means that the shuffle circuit is capable of obtaining operation data from 128 parallel threads, and that there is a correspondence between its result data and the operation data, namely: the 0 th to 63 th result data are obtained from the 0 th to 63 th operation data only, and the 64 th to 127 th result data are obtained from the 64 th to 127 th operation data only. Similarly, a shuffle circuit that is based on a SIMD128 architecture and that employs a SIMD128 mode means that the shuffle circuit is capable of acquiring operation data from 128 parallel threads, and that there is a correspondence between its result data and the operation data, namely: the 0 th to 127 th result data are obtained only from the 0 th to 127 th operation data.

In consideration of the maximum number of threads that can be processed in parallel by the shuffler in the shuffle circuit, it is necessary to consider a case where the operation data and the result data are grouped according to the maximum number of threads that can be processed in parallel by the shuffler, respectively, in each SIMD mode, in correspondence relationship between the result data and the operation data. Also taking as an example a shuffle circuit based on a SIMD128 architecture and employing a SIMD32 mode, if the maximum number of threads that its shuffler can process in parallel is 32, then the operation data that it acquires from 128 parallel threads is divided into 4 groups, each group comprising 32 operation data, and correspondingly its result data is also divided into 4 groups, each group comprising 32 result data, and there is a correspondence between the result data and the operation data, namely: the first set of result data is obtained from only the first set of operation data, the second set of result data is obtained from only the second set of operation data, the third set of result data is obtained from only the third set of operation data, and the fourth set of result data is obtained from only the fourth set of operation data. Cases operating in SIMD64 and SIMD128 modes may be considered similarly.

Furthermore, it will be appreciated that for a shuffle circuit based on a certain SIMD architecture, it may actually operate in a number of different SIMD modes.

Referring to figure 1, a shuffling circuit of the prior art is schematically shown. As shown in fig. 1, the shuffle circuit 10 is a shuffle circuit based on a SIMD128 architecture, and therefore it is capable of acquiring 128 pieces of operation data, i.e., operation data 0 to operation data 127, from 128 parallel threads, and correspondingly, outputting 128 pieces of result data, i.e., result data 0 to result data 127. The shuffle circuit 10 includes an input selector 12, a shuffler 13, and an output selector 14. The maximum number of threads that the shuffler 13 can process in parallel is 32, and therefore, the input selector 12 divides 128 operation data into 4 operation data groups, that is: a first operation data group 11-1 including operation data 0 through operation data 31, a second operation data group 11-2 including operation data 32 through operation data 63, a third operation data group 11-3 including operation data 64 through operation data 95, and a fourth operation data group 11-4 including operation data 96 through operation data 127. The input selector 12 selects corresponding operation data groups from these operation data groups and sends them to the shuffler 13 in sequence. The shuffler 13 sequentially receives the corresponding operation data groups, performs a shuffle operation on 32 operation data in each of the received corresponding operation data groups, and outputs j shuffled output data, where j is an integer and 0 ≦ j ≦ 32. The output selector 14 receives the shuffled output data obtained from each corresponding operation data group in turn from the shuffler 13 to generate 128 result data, wherein the 128 result data are divided into 4 result data groups according to the data also: a first result data set 15-1 comprising result data 0 to result data 31, a second result data set 15-2 comprising result data 32 to result data 63, a third result data set 15-3 comprising result data 64 to result data 95, and a fourth operation data set 15-4 comprising result data 96 to result data 127.

Therefore, in order to generate the result data of one result data group, the shuffling circuit 10 needs to traverse all four operation data groups. For example, in order to generate the result data included in the first result data group 15-1, the shuffle circuit 10 needs to perform the shuffle operation on all four operation data groups, and thus 4 shuffle cycles are required. Similarly, to generate the result data included in the second, third, and fourth result data groups 15-2, 15-3, and 15-4, the shuffle circuit 10 needs to perform a shuffle operation on the operation data of all four operation data groups for each of the result data groups. It follows that the shuffle circuit 10 requires 16 shuffle cycles in order to generate the result data comprised by the first, second, third and fourth result data sets 15-1, 15-2, 15-3, 15-4. However, in some SIMD modes, there is a specific correspondence between the result data of each result data set and the operation data of the operation data set. For example, in the SIMD32 mode, the result data of the first result data group 15-1 would be obtained only from the operation data of the first operation data group 11-1, the result data of the second result data group 15-2 would be obtained only from the operation data of the second operation data group 11-2, the result data of the third result data group 15-3 would be obtained only from the operation data of the third operation data group 11-3, and the result data of the fourth result data group 15-4 would be obtained only from the operation data of the fourth operation data group 11-4. Thus, for the SIMD32 mode, only 4 of the 16 shuffle cycles that the shuffle circuit 10 takes to generate the result data comprised by the first, second, third, and fourth result data sets 15-1, 15-2, 15-3, and 15-4 are valid, and no result data is generated for the other 12 shuffle cycles, and are therefore invalid. Therefore, the shuffle circuit 10 based on the SIMD128 architecture has a problem that when operating in the SIMD32 mode, the calculation resources are wasted and the processing efficiency is low.

Referring to FIG. 2, the structure of a shuffle circuit in accordance with one exemplary embodiment of the present disclosure is schematically illustrated in block diagram form. As shown in fig. 2, the shuffle circuit 100 includes an input selector 120, a shuffler 130, an output selector 140, and a control circuit 160. The shuffle circuit 100 is capable of acquiring m operation data from m parallel threads, where m is an integer greater than or equal to 1, and correspondingly, of outputting m result data. Accordingly, shuffle circuit 100 is a SIMD (m) architecture based shuffle circuit that is capable of operating in different SIMD modes.

The control circuit 160 divides the m threads into n thread groups, each thread group comprising k threads, according to the maximum number k of threads that the shuffler 130 can process in parallel, where k, m, n are integers greater than or equal to 1. Accordingly, the operation data of the m threads is divided into n operation data groups corresponding to the n thread groups, that is: a first operation data group 110-1, a second operation data group 110-2, … …, an (n-1) th operation data group 110- (n-1), an nth operation data group 110-n, wherein each operation data group comprises k operation data. And, the result data of the m threads is divided into n result data groups corresponding to the n thread groups, namely: a first result data set 150-1, a second result data set 150-2, … …, an (n-1) th result data set 150- (n-1), an nth result data set 150-n, wherein each result data set comprises k result data. Based on the n operation data groups and the n result data groups of the n thread groups, the control circuit 160 may generate data correspondence information and transmit the data correspondence information to the input selector 120 and the output selector 130. The data correspondence information defines from which one or ones of the thread groups the result data for each thread group is to be obtained, respectively. In other words, the data correspondence information defines from which operation data of the operation data group or groups the result data in each result data group is obtained.

The input selector 120 selects one or more operation data of the corresponding thread groups from the n thread groups based on the received data correspondence information, and sequentially transmits the operation data to the shuffler 130 in a predetermined order of the thread groups. It should be understood that in this disclosure, in order of thread groups, refer to: on the one hand, the input selector 120 sends an operation data group corresponding to one thread group to the shuffler 130, and after the shuffler 130 finishes processing the operation data of the group, sends an operation data group corresponding to the next thread group to the shuffler 130; on the other hand, the input selector 120 sequentially transmits all operation data groups corresponding to one result data group to the shuffler 130 in the manner described in the previous aspect, and sequentially transmits all operation data groups corresponding to the next result data group to the shuffler 130 in the manner described in the previous aspect after the shuffler 130 sequentially processes the operation data of the operation data groups.

The shuffler 130 sequentially receives the operation data of the one or more corresponding thread groups from the input selector 120, performs a shuffle operation on the received k operation data of each corresponding thread group, and outputs j shuffled output data, where j is an integer and 0 ≦ j ≦ k. The output selector 140 sequentially receives the shuffled output data obtained from the operation data of the one or more corresponding thread groups from the shuffler 130, and generates result data for each thread group based on the shuffled output data according to the received data correspondence information. It should be understood that, because the input selector 120 selects operation data of one or more corresponding thread groups from the n thread groups based on the received data correspondence information and sequentially sends the operation data to the shuffler 130 in a predetermined order of the thread groups, correspondingly, the output selector 140 sequentially receives the shuffled output data from the shuffler 130 and generates a set of result data of the corresponding thread group from the n thread groups based on the shuffled output data based on the data correspondence. For example, the output selector 140 sequentially generates the result data of each of the first result data group 150-1, the second result data groups 150-2, … …, the (n-1) th result data group 150- (n-1), and the nth result data group 150-n shown in fig. 2, respectively.

It should be understood that, because the shuffle circuit 100 shown in fig. 2 generates data correspondence information defining from which operation data of which thread group or groups the result data of each thread group is obtained, respectively, by the control circuit 160, the shuffle circuit 100 can ensure that all of the shuffle cycles spent are used to generate the result data, and thus are all effective shuffle cycles, thereby eliminating waste of computational resources and improving processing efficiency. Accordingly, the shuffle circuit 100 shown in fig. 2 is able to operate efficiently for different SIMD modes, improving compatibility for different SIMD modes.

In some exemplary embodiments of the present disclosure, the control circuit 160 can generate data correspondence information based on the SIMD pattern in which the shuffle circuit actually operates. In other exemplary embodiments of the present disclosure, the control circuit 160 can generate the data correspondence information based on a result data index flag corresponding to each result data group and an operation data index flag corresponding to each operation data group. These exemplary embodiments of the present disclosure will be described in detail below, respectively.

Referring to FIG. 3, a structure of a shuffle circuit in accordance with another exemplary embodiment of the present disclosure is schematically illustrated in block diagram form. As shown in fig. 3, shuffle circuit 200 includes an input selector 220, a shuffler 230, an output selector 240, and a SIMD mode control circuit 260. The shuffle circuit 200 is capable of acquiring 128 operation data from 128 parallel threads, and thus the shuffle circuit 200 is a SIMD128 architecture based shuffle circuit. The maximum number of threads that shuffler 230 can process in parallel is 32. Accordingly, the SIMD pattern control circuitry 260 groups the 128 threads into 4 thread groups, each thread group comprising 32 threads. In other words, for the shuffle circuit 200, the maximum number of threads k, the number of threads m, and the number of thread groups n that the shuffler described above can process in parallel have the values: k has a value of 32, m has a value of 128, and n has a value of 4. Thus, the operation data of the 128 threads is divided into four operation data groups corresponding to the four thread groups, namely: a first operational data group 210-1, a second operational data group 210-2, a third operational data group 210-3, and a fourth operational data group 210-4. Correspondingly, the result data of the 128 threads is divided into four result data sets corresponding to the four thread groups, namely: a first result data set 250-1, a second result data set 250-2, a third result data set 250-3, and a fourth result data set 250-4. The SIMD mode control circuitry 260 is capable of generating data correspondence information based on a SIMD mode, and the SIMD mode is one of a SIMD32 mode, a SIMD64 mode, and a SIMD128 mode.

Referring to fig. 4a, 4b, 4c, and in combination with fig. 3, fig. 4a, 4b, 4c schematically illustrate the correspondence between operation data sets and result data sets, respectively, when the shuffle circuit 200 illustrated in fig. 3 operates in different SIMD modes.

Figure 4a schematically shows the correspondence between the operation data set and the result data set when the shuffle circuit 200 is operating in SIMD32 mode. As shown in fig. 4a, when the SIMD mode is the SIMD32 mode, the data correspondence information generated by the SIMD mode control circuit 260 includes: obtaining result data of the first result data group 250-1 only from the operation data of the first operation data group 210-1; obtaining result data of the second result data group 250-2 only from the operation data of the second operation data group 210-2; obtaining result data of the third result data group 250-3 only from the operation data of the third operation data group 210-3; and, the result data of the fourth result data set 250-4 is obtained only from the operation data of the fourth operation data set 210-4. Accordingly, the shuffle circuit 200 requires 4 shuffle cycles to generate result data for 128 threads when operating in SIMD32 mode.

Figure 4b schematically shows the correspondence between the operation data set and the result data set when the shuffle circuit 200 is operating in SIMD64 mode. As shown in fig. 4b, when the SIMD mode is the SIMD64 mode, the data correspondence information generated by the SIMD mode control circuit 260 includes: obtaining result data of the first result data group 250-1 only from the operation data of the first operation data group 210-1 and the second operation data group 210-2; obtaining result data of the second result data group 250-2 only from the operation data of the first operation data group 210-1 and the second operation data group 210-2; the result data of the third result data group 250-3 is obtained only from the operation data of the third operation data group 210-3 and the fourth operation data group 210-4; and, the result data of the fourth result data group 250-4 is obtained only from the operation data of the third operation data group 210-3 and the fourth operation data group 210-4. Accordingly, the shuffle circuit 200 requires 8 shuffle cycles to generate result data for 128 threads when operating in SIMD64 mode.

Figure 4c schematically shows the correspondence between the operation data set and the result data set when the shuffle circuit 200 is operating in the SIMD128 mode. As shown in fig. 4c, when the SIMD mode is the SIMD64 mode, the data correspondence information generated by the SIMD mode control circuit 260 includes: obtaining result data of a first result data group 250-1 from operation data of a first operation data group 210-1, a second operation data group 210-2, a third operation data group 210-3 and a fourth operation data group 210-4; obtaining result data of the second result data group 250-2 from the operation data of the first operation data group 210-1, the second operation data group 210-2, the third operation data group 210-3 and the fourth operation data group 210-4; obtaining result data of a third result data group 250-3 from the operation data of the first operation data group 210-1, the second operation data group 210-2, the third operation data group 210-3 and the fourth operation data group 210-4; and, the result data of the fourth result data group 250-4 is obtained from the operation data of the first operation data group 210-1, the second operation data group 210-2, the third operation data group 210-3, and the fourth operation data group 210-4. Accordingly, the shuffle circuit 200 requires 16 shuffle cycles to generate result data for 128 threads when operating in SIMD128 mode.

With continued reference to fig. 3, the structures and functions of input selector 220, shuffler 230, and output selector 240 included in shuffle circuit 200 are the same as or similar to those of input selector 120, shuffler 130, and output selector 140, respectively, included in shuffle circuit 100 shown in fig. 2, and therefore, will not be described again here. It should be understood that in the shuffle circuit 200 shown in fig. 3, the SIMD pattern control circuit 260 generates data correspondence information defining from which thread group or thread groups the result data of each thread group is obtained, respectively, according to the SIMD pattern employed, thereby causing the shuffle circuit 200 to perform a shuffle operation for a fixed number of shuffle cycles for each SIMD pattern, thus reducing the number of shuffle cycles, eliminating waste of computational resources, and improving processing efficiency, as compared to the shuffle circuit in the prior art.

Referring to FIG. 5, a structure of a shuffle circuit in accordance with another exemplary embodiment of the present disclosure is schematically illustrated in block diagram form. As shown in fig. 5, the shuffle circuit 300 includes an input selector 320, a shuffler 330, an output selector 340, and an index flag control circuit 360. The shuffle circuit 300 is capable of acquiring 128 operation data from 128 parallel threads, and therefore the shuffle circuit 300 is a SIMD128 architecture based shuffle circuit. The maximum number of threads that the shuffler 330 can process in parallel is 32. Accordingly, the index flag control circuit 360 divides the 128 threads into 4 thread groups, each thread group including 32 threads. In other words, for the shuffle circuit 300, the values of the maximum number of threads k, the number of threads m, and the number of thread groups n of the parallel processing described above are respectively: k has a value of 32, m has a value of 128 and n has a value of 4. Thus, the operation data of the 128 threads is divided into four operation data groups corresponding to the four thread groups, namely: a first operational data group 310-1, a second operational data group 310-2, a third operational data group 310-3, and a fourth operational data group 310-4. Correspondingly, the result data of the 128 threads is divided into four result data sets corresponding to the four thread groups, namely: a first result data set 350-1, a second result data set 350-2, a third result data set 350-3, and a fourth result data set 350-4.

Index flag control circuitry 360 includes a result data index flag generator 361 and an operation data index flag generator 362. The result data index flag generator 361 generates a 4-bit result data index flag dst _ grp _ mask according to the validity of 128 threads, wherein each bit of the result data index flag dst _ grp _ mask corresponds to one set of result data of one thread group of the 4 thread groups, that is, corresponds to one of the four result data groups shown in fig. 5. The operation data index flag generator 362 calculates an operation data index corresponding to each result data to generate a 4-bit operation data index flag src _ grp _ mask for one set of result data (i.e., one of the four result data sets shown in fig. 5) of each of the 4 thread groups, where each bit of the operation data index flag src _ grp _ mask corresponds to one set of operation data of one of the n thread groups, i.e., corresponds to one of the four operation data sets shown in fig. 5. As a non-limiting example, the operation data index may be calculated according to a formula corresponding to the shuffle operation. The operation data index indicates from which operation data a certain result data is derived, and therefore, based on the operation data index, the distribution of the result data with respect to the operation data can be obtained. Thereby, the index flag control circuit 360 can generate data correspondence information that defines from which operation data of which thread group or groups the result data of each thread group is obtained, respectively, based on the result data index flag dst _ grp _ mask and the operation data index flag src _ grp _ mask. In one exemplary embodiment, the index flag control circuit 360 may determine a set of result data of one thread group corresponding to a bit having a value of 1 in the result data index flag dst _ grp _ mask as valid result data groups, and for each valid result data group, obtain the result data from the operation data of the thread group corresponding to a bit having a value of 1 in the corresponding operation data index flag src _ grp _ mask.

Further, it should be understood that the configuration and function of the input selector 320, the shuffler 330, and the output selector 340 included in the shuffle circuit 300 are the same as or similar to those of the input selector 120, the shuffler 130, and the output selector 140 included in the shuffle circuit 100 shown in fig. 2, respectively, and thus, will not be described in detail herein.

Referring to figures 6a, 6b, 6c and 6d, and in combination to figure 5, wherein figures 6a, 6b, 6c and 6d together schematically illustrate the operation of the shuffle circuit 300 shown in figure 5 for one SIMD mode.

Figure 6a schematically shows the process of the shuffle circuit 300 obtaining result data of the first result data group 350-1. As shown in fig. 6a, if 128 threads are all active, the result data index flag dst _ grp _ mask is "1111", and therefore, the shuffling circuit 300 shuffles the result data of the result data group (i.e., the first result data group 350-1) corresponding to the first bit of the result data index flag dst _ grp _ mask (i.e., the bit marked with a thick line block in fig. 6 a) based on the result data index flag dst _ grp _ mask and the operation data index flag src _ grp _ mask. Because the operation data index flag src _ grp _ mask is "1111," the operation data of the first operation data group 310-1 corresponding to the first bit of the operation data index flag src _ grp _ mask is first sent to the shuffler 330 to generate the result data in the first result data group 350-1. Subsequently, the operation data index flag processing section 370 processes the operation data index flag src _ grp _ mask so that the value thereof becomes "0111". The value of the first bit of the operation data index flag src _ grp _ mask becomes "0", meaning that the shuffle circuit 300 has fetched the result data in the first result data group 350-1 from the first operation data group 310-1. The operation data of the second operation data group 310-2 corresponding to the second bit of the operation data index flag src _ grp _ mask is then transmitted to the shuffler 330 according to the operation data index flag src _ grp _ mask being "0111" to generate result data in the first result data group 350-1. Then, the operation data index flag processing section 370 processes the operation data index flag src _ grp _ mask to change its value to "0011", which means that the shuffling circuit 300 has taken the result data in the first result data group 350-1 from the second operation data group 310-2. The operation data of the third operation data group 310-3 corresponding to the third bit of the operation data index flag src _ grp _ mask is then sent to the shuffler 330 to generate result data in the first result data group 350-1, according to the operation data index flag src _ grp _ mask being "0011". Then, the operation data index flag processing section 370 processes the operation data index flag src _ grp _ mask to become "0001", which means that the shuffling circuit 300 has retrieved the result data in the first result data group 350-1 from the third operation data group 310-3. According to the operation data index flag src _ grp _ mask being "0001", the operation data of the fourth operation data group 310-4 corresponding to the fourth bit of the operation data index flag src _ grp _ mask is finally transmitted to the shuffler 330 to generate the result data in the first result data group 350-1. Then, the operation data index flag processing section 370 processes the operation data index flag src _ grp _ mask to "0000", which means that the shuffle circuit 300 has performed the shuffle operation on the operation data of all the corresponding operation data groups without acquiring the operation data again. It can be seen that during the operation shown in FIG. 6a, 4 shuffle cycles are required to generate the result data in the first result data group 350-1.

Figure 6b schematically shows the process of the shuffle circuit 300 obtaining result data of the second result data group 350-2. As shown in fig. 6b, the result data index flag dst _ grp _ mask becomes "0111", which means that the shuffle circuit 300 has generated the result data in the first result data group 350-1, and therefore, the shuffle circuit 300 will generate the result data of the result data group (i.e., the second result data group 350-2) corresponding to the second bit of the result data index flag dst _ grp _ mask (i.e., the bit marked with a bold line block in fig. 6 b) based on the result data index flag dst _ grp _ mask and the operation data index flag src _ grp _ mask. Because the operation data index flag src _ grp _ mask is "1111," the operation data of the first operation data group 310-1 corresponding to the first bit of the operation data index flag src _ grp _ mask is first sent to the shuffler 330 to generate result data in the second result data group 350-2. Subsequently, the operation data index flag processing section 370 processes the operation data index flag src _ grp _ mask so that the value thereof becomes "0111". The value of the first bit of the operation data index flag src _ grp _ mask becomes "0", meaning that the shuffle circuit 300 has fetched the result data in the second result data group 350-2 from the first operation data group 310-1. The operation data of the second operation data group 310-2 corresponding to the second bit of the operation data index flag src _ grp _ mask is then sent to the shuffler 330 to generate result data in the second result data group 350-2, according to the operation data index flag src _ grp _ mask being "0111". Then, the operation data index flag processing section 370 processes the operation data index flag src _ grp _ mask to change its value to "0011", which means that the shuffling circuit 300 has taken the result data in the second result data group 350-2 from the second operation data group 310-2. The operation data of the third operation data group 310-3 corresponding to the third bit of the operation data index flag src _ grp _ mask is then sent to the shuffler 330 to generate result data in the second result data group 350-2, according to the operation data index flag src _ grp _ mask being "0011". Then, the operation data index flag processing section 370 processes the operation data index flag src _ grp _ mask to become "0001", which means that the shuffling circuit 300 has taken the result data in the second result data group 350-2 from the third operation data group 310-3. According to the operation data index flag src _ grp _ mask being "0001", the operation data of the fourth operation data group 310-4 corresponding to the fourth bit of the operation data index flag src _ grp _ mask is finally transmitted to the shuffler 330 to generate the result data in the second result data group 350-2. Then, the operation data index flag processing section 370 processes the operation data index flag src _ grp _ mask to "0000", which means that the shuffle circuit 300 has performed the shuffle operation on the operation data of all the corresponding operation data groups without acquiring the operation data again. It can be seen that during the operation shown in FIG. 6b, 4 shuffle cycles are also required to generate the result data in the second result data set 350-2.

Figure 6c schematically shows the process of the shuffle circuit 300 obtaining result data of the third result data group 350-3. As shown in fig. 6c, the result data index flag dst _ grp _ mask becomes "0011", which means that the shuffling circuit 300 has generated the result data in the first result data group 350-1 and the second result data group 350-2, and therefore, the shuffling circuit 300 will generate the result data of the result data group (i.e., the third result data group 350-3) corresponding to the third bit of the result data index flag dst _ grp _ mask (i.e., the bit marked with a bold line block in fig. 6 c) based on the result data index flag dst _ grp _ mask and the operation data index flag src _ grp _ mask. Because the operation data index flag src _ grp _ mask is "1111," the operation data of the first operation data group 310-1 corresponding to the first bit of the operation data index flag src _ grp _ mask is first transmitted to the shuffler 330 to generate result data in the third result data group 350-3. Subsequently, the operation data index flag processing section 370 processes the operation data index flag src _ grp _ mask so that the value thereof becomes "0111". The value of the first bit of the operation data index flag src _ grp _ mask becomes "0", meaning that the shuffle circuit 300 has fetched the result data in the third result data group 350-3 from the first operation data group 310-1. The operation data of the second operation data group 310-2 corresponding to the second bit of the operation data index flag src _ grp _ mask is then transmitted to the shuffler 330 according to the operation data index flag src _ grp _ mask being "0111" to generate result data in the third result data group 350-3. Then, the operation data index flag processing section 370 processes the operation data index flag src _ grp _ mask to change its value to "0011", which means that the shuffling circuit 300 has taken the result data in the third result data group 350-3 from the second operation data group 310-2. The operation data of the third operation data group 310-3 corresponding to the third bit of the operation data index flag src _ grp _ mask is then sent to the shuffler 330 to generate result data in the third result data group 350-3, according to the operation data index flag src _ grp _ mask being "0011". Then, the operation data index flag processing section 370 processes the operation data index flag src _ grp _ mask to become "0001", which means that the shuffling circuit 300 has retrieved the result data in the third result data group 350-3 from the third operation data group 310-3. According to the operation data index flag src _ grp _ mask being "0001", the operation data of the fourth operation data group 310-4 corresponding to the fourth bit of the operation data index flag src _ grp _ mask is finally transmitted to the shuffler 330 to generate the result data in the third result data group 350-3. Then, the operation data index flag processing section 370 processes the operation data index flag src _ grp _ mask to "0000", which means that the shuffle circuit 300 has performed the shuffle operation on the operation data of all the corresponding operation data groups without acquiring the operation data again. It can be seen that during the operation shown in FIG. 6c, 4 shuffle cycles are also required to generate the result data in the third result data group 350-3.

Figure 6d schematically shows the process of the shuffle circuit 300 obtaining result data of the third result data group 350-4. As shown in fig. 6d, the result data index flag dst _ grp _ mask becomes "0001", which means that the shuffle circuit 300 has generated the result data in the first, second, and third result data groups 350-1, 350-2, and 350-3, and therefore, the shuffle circuit 300 will generate the result data of the result data group (i.e., the fourth result data group 350-4) corresponding to the fourth bit of the result data index flag dst _ grp _ mask (i.e., the bit marked with the bold line block in fig. 6 d) based on the result data index flag dst _ grp _ mask and the operation data index flag src _ grp _ mask. Since the operation data index flag src _ grp _ mask is "1111", the operation data of the first operation data group 310-1 corresponding to the first bit of the operation data index flag src _ grp _ mask is first sent to the shuffler 330 to generate result data in the fourth result data group 350-4. Subsequently, the operation data index flag processing section 370 processes the operation data index flag src _ grp _ mask so that the value thereof becomes "0111". The value of the first bit of the operation data index flag src _ grp _ mask becomes "0", meaning that the shuffle circuit 300 has fetched the result data in the fourth result data group 350-4 from the first operation data group 310-1. The operation data of the second operation data group 310-2 corresponding to the second bit of the operation data index flag src _ grp _ mask is then sent to the shuffler 330 to generate result data in a fourth result data group 350-4, according to the operation data index flag src _ grp _ mask being "0111". Then, the operation data index flag processing section 370 processes the operation data index flag src _ grp _ mask to change its value to "0011", which means that the shuffling circuit 300 has taken the result data in the fourth result data group 350-4 from the second operation data group 310-2. The operation data of the third operation data group 310-3 corresponding to the third bit of the operation data index flag src _ grp _ mask is then sent to the shuffler 330 to generate result data in a fourth result data group 350-4, according to the operation data index flag src _ grp _ mask being "0011". Then, the operation data index flag processing section 370 processes the operation data index flag src _ grp _ mask to become "0001", which means that the shuffling circuit 300 has retrieved the result data in the fourth result data group 350-4 from the third operation data group 310-3. According to the operation data index flag src _ grp _ mask being "0001", the operation data of the fourth operation data group 310-4 corresponding to the fourth bit of the operation data index flag src _ grp _ mask is finally transmitted to the shuffler 330 to generate result data in the fourth result data group 350-4. Then, the operation data index flag processing section 370 processes the operation data index flag src _ grp _ mask to "0000", which means that the shuffle circuit 300 has performed the shuffle operation on the operation data of all the corresponding operation data groups without acquiring the operation data again. It can be seen that during the operation shown in figure 6d, 4 shuffle cycles are required to generate the result data in the fourth result data set 350-4.

It can be seen that in the operation process schematically illustrated in figures 6a, 6b, 6c and 6d, 16 shuffle cycles are required for the shuffle circuit 300 to generate 128 result data for 128 threads, and that generating the result data for each result data set requires traversing all four operation data sets. Thus, during the operation schematically illustrated in figures 6a, 6b, 6c and 6d, the shuffle circuit 300 based on the SIMD128 architecture operates in SIMD128 mode.

Referring to figures 7a, 7b, 7c and 7d, and in combination to figure 5, wherein figures 7a, 7b, 7c and 7d together schematically illustrate the operation of the shuffle circuit shown in figure 5 for another SIMD mode.

Figure 7a schematically shows the process of the shuffle circuit 300 obtaining result data of the first result data group 350-1. As shown in fig. 7a, if 128 threads are all active, the result data index flag dst _ grp _ mask is "1111", and therefore, the shuffling circuit 300 shuffles the result data of the result data group (i.e., the first result data group 350-1) corresponding to the first bit of the result data index flag dst _ grp _ mask (i.e., the bit marked with a thick line block in fig. 7 a) based on the result data index flag dst _ grp _ mask and the operation data index flag src _ grp _ mask. Because the operation data index flag src _ grp _ mask is "1000," the operation data of the first operation data group 310-1 corresponding to the first bit of the operation data index flag src _ grp _ mask is sent to the shuffler 330 to generate result data in the first result data group 350-1. Subsequently, the operation data index flag processing section 370 processes the operation data index flag src _ grp _ mask so that the value thereof becomes "0000", which means that the shuffle circuit 300 has performed the shuffle operation on the operation data of all the corresponding operation data groups without acquiring the operation data again. The value of the first bit of the operation data index flag src _ grp _ mask becomes "0", meaning that the shuffle circuit 300 has fetched the result data in the first result data group 350-1 from the first operation data group 310-1, and then the operation data index flag src _ grp _ mask becomes "0000", meaning that the shuffle circuit 300 does not need to fetch the result data in the first result data group 350-1 from the other operation data groups any more. Thus, during the operation shown in FIG. 7a, 1 shuffle cycle is required to generate the result data in first result data set 350-1.

Figure 7b schematically shows the process of the shuffle circuit 300 obtaining result data of the second result data group 350-2. As shown in fig. 7b, the result data index flag dst _ grp _ mask becomes "0111", which means that the shuffle circuit 300 has generated the result data in the first result data group 350-1, and therefore, the shuffle circuit 300 will generate the result data of the result data group (i.e., the second result data group 350-2) corresponding to the second bit of the result data index flag dst _ grp _ mask (i.e., the bit marked with a bold line block in fig. 7 b) based on the result data index flag dst _ grp _ mask and the operation data index flag src _ grp _ mask. Because the operation data index flag src _ grp _ mask is "0100," the operation data of the second operation data group 310-2 corresponding to the second bit of the operation data index flag src _ grp _ mask is sent to the shuffler 330 to generate result data in the second result data group 350-2. Subsequently, the operation data index flag processing section 370 processes the operation data index flag src _ grp _ mask so that the value thereof becomes "0000", which means that the shuffle circuit 300 has performed the shuffle operation on the operation data of all the corresponding operation data groups without acquiring the operation data again. The value of the second bit of the operation data index flag src _ grp _ mask becomes "0", meaning that the shuffle circuit 300 has fetched the result data in the second result data group 350-2 from the second operation data group 310-2, and then the operation data index flag src _ grp _ mask becomes "0000", meaning that the shuffle circuit 300 does not need to fetch the result data in the second result data group 350-2 from the other operation data groups any more. Thus, during the operation shown in FIG. 7b, 1 shuffle cycle is required to generate the result data in second result data group 350-2.

Figure 7c schematically shows the process of the shuffle circuit 300 obtaining result data of the third result data group 350-3. As shown in fig. 7c, the result data index flag dst _ grp _ mask becomes "0011", which means that the shuffle circuit 300 has generated the result data in the first result data group 350-1 and the second result data group 350-2, and therefore, the shuffle circuit 300 will generate the result data of the result data group (i.e., the third result data group 350-3) corresponding to the third bit of the result data index flag dst _ grp _ mask (i.e., the bit marked with the bold line block in fig. 7 c) based on the result data index flag dst _ grp _ mask and the operation data index flag src _ grp _ mask. Because the operation data index flag src _ grp _ mask is "0010," the operation data of the third operation data group 310-3 corresponding to the third bit of the operation data index flag src _ grp _ mask is sent to the shuffler 330 to generate result data in the third result data group 350-3. Subsequently, the operation data index flag processing section 370 processes the operation data index flag src _ grp _ mask so that the value thereof becomes "0000", which means that the shuffle circuit 300 has performed the shuffle operation on the operation data of all the corresponding operation data groups without acquiring the operation data again. The value of the third bit of the operation data index flag src _ grp _ mask becomes "0", meaning that the shuffle circuit 300 has fetched the result data in the third result data group 350-3 from the third operation data group 310-3, and then the operation data index flag src _ grp _ mask becomes "0000", meaning that the shuffle circuit 300 does not need to fetch the result data in the third result data group 350-3 from the other operation data groups any more. Thus, during the operation shown in FIG. 7c, 1 shuffle cycle is required to generate the result data in the third result data group 350-3.

Figure 7d schematically shows the process of the shuffle circuit 300 obtaining result data of the fourth result data group 350-4. As shown in fig. 7d, the result data index flag dst _ grp _ mask becomes "0001", which means that the shuffle circuit 300 has generated the result data in the first, second, and third result data groups 350-1, 350-2, and 350-3, and therefore, the shuffle circuit 300 will generate the result data of the result data group (i.e., the fourth result data group 350-4) corresponding to the fourth bit of the result data index flag dst _ grp _ mask (i.e., the bit marked with a bold line block in fig. 7 d) based on the result data index flag dst _ grp _ mask and the operation data index flag src _ grp _ mask. Because the operation data index flag src _ grp _ mask is "0001," the operation data of the fourth operation data group 310-4 corresponding to the fourth bit of the operation data index flag src _ grp _ mask is sent to the shuffler 330 to generate result data in the fourth result data group 350-4. Subsequently, the operation data index flag processing section 370 processes the operation data index flag src _ grp _ mask so that the value thereof becomes "0000", which means that the shuffle circuit 300 has performed the shuffle operation on the operation data of all the corresponding operation data groups without acquiring the operation data again. The value of the fourth bit of the operation data index flag src _ grp _ mask becomes "0", meaning that the shuffle circuit 300 has fetched the result data in the fourth result data group 350-4 from the fourth operation data group 310-4, and then since the operation data index flag src _ grp _ mask becomes "0000", meaning that the shuffle circuit 300 does not need to fetch the result data in the fourth result data group 350-4 from the other operation data groups any more. It can be seen that during the operation shown in figure 7d, 1 shuffle cycle is required to generate the result data in the fourth result data set 350-4.

It can be seen that in the operation process schematically illustrated in figures 7a, 7b, 7c and 7d, 4 shuffle cycles are required for the shuffle circuit 300 to generate 128 result data for 128 threads, and that it takes 4 shuffle cycles to obtain the result data for the first result data set 350-1 from the operation data for the first operation data set 310-1, the result data for the second result data set 350-2 from the operation data for the second operation data set 310-2, the result data for the third result data set 350-3 from the operation data for the third operation data set 310-3, and the result data for the fourth result data set 350-4 from the operation data for the fourth operation data set 310-4. Thus, during the operation schematically illustrated in figures 7a, 7b, 7c and 7d, the shuffle circuit 300 based on the SIMD128 architecture operates in SIMD32 mode.

It should be understood that, in the shuffle circuit 300 shown in fig. 5, the index flag control circuit 360 includes a result data index flag generator 361 and an operation data index flag generator 362, and generates data correspondence information that defines from which operation data of which thread group or groups, respectively, the result data of each thread group is obtained, based on the result data index flag dst _ grp _ mask and the operation data index flag src _ grp _ mask. In this manner, the shuffle circuit 300 enables the shuffle cycles spent to generate all the result data to be valid shuffle cycles, thereby removing invalid shuffle cycles, eliminating waste of computational resources, and improving processing efficiency.

Furthermore, generating data correspondence information based on the result data index flag dst _ grp _ mask and the operation data index flag src _ grp _ mask enables the shuffle circuit 300 to determine the effective number of shuffle cycles more flexibly, thereby better accommodating different SIMD modes. Taking as an example a shuffle circuit 300 based on a SIMD128 architecture and performing a shuffle operation in 32 threads, the circuit is capable of implementing: the processing SIMD128 mode can be flexibly variable according to actual conditions within 1 to 16 shuffle cycles, the processing SIMD64 mode can be flexibly variable according to actual conditions within 1 to 8 shuffle cycles, and the processing SIMD32 mode can be flexibly variable according to actual conditions within 1 to 4 cycles. The shuffle circuit 300 is able to complete the shuffle operation within 1 to 4 shuffle cycles, even for some lower granularity SIMD patterns.

It should also be understood that the index flag control circuit 360 in the shuffle circuit 300 shown in figure 5 may be applied in the shuffle circuit 100 shown in figure 2 as well, for example, instead of the control circuit 160. In this case, the result data index flag generator 361 generates n-bit result data index flags dst _ grp _ mask according to the validity of the m threads, wherein each bit of the result data index flags dst _ grp _ mask corresponds to a set of result data of one thread group of the n thread groups; the operation data index flag generator 362 calculates an operation data index corresponding to each result data to generate an n-bit operation data index flag src _ grp _ mask for a set of result data of each of the n thread groups, wherein each bit of the operation data index flag src _ grp _ mask corresponds to a set of operation data of one of the n thread groups; and the index flag control circuit 360 generates data correspondence information based on the result data index flag dst _ grp _ mask and the operation data index flag src _ grp _ mask.

Referring to FIG. 8, a data shuffling method is schematically illustrated in flow chart form in accordance with one exemplary embodiment of the present disclosure. As shown in fig. 8, the data shuffling method 500 includes

steps

510, 520, 530, 540, and 550:

at step 510, dividing m threads into n thread groups according to the maximum number k of threads capable of being processed in parallel, wherein each thread group comprises k threads, and k, m and n are integers greater than or equal to 1;

at step 520, data correspondence information is generated, the data correspondence information defining from which operation data of which thread group or groups the result data of each thread group is obtained, respectively;

in step 530, selecting operation data of one or more corresponding thread groups from the n thread groups according to the data correspondence information;

in step 540, performing a shuffle operation on the k operation data of each corresponding thread group and outputting j shuffle output data, where j is an integer and 0 ≦ j ≦ k;

at step 550, result data for each thread group is generated based on the shuffle output data obtained from the operation data for the one or more corresponding thread groups, in accordance with the data correspondence information.

The data shuffling method 500 shown in fig. 8 generates data correspondence information defining from which operation data of which thread group or groups the result data of each thread group is obtained, respectively, and thus can ensure that all the shuffle cycles are used for generating the result data, and are effective shuffle cycles, thereby eliminating waste of calculation resources and improving processing efficiency.

According to an exemplary embodiment, step 520 of the data shuffling method 500 may further comprise: generating the data correspondence information based on a SIMD pattern. Further, as a non-limiting example, m has a value of 128, k has a value of 32, n has a value of 4, and the SIMD mode is one of a SIMD32 mode, a SIMD64 mode, and a SIMD128 mode. Thus, the data correspondence information generated based on the SIMD mode may include a correspondence between the operation data sets and the result data sets in different SIMD modes (e.g. in SIMD32 mode, SIMD64 mode, or SIMD128 mode) as previously illustrated in fig. 4a, 4b, and 4 c. Thus, the data shuffling method 500 completes the shuffling operation for a fixed number of shuffle cycles for each SIMD pattern, thereby reducing the number of shuffle cycles, eliminating waste of computational resources, and improving processing efficiency, compared to the data shuffling method in the prior art.

Referring to FIG. 9, details of the data shuffling method shown in FIG. 8 are shown in flow chart form. As shown in fig. 9, step 520 of the data shuffling method 500 shown in fig. 8 further includes

steps

521, 522, and 523:

in step 521, according to the validity of the m threads, generating a result data index flag with n bits, where each bit in the result data index flag corresponds to a set of result data of one thread group in the n thread groups;

at step 522, calculating an operation data index corresponding to each result data to generate an operation data index flag with n bits for a set of result data of each of the n thread groups, wherein each bit of the operation data index flag corresponds to a set of operation data of one of the n thread groups;

in step 523, the data correspondence information is generated based on the result data index flag and the operation data index flag.

Referring to FIG. 10, details of the data shuffling method shown in FIG. 9 are shown in flow chart form. As shown in fig. 10, step 523 shown in fig. 9 further includes

steps

523a and 523 b:

in step 523a, a group of result data of a thread group corresponding to the bit with the median value of 1 in the result data index flag is determined as an effective result data group;

in step 523b, for each valid result data set, result data is obtained from the operation data of the thread set corresponding to the bit with the value of 1 in the corresponding operation data index flag.

As has been described in detail previously, through the steps shown in fig. 9 and 10, the data shuffling method 500 not only enables the shuffle cycles taken to generate all the result data to be valid shuffle cycles, thereby removing invalid shuffle cycles, eliminating waste of computation resources, improving processing efficiency, but also enables more flexible determination of the number of valid shuffle cycles, thereby better accommodating different SIMD modes.

Referring to fig. 11, a structure of a chip according to an exemplary embodiment of the present disclosure is schematically illustrated in block diagram form. As shown in fig. 11, chip 600 includes a shuffle circuit 610, where shuffle circuit 610 may be

shuffle circuits

100, 200, 300 as shown in fig. 2, 3, 5 of the present disclosure. It should be understood that chip 600 may be any suitable kind of chip including, but not limited to, a GPU chip, a CPU chip, and the like.

Referring to fig. 12, the structure of an integrated circuit device according to one exemplary embodiment of the present disclosure is schematically illustrated in block diagram form. As shown in fig. 12, the integrated circuit device 700 includes the chip 600 shown in fig. 11. It should be understood that the integrated circuit device 700 may be any suitable kind of integrated circuit device including, but not limited to, an integrated graphics card, a stand-alone graphics card, an image processing device, and the like.

It should be understood that the

shuffle circuits

100, 200, 300 provided according to the exemplary embodiments shown in fig. 2, 3 and 5 of the present disclosure may each be implemented in the form of any suitable hardware circuit. These hardware circuits may be implemented using any suitable technique or combination of techniques known in the art, including, by way of non-limiting example: discrete logic circuits with logic gates for implementing logic functions on data signals, application specific integrated circuits with appropriate combinational logic gates, programmable gate arrays, field programmable gate arrays, application specific integrated circuits, and the like.

It should also be understood that all or some of the steps of the data shuffling methods provided in accordance with the exemplary embodiments illustrated in fig. 8, 9 and 10 of the present disclosure may be implemented by a list of executable instructions in addition to the shuffling

circuits

100, 200, 300 illustrated in fig. 2, 3 and 5. The list of executable instructions can be embodied in any suitable computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

The terminology used in the present disclosure is for the purpose of describing embodiments in the present disclosure only and is not intended to be limiting of the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and "comprising," when used in this disclosure, specify the presence of stated features but do not preclude the presence or addition of one or more other features. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items. It will be understood that, although the terms "first," "second," "third," etc. may be used herein to describe various features, these features should not be limited by these terms. These terms are only used to distinguish one feature from another.

Unless otherwise defined, all terms (including technical and scientific terms) used in this disclosure have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and/or the present specification and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

In the description of the present disclosure, the description of the terms "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present disclosure. In this disclosure, schematic representations of the above terms are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this disclosure can be combined and combined by one skilled in the art without contradiction.

It should be understood that the various steps of the methods shown in the flowcharts or otherwise described herein are merely exemplary and are not meant to imply that the steps of the methods shown or described must be performed in accordance with the steps shown or described. Rather, various steps of the methods shown in the flowcharts or otherwise described herein may be performed in a different order than presented in the present disclosure or may be performed concurrently. Further, the methods shown in the flowcharts or otherwise described herein may include other additional steps as desired.

Although the present disclosure has been described in detail in connection with some exemplary embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present disclosure is limited only by the accompanying claims.

Claims

1. A shuffle circuit comprising a control circuit, an input selector, a shuffler, and an output selector, wherein:

the control circuit is configured to: dividing m threads into n thread groups according to the maximum number k of threads which can be processed in parallel by the shuffler, wherein each thread group comprises k threads, generating data correspondence information, and sending the data correspondence information to the input selector and the output selector, wherein the data correspondence information defines from which operation data of which thread group or groups the result data of each thread group is obtained respectively, and k, m and n are integers greater than or equal to 1;

the input selector is configured to: selecting one or more corresponding thread groups of operation data from the n thread groups according to the received data correspondence information, and sequentially sending the operation data to the shuffler according to a predetermined order of the thread groups;

the shuffler is configured to: receiving operation data of the one or more corresponding thread groups in sequence from the input selector, performing shuffle operation on the received k operation data of each corresponding thread group and outputting j shuffle output data, wherein j is an integer and is greater than or equal to 0 and less than or equal to k;

the output selector is configured to: shuffle output data obtained from operation data of the one or more corresponding thread groups is sequentially received from the shuffler, and result data for each thread group is generated based on the shuffle output data according to the received data correspondence information.

2. The shuffle circuit of claim 1, wherein the control circuit is further configured to: generating the data correspondence information based on a SIMD pattern.

3. The shuffle circuit of claim 2, wherein a value of m is 128, a value of k is 32, a value of n is 4, and the SIMD mode is one of a SIMD32 mode, a SIMD64 mode, and a SIMD128 mode.

4. A shuffle circuit as claimed in claim 3, wherein when the SIMD mode is a SIMD32 mode, the data correspondence information comprises:

obtaining result data of the first thread group from operation data of the first thread group;

obtaining result data of the second thread group from the operation data of the second thread group;

obtaining result data of the third thread group from the operation data of the third thread group;

result data for the fourth thread group is obtained from the operation data for the fourth thread group.

5. The shuffle circuit of claim 3, wherein when the SIMD mode is a SIMD64 mode, the data correspondence information comprises:

obtaining result data of the first thread group from operation data of the first and second thread groups;

obtaining result data of a second thread group from operation data of the first and second thread groups;

obtaining result data of the third thread group from the operation data of the third and the four thread groups;

result data for the fourth thread group is obtained from the operation data for the third, four thread group.

6. A shuffle circuit as claimed in claim 3, wherein, when the SIMD mode is a SIMD128 mode, the data correspondence information comprises:

obtaining result data of the first thread group from the operation data of the first, second, third and fourth thread groups;

obtaining result data of the second thread group from the operation data of the first, second, third and fourth thread groups;

obtaining result data of a third thread group from the operation data of the first, second, third and fourth thread groups;

result data for the fourth thread group is obtained from the operation data for the first, second, third, and fourth thread groups.

7. The shuffle circuit of claim 1, wherein the control circuit comprises a result data index flag generator and an operation data index flag generator, and wherein:

the result data index flag generator is configured to: generating n-bit result data index marks according to the effectiveness of the m threads, wherein each bit of the result data index marks corresponds to a group of result data of one thread group in the n thread groups;

the operational data index flag generator is configured to: calculating an operation data index corresponding to each result data to generate an n-bit operation data index flag for a set of result data of each of the n thread groups, wherein each bit of the operation data index flag corresponds to a set of operation data of one of the n thread groups;

the control circuit is configured to: and generating the data corresponding information based on the result data index mark and the operation data index mark.

8. The shuffle circuit of claim 7, wherein the control circuit is further configured to: and for each effective result data group, obtaining the result data from a group of operation data of the thread group corresponding to the bit with the value of 1 in the corresponding operation data index mark.

9. The shuffle circuit of claim 8, wherein the value of m is 128, the value of k is 32, and the value of n is 4.

10. A data shuffling method comprising:

dividing m threads into n thread groups according to the maximum number k of threads capable of being processed in parallel, wherein each thread group comprises k threads, and k, m and n are integers greater than or equal to 1;

generating data corresponding information which defines from which one or more thread groups the result data of each thread group is obtained respectively;

selecting one or more corresponding thread groups from the n thread groups according to the data corresponding information;

performing shuffle operation on the k operation data of each corresponding thread group and outputting j shuffle output data, wherein j is an integer and is more than or equal to 0 and less than or equal to k;

generating result data for each thread group according to the data correspondence information based on shuffle output data obtained from the operation data of the one or more corresponding thread groups.

11. The data shuffling method as claimed in claim 10, wherein said generating data correspondence information includes: generating the data correspondence information based on a SIMD pattern.

12. The data shuffling method of claim 11, wherein m has a value of 128, k has a value of 32, n has a value of 4, and the SIMD mode is one of a SIMD32 mode, a SIMD64 mode, and a SIMD128 mode.

13. The data shuffling method of claim 12, wherein the generating the data correspondence information based on SIMD patterns comprises: when the SIMD mode is a SIMD32 mode, the data correspondence information includes:

14. The data shuffling method of claim 12, wherein the generating the data correspondence information based on SIMD patterns comprises: when the SIMD mode is a SIMD64 mode, the data correspondence information includes:

obtaining result data of the third thread group from the operation data of the third and the fourth thread groups;

15. The data shuffling method of claim 12, wherein the generating the data correspondence information based on SIMD patterns comprises: when the SIMD mode is a SIMD128 mode, the data correspondence information comprises:

16. The data shuffling method as claimed in claim 10, wherein said generating data correspondence information includes:

generating n-bit result data index marks according to the effectiveness of the m threads, wherein each bit of the result data index marks corresponds to a group of result data of one thread group in the n thread groups;

calculating an operation data index corresponding to each result data to generate an n-bit operation data index flag for a set of result data of each of the n thread groups, wherein each bit of the operation data index flag corresponds to a set of operation data of one of the n thread groups;

and generating the data corresponding information based on the result data index mark and the operation data index mark.

17. The data shuffling method as claimed in claim 16, wherein said generating the data correspondence information based on the result data index flag and the operation data index flag comprises:

determining a group of result data of a thread group corresponding to the bit with the median value of 1 in the result data index mark as an effective result data group;

for each valid result data set, result data is obtained from the operation data of the thread set corresponding to the bit with the value of 1 in the corresponding operation data index mark.

18. A SIMD architecture based chip comprising a shuffle circuit as claimed in any one of claims 1 to 9.

19. The chip of claim 18, wherein the chip is a GPU chip.

20. An integrated circuit device comprising at least one chip according to claim 18 or 19.