CN114546328A

CN114546328A - Method, apparatus and medium for implementing data arrangement

Info

Publication number: CN114546328A
Application number: CN202210194323.3A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Biren Intelligent Technology Co Ltd
Current assignee: Shanghai Biren Intelligent Technology Co Ltd
Priority date: 2022-03-01
Filing date: 2022-03-01
Publication date: 2022-05-27

Abstract

Embodiments of the present disclosure relate to a method, apparatus, and medium for implementing data arrangement, the method including: acquiring first auxiliary data and second auxiliary data based on the thread number and the data arrangement requirement of the thread group; shifting the data in the first register, the first auxiliary data, the data in the second register and the second auxiliary data to a transfer register; extracting data in the transfer register based on the address of the transfer register so as to obtain first data, second data, third data and fourth data; determining mapping parameters based on the thread number and the data arrangement requirement of the thread group; and mapping the third data to the first register and the fourth data to the second register based on the mapping parameter, thereby achieving data arrangement in the registers. This enables faster data alignment between the plurality of registers without calculating the shared memory address.

Description

Method, apparatus and medium for implementing data arrangement

Technical Field

Embodiments of the present disclosure relate generally to the field of processors, and more particularly, to a method, computing device, and computer-readable storage medium for implementing data arrangement.

Background

In General, in General-purpose-computing graphics processing units (gpgpgpu), Fast Fourier Transform (FFT) operations are typically involved in which the input is real and the output is complex (R2C). The operation requires the permutation (e.g., parity rearrangement) of the original input real sequence to recombine into a new complex sequence. This involves the arrangement of data between thread groups (warp) between different registers. The permutation operation needs to rely on data interaction between thread groups.

The existing way of data interaction between thread groups is mainly through shared memory. The system calculates corresponding shared memory addresses of different thread data according to the indexes, and then writes the data on the register into the shared memory through a storage instruction (store). Finally, the load instruction (load) is written back to the register. Due to scheduling uncertainty between threads, it is necessary to wait for all threads to complete the write operation before performing the read operation, which requires additional synchronization operations, and it is also only possible to process writes to the same location serially if any. This way the read latency of the instructions increases wasting the data bandwidth of the instruction memory. In addition, the index calculation of the shared memory is complicated and also requires additional calculation for conversion into a corresponding address.

In summary, the conventional scheme for realizing data arrangement has the following disadvantages: the complicated calculation operation of index address conversion of the shared memory is added, and meanwhile, the read-write operation of the shared memory is delayed for a long time, so that the program performance is influenced.

Disclosure of Invention

In view of the above problems, the present disclosure provides a method, a computing device, and a computer-readable storage medium for implementing data arrangement capable of implementing data arrangement among a plurality of registers more quickly without calculating a shared memory address.

According to a first aspect of the present disclosure, there is provided a method for implementing data arrangement in a register, comprising: acquiring first auxiliary data and second auxiliary data based on the thread number and the data arrangement requirement of the thread group; shifting the data in the first register, the first auxiliary data, the data in the second register and the second auxiliary data to a transfer register; extracting data in the transfer register based on the address of the transfer register so as to obtain first data, second data, third data and fourth data; determining mapping parameters based on the thread number and the data arrangement requirement of the thread group; and mapping the third data to the first register and the fourth data to the second register based on the mapping parameter, thereby achieving data arrangement in the registers.

According to a second aspect of the present disclosure, there is provided a computing device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect of the disclosure.

In a third aspect of the present disclosure, a non-transitory computer readable storage medium is provided having stored thereon computer instructions for causing a computer to perform the method of the first aspect of the present disclosure.

In one embodiment, the method further comprises: writing first data, second data, third data, and fourth data to the first register, the second register, the third register, and the fourth register, respectively.

In one embodiment, mapping the third data to the first register and mapping the fourth data to the second register comprises: applying a mask to the third register and the fourth register; and mapping data in the third register to which the mask is applied to the first register every mapping parameter bit and mapping data in the fourth register to which the mask is applied to the second register every mapping parameter bit, the mapping parameter bits being determined based on a mapping parameter.

In one embodiment, mapping the third data to the first register and mapping the fourth data to the second register comprises: respectively mapping data of current mapping parameter bits of the third register and the fourth register to current mapping parameter bits of a fifth register and a sixth register; mapping data of the third register and the fourth register to next mapping parameter bits of a fifth register and a sixth register respectively, wherein data of last mapping parameter bits of the third register and the fourth register are not mapped; applying a mask to the fifth register and the sixth register; and mapping data in the fifth register to which the mask is applied to the first register and mapping data in the sixth register to which the mask is applied to the second register.

In one embodiment, determining the mapping parameters based on the number of threads of the thread group and the data arrangement requirement comprises: determining the mapping parameter to be one eighth of the number of threads of the thread group.

In one embodiment, performing the extraction on the data in the staging register comprises: constructing a sequence based on the number of threads of the thread group, wherein each element in the constructed sequence corresponds to data in the staging register; combining the data of the transfer register into a plurality of sets of data columns; and performing decimation on the combined set composed of a plurality of the data columns.

In one embodiment, performing the decimation on the combined set of a plurality of the data columns comprises: determining coordinate values of elements in the plurality of arrays, wherein the coordinate values comprise abscissas and ordinates corresponding to the elements; and performing extraction on the set of the combined plurality of the number sequences in an order in which an abscissa is even and an ordinate is even, an abscissa is even and an ordinate is odd, an abscissa is odd and an ordinate is even, an abscissa is odd and an ordinate is odd, based on the determined coordinate values.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements.

Fig. 1 shows a schematic diagram of a system 100 for implementing a method for implementing data arrangement according to an embodiment of the invention.

FIG. 2 shows a flow diagram of a method 200 for implementing data alignment in a register, according to an embodiment of the present disclosure.

Fig. 3 shows a schematic diagram of a shifted transit register according to an embodiment of the present disclosure.

Fig. 4 shows a schematic diagram of data of a relay register composed of a set of a plurality of arrays according to an embodiment of the present disclosure.

Fig. 5 shows a schematic diagram of the extracted first data, second data, third data, and fourth data according to an embodiment of the present disclosure.

Fig. 6 shows a schematic diagram of a mapping result according to an embodiment of the present disclosure.

Fig. 7 shows a schematic diagram of a third register, a fourth register, and a fifth register, a sixth register mapping, according to an embodiment of the present disclosure.

Fig. 8 shows a schematic diagram of mapping data in the fifth and sixth registers to the first and second registers according to an embodiment of the present disclosure.

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The term "include" and variations thereof as used herein is meant to be inclusive in an open-ended manner, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions are also possible below.

As described above, the data interaction among the threads is mainly achieved by sharing the memory. The system calculates corresponding shared memory addresses of different thread data according to the indexes, and then writes the data on the register into the shared memory through a storage instruction (store). Finally, the load instruction (load) is written back to the register. Due to scheduling uncertainty between threads, it is necessary to wait for all threads to complete the write operation before performing the read operation, which requires additional synchronization operations, and it is also only possible to process writes to the same location serially if any. This way the read latency of the instructions increases wasting the data bandwidth of the instruction memory. In addition, the index calculation of the shared memory is complicated and also requires additional calculation for conversion into a corresponding address. Therefore, the extra calculation of the index and address of the shared memory is high in consumption and serious in delay, and the read-write operation of the shared memory is delayed for a long time, so that the program performance is affected.

To at least partially address one or more of the above-mentioned problems, as well as other potential problems, example embodiments of the present disclosure provide a solution for generating a data report, in which arrangement of data stored in two registers, for example, 1-64 in succession, may be achieved by using two additional registers in a thread group and a transit register. Permutation may refer to reordering consecutive pieces of data, e.g., consecutive data 1-64 stored in two registers of 32 threads in one thread group (warp), into

odd columns

1, 3, 5 …, 63 and even

columns

2, 4, 6 …, 64. Finally, the two registers store corresponding odd columns and even columns.

Fig. 1 shows a schematic diagram of a system 100 for implementing a method for implementing data arrangement according to an embodiment of the invention. As shown in fig. 1, system 100 includes a computing device 110 and a data storage device 130 and a network 140. Computing device 110, data storage device 130 may interact with data via network 140 (e.g., the internet).

The data storage device 130, which may, for example, store a plurality of different types of data stores, e.g., store consecutive numbers to be sorted, etc. The data storage device 130 may also send stored data stores to the computing device 110. The data storage device may be a one-stop storage computing structure running on one or more computer nodes for implementing high-concurrency, high-throughput query services, which may include special purpose processing units such as GPUs, FPGAs, and ASICs, as well as general purpose processing units such as CPUs.

With respect to computing device 110, for example, for obtaining data storage from data storage device 130; thereby performing permutation on the received data. The rearranged odd and even columns may be stored in two registers within the same warp. Computing device 110 may have one or more processing units, including special purpose processing units such as GPUs, FPGAs, ASICs, and the like, as well as general purpose processing units such as a CPU. Additionally, one or more virtual machines may also be running on each computing device 110. In some embodiments, the computing device 110 and the data storage device 130 may be integrated or may be separate from each other. In some embodiments, computing device 110 includes, for example, an acquisition module 112, a shift module 114, a decimation module 116, a determination module 118, and a mapping module 120.

An obtaining module 112, where the obtaining module 112 is configured to obtain the first auxiliary data and the second auxiliary data based on the thread number and the data arrangement requirement of the thread group.

A shift module 114, the shift module 114 configured to shift the data in the first register, the first auxiliary data, the data in the second register, and the second auxiliary data to the transfer register.

An extraction module 116, where the extraction module 116 is configured to perform extraction on the data in the transfer register based on the address of the transfer register, so as to obtain the first data, the second data, the third data, and the fourth data.

A determination module 118, the determination module 118 configured to determine the mapping parameters based on the thread count and the data arrangement requirements of the thread groups.

A mapping module 120, the mapping module 120 configured to map the third data to the first register and the fourth data to the second register based on the mapping parameter, thereby enabling data arrangement in a register.

FIG. 2 shows a flow diagram of a method 200 for implementing data alignment in a register, according to an embodiment of the disclosure. The method 200 may be performed by the computing device 110 as shown in FIG. 1, or may be performed at the electronic device 800 shown in FIG. 8. It should be understood that method 200 may also include additional blocks not shown and/or may omit blocks shown, as the scope of the present disclosure is not limited in this respect.

In the context of the present application, a register may refer to a thread group (warp), e.g. a register group comprising a first register and a second register. The register sets may be located in the same thread or in different threads.

In one embodiment, for example, a thread group (warp) may include 8, 16, 32, 64 threads. For brevity, the present disclosure will be presented in terms of 32 threads. Note that thread groups with other thread numbers are also applicable to the technical solution described in this disclosure.

At step 202, the computing device 110 may obtain first assistance data and second assistance data based on the thread count and data alignment requirements of the thread group.

In one embodiment, the number of bits of the first assistance data and the second assistance data may be the same as the number of threads of the thread group. For example, if the number of threads in the thread group is 32, the first auxiliary data and the second auxiliary data may be 32-bit data. For simplicity, the present application context uses 32 threads as an illustration. As described above, the first and second registers corresponding to the 32 threads hold

data sequence numbers

1, 2, 3 …, 32 and 33, 34, 35, …, 64, respectively. In the context of this application, a data arrangement requirement may refer to exchanging, inserting, rearranging, etc., data corresponding to a thread. For the sake of brevity, the context of this application will be described in terms of reordering the data corresponding to two sets of 32 threads 1 bit apart. For example, the data in the first thread group and the second thread group are permuted such that a first register corresponding to the first thread group holds odd columns (1, 3, 5, 7, …, 63) and a second register corresponding to the second thread group holds even columns (2, 4, 6, 8, …, 64) after the permutation.

In one embodiment, the first auxiliary data and the second auxiliary data may be any data used to fill a register. For example, the first auxiliary data and the second auxiliary data may select 32 consecutive series, e.g. 1, 2, 3 … 32, or the same data series, e.g. 32 0. For the sake of brevity, the present application uses 32 consecutive 0 s as the first assistance data and the second assistance data.

At step 204, the computing device 110 may shift the data in the first register, the first auxiliary data, the data in the second register, and the second auxiliary data to the staging register.

In one embodiment, the computing device 110 may shift

data

1, 2, 3 …, 32 in the first register R4, the first auxiliary data 32 0, the

data

33, 34, 35, …, 64 in the second register R5, the second auxiliary data 32 0 to the transfer register X0. Fig. 3 shows a schematic diagram of a shifted transit register according to an embodiment of the present disclosure. The shifted transfer register X0 is extracted in the subsequent step, thereby realizing data arrangement. Since the transfer register X0 is a special register, the size of which is 4 times that of the first register, and data of 4 first registers can be stored.

At step 206, the computing device 110 may perform decimation on the data in the transfer register based on the address of the transfer register X0 to obtain the first data, the second data, the third data, and the fourth data.

In one embodiment, performing the extraction on the data in the register X0 may include: a number column is constructed based on the number of threads (e.g., 32) of the thread group, where each element in the constructed number column corresponds to the mapped data.

Specifically, the computing device 110 may construct a sequence of numbers based on the 32-bit data of the first register and the second register. The array is constructed as a 4 x 8 array, thus dividing a 32-bit register into 4 groups of 8 bits. For example, the data shifted to the first line in the transfer register X0 may be divided into 1, 2, 3 … 8 and 9, 10, 11 … 18 and 19, 20, 21 … 24 and 25, 26, 27 … 32. The data shifted to the second line in the transfer register may be divided into eight bits 0 and eight

bits

0 and 0. The data shifted to the third row in the relay register may be divided into 33, 34, 35 … 40 and 41, 42, 43 … 48 and 49, 50, 51 … 56 and 57, 58, 59 … 64. The data shifted to the second line in the transfer register may be divided into eight bits 0 and eight

bits

0 and 0.

Subsequently, in one embodiment, the computing device 110 may combine the data of the staging register into a plurality of sets of said data columns, for example a set of 4 x 8 number columns. Fig. 4 shows a schematic diagram of data of a relay register composed of a set of a plurality of arrays according to an embodiment of the present disclosure. As shown in fig. 4, the data of the relay register X0 can be represented by a set of 4 × 8 arrays. For example, for the data of the first row in the transfer register X0, 4 data columns of 8 bits, i.e., 4 × 8 data columns, of 1, 2, 3, …, 8 and 9, 10, 11, …, 16 and 17, 18, 19, …, 24 and 25, 26, 27, …, 32, can be constructed. For the data of the second row in the transfer register X0, 4 data columns of 8-bit 0's can be constructed, and so on. Based on the constructed data columns, the addresses of the relay registers may be combined into a set of, for example, 4 × 8 data columns, where a 4 × 8 data column composed of the data of the first row may be located at the upper left of the set, a 4 × 8 data column composed of the data of the second row may be located at the upper right of the set, a 4 × 8 data column composed of the data of the third row may be located at the lower left of the set, and a 4 × 8 data column composed of the data of the fourth row may be located at the lower right of the set.

The upper left data of the set consisting of a plurality of said columns of data consists of 4 × 8 columns originally belonging to the first register (1, 2, 3 … 8 and 9, 10, 11 … 18 and 19, 20, 21 … 24 and 25, 26, 27 … 32), the upper right data consists of 4 × 8 columns originally belonging to the first auxiliary data (eight bit 0 and eight bit 0), the lower left data consists of 4 × 8 columns originally belonging to the second register (33, 34, 35 … 40 and 41, 42, 43 … 48 and 49, 50, 51 … 56 and 57, 58, 59 … 64), and the lower right data consists of 4 × 8 columns originally belonging to the second auxiliary data (eight bit 0 and eight bit 0).

The set may be represented in a coordinate system, wherein coordinate values of the coordinate system comprise an abscissa and an ordinate to which the data corresponds. With the top left point, to the right as the y-axis and down as the x-axis. From the origin, in the first 2 × 2 matrix, the coordinate value corresponding to the first data 1 is (0, 0), the coordinate value corresponding to the data 2 is (0, 1), the coordinate value corresponding to the data 9 is (1, 0), the coordinate value corresponding to the data 10 is (1, 1), and so on for the rest of the data.

Based on the determined coordinate values, computing device 110 may perform decimation on the combined set of multiple columns of the data. Specifically, the decimation is performed on the set of the plurality of column combinations in the order of the even abscissa and the even ordinate, the even abscissa and the odd ordinate, the odd abscissa and the even ordinate, the odd abscissa, and the odd ordinate. The extracted data may be four types of data, which are first data having an even abscissa and an even ordinate, second data having an even abscissa and an odd ordinate, third data having an odd abscissa and an even ordinate, and fourth data having an odd abscissa and an odd ordinate, respectively.

In one embodiment, the computing device 110 may also write the extracted first data, second data, third data, and fourth data to the first register R4, the second register R5, the third register R6, and the fourth register R7, respectively. The third register R6 and the fourth register R7 will be used for mapping data in subsequent steps.

Fig. 5 shows a schematic diagram of the extracted first data, second data, third data, and fourth data according to an embodiment of the present disclosure. As shown in fig. 5, the extracted first data, second data, third data, and fourth data are written to the first register R4, the second register R5, the third register R6, and the fourth register R7, respectively.

At step 208, the computing device 110 may determine mapping parameters based on the thread count and data arrangement requirements of the thread groups.

In one embodiment, the mapping parameter is determined to be one eighth of the number of threads of a thread group. In the case where the number of threads of the thread group is 32, the mapping parameter k may be 4.

At step 210, computing device 110 may map the third data to the first register and the fourth data to the second register based on the mapping parameter, thereby enabling data arrangement in the registers.

In one embodiment, the computing device 110 may apply a mask to the third register R6 and the fourth register R7. For example, the mask may be 0xf0f0f0f0, i.e., 11110000111100001111000011110000. With such a mask applied, the data will map once every 4 threads, i.e., threads at locations where the number of applied mask bits is 1 may be mapped and threads at locations where the number of applied mask bits is 0 may not be mapped.

In one embodiment, the computing device 110 maps data in the third register R6 with the mask applied to the first register R4 every other mapping parameter bit (i.e., 4 bits) and maps data in the fourth register R7 with the mask applied to the second register R5 every other mapping parameter bit (i.e., 4 bits), where the mapping parameter bits are determined based on the mapping parameters. For example, if the mapping parameter is 4, the mapping parameter bit is 4 bits.

For example, based on the applied mask (0xf0f0f0f0), computing device 110 may map data (9, 11, 13, 15) in third register R6 corresponding to threads T0-T3 to threads T4-T7 of first register R4 and

override

0, 0 in threads T4-T7. By analogy, the computing device 110 may map data (25, 27, 29, 31) in the third register R6 corresponding to threads T8-T11 to threads T12-T15 of the first register R4 and

override

0, 0 of threads T12-T15, map data (41, 43, 45, 47) in the third register R6 corresponding to threads T16-T19 to threads T20-T23 of the first register R4 and

override

0, 0 of threads T20-T23, map data (57, 59, 61, 63) in the third register R6 corresponding to threads T24-T27 to threads T28-T31 of the first register R4 and

override

0, 0 of threads T28-T31.

Similarly, the computing device 110 may map data (10, 12, 14, 16) in the fourth register R7 corresponding to threads T0-T3 to threads T5-T5 of the second register R5 and

cover

0, 0 of threads T5-T5, map data (26, 28, 30, 32) in the fourth register R5 corresponding to threads T5-T5 of the second register R5 and

cover

0, 0 of threads T5-T5, map data (42, 44, 46, 48) in the fourth register R5 corresponding to threads T5-T5 of the second register R5 and

cover

0, 0 of threads T5, and 5 of threads T5, map data (42, 44, 46, 48) in the fourth register R5 corresponding to threads T5-T5 and

cover

0, 5 of threads T5, and cover bits of threads T5, and map data (42, 8, 58, 5) in the fourth register R5) corresponding to threads T5, and cover bits of the second register T5, and cover the second register T5, and the second register R5, and the second register T5, and the second register R5, 0. 0 and 0.

Fig. 6 shows a schematic diagram of a mapping result according to an embodiment of the present disclosure. As shown in fig. 6, the first register

R4 stores data

1, 3, 5, 7, 9 … and the second register

R5 stores data

2, 4, 6, 8, 10 …, thereby realizing rearrangement of data between the two registers, i.e., parity rearrangement.

By employing the above approach, the computing device 110 can quickly implement the arrangement of data with additional registers. Again this does not require the computation of the address of the shared memory and can take advantage of the data computed in the previous method, thereby reducing the power consumption of the arrangement.

In another embodiment, the computing device 110 may also map the data of the pre-mapped parameter bits of the third register R6 and the fourth register R7 directly to the pre-mapped parameter bits of the fifth register and the sixth register. For example, the computing device 110 may also map the first mapping parameter bits, i.e., the first 4 bits, of the third register R6 and the fourth register R7 directly to the first mapping parameter bits, i.e., the first 4 bits, of the fifth register R8 and the sixth register R9. For example,

data

9, 11, 13, 15 of the third register R6 corresponding to threads T0-T3 are mapped directly to the fifth register R8. The

data

10, 12, 14, 16 of the fourth register R7 corresponding to the threads T0-T3 are mapped directly to the sixth register R9.

Subsequently, the computing device 110 may map the data of the third register R6 and the fourth register R7 to the next mapping parameter bits of the fifth register R8 and the sixth register R9, i.e., plus 4 bits. Since the first 4 bits of the fifth register R8 and the sixth register R9 have been mapped to the first 4 bits of the third register R6 and the fourth register R7, the data of the last mapping parameter bits of the third register R6 and the fourth register R7 are not mapped, that is, the data of the fifth register and the sixth register corresponding to the last 4 threads (i.e., the data corresponding to the threads T28-T31) are not mapped. Fig. 7 shows a schematic diagram of the mapping of the third and fourth registers R6, R7 with the fifth and sixth registers R8, R9 according to an embodiment of the present disclosure. As shown in FIG. 7, for example, the computing device 110 maps data of the third register R6 corresponding to the threads T0-T27 to the fifth register R8. Meanwhile, the computing device 110 maps data of the fourth register R7 corresponding to the threads T0-T27 to the sixth register R9.

Subsequently, the computing device 110 may apply a mask to the fifth register R8 and the sixth register R9. As described above, the mask may be 0xf0f0f0f 0. With such a mask applied, the data will map once every 4 threads, i.e., threads at locations where the number of applied mask bits is 1 may be mapped and threads at locations where the number of applied mask bits is 0 may not be mapped.

The computing apparatus 110 maps the data in the fifth register R8 to which the mask is applied to the first register R4 every other mapping parameter bit (i.e., 4 bits) and maps the data in the sixth register R9 to which the mask is applied to the second register R5 every other mapping parameter bit (i.e., 4 bits).

Fig. 8 shows a schematic diagram of mapping data in the fifth and sixth registers R8 and R9 to the first and second registers R4 and R5 according to an embodiment of the present disclosure. As shown in FIG. 8, data of the fifth register R8 and the sixth register R9 corresponding to the threads T4-T7, T12-T15, T20-T23 and T28-T31 are mapped to the first register R4 and the second register R5 respectively. Fig. 6 shows a mapping result according to an embodiment of the present disclosure. As shown in fig. 6, the first register

R4 stores data

1, 3, 5, 7, 9 … and the second register

R5 stores data

2, 4, 6, 8, 10 …, thereby realizing rearrangement of data between the two registers, for example, parity rearrangement. In this way, the data rearrangement is also achieved when two more registers are used.

By using the technical means, the data arrangement between the two registers can be realized quickly, and only two additional registers and one transfer register are needed. At the same time, the address of the shared memory does not need to be calculated, thereby reducing the power consumption of the arrangement.

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure. For example, computing device 110 as shown in fig. 1 may be implemented by electronic device 900. As shown, electronic device 900 includes a Central Processing Unit (CPU) 901 that can perform various appropriate actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM)902 or loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the random access memory 903, various programs and data required for the operation of the electronic device 900 can also be stored. The central processing unit 901, the read only memory 902 and the random access memory 903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the electronic device 900 are connected to the input/output interface 905, including: an input unit 906 such as a keyboard, a mouse, a microphone, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The various processes and processes described above, such as the method 200, may be performed by the central processing unit 901. For example, in some embodiments, the method 200 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 908. In some embodiments, some or all of the computer program may be loaded and/or installed onto device 900 via read only memory 902 and/or communications unit 909. When the computer program is loaded into the random access memory 903 and executed by the central processing unit 901, one or more of the actions of the method 200 described above may be performed.

The present disclosure relates to methods, apparatuses, systems, electronic devices, computer-readable storage media and/or computer program products. The computer program product may include computer-readable program instructions for performing various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge computing devices. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for implementing data alignment in a register, comprising:

acquiring first auxiliary data and second auxiliary data based on the thread number and the data arrangement requirement of the thread group;

shifting the data in the first register, the first auxiliary data, the data in the second register and the second auxiliary data to a transfer register;

extracting data in the transfer register based on the address of the transfer register so as to obtain first data, second data, third data and fourth data;

determining mapping parameters based on the thread number and the data arrangement requirement of the thread group; and

mapping the third data to the first register and mapping the fourth data to the second register based on the mapping parameter, thereby achieving data arrangement in a register.

2. The method of claim 1, further comprising:

writing first data, second data, third data, and fourth data to the first register, the second register, the third register, and the fourth register, respectively.

3. The method of claim 2, mapping the third data to the first register and mapping the fourth data to the second register comprises:

applying a mask to the third register and the fourth register; and

mapping data in the third register to which the mask is applied to the first register every mapping parameter bit and mapping data in the fourth register to which the mask is applied to the second register every mapping parameter bit, the mapping parameter bits being determined based on a mapping parameter.

4. The method of claim 2, mapping the third data to the first register and mapping the fourth data to the second register comprises:

respectively mapping data of current mapping parameter bits of the third register and the fourth register to current mapping parameter bits of a fifth register and a sixth register;

mapping data of the third register and the fourth register to next mapping parameter bits of a fifth register and a sixth register respectively, wherein data of last mapping parameter bits of the third register and the fourth register are not mapped;

applying a mask to the fifth register and the sixth register; and

data in the fifth register to which the mask is applied is mapped to the first register and data in the sixth register to which the mask is applied is mapped to the second register.

5. The method of claim 1, wherein determining mapping parameters based on the number of threads of a thread group and data arrangement requirements comprises:

the mapping parameter is determined to be one eighth of the number of threads of the thread group.

6. The method of claim 1, wherein performing decimation on data in the staging register comprises:

constructing a sequence based on the number of threads of the thread group, wherein each element in the constructed sequence corresponds to data in the staging register;

combining the data of the transfer register into a plurality of sets of data columns; and

performing decimation on the combined set of columns of data.

7. The method of claim 2, wherein performing decimation on the combined set of a plurality of the columns of data comprises:

determining coordinate values of elements in the plurality of arrays, wherein the coordinate values comprise abscissas and ordinates corresponding to the elements; and

based on the determined coordinate values, performing decimation on the set of the combined plurality of the number columns in an order in which an abscissa is even and an ordinate is even, an abscissa is even and an ordinate is odd, an abscissa is odd and an ordinate is even, an abscissa is odd and an ordinate is odd.

8. A computing device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor;

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

9. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.