WO2022266842A1

WO2022266842A1 - Multi-thread data processing method and apparatus

Info

Publication number: WO2022266842A1
Application number: PCT/CN2021/101533
Authority: WO
Inventors: 陈水挺; 杨伟光; 吴任初
Original assignee: 华为技术有限公司
Priority date: 2021-06-22
Filing date: 2021-06-22
Publication date: 2022-12-29
Also published as: CN117561501A

Abstract

Disclosed in the present application are a multi-thread data processing method and apparatus, which are used for solving the problem of cross-thread computing being complicated and having large overheads. The method comprises: acquiring a first operation instruction, wherein the first operation instruction comprises the following parameters: a first operation code, which is used for indicating a data transfer mode between N threads, N being an integer greater than or equal to 2, a first source operand, which is used for indicating first source data of the N threads, and a second source operand, which is used for determining a thread offset corresponding to the data transfer mode; and transferring the first source data of the N threads according to the first operation instruction, so as to obtain the transferred first data on each of the N threads.

Description

A multi-thread data processing method and device

technical field

The present application relates to the technical field of parallel computing, and in particular to a multi-thread data processing method and device.

Background technique

With the increase of application program's demand for data processing capabilities, parallel computing processors, such as single instruction multiple data (single instruction multiple data, SIMD) processors, are introduced into computer systems. More and more parallel computing programs require cross-thread computing, involving data exchange between threads.

The traditional parallel processor solutions are divided into software solutions and hardware solutions. Among them, the software solution is to use the shared on-chip storage, store the data in the shared on-chip storage, then modify the thread address and then grab the data back to the core register to realize the exchange of data between threads. The software solution involves frequent memory access operations, resulting in inefficient execution and higher power consumption. The hardware solution is generally through a complex cross bar. For example, the data of each output thread of the cross network can come from any input thread, so as to achieve the ability of thread data exchange. However, hardware solutions require higher hardware costs.

Contents of the invention

The present application provides a multi-thread data processing method and device, which can improve execution performance and realize cross-thread operations involved in parallel computing at a lower hardware cost.

In a first aspect, an embodiment of the present application provides a multi-threaded data processing method, the method including: acquiring a first operation instruction. The first operation instruction includes the following parameters: a first operation code, the first operation code is used to indicate the data transfer mode between N threads, and N is an integer greater than or equal to 2; the first source operand, the The first source operand is used to indicate the first source data of the N threads; the second source operand is used to determine the thread offset corresponding to the data movement mode. Moving the first source data of the N threads according to the first operation instruction to obtain the moved first data on each of the N threads.

In the embodiment of the present application, the efficient cross-thread operation of the parallel computing processor is realized by a single instruction, which is simpler than the cross bar of the cross network, and does not require frequent memory access, and can realize high-performance parallelism with lower hardware or signaling overhead Accelerated processing of applications operating across threads in a computing processor.

In an optional implementation manner, the data moving method is the first moving method, and the moving the first source data of the N threads according to the first operation instruction includes: The first source data of the thread ₁ is moved to the thread numbered i; wherein, the numbering of N threads is 0～(N-1), i takes 0～(N-1), and I ₁ is (i+ SRC1) the value of the remainder of N; wherein, SRC1 represents the second source operand, and SRC1 is a positive integer. Through such a design, efficient cross-thread operations, that is, circular transfer operations between multi-threaded data, can be realized at low hardware cost, which can effectively accelerate the reduction algorithm of parallel computing.

In an optional implementation manner, the data moving method is the second moving method, and the moving the first source data of the N threads according to the first operation instruction includes: The first source data of the thread of ₂ is moved to the thread numbered i; wherein, the numbering of N threads is 0～(N-1), and i takes 0～(N-1); I ₂ is i and SRC1 XOR value of , SRC1 represents the second source operand, and SRC1 is a positive integer. Through such a design, efficient cross-thread operations, that is, cross-movement operations between multi-threaded data, can be realized at a low hardware cost, which can effectively accelerate differential calculations in graphics processing.

In an optional implementation manner, the data moving method is a third offset method, and the moving the first source data of the N threads according to the first operation instruction includes: The first source data of the thread of I ₃ is moved to the thread numbered as i; wherein, the numbering of N threads is 0～(N-1), and i takes 0～(N-1); the value of I ₃ for

SRC1 represents the second source operand, SRC1 is a positive integer, and n is a positive integer that can divide N. Through such a design, efficient cross-thread operations, that is, one-to-many transfer operations between multi-threaded data, can be realized at a low hardware cost, which can effectively accelerate differential calculations in graphics processing.

In an optional implementation manner, the first operation instruction further includes a second operation code, and the second operation code is used to indicate an operation type; the method further includes:

For a first thread among the N threads, based on the first source data of the first thread and the moved first data on the first thread, an operation operation corresponding to the operation type is performed.

In an optional implementation manner, each of the N threads is associated with a thread flag bit, and the thread flag bit is used to indicate whether the first source data of the thread participates in an operation. Through such a design, the data that does not need to be calculated can be removed and the calculation cost can be reduced.

In an optional implementation manner, the first data moved on the first thread comes from the second thread among the N threads; the thread flag associated with the first thread indicates that the first The first source data of the thread participates in the computing operation, and the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in the computing operation.

In an optional implementation manner, the first operation instruction further includes a destination operand, where the destination operand is used to indicate a storage location of a corresponding operation result of the first thread.

In an optional implementation manner, the first source data of the N threads comes from N consecutive threads of a parallel computing processor, and the parallel computing processor includes 2N threads, and the method further includes: obtaining The second operation instruction, the second operation instruction includes the following parameters: the first operation code; the second source operand; the third source operand, the third source operand indicates the N threads The second source data, the second source data of the N threads come from the remaining N continuous threads in the parallel computing processor; perform the second source data of the N threads according to the second operation instruction moving to obtain the moved second data on each of the N threads.

In an optional implementation manner, the method further includes: exchanging the moved first data on the third thread with the second data moved on the third thread; wherein, the three threads are The thread numbered r among the N threads; if SRC1 is less than N, r is greater than or equal to (N-SRC1), and r is less than N; if SRC1 is greater than or equal to N, r is greater than or equal to 0, and r is less than (N -SRC1%N). Through such a design, the efficient cross-thread operation of a higher parallel computing processor such as SMID width can be realized at a lower hardware cost, which can effectively accelerate the reduction algorithm of parallel computing.

In a second aspect, an embodiment of the present application provides a multi-threaded data processing device, which includes: an instruction acquisition module, configured to acquire a first operation instruction, and the first operation instruction includes the following parameters: the first operation code, the The first operation code is used to indicate the data movement mode between N threads, and N is an integer greater than or equal to 2; the first source operand is used to indicate the first source operand of the N threads. Source data; a second source operand, the second source operand is used to determine the thread offset corresponding to the data movement mode; a processing module is used to process the N threads according to the first operation instruction The first source data is moved to obtain the moved first data on each of the N threads.

In an optional implementation manner, the data transfer method is a first transfer method, and the processing module is specifically configured to: transfer the first source data of the thread numbered I ₊₁ to the thread numbered i ; Wherein, the numbering of N threads is 0～(N-1), and i is taken over 0～(N-1), and I ₁ is (i+SRC1) the value that N gets remainder; Wherein, SRC1 represents described first Two source operands, SRC1 is a positive integer.

In an optional implementation manner, the data transfer method is the second transfer method, and the processing module is specifically used to: transfer the first source data of the thread numbered ₁₂ to the thread numbered i ; Wherein, the numbering of N threads is 0～(N-1), and i takes 0～(N-1); I ₂ is the XOR value of i and SRC1, and SRC1 represents the second source operand, and SRC1 is a positive integer.

In an optional implementation manner, the data moving method is a third offset method, and the processing module is specifically used to: move the first source data of the thread numbered _I3 to the thread numbered i Among them, the numbers of N threads are 0～(N-1), and i takes 0～(N-1); the value of I ₃ is

SRC1 represents the second source operand, SRC1 is a positive integer, and n is a positive integer that can divide N.

In an optional implementation manner, the first operation instruction further includes a second operation code, and the second operation code is used to indicate an operation type; the processing module is further configured to: for the N threads The first thread in the first thread executes an operation corresponding to the operation type based on the first source data of the first thread and the first data moved on the first thread.

In an optional implementation manner, each of the N threads is associated with a thread flag bit, and the thread flag bit is used to indicate whether the first source data of the thread participates in an operation.

In an optional implementation manner, the first source data of the N threads comes from N consecutive threads of a parallel computing processor, and the parallel computing processor includes 2N threads; the instruction acquisition module further For acquiring a second operation instruction, the second operation instruction includes the following parameters: the first operation code; the second source operand; a third source operand, the third source operand indicating the N The second source data of the threads, the second source data of the N threads are from the remaining N continuous threads in the parallel computing processor; The second source data of the N threads are moved to obtain the moved second data on each of the N threads.

In an optional implementation manner, the processing module is further configured to: exchange the moved first data on the third thread with the second data moved on the third thread; wherein, the The three threads are threads numbered r among the N threads; if SRC1 is less than N, r is greater than or equal to (N-SRC1), and r is less than N; if SRC1 is greater than or equal to N, r is greater than or equal to 0, and r Less than (N-SRC1%N).

In a third aspect, the present application provides a communication device, including a processor, the processor is coupled to a memory, the memory is used to store computer programs or instructions, and the processor is used to execute the computer programs or instructions to perform Various implementation methods of any one of the above-mentioned first aspect to the fourth aspect. The memory may be located within the device or external to the device. The number of the processors is one or more.

In a fourth aspect, the present application also provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium is run on a computer, the computer executes the above-mentioned first aspect and each executable method of the first aspect. The method described in Selected Implementations.

In a fifth aspect, the present application further provides a computer program product including instructions, which, when run on a computer, cause the computer to execute the method described in the above first aspect and each possible implementation manner of the first aspect.

In the sixth aspect, the present application also provides a computer chip, the chip is connected to the memory, and the chip is used to read and execute the software program stored in the memory, and execute the above-mentioned first aspect and each possible function of the first aspect. The method described in Selected Implementations.

In addition, for the beneficial effects of the second aspect to the sixth aspect, reference may be made to the beneficial effects shown in the first aspect and each optional implementation manner of the first aspect.

Description of drawings

FIG. 1 is a schematic diagram of a circuit structure coupled by a crossover network;

Fig. 2 is a schematic diagram of internal element shifting of a thread;

FIG. 3 is a schematic diagram of a SIMD parallel computing processor system architecture provided by an embodiment of the present application;

Fig. 4 is a schematic diagram of a cycle transfer provided by the embodiment of the present application;

FIG. 5 is a schematic structural diagram of a CROSSDOWN cross-thread processing unit provided by an embodiment of the present application;

FIG. 6 is one of the schematic diagrams of thread flag bits provided by the embodiment of the present application;

Fig. 7 is a schematic diagram of cross transfer provided by the embodiment of the present application;

FIG. 8 is a schematic structural diagram of a CROSS QUAD BUTTERFLY cross-thread processing unit provided by the embodiment of the present application;

FIG. 9 is one of the schematic diagrams of thread flag bits provided by the embodiment of the present application;

FIG. 10 is a schematic diagram of one-to-many transfer provided by the embodiment of the present application;

FIG. 11 is a schematic structural diagram of a CROSS QUAD-BROADCAST cross-thread processing unit provided by the embodiment of the present application;

FIG. 12 is one of the schematic diagrams of thread flag bits provided by the embodiment of the present application;

FIG. 13 is one of the schematic diagrams of the cross-thread data processing flow provided by the embodiment of the present application;

FIG. 14 is a schematic diagram of a source data source provided by the embodiment of the present application;

Fig. 15 is another schematic diagram of circular transfer provided by the embodiment of the present application;

Fig. 16 is a schematic diagram of data replacement provided by the embodiment of the present application;

FIG. 17 is one of the schematic diagrams of the cross-thread data processing flow provided by the embodiment of the present application;

FIG. 18 is one of the schematic diagrams of the cross-thread data processing flow provided by the embodiment of the present application;

FIG. 19 is one of the schematic flowcharts of the multi-threaded data processing method provided by the embodiment of the present application;

FIG. 20 is a schematic structural diagram of a multi-threaded data processing device provided by an embodiment of the present application;

FIG. 21 is a schematic structural diagram of a communication device provided by an embodiment of the present application.

detailed description

The following firstly introduces related technologies for processing thread data in parallel.

Related technology one:

The thread data is read from the core registers by software, and the thread data is stored in memory such as shared on-chip storage. Modify the thread address of the data, and fetch the data back to the kernel register according to the modified thread address. The same thread address corresponds to the original data read from the kernel register, and the data fetched back to the kernel register, so as to realize the exchange of data between threads. Such a method involves frequent memory access operations, resulting in low execution efficiency and high power consumption.

Related technology two:

Refer to Figure 1 to illustrate a circuit structure coupled by a crossover network. The circuit is divided into four quadrants by dotted lines, and each quadrant contains one vector processor (Execution Pipelines) and two crossover networks for performing cross-thread data movement operations. (cross bar) chip. Among them, the four quadrants are respectively recorded as the first quadrant, the second quadrant, the third quadrant and the fourth quadrant. The first quadrant comprises vector processor 455, cross network chip 410A (or claims, cross bar410A), cross network chip 410B; The second quadrant comprises vector processor 460, cross network chip 420A, cross network chip 420B; The third quadrant comprises vector The processor 465, the cross network chip 430A, and the cross network chip 430B; the fourth quadrant includes the vector processor 470, the cross network chip 440A, and the cross network chip 440B.

The coupling between the output channels of cross bar 410A, cross bar 410B, cross bar 420A, cross bar 420B, cross bar 430A, cross bar 430B, cross bar 440A and cross bar 440B and various vector processors can be achieved with a small number of threads The combination of cross bar to achieve cross-thread operation with a large number of threads. Among them, the coupling relationship between the cross bar and the vector processor is shown in Table 1 below.

Table 1

向量处理器vector processor	可使用的cross bar Available cross bar
455455	410A,420A,430B,440A410A, 420A, 430B, 440A
460460	410B,420B,430A,440B410B, 420B, 430A, 440B
465465	410B,420B,430A,440B410B, 420B, 430A, 440B
470470	410A,420A,430B,440A410A, 420A, 430B, 440A

Each of the aforementioned cross bars has 8 input channels and 8 output channels, that is, an 8×8 cross network. The combination of 4 cross bars can realize 16 input channels and 16 output channels. One cross-thread operation instruction can control the replacement of 16 channels. For the cross-thread operation of 32 threads, two back-to-back cross-thread operation instructions can be used, that is, two cross-thread operation instructions continuous in time to perform 32×32 replacement. Among them, two cross-thread operation instructions are recorded as the first replacement instruction and the second replacement instruction. The first replacement instruction controls the combined cross network to input 16 thread data to perform the replacement operation and then output it, and write it back to the vector register file. The second replacement instruction The control-combined cross network inputs 16 thread data and outputs them after replacement operation. The output of the first permutation instruction will be read and the output of the second permutation instruction will be combined with the output of the first permutation instruction to produce the final result of the 32x32 permutation.

In the second technique above, although the cross bar sharing design is used to reduce the number of cross bars, there is still a large hardware cost. And each cross bar is shared by two vector processors, and can only be used by one of the vector processors at a time. Therefore, when a vector processor is in use, if another vector processor also needs to use the same cross bar, it will cause processor blockage. For the cross-thread operation of 32 threads, two back-to-back cross-thread instructions need to be used. The first instruction needs to be written into the register and then read out, which will consume additional power consumption.

Related technology three:

Set the vector reduction instruction (VADDREDUCEPS) to perform a shift operation on the data elements in each thread in the multi-thread to achieve the reduction calculation inside the same thread. As shown in Figure 2, 310 is a vector register containing 4 threads, and each thread contains 4 elements. After the vector reduction instruction is executed, the data in each thread is shifted to the right by the bit width of 1 element unit , the rightmost element in each thread is not shifted, and is added, subtracted or multiplied with the shifted element, the leftmost element in each thread is filled with 0, and the shift operation will not cross the thread boundary. As shown in Figure 2, after the aforementioned shift operation, 310 is changed to 320, the details are as follows:

{A15,A14,A13,A12}->{0,A15,A14,A13+A12}

{A11,A10,A9,A8}->{0,A11,A10,A9+A8}

{A7,A6,A5,A4}->{0,A7,A6,A5+A4}

{A3,A2,A1,A0}->{0,A3,A2,A1+A0}

In the third technique above, using the vector reduction instruction can only perform shift operations on the data in each thread, and does not involve real cross-thread operations. Although the reduction calculation can be realized, the efficiency is low. Moreover, it is only applicable to processors with fewer threads. For SIMD processors, since the number of threads in SIMD processors is large and the bit width of registers in threads is small, this technology cannot operate across threads and has poor applicability. In addition, this technique can only do partial reduction calculations, and cannot be applied to differential calculations in graphics.

Based on this, the embodiments of the present application provide a multi-threaded data processing method and device, which can improve execution performance, realize cross-thread operations involved in parallel computing at a lower hardware cost, and effectively accelerate data processing of parallel computing. For example, the multi-threaded data processing method provided in the embodiment of the present application can be applied to reduction algorithms in parallel computing, differential computing in graphics processing, and the like.

The multiple mentioned below in the embodiments of the present application refers to two or more. "And/or" describes the association relationship of associated objects, indicating that there may be three types of relationships, for example, A and/or B may indicate: A exists alone, A and B exist simultaneously, and B exists independently. The character "/" generally indicates that the contextual objects are an "or" relationship. In addition, it should be understood that although the terms first, second, etc. may be used to describe various data in the embodiments of the present invention, these data should not be limited to these terms. These terms are only used to distinguish data from one another. "At least one" means one or more. At least two means two or more. "At least one", "any one" or other similar expressions refer to any combination of these items, including any combination of single item(s) or plural item(s). For example, a, b, c can be single or multiple.

The terms "including" and "having" mentioned in the following description of the embodiments of the present application and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device comprising a series of steps or units is not limited to the listed steps or units, but optionally also includes other unlisted steps or units, or optionally also includes Other steps or elements inherent to the process, method, product or apparatus are included. It should be noted that, in the embodiments of the present application, words such as "exemplary" or "for example" are used as examples, illustrations or descriptions. Any embodiment or design scheme described as "exemplary" or "for example" in the embodiments of the present application shall not be interpreted as being more preferred or more advantageous than other embodiments or design schemes. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete manner.

The embodiments of the present application will be further introduced below in conjunction with the accompanying drawings.

Referring to FIG. 3 , it shows a schematic diagram of a SIMD parallel computing processor system architecture. The multi-thread data processing method provided by the embodiment of the present application can be applied to the SMID parallel computing processor system. SIMD parallel computing processor systems can be deployed in devices such as personal computers, laptops, smart phones, smart set-top boxes, in-vehicle smart systems, smart wearable devices, and more. The SIMD parallel computing processor system is mainly used to process applications with a large amount of data, input the compiled binary instruction code, and the corresponding data to be processed, and finally output the data processed by the program to the external storage. A typical example is a graphics processing unit (GPU), which inputs a large amount of 3D model vertex data and the rendering program instruction code compiled by the compiler, and finally outputs the rendered data to the video memory.

The SIMD parallel computing processor system mainly includes one or more processor cores, and one SIMD processor core is schematically shown in FIG. 3 . Each processor core contains multiple arithmetic logic units (arithmetic logic unit, ALU), general purpose register (GPR) units, and instruction processing related units such as instruction scheduler, instruction decoder, one of the source operand collection units or Multiple. The processing functions of the main modules are as follows:

The instruction scheduler is used to read the instruction code compiled by the compiler from the memory, and distribute the instruction code according to the degree of idleness of the arithmetic logic unit (ALU) and the degree of resource usage. Wherein, the instruction encoding is an encoding in a binary format, and the instruction encoding; optionally, the instruction encoding may also be referred to as an operation instruction. An instruction encoding may contain one or more of the following parameters: one or more opcodes used to indicate the behavior of the instruction encoding; source operands, used to indicate the source data required by the opcode, such as the The source can be register address encoding or immediate number encoding; the destination operand, used to indicate the storage location of the result after the instruction opcode is executed, can be register address encoding. The embodiment of the present application will describe the instruction encoding in detail in the following content.

A general-purpose register (GPR) unit is used to store data corresponding to operands involved in instruction calculation, such as data corresponding to source operands and data corresponding to destination operands. Optionally, the general purpose register unit (GPR) uses static random access memory (SRAM). The initial data may come from external storage, corresponding to the multithreading of the parallel computing processor, and the initial data may be the data of the multithreading of the SIDM processor core.

The instruction decoder is configured to receive and parse the instruction code, and instruct the general purpose register unit (GPR) to prepare for reading the source data according to the instruction code.

The source operand collector is used to receive multiple source data returned by the general-purpose register, and based on the multiple source data returned by the general-purpose register, perform a cross-thread data movement operation and then output the data to the arithmetic logic unit. Specifically, a set number of threads are deployed in the source operand collector, and the source operand collector can use multiple source data returned by the general register as the source data of the aforementioned set number of threads, one thread corresponds to one source data, Perform data movement operations among the set number of threads. In the embodiment of the present application, the source operand collector may also output multiple source data to the ALU; or, the ALU may also directly receive multiple source data returned by the general-purpose register.

Arithmetic logic unit (ALU), including multi-stage pipelines, can complete instruction calculations of various types of operations, such as floating-point addition FADD, floating-point multiplication FMUL, floating-point comparison FMIN/FMAX, signed integer addition IADDS, unsigned integer Type addition IADDU, signed integer subtraction ISUBS, unsigned integer subtraction ISUBU, signed integer multiplication IMULS, unsigned integer multiplication IMULU, signed comparison IMINS, unsigned comparison IMINU, logical XOR operation XOR, logical AND Floating-point, integer and logical operations such as AND, logical or OR. The data to be calculated (also called operand) and the instruction code indicating the type of operation are input to the arithmetic logic unit ALU to complete the instruction calculation of the relevant operation type. In a SIMD parallel computing processor system, each SIMD processor core can contain multiple ALUs to achieve high computing throughput. Wherein, an independent 1-bit flag can be set for each ALU unit, and the value of the flag indicates whether the ALU unit participates in instruction calculation. For example, if the flag bit is 1, it means that the ALU participates in the instruction calculation, and if the flag bit is 0, it means that the ALU does not participate in the instruction calculation, and there is no need for clock inversion, which can save power consumption.

The above-mentioned system provided by the embodiment of the present application does not need to use a complex cross network or access storage to obtain data, execute a single instruction code, read data from a general-purpose register at one time, complete cross-thread data movement and calculation, and can improve cross-thread operations. execution performance.

Further, refer to Table 2 below to illustrate a format of an instruction encoding, and the instruction encoding may specifically include the following parameters.

Table 2

first opcode

second opcode

destination operand

source operand 1

source operand 2

The first operation code is used to indicate the data transfer mode among the set number of threads deployed in the source operand collector, and the data transfer mode includes one or more types, which can be defined according to actual requirements. Operation types include floating-point addition FADD, floating-point multiplication FMUL, floating-point comparison FMIN/FMAX, signed integer addition IADDS, unsigned integer addition IADDU, signed integer subtraction ISUBS, unsigned integer subtraction ISUBU, signed Integer multiplication IMULS, unsigned integer multiplication IMULU, signed comparison IMINS, unsigned comparison IMINU, logical XOR operation XOR, logical AND operation AND, logical OR operation and other floating point, integer and logical operations. Wherein, the first operation code may also be called a main operation code, and the second operation code may also be called a secondary operation code.

Optionally, the data migration mode may include the following types: circular migration, cross migration, and one-to-many migration. The second opcode is used to indicate the operation type. Loop moving can be understood as moving the data of each thread according to the same thread offset and the same thread number sorting direction (such as the direction from high thread to low thread); cross moving can be understood as data between two threads Mutual exchange; one-to-many transfer, also known as diffusion transfer, can be understood as moving the data of one thread to other or multiple threads including this thread. Optionally, different values of the first operation code can be used to indicate different data transfer methods, for example, the first operation code can be CROSS-DOWN, and CROSS-DOWN can be used to indicate circular transfer, or the first operation code can be CROSS-QUAD -BUTTERFLY, use CROSS-QUAD-BUTTERFLY to indicate cross transfer; or, the first opcode can be CROSS-QUAD-BROADCAST, use CROSS-QUAD-BROADCAST to indicate one-to-many transfer.

When defining, the aforementioned cyclic transfer, crossover transfer, and one-to-many transfer can also be replaced by other names, as long as they can be identified so that the source operand collector can determine which transfer operation to perform according to the first operation code, The embodiments of the present application do not limit this. Exemplarily, the first data transfer method, the second data transfer method, and the third data transfer method can be used to distinguish the types of the above data transfer methods. For example, the first data transfer method indicates circular transfer, and the second data transfer method indicates cross transfer. , the third data transfer mode indicates one-to-many transfer.

Source operand 1 is used to indicate the source data of the set number of threads; wherein, the source data of the set number of threads can come from parallel computing processors such as SIMD processors, the sources of different threads in the aforementioned set number The data comes from different threads in the SIMD processor. The set number of deployment threads in the aforementioned source operand collector can be consistent with the number of threads of a parallel computing processor such as a SIMD processor, for example, both are N, and N is an integer greater than or equal to 2; or, the aforementioned source operand collector The set number of deployment threads in can also be less than the number of threads of parallel computing processors such as SIMD processors. For example, the set number of deployment threads in the source operand collector is N, and the number of threads of parallel computing processors such as SIMD processors is 2N. When a general-purpose register or a special-purpose register is used to store data of multiple threads of a parallel computing processor, the source operand 1 may specifically be a general-purpose register address or a special-purpose register address. The source operand 2 is used to determine the thread offset corresponding to the data movement mode, and the source operation data 2 can be an immediate value set according to actual computing requirements. The destination operand is used to indicate the storage location of the operation result, specifically, it may be a general-purpose register address or a special-purpose register address.

Specifically, the instruction encoder can obtain the first operand, the second operand, the destination operand, the source operand 1 and the source operand 2 from the instruction encoding according to the format of the instruction encoding. Instruct the general-purpose register to prepare corresponding source data according to the first operand, and the general-purpose register returns the source data of the aforementioned set number of threads to the source operand collector, and the source operand collector can encode the source data of the set number of threads according to the instruction code The data is moved to obtain the moved data on each thread. The source operand collector can send the source data and the moved data on some or all threads in the set number to the arithmetic logic unit, and the arithmetic logic unit can execute the second operation code in parallel (simultaneously) for some or all threads The indicated operation type gets the corresponding operation result, which is stored according to the destination operand.

In the following, schemes 1 to 4 are combined to describe in detail the schemes of cross-thread data movement and calculation under different data movement methods.

Option One:

The source operand collector deploys the same number of threads as the parallel computing processors, such as N threads. One instruction code can be used to realize the circular transfer of data among N threads.

Note that the instruction is coded as the first operation instruction, and the first operation instruction may include the following parameters: the first operation code, the second operation code, the destination operand, the first source operand (that is, the source operand 1 in the first operation instruction ) and the second source operand (ie, source operand 2 in the first operation instruction). Exemplarily, the first operation code is CROSS-DOWN, which indicates that the data transfer method between N threads in this solution is circular transfer or the first data transfer method; the second operation code indicates the operation type, such as floating-point addition FADD; the destination operand is the general-purpose register R0, and the storage location of the operation result is indicated by the general-purpose register address; the first source operand is the general-purpose register R1, and the first source data of N threads is indicated by the general-purpose register address, and the general-purpose register R1 is The initial data is the data of N threads of the parallel computing processor; the second source operand can be an immediate value, and the thread offset corresponding to the first data movement method is the immediate value. It should be noted that the thread offset here The amount can be understood as the degree of thread crossing involved in moving data. For example, if the immediate value is 2, there are 2 threads between the thread where a certain data is moved and the thread where the data is moved. Optionally, the expression of the first operation instruction may be recorded as: CROSSDOWN.FADD R0, R1, 2.

The source operand collector moves the first source data of N threads according to the first operation instruction, specifically, it can be implemented with reference to the following manner: the first source data of the thread numbered I ₊₁ is moved to the thread numbered i; Wherein, the numbering of N threads is 0～(N-1), i is taken over 0～(N-1), and I ₁ is the value of (i+SRC1) taking remainder of N; wherein, SRC1 represents the second Source operand, SRC1 is a positive integer. The data moved on the thread numbered i can be represented by SRC0[i], and the result after the cycle transfer satisfies the expression: SRC0[i]=SRC0[(i+SRC1)%N], i∈[0,N -1].

Both the source operand collector deployment and the parallel computing processor have 32 threads, and SRC1 is an example of 2. As shown in FIG. 4 , a schematic diagram of circular transfer, the first source data of the thread corresponding to the end of the arrow is moved to the thread corresponding to the head of the arrow. The data moved on the thread numbered 0 is the first source data on the thread numbered 2; the data moved on the thread numbered 2 is the first source data on the thread numbered 4; The data moved on the thread is the first source data on the thread numbered 27; the data moved on the thread numbered 30 is the first source data on the thread numbered 0, and so on. This embodiment of the present application will not describe it again.

In an optional embodiment, a CROSSDOWN cross-thread processing unit can be deployed in the source operation data collector, such as the CROSSDOWN cross-thread processing unit structure shown in Figure 5, which can be implemented by using multiple selectors MUX Cyclic transfer operation. Assuming that the data bit width in each thread of N threads is M bits, then a cascade circuit can be constructed by log2(N) binary selectors with a bit width of 2*M*N bits to perform cascaded data selection . The input of the first selector in the cascade circuit is generated based on the first source data of N threads: the first source data of each thread in the N threads is copied to double the bit width (2M) as the selector The first input, remember that the first source data of the thread is SRC0. In Figure 5, 2{SRC0,SRC0} represents the data copied to double the bit width; the data copied to the double bit width is shifted to the right by M bits as the second input. Use the bit 0 converted from SRC1 to binary as the selection bit, and select one output from the two inputs. For example, if bit 0 of SRC1 is 0, select the aforementioned data output that only copies twice the bit width; if bit 0 of SRC1 If it is 1, then select the data output that is double the bit width copied and shifted to the right; vice versa. One of the inputs to the i-th stage selector thereafter comes from the output of the selector at the previous stage, and the other input is the data shifted to the right by (i+1)*M bits from the output of the selector at the previous stage. The bit i converted from SRC1 to binary is used as the selection bit. For example, if the bit i of SRC1 is 0, then select the aforementioned data output that only copies twice the bit width; if the bit i of SRC1 is 1, then select to copy twice the bit Wide and right-shifted data output; vice versa, but it should be noted that the definition of the value of bit i in each stage of selector is the same. The selector of the last stage uses the bit log(N)-1 converted from SRC1 into binary as the selection bit, and the output data is sent to the arithmetic logic unit ALU as an operand, and the ALU is indicated according to the second operation code of the aforementioned first operation instruction Operation type, which can calculate the operands before and after cross-thread movement, including but not limited to floating-point multiplication, floating-point addition, integer multiplication, integer addition, floating-point comparison, integer comparison, logical and, logical or, Logical XOR etc.

Specifically, the arithmetic logic unit ALU may, for the first thread among the N threads, perform the operation corresponding to the operation type based on the first source data of the first thread and the moved first data on the first thread. arithmetic operations. Wherein, the first thread may include some or all of the N threads.

Optionally, a thread flag bit may be configured for each of the N threads, and the thread flag bit is used to indicate whether the first source data of the thread participates in the calculation operation. Specifically, when the thread flag bit takes a first value, the thread flag bit is used to indicate that the source data of the thread participates in an operation; or, when the thread flag bit takes a second value, the thread flag bit is used for Indicates that the source data of the thread does not participate in the calculation operation. Wherein, the first value may be 1, and the second value may be 0. Setting the thread flag bit marks the threads that do not need to be calculated, which can save calculation power consumption in this way.

The arithmetic logic unit ALU can determine whether the data before and after the movement on the thread participates in the calculation according to the thread flag bit. Specifically, for the thread numbered i in N (referred to as thread i), the thread flag bit and number of the thread i can be ( (i+src1)%N) The thread flag bit of the thread determines whether the data before and after moving on the thread i participates in the operation. It can also be understood as after the data is moved, update the thread flag of the thread i according to the original thread flag of the thread i and the source of the moved data, that is, the thread flag of the thread numbered ((i+src1)%N) . The pseudocode can be expressed as: new_lanemask[i]=lanemask[(i+src1)%N]&lanemask[i]. Wherein, lanemask[i] represents the value of the original thread flag of thread i, and lanemask[(i+src1)%N] is the value of the flag of the thread whose number is ((i+src1)%N). When both lanemask[(i+src1)%N] and lanemask[i] are 1, the updated thread flag new_lanemask[i] of thread i is 1, which means that the data before and after moving on the thread participates in the operation.

Taking the first data moved on the first thread from the second thread among the N threads as an example, the data before and after the move on the first thread needs to meet the following conditions to participate in the operation: the data associated with the first thread The thread flag bit indicates that the first source data of the first thread participates in an operation operation, and the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in an operation operation. As shown in Figure 6, a schematic diagram of a thread flag bit, the thread 1 before the move is filled with black, and the original thread flag bit of thread 28 is 0, and the cross-thread data move operation is the updated thread 1, thread 1 after the circular move. 26, thread 28, and thread 31 have thread flags of 0. Data before and after migration on thread 1, thread 26, thread 28, and thread 31 does not participate in the calculation.

The solution 1 uses a single instruction to move data circularly across threads, which can be applied to reduction calculations in parallel computing. This solution one can also be used to realize multi-threaded data accumulation and multiplication operations, such as constructing multi-level instructions, each level of instruction uses the aforementioned first operation instruction, and the output result of each level of instruction can be used as the input of the next level of instruction, finally making Move multi-threaded data to the same thread to realize accumulation, multiplication and other operations.

Option II

The source operand collector deploys the same number of threads as the parallel computing processors, such as N threads. One instruction code can be used to realize the cross movement of data of N threads.

Note that the instruction is coded as the first operation instruction, and the first operation instruction may include the following parameters: the first operation code, the second operation code, the destination operand, the first source operand (that is, the source operand 1 in the first operation instruction ) and the second source operand (ie, source operand 2 in the first operation instruction). Exemplarily, the first operation code is CROSS QUAD BUTTERFLY, indicating that the data transfer method between N threads in the second solution is cross transfer or the second data transfer method; the second operation code indicates the operation type, such as floating-point addition FADD; the destination operand is the general-purpose register R0, and the storage location of the operation result is indicated by the general-purpose register address; the first source operand is the general-purpose register R1, and the first source data of N threads is indicated by the general-purpose register address, and the general-purpose register R1 is The initial data is the data of N threads of the parallel computing processor; the second source operand may be an immediate value, such as 2. Optionally, the expression of the first operation instruction may be recorded as: CROSS QUAD BUTTERFLY.FADD R0,R1,2.

The source operand collector moves the first source data of N threads according to the first operation instruction, specifically, it can be implemented with reference to the following manner: the first source data of the thread that is numbered ₁₂ is moved to the thread that is numbered i; Wherein, the numbering of N threads is 0～(N-1), and i takes 0～(N-1); I ₂ is the XOR value of i and SRC1, and SRC1 represents the second source operand, and SRC1 is positive integer. The moved data on the thread numbered i can be represented by SRC0[i], and the result after cyclic moving satisfies the expression: SRC0[i]=SRC0[i^SRC1], i∈[0,N-1].

Both the source operand collector deployment and the parallel computing processor have 32 threads, and SRC1 is an example of 2. As shown in FIG. 7 , a cross-offset schematic diagram, the first source data of the thread corresponding to the end of the arrow is moved to the thread corresponding to the head of the arrow. The data moved on the thread numbered 0 is the first source data on the thread numbered 2; the data moved on the thread numbered 2 is the first source data on the thread numbered 0; The data moved on the thread is the first source data on the thread numbered 31; the data moved on the thread numbered 31 is the first source data on the thread numbered 29, and so on. This embodiment of the present application will not describe it again. The CROSS QUAD BUTTERFLY of the second scheme can make N threads, such as 32 threads, grouped into a group of 4 consecutive threads to form a QUAD, and realize the data exchange between two threads in each QUAD.

In an optional implementation, the CROSS QUAD BUTTERFLY cross-thread processing unit can be deployed in the source operation data collector, and this unit uses multiple four-selector MUX to realize the circular transfer operation. Assuming that the data bit width in each of the N threads is M bits, parallel data selection can be performed through N four selectors with a bit width of M bits. Figure 8 shows the i-th four-selector MUX in the cross-thread processing unit of CROSS QUAD BUTTERFLY. The input of the i-th four-selector MUX is the first source data of the four threads of the QUAD to which the thread belongs. The numbers of the aforementioned four threads are respectively: i%4, (i%4)+1, (i%4)+2 and (i%4)+3. The i-th selector uses the XOR result of SRC1 and i as the selection bit, selects one of the four inputs and outputs it to the arithmetic logic unit ALU, and the ALU can perform operations according to the operation type indicated by the second operation code of the aforementioned first operation instruction. Calculate the operands before and after moving across threads, including but not limited to floating-point multiplication, floating-point addition, integer multiplication, integer addition, floating-point comparison, integer comparison, logical AND, logical OR, logical XOR, etc.

The arithmetic logic unit ALU can determine whether the data before and after the movement on the thread participates in the calculation according to the thread flag bit. Specifically, for the thread numbered i in N (referred to as thread i), the thread flag bit and number of the thread i can be ( The thread flag bit of the thread of i^SRC1) determines whether the data before and after moving on the thread i participates in the operation. It can also be understood as after the data is moved, the thread flag of thread i is updated according to the original thread flag of the thread i and the thread flag of the thread numbered (i^SRC1) which is the source of the moved data. The pseudocode can be expressed as: new_lanemask[i]=lanemask[i^SRC1]&lanemask[i]. Wherein, lanemask[i] represents the value of the original thread flag of thread i, and lanemask[i^SRC1] is numbered as the value of the flag of the thread of (i^SRC1). When both lanemask[i^SRC1] and lanemask[i] are 1, the updated thread flag new_lanemask[i] of thread i is 1, which means that the data before and after moving on the thread participates in the operation.

Taking the first data moved on the first thread from the second thread among the N threads as an example, the data before and after the move on the first thread needs to meet the following conditions to participate in the operation: the data associated with the first thread The thread flag bit indicates that the first source data of the first thread participates in an operation operation, and the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in an operation operation. As shown in Figure 9, a schematic diagram of a thread flag bit, the thread 1 before the move is filled with black, and the original thread flag bit of thread 28 is 0. After the cross-thread data move operation, that is, the cross-movement of the price difference, the updated thread 1, The thread flag bits of thread 3, thread 28, and thread 30 are all 0. Data before and after migration on thread 1, thread 3, thread 28, and thread 30 does not participate in the calculation.

The second solution uses a single instruction to achieve cross-thread data transfer, and can lock the data exchange between threads within a small range of QUAD, which can be applied to difference calculations in image processing, such as the difference between two pixels that are located close to each other. Pixel comparison etc.

third solution

Note that the instruction is coded as the first operation instruction, and the first operation instruction may include the following parameters: the first operation code, the second operation code, the destination operand, the first source operand (that is, the source operand 1 in the first operation instruction ) and the second source operand (ie, source operand 2 in the first operation instruction). Exemplarily, the first operation code is CROSS QUAD-BROADCAST, indicating that the data transfer method between N threads in the third scheme is one-to-many transfer or diffusion transfer, or the third data transfer method; the second operation The code indicates the operation type, such as floating-point addition FADD; the destination operand is the general-purpose register R0, and the storage location of the operation result is indicated by the general-purpose register address; the first source operand is the general-purpose register R1, and the first source operand of the N threads is indicated by the general-purpose register address One source data, the initial data in the general register R1 is the data of N threads of the parallel computing processor; the second source operand can be an immediate value, such as 2. Optionally, the expression of the first operation instruction may be recorded as: CROSS QUAD-BROADCAST.FADD R0,R1,2.

The source operand collector moves the first source data of N threads according to the first operation instruction, specifically, it can be implemented with reference to the following manner: the first source data of the thread numbered ₁₃ is moved to the thread numbered i; Wherein, the numbering of N threads is 0～(N-1), and i takes 0～(N-1); the value of I ₃ is

SRC1 represents the second source operand, SRC1 is a positive integer, n is a positive integer capable of dividing N,

Indicates rounding down. The data moved on the thread numbered i can be represented by SRC0[i], and the result after cyclic moving satisfies the expression:

i∈[0,N-1]. Optionally, n is 4.

Both the source operand collector deployment and the parallel computing processor have 32 threads, the SRC1 is 2, and n is 4 examples. As shown in FIG. 10 , a schematic diagram of one-to-many offset, the first source data of the thread corresponding to the end of the arrow is moved to the thread corresponding to the head of the arrow. The first source data of the thread numbered 2 is moved to the thread numbered 0, the thread numbered 1, the thread numbered 2, and the thread numbered 3. The data moved on the thread numbered 0 is the first source data on the thread numbered 2; the data moved on the thread numbered 1 is the first source data on the thread numbered 2; The data moved on the thread is still the first source data on the thread numbered 2; the data moved on the thread numbered 3 is the first source data on the thread numbered 2, and so on. This embodiment of the present application will not describe it again. The CROSS QUAD-BROADCAST of the third scheme can make N threads, such as 32 threads, grouped into a group of 4 consecutive threads to form a QUAD, and then for each thread, select the thread number in the QUAD to which it belongs as SRC1 The first source data of the thread is moved to this thread.

In an optional implementation manner, the CROSS QUAD-BROADCAST cross-thread processing unit can be deployed in the source operation data collector, and this unit uses a plurality of four selectors MUX to realize the circular transfer operation. Assuming that the data bit width in each of the N threads is M bits, parallel data selection can be performed through N four selectors with a bit width of M bits. Figure 11 shows the i-th four-selector MUX in the cross-thread processing unit of CROSS QUAD-BROADCAST, the input of the i-th four-selector MUX is the first source of the four threads of the QUAD to which the thread belongs Data, the numbers of the aforementioned four threads are respectively: i%4, (i%4)+1, (i%4)+2 and (i%4)+3. The i-th selector uses SRC1 as the selection bit, and selects one of the four inputs to output to the arithmetic logic unit ALU. According to the operation type indicated by the second operation code of the aforementioned first operation instruction, the ALU can perform cross-thread transfer before and after Operand calculations, including but not limited to floating-point multiplication, floating-point addition, integer multiplication, integer addition, floating-point comparison, integer comparison, logical AND, logical OR, logical XOR, etc.

The arithmetic logic unit ALU can determine whether the data before and after the movement on the thread participates in the calculation according to the thread flag bit. Specifically, for the thread numbered i in N (referred to as thread i), the thread flag bit and number of the thread i can be ( (i+src1)%4) The thread flag bit of the thread determines whether the data before and after moving on the thread i participates in the operation. It can also be understood as after the data is moved, update the thread flag of the thread i according to the original thread flag of the thread i and the source of the moved data, that is, the thread flag of the thread numbered ((i+src1)%4) . The pseudocode can be expressed as: new_lanemask[i]=lanemask[(i+src1)%4]&lanemask[i]. Wherein, lanemask[i] represents the value of the original thread flag of thread i, and lanemask[(i+src1)%4] is the value of the flag of the thread whose number is ((i+src1)%4). When both lanemask[(i+src1)%4] and lanemask[i] are 1, the updated thread flag new_lanemask[i] of thread i is 1, which means that the data before and after moving on the thread participates in the operation.

Taking the first data moved on the first thread from the second thread among the N threads as an example, the data before and after the move on the first thread needs to meet the following conditions to participate in the operation: the data associated with the first thread The thread flag bit indicates that the first source data of the first thread participates in an operation operation, and the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in an operation operation. As shown in Figure 12, a schematic diagram of a thread flag bit, the thread 1 before the move is filled in black, and the original thread flag bit of thread 30 is 0. After the cross-thread data move operation, that is, the price difference cross move, the updated thread 1, The thread flag bits of threads 28-31 are all 0. The data before and after the migration on thread 1 and threads 28-31 does not participate in the calculation.

The third solution uses a single instruction to achieve cross-thread data transfer, and can lock the data of a certain thread in a small range of QUAD, which can be applied to the difference calculation in image processing, such as four adjacent pixels in position, Smoothing based on one of the pixels, etc.

It can be understood that the above schemes 1 to 3 provided in the embodiment of the present application can be implemented independently or in combination. For example, plan 1 and plan 3 are combined and implemented, and the calculation results of each thread in plan 1 are used as the source data of the corresponding thread in plan 3 to perform one-to-many transfer operations.

Corresponding to the embodiments of the above schemes 1 to 3, referring to FIG. 13 , the embodiment of the present application also provides a cross-thread data processing flow, which can be executed cooperatively by various units in a parallel computing processor. Mainly include the following steps.

(1) The instruction scheduler inputs instructions.

The parallel computing program or graphics rendering program is compiled into a binary instruction code by a compiler and then configured into the SIMD processor. The instruction code is used as the input of the instruction scheduler, and the data to be processed is configured into the storage through the software and is processed before the instruction starts to be issued. Initialized into the register as the data input of the register module.

(2) The instruction decoder parses the instruction code to obtain operands (such as source operand 1, source operand 2, destination operand, etc.) and operation codes, such as the first operation code and the second operation code.

(3) After the instruction decoder parses the source operand, it sends a source operand read request to the general register, and the general register returns the data corresponding to the source operand to the collector.

(4) The source operand collector judges whether the first opcode is a CROSS type instruction. If not, perform step (5) to send the data to the downstream ALU for calculation. If it is a CROSS type instruction, after judging whether it is a CROSS DOWN instruction, a CROSS QUAD BUTTERFLY instruction or a CROSS QUAD BROADCAST instruction, and processing it through the corresponding processing unit such as performing a cross-thread data movement operation, then perform step (5) to transfer the data Send it to the downstream ALU for calculation.

(5) The ALU performs corresponding calculations according to the second operation code, and the result is sent to the next module for processing.

Option four:

The source operand collector deploys fewer threads than the number of parallel computing processors. For example, the source operand collector deploys N threads, while the number of parallel computing processing threads is 2N. Two instruction codes can be used to realize the circular transfer of data among 2N threads of parallel computing processors.

The codes of the two instructions can be recorded as the first operation instruction and the second operation instruction. The first operation instruction can refer to the definition of Scheme 1. The source data sources indicated by the source operand 1 in the first operation instruction and the second operation instruction are different. . The first source data of the N threads indicated by the first source operand in the first operation instruction comes from N consecutive threads of the parallel computing processor. In order to distinguish the source operand 1 in the second operation instruction as the third source operand, the second source operation data indicates (in the source operation data collector) the second source data of N threads, and the N threads The second source data comes from the remaining N consecutive threads in the parallel computing processor. In addition, the second operation instruction may also include other parameters that are the same as those of the first operation instruction, such as a first operation code, a second operation code, a destination operand, and a second source operand.

In this scheme four, the source operand collector moves the first source data of N threads according to the first operation instruction, and the specific implementation manner of moving the second source data of N threads according to the second operation instruction can refer to Solution 1 is carried out, which will not be introduced in this embodiment of the present application. Exemplarily, the source operand collector can deploy 32 threads, the parallel computing processor includes 64 threads, and the SRC1 is 2. In the instruction scheduler stage, the time when the first operation instruction is issued is ahead of the time when the second operation instruction is issued; note that the time sequence interval between the two instructions is m, and when m is 1, it means that the first operation instruction and the second operation instruction are Two instructions sent back to back. Each instruction processes N threads. Figure 14 shows a schematic diagram of source data sources. The N threads of the source operand collector are numbered from 0 to 31. The first source data of the N threads indicated by the first operation instruction comes from Threads numbered 32-63 in the parallel computing processor, wherein the first source data of thread 0 in the source operation data collector comes from thread 32 in the parallel computing processor, and the first source data of thread 1 in the source operation data collector The source data comes from the thread 33 in the parallel computing processor, and by analogy, the first source data of the thread 31 in the source operation data collector comes from the thread 63 in the parallel computing processor; the indicated N threads of the second operation instruction The second source data comes from threads numbered 0-31 in the parallel computing processor, wherein the second source data of thread 0 in the source operation data collector comes from thread 0 in the parallel computing processor, and the thread 0 in the source operation data collector The second source data of thread 1 comes from thread 1 in the parallel computing processor, and so on, the second source data of thread 31 in the source operation data collector comes from thread 31 in the parallel computing processor.

Further on the basis of FIG. 14 , as shown in FIG. 15 , the embodiment of the present application provides another schematic diagram of circular transfer. The source operation data collector first obtains the first operation instruction, and after the source operation data collector executes the circular movement of cross-thread data according to the first operation instruction, in the source operation data collector: the data moved on thread 0 is on thread 2 the first source data on thread 1; the data moved on thread 1 is the first source data on thread 3; ... the data moved on thread 30 is the first source data on thread 0; the data moved on thread 31 is First source data on thread 1. The source operation data collector inputs the migration result corresponding to the first operation instruction, recorded as the first data of N threads, to the ALU. Then the source operation data collector obtains the second operation instruction, and after the source operation data collector executes the circular transfer of cross-thread data according to the second operation instruction, in the source operation data collector: the data moved on thread 0 is on thread 2 the second source data on thread 1; the data moved on thread 1 is the second source data on thread 3; ... the data moved on thread 30 is the second source data on thread 0; the data moved on thread 31 is Second source data on thread 1. The source operation data collector inputs the moving result corresponding to the second operation instruction, recorded as the second data of N threads, to the ALU.

In the processing stage in the ALU, the first operation instruction arrives earlier than the second operation instruction, assuming that the second operation instruction arrives at the stage I of the ALU, and the first operation instruction arrives at the stage I+m of the ALU, I is Arbitrary stages in the ALU. Then the arithmetic logic unit ALU can exchange the first data after moving on the third thread and the second data after moving on the third thread; wherein, the third thread is numbered r among the N threads in the source operation data collector the rout. The value of r can be determined as follows: if SRC1 is less than N, r is greater than or equal to (N-SRC1), and r is less than N; if SRC1 is greater than or equal to N, r is greater than or equal to 0, and r is less than (N- SRC1%N).

Taking SRC1 as 2 as an example, refer to FIG. 16 for a schematic diagram of data replacement. On the basis of FIG. 15 , it shows that when the first operation instruction arrives at stage 0 of the ALU, the first data and the thread after being moved on the thread 30 are shown. Replace the second data moved on the thread 31; and replace the first data moved on the thread 31 with the second data moved on the thread 31, so far realize the circular movement operation of the data between the threads of the parallel computing processor 64 .

Further, the ALU implements the corresponding calculation operation based on the result of circular transfer of data between threads of the parallel computing processor 64 according to the second operation code in the first operation instruction/second operation instruction. The specific computing operations may be performed according to actual requirements, which is not limited in this embodiment of the present application.

Of course, it should be noted that in this solution four, a thread flag bit can also be configured for each of the N threads deployed by the source operand collector, and the thread flag bit is used to indicate whether the first source data of the thread participates in the calculation operation . The specific implementation manner can be carried out with reference to the manner in Solution 1, which will not be repeated in this embodiment of the present application. As an example, in FIG. 16 , it is filled with black to indicate that the data before and after migration on the

thread numbers

30 and 31 does not participate in the calculation operation.

In the fourth solution, fewer threads in the source operand collector are combined with ALU data exchange processing to realize efficient cross-thread data circulation with higher SIDM width, which can be applied to reduction calculation in parallel computing. This solution four can also be used to realize multi-threaded data accumulation and multiplication operations, such as constructing multi-level instructions, each level of instruction uses the aforementioned first operation instruction, and the output result of each level of instruction can be used as the input of the next level of instruction, finally making Move multi-threaded data to the same thread to realize accumulation, multiplication and other operations.

Corresponding to the fourth solution above, referring to FIG. 17 , the embodiment of the present application provides a cross-thread data processing flow, which can be executed cooperatively by various units in a parallel computing processor. Mainly include the following steps.

(1) The instruction scheduler inputs instructions.

(4) The source operand collector judges whether the first opcode is a CROSS type instruction. If not, then execute step (10) to send the data to the next module. If so, when it is determined to be a CROSS DOWN instruction, step (5) is performed.

(5) The source operand collector performs CROSS DOWN data processing such as circular transfer operations.

(6) judge whether it is the second instruction of CROSS DOWN (i.e. the aforementioned second operation instruction) in ALU stage I. If not, then execute step (10) to send the data to the next module; if yes, then instruct step (7).

(7) The ALU judges whether the value of SRC1 is smaller than N; if yes, execute step (8); if not, execute step (9).

(8) Exchange data between threads whose thread numbers are greater than or equal to N-SRC1 to threads whose thread numbers are less than N in ALU stage I and ALU stage I+m.

(9) Exchange data between threads whose thread numbers are greater than or equal to 0 and whose thread numbers are less than (N−SRC1%N) in ALU stage I and ALU stage I+m.

(10) Next module processing.

Further, in order to distinguish which one of the implementation schemes 1 to 4, after the instruction is input, it can first be judged whether the number of threads in the SIMD is twice the number of threads in the source operand collector. Specifically, refer to FIG. 18 to illustrate a cross-thread data processing flow, which mainly includes the following steps.

(1) The instruction scheduler inputs instructions.

(2) The instruction scheduler judges whether it is SIMD 2N mode, that is, whether the number of threads of SIMD is twice the number of threads in the source operand collector. If (3) is executed, the instruction is issued only once; if (4) is not executed, the instruction is issued twice.

(3) The instruction decoder parses the instruction code to obtain operands (such as source operand 1, source operand 2, destination operand, etc.) and operation codes, such as the first operation code and the second operation code.

(4) After the instruction decoder parses the source operand, it sends a source operand read request to the general register, and the general register returns the data corresponding to the source operand to the collector.

(5) The source operand collector executes the moving operation of the first source data between the N threads indicated by the first operation instruction according to the above schemes 1 to 4. In Fig. 18, "Scheme 1, Plan 2, Plan 3, and Plan 4 for carrying out SIMD N" are indicated.

(6) judge whether it is the CROSS DOWN instruction of SIMD 2N by ALU, or judge whether to receive the second instruction of CROSS DOWN (i.e. the aforementioned second operation instruction). If not, execute step (8) to send the data to the next module; if yes, execute step (7).

(7) Perform data exchange at the ALU stage according to scheme four.

For example, in the ALU stage I and the ALU stage I+m, the data between the threads whose thread number is greater than or equal to N-SRC1 to the thread number less than N is exchanged. Or, exchange data between threads whose thread numbers are greater than or equal to 0 and whose thread numbers are less than (N−SRC1%N) in ALU stage I and ALU stage I+m.

(8) Next module processing.

Based on the same idea, an embodiment of the present application provides a multi-thread data processing method, as shown in FIG. 19 . The method mainly includes the following processes.

S1901. Acquire a first operation instruction. The first operation instruction includes the following parameters: a first operation code, the first operation code is used to indicate the data transfer mode between N threads, and N is an integer greater than or equal to 2; the first source operand, the The first source operand is used to indicate the first source data of the N threads; the second source operand is used to determine the thread offset corresponding to the data movement mode;

S1902. Move the first source data of the N threads according to the first operation instruction, to obtain the moved first data on each of the N threads.

In an optional implementation manner, the data moving method is the first moving method, and the moving the first source data of the N threads according to the first operation instruction includes: The first source data of the thread ₁ is moved to the thread numbered i; wherein, the numbering of N threads is 0～(N-1), i takes 0～(N-1), and I ₁ is (i+ SRC1) the value of the remainder of N; wherein, SRC1 represents the second source operand, and SRC1 is a positive integer.

In an optional implementation manner, the data moving method is a second moving method, and the moving the first source data of the N threads according to the first operation instruction includes:

Move the first source data of the thread that is numbered I ₂ to the thread that is numbered i; wherein, the numbers of N threads are 0～(N-1), and i takes 0～(N-1); I ₂ is the XOR value of i and SRC1, SRC1 represents the second source operand, and SRC1 is a positive integer.

In an optional implementation manner, the data moving method is a third offset method, and the moving the first source data of the N threads according to the first operation instruction includes:

Move the first source data of the thread that is numbered _I3 to the thread that is numbered i; wherein, the numbers of N threads are 0～(N-1), and i takes 0～(N-1); _I3 The value is

In an optional implementation manner, the first operation instruction further includes a destination operand, and the destination operand is used to indicate a storage location of a corresponding operation result of the first thread.

In an optional implementation manner, the first source data of the N threads comes from N consecutive threads of a parallel computing processor, and the parallel computing processor includes 2N threads, and the method further includes:

Acquiring a second operation instruction, the second operation instruction including the following parameters: the first operation code; the second source operand; a third source operand, the third source operand indicating the N threads The second source data of the N threads comes from the remaining N continuous threads in the parallel computing processor;

Moving the second source data of the N threads according to the second operation instruction to obtain the moved second data on each of the N threads.

In an optional embodiment, the method also includes:

Exchange the first data moved on the third thread with the second data moved on the third thread; wherein, the three threads are threads numbered r among the N threads; if SRC1 is less than N , r is greater than or equal to (N-SRC1), and r is less than N; if SRC1 is greater than or equal to N, r is greater than or equal to 0, and r is less than (N-SRC1%N).

Based on the same idea, as shown in FIG. 20, the embodiment of the present application also provides a multi-thread data processing device 2000, the multi-thread data processing device includes:

The instruction acquisition module 2001 is configured to acquire a first operation instruction, the first operation instruction includes the following parameters: a first operation code, and the first operation code is used to indicate a data transfer mode between N threads, where N is greater than Or an integer equal to 2; the first source operand, the first source operand is used to indicate the first source data of the N threads; the second source operand, the second source operand is used to determine the The thread offset corresponding to the above data movement mode.

The processing module 2002 is configured to move the first source data of the N threads according to the first operation instruction, and obtain the moved first data on each of the N threads.

In an optional implementation manner, the data transfer method is the first transfer method, and the processing module 2002 is specifically configured to: transfer the first source data of the thread numbered I ₊₁ to the thread numbered i Among them; wherein, the numbering of N threads is 0～(N-1), i takes 0～(N-1), and I ₁ is the value that (i+SRC1) gets remainder to N; Wherein, SRC1 represents described The second source operand, SRC1, is a positive integer.

In an optional implementation manner, the data transfer method is the second transfer method, and the processing module 2002 is specifically configured to: transfer the first source data of the thread numbered ₁₂ to the thread numbered i Among them; wherein, the numbering of N threads is 0～(N-1), and i takes 0～(N-1); I ₂ is the XOR value of i and SRC1, and SRC1 represents the second source operand, SRC1 is a positive integer.

In an optional implementation manner, the data moving method is a third offset method, and the processing module 2002 is specifically configured to: move the first source data of the thread numbered _I3 to the thread numbered i Among the threads; wherein, the numbers of N threads are 0～(N-1), and i takes 0～(N-1); the value of I ₃ is

In an optional implementation manner, the first operation instruction further includes a second operation code, and the second operation code is used to indicate the operation type; the processing module 2002 is further configured to: for the N A first thread among the threads executes an operation corresponding to the operation type based on the first source data of the first thread and the moved first data on the first thread.

In an optional implementation manner, each thread in the N threads is associated with a thread flag bit, and the thread flag bit is used to indicate whether the first source data of the thread participates in an operation.

In an optional implementation manner, the first source data of the N threads comes from N consecutive threads of a parallel computing processor, and the parallel computing processor includes 2N threads; the instruction acquisition module 2001, It is also used to obtain a second operation instruction, the second operation instruction includes the following parameters: the first operation code; the second source operand; the third source operand, the third source operand indicates the The second source data of N threads, the second source data of the N threads comes from the remaining N continuous threads in the parallel computing processor; the processing module 2002 is further configured to perform the second operation according to the The instruction moves the second source data of the N threads to obtain the moved second data on each of the N threads.

In an optional implementation manner, the processing module 2002 is further configured to: exchange the moved first data on the third thread with the second data moved on the third thread; wherein, the The three threads are the threads numbered r among the N threads; if SRC1 is less than N, r is greater than or equal to (N-SRC1), and r is less than N; if SRC1 is greater than or equal to N, r is greater than or equal to 0, and r is less than (N-SRC1%N).

Based on the same technical concept, this application also provides a communication device 2100 . The communication device 2100 may be a chip or a chip system. Optionally, in the embodiment of the present application, the system-on-a-chip may be composed of chips, or may include chips and other discrete devices.

The communication device 2100 may include at least one processor 2110, and the processor 2110 is coupled to a memory. Optionally, the memory may be located within the device, the memory may be integrated with the processor, or the memory may be located outside the device. For example, the communication device 2100 may further include at least one memory 2120 . The memory 2120 stores necessary computer programs, configuration information, computer programs or instructions and/or data for implementing any of the above embodiments; the processor 2110 may execute the computer programs stored in the memory 2120 to complete the method in any of the above embodiments.

The coupling in the embodiments of the present application is an indirect coupling or a communication connection between devices, units or modules, which may be in electrical, mechanical or other forms, and is used for information exchange between devices, units or modules. Processor 2110 may cooperate with memory 2120 . The specific connection medium between the processor 2110 and the memory 2120 is not limited in this embodiment of the present application.

The communication device 2100 may further include a communication interface 2130, and the communication device 2100 may perform information exchange with other devices through the communication interface 2130. Exemplarily, the communication interface 2130 may be a transceiver, a circuit, a bus, a module or other types of communication interfaces. When the communication device 2100 is a chip-like device or circuit, the communication interface 2130 in the device 2100 can also be an input and output circuit, which can input information (or call it, receive information) and output information (or call it, send information), The processor is an integrated processor or a microprocessor or an integrated circuit or a logic circuit, and the processor can determine output information according to input information.

Optionally, referring to FIG. 21 , the communication interface 2130 , the processing module 2110 and the memory 2120 are connected to each other through a bus 2140 . The bus 2140 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus or the like. The bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in Fig. 21, but it does not mean that there is only one bus or one type of bus.

In this embodiment of the application, the processor may be a general-purpose processor, a digital signal processor, an application-specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, and may implement or Execute the methods, steps and logic block diagrams disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the methods disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.

In the embodiment of the present application, the memory may be a non-volatile memory, such as a hard disk (hard disk drive, HDD) or a solid-state drive (solid-state drive, SSD), etc., and may also be a volatile memory (volatile memory), such as Random-access memory (RAM). A memory is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory in the embodiment of the present application may also be a circuit or any other device capable of implementing a storage function, and is used for storing program instructions and/or data.

Based on the above embodiments, an embodiment of the present application further provides a computer program, which, when the computer program is run on a computer, causes the computer to execute the above multi-threaded data processing method.

Based on the above embodiments, the embodiments of the present application also provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a computer, the computer executes the method described in the above-mentioned method embodiments. Provides multi-threaded data processing methods. Wherein, the storage medium may be any available medium that can be accessed by a computer. By way of example but not limitation: computer-readable media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage media or other magnetic storage devices, or may be used to carry or store information in the form of instructions or data structures desired program code and any other medium that can be accessed by a computer.

Based on the above embodiments, the embodiment of the present application also provides a computer chip, the chip is connected to the memory, and the chip is used to read and execute the software program stored in the memory, so as to realize the multiple functions provided in the above method embodiments. Thread data processing method.

Based on the above embodiments, an embodiment of the present application provides a chip system, the chip system includes a processor, configured to support a computer device to implement the functions of the multi-threaded data processing method in the above method embodiments. In a possible design, the chip system further includes a memory, and the memory is used to store necessary programs and data of the computer device. The system-on-a-chip may consist of chips, or may include chips and other discrete devices.

The technical solutions provided by the embodiments of the present application may be fully or partially implemented by software, hardware, firmware or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present invention will be generated in whole or in part. The computer may be a general computer, a special computer, a computer network, a network device, a terminal device or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media. The available medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, a digital video disc (digital video disc, DVD)), or a semiconductor medium.

In the embodiments of the present application, on the premise that there is no logical contradiction, the various embodiments may refer to each other, for example, the methods and/or terms between the method embodiments may refer to each other, such as the functions and/or terms between the device embodiments Or terms may refer to each other, for example, functions and/or terms between the apparatus embodiment and the method embodiment may refer to each other.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the present application. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions can also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby The instructions provide steps for implementing the functions specified in the flow chart or blocks of the flowchart and/or the block or blocks of the block diagrams.

Apparently, those skilled in the art can make various changes and modifications to the present application without departing from the scope of the present application. In this way, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalent technologies, the present application is also intended to include these modifications and variations.

Claims

A multi-thread data processing method is characterized in that, comprising:

Acquire a first operation instruction, the first operation instruction includes the following parameters: a first operation code, the first operation code is used to indicate the data transfer mode between N threads, and N is an integer greater than or equal to 2; A source operand, the first source operand is used to indicate the first source data of the N threads; a second source operand, the second source operand is used to determine the thread corresponding to the data movement mode Offset;

Moving the first source data of the N threads according to the first operation instruction to obtain the moved first data on each of the N threads.
The method according to claim 1, wherein the data moving method is a first moving method, and the moving the first source data of the N threads according to the first operation instruction includes:

Move the first source data of the thread numbered I 1 to the thread numbered i; wherein, the numbers of N threads are 0～(N-1), i takes 0～(N-1), I 1 (i+SRC1) is the value of the remainder of N; wherein, SRC1 represents the second source operand, and SRC1 is a positive integer.
The method according to claim 1, wherein the data moving method is a second moving method, and the moving the first source data of the N threads according to the first operation instruction includes:

Move the first source data of the thread that is numbered I 2 to the thread that is numbered i; wherein, the numbers of N threads are 0～(N-1), and i takes 0～(N-1); I 2 is the XOR value of i and SRC1, SRC1 represents the second source operand, and SRC1 is a positive integer.
The method according to claim 1, wherein the data moving method is a third offset method, and the moving the first source data of the N threads according to the first operation instruction includes:

Move the first source data of the thread that is numbered I3 to the thread that is numbered i; wherein, the numbers of N threads are 0～(N-1), and i takes 0～(N-1); I3 The value is
SRC1 represents the second source operand, SRC1 is a positive integer, and n is a positive integer that can divide N.
The method according to any one of claims 1-4, wherein the first operation instruction further includes a second operation code, and the second operation code is used to indicate the operation type; the method further includes:

For a first thread among the N threads, based on the first source data of the first thread and the moved first data on the first thread, an operation operation corresponding to the operation type is performed.
The method according to claim 5, wherein each of the N threads is associated with a thread flag bit, and the thread flag bit is used to indicate whether the first source data of the thread participates in an operation.
The method according to claim 6, wherein the first data moved on the first thread comes from a second thread among the N threads; the thread flag bit associated with the first thread indicates the The first source data of the first thread participates in the computing operation, and the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in the computing operation.
The method according to any one of claims 5-7, wherein the first operation instruction further includes a destination operand, and the destination operand is used to indicate a storage location of a corresponding operation result of the first thread.
The method according to claim 2, wherein the first source data of the N threads are from N continuous threads of a parallel computing processor, and the parallel computing processor includes 2N threads, and the method further comprises include:

Acquiring a second operation instruction, the second operation instruction including the following parameters: the first operation code; the second source operand; a third source operand, the third source operand indicating the N threads The second source data of the N threads comes from the remaining N continuous threads in the parallel computing processor;

Moving the second source data of the N threads according to the second operation instruction to obtain the moved second data on each of the N threads.
The method of claim 8, further comprising:

Exchange the first data moved on the third thread with the second data moved on the third thread; wherein, the three threads are threads numbered r among the N threads; if SRC1 is less than N , r is greater than or equal to (N-SRC1), and r is less than N; if SRC1 is greater than or equal to N, r is greater than or equal to 0, and r is less than (N-SRC1%N).
A multi-thread data processing device is characterized in that it comprises:

An instruction acquisition module, configured to acquire a first operation instruction, the first operation instruction includes the following parameters: a first operation code, the first operation code is used to indicate a data transfer mode between N threads, and N is greater than or An integer equal to 2; a first source operand, the first source operand is used to indicate the first source data of the N threads; a second source operand, the second source operand is used to determine the The thread offset corresponding to the data movement method;

A processing module, configured to move the first source data of the N threads according to the first operation instruction, to obtain the moved first data on each of the N threads.
The device according to claim 11, wherein the data transfer method is a first transfer method, and the processing module is specifically configured to: transfer the first source data of the thread numbered I +1 to the numbered i Among the threads; wherein, the numbers of N threads are 0～(N-1), i is taken over 0～(N-1), and I 1 is the value of (i+SRC1) taking the remainder of N; wherein, SRC1 represents The second source operand, SRC1 is a positive integer.
The device according to claim 11, wherein the data transfer mode is the second transfer mode, and the processing module is specifically used to: move the first source data of the thread numbered I+ 2 to the numbered i Among the threads; wherein, the numbers of N threads are 0～(N-1), and i takes 0～(N-1); I 2 is the XOR value of i and SRC1, and SRC1 represents the second source operation number, SRC1 is a positive integer.
The device according to claim 11, wherein the data transfer method is a third offset method, and the processing module is specifically used for:

Move the first source data of the thread that is numbered I3 to the thread that is numbered i; wherein, the numbers of N threads are 0～(N-1), and i takes 0～(N-1); I3 The value is
SRC1 represents the second source operand, SRC1 is a positive integer, and n is a positive integer that can divide N.
The device according to any one of claims 11-14, wherein the first operation instruction further includes a second operation code, and the second operation code is used to indicate the operation type; the processing module also uses At:

For a first thread among the N threads, based on the first source data of the first thread and the moved first data on the first thread, an operation operation corresponding to the operation type is performed.
The device according to claim 15, wherein each of the N threads is associated with a thread flag bit, and the thread flag bit is used to indicate whether the first source data of the thread participates in an operation.
The device according to claim 16, wherein the first data moved on the first thread comes from the second thread among the N threads; the thread flag bit associated with the first thread indicates the The first source data of the first thread participates in the computing operation, and the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in the computing operation.
The device according to any one of claims 15-17, wherein the first operation instruction further includes a destination operand, and the destination operand is used to indicate a storage location of a corresponding operation result of the first thread.
The device according to claim 12, wherein the first source data of the N threads is from N continuous threads of a parallel computing processor, and the parallel computing processor includes 2N threads;

The instruction obtaining module is also used to obtain a second operation instruction, the second operation instruction includes the following parameters: the first operation code; the second source operand; the third source operand, the third The source operand indicates the second source data of the N threads, and the second source data of the N threads comes from the remaining N consecutive threads in the parallel computing processor;

The processing module is further configured to move the second source data of the N threads according to the second operation instruction, to obtain the moved second data on each of the N threads.
The device according to claim 18, wherein the processing module is further configured to:

Exchange the first data moved on the third thread with the second data moved on the third thread; wherein, the three threads are threads numbered r among the N threads; if SRC1 is less than N , r is greater than or equal to (N-SRC1), and r is less than N; if SRC1 is greater than or equal to N, r is greater than or equal to 0, and r is less than (N-SRC1%N).
A communication device, characterized in that it includes a processor, the processor is coupled to a memory, the memory is used to store computer programs or instructions, and the processor is used to execute the computer programs or instructions to implement claim 1 to the method described in any one of 10.
A computer-readable storage medium is characterized by comprising instructions, which, when run on a computer, cause the computer to execute the method according to any one of claims 1 to 10.
A computer program product, characterized in that, when it is run on a computer, it causes the computer to execute the method described in any one of claims 1 to 10.