WO2022266842A1 - Multi-thread data processing method and apparatus - Google Patents

Multi-thread data processing method and apparatus Download PDF

Info

Publication number
WO2022266842A1
WO2022266842A1 PCT/CN2021/101533 CN2021101533W WO2022266842A1 WO 2022266842 A1 WO2022266842 A1 WO 2022266842A1 CN 2021101533 W CN2021101533 W CN 2021101533W WO 2022266842 A1 WO2022266842 A1 WO 2022266842A1
Authority
WO
WIPO (PCT)
Prior art keywords
thread
data
threads
source
src1
Prior art date
Application number
PCT/CN2021/101533
Other languages
French (fr)
Chinese (zh)
Inventor
陈水挺
杨伟光
吴任初
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2021/101533 priority Critical patent/WO2022266842A1/en
Priority to CN202180099704.7A priority patent/CN117561501A/en
Publication of WO2022266842A1 publication Critical patent/WO2022266842A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the technical field of parallel computing, and in particular to a multi-thread data processing method and device.
  • SIMD single instruction multiple data
  • the traditional parallel processor solutions are divided into software solutions and hardware solutions.
  • the software solution is to use the shared on-chip storage, store the data in the shared on-chip storage, then modify the thread address and then grab the data back to the core register to realize the exchange of data between threads.
  • the software solution involves frequent memory access operations, resulting in inefficient execution and higher power consumption.
  • the hardware solution is generally through a complex cross bar. For example, the data of each output thread of the cross network can come from any input thread, so as to achieve the ability of thread data exchange.
  • hardware solutions require higher hardware costs.
  • the present application provides a multi-thread data processing method and device, which can improve execution performance and realize cross-thread operations involved in parallel computing at a lower hardware cost.
  • an embodiment of the present application provides a multi-threaded data processing method, the method including: acquiring a first operation instruction.
  • the first operation instruction includes the following parameters: a first operation code, the first operation code is used to indicate the data transfer mode between N threads, and N is an integer greater than or equal to 2; the first source operand, the The first source operand is used to indicate the first source data of the N threads; the second source operand is used to determine the thread offset corresponding to the data movement mode. Moving the first source data of the N threads according to the first operation instruction to obtain the moved first data on each of the N threads.
  • the efficient cross-thread operation of the parallel computing processor is realized by a single instruction, which is simpler than the cross bar of the cross network, and does not require frequent memory access, and can realize high-performance parallelism with lower hardware or signaling overhead Accelerated processing of applications operating across threads in a computing processor.
  • the data moving method is the first moving method
  • the moving the first source data of the N threads according to the first operation instruction includes: The first source data of the thread 1 is moved to the thread numbered i; wherein, the numbering of N threads is 0 ⁇ (N-1), i takes 0 ⁇ (N-1), and I 1 is (i+ SRC1) the value of the remainder of N; wherein, SRC1 represents the second source operand, and SRC1 is a positive integer.
  • the data moving method is the second moving method
  • the moving the first source data of the N threads according to the first operation instruction includes: The first source data of the thread of 2 is moved to the thread numbered i; wherein, the numbering of N threads is 0 ⁇ (N-1), and i takes 0 ⁇ (N-1); I 2 is i and SRC1 XOR value of , SRC1 represents the second source operand, and SRC1 is a positive integer.
  • the data moving method is a third offset method
  • the moving the first source data of the N threads according to the first operation instruction includes: The first source data of the thread of I 3 is moved to the thread numbered as i; wherein, the numbering of N threads is 0 ⁇ (N-1), and i takes 0 ⁇ (N-1); the value of I 3 for SRC1 represents the second source operand, SRC1 is a positive integer, and n is a positive integer that can divide N.
  • the first operation instruction further includes a second operation code, and the second operation code is used to indicate an operation type; the method further includes:
  • each of the N threads is associated with a thread flag bit, and the thread flag bit is used to indicate whether the first source data of the thread participates in an operation.
  • the first data moved on the first thread comes from the second thread among the N threads; the thread flag associated with the first thread indicates that the first The first source data of the thread participates in the computing operation, and the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in the computing operation.
  • the first operation instruction further includes a destination operand, where the destination operand is used to indicate a storage location of a corresponding operation result of the first thread.
  • the first source data of the N threads comes from N consecutive threads of a parallel computing processor, and the parallel computing processor includes 2N threads, and the method further includes: obtaining The second operation instruction, the second operation instruction includes the following parameters: the first operation code; the second source operand; the third source operand, the third source operand indicates the N threads
  • the second source data, the second source data of the N threads come from the remaining N continuous threads in the parallel computing processor; perform the second source data of the N threads according to the second operation instruction moving to obtain the moved second data on each of the N threads.
  • the method further includes: exchanging the moved first data on the third thread with the second data moved on the third thread; wherein, the three threads are The thread numbered r among the N threads; if SRC1 is less than N, r is greater than or equal to (N-SRC1), and r is less than N; if SRC1 is greater than or equal to N, r is greater than or equal to 0, and r is less than (N -SRC1%N).
  • an embodiment of the present application provides a multi-threaded data processing device, which includes: an instruction acquisition module, configured to acquire a first operation instruction, and the first operation instruction includes the following parameters: the first operation code, the The first operation code is used to indicate the data movement mode between N threads, and N is an integer greater than or equal to 2; the first source operand is used to indicate the first source operand of the N threads. Source data; a second source operand, the second source operand is used to determine the thread offset corresponding to the data movement mode; a processing module is used to process the N threads according to the first operation instruction The first source data is moved to obtain the moved first data on each of the N threads.
  • the efficient cross-thread operation of the parallel computing processor is realized by a single instruction, which is simpler than the cross bar of the cross network, and does not require frequent memory access, and can realize high-performance parallelism with lower hardware or signaling overhead Accelerated processing of applications operating across threads in a computing processor.
  • the data transfer method is a first transfer method
  • the processing module is specifically configured to: transfer the first source data of the thread numbered I +1 to the thread numbered i ; Wherein, the numbering of N threads is 0 ⁇ (N-1), and i is taken over 0 ⁇ (N-1), and I 1 is (i+SRC1) the value that N gets remainder; Wherein, SRC1 represents described first Two source operands, SRC1 is a positive integer.
  • the data transfer method is the second transfer method, and the processing module is specifically used to: transfer the first source data of the thread numbered 12 to the thread numbered i ; Wherein, the numbering of N threads is 0 ⁇ (N-1), and i takes 0 ⁇ (N-1); I 2 is the XOR value of i and SRC1, and SRC1 represents the second source operand, and SRC1 is a positive integer.
  • the data moving method is a third offset method, and the processing module is specifically used to: move the first source data of the thread numbered I3 to the thread numbered i Among them, the numbers of N threads are 0 ⁇ (N-1), and i takes 0 ⁇ (N-1); the value of I 3 is SRC1 represents the second source operand, SRC1 is a positive integer, and n is a positive integer that can divide N.
  • the first operation instruction further includes a second operation code, and the second operation code is used to indicate an operation type;
  • the processing module is further configured to: for the N threads The first thread in the first thread executes an operation corresponding to the operation type based on the first source data of the first thread and the first data moved on the first thread.
  • each of the N threads is associated with a thread flag bit, and the thread flag bit is used to indicate whether the first source data of the thread participates in an operation.
  • the first data moved on the first thread comes from the second thread among the N threads; the thread flag associated with the first thread indicates that the first The first source data of the thread participates in the computing operation, and the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in the computing operation.
  • the first operation instruction further includes a destination operand, where the destination operand is used to indicate a storage location of a corresponding operation result of the first thread.
  • the first source data of the N threads comes from N consecutive threads of a parallel computing processor, and the parallel computing processor includes 2N threads;
  • the instruction acquisition module further For acquiring a second operation instruction, the second operation instruction includes the following parameters: the first operation code; the second source operand; a third source operand, the third source operand indicating the N
  • the second source data of the threads, the second source data of the N threads are from the remaining N continuous threads in the parallel computing processor;
  • the second source data of the N threads are moved to obtain the moved second data on each of the N threads.
  • the processing module is further configured to: exchange the moved first data on the third thread with the second data moved on the third thread; wherein, the The three threads are threads numbered r among the N threads; if SRC1 is less than N, r is greater than or equal to (N-SRC1), and r is less than N; if SRC1 is greater than or equal to N, r is greater than or equal to 0, and r Less than (N-SRC1%N).
  • the present application provides a communication device, including a processor, the processor is coupled to a memory, the memory is used to store computer programs or instructions, and the processor is used to execute the computer programs or instructions to perform Various implementation methods of any one of the above-mentioned first aspect to the fourth aspect.
  • the memory may be located within the device or external to the device.
  • the number of the processors is one or more.
  • the present application also provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium is run on a computer, the computer executes the above-mentioned first aspect and each executable method of the first aspect. The method described in Selected Implementations.
  • the present application further provides a computer program product including instructions, which, when run on a computer, cause the computer to execute the method described in the above first aspect and each possible implementation manner of the first aspect.
  • the present application also provides a computer chip, the chip is connected to the memory, and the chip is used to read and execute the software program stored in the memory, and execute the above-mentioned first aspect and each possible function of the first aspect.
  • FIG. 1 is a schematic diagram of a circuit structure coupled by a crossover network
  • Fig. 2 is a schematic diagram of internal element shifting of a thread
  • FIG. 3 is a schematic diagram of a SIMD parallel computing processor system architecture provided by an embodiment of the present application.
  • Fig. 4 is a schematic diagram of a cycle transfer provided by the embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a CROSSDOWN cross-thread processing unit provided by an embodiment of the present application.
  • FIG. 6 is one of the schematic diagrams of thread flag bits provided by the embodiment of the present application.
  • Fig. 7 is a schematic diagram of cross transfer provided by the embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a CROSS QUAD BUTTERFLY cross-thread processing unit provided by the embodiment of the present application.
  • FIG. 9 is one of the schematic diagrams of thread flag bits provided by the embodiment of the present application.
  • FIG. 10 is a schematic diagram of one-to-many transfer provided by the embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a CROSS QUAD-BROADCAST cross-thread processing unit provided by the embodiment of the present application.
  • FIG. 12 is one of the schematic diagrams of thread flag bits provided by the embodiment of the present application.
  • FIG. 13 is one of the schematic diagrams of the cross-thread data processing flow provided by the embodiment of the present application.
  • FIG. 14 is a schematic diagram of a source data source provided by the embodiment of the present application.
  • Fig. 15 is another schematic diagram of circular transfer provided by the embodiment of the present application.
  • Fig. 16 is a schematic diagram of data replacement provided by the embodiment of the present application.
  • FIG. 17 is one of the schematic diagrams of the cross-thread data processing flow provided by the embodiment of the present application.
  • FIG. 18 is one of the schematic diagrams of the cross-thread data processing flow provided by the embodiment of the present application.
  • FIG. 19 is one of the schematic flowcharts of the multi-threaded data processing method provided by the embodiment of the present application.
  • FIG. 20 is a schematic structural diagram of a multi-threaded data processing device provided by an embodiment of the present application.
  • FIG. 21 is a schematic structural diagram of a communication device provided by an embodiment of the present application.
  • the thread data is read from the core registers by software, and the thread data is stored in memory such as shared on-chip storage. Modify the thread address of the data, and fetch the data back to the kernel register according to the modified thread address. The same thread address corresponds to the original data read from the kernel register, and the data fetched back to the kernel register, so as to realize the exchange of data between threads.
  • Such a method involves frequent memory access operations, resulting in low execution efficiency and high power consumption.
  • each quadrant contains one vector processor (Execution Pipelines) and two crossover networks for performing cross-thread data movement operations. (cross bar) chip.
  • the four quadrants are respectively recorded as the first quadrant, the second quadrant, the third quadrant and the fourth quadrant.
  • the first quadrant comprises vector processor 455, cross network chip 410A (or claims, cross bar410A), cross network chip 410B;
  • the second quadrant comprises vector processor 460, cross network chip 420A, cross network chip 420B;
  • the third quadrant comprises vector The processor 465, the cross network chip 430A, and the cross network chip 430B;
  • the fourth quadrant includes the vector processor 470, the cross network chip 440A, and the cross network chip 440B.
  • cross bar 410A, cross bar 410B, cross bar 420A, cross bar 420B, cross bar 430A, cross bar 430B, cross bar 440A and cross bar 440B and various vector processors can be achieved with a small number of threads
  • the combination of cross bar to achieve cross-thread operation with a large number of threads is shown in Table 1 below.
  • vector processor Available cross bar 455 410A, 420A, 430B, 440A 460 410B, 420B, 430A, 440B 465 410B, 420B, 430A, 440B 470 410A, 420A, 430B, 440A
  • Each of the aforementioned cross bars has 8 input channels and 8 output channels, that is, an 8 ⁇ 8 cross network.
  • the combination of 4 cross bars can realize 16 input channels and 16 output channels.
  • One cross-thread operation instruction can control the replacement of 16 channels.
  • two back-to-back cross-thread operation instructions can be used, that is, two cross-thread operation instructions continuous in time to perform 32 ⁇ 32 replacement.
  • two cross-thread operation instructions are recorded as the first replacement instruction and the second replacement instruction.
  • the first replacement instruction controls the combined cross network to input 16 thread data to perform the replacement operation and then output it, and write it back to the vector register file.
  • the second replacement instruction The control-combined cross network inputs 16 thread data and outputs them after replacement operation.
  • the output of the first permutation instruction will be read and the output of the second permutation instruction will be combined with the output of the first permutation instruction to produce the final result of the 32x32 permutation.
  • each cross bar is shared by two vector processors, and can only be used by one of the vector processors at a time. Therefore, when a vector processor is in use, if another vector processor also needs to use the same cross bar, it will cause processor blockage.
  • two back-to-back cross-thread instructions need to be used. The first instruction needs to be written into the register and then read out, which will consume additional power consumption.
  • VADDREDUCEPS Vector reduction instruction
  • 310 is a vector register containing 4 threads, and each thread contains 4 elements.
  • the data in each thread is shifted to the right by the bit width of 1 element unit , the rightmost element in each thread is not shifted, and is added, subtracted or multiplied with the shifted element, the leftmost element in each thread is filled with 0, and the shift operation will not cross the thread boundary.
  • 310 is changed to 320, the details are as follows:
  • using the vector reduction instruction can only perform shift operations on the data in each thread, and does not involve real cross-thread operations. Although the reduction calculation can be realized, the efficiency is low. Moreover, it is only applicable to processors with fewer threads. For SIMD processors, since the number of threads in SIMD processors is large and the bit width of registers in threads is small, this technology cannot operate across threads and has poor applicability. In addition, this technique can only do partial reduction calculations, and cannot be applied to differential calculations in graphics.
  • the embodiments of the present application provide a multi-threaded data processing method and device, which can improve execution performance, realize cross-thread operations involved in parallel computing at a lower hardware cost, and effectively accelerate data processing of parallel computing.
  • the multi-threaded data processing method provided in the embodiment of the present application can be applied to reduction algorithms in parallel computing, differential computing in graphics processing, and the like.
  • FIG. 3 it shows a schematic diagram of a SIMD parallel computing processor system architecture.
  • the multi-thread data processing method provided by the embodiment of the present application can be applied to the SMID parallel computing processor system.
  • SIMD parallel computing processor systems can be deployed in devices such as personal computers, laptops, smart phones, smart set-top boxes, in-vehicle smart systems, smart wearable devices, and more.
  • the SIMD parallel computing processor system is mainly used to process applications with a large amount of data, input the compiled binary instruction code, and the corresponding data to be processed, and finally output the data processed by the program to the external storage.
  • a typical example is a graphics processing unit (GPU), which inputs a large amount of 3D model vertex data and the rendering program instruction code compiled by the compiler, and finally outputs the rendered data to the video memory.
  • GPU graphics processing unit
  • the SIMD parallel computing processor system mainly includes one or more processor cores, and one SIMD processor core is schematically shown in FIG. 3 .
  • Each processor core contains multiple arithmetic logic units (arithmetic logic unit, ALU), general purpose register (GPR) units, and instruction processing related units such as instruction scheduler, instruction decoder, one of the source operand collection units or Multiple.
  • ALU arithmetic logic unit
  • GPR general purpose register
  • instruction processing related units such as instruction scheduler, instruction decoder, one of the source operand collection units or Multiple.
  • the instruction scheduler is used to read the instruction code compiled by the compiler from the memory, and distribute the instruction code according to the degree of idleness of the arithmetic logic unit (ALU) and the degree of resource usage.
  • the instruction encoding is an encoding in a binary format, and the instruction encoding; optionally, the instruction encoding may also be referred to as an operation instruction.
  • An instruction encoding may contain one or more of the following parameters: one or more opcodes used to indicate the behavior of the instruction encoding; source operands, used to indicate the source data required by the opcode, such as the The source can be register address encoding or immediate number encoding; the destination operand, used to indicate the storage location of the result after the instruction opcode is executed, can be register address encoding.
  • the embodiment of the present application will describe the instruction encoding in detail in the following content.
  • a general-purpose register (GPR) unit is used to store data corresponding to operands involved in instruction calculation, such as data corresponding to source operands and data corresponding to destination operands.
  • the general purpose register unit (GPR) uses static random access memory (SRAM).
  • the initial data may come from external storage, corresponding to the multithreading of the parallel computing processor, and the initial data may be the data of the multithreading of the SIDM processor core.
  • the instruction decoder is configured to receive and parse the instruction code, and instruct the general purpose register unit (GPR) to prepare for reading the source data according to the instruction code.
  • GPR general purpose register unit
  • the source operand collector is used to receive multiple source data returned by the general-purpose register, and based on the multiple source data returned by the general-purpose register, perform a cross-thread data movement operation and then output the data to the arithmetic logic unit. Specifically, a set number of threads are deployed in the source operand collector, and the source operand collector can use multiple source data returned by the general register as the source data of the aforementioned set number of threads, one thread corresponds to one source data, Perform data movement operations among the set number of threads. In the embodiment of the present application, the source operand collector may also output multiple source data to the ALU; or, the ALU may also directly receive multiple source data returned by the general-purpose register.
  • Arithmetic logic unit including multi-stage pipelines, can complete instruction calculations of various types of operations, such as floating-point addition FADD, floating-point multiplication FMUL, floating-point comparison FMIN/FMAX, signed integer addition IADDS, unsigned integer Type addition IADDU, signed integer subtraction ISUBS, unsigned integer subtraction ISUBU, signed integer multiplication IMULS, unsigned integer multiplication IMULU, signed comparison IMINS, unsigned comparison IMINU, logical XOR operation XOR, logical AND Floating-point, integer and logical operations such as AND, logical or OR.
  • floating-point addition FADD floating-point multiplication FMUL
  • floating-point comparison FMIN/FMAX floating-point comparison
  • signed integer addition IADDS unsigned integer Type addition IADDU
  • signed integer subtraction ISUBS signed integer subtraction ISUBU
  • signed integer multiplication IMULS unsigned integer multiplication IMULU
  • signed comparison IMINS unsigned comparison IMINU
  • logical XOR operation XOR logical AND
  • each SIMD processor core can contain multiple ALUs to achieve high computing throughput.
  • an independent 1-bit flag can be set for each ALU unit, and the value of the flag indicates whether the ALU unit participates in instruction calculation. For example, if the flag bit is 1, it means that the ALU participates in the instruction calculation, and if the flag bit is 0, it means that the ALU does not participate in the instruction calculation, and there is no need for clock inversion, which can save power consumption.
  • the above-mentioned system provided by the embodiment of the present application does not need to use a complex cross network or access storage to obtain data, execute a single instruction code, read data from a general-purpose register at one time, complete cross-thread data movement and calculation, and can improve cross-thread operations. execution performance.
  • the instruction encoding may specifically include the following parameters.
  • the first operation code is used to indicate the data transfer mode among the set number of threads deployed in the source operand collector, and the data transfer mode includes one or more types, which can be defined according to actual requirements. Operation types include floating-point addition FADD, floating-point multiplication FMUL, floating-point comparison FMIN/FMAX, signed integer addition IADDS, unsigned integer addition IADDU, signed integer subtraction ISUBS, unsigned integer subtraction ISUBU, signed Integer multiplication IMULS, unsigned integer multiplication IMULU, signed comparison IMINS, unsigned comparison IMINU, logical XOR operation XOR, logical AND operation AND, logical OR operation and other floating point, integer and logical operations.
  • the first operation code may also be called a main operation code
  • the second operation code may also be called a secondary operation code.
  • the data migration mode may include the following types: circular migration, cross migration, and one-to-many migration.
  • the second opcode is used to indicate the operation type.
  • Loop moving can be understood as moving the data of each thread according to the same thread offset and the same thread number sorting direction (such as the direction from high thread to low thread); cross moving can be understood as data between two threads Mutual exchange; one-to-many transfer, also known as diffusion transfer, can be understood as moving the data of one thread to other or multiple threads including this thread.
  • the first operation code can be CROSS-DOWN, and CROSS-DOWN can be used to indicate circular transfer, or the first operation code can be CROSS-QUAD -BUTTERFLY, use CROSS-QUAD-BUTTERFLY to indicate cross transfer; or, the first opcode can be CROSS-QUAD-BROADCAST, use CROSS-QUAD-BROADCAST to indicate one-to-many transfer.
  • the aforementioned cyclic transfer, crossover transfer, and one-to-many transfer can also be replaced by other names, as long as they can be identified so that the source operand collector can determine which transfer operation to perform according to the first operation code,
  • the embodiments of the present application do not limit this.
  • the first data transfer method, the second data transfer method, and the third data transfer method can be used to distinguish the types of the above data transfer methods.
  • the first data transfer method indicates circular transfer
  • the second data transfer method indicates cross transfer.
  • the third data transfer mode indicates one-to-many transfer.
  • Source operand 1 is used to indicate the source data of the set number of threads; wherein, the source data of the set number of threads can come from parallel computing processors such as SIMD processors, the sources of different threads in the aforementioned set number The data comes from different threads in the SIMD processor.
  • the set number of deployment threads in the aforementioned source operand collector can be consistent with the number of threads of a parallel computing processor such as a SIMD processor, for example, both are N, and N is an integer greater than or equal to 2; or, the aforementioned source operand collector
  • the set number of deployment threads in can also be less than the number of threads of parallel computing processors such as SIMD processors.
  • the set number of deployment threads in the source operand collector is N, and the number of threads of parallel computing processors such as SIMD processors is 2N.
  • the source operand 1 may specifically be a general-purpose register address or a special-purpose register address.
  • the source operand 2 is used to determine the thread offset corresponding to the data movement mode, and the source operation data 2 can be an immediate value set according to actual computing requirements.
  • the destination operand is used to indicate the storage location of the operation result, specifically, it may be a general-purpose register address or a special-purpose register address.
  • the instruction encoder can obtain the first operand, the second operand, the destination operand, the source operand 1 and the source operand 2 from the instruction encoding according to the format of the instruction encoding. Instruct the general-purpose register to prepare corresponding source data according to the first operand, and the general-purpose register returns the source data of the aforementioned set number of threads to the source operand collector, and the source operand collector can encode the source data of the set number of threads according to the instruction code The data is moved to obtain the moved data on each thread.
  • the source operand collector can send the source data and the moved data on some or all threads in the set number to the arithmetic logic unit, and the arithmetic logic unit can execute the second operation code in parallel (simultaneously) for some or all threads
  • the indicated operation type gets the corresponding operation result, which is stored according to the destination operand.
  • schemes 1 to 4 are combined to describe in detail the schemes of cross-thread data movement and calculation under different data movement methods.
  • the source operand collector deploys the same number of threads as the parallel computing processors, such as N threads.
  • One instruction code can be used to realize the circular transfer of data among N threads.
  • the instruction is coded as the first operation instruction, and the first operation instruction may include the following parameters: the first operation code, the second operation code, the destination operand, the first source operand (that is, the source operand 1 in the first operation instruction ) and the second source operand (ie, source operand 2 in the first operation instruction).
  • the first operation code is CROSS-DOWN, which indicates that the data transfer method between N threads in this solution is circular transfer or the first data transfer method;
  • the second operation code indicates the operation type, such as floating-point addition FADD;
  • the destination operand is the general-purpose register R0, and the storage location of the operation result is indicated by the general-purpose register address;
  • the first source operand is the general-purpose register R1, and the first source data of N threads is indicated by the general-purpose register address, and the general-purpose register R1 is
  • the initial data is the data of N threads of the parallel computing processor;
  • the second source operand can be an immediate value, and the thread offset corresponding to the first data movement method is the immediate value.
  • the thread offset here The amount can be understood as the degree of thread crossing involved in moving data. For example, if the immediate value is 2, there are 2 threads between the thread where a certain data is moved and the thread where the data is moved.
  • the expression of the first operation instruction may be recorded as: CROSSDOWN.FADD R0, R1, 2.
  • the source operand collector moves the first source data of N threads according to the first operation instruction, specifically, it can be implemented with reference to the following manner: the first source data of the thread numbered I +1 is moved to the thread numbered i; Wherein, the numbering of N threads is 0 ⁇ (N-1), i is taken over 0 ⁇ (N-1), and I 1 is the value of (i+SRC1) taking remainder of N; wherein, SRC1 represents the second Source operand, SRC1 is a positive integer.
  • Both the source operand collector deployment and the parallel computing processor have 32 threads, and SRC1 is an example of 2.
  • FIG. 4 a schematic diagram of circular transfer, the first source data of the thread corresponding to the end of the arrow is moved to the thread corresponding to the head of the arrow.
  • the data moved on the thread numbered 0 is the first source data on the thread numbered 2;
  • the data moved on the thread numbered 2 is the first source data on the thread numbered 4;
  • the data moved on the thread is the first source data on the thread numbered 27;
  • the data moved on the thread numbered 30 is the first source data on the thread numbered 0, and so on.
  • a CROSSDOWN cross-thread processing unit can be deployed in the source operation data collector, such as the CROSSDOWN cross-thread processing unit structure shown in Figure 5, which can be implemented by using multiple selectors MUX Cyclic transfer operation. Assuming that the data bit width in each thread of N threads is M bits, then a cascade circuit can be constructed by log2(N) binary selectors with a bit width of 2*M*N bits to perform cascaded data selection .
  • the input of the first selector in the cascade circuit is generated based on the first source data of N threads: the first source data of each thread in the N threads is copied to double the bit width (2M) as the selector
  • the first input remember that the first source data of the thread is SRC0.
  • 2 ⁇ SRC0,SRC0 ⁇ represents the data copied to double the bit width; the data copied to the double bit width is shifted to the right by M bits as the second input.
  • bit 0 of SRC1 selects the aforementioned data output that only copies twice the bit width; if bit 0 of SRC1 If it is 1, then select the data output that is double the bit width copied and shifted to the right; vice versa.
  • One of the inputs to the i-th stage selector thereafter comes from the output of the selector at the previous stage, and the other input is the data shifted to the right by (i+1)*M bits from the output of the selector at the previous stage.
  • the bit i converted from SRC1 to binary is used as the selection bit.
  • bit i of SRC1 For example, if the bit i of SRC1 is 0, then select the aforementioned data output that only copies twice the bit width; if the bit i of SRC1 is 1, then select to copy twice the bit Wide and right-shifted data output; vice versa, but it should be noted that the definition of the value of bit i in each stage of selector is the same.
  • the selector of the last stage uses the bit log(N)-1 converted from SRC1 into binary as the selection bit, and the output data is sent to the arithmetic logic unit ALU as an operand, and the ALU is indicated according to the second operation code of the aforementioned first operation instruction Operation type, which can calculate the operands before and after cross-thread movement, including but not limited to floating-point multiplication, floating-point addition, integer multiplication, integer addition, floating-point comparison, integer comparison, logical and, logical or, Logical XOR etc.
  • the arithmetic logic unit ALU may, for the first thread among the N threads, perform the operation corresponding to the operation type based on the first source data of the first thread and the moved first data on the first thread. arithmetic operations.
  • the first thread may include some or all of the N threads.
  • a thread flag bit may be configured for each of the N threads, and the thread flag bit is used to indicate whether the first source data of the thread participates in the calculation operation. Specifically, when the thread flag bit takes a first value, the thread flag bit is used to indicate that the source data of the thread participates in an operation; or, when the thread flag bit takes a second value, the thread flag bit is used for Indicates that the source data of the thread does not participate in the calculation operation. Wherein, the first value may be 1, and the second value may be 0. Setting the thread flag bit marks the threads that do not need to be calculated, which can save calculation power consumption in this way.
  • the arithmetic logic unit ALU can determine whether the data before and after the movement on the thread participates in the calculation according to the thread flag bit. Specifically, for the thread numbered i in N (referred to as thread i), the thread flag bit and number of the thread i can be ( (i+src1)%N) The thread flag bit of the thread determines whether the data before and after moving on the thread i participates in the operation. It can also be understood as after the data is moved, update the thread flag of the thread i according to the original thread flag of the thread i and the source of the moved data, that is, the thread flag of the thread numbered ((i+src1)%N) .
  • lanemask[i] represents the value of the original thread flag of thread i
  • lanemask[(i+src1)%N] is the value of the flag of the thread whose number is ((i+src1)%N).
  • the updated thread flag new_lanemask[i] of thread i is 1, which means that the data before and after moving on the thread participates in the operation.
  • the data before and after the move on the first thread needs to meet the following conditions to participate in the operation: the data associated with the first thread
  • the thread flag bit indicates that the first source data of the first thread participates in an operation operation
  • the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in an operation operation.
  • Figure 6 a schematic diagram of a thread flag bit, the thread 1 before the move is filled with black, and the original thread flag bit of thread 28 is 0, and the cross-thread data move operation is the updated thread 1, thread 1 after the circular move. 26, thread 28, and thread 31 have thread flags of 0. Data before and after migration on thread 1, thread 26, thread 28, and thread 31 does not participate in the calculation.
  • the solution 1 uses a single instruction to move data circularly across threads, which can be applied to reduction calculations in parallel computing.
  • This solution one can also be used to realize multi-threaded data accumulation and multiplication operations, such as constructing multi-level instructions, each level of instruction uses the aforementioned first operation instruction, and the output result of each level of instruction can be used as the input of the next level of instruction, finally making Move multi-threaded data to the same thread to realize accumulation, multiplication and other operations.
  • the source operand collector deploys the same number of threads as the parallel computing processors, such as N threads.
  • One instruction code can be used to realize the cross movement of data of N threads.
  • the instruction is coded as the first operation instruction, and the first operation instruction may include the following parameters: the first operation code, the second operation code, the destination operand, the first source operand (that is, the source operand 1 in the first operation instruction ) and the second source operand (ie, source operand 2 in the first operation instruction).
  • the first operation code is CROSS QUAD BUTTERFLY, indicating that the data transfer method between N threads in the second solution is cross transfer or the second data transfer method;
  • the second operation code indicates the operation type, such as floating-point addition FADD;
  • the destination operand is the general-purpose register R0, and the storage location of the operation result is indicated by the general-purpose register address;
  • the first source operand is the general-purpose register R1, and the first source data of N threads is indicated by the general-purpose register address, and the general-purpose register R1 is
  • the initial data is the data of N threads of the parallel computing processor;
  • the second source operand may be an immediate value, such as 2.
  • the expression of the first operation instruction may be recorded as: CROSS QUAD BUTTERFLY.FADD R0,R1,2.
  • the source operand collector moves the first source data of N threads according to the first operation instruction, specifically, it can be implemented with reference to the following manner: the first source data of the thread that is numbered 12 is moved to the thread that is numbered i; Wherein, the numbering of N threads is 0 ⁇ (N-1), and i takes 0 ⁇ (N-1); I 2 is the XOR value of i and SRC1, and SRC1 represents the second source operand, and SRC1 is positive integer.
  • Both the source operand collector deployment and the parallel computing processor have 32 threads, and SRC1 is an example of 2.
  • SRC1 is an example of 2.
  • the data moved on the thread numbered 0 is the first source data on the thread numbered 2; the data moved on the thread numbered 2 is the first source data on the thread numbered 0;
  • the data moved on the thread is the first source data on the thread numbered 31; the data moved on the thread numbered 31 is the first source data on the thread numbered 29, and so on.
  • the CROSS QUAD BUTTERFLY of the second scheme can make N threads, such as 32 threads, grouped into a group of 4 consecutive threads to form a QUAD, and realize the data exchange between two threads in each QUAD.
  • the CROSS QUAD BUTTERFLY cross-thread processing unit can be deployed in the source operation data collector, and this unit uses multiple four-selector MUX to realize the circular transfer operation. Assuming that the data bit width in each of the N threads is M bits, parallel data selection can be performed through N four selectors with a bit width of M bits.
  • Figure 8 shows the i-th four-selector MUX in the cross-thread processing unit of CROSS QUAD BUTTERFLY. The input of the i-th four-selector MUX is the first source data of the four threads of the QUAD to which the thread belongs.
  • the numbers of the aforementioned four threads are respectively: i%4, (i%4)+1, (i%4)+2 and (i%4)+3.
  • the i-th selector uses the XOR result of SRC1 and i as the selection bit, selects one of the four inputs and outputs it to the arithmetic logic unit ALU, and the ALU can perform operations according to the operation type indicated by the second operation code of the aforementioned first operation instruction. Calculate the operands before and after moving across threads, including but not limited to floating-point multiplication, floating-point addition, integer multiplication, integer addition, floating-point comparison, integer comparison, logical AND, logical OR, logical XOR, etc.
  • the arithmetic logic unit ALU may, for the first thread among the N threads, perform the operation corresponding to the operation type based on the first source data of the first thread and the moved first data on the first thread. arithmetic operations.
  • the first thread may include some or all of the N threads.
  • a thread flag bit may be configured for each of the N threads, and the thread flag bit is used to indicate whether the first source data of the thread participates in the calculation operation. Specifically, when the thread flag bit takes a first value, the thread flag bit is used to indicate that the source data of the thread participates in an operation; or, when the thread flag bit takes a second value, the thread flag bit is used for Indicates that the source data of the thread does not participate in the calculation operation. Wherein, the first value may be 1, and the second value may be 0. Setting the thread flag bit marks the threads that do not need to be calculated, which can save calculation power consumption in this way.
  • the arithmetic logic unit ALU can determine whether the data before and after the movement on the thread participates in the calculation according to the thread flag bit. Specifically, for the thread numbered i in N (referred to as thread i), the thread flag bit and number of the thread i can be ( The thread flag bit of the thread of i ⁇ SRC1) determines whether the data before and after moving on the thread i participates in the operation. It can also be understood as after the data is moved, the thread flag of thread i is updated according to the original thread flag of the thread i and the thread flag of the thread numbered (i ⁇ SRC1) which is the source of the moved data.
  • lanemask[i] represents the value of the original thread flag of thread i
  • lanemask[i ⁇ SRC1] is numbered as the value of the flag of the thread of (i ⁇ SRC1).
  • the updated thread flag new_lanemask[i] of thread i is 1, which means that the data before and after moving on the thread participates in the operation.
  • the data before and after the move on the first thread needs to meet the following conditions to participate in the operation: the data associated with the first thread
  • the thread flag bit indicates that the first source data of the first thread participates in an operation operation
  • the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in an operation operation.
  • Figure 9 a schematic diagram of a thread flag bit, the thread 1 before the move is filled with black, and the original thread flag bit of thread 28 is 0.
  • the updated thread 1 After the cross-thread data move operation, that is, the cross-movement of the price difference, the updated thread 1,
  • the thread flag bits of thread 3, thread 28, and thread 30 are all 0. Data before and after migration on thread 1, thread 3, thread 28, and thread 30 does not participate in the calculation.
  • the second solution uses a single instruction to achieve cross-thread data transfer, and can lock the data exchange between threads within a small range of QUAD, which can be applied to difference calculations in image processing, such as the difference between two pixels that are located close to each other. Pixel comparison etc.
  • the source operand collector deploys the same number of threads as the parallel computing processors, such as N threads.
  • One instruction code can be used to realize the cross movement of data of N threads.
  • the instruction is coded as the first operation instruction, and the first operation instruction may include the following parameters: the first operation code, the second operation code, the destination operand, the first source operand (that is, the source operand 1 in the first operation instruction ) and the second source operand (ie, source operand 2 in the first operation instruction).
  • the first operation code is CROSS QUAD-BROADCAST, indicating that the data transfer method between N threads in the third scheme is one-to-many transfer or diffusion transfer, or the third data transfer method;
  • the second operation The code indicates the operation type, such as floating-point addition FADD;
  • the destination operand is the general-purpose register R0, and the storage location of the operation result is indicated by the general-purpose register address;
  • the first source operand is the general-purpose register R1, and the first source operand of the N threads is indicated by the general-purpose register address
  • One source data, the initial data in the general register R1 is the data of N threads of the parallel computing processor;
  • the second source operand can be an immediate value, such as 2.
  • the expression of the first operation instruction may be recorded as: CROSS QUAD-BROADCAST.FADD R0,R1,2.
  • the source operand collector moves the first source data of N threads according to the first operation instruction, specifically, it can be implemented with reference to the following manner: the first source data of the thread numbered 13 is moved to the thread numbered i; Wherein, the numbering of N threads is 0 ⁇ (N-1), and i takes 0 ⁇ (N-1); the value of I 3 is SRC1 represents the second source operand, SRC1 is a positive integer, n is a positive integer capable of dividing N, Indicates rounding down.
  • the data moved on the thread numbered i can be represented by SRC0[i], and the result after cyclic moving satisfies the expression: i ⁇ [0,N-1].
  • n is 4.
  • Both the source operand collector deployment and the parallel computing processor have 32 threads, the SRC1 is 2, and n is 4 examples.
  • FIG. 10 a schematic diagram of one-to-many offset, the first source data of the thread corresponding to the end of the arrow is moved to the thread corresponding to the head of the arrow.
  • the first source data of the thread numbered 2 is moved to the thread numbered 0, the thread numbered 1, the thread numbered 2, and the thread numbered 3.
  • the data moved on the thread numbered 0 is the first source data on the thread numbered 2; the data moved on the thread numbered 1 is the first source data on the thread numbered 2; The data moved on the thread is still the first source data on the thread numbered 2; the data moved on the thread numbered 3 is the first source data on the thread numbered 2, and so on.
  • the CROSS QUAD-BROADCAST of the third scheme can make N threads, such as 32 threads, grouped into a group of 4 consecutive threads to form a QUAD, and then for each thread, select the thread number in the QUAD to which it belongs as SRC1 The first source data of the thread is moved to this thread.
  • the CROSS QUAD-BROADCAST cross-thread processing unit can be deployed in the source operation data collector, and this unit uses a plurality of four selectors MUX to realize the circular transfer operation. Assuming that the data bit width in each of the N threads is M bits, parallel data selection can be performed through N four selectors with a bit width of M bits.
  • Figure 11 shows the i-th four-selector MUX in the cross-thread processing unit of CROSS QUAD-BROADCAST
  • the input of the i-th four-selector MUX is the first source of the four threads of the QUAD to which the thread belongs Data
  • the numbers of the aforementioned four threads are respectively: i%4, (i%4)+1, (i%4)+2 and (i%4)+3.
  • the i-th selector uses SRC1 as the selection bit, and selects one of the four inputs to output to the arithmetic logic unit ALU.
  • the ALU can perform cross-thread transfer before and after Operand calculations, including but not limited to floating-point multiplication, floating-point addition, integer multiplication, integer addition, floating-point comparison, integer comparison, logical AND, logical OR, logical XOR, etc.
  • the arithmetic logic unit ALU may, for the first thread among the N threads, perform the operation corresponding to the operation type based on the first source data of the first thread and the moved first data on the first thread. arithmetic operations.
  • the first thread may include some or all of the N threads.
  • a thread flag bit may be configured for each of the N threads, and the thread flag bit is used to indicate whether the first source data of the thread participates in the calculation operation. Specifically, when the thread flag bit takes a first value, the thread flag bit is used to indicate that the source data of the thread participates in an operation; or, when the thread flag bit takes a second value, the thread flag bit is used for Indicates that the source data of the thread does not participate in the calculation operation. Wherein, the first value may be 1, and the second value may be 0. Setting the thread flag bit marks the threads that do not need to be calculated, which can save calculation power consumption in this way.
  • the arithmetic logic unit ALU can determine whether the data before and after the movement on the thread participates in the calculation according to the thread flag bit. Specifically, for the thread numbered i in N (referred to as thread i), the thread flag bit and number of the thread i can be ( (i+src1)%4) The thread flag bit of the thread determines whether the data before and after moving on the thread i participates in the operation. It can also be understood as after the data is moved, update the thread flag of the thread i according to the original thread flag of the thread i and the source of the moved data, that is, the thread flag of the thread numbered ((i+src1)%4) .
  • lanemask[i] represents the value of the original thread flag of thread i
  • lanemask[(i+src1)%4] is the value of the flag of the thread whose number is ((i+src1)%4).
  • the updated thread flag new_lanemask[i] of thread i is 1, which means that the data before and after moving on the thread participates in the operation.
  • the data before and after the move on the first thread needs to meet the following conditions to participate in the operation: the data associated with the first thread
  • the thread flag bit indicates that the first source data of the first thread participates in an operation operation
  • the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in an operation operation.
  • Figure 12 a schematic diagram of a thread flag bit, the thread 1 before the move is filled in black, and the original thread flag bit of thread 30 is 0.
  • the updated thread 1 After the cross-thread data move operation, that is, the price difference cross move, the updated thread 1,
  • the thread flag bits of threads 28-31 are all 0. The data before and after the migration on thread 1 and threads 28-31 does not participate in the calculation.
  • the third solution uses a single instruction to achieve cross-thread data transfer, and can lock the data of a certain thread in a small range of QUAD, which can be applied to the difference calculation in image processing, such as four adjacent pixels in position, Smoothing based on one of the pixels, etc.
  • plan 1 and plan 3 are combined and implemented, and the calculation results of each thread in plan 1 are used as the source data of the corresponding thread in plan 3 to perform one-to-many transfer operations.
  • the embodiment of the present application also provides a cross-thread data processing flow, which can be executed cooperatively by various units in a parallel computing processor. Mainly include the following steps.
  • the instruction scheduler inputs instructions.
  • the parallel computing program or graphics rendering program is compiled into a binary instruction code by a compiler and then configured into the SIMD processor.
  • the instruction code is used as the input of the instruction scheduler, and the data to be processed is configured into the storage through the software and is processed before the instruction starts to be issued.
  • Initialized into the register as the data input of the register module.
  • the instruction decoder parses the instruction code to obtain operands (such as source operand 1, source operand 2, destination operand, etc.) and operation codes, such as the first operation code and the second operation code.
  • the instruction decoder parses the source operand, it sends a source operand read request to the general register, and the general register returns the data corresponding to the source operand to the collector.
  • the source operand collector judges whether the first opcode is a CROSS type instruction. If not, perform step (5) to send the data to the downstream ALU for calculation. If it is a CROSS type instruction, after judging whether it is a CROSS DOWN instruction, a CROSS QUAD BUTTERFLY instruction or a CROSS QUAD BROADCAST instruction, and processing it through the corresponding processing unit such as performing a cross-thread data movement operation, then perform step (5) to transfer the data Send it to the downstream ALU for calculation.
  • the ALU performs corresponding calculations according to the second operation code, and the result is sent to the next module for processing.
  • the source operand collector deploys fewer threads than the number of parallel computing processors. For example, the source operand collector deploys N threads, while the number of parallel computing processing threads is 2N. Two instruction codes can be used to realize the circular transfer of data among 2N threads of parallel computing processors.
  • the codes of the two instructions can be recorded as the first operation instruction and the second operation instruction.
  • the first operation instruction can refer to the definition of Scheme 1.
  • the source data sources indicated by the source operand 1 in the first operation instruction and the second operation instruction are different. .
  • the first source data of the N threads indicated by the first source operand in the first operation instruction comes from N consecutive threads of the parallel computing processor.
  • the second source operation data indicates (in the source operation data collector) the second source data of N threads, and the N threads
  • the second source data comes from the remaining N consecutive threads in the parallel computing processor.
  • the second operation instruction may also include other parameters that are the same as those of the first operation instruction, such as a first operation code, a second operation code, a destination operand, and a second source operand.
  • the source operand collector moves the first source data of N threads according to the first operation instruction
  • the specific implementation manner of moving the second source data of N threads according to the second operation instruction can refer to Solution 1 is carried out, which will not be introduced in this embodiment of the present application.
  • the source operand collector can deploy 32 threads
  • the parallel computing processor includes 64 threads
  • the SRC1 is 2.
  • the time when the first operation instruction is issued is ahead of the time when the second operation instruction is issued; note that the time sequence interval between the two instructions is m, and when m is 1, it means that the first operation instruction and the second operation instruction are Two instructions sent back to back.
  • Each instruction processes N threads.
  • Figure 14 shows a schematic diagram of source data sources.
  • the N threads of the source operand collector are numbered from 0 to 31.
  • the first source data of the N threads indicated by the first operation instruction comes from Threads numbered 32-63 in the parallel computing processor, wherein the first source data of thread 0 in the source operation data collector comes from thread 32 in the parallel computing processor, and the first source data of thread 1 in the source operation data collector
  • the source data comes from the thread 33 in the parallel computing processor, and by analogy, the first source data of the thread 31 in the source operation data collector comes from the thread 63 in the parallel computing processor; the indicated N threads of the second operation instruction
  • the second source data comes from threads numbered 0-31 in the parallel computing processor, wherein the second source data of thread 0 in the source operation data collector comes from thread 0 in the parallel computing processor, and the thread 0 in the source operation data collector
  • the second source data of thread 1 comes from thread 1 in the parallel computing processor, and so on, the second source data of thread 31 in the source operation data collector comes from thread 31 in the parallel computing processor.
  • the embodiment of the present application provides another schematic diagram of circular transfer.
  • the source operation data collector first obtains the first operation instruction, and after the source operation data collector executes the circular movement of cross-thread data according to the first operation instruction, in the source operation data collector: the data moved on thread 0 is on thread 2 the first source data on thread 1; the data moved on thread 1 is the first source data on thread 3; ... the data moved on thread 30 is the first source data on thread 0; the data moved on thread 31 is First source data on thread 1.
  • the source operation data collector inputs the migration result corresponding to the first operation instruction, recorded as the first data of N threads, to the ALU.
  • the source operation data collector obtains the second operation instruction, and after the source operation data collector executes the circular transfer of cross-thread data according to the second operation instruction, in the source operation data collector: the data moved on thread 0 is on thread 2 the second source data on thread 1; the data moved on thread 1 is the second source data on thread 3; ... the data moved on thread 30 is the second source data on thread 0; the data moved on thread 31 is Second source data on thread 1.
  • the source operation data collector inputs the moving result corresponding to the second operation instruction, recorded as the second data of N threads, to the ALU.
  • the first operation instruction arrives earlier than the second operation instruction, assuming that the second operation instruction arrives at the stage I of the ALU, and the first operation instruction arrives at the stage I+m of the ALU, I is Arbitrary stages in the ALU. Then the arithmetic logic unit ALU can exchange the first data after moving on the third thread and the second data after moving on the third thread; wherein, the third thread is numbered r among the N threads in the source operation data collector the rout.
  • r can be determined as follows: if SRC1 is less than N, r is greater than or equal to (N-SRC1), and r is less than N; if SRC1 is greater than or equal to N, r is greater than or equal to 0, and r is less than (N- SRC1%N).
  • FIG. 16 for a schematic diagram of data replacement.
  • FIG. 15 shows that when the first operation instruction arrives at stage 0 of the ALU, the first data and the thread after being moved on the thread 30 are shown. Replace the second data moved on the thread 31; and replace the first data moved on the thread 31 with the second data moved on the thread 31, so far realize the circular movement operation of the data between the threads of the parallel computing processor 64 .
  • the ALU implements the corresponding calculation operation based on the result of circular transfer of data between threads of the parallel computing processor 64 according to the second operation code in the first operation instruction/second operation instruction.
  • the specific computing operations may be performed according to actual requirements, which is not limited in this embodiment of the present application.
  • a thread flag bit can also be configured for each of the N threads deployed by the source operand collector, and the thread flag bit is used to indicate whether the first source data of the thread participates in the calculation operation .
  • the specific implementation manner can be carried out with reference to the manner in Solution 1, which will not be repeated in this embodiment of the present application. As an example, in FIG. 16 , it is filled with black to indicate that the data before and after migration on the thread numbers 30 and 31 does not participate in the calculation operation.
  • the fourth solution fewer threads in the source operand collector are combined with ALU data exchange processing to realize efficient cross-thread data circulation with higher SIDM width, which can be applied to reduction calculation in parallel computing.
  • This solution four can also be used to realize multi-threaded data accumulation and multiplication operations, such as constructing multi-level instructions, each level of instruction uses the aforementioned first operation instruction, and the output result of each level of instruction can be used as the input of the next level of instruction, finally making Move multi-threaded data to the same thread to realize accumulation, multiplication and other operations.
  • the embodiment of the present application provides a cross-thread data processing flow, which can be executed cooperatively by various units in a parallel computing processor. Mainly include the following steps.
  • the instruction scheduler inputs instructions.
  • the parallel computing program or graphics rendering program is compiled into a binary instruction code by a compiler and then configured into the SIMD processor.
  • the instruction code is used as the input of the instruction scheduler, and the data to be processed is configured into the storage through the software and is processed before the instruction starts to be issued.
  • Initialized into the register as the data input of the register module.
  • the instruction decoder parses the instruction code to obtain operands (such as source operand 1, source operand 2, destination operand, etc.) and operation codes, such as the first operation code and the second operation code.
  • the instruction decoder parses the source operand, it sends a source operand read request to the general register, and the general register returns the data corresponding to the source operand to the collector.
  • step (10) The source operand collector judges whether the first opcode is a CROSS type instruction. If not, then execute step (10) to send the data to the next module. If so, when it is determined to be a CROSS DOWN instruction, step (5) is performed.
  • the source operand collector performs CROSS DOWN data processing such as circular transfer operations.
  • step (10) judge whether it is the second instruction of CROSS DOWN (i.e. the aforementioned second operation instruction) in ALU stage I. If not, then execute step (10) to send the data to the next module; if yes, then instruct step (7).
  • the ALU judges whether the value of SRC1 is smaller than N; if yes, execute step (8); if not, execute step (9).
  • FIG. 18 illustrate a cross-thread data processing flow, which mainly includes the following steps.
  • the instruction scheduler inputs instructions.
  • the parallel computing program or graphics rendering program is compiled into a binary instruction code by a compiler and then configured into the SIMD processor.
  • the instruction code is used as the input of the instruction scheduler, and the data to be processed is configured into the storage through the software and is processed before the instruction starts to be issued.
  • Initialized into the register as the data input of the register module.
  • the instruction scheduler judges whether it is SIMD 2N mode, that is, whether the number of threads of SIMD is twice the number of threads in the source operand collector. If (3) is executed, the instruction is issued only once; if (4) is not executed, the instruction is issued twice.
  • the instruction decoder parses the instruction code to obtain operands (such as source operand 1, source operand 2, destination operand, etc.) and operation codes, such as the first operation code and the second operation code.
  • the instruction decoder parses the source operand, it sends a source operand read request to the general register, and the general register returns the data corresponding to the source operand to the collector.
  • the source operand collector executes the moving operation of the first source data between the N threads indicated by the first operation instruction according to the above schemes 1 to 4.
  • Scheme 1, Plan 2, Plan 3, and Plan 4 for carrying out SIMD N are indicated.
  • step (6) judge whether it is the CROSS DOWN instruction of SIMD 2N by ALU, or judge whether to receive the second instruction of CROSS DOWN (i.e. the aforementioned second operation instruction). If not, execute step (8) to send the data to the next module; if yes, execute step (7).
  • the data between the threads whose thread number is greater than or equal to N-SRC1 to the thread number less than N is exchanged.
  • exchange data between threads whose thread numbers are greater than or equal to 0 and whose thread numbers are less than (N ⁇ SRC1%N) in ALU stage I and ALU stage I+m is exchanged.
  • an embodiment of the present application provides a multi-thread data processing method, as shown in FIG. 19 .
  • the method mainly includes the following processes.
  • the first operation instruction includes the following parameters: a first operation code, the first operation code is used to indicate the data transfer mode between N threads, and N is an integer greater than or equal to 2; the first source operand, the The first source operand is used to indicate the first source data of the N threads; the second source operand is used to determine the thread offset corresponding to the data movement mode;
  • the efficient cross-thread operation of the parallel computing processor is realized by a single instruction, which is simpler than the cross bar of the cross network, and does not require frequent memory access, and can realize high-performance parallelism with lower hardware or signaling overhead Accelerated processing of applications operating across threads in a computing processor.
  • the data moving method is the first moving method
  • the moving the first source data of the N threads according to the first operation instruction includes: The first source data of the thread 1 is moved to the thread numbered i; wherein, the numbering of N threads is 0 ⁇ (N-1), i takes 0 ⁇ (N-1), and I 1 is (i+ SRC1) the value of the remainder of N; wherein, SRC1 represents the second source operand, and SRC1 is a positive integer.
  • the data moving method is a second moving method
  • the moving the first source data of the N threads according to the first operation instruction includes:
  • the data moving method is a third offset method
  • the moving the first source data of the N threads according to the first operation instruction includes:
  • the first operation instruction further includes a second operation code, and the second operation code is used to indicate an operation type; the method further includes:
  • each of the N threads is associated with a thread flag bit, and the thread flag bit is used to indicate whether the first source data of the thread participates in an operation.
  • the first data moved on the first thread comes from the second thread among the N threads; the thread flag associated with the first thread indicates that the first The first source data of the thread participates in the computing operation, and the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in the computing operation.
  • the first operation instruction further includes a destination operand, and the destination operand is used to indicate a storage location of a corresponding operation result of the first thread.
  • the first source data of the N threads comes from N consecutive threads of a parallel computing processor, and the parallel computing processor includes 2N threads, and the method further includes:
  • the second operation instruction including the following parameters: the first operation code; the second source operand; a third source operand, the third source operand indicating the N threads
  • the second source data of the N threads comes from the remaining N continuous threads in the parallel computing processor;
  • the method also includes:
  • the three threads are threads numbered r among the N threads; if SRC1 is less than N , r is greater than or equal to (N-SRC1), and r is less than N; if SRC1 is greater than or equal to N, r is greater than or equal to 0, and r is less than (N-SRC1%N).
  • the embodiment of the present application also provides a multi-thread data processing device 2000, the multi-thread data processing device includes:
  • the instruction acquisition module 2001 is configured to acquire a first operation instruction, the first operation instruction includes the following parameters: a first operation code, and the first operation code is used to indicate a data transfer mode between N threads, where N is greater than Or an integer equal to 2; the first source operand, the first source operand is used to indicate the first source data of the N threads; the second source operand, the second source operand is used to determine the The thread offset corresponding to the above data movement mode.
  • the processing module 2002 is configured to move the first source data of the N threads according to the first operation instruction, and obtain the moved first data on each of the N threads.
  • the efficient cross-thread operation of the parallel computing processor is realized by a single instruction, which is simpler than the cross bar of the cross network, and does not require frequent memory access, and can realize high-performance parallelism with lower hardware or signaling overhead Accelerated processing of applications operating across threads in a computing processor.
  • the data transfer method is the first transfer method
  • the processing module 2002 is specifically configured to: transfer the first source data of the thread numbered I +1 to the thread numbered i Among them; wherein, the numbering of N threads is 0 ⁇ (N-1), i takes 0 ⁇ (N-1), and I 1 is the value that (i+SRC1) gets remainder to N; Wherein, SRC1 represents described
  • the second source operand, SRC1, is a positive integer.
  • the data transfer method is the second transfer method
  • the processing module 2002 is specifically configured to: transfer the first source data of the thread numbered 12 to the thread numbered i Among them; wherein, the numbering of N threads is 0 ⁇ (N-1), and i takes 0 ⁇ (N-1); I 2 is the XOR value of i and SRC1, and SRC1 represents the second source operand, SRC1 is a positive integer.
  • the data moving method is a third offset method
  • the processing module 2002 is specifically configured to: move the first source data of the thread numbered I3 to the thread numbered i Among the threads; wherein, the numbers of N threads are 0 ⁇ (N-1), and i takes 0 ⁇ (N-1); the value of I 3 is SRC1 represents the second source operand, SRC1 is a positive integer, and n is a positive integer that can divide N.
  • the first operation instruction further includes a second operation code, and the second operation code is used to indicate the operation type;
  • the processing module 2002 is further configured to: for the N A first thread among the threads executes an operation corresponding to the operation type based on the first source data of the first thread and the moved first data on the first thread.
  • each thread in the N threads is associated with a thread flag bit, and the thread flag bit is used to indicate whether the first source data of the thread participates in an operation.
  • the first data moved on the first thread comes from the second thread among the N threads; the thread flag associated with the first thread indicates that the first The first source data of the thread participates in the computing operation, and the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in the computing operation.
  • the first operation instruction further includes a destination operand, and the destination operand is used to indicate a storage location of a corresponding operation result of the first thread.
  • the first source data of the N threads comes from N consecutive threads of a parallel computing processor, and the parallel computing processor includes 2N threads;
  • the instruction acquisition module 2001 It is also used to obtain a second operation instruction, the second operation instruction includes the following parameters: the first operation code; the second source operand; the third source operand, the third source operand indicates the The second source data of N threads, the second source data of the N threads comes from the remaining N continuous threads in the parallel computing processor;
  • the processing module 2002 is further configured to perform the second operation according to the The instruction moves the second source data of the N threads to obtain the moved second data on each of the N threads.
  • the processing module 2002 is further configured to: exchange the moved first data on the third thread with the second data moved on the third thread; wherein, the The three threads are the threads numbered r among the N threads; if SRC1 is less than N, r is greater than or equal to (N-SRC1), and r is less than N; if SRC1 is greater than or equal to N, r is greater than or equal to 0, and r is less than (N-SRC1%N).
  • the communication device 2100 may be a chip or a chip system.
  • the system-on-a-chip may be composed of chips, or may include chips and other discrete devices.
  • the communication device 2100 may include at least one processor 2110, and the processor 2110 is coupled to a memory.
  • the memory may be located within the device, the memory may be integrated with the processor, or the memory may be located outside the device.
  • the communication device 2100 may further include at least one memory 2120 .
  • the memory 2120 stores necessary computer programs, configuration information, computer programs or instructions and/or data for implementing any of the above embodiments; the processor 2110 may execute the computer programs stored in the memory 2120 to complete the method in any of the above embodiments.
  • the coupling in the embodiments of the present application is an indirect coupling or a communication connection between devices, units or modules, which may be in electrical, mechanical or other forms, and is used for information exchange between devices, units or modules.
  • Processor 2110 may cooperate with memory 2120 .
  • the specific connection medium between the processor 2110 and the memory 2120 is not limited in this embodiment of the present application.
  • the communication device 2100 may further include a communication interface 2130, and the communication device 2100 may perform information exchange with other devices through the communication interface 2130.
  • the communication interface 2130 may be a transceiver, a circuit, a bus, a module or other types of communication interfaces.
  • the communication interface 2130 in the device 2100 can also be an input and output circuit, which can input information (or call it, receive information) and output information (or call it, send information),
  • the processor is an integrated processor or a microprocessor or an integrated circuit or a logic circuit, and the processor can determine output information according to input information.
  • the communication interface 2130 , the processing module 2110 and the memory 2120 are connected to each other through a bus 2140 .
  • the bus 2140 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus or the like.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in Fig. 21, but it does not mean that there is only one bus or one type of bus.
  • the processor may be a general-purpose processor, a digital signal processor, an application-specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, and may implement or Execute the methods, steps and logic block diagrams disclosed in the embodiments of the present application.
  • a general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the methods disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.
  • the memory may be a non-volatile memory, such as a hard disk (hard disk drive, HDD) or a solid-state drive (solid-state drive, SSD), etc., and may also be a volatile memory (volatile memory), such as Random-access memory (RAM).
  • a memory is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
  • the memory in the embodiment of the present application may also be a circuit or any other device capable of implementing a storage function, and is used for storing program instructions and/or data.
  • an embodiment of the present application further provides a computer program, which, when the computer program is run on a computer, causes the computer to execute the above multi-threaded data processing method.
  • the embodiments of the present application also provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a computer, the computer executes the method described in the above-mentioned method embodiments.
  • the storage medium may be any available medium that can be accessed by a computer.
  • computer-readable media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage media or other magnetic storage devices, or may be used to carry or store information in the form of instructions or data structures desired program code and any other medium that can be accessed by a computer.
  • the embodiment of the present application also provides a computer chip, the chip is connected to the memory, and the chip is used to read and execute the software program stored in the memory, so as to realize the multiple functions provided in the above method embodiments. Thread data processing method.
  • an embodiment of the present application provides a chip system
  • the chip system includes a processor, configured to support a computer device to implement the functions of the multi-threaded data processing method in the above method embodiments.
  • the chip system further includes a memory, and the memory is used to store necessary programs and data of the computer device.
  • the system-on-a-chip may consist of chips, or may include chips and other discrete devices.
  • the technical solutions provided by the embodiments of the present application may be fully or partially implemented by software, hardware, firmware or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer program instructions When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present invention will be generated in whole or in part.
  • the computer may be a general computer, a special computer, a computer network, a network device, a terminal device or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media.
  • the available medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, a digital video disc (digital video disc, DVD)), or a semiconductor medium.
  • the various embodiments may refer to each other, for example, the methods and/or terms between the method embodiments may refer to each other, such as the functions and/or terms between the device embodiments Or terms may refer to each other, for example, functions and/or terms between the apparatus embodiment and the method embodiment may refer to each other.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions
  • the device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

Abstract

Disclosed in the present application are a multi-thread data processing method and apparatus, which are used for solving the problem of cross-thread computing being complicated and having large overheads. The method comprises: acquiring a first operation instruction, wherein the first operation instruction comprises the following parameters: a first operation code, which is used for indicating a data transfer mode between N threads, N being an integer greater than or equal to 2, a first source operand, which is used for indicating first source data of the N threads, and a second source operand, which is used for determining a thread offset corresponding to the data transfer mode; and transferring the first source data of the N threads according to the first operation instruction, so as to obtain the transferred first data on each of the N threads.

Description

一种多线程数据处理方法及装置A multi-thread data processing method and device 技术领域technical field
本申请涉及并行计算技术领域,尤其涉及一种多线程数据处理方法及装置。The present application relates to the technical field of parallel computing, and in particular to a multi-thread data processing method and device.
背景技术Background technique
随着应用程序对数据处理能力需求的增加,在计算机系统中引入并行计算处理器,例如单指令多线程(single instruction multiple data,SIMD)处理器。越来越多的并行计算程序需要跨线程计算,涉及线程间数据的交换。With the increase of application program's demand for data processing capabilities, parallel computing processors, such as single instruction multiple data (single instruction multiple data, SIMD) processors, are introduced into computer systems. More and more parallel computing programs require cross-thread computing, involving data exchange between threads.
传统并行处理器的解决方法分为软件解决和硬件解决。其中,软件解决方法是使用共享片上存储,将数据存储到共享片上存储,然后修改线程地址再将数据抓取回核内寄存器以实现线程间数据的交换。软件解决方法涉及到频繁的访存操作,导致执行效率较低,且功耗较高。硬件解决方法一般是通过复杂的交叉网络(cross bar),如交叉网络每个输出线程的数据可以来自于任意一个输入线程,从而达到线程数据交换的能力。但是硬件解决方法所需的硬件代价较高。The traditional parallel processor solutions are divided into software solutions and hardware solutions. Among them, the software solution is to use the shared on-chip storage, store the data in the shared on-chip storage, then modify the thread address and then grab the data back to the core register to realize the exchange of data between threads. The software solution involves frequent memory access operations, resulting in inefficient execution and higher power consumption. The hardware solution is generally through a complex cross bar. For example, the data of each output thread of the cross network can come from any input thread, so as to achieve the ability of thread data exchange. However, hardware solutions require higher hardware costs.
发明内容Contents of the invention
本申请提供一种多线程数据处理方法及装置,能够提升执行性能,以较低的硬件代价实现并行计算中所涉及的跨线程操作。The present application provides a multi-thread data processing method and device, which can improve execution performance and realize cross-thread operations involved in parallel computing at a lower hardware cost.
第一方面,本申请实施例提供一种多线程数据处理方法,该方法包括:获取第一操作指令。所述第一操作指令包括如下参数:第一操作码,所述第一操码用于指示N个线程之间的数据搬移方式,N为大于或者等于2的整数;第一源操作数,所述第一源操作数用于指示所述N个线程的第一源数据;第二源操作数,所述第二源操作数用于确定所述数据搬移方式对应的线程偏移量。根据所述第一操作指令对所述N个线程的第一源数据进行搬移,得到所述N个线程中每个线程上搬移后的第一数据。In a first aspect, an embodiment of the present application provides a multi-threaded data processing method, the method including: acquiring a first operation instruction. The first operation instruction includes the following parameters: a first operation code, the first operation code is used to indicate the data transfer mode between N threads, and N is an integer greater than or equal to 2; the first source operand, the The first source operand is used to indicate the first source data of the N threads; the second source operand is used to determine the thread offset corresponding to the data movement mode. Moving the first source data of the N threads according to the first operation instruction to obtain the moved first data on each of the N threads.
本申请实施例中,通过单指令实现并行计算处理器的高效跨线程操作,相较于交叉网络cross bar更为简单,也无需频繁访存,能够以较低硬件或者信令开销实现高性能并行计算处理器中跨线程操作应用的加速处理。In the embodiment of the present application, the efficient cross-thread operation of the parallel computing processor is realized by a single instruction, which is simpler than the cross bar of the cross network, and does not require frequent memory access, and can realize high-performance parallelism with lower hardware or signaling overhead Accelerated processing of applications operating across threads in a computing processor.
在一种可选的实现方式中,所述数据搬移方式为第一搬移方式,所述根据所述第一操作指令对所述N个线程的第一源数据进行搬移,包括:将编号为I 1的线程的第一源数据搬移到编号为i的线程中;其中,N个线程的编号为0~(N-1),i取遍0~(N-1),I 1为(i+SRC1)对N取余的值;其中,SRC1表示所述第二源操作数,SRC1为正整数。通过这样的设计,以低硬件代价实现高效跨线程操作即多线程数据之间的循环搬移操作,能够有效加速并行计算的归约算法。 In an optional implementation manner, the data moving method is the first moving method, and the moving the first source data of the N threads according to the first operation instruction includes: The first source data of the thread 1 is moved to the thread numbered i; wherein, the numbering of N threads is 0~(N-1), i takes 0~(N-1), and I 1 is (i+ SRC1) the value of the remainder of N; wherein, SRC1 represents the second source operand, and SRC1 is a positive integer. Through such a design, efficient cross-thread operations, that is, circular transfer operations between multi-threaded data, can be realized at low hardware cost, which can effectively accelerate the reduction algorithm of parallel computing.
在一种可选的实现方式中,所述数据搬移方式为第二搬移方式,所述根据所述第一操作指令对所述N个线程的第一源数据进行搬移,包括:将编号为I 2的线程的第一源数据搬移到编号为i的线程中;其中,N个线程的编号为0~(N-1),i取遍0~(N-1);I 2为i与SRC1的异或值,SRC1表示所述第二源操作数,SRC1为正整数。通过这样的设计,以低硬件代价实现高效跨线程操作即多线程数据之间的交叉搬移操作,能够有效加速图形处理中的差 分计算。 In an optional implementation manner, the data moving method is the second moving method, and the moving the first source data of the N threads according to the first operation instruction includes: The first source data of the thread of 2 is moved to the thread numbered i; wherein, the numbering of N threads is 0~(N-1), and i takes 0~(N-1); I 2 is i and SRC1 XOR value of , SRC1 represents the second source operand, and SRC1 is a positive integer. Through such a design, efficient cross-thread operations, that is, cross-movement operations between multi-threaded data, can be realized at a low hardware cost, which can effectively accelerate differential calculations in graphics processing.
在一种可选的实现方式中,所述数据搬移方式为第三偏移方式,所述根据所述第一操作指令对所述N个线程的第一源数据进行搬移,包括:将编号为I 3的线程的第一源数据搬移到编号为i的线程中;其中,N个线程的编号为0~(N-1),i取遍0~(N-1);I 3的取值为
Figure PCTCN2021101533-appb-000001
SRC1表示所述第二源操作数,SRC1为正整数,n为能够整除N的正整数。通过这样的设计,以低硬件代价实现高效跨线程操作即多线程数据之间的一对多搬移操作,能够有效加速图形处理中的差分计算。
In an optional implementation manner, the data moving method is a third offset method, and the moving the first source data of the N threads according to the first operation instruction includes: The first source data of the thread of I 3 is moved to the thread numbered as i; wherein, the numbering of N threads is 0~(N-1), and i takes 0~(N-1); the value of I 3 for
Figure PCTCN2021101533-appb-000001
SRC1 represents the second source operand, SRC1 is a positive integer, and n is a positive integer that can divide N. Through such a design, efficient cross-thread operations, that is, one-to-many transfer operations between multi-threaded data, can be realized at a low hardware cost, which can effectively accelerate differential calculations in graphics processing.
在一种可选的实现方式中,所述第一操作指令还包括第二操作码,所述第二操作码用于指示运算类型;所述方法还包括:In an optional implementation manner, the first operation instruction further includes a second operation code, and the second operation code is used to indicate an operation type; the method further includes:
针对所述N个线程中的第一线程,基于所述第一线程的第一源数据和所述第一线程上搬移后的第一数据,执行所述运算类型对应的运算操作。For a first thread among the N threads, based on the first source data of the first thread and the moved first data on the first thread, an operation operation corresponding to the operation type is performed.
在一种可选的实现方式中,所述N个线程中的每个线程关联有线程标志位,所述线程标志位用于指示线程的第一源数据是否参与运算操作。通过这样的设计,可以去掉无需计算的数据,减少计算开销。In an optional implementation manner, each of the N threads is associated with a thread flag bit, and the thread flag bit is used to indicate whether the first source data of the thread participates in an operation. Through such a design, the data that does not need to be calculated can be removed and the calculation cost can be reduced.
在一种可选的实现方式中,所述第一线程上搬移后的第一数据来自于所述N个线程中的第二线程;所述第一线程关联的线程标志位指示所述第一线程的第一源数据参与运算操作,且所述第二线程关联的线程标志位指示所述第二线程的第一源数据参与运算操作。In an optional implementation manner, the first data moved on the first thread comes from the second thread among the N threads; the thread flag associated with the first thread indicates that the first The first source data of the thread participates in the computing operation, and the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in the computing operation.
在一种可选的实现方式中,所述第一操作指令还包括目的操作数,所述目的操作数用于指示所述第一线程对应运算结果的存储位置。In an optional implementation manner, the first operation instruction further includes a destination operand, where the destination operand is used to indicate a storage location of a corresponding operation result of the first thread.
在一种可选的实现方式中,所述N个线程的第一源数据来自并行计算处理器的N个连续的线程,所述并行计算处理器包含2N个线程,所述方法还包括:获取第二操作指令,所述第二操作指令包括如下参数:所述第一操作码;所述第二源操作数;第三源操作数,所述第三源操作数指示所述N个线程的第二源数据,所述N个线程的第二源数据来自所述并行计算处理器中的剩余N个连续的线程;根据所述第二操作指令对所述N个线程的第二源数据进行搬移,得到所述N个线程中每个线程上搬移后的第二数据。In an optional implementation manner, the first source data of the N threads comes from N consecutive threads of a parallel computing processor, and the parallel computing processor includes 2N threads, and the method further includes: obtaining The second operation instruction, the second operation instruction includes the following parameters: the first operation code; the second source operand; the third source operand, the third source operand indicates the N threads The second source data, the second source data of the N threads come from the remaining N continuous threads in the parallel computing processor; perform the second source data of the N threads according to the second operation instruction moving to obtain the moved second data on each of the N threads.
在一种可选的实现方式中,所述方法还包括:将第三线程上搬移后的第一数据和所述第三线程上搬移后的第二数据进行交换;其中,所述三线程为所述N个线程中编号为r的线程;若SRC1小于N,r大于或者等于(N-SRC1),且r小于N;若SRC1大于或者等于N,r大于或者等于0,且r小于(N-SRC1%N)。通过这样的设计,以更低硬件代价实现更高并行计算处理器如SMID宽度的高效跨线程操作,能够有效加速并行计算的归约算法。In an optional implementation manner, the method further includes: exchanging the moved first data on the third thread with the second data moved on the third thread; wherein, the three threads are The thread numbered r among the N threads; if SRC1 is less than N, r is greater than or equal to (N-SRC1), and r is less than N; if SRC1 is greater than or equal to N, r is greater than or equal to 0, and r is less than (N -SRC1%N). Through such a design, the efficient cross-thread operation of a higher parallel computing processor such as SMID width can be realized at a lower hardware cost, which can effectively accelerate the reduction algorithm of parallel computing.
第二方面,本申请实施例提供一种多线程数据处理装置,该装置包括:指令获取模块,用于获取第一操作指令,所述第一操作指令包括如下参数:第一操作码,所述第一操码用于指示N个线程之间的数据搬移方式,N为大于或者等于2的整数;第一源操作数,所述第一源操作数用于指示所述N个线程的第一源数据;第二源操作数,所述第二源操作数用于确定所述数据搬移方式对应的线程偏移量;处理模块,用于根据所述第一操作指令对所述N个线程的第一源数据进行搬移,得到所述N个线程中每个线程上搬移后的第一数据。In a second aspect, an embodiment of the present application provides a multi-threaded data processing device, which includes: an instruction acquisition module, configured to acquire a first operation instruction, and the first operation instruction includes the following parameters: the first operation code, the The first operation code is used to indicate the data movement mode between N threads, and N is an integer greater than or equal to 2; the first source operand is used to indicate the first source operand of the N threads. Source data; a second source operand, the second source operand is used to determine the thread offset corresponding to the data movement mode; a processing module is used to process the N threads according to the first operation instruction The first source data is moved to obtain the moved first data on each of the N threads.
本申请实施例中,通过单指令实现并行计算处理器的高效跨线程操作,相较于交叉网络cross bar更为简单,也无需频繁访存,能够以较低硬件或者信令开销实现高性能并行计算处理器中跨线程操作应用的加速处理。In the embodiment of the present application, the efficient cross-thread operation of the parallel computing processor is realized by a single instruction, which is simpler than the cross bar of the cross network, and does not require frequent memory access, and can realize high-performance parallelism with lower hardware or signaling overhead Accelerated processing of applications operating across threads in a computing processor.
在一种可选的实现方式中,所述数据搬移方式为第一搬移方式,所述处理模块,具体 用于:将编号为I 1的线程的第一源数据搬移到编号为i的线程中;其中,N个线程的编号为0~(N-1),i取遍0~(N-1),I 1为(i+SRC1)对N取余的值;其中,SRC1表示所述第二源操作数,SRC1为正整数。 In an optional implementation manner, the data transfer method is a first transfer method, and the processing module is specifically configured to: transfer the first source data of the thread numbered I +1 to the thread numbered i ; Wherein, the numbering of N threads is 0~(N-1), and i is taken over 0~(N-1), and I 1 is (i+SRC1) the value that N gets remainder; Wherein, SRC1 represents described first Two source operands, SRC1 is a positive integer.
在一种可选的实现方式中,所述数据搬移方式为第二搬移方式,所述处理模块,具体用于:将编号为I 2的线程的第一源数据搬移到编号为i的线程中;其中,N个线程的编号为0~(N-1),i取遍0~(N-1);I 2为i与SRC1的异或值,SRC1表示所述第二源操作数,SRC1为正整数。 In an optional implementation manner, the data transfer method is the second transfer method, and the processing module is specifically used to: transfer the first source data of the thread numbered 12 to the thread numbered i ; Wherein, the numbering of N threads is 0~(N-1), and i takes 0~(N-1); I 2 is the XOR value of i and SRC1, and SRC1 represents the second source operand, and SRC1 is a positive integer.
在一种可选的实现方式中,所述数据搬移方式为第三偏移方式,所述处理模块,具体用于:将编号为I 3的线程的第一源数据搬移到编号为i的线程中;其中,N个线程的编号为0~(N-1),i取遍0~(N-1);I 3的取值为
Figure PCTCN2021101533-appb-000002
SRC1表示所述第二源操作数,SRC1为正整数,n为能够整除N的正整数。
In an optional implementation manner, the data moving method is a third offset method, and the processing module is specifically used to: move the first source data of the thread numbered I3 to the thread numbered i Among them, the numbers of N threads are 0~(N-1), and i takes 0~(N-1); the value of I 3 is
Figure PCTCN2021101533-appb-000002
SRC1 represents the second source operand, SRC1 is a positive integer, and n is a positive integer that can divide N.
在一种可选的实现方式中,所述第一操作指令还包括第二操作码,所述第二操作码用于指示运算类型;所述处理模块,还用于:针对所述N个线程中的第一线程,基于所述第一线程的第一源数据和所述第一线程上搬移后的第一数据,执行所述运算类型对应的运算操作。In an optional implementation manner, the first operation instruction further includes a second operation code, and the second operation code is used to indicate an operation type; the processing module is further configured to: for the N threads The first thread in the first thread executes an operation corresponding to the operation type based on the first source data of the first thread and the first data moved on the first thread.
在一种可选的实现方式中,所述N个线程中的每个线程关联有线程标志位,所述线程标志位用于指示线程的第一源数据是否参与运算操作。In an optional implementation manner, each of the N threads is associated with a thread flag bit, and the thread flag bit is used to indicate whether the first source data of the thread participates in an operation.
在一种可选的实现方式中,所述第一线程上搬移后的第一数据来自于所述N个线程中的第二线程;所述第一线程关联的线程标志位指示所述第一线程的第一源数据参与运算操作,且所述第二线程关联的线程标志位指示所述第二线程的第一源数据参与运算操作。In an optional implementation manner, the first data moved on the first thread comes from the second thread among the N threads; the thread flag associated with the first thread indicates that the first The first source data of the thread participates in the computing operation, and the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in the computing operation.
在一种可选的实现方式中,所述第一操作指令还包括目的操作数,所述目的操作数用于指示所述第一线程对应运算结果的存储位置。In an optional implementation manner, the first operation instruction further includes a destination operand, where the destination operand is used to indicate a storage location of a corresponding operation result of the first thread.
在一种可选的实现方式中,所述N个线程的第一源数据来自并行计算处理器的N个连续的线程,所述并行计算处理器包含2N个线程;所述指令获取模块,还用于获取第二操作指令,所述第二操作指令包括如下参数:所述第一操作码;所述第二源操作数;第三源操作数,所述第三源操作数指示所述N个线程的第二源数据,所述N个线程的第二源数据来自所述并行计算处理器中的剩余N个连续的线程;所述处理模块,还用于根据所述第二操作指令对所述N个线程的第二源数据进行搬移,得到所述N个线程中每个线程上搬移后的第二数据。In an optional implementation manner, the first source data of the N threads comes from N consecutive threads of a parallel computing processor, and the parallel computing processor includes 2N threads; the instruction acquisition module further For acquiring a second operation instruction, the second operation instruction includes the following parameters: the first operation code; the second source operand; a third source operand, the third source operand indicating the N The second source data of the threads, the second source data of the N threads are from the remaining N continuous threads in the parallel computing processor; The second source data of the N threads are moved to obtain the moved second data on each of the N threads.
在一种可选的实现方式中,所述处理模块,还用于:将第三线程上搬移后的第一数据和所述第三线程上搬移后的第二数据进行交换;其中,所述三线程为所述N个线程中编号为r的线程;若SRC1小于N,r大于或者等于(N-SRC1),且r小于N;若SRC1大于或者等于N,r大于或者等于0,且r小于(N-SRC1%N)。In an optional implementation manner, the processing module is further configured to: exchange the moved first data on the third thread with the second data moved on the third thread; wherein, the The three threads are threads numbered r among the N threads; if SRC1 is less than N, r is greater than or equal to (N-SRC1), and r is less than N; if SRC1 is greater than or equal to N, r is greater than or equal to 0, and r Less than (N-SRC1%N).
第三方面,本申请提供一种通信装置,包括处理器,所述处理器和存储器耦合,所述存储器用于存储计算机程序或指令,所述处理器用于执行所述计算机程序或指令,以执行上述第一方面至第四方面中任一方面的各实现方法。该存储器可以位于该装置之内,也可以位于该装置之外。该处理器的数量为一个或多个。In a third aspect, the present application provides a communication device, including a processor, the processor is coupled to a memory, the memory is used to store computer programs or instructions, and the processor is used to execute the computer programs or instructions to perform Various implementation methods of any one of the above-mentioned first aspect to the fourth aspect. The memory may be located within the device or external to the device. The number of the processors is one or more.
第四方面,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面以及第一方面的各个可选的实现方式中所述的方法。In a fourth aspect, the present application also provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium is run on a computer, the computer executes the above-mentioned first aspect and each executable method of the first aspect. The method described in Selected Implementations.
第五方面,本申请还提供一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面以及第一方面的各个可能的实现方式中所述的方法。In a fifth aspect, the present application further provides a computer program product including instructions, which, when run on a computer, cause the computer to execute the method described in the above first aspect and each possible implementation manner of the first aspect.
第六方面,本申请还提供一种计算机芯片,所述芯片与存储器相连,所述芯片用于读取并执行所述存储器中存储的软件程序,执行上述第一方面以及第一方面的各个可选的实现方式中所述的方法。In the sixth aspect, the present application also provides a computer chip, the chip is connected to the memory, and the chip is used to read and execute the software program stored in the memory, and execute the above-mentioned first aspect and each possible function of the first aspect. The method described in Selected Implementations.
另外,第二方面至第六方面的有益效果可以参见如第一方面及第一方面的各个可选的实现方式所示的有益效果。In addition, for the beneficial effects of the second aspect to the sixth aspect, reference may be made to the beneficial effects shown in the first aspect and each optional implementation manner of the first aspect.
附图说明Description of drawings
图1为一种交叉网络耦合的电路结构示意图;FIG. 1 is a schematic diagram of a circuit structure coupled by a crossover network;
图2为一种线程内部元素移位示意图;Fig. 2 is a schematic diagram of internal element shifting of a thread;
图3为本申请实施例提供的一种SIMD并行计算处理器系统架构示意图;FIG. 3 is a schematic diagram of a SIMD parallel computing processor system architecture provided by an embodiment of the present application;
图4为本申请实施例提供的一种循环搬移示意图;Fig. 4 is a schematic diagram of a cycle transfer provided by the embodiment of the present application;
图5为本申请实施例提供的一种CROSSDOWN跨线程处理单元结构示意图;FIG. 5 is a schematic structural diagram of a CROSSDOWN cross-thread processing unit provided by an embodiment of the present application;
图6为本申请实施例提供的线程标志位示意图之一;FIG. 6 is one of the schematic diagrams of thread flag bits provided by the embodiment of the present application;
图7为本申请实施例提供的一种交叉搬移示意图;Fig. 7 is a schematic diagram of cross transfer provided by the embodiment of the present application;
图8为本申请实施例提供的一种CROSS QUAD BUTTERFLY跨线程处理单元结构示意图;FIG. 8 is a schematic structural diagram of a CROSS QUAD BUTTERFLY cross-thread processing unit provided by the embodiment of the present application;
图9为本申请实施例提供的线程标志位示意图之一;FIG. 9 is one of the schematic diagrams of thread flag bits provided by the embodiment of the present application;
图10为本申请实施例提供的一种一对多搬移示意图;FIG. 10 is a schematic diagram of one-to-many transfer provided by the embodiment of the present application;
图11为本申请实施例提供的一种CROSS QUAD-BROADCAST跨线程处理单元结构示意图;FIG. 11 is a schematic structural diagram of a CROSS QUAD-BROADCAST cross-thread processing unit provided by the embodiment of the present application;
图12为本申请实施例提供的线程标志位示意图之一;FIG. 12 is one of the schematic diagrams of thread flag bits provided by the embodiment of the present application;
图13为本申请实施例提供的跨线程数据处理流程示意图之一;FIG. 13 is one of the schematic diagrams of the cross-thread data processing flow provided by the embodiment of the present application;
图14为本申请实施例提供的一种源数据来源示意图;FIG. 14 is a schematic diagram of a source data source provided by the embodiment of the present application;
图15为本申请实施例提供的另一种循环搬移示意图;Fig. 15 is another schematic diagram of circular transfer provided by the embodiment of the present application;
图16为本申请实施例提供的一种数据置换示意图;Fig. 16 is a schematic diagram of data replacement provided by the embodiment of the present application;
图17为本申请实施例提供的跨线程数据处理流程示意图之一;FIG. 17 is one of the schematic diagrams of the cross-thread data processing flow provided by the embodiment of the present application;
图18为本申请实施例提供的跨线程数据处理流程示意图之一;FIG. 18 is one of the schematic diagrams of the cross-thread data processing flow provided by the embodiment of the present application;
图19为本申请实施例提供的多线程数据处理方法的流程示意图之一;FIG. 19 is one of the schematic flowcharts of the multi-threaded data processing method provided by the embodiment of the present application;
图20为本申请实施例提供的一种多线程数据处理装置的结构示意图;FIG. 20 is a schematic structural diagram of a multi-threaded data processing device provided by an embodiment of the present application;
图21为本申请实施例提供的一种通信装置的结构示意图。FIG. 21 is a schematic structural diagram of a communication device provided by an embodiment of the present application.
具体实施方式detailed description
以下先对现有并行处理线程数据的相关技术进行介绍。The following firstly introduces related technologies for processing thread data in parallel.
相关技术一:Related technology one:
采用软件的方式,从核内寄存器中读取线程数据,并将线程数据存储到内存例如共享片上存储中。修改数据的线程地址,并按照修改后的线程地址将数据抓取回核内寄存器。使得同一线程地址对应从核内寄存器中所读取的原有数据,以及抓取回至核内寄存器的数 据,从而实现线程间数据的交换。这样的方式涉及到频繁的访存操作,导致执行效率较低,且功耗较高。The thread data is read from the core registers by software, and the thread data is stored in memory such as shared on-chip storage. Modify the thread address of the data, and fetch the data back to the kernel register according to the modified thread address. The same thread address corresponds to the original data read from the kernel register, and the data fetched back to the kernel register, so as to realize the exchange of data between threads. Such a method involves frequent memory access operations, resulting in low execution efficiency and high power consumption.
相关技术二:Related technology two:
参见图1示意一种交叉网络耦合的电路结构,以虚线将电路划分成4个象限,每个象限包含1个向量处理器(Execution Pipelines)以及2个用于执行跨线程数据搬移操作的交叉网络(cross bar)芯片。其中,4个象限分别记为第一象限、第二象限、第三象限和第四象限。第一象限包含向量处理器455、交叉网络芯片410A(或称,cross bar410A)、交叉网络芯片410B;第二象限包含向量处理器460、交叉网络芯片420A、交叉网络芯片420B;第三象限包含向量处理器465、交叉网络芯片430A、交叉网络芯片430B;第四象限包含向量处理器470、交叉网络芯片440A、交叉网络芯片440B。Refer to Figure 1 to illustrate a circuit structure coupled by a crossover network. The circuit is divided into four quadrants by dotted lines, and each quadrant contains one vector processor (Execution Pipelines) and two crossover networks for performing cross-thread data movement operations. (cross bar) chip. Among them, the four quadrants are respectively recorded as the first quadrant, the second quadrant, the third quadrant and the fourth quadrant. The first quadrant comprises vector processor 455, cross network chip 410A (or claims, cross bar410A), cross network chip 410B; The second quadrant comprises vector processor 460, cross network chip 420A, cross network chip 420B; The third quadrant comprises vector The processor 465, the cross network chip 430A, and the cross network chip 430B; the fourth quadrant includes the vector processor 470, the cross network chip 440A, and the cross network chip 440B.
cross bar410A、cross bar410B、cross bar 420A、cross bar 420B、cross bar 430A、cross bar 430B、cross bar 440A和cross bar 440B的输出通道与各种向量处理器的这种耦合可以达到用较小线程数的cross bar的组合来达到线程数较大的跨线程操作。其中,cross bar与向量处理器的耦合关系如下表1所示。The coupling between the output channels of cross bar 410A, cross bar 410B, cross bar 420A, cross bar 420B, cross bar 430A, cross bar 430B, cross bar 440A and cross bar 440B and various vector processors can be achieved with a small number of threads The combination of cross bar to achieve cross-thread operation with a large number of threads. Among them, the coupling relationship between the cross bar and the vector processor is shown in Table 1 below.
表1Table 1
向量处理器vector processor 可使用的cross bar Available cross bar
455455 410A,420A,430B,440A410A, 420A, 430B, 440A
460460 410B,420B,430A,440B410B, 420B, 430A, 440B
465465 410B,420B,430A,440B410B, 420B, 430A, 440B
470470 410A,420A,430B,440A410A, 420A, 430B, 440A
前述每个cross bar均具有8个输入通道和8个输出通道,即8×8的交叉网络。4个cross bar的组合可以实现16个输入通道、16个输出通道。一条跨线程操作指令可以控制16个通道的置换。对于32线程的跨线程操作,可以使用背对背的两条跨线程操作指令,即时间上连续的两条跨线程操作指令进行32×32置换。其中,记两条跨线程操作指令为第一置换指令和第二置换指令,第一置换指令控制组合交叉网络输入16个线程数据进行置换操作后输出,并写回矢量寄存器文件,第二置换指令控组合交叉网络输入16个线程数据进行置换操作后输出。将读取第一置换指令的输出,将第二置换指令的输出与第一置换指令的输出合并,以生成32×32置换的最终结果。Each of the aforementioned cross bars has 8 input channels and 8 output channels, that is, an 8×8 cross network. The combination of 4 cross bars can realize 16 input channels and 16 output channels. One cross-thread operation instruction can control the replacement of 16 channels. For the cross-thread operation of 32 threads, two back-to-back cross-thread operation instructions can be used, that is, two cross-thread operation instructions continuous in time to perform 32×32 replacement. Among them, two cross-thread operation instructions are recorded as the first replacement instruction and the second replacement instruction. The first replacement instruction controls the combined cross network to input 16 thread data to perform the replacement operation and then output it, and write it back to the vector register file. The second replacement instruction The control-combined cross network inputs 16 thread data and outputs them after replacement operation. The output of the first permutation instruction will be read and the output of the second permutation instruction will be combined with the output of the first permutation instruction to produce the final result of the 32x32 permutation.
以上技术二中,虽然采用cross bar共享的设计降低cross bar的数量,但仍然存在较大的硬件代价。且每个cross bar由两个向量处理器共享,同一时间只能被其中一个向量处理器所使用。因此当一个向量处理器在使用的时候,另一个向量处理如果也需要使用同一cross bar,将会造成处理器堵塞。对于32线程的跨线程操作,需要使用两次背靠背的跨线程指令配合,第一次指令需要写入寄存器,再读出来,会消耗额外的功耗。In the second technique above, although the cross bar sharing design is used to reduce the number of cross bars, there is still a large hardware cost. And each cross bar is shared by two vector processors, and can only be used by one of the vector processors at a time. Therefore, when a vector processor is in use, if another vector processor also needs to use the same cross bar, it will cause processor blockage. For the cross-thread operation of 32 threads, two back-to-back cross-thread instructions need to be used. The first instruction needs to be written into the register and then read out, which will consume additional power consumption.
相关技术三:Related technology three:
设定向量归约指令(VADDREDUCEPS)对于多线程中每个线程内的数据元素进行移位操作,来达到同一线程内部的归约计算。如图2示意,310是一个包含4个线程,每个线程中包含4个元素的向量寄存器,向量归约指令执行之后,将每个线程中的数据向右移位1个元素单元的位宽,每个线程中最右边的元素不移位,与移位下来的元素进行相加、相减或相乘,每个线程中最左边的元素补0,移位操作不会跨过线程边界。如图2示意,按照前述移位操作后,310变更为320,具体的如下:Set the vector reduction instruction (VADDREDUCEPS) to perform a shift operation on the data elements in each thread in the multi-thread to achieve the reduction calculation inside the same thread. As shown in Figure 2, 310 is a vector register containing 4 threads, and each thread contains 4 elements. After the vector reduction instruction is executed, the data in each thread is shifted to the right by the bit width of 1 element unit , the rightmost element in each thread is not shifted, and is added, subtracted or multiplied with the shifted element, the leftmost element in each thread is filled with 0, and the shift operation will not cross the thread boundary. As shown in Figure 2, after the aforementioned shift operation, 310 is changed to 320, the details are as follows:
{A15,A14,A13,A12}->{0,A15,A14,A13+A12}{A15,A14,A13,A12}->{0,A15,A14,A13+A12}
{A11,A10,A9,A8}->{0,A11,A10,A9+A8}{A11,A10,A9,A8}->{0,A11,A10,A9+A8}
{A7,A6,A5,A4}->{0,A7,A6,A5+A4}{A7,A6,A5,A4}->{0,A7,A6,A5+A4}
{A3,A2,A1,A0}->{0,A3,A2,A1+A0}{A3,A2,A1,A0}->{0,A3,A2,A1+A0}
以上技术三中,使用向量归约指令只能每个线程内的数据进行移位操作,不涉及到真正的跨线程操作,虽然能实现归约计算,但是效率较低。而且只适用于线程数较少的处理器中,对于SIMD处理器而言,由于SIMD处理器的线程数较多,线程内的寄存器位宽较小,该技术无法跨线程操作,适用性差。此外,该技术只能做部分归约计算,无法适用于图形学中的差分计算。In the third technique above, using the vector reduction instruction can only perform shift operations on the data in each thread, and does not involve real cross-thread operations. Although the reduction calculation can be realized, the efficiency is low. Moreover, it is only applicable to processors with fewer threads. For SIMD processors, since the number of threads in SIMD processors is large and the bit width of registers in threads is small, this technology cannot operate across threads and has poor applicability. In addition, this technique can only do partial reduction calculations, and cannot be applied to differential calculations in graphics.
有基于此,本申请实施例提供一种多线程数据处理方法及装置,能够提升执行性能,且以较低的硬件代价实现并行计算中所涉及的跨线程操作,有效加速并行计算的数据处理。例如,本申请实施例提供的多线程数据处理方法可以适用于并行计算中的归约算法,图形处理中的差分计算等。Based on this, the embodiments of the present application provide a multi-threaded data processing method and device, which can improve execution performance, realize cross-thread operations involved in parallel computing at a lower hardware cost, and effectively accelerate data processing of parallel computing. For example, the multi-threaded data processing method provided in the embodiment of the present application can be applied to reduction algorithms in parallel computing, differential computing in graphics processing, and the like.
本申请实施例以下中涉及的多个,是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。另外,应当理解,尽管在本发明实施例中可能采用术语第一、第二等来描述各数据、但这些数据不应限于这些术语。这些术语仅用来将各数据彼此区分开。“至少一个”是指一个或多个。至少两个是指两个或者多个。“至少一个”、“任意一个”或其它类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如a,b,c可以是单个,也可以是多个。The multiple mentioned below in the embodiments of the present application refers to two or more. "And/or" describes the association relationship of associated objects, indicating that there may be three types of relationships, for example, A and/or B may indicate: A exists alone, A and B exist simultaneously, and B exists independently. The character "/" generally indicates that the contextual objects are an "or" relationship. In addition, it should be understood that although the terms first, second, etc. may be used to describe various data in the embodiments of the present invention, these data should not be limited to these terms. These terms are only used to distinguish data from one another. "At least one" means one or more. At least two means two or more. "At least one", "any one" or other similar expressions refer to any combination of these items, including any combination of single item(s) or plural item(s). For example, a, b, c can be single or multiple.
本申请实施例的以下描述中所提到的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括其他没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。需要说明的是,本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。The terms "including" and "having" mentioned in the following description of the embodiments of the present application and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device comprising a series of steps or units is not limited to the listed steps or units, but optionally also includes other unlisted steps or units, or optionally also includes Other steps or elements inherent to the process, method, product or apparatus are included. It should be noted that, in the embodiments of the present application, words such as "exemplary" or "for example" are used as examples, illustrations or descriptions. Any embodiment or design scheme described as "exemplary" or "for example" in the embodiments of the present application shall not be interpreted as being more preferred or more advantageous than other embodiments or design schemes. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete manner.
下面结合附图对本申请实施例进行进一步介绍。The embodiments of the present application will be further introduced below in conjunction with the accompanying drawings.
参见图3示意一种SIMD并行计算处理器系统架构示意图,本申请实施例提供的多线程数据处理方法可以应用于该SMID并行计算处理器系统。SIMD并行计算处理器系统可以部署在例如个人计算机,笔记本电脑,智能手机,智能机顶盒,车内智能系统,智能可穿戴设备等等的设备中。SIMD并行计算处理器系统主要用于处理数据量较大的应用,输入编译后的二进制指令码,以及相应的待处理数据,最终输出程序处理后的数据到外部存储中。典型的如图形处理器(graphics processing unit,GPU),输入大量的三维模型顶点数据,以及经过编译器编译的渲染程序指令码,最终输出渲染后的数据到显存。Referring to FIG. 3 , it shows a schematic diagram of a SIMD parallel computing processor system architecture. The multi-thread data processing method provided by the embodiment of the present application can be applied to the SMID parallel computing processor system. SIMD parallel computing processor systems can be deployed in devices such as personal computers, laptops, smart phones, smart set-top boxes, in-vehicle smart systems, smart wearable devices, and more. The SIMD parallel computing processor system is mainly used to process applications with a large amount of data, input the compiled binary instruction code, and the corresponding data to be processed, and finally output the data processed by the program to the external storage. A typical example is a graphics processing unit (GPU), which inputs a large amount of 3D model vertex data and the rendering program instruction code compiled by the compiler, and finally outputs the rendered data to the video memory.
该SIMD并行计算处理器系统主要包含一个或多个处理器核,图3中示意出了一个SIMD处理器核。每个处理器核中包含多个算术逻辑单元(arithmetic logic unit,ALU),通用寄存器(GPR)单元,以及指令处理相关单元如指令调度器,指令解码器,源操作数 收集单元中的一个或多个。主要模块的处理功能如下:The SIMD parallel computing processor system mainly includes one or more processor cores, and one SIMD processor core is schematically shown in FIG. 3 . Each processor core contains multiple arithmetic logic units (arithmetic logic unit, ALU), general purpose register (GPR) units, and instruction processing related units such as instruction scheduler, instruction decoder, one of the source operand collection units or Multiple. The processing functions of the main modules are as follows:
指令调度器,用于从存储器中读取编译器编译的指令编码,根据算术逻辑单元(ALU)的空闲程度以及资源使用程度进行指令编码的分发。其中,指令编码为二进制格式的编码,该指令编码;可选的,指令编码也可以称为操作指令。指令编码可以包含如下参数中的一种或多种:用于指示该指令编码的行为的一个或多个操作码;源操作数,用于指示操作码所需的源数据,如表示源数据的来源可以是寄存器地址编码或者立即数编码;目的操作数,用于指示经过指令操作码执行之后的结果的存储位置,可以是寄存器地址编码。本申请实施例将在后续内容对指令编码进行详细说明。The instruction scheduler is used to read the instruction code compiled by the compiler from the memory, and distribute the instruction code according to the degree of idleness of the arithmetic logic unit (ALU) and the degree of resource usage. Wherein, the instruction encoding is an encoding in a binary format, and the instruction encoding; optionally, the instruction encoding may also be referred to as an operation instruction. An instruction encoding may contain one or more of the following parameters: one or more opcodes used to indicate the behavior of the instruction encoding; source operands, used to indicate the source data required by the opcode, such as the The source can be register address encoding or immediate number encoding; the destination operand, used to indicate the storage location of the result after the instruction opcode is executed, can be register address encoding. The embodiment of the present application will describe the instruction encoding in detail in the following content.
通用寄存器(GPR)单元,用于存储指令计算所涉及操作数对应的数据,如源操作数对应的数据和目的操作数对应的数据。可选的,通用寄存器单元(GPR)使用静态随机访问存储(SRAM)。初始的数据可以来自外部存储,对应并行计算处理器的多线程,初始的数据可以是SIDM处理器核的多线程的数据。A general-purpose register (GPR) unit is used to store data corresponding to operands involved in instruction calculation, such as data corresponding to source operands and data corresponding to destination operands. Optionally, the general purpose register unit (GPR) uses static random access memory (SRAM). The initial data may come from external storage, corresponding to the multithreading of the parallel computing processor, and the initial data may be the data of the multithreading of the SIDM processor core.
指令解码器,用于接收并解析指令编码,根据指令编码指示通用寄存器单元(GPR)准备源数据的读取。The instruction decoder is configured to receive and parse the instruction code, and instruct the general purpose register unit (GPR) to prepare for reading the source data according to the instruction code.
源操作数收集器,用于接收通用寄存器返回的多个源数据,并基于通用寄存器返回的多个源数据,执行跨线程数据搬移操作后输出数据至算数逻辑单元。具体的,源操作数收集器中部署有设定数量的线程,源操作数收集器可以将通用寄存器返回的多个源数据作为前述设定数量的线程的源数据,一个线程对应一个源数据,针对该设定数量的线程之间执行数据搬移操作。在本申请实施例中,源操作数收集器还可以将多个源数据输出至算数逻辑单元;或者,也可以由算术逻辑单元直接接收通用寄存器返回的多个源数据。The source operand collector is used to receive multiple source data returned by the general-purpose register, and based on the multiple source data returned by the general-purpose register, perform a cross-thread data movement operation and then output the data to the arithmetic logic unit. Specifically, a set number of threads are deployed in the source operand collector, and the source operand collector can use multiple source data returned by the general register as the source data of the aforementioned set number of threads, one thread corresponds to one source data, Perform data movement operations among the set number of threads. In the embodiment of the present application, the source operand collector may also output multiple source data to the ALU; or, the ALU may also directly receive multiple source data returned by the general-purpose register.
算术逻辑单元(ALU),包含多级流水线,可以完成多种运算类型的指令计算,例如浮点加法FADD,浮点乘法FMUL,浮点比较FMIN/FMAX,有符号整型加法IADDS,无符号整型加法IADDU,有符号整型减法ISUBS,无符号整型减法ISUBU,有符号整型乘法IMULS,无符号整型乘法IMULU,有符号比较IMINS,无符号比较IMINU,逻辑异或操作XOR,逻辑与操作AND,逻辑或操作OR等浮点,整型和逻辑运算。给算术逻辑单元ALU输入需要运算的数据(也称为运算数)和指示运算类型的指令码作,即可完成相关运算类型的指令计算。在SIMD并行计算处理器系统中,每个SIMD处理器核可以包含多个ALU,以达到高计算吞吐。其中,可以为每个ALU单元设定独立的1比特的标志位,以该标志位的取值表示ALU单元是否参与指令计算。例如,若标志位为1则表示该ALU参与指令计算,若标志位为0则表示该ALU不参与指令计算,无需时钟翻转,能够节省功耗。Arithmetic logic unit (ALU), including multi-stage pipelines, can complete instruction calculations of various types of operations, such as floating-point addition FADD, floating-point multiplication FMUL, floating-point comparison FMIN/FMAX, signed integer addition IADDS, unsigned integer Type addition IADDU, signed integer subtraction ISUBS, unsigned integer subtraction ISUBU, signed integer multiplication IMULS, unsigned integer multiplication IMULU, signed comparison IMINS, unsigned comparison IMINU, logical XOR operation XOR, logical AND Floating-point, integer and logical operations such as AND, logical or OR. The data to be calculated (also called operand) and the instruction code indicating the type of operation are input to the arithmetic logic unit ALU to complete the instruction calculation of the relevant operation type. In a SIMD parallel computing processor system, each SIMD processor core can contain multiple ALUs to achieve high computing throughput. Wherein, an independent 1-bit flag can be set for each ALU unit, and the value of the flag indicates whether the ALU unit participates in instruction calculation. For example, if the flag bit is 1, it means that the ALU participates in the instruction calculation, and if the flag bit is 0, it means that the ALU does not participate in the instruction calculation, and there is no need for clock inversion, which can save power consumption.
本申请实施例提供的上述系统不需要使用复杂的交叉网络也不需要访问存储来获取数据,执行单个指令编码,一次从通用寄存器读取数据,完成跨线程数据搬移及运算,能够提高跨线程操作的执行性能。The above-mentioned system provided by the embodiment of the present application does not need to use a complex cross network or access storage to obtain data, execute a single instruction code, read data from a general-purpose register at one time, complete cross-thread data movement and calculation, and can improve cross-thread operations. execution performance.
进一步,参见下表2示意一种指令编码的格式,该指令编码具体可以包括如下参数。Further, refer to Table 2 below to illustrate a format of an instruction encoding, and the instruction encoding may specifically include the following parameters.
表2Table 2
第一操作码first opcode 第二操作码second opcode 目的操作数destination operand 源操作数1source operand 1 源操作数2 source operand 2
第一操作码用于指示源操作数收集器中部署的设定数目线程之间的数据搬移方式,数据搬移方式包括一种或多种,可以根据实际需求定义。运算类型包括浮点加法FADD,浮点乘法FMUL,浮点比较FMIN/FMAX,有符号整型加法IADDS,无符号整型加法IADDU, 有符号整型减法ISUBS,无符号整型减法ISUBU,有符号整型乘法IMULS,无符号整型乘法IMULU,有符号比较IMINS,无符号比较IMINU,逻辑异或操作XOR,逻辑与操作AND,逻辑或操作OR等浮点,整型和逻辑运算。其中,第一操作码也可以称为主操作码,第二操作码也可以称为副操作码。The first operation code is used to indicate the data transfer mode among the set number of threads deployed in the source operand collector, and the data transfer mode includes one or more types, which can be defined according to actual requirements. Operation types include floating-point addition FADD, floating-point multiplication FMUL, floating-point comparison FMIN/FMAX, signed integer addition IADDS, unsigned integer addition IADDU, signed integer subtraction ISUBS, unsigned integer subtraction ISUBU, signed Integer multiplication IMULS, unsigned integer multiplication IMULU, signed comparison IMINS, unsigned comparison IMINU, logical XOR operation XOR, logical AND operation AND, logical OR operation and other floating point, integer and logical operations. Wherein, the first operation code may also be called a main operation code, and the second operation code may also be called a secondary operation code.
可选的,数据搬移方式可以包括如下类型:循环搬移、交叉搬移、一对多搬移。第二操作码用于指示运算类型。循环搬移可以理解为针对每个线程的数据按照相同的线程偏移量以及相同的线程编号排序方向(如高线程往低线程的方向)进行搬移;交叉搬移可以理解为两个线程之间的数据相互交换;一对多搬移,也可称为扩散搬移,可以理解为将一个线程的数据搬移到其它或者包括该线程在内的多个线程。可选的,可采用第一操作码的不同取值表示不同的数据搬移方式,如第一操作码可以为CROSS-DOWN,以CROSS-DOWN指示循环搬移,或者第一操作码可以为CROSS-QUAD-BUTTERFLY,以CROSS-QUAD-BUTTERFLY指示交叉搬移;或者,第一操作码可以为CROSS-QUAD-BROADCAST,以CROSS-QUAD-BROADCAST指示一对多搬移。Optionally, the data migration mode may include the following types: circular migration, cross migration, and one-to-many migration. The second opcode is used to indicate the operation type. Loop moving can be understood as moving the data of each thread according to the same thread offset and the same thread number sorting direction (such as the direction from high thread to low thread); cross moving can be understood as data between two threads Mutual exchange; one-to-many transfer, also known as diffusion transfer, can be understood as moving the data of one thread to other or multiple threads including this thread. Optionally, different values of the first operation code can be used to indicate different data transfer methods, for example, the first operation code can be CROSS-DOWN, and CROSS-DOWN can be used to indicate circular transfer, or the first operation code can be CROSS-QUAD -BUTTERFLY, use CROSS-QUAD-BUTTERFLY to indicate cross transfer; or, the first opcode can be CROSS-QUAD-BROADCAST, use CROSS-QUAD-BROADCAST to indicate one-to-many transfer.
在定义时,前述循环搬移、交叉搬移、一对多搬移也可以采用其它的名称所代替,只要其能够被识别令源操作数收集器根据第一操作码可确定执行哪种搬移操作即可,本申请实施例对此并不进行限制。示例性的,可以采用第一数据搬移方式、第二数据搬移方式以及第三数据搬移方式来区分上述数据搬移方式的类型,如第一数据搬移方式指示循环搬移,第二数据搬移方式指示交叉搬移,第三数据搬移方式指示一对多搬移。When defining, the aforementioned cyclic transfer, crossover transfer, and one-to-many transfer can also be replaced by other names, as long as they can be identified so that the source operand collector can determine which transfer operation to perform according to the first operation code, The embodiments of the present application do not limit this. Exemplarily, the first data transfer method, the second data transfer method, and the third data transfer method can be used to distinguish the types of the above data transfer methods. For example, the first data transfer method indicates circular transfer, and the second data transfer method indicates cross transfer. , the third data transfer mode indicates one-to-many transfer.
源操作数1,用于指示设定数目的线程的源数据;其中,设定数目的线程的源数据可以来自并行计算处理器如SIMD处理器的多线程,前述设定数目中不同线程的源数据来自于SIMD处理器中的不同线程。关于前述源操作数收集器中部署线程的设定数目可以与并行计算处理器如SIMD处理器的线程数目一致,例如均为N,N为大于等于2的整数;或者,前述源操作数收集器中部署线程的设定数目也可以少于并行计算处理器如SIMD处理器的线程数目,例如源操作数收集器中部署线程的设定数目为N,并行计算处理器如SIMD处理器的线程数目为2N。当使用通用寄存器或专用寄存器存储并行计算处理器的多线程的数据时,源操作数1可以具体是通用寄存器地址或者专用寄存器地址。源操作数2用于确定数据搬移方式对应的线程偏移量,源操作数据2可以是根据实际计算需求所设定的立即数。目的操作数用于指示运算结果的存储位置,具体可以是通用寄存器地址或者专用寄存器地址。 Source operand 1 is used to indicate the source data of the set number of threads; wherein, the source data of the set number of threads can come from parallel computing processors such as SIMD processors, the sources of different threads in the aforementioned set number The data comes from different threads in the SIMD processor. The set number of deployment threads in the aforementioned source operand collector can be consistent with the number of threads of a parallel computing processor such as a SIMD processor, for example, both are N, and N is an integer greater than or equal to 2; or, the aforementioned source operand collector The set number of deployment threads in can also be less than the number of threads of parallel computing processors such as SIMD processors. For example, the set number of deployment threads in the source operand collector is N, and the number of threads of parallel computing processors such as SIMD processors is 2N. When a general-purpose register or a special-purpose register is used to store data of multiple threads of a parallel computing processor, the source operand 1 may specifically be a general-purpose register address or a special-purpose register address. The source operand 2 is used to determine the thread offset corresponding to the data movement mode, and the source operation data 2 can be an immediate value set according to actual computing requirements. The destination operand is used to indicate the storage location of the operation result, specifically, it may be a general-purpose register address or a special-purpose register address.
具体地,指令编码器可按照该指令编码的格式,从指令编码中获取第一操作数,第二操作数,目的操作数,源操作数1和源操作数2。根据第一操作数指示通用寄存器准备相应的源数据,通用寄存器返回前述设定数目的线程的源数据给源操作数收集器,源操作数收集器可根据指令编码对设定数目的线程的源数据进行搬移,得到每个线程上搬移后的数据。源操作数收集器可将设定数目中部分或者全部线程上的源数据以及搬移后的数据发送给算术逻辑单元,算术逻辑单元则可针对部分或者全部线程并行(同时)执行第二操作码所指示的运算类型得到对应的运算结果,按照目的操作数进行存储。Specifically, the instruction encoder can obtain the first operand, the second operand, the destination operand, the source operand 1 and the source operand 2 from the instruction encoding according to the format of the instruction encoding. Instruct the general-purpose register to prepare corresponding source data according to the first operand, and the general-purpose register returns the source data of the aforementioned set number of threads to the source operand collector, and the source operand collector can encode the source data of the set number of threads according to the instruction code The data is moved to obtain the moved data on each thread. The source operand collector can send the source data and the moved data on some or all threads in the set number to the arithmetic logic unit, and the arithmetic logic unit can execute the second operation code in parallel (simultaneously) for some or all threads The indicated operation type gets the corresponding operation result, which is stored according to the destination operand.
以下结合方案一至方案四,针对不同数据搬移方式下跨线程数据搬移及运算的方案进行详细说明。In the following, schemes 1 to 4 are combined to describe in detail the schemes of cross-thread data movement and calculation under different data movement methods.
方案一:Option One:
源操作数收集器部署与并行计算处理器相同数量的线程,如N个线程。可使用一条指令编码实现N个线程的数据之间的循环搬移。The source operand collector deploys the same number of threads as the parallel computing processors, such as N threads. One instruction code can be used to realize the circular transfer of data among N threads.
记该指令编码为第一操作指令,第一操作指令可以包括如下参数:第一操作码、第二操作码、目的操作数、第一源操作数(即第一操作指令中的源操作数1)以及第二源操作数(即第一操作指令中的源操作数2)。示例性的,第一操作码为CROSS-DOWN,指示本方案一中N个线程之间的数据搬移方式为循环搬移或称第一数据搬移方式;第二操作码指示运算类型,例如浮点加法FADD;目的操作数为通用寄存器R0,以通用寄存器地址指示运算结果的存储位置;第一源操作数为通用寄存器R1,以通用寄存器地址指示N个线程的第一源数据,通用寄存器R1中的初始数据为并行计算处理器的N个线程的数据;第二源操作数可以为立即数,第一数据搬移方式对应的线程偏移量即为该立即数,需要说明的是这里的线程偏移量可以理解为搬移数据所涉及的线程跨越程度,例如立即数为2,某一数据搬移前所在的线程与该数据搬移后所在的线程之间相隔2个线程。可选的,该第一操作指令的表达式可以记为:CROSSDOWN.FADD R0,R1,2。Note that the instruction is coded as the first operation instruction, and the first operation instruction may include the following parameters: the first operation code, the second operation code, the destination operand, the first source operand (that is, the source operand 1 in the first operation instruction ) and the second source operand (ie, source operand 2 in the first operation instruction). Exemplarily, the first operation code is CROSS-DOWN, which indicates that the data transfer method between N threads in this solution is circular transfer or the first data transfer method; the second operation code indicates the operation type, such as floating-point addition FADD; the destination operand is the general-purpose register R0, and the storage location of the operation result is indicated by the general-purpose register address; the first source operand is the general-purpose register R1, and the first source data of N threads is indicated by the general-purpose register address, and the general-purpose register R1 is The initial data is the data of N threads of the parallel computing processor; the second source operand can be an immediate value, and the thread offset corresponding to the first data movement method is the immediate value. It should be noted that the thread offset here The amount can be understood as the degree of thread crossing involved in moving data. For example, if the immediate value is 2, there are 2 threads between the thread where a certain data is moved and the thread where the data is moved. Optionally, the expression of the first operation instruction may be recorded as: CROSSDOWN.FADD R0, R1, 2.
源操作数收集器根据第一操作指令对N个线程的第一源数据进行搬移,具体可以参照如下方式实施:将编号为I 1的线程的第一源数据搬移到编号为i的线程中;其中,N个线程的编号为0~(N-1),i取遍0~(N-1),I 1为(i+SRC1)对N取余的值;其中,SRC1表示所述第二源操作数,SRC1为正整数。对于编号为i的线程上搬移后的数据可以采用SRC0[i]表示,循环搬移后的结果满足表达式:SRC0[i]=SRC0[(i+SRC1)%N],i∈[0,N-1]。 The source operand collector moves the first source data of N threads according to the first operation instruction, specifically, it can be implemented with reference to the following manner: the first source data of the thread numbered I +1 is moved to the thread numbered i; Wherein, the numbering of N threads is 0~(N-1), i is taken over 0~(N-1), and I 1 is the value of (i+SRC1) taking remainder of N; wherein, SRC1 represents the second Source operand, SRC1 is a positive integer. The data moved on the thread numbered i can be represented by SRC0[i], and the result after the cycle transfer satisfies the expression: SRC0[i]=SRC0[(i+SRC1)%N], i∈[0,N -1].
以源操作数收集器部署与并行计算处理器均有32个线程,SRC1为2示例。如图4示意的一种循环搬移示意图,将箭头尾端对应线程的第一源数据搬移到箭头首端对应线程中。编号为0的线程上搬移后的数据是编号为2的线程上的第一源数据;编号为2的线程上搬移后的数据是编号为4的线程上的第一源数据;编号为25的线程上搬移后的数据为编号为27的线程上的第一源数据;编号为30的线程上搬移后的数据为编号为0的线程上的第一源数据,等等。本申请实施例对此不再进行赘述。Both the source operand collector deployment and the parallel computing processor have 32 threads, and SRC1 is an example of 2. As shown in FIG. 4 , a schematic diagram of circular transfer, the first source data of the thread corresponding to the end of the arrow is moved to the thread corresponding to the head of the arrow. The data moved on the thread numbered 0 is the first source data on the thread numbered 2; the data moved on the thread numbered 2 is the first source data on the thread numbered 4; The data moved on the thread is the first source data on the thread numbered 27; the data moved on the thread numbered 30 is the first source data on the thread numbered 0, and so on. This embodiment of the present application will not describe it again.
在一种可选的实施方式中,源操作数据收集器中可部署CROSSDOWN跨线程处理单元,如图5示意的CROSSDOWN跨线程处理单元结构,该单元可采用多个二选一选择器MUX来实现循环搬移操作。假设N个线程的每个线程中的数据位宽为M比特,则可通过log2(N)个位宽为2*M*N比特的二选一选择器构造级联电路,进行级联数据选择。级联电路中第一个选择器的输入是基于N个线程的第一源数据生成的:将N个线程中每一线程的第一源数据复制成两倍位宽(2M)作为选择器的第一个输入,记线程的第一源数据为SRC0,图5中以2{SRC0,SRC0}表示复制成两倍位宽的数据;将复制成两倍位宽的数据向右移位M比特作为第二个输入。以SRC1转换为二进制的比特0作为选择位,从两个输入中选取一个输出,例如若SRC1的比特0位为0,则选取前述仅复制两倍位宽的数据输出;若SRC1的比特0位为1,则选取前述复制两倍位宽且向右移位的数据输出;反之亦可。此后的第i级选择器输入之一来自上一级选择器的输出,另一个输入为上一级选择器的输出向右移(i+1)*M比特的数据。以SRC1转换为二进制的比特i作为选择位,例如若SRC1的比特i位为0,则选取前述仅复制两倍位宽的数据输出;若SRC1的比特i位为1,则选取复制两倍位宽且向右移位的数据输出;反之亦可,但需要说明的是每一级选择器关于比特i位取值的定义相同。最后一级的选择器以SRC1转换为二进制的比特log(N)-1作为选择位,输出的数据送给算术逻辑单元ALU作为操作数,ALU根据前述第一操作指令的第二操作码所指示的运算类型,可以对跨 线程搬移前后的操作数进行计算,包括但不限于浮点乘,浮点加,整型乘,整型加,浮点比较,整型比较,逻辑与,逻辑或,逻辑异或等。In an optional embodiment, a CROSSDOWN cross-thread processing unit can be deployed in the source operation data collector, such as the CROSSDOWN cross-thread processing unit structure shown in Figure 5, which can be implemented by using multiple selectors MUX Cyclic transfer operation. Assuming that the data bit width in each thread of N threads is M bits, then a cascade circuit can be constructed by log2(N) binary selectors with a bit width of 2*M*N bits to perform cascaded data selection . The input of the first selector in the cascade circuit is generated based on the first source data of N threads: the first source data of each thread in the N threads is copied to double the bit width (2M) as the selector The first input, remember that the first source data of the thread is SRC0. In Figure 5, 2{SRC0,SRC0} represents the data copied to double the bit width; the data copied to the double bit width is shifted to the right by M bits as the second input. Use the bit 0 converted from SRC1 to binary as the selection bit, and select one output from the two inputs. For example, if bit 0 of SRC1 is 0, select the aforementioned data output that only copies twice the bit width; if bit 0 of SRC1 If it is 1, then select the data output that is double the bit width copied and shifted to the right; vice versa. One of the inputs to the i-th stage selector thereafter comes from the output of the selector at the previous stage, and the other input is the data shifted to the right by (i+1)*M bits from the output of the selector at the previous stage. The bit i converted from SRC1 to binary is used as the selection bit. For example, if the bit i of SRC1 is 0, then select the aforementioned data output that only copies twice the bit width; if the bit i of SRC1 is 1, then select to copy twice the bit Wide and right-shifted data output; vice versa, but it should be noted that the definition of the value of bit i in each stage of selector is the same. The selector of the last stage uses the bit log(N)-1 converted from SRC1 into binary as the selection bit, and the output data is sent to the arithmetic logic unit ALU as an operand, and the ALU is indicated according to the second operation code of the aforementioned first operation instruction Operation type, which can calculate the operands before and after cross-thread movement, including but not limited to floating-point multiplication, floating-point addition, integer multiplication, integer addition, floating-point comparison, integer comparison, logical and, logical or, Logical XOR etc.
具体地,算术逻辑单元ALU可针对所述N个线程中的第一线程,基于所述第一线程的第一源数据和所述第一线程上搬移后的第一数据,执行运算类型对应的运算操作。其中,第一线程可以包括N个线程中的部分或者全部线程。Specifically, the arithmetic logic unit ALU may, for the first thread among the N threads, perform the operation corresponding to the operation type based on the first source data of the first thread and the moved first data on the first thread. arithmetic operations. Wherein, the first thread may include some or all of the N threads.
可选的,可为N个线程中每个线程配置线程标志位,线程标志位用于指示线程的第一源数据是否参与运算操作。具体地,所述线程标志位取第一值时,所述线程标志位用于指示线程的源数据参与运算操作;或者,所述线程标志位取第二值时,所述线程标志位用于指示线程的源数据不参与运算操作。其中,第一值可以为1,第二值可以为0。设置线程标志位标记无需计算的线程,通过这样的方式能够节省计算功耗。Optionally, a thread flag bit may be configured for each of the N threads, and the thread flag bit is used to indicate whether the first source data of the thread participates in the calculation operation. Specifically, when the thread flag bit takes a first value, the thread flag bit is used to indicate that the source data of the thread participates in an operation; or, when the thread flag bit takes a second value, the thread flag bit is used for Indicates that the source data of the thread does not participate in the calculation operation. Wherein, the first value may be 1, and the second value may be 0. Setting the thread flag bit marks the threads that do not need to be calculated, which can save calculation power consumption in this way.
算术逻辑单元ALU可根据线程标志位来确定线程上搬移前后的数据是否参与运算,具体地,针对N中编号为i的线程(简称线程i),可以根据线程i的线程标志位和编号为((i+src1)%N)的线程的线程标志位确定线程i上搬移前后的数据是否参与运算。也可以理解为进行数据搬移后,根据该线程i的原有线程标志位与搬移数据的来源即编号为((i+src1)%N)的线程的线程标志位,更新线程i的线程标志位。伪代码可表示为:new_lanemask[i]=lanemask[(i+src1)%N]&lanemask[i]。其中,lanemask[i]表示线程i的原有线程标志位的取值,lanemask[(i+src1)%N]编号为((i+src1)%N)的线程的标志位的取值。当lanemask[(i+src1)%N]和lanemask[i]均为1时,更新后线程i的线程标志位new_lanemask[i]为1即可表示线程上搬移前后的数据参与运算。The arithmetic logic unit ALU can determine whether the data before and after the movement on the thread participates in the calculation according to the thread flag bit. Specifically, for the thread numbered i in N (referred to as thread i), the thread flag bit and number of the thread i can be ( (i+src1)%N) The thread flag bit of the thread determines whether the data before and after moving on the thread i participates in the operation. It can also be understood as after the data is moved, update the thread flag of the thread i according to the original thread flag of the thread i and the source of the moved data, that is, the thread flag of the thread numbered ((i+src1)%N) . The pseudocode can be expressed as: new_lanemask[i]=lanemask[(i+src1)%N]&lanemask[i]. Wherein, lanemask[i] represents the value of the original thread flag of thread i, and lanemask[(i+src1)%N] is the value of the flag of the thread whose number is ((i+src1)%N). When both lanemask[(i+src1)%N] and lanemask[i] are 1, the updated thread flag new_lanemask[i] of thread i is 1, which means that the data before and after moving on the thread participates in the operation.
以第一线程上搬移后的第一数据来自于所述N个线程中的第二线程为例,第一线程上的搬移前后的数据参与运算,需要符合如下条件:所述第一线程关联的线程标志位指示所述第一线程的第一源数据参与运算操作,且所述第二线程关联的线程标志位指示所述第二线程的第一源数据参与运算操作。如图6示意的一种线程标志位示意图,以黑色填充示意出搬移之前线程1,线程28的原有线程标志位为0,跨线程数据搬移操作即循环搬移后,更新后的线程1,线程26,线程28,线程31的线程标志位均为0。线程1,线程26,线程28,线程31上搬移前后的数据不参与运算。Taking the first data moved on the first thread from the second thread among the N threads as an example, the data before and after the move on the first thread needs to meet the following conditions to participate in the operation: the data associated with the first thread The thread flag bit indicates that the first source data of the first thread participates in an operation operation, and the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in an operation operation. As shown in Figure 6, a schematic diagram of a thread flag bit, the thread 1 before the move is filled with black, and the original thread flag bit of thread 28 is 0, and the cross-thread data move operation is the updated thread 1, thread 1 after the circular move. 26, thread 28, and thread 31 have thread flags of 0. Data before and after migration on thread 1, thread 26, thread 28, and thread 31 does not participate in the calculation.
本方案一以单条指令实现跨线程数据循环搬移,可以应用于并行计算中的归约计算。本方案一还可用于实现多线程数据累加累乘等运算,如构造多级指令,每级指令使用前述第一操作指令,每一级指令的输出结果可作为下一级指令的输入,最终使得将多线程的数据搬移到同一线程上实现累加累乘等运算。The solution 1 uses a single instruction to move data circularly across threads, which can be applied to reduction calculations in parallel computing. This solution one can also be used to realize multi-threaded data accumulation and multiplication operations, such as constructing multi-level instructions, each level of instruction uses the aforementioned first operation instruction, and the output result of each level of instruction can be used as the input of the next level of instruction, finally making Move multi-threaded data to the same thread to realize accumulation, multiplication and other operations.
方案二Option II
源操作数收集器部署与并行计算处理器相同数量的线程,如N个线程。可使用一条指令编码实现N个线程的数据之间的交叉搬移。The source operand collector deploys the same number of threads as the parallel computing processors, such as N threads. One instruction code can be used to realize the cross movement of data of N threads.
记该指令编码为第一操作指令,第一操作指令可以包括如下参数:第一操作码、第二操作码、目的操作数、第一源操作数(即第一操作指令中的源操作数1)以及第二源操作数(即第一操作指令中的源操作数2)。示例性的,第一操作码为CROSS QUAD BUTTERFLY,指示本方案二中N个线程之间的数据搬移方式为交叉搬移或称第二数据搬移方式;第二操作码指示运算类型,例如浮点加法FADD;目的操作数为通用寄存器R0,以通用寄存器地址指示运算结果的存储位置;第一源操作数为通用寄存器R1,以通用寄存 器地址指示N个线程的第一源数据,通用寄存器R1中的初始数据为并行计算处理器的N个线程的数据;第二源操作数可以为立即数,例如2。可选的,该第一操作指令的表达式可以记为:CROSS QUAD BUTTERFLY.FADD R0,R1,2。Note that the instruction is coded as the first operation instruction, and the first operation instruction may include the following parameters: the first operation code, the second operation code, the destination operand, the first source operand (that is, the source operand 1 in the first operation instruction ) and the second source operand (ie, source operand 2 in the first operation instruction). Exemplarily, the first operation code is CROSS QUAD BUTTERFLY, indicating that the data transfer method between N threads in the second solution is cross transfer or the second data transfer method; the second operation code indicates the operation type, such as floating-point addition FADD; the destination operand is the general-purpose register R0, and the storage location of the operation result is indicated by the general-purpose register address; the first source operand is the general-purpose register R1, and the first source data of N threads is indicated by the general-purpose register address, and the general-purpose register R1 is The initial data is the data of N threads of the parallel computing processor; the second source operand may be an immediate value, such as 2. Optionally, the expression of the first operation instruction may be recorded as: CROSS QUAD BUTTERFLY.FADD R0,R1,2.
源操作数收集器根据第一操作指令对N个线程的第一源数据进行搬移,具体可以参照如下方式实施:将编号为I 2的线程的第一源数据搬移到编号为i的线程中;其中,N个线程的编号为0~(N-1),i取遍0~(N-1);I 2为i与SRC1的异或值,SRC1表示所述第二源操作数,SRC1为正整数。对于编号为i的线程上搬移后的数据可以采用SRC0[i]表示,循环搬移后的结果满足表达式:SRC0[i]=SRC0[i^SRC1],i∈[0,N-1]。 The source operand collector moves the first source data of N threads according to the first operation instruction, specifically, it can be implemented with reference to the following manner: the first source data of the thread that is numbered 12 is moved to the thread that is numbered i; Wherein, the numbering of N threads is 0~(N-1), and i takes 0~(N-1); I 2 is the XOR value of i and SRC1, and SRC1 represents the second source operand, and SRC1 is positive integer. The moved data on the thread numbered i can be represented by SRC0[i], and the result after cyclic moving satisfies the expression: SRC0[i]=SRC0[i^SRC1], i∈[0,N-1].
以源操作数收集器部署与并行计算处理器均有32个线程,SRC1为2示例。如图7示意的一种交叉偏移示意图,将箭头尾端对应线程的第一源数据搬移到箭头首端对应线程中。编号为0的线程上搬移后的数据是编号为2的线程上的第一源数据;编号为2的线程上搬移后的数据是编号为0的线程上的第一源数据;编号为29的线程上搬移后的数据为编号为31的线程上的第一源数据;编号为31的线程上搬移后的数据是编号为29的线程上的第一源数据,等等。本申请实施例对此不再进行赘述。本方案二的CROSS QUAD BUTTERFLY可以使得将N个线程如32个线程以每连续4条线程分为一组成为一个QUAD,在每个QUAD内实现两个线程之间数据的互换。Both the source operand collector deployment and the parallel computing processor have 32 threads, and SRC1 is an example of 2. As shown in FIG. 7 , a cross-offset schematic diagram, the first source data of the thread corresponding to the end of the arrow is moved to the thread corresponding to the head of the arrow. The data moved on the thread numbered 0 is the first source data on the thread numbered 2; the data moved on the thread numbered 2 is the first source data on the thread numbered 0; The data moved on the thread is the first source data on the thread numbered 31; the data moved on the thread numbered 31 is the first source data on the thread numbered 29, and so on. This embodiment of the present application will not describe it again. The CROSS QUAD BUTTERFLY of the second scheme can make N threads, such as 32 threads, grouped into a group of 4 consecutive threads to form a QUAD, and realize the data exchange between two threads in each QUAD.
在一种可选的实施方式中,源操作数据收集器中可部署CROSS QUAD BUTTERFLY跨线程处理单元,该单元采用多个四选一选择器MUX来实现循环搬移操作。假设N个线程的每个线程中的数据位宽为M比特,则可通过N个位宽为M比特的四选一选择器进行并联数据选择。如图8示意CROSS QUAD BUTTERFLY跨线程处理单元其中的第i个四选一选择器MUX,该第i个四选一选择器MUX的输入是该线程所属QUAD的四个线程的第一源数据,前述四个线程的编号分别为:i%4,(i%4)+1,(i%4)+2以及(i%4)+3。第i个选择器以SRC1与i的异或结果作为选择位,从四个输入中选择一个输出给算术逻辑单元ALU,ALU根据前述第一操作指令的第二操作码所指示的运算类型,可以对跨线程搬移前后的操作数进行计算,包括但不限于浮点乘,浮点加,整型乘,整型加,浮点比较,整型比较,逻辑与,逻辑或,逻辑异或等。In an optional implementation, the CROSS QUAD BUTTERFLY cross-thread processing unit can be deployed in the source operation data collector, and this unit uses multiple four-selector MUX to realize the circular transfer operation. Assuming that the data bit width in each of the N threads is M bits, parallel data selection can be performed through N four selectors with a bit width of M bits. Figure 8 shows the i-th four-selector MUX in the cross-thread processing unit of CROSS QUAD BUTTERFLY. The input of the i-th four-selector MUX is the first source data of the four threads of the QUAD to which the thread belongs. The numbers of the aforementioned four threads are respectively: i%4, (i%4)+1, (i%4)+2 and (i%4)+3. The i-th selector uses the XOR result of SRC1 and i as the selection bit, selects one of the four inputs and outputs it to the arithmetic logic unit ALU, and the ALU can perform operations according to the operation type indicated by the second operation code of the aforementioned first operation instruction. Calculate the operands before and after moving across threads, including but not limited to floating-point multiplication, floating-point addition, integer multiplication, integer addition, floating-point comparison, integer comparison, logical AND, logical OR, logical XOR, etc.
具体地,算术逻辑单元ALU可针对所述N个线程中的第一线程,基于所述第一线程的第一源数据和所述第一线程上搬移后的第一数据,执行运算类型对应的运算操作。其中,第一线程可以包括N个线程中的部分或者全部线程。Specifically, the arithmetic logic unit ALU may, for the first thread among the N threads, perform the operation corresponding to the operation type based on the first source data of the first thread and the moved first data on the first thread. arithmetic operations. Wherein, the first thread may include some or all of the N threads.
可选的,可为N个线程中每个线程配置线程标志位,线程标志位用于指示线程的第一源数据是否参与运算操作。具体地,所述线程标志位取第一值时,所述线程标志位用于指示线程的源数据参与运算操作;或者,所述线程标志位取第二值时,所述线程标志位用于指示线程的源数据不参与运算操作。其中,第一值可以为1,第二值可以为0。设置线程标志位标记无需计算的线程,通过这样的方式能够节省计算功耗。Optionally, a thread flag bit may be configured for each of the N threads, and the thread flag bit is used to indicate whether the first source data of the thread participates in the calculation operation. Specifically, when the thread flag bit takes a first value, the thread flag bit is used to indicate that the source data of the thread participates in an operation; or, when the thread flag bit takes a second value, the thread flag bit is used for Indicates that the source data of the thread does not participate in the calculation operation. Wherein, the first value may be 1, and the second value may be 0. Setting the thread flag bit marks the threads that do not need to be calculated, which can save calculation power consumption in this way.
算术逻辑单元ALU可根据线程标志位来确定线程上搬移前后的数据是否参与运算,具体地,针对N中编号为i的线程(简称线程i),可以根据线程i的线程标志位和编号为(i^SRC1)的线程的线程标志位确定线程i上搬移前后的数据是否参与运算。也可以理解为进行数据搬移后,根据该线程i的原有线程标志位与搬移数据的来源即编号为(i^SRC1)的线程的线程标志位,更新线程i的线程标志位。伪代码可表示为:new_lanemask[i]=lanemask[i^SRC1]&lanemask[i]。其中,lanemask[i]表示线程i的原有线程标志位的取值,lanemask[i^SRC1]编 号为(i^SRC1)的线程的标志位的取值。当lanemask[i^SRC1]和lanemask[i]均为1时,更新后线程i的线程标志位new_lanemask[i]为1即可表示线程上搬移前后的数据参与运算。The arithmetic logic unit ALU can determine whether the data before and after the movement on the thread participates in the calculation according to the thread flag bit. Specifically, for the thread numbered i in N (referred to as thread i), the thread flag bit and number of the thread i can be ( The thread flag bit of the thread of i^SRC1) determines whether the data before and after moving on the thread i participates in the operation. It can also be understood as after the data is moved, the thread flag of thread i is updated according to the original thread flag of the thread i and the thread flag of the thread numbered (i^SRC1) which is the source of the moved data. The pseudocode can be expressed as: new_lanemask[i]=lanemask[i^SRC1]&lanemask[i]. Wherein, lanemask[i] represents the value of the original thread flag of thread i, and lanemask[i^SRC1] is numbered as the value of the flag of the thread of (i^SRC1). When both lanemask[i^SRC1] and lanemask[i] are 1, the updated thread flag new_lanemask[i] of thread i is 1, which means that the data before and after moving on the thread participates in the operation.
以第一线程上搬移后的第一数据来自于所述N个线程中的第二线程为例,第一线程上的搬移前后的数据参与运算,需要符合如下条件:所述第一线程关联的线程标志位指示所述第一线程的第一源数据参与运算操作,且所述第二线程关联的线程标志位指示所述第二线程的第一源数据参与运算操作。如图9示意的一种线程标志位示意图,以黑色填充示意出搬移之前线程1,线程28的原有线程标志位为0,跨线程数据搬移操作即价差交叉搬移后,更新后的线程1,线程3,线程28,线程30的线程标志位均为0。线程1,线程3,线程28,线程30上搬移前后的数据不参与运算。Taking the first data moved on the first thread from the second thread among the N threads as an example, the data before and after the move on the first thread needs to meet the following conditions to participate in the operation: the data associated with the first thread The thread flag bit indicates that the first source data of the first thread participates in an operation operation, and the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in an operation operation. As shown in Figure 9, a schematic diagram of a thread flag bit, the thread 1 before the move is filled with black, and the original thread flag bit of thread 28 is 0. After the cross-thread data move operation, that is, the cross-movement of the price difference, the updated thread 1, The thread flag bits of thread 3, thread 28, and thread 30 are all 0. Data before and after migration on thread 1, thread 3, thread 28, and thread 30 does not participate in the calculation.
本方案二以单条指令实现跨线程数据交叉搬移,且可以锁定在QUAD较小范围内线程之间的数据交换,可以应用于图像处理中的差分计算,例如位置靠近的两个像素点之间的像素比较等。The second solution uses a single instruction to achieve cross-thread data transfer, and can lock the data exchange between threads within a small range of QUAD, which can be applied to difference calculations in image processing, such as the difference between two pixels that are located close to each other. Pixel comparison etc.
方案三third solution
源操作数收集器部署与并行计算处理器相同数量的线程,如N个线程。可使用一条指令编码实现N个线程的数据之间的交叉搬移。The source operand collector deploys the same number of threads as the parallel computing processors, such as N threads. One instruction code can be used to realize the cross movement of data of N threads.
记该指令编码为第一操作指令,第一操作指令可以包括如下参数:第一操作码、第二操作码、目的操作数、第一源操作数(即第一操作指令中的源操作数1)以及第二源操作数(即第一操作指令中的源操作数2)。示例性的,第一操作码为CROSS QUAD-BROADCAST,指示本方案三中N个线程之间的数据搬移方式为一对多搬移或称扩散搬移,又或称第三数据搬移方式;第二操作码指示运算类型,例如浮点加法FADD;目的操作数为通用寄存器R0,以通用寄存器地址指示运算结果的存储位置;第一源操作数为通用寄存器R1,以通用寄存器地址指示N个线程的第一源数据,通用寄存器R1中的初始数据为并行计算处理器的N个线程的数据;第二源操作数可以为立即数,例如2。可选的,该第一操作指令的表达式可以记为:CROSS QUAD-BROADCAST.FADD R0,R1,2。Note that the instruction is coded as the first operation instruction, and the first operation instruction may include the following parameters: the first operation code, the second operation code, the destination operand, the first source operand (that is, the source operand 1 in the first operation instruction ) and the second source operand (ie, source operand 2 in the first operation instruction). Exemplarily, the first operation code is CROSS QUAD-BROADCAST, indicating that the data transfer method between N threads in the third scheme is one-to-many transfer or diffusion transfer, or the third data transfer method; the second operation The code indicates the operation type, such as floating-point addition FADD; the destination operand is the general-purpose register R0, and the storage location of the operation result is indicated by the general-purpose register address; the first source operand is the general-purpose register R1, and the first source operand of the N threads is indicated by the general-purpose register address One source data, the initial data in the general register R1 is the data of N threads of the parallel computing processor; the second source operand can be an immediate value, such as 2. Optionally, the expression of the first operation instruction may be recorded as: CROSS QUAD-BROADCAST.FADD R0,R1,2.
源操作数收集器根据第一操作指令对N个线程的第一源数据进行搬移,具体可以参照如下方式实施:将编号为I 3的线程的第一源数据搬移到编号为i的线程中;其中,N个线程的编号为0~(N-1),i取遍0~(N-1);I 3的取值为
Figure PCTCN2021101533-appb-000003
SRC1表示所述第二源操作数,SRC1为正整数,n为能够整除N的正整数,
Figure PCTCN2021101533-appb-000004
表示向下取整。对于编号为i的线程上搬移后的数据可以采用SRC0[i]表示,循环搬移后的结果满足表达式:
Figure PCTCN2021101533-appb-000005
Figure PCTCN2021101533-appb-000006
i∈[0,N-1]。可选的,n为4。
The source operand collector moves the first source data of N threads according to the first operation instruction, specifically, it can be implemented with reference to the following manner: the first source data of the thread numbered 13 is moved to the thread numbered i; Wherein, the numbering of N threads is 0~(N-1), and i takes 0~(N-1); the value of I 3 is
Figure PCTCN2021101533-appb-000003
SRC1 represents the second source operand, SRC1 is a positive integer, n is a positive integer capable of dividing N,
Figure PCTCN2021101533-appb-000004
Indicates rounding down. The data moved on the thread numbered i can be represented by SRC0[i], and the result after cyclic moving satisfies the expression:
Figure PCTCN2021101533-appb-000005
Figure PCTCN2021101533-appb-000006
i∈[0,N-1]. Optionally, n is 4.
以源操作数收集器部署与并行计算处理器均有32个线程,SRC1为2,n为4示例。如图10示意的一种一对多偏移示意图,将箭头尾端对应线程的第一源数据搬移到箭头首端对应线程中。将编号为2的线程的第一源数据搬移至编号为0的线程、编号为1的线程、编号为2的线程以及编号为3的线程。编号为0的线程上搬移后的数据是编号为2的线程上的第一源数据;编号为1的线程上搬移后的数据是编号为2的线程上的第一源数据;编号为2的线程上搬移后的数据依然为编号为2的线程上的第一源数据;编号为3的线程上搬移后的数据是编号为2的线程上的第一源数据,等等。本申请实施例对此不再进行赘述。本方案三的CROSS QUAD-BROADCAST可以使得将N个线程如32个线程以每连续4条线程分为一组,成为一个QUAD,然后针对每个线程,在其所属的QUAD内选择线程编号为SRC1的线程的 第一源数据搬移到本线程。Both the source operand collector deployment and the parallel computing processor have 32 threads, the SRC1 is 2, and n is 4 examples. As shown in FIG. 10 , a schematic diagram of one-to-many offset, the first source data of the thread corresponding to the end of the arrow is moved to the thread corresponding to the head of the arrow. The first source data of the thread numbered 2 is moved to the thread numbered 0, the thread numbered 1, the thread numbered 2, and the thread numbered 3. The data moved on the thread numbered 0 is the first source data on the thread numbered 2; the data moved on the thread numbered 1 is the first source data on the thread numbered 2; The data moved on the thread is still the first source data on the thread numbered 2; the data moved on the thread numbered 3 is the first source data on the thread numbered 2, and so on. This embodiment of the present application will not describe it again. The CROSS QUAD-BROADCAST of the third scheme can make N threads, such as 32 threads, grouped into a group of 4 consecutive threads to form a QUAD, and then for each thread, select the thread number in the QUAD to which it belongs as SRC1 The first source data of the thread is moved to this thread.
在一种可选的实施方式中,源操作数据收集器中可部署CROSS QUAD-BROADCAST跨线程处理单元,该单元采用多个四选一选择器MUX来实现循环搬移操作。假设N个线程的每个线程中的数据位宽为M比特,则可通过N个位宽为M比特的四选一选择器进行并联数据选择。如图11示意出CROSS QUAD-BROADCAST跨线程处理单元中的第i个四选一选择器MUX,该第i个四选一选择器MUX的输入是该线程所属QUAD的四个线程的第一源数据,前述四个线程的编号分别为:i%4,(i%4)+1,(i%4)+2以及(i%4)+3。第i个选择器以SRC1作为选择位,从四个输入中选择一个输出给算术逻辑单元ALU,ALU根据前述第一操作指令的第二操作码所指示的运算类型,可以对跨线程搬移前后的操作数进行计算,包括但不限于浮点乘,浮点加,整型乘,整型加,浮点比较,整型比较,逻辑与,逻辑或,逻辑异或等。In an optional implementation manner, the CROSS QUAD-BROADCAST cross-thread processing unit can be deployed in the source operation data collector, and this unit uses a plurality of four selectors MUX to realize the circular transfer operation. Assuming that the data bit width in each of the N threads is M bits, parallel data selection can be performed through N four selectors with a bit width of M bits. Figure 11 shows the i-th four-selector MUX in the cross-thread processing unit of CROSS QUAD-BROADCAST, the input of the i-th four-selector MUX is the first source of the four threads of the QUAD to which the thread belongs Data, the numbers of the aforementioned four threads are respectively: i%4, (i%4)+1, (i%4)+2 and (i%4)+3. The i-th selector uses SRC1 as the selection bit, and selects one of the four inputs to output to the arithmetic logic unit ALU. According to the operation type indicated by the second operation code of the aforementioned first operation instruction, the ALU can perform cross-thread transfer before and after Operand calculations, including but not limited to floating-point multiplication, floating-point addition, integer multiplication, integer addition, floating-point comparison, integer comparison, logical AND, logical OR, logical XOR, etc.
具体地,算术逻辑单元ALU可针对所述N个线程中的第一线程,基于所述第一线程的第一源数据和所述第一线程上搬移后的第一数据,执行运算类型对应的运算操作。其中,第一线程可以包括N个线程中的部分或者全部线程。Specifically, the arithmetic logic unit ALU may, for the first thread among the N threads, perform the operation corresponding to the operation type based on the first source data of the first thread and the moved first data on the first thread. arithmetic operations. Wherein, the first thread may include some or all of the N threads.
可选的,可为N个线程中每个线程配置线程标志位,线程标志位用于指示线程的第一源数据是否参与运算操作。具体地,所述线程标志位取第一值时,所述线程标志位用于指示线程的源数据参与运算操作;或者,所述线程标志位取第二值时,所述线程标志位用于指示线程的源数据不参与运算操作。其中,第一值可以为1,第二值可以为0。设置线程标志位标记无需计算的线程,通过这样的方式能够节省计算功耗。Optionally, a thread flag bit may be configured for each of the N threads, and the thread flag bit is used to indicate whether the first source data of the thread participates in the calculation operation. Specifically, when the thread flag bit takes a first value, the thread flag bit is used to indicate that the source data of the thread participates in an operation; or, when the thread flag bit takes a second value, the thread flag bit is used for Indicates that the source data of the thread does not participate in the calculation operation. Wherein, the first value may be 1, and the second value may be 0. Setting the thread flag bit marks the threads that do not need to be calculated, which can save calculation power consumption in this way.
算术逻辑单元ALU可根据线程标志位来确定线程上搬移前后的数据是否参与运算,具体地,针对N中编号为i的线程(简称线程i),可以根据线程i的线程标志位和编号为((i+src1)%4)的线程的线程标志位确定线程i上搬移前后的数据是否参与运算。也可以理解为进行数据搬移后,根据该线程i的原有线程标志位与搬移数据的来源即编号为((i+src1)%4)的线程的线程标志位,更新线程i的线程标志位。伪代码可表示为:new_lanemask[i]=lanemask[(i+src1)%4]&lanemask[i]。其中,lanemask[i]表示线程i的原有线程标志位的取值,lanemask[(i+src1)%4]编号为((i+src1)%4)的线程的标志位的取值。当lanemask[(i+src1)%4]和lanemask[i]均为1时,更新后线程i的线程标志位new_lanemask[i]为1即可表示线程上搬移前后的数据参与运算。The arithmetic logic unit ALU can determine whether the data before and after the movement on the thread participates in the calculation according to the thread flag bit. Specifically, for the thread numbered i in N (referred to as thread i), the thread flag bit and number of the thread i can be ( (i+src1)%4) The thread flag bit of the thread determines whether the data before and after moving on the thread i participates in the operation. It can also be understood as after the data is moved, update the thread flag of the thread i according to the original thread flag of the thread i and the source of the moved data, that is, the thread flag of the thread numbered ((i+src1)%4) . The pseudocode can be expressed as: new_lanemask[i]=lanemask[(i+src1)%4]&lanemask[i]. Wherein, lanemask[i] represents the value of the original thread flag of thread i, and lanemask[(i+src1)%4] is the value of the flag of the thread whose number is ((i+src1)%4). When both lanemask[(i+src1)%4] and lanemask[i] are 1, the updated thread flag new_lanemask[i] of thread i is 1, which means that the data before and after moving on the thread participates in the operation.
以第一线程上搬移后的第一数据来自于所述N个线程中的第二线程为例,第一线程上的搬移前后的数据参与运算,需要符合如下条件:所述第一线程关联的线程标志位指示所述第一线程的第一源数据参与运算操作,且所述第二线程关联的线程标志位指示所述第二线程的第一源数据参与运算操作。如图12示意的一种线程标志位示意图,以黑色填充示意出搬移之前线程1,线程30的原有线程标志位为0,跨线程数据搬移操作即价差交叉搬移后,更新后的线程1,线程28-31的线程标志位均为0。线程1,线程28-31上搬移前后的数据不参与运算。Taking the first data moved on the first thread from the second thread among the N threads as an example, the data before and after the move on the first thread needs to meet the following conditions to participate in the operation: the data associated with the first thread The thread flag bit indicates that the first source data of the first thread participates in an operation operation, and the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in an operation operation. As shown in Figure 12, a schematic diagram of a thread flag bit, the thread 1 before the move is filled in black, and the original thread flag bit of thread 30 is 0. After the cross-thread data move operation, that is, the price difference cross move, the updated thread 1, The thread flag bits of threads 28-31 are all 0. The data before and after the migration on thread 1 and threads 28-31 does not participate in the calculation.
本方案三以单条指令实现跨线程数据交叉搬移,且可以锁定在QUAD较小范围内扩散某一线程的数据,可以应用于图像处理中的差分计算,例如位置上相邻的四个像素点,基于其中一个像素点做平滑处理等。The third solution uses a single instruction to achieve cross-thread data transfer, and can lock the data of a certain thread in a small range of QUAD, which can be applied to the difference calculation in image processing, such as four adjacent pixels in position, Smoothing based on one of the pixels, etc.
可以理解的是,本申请实施例提供的上述方案一至方案三可以独立实施,也可以结合在一起实施。例如将方案一和方案三结合在一起实施,以方案一各个线程的运算结果作为 方案三中对应线程的源数据,进行一对多搬移操作。It can be understood that the above schemes 1 to 3 provided in the embodiment of the present application can be implemented independently or in combination. For example, plan 1 and plan 3 are combined and implemented, and the calculation results of each thread in plan 1 are used as the source data of the corresponding thread in plan 3 to perform one-to-many transfer operations.
对应上述方案一至方案三的实施例,参见图13,本申请实施例还提一种跨线程数据处理流程,该流程可由并行计算处理器中的各个单元协同执行。主要的包括如下步骤。Corresponding to the embodiments of the above schemes 1 to 3, referring to FIG. 13 , the embodiment of the present application also provides a cross-thread data processing flow, which can be executed cooperatively by various units in a parallel computing processor. Mainly include the following steps.
(1)指令调度器输入指令。(1) The instruction scheduler inputs instructions.
并行计算程序或图形渲染程序经过编译器编译成二进制指令编码后配置到SIMD处理器中,指令编码作为指令调度器的输入,待处理的数据通过软件配置到存储中,并在指令开始发射前被初始化到寄存器中,作为寄存器模块的数据输入。The parallel computing program or graphics rendering program is compiled into a binary instruction code by a compiler and then configured into the SIMD processor. The instruction code is used as the input of the instruction scheduler, and the data to be processed is configured into the storage through the software and is processed before the instruction starts to be issued. Initialized into the register as the data input of the register module.
(2)指令解码器解析指令编码,得到操作数(如源操作数1,源操作数2,目的操作数等)以及操作码,如第一操作码、第二操作码。(2) The instruction decoder parses the instruction code to obtain operands (such as source operand 1, source operand 2, destination operand, etc.) and operation codes, such as the first operation code and the second operation code.
(3)指令解码器解析出源操作数后向通用寄存器发射源操作数读取请求,通用寄存器返回源操作数对应的数据给收集器。(3) After the instruction decoder parses the source operand, it sends a source operand read request to the general register, and the general register returns the data corresponding to the source operand to the collector.
(4)源操作数收集器判断第一操作码是否为CROSS类型指令。如果不是,则执行步骤(5)将数据送给下游ALU进行计算。如果是CROSS类型指令,则在判断是否为CROSS DOWN指令、CROSS QUAD BUTTERFLY指令或CROSS QUAD BROADCAST指令,并通过相应的处理单元进行处理如执行跨线程数据搬移操作后,再执行步骤(5)将数据送给下游ALU进行计算。(4) The source operand collector judges whether the first opcode is a CROSS type instruction. If not, perform step (5) to send the data to the downstream ALU for calculation. If it is a CROSS type instruction, after judging whether it is a CROSS DOWN instruction, a CROSS QUAD BUTTERFLY instruction or a CROSS QUAD BROADCAST instruction, and processing it through the corresponding processing unit such as performing a cross-thread data movement operation, then perform step (5) to transfer the data Send it to the downstream ALU for calculation.
(5)ALU按照第二操作码进行相应计算,结果送给下一模块处理。(5) The ALU performs corresponding calculations according to the second operation code, and the result is sent to the next module for processing.
方案四:Option four:
源操作数收集器部署比并行计算处理器数量少的线程,如源操作数收集器部署N个线程,而并行计算处理的线程数量为2N。可使用两条指令编码实现并行计算处理器2N个线程的数据之间的循环搬移。The source operand collector deploys fewer threads than the number of parallel computing processors. For example, the source operand collector deploys N threads, while the number of parallel computing processing threads is 2N. Two instruction codes can be used to realize the circular transfer of data among 2N threads of parallel computing processors.
两条指令编码可记为第一操作指令和第二操作指令,第一操作指令可参照方案一的定义,第一操作指令和第二操作指令中的源操作数1所指示的源数据来源不同。第一操作指令中的第一源操作数所指示N个线程的第一源数据来自并行计算处理器的N个连续的线程。为加以区分记第二操作指令中的源操作数1为第三源操作数,第二源操作数据指示(源操作数据收集器中)N个线程的第二源数据,所述N个线程的第二源数据来自所述并行计算处理器中的剩余N个连续的线程。此外,第二操作指令还可以包括和第一操作指令相同的其余参数,如第一操作码、第二操作码、目的操作数、第二源操作数。The codes of the two instructions can be recorded as the first operation instruction and the second operation instruction. The first operation instruction can refer to the definition of Scheme 1. The source data sources indicated by the source operand 1 in the first operation instruction and the second operation instruction are different. . The first source data of the N threads indicated by the first source operand in the first operation instruction comes from N consecutive threads of the parallel computing processor. In order to distinguish the source operand 1 in the second operation instruction as the third source operand, the second source operation data indicates (in the source operation data collector) the second source data of N threads, and the N threads The second source data comes from the remaining N consecutive threads in the parallel computing processor. In addition, the second operation instruction may also include other parameters that are the same as those of the first operation instruction, such as a first operation code, a second operation code, a destination operand, and a second source operand.
本方案四中,源操作数收集器根据第一操作指令对N个线程的第一源数据进行搬移,以及根据第二操作指令对N个线程的第二源数据进行搬移的具体实施方式可参照方案一进行,本申请实施例对此不再进行介绍。示例性的,源操作数收集器可部署32个线程,并行计算处理器包括64个线程,SRC1为2。在指令调度器阶段第一操作指令发出的时刻提前于第二操作指令发出的时刻;记两条指令相差的发射时序间隔为m,m为1时,表示第一操作指令和第二操作指令为背靠背发送的两条指令。每个指令处理N个线程,如图14示意了一种源数据来源示意图,源操作数收集器的N个线程编号为0~31,第一操作指令所指示N个线程的第一源数据来自并行计算处理器中编号为32~63的线程,其中源操作数据收集器中的线程0的第一源数据来自并行计算处理器中的线程32,源操作数据收集器中的线程1的第一源数据来自并行计算处理器中的线程33,以此类推,源操作数据收集器中的线程31的第一源数据来自并行计算处理器中的线程63;第二操作指令所指示N个线程的第二源数据来 自并行计算处理器中编号为0~31的线程,其中源操作数据收集器中的线程0的第二源数据来自并行计算处理器中的线程0,源操作数据收集器中的线程1的第二源数据来自并行计算处理器中的线程1,以此类推,源操作数据收集器中的线程31的第二源数据来自并行计算处理器中的线程31。In this scheme four, the source operand collector moves the first source data of N threads according to the first operation instruction, and the specific implementation manner of moving the second source data of N threads according to the second operation instruction can refer to Solution 1 is carried out, which will not be introduced in this embodiment of the present application. Exemplarily, the source operand collector can deploy 32 threads, the parallel computing processor includes 64 threads, and the SRC1 is 2. In the instruction scheduler stage, the time when the first operation instruction is issued is ahead of the time when the second operation instruction is issued; note that the time sequence interval between the two instructions is m, and when m is 1, it means that the first operation instruction and the second operation instruction are Two instructions sent back to back. Each instruction processes N threads. Figure 14 shows a schematic diagram of source data sources. The N threads of the source operand collector are numbered from 0 to 31. The first source data of the N threads indicated by the first operation instruction comes from Threads numbered 32-63 in the parallel computing processor, wherein the first source data of thread 0 in the source operation data collector comes from thread 32 in the parallel computing processor, and the first source data of thread 1 in the source operation data collector The source data comes from the thread 33 in the parallel computing processor, and by analogy, the first source data of the thread 31 in the source operation data collector comes from the thread 63 in the parallel computing processor; the indicated N threads of the second operation instruction The second source data comes from threads numbered 0-31 in the parallel computing processor, wherein the second source data of thread 0 in the source operation data collector comes from thread 0 in the parallel computing processor, and the thread 0 in the source operation data collector The second source data of thread 1 comes from thread 1 in the parallel computing processor, and so on, the second source data of thread 31 in the source operation data collector comes from thread 31 in the parallel computing processor.
进一步在图14的基础上,如图15示意,本申请实施例提供另一种循环搬移示意图。源操作数据收集器先获取到第一操作指令,源操作数据收集器按照第一操作指令执行跨线程数据的循环搬移后,源操作数据收集器中:线程0上搬移后的数据为线程2上的第一源数据;线程1上搬移后的数据为线程3上的第一源数据;……线程30上搬移后的数据为线程0上的第一源数据;线程31上搬移后的数据为线程1上的第一源数据。源操作数据收集器将第一操作指令对应的搬移结果,记为N个线程的第一数据输入给算术逻辑单元。然后源操作数据收集器获取到第二操作指令,源操作数据收集器按照第二操作指令执行跨线程数据的循环搬移后,源操作数据收集器中:线程0上搬移后的数据为线程2上的第二源数据;线程1上搬移后的数据为线程3上的第二源数据;……线程30上搬移后的数据为线程0上的第二源数据;线程31上搬移后的数据为线程1上的第二源数据。源操作数据收集器将第二操作指令对应的搬移结果,记为N个线程的第二数据输入给算术逻辑单元。Further on the basis of FIG. 14 , as shown in FIG. 15 , the embodiment of the present application provides another schematic diagram of circular transfer. The source operation data collector first obtains the first operation instruction, and after the source operation data collector executes the circular movement of cross-thread data according to the first operation instruction, in the source operation data collector: the data moved on thread 0 is on thread 2 the first source data on thread 1; the data moved on thread 1 is the first source data on thread 3; ... the data moved on thread 30 is the first source data on thread 0; the data moved on thread 31 is First source data on thread 1. The source operation data collector inputs the migration result corresponding to the first operation instruction, recorded as the first data of N threads, to the ALU. Then the source operation data collector obtains the second operation instruction, and after the source operation data collector executes the circular transfer of cross-thread data according to the second operation instruction, in the source operation data collector: the data moved on thread 0 is on thread 2 the second source data on thread 1; the data moved on thread 1 is the second source data on thread 3; ... the data moved on thread 30 is the second source data on thread 0; the data moved on thread 31 is Second source data on thread 1. The source operation data collector inputs the moving result corresponding to the second operation instruction, recorded as the second data of N threads, to the ALU.
在算术逻辑单元ALU中的处理阶段,第一操作指令相较于第二条操作指令提前到达,假设第二操作指令到达ALU的阶段I,第一操作指令到达ALU的阶段I+m,I为ALU中的任意阶段。则算术逻辑单元ALU可针对第三线程上搬移后的第一数据和第三线程上搬移后的第二数据进行交换;其中,第三线程为源操作数据收集器中N个线程中编号为r的线程。关于r的取值可通过如下方式确定:若SRC1小于N,r大于或者等于(N-SRC1),且r小于N;若SRC1大于或者等于N,r大于或者等于0,且r小于(N-SRC1%N)。In the processing stage in the ALU, the first operation instruction arrives earlier than the second operation instruction, assuming that the second operation instruction arrives at the stage I of the ALU, and the first operation instruction arrives at the stage I+m of the ALU, I is Arbitrary stages in the ALU. Then the arithmetic logic unit ALU can exchange the first data after moving on the third thread and the second data after moving on the third thread; wherein, the third thread is numbered r among the N threads in the source operation data collector the rout. The value of r can be determined as follows: if SRC1 is less than N, r is greater than or equal to (N-SRC1), and r is less than N; if SRC1 is greater than or equal to N, r is greater than or equal to 0, and r is less than (N- SRC1%N).
以SRC1为2为例,参见图16示意一种数据置换示意图,在图15的基础上示意出了在第一操作指令到达ALU的阶段0时,将线程30上搬移后的第一数据与线程31上搬移后的第二数据进行置换;以及将线程31上搬移后的第一数据与线程31上搬移后的第二数据进行置换,至此实现并行计算处理器64线程之间数据的循环搬移操作。Taking SRC1 as 2 as an example, refer to FIG. 16 for a schematic diagram of data replacement. On the basis of FIG. 15 , it shows that when the first operation instruction arrives at stage 0 of the ALU, the first data and the thread after being moved on the thread 30 are shown. Replace the second data moved on the thread 31; and replace the first data moved on the thread 31 with the second data moved on the thread 31, so far realize the circular movement operation of the data between the threads of the parallel computing processor 64 .
进一步地,ALU再按照第一操作指令/第二操作指令中的第二操作码,基于并行计算处理器64线程之间数据的循环搬移结果实施相应的运算操作。具体的运算操作可按照实际需求进行,本申请实施例对此不进行限定。Further, the ALU implements the corresponding calculation operation based on the result of circular transfer of data between threads of the parallel computing processor 64 according to the second operation code in the first operation instruction/second operation instruction. The specific computing operations may be performed according to actual requirements, which is not limited in this embodiment of the present application.
当然需要说明的是,在本方案四中也可以针对源操作数收集器所部署的N个线程中每个线程配置线程标志位,线程标志位用于指示线程的第一源数据是否参与运算操作。具体的实施方式可参照方案一中的方式进行,本申请实施例对此不再进行赘述。作为示例,在图16中,以黑色填充示意出线程编号为30及31上的搬移前后的数据不参与运算操作。Of course, it should be noted that in this solution four, a thread flag bit can also be configured for each of the N threads deployed by the source operand collector, and the thread flag bit is used to indicate whether the first source data of the thread participates in the calculation operation . The specific implementation manner can be carried out with reference to the manner in Solution 1, which will not be repeated in this embodiment of the present application. As an example, in FIG. 16 , it is filled with black to indicate that the data before and after migration on the thread numbers 30 and 31 does not participate in the calculation operation.
本方案四中,以源操作数收集器中较少的线程结合ALU的数据交换处理,实现更高SIDM宽度的高效跨线程数据循环搬移,可以应用于并行计算中的归约计算。本方案四还可用于实现多线程数据累加累乘等运算,如构造多级指令,每级指令使用前述第一操作指令,每一级指令的输出结果可作为下一级指令的输入,最终使得将多线程的数据搬移到同一线程上实现累加累乘等运算。In the fourth solution, fewer threads in the source operand collector are combined with ALU data exchange processing to realize efficient cross-thread data circulation with higher SIDM width, which can be applied to reduction calculation in parallel computing. This solution four can also be used to realize multi-threaded data accumulation and multiplication operations, such as constructing multi-level instructions, each level of instruction uses the aforementioned first operation instruction, and the output result of each level of instruction can be used as the input of the next level of instruction, finally making Move multi-threaded data to the same thread to realize accumulation, multiplication and other operations.
对应上述方案四,参见图17,本申请实施例提供一种跨线程数据处理流程,该流程可由并行计算处理器中的各个单元协同执行。主要的包括如下步骤。Corresponding to the fourth solution above, referring to FIG. 17 , the embodiment of the present application provides a cross-thread data processing flow, which can be executed cooperatively by various units in a parallel computing processor. Mainly include the following steps.
(1)指令调度器输入指令。(1) The instruction scheduler inputs instructions.
并行计算程序或图形渲染程序经过编译器编译成二进制指令编码后配置到SIMD处理器中,指令编码作为指令调度器的输入,待处理的数据通过软件配置到存储中,并在指令开始发射前被初始化到寄存器中,作为寄存器模块的数据输入。The parallel computing program or graphics rendering program is compiled into a binary instruction code by a compiler and then configured into the SIMD processor. The instruction code is used as the input of the instruction scheduler, and the data to be processed is configured into the storage through the software and is processed before the instruction starts to be issued. Initialized into the register as the data input of the register module.
(2)指令解码器解析指令编码,得到操作数(如源操作数1,源操作数2,目的操作数等)以及操作码,如第一操作码、第二操作码。(2) The instruction decoder parses the instruction code to obtain operands (such as source operand 1, source operand 2, destination operand, etc.) and operation codes, such as the first operation code and the second operation code.
(3)指令解码器解析出源操作数后向通用寄存器发射源操作数读取请求,通用寄存器返回源操作数对应的数据给收集器。(3) After the instruction decoder parses the source operand, it sends a source operand read request to the general register, and the general register returns the data corresponding to the source operand to the collector.
(4)源操作数收集器判断第一操作码是否为CROSS类型指令。如果不是,则执行步骤(10)将数据送给下一模块。如果是则在确定为CROSS DOWN指令时,执行步骤(5)。(4) The source operand collector judges whether the first opcode is a CROSS type instruction. If not, then execute step (10) to send the data to the next module. If so, when it is determined to be a CROSS DOWN instruction, step (5) is performed.
(5)源操作数收集器进行CROSS DOWN数据处理如循环搬移操作。(5) The source operand collector performs CROSS DOWN data processing such as circular transfer operations.
(6)在ALU阶段I判断是否是CROSS DOWN的第二条指令(即前述第二操作指令)。如果不是,则执行步骤(10)将数据送给下一模块;如果是则指令步骤(7)。(6) judge whether it is the second instruction of CROSS DOWN (i.e. the aforementioned second operation instruction) in ALU stage I. If not, then execute step (10) to send the data to the next module; if yes, then instruct step (7).
(7)ALU判断SRC1数值是否小于N;如果是,执行步骤(8);如果否,则执行步骤(9)。(7) The ALU judges whether the value of SRC1 is smaller than N; if yes, execute step (8); if not, execute step (9).
(8)将ALU阶段I与ALU阶段I+m中,线程编号大于或等于N-SRC1至线程编号小于N的线程之间的数据进行交换。(8) Exchange data between threads whose thread numbers are greater than or equal to N-SRC1 to threads whose thread numbers are less than N in ALU stage I and ALU stage I+m.
(9)将ALU阶段I与ALU阶段I+m中,线程编号大于或等于0至线程编号小于(N-SRC1%N)的线程之间的数据进行交换。(9) Exchange data between threads whose thread numbers are greater than or equal to 0 and whose thread numbers are less than (N−SRC1%N) in ALU stage I and ALU stage I+m.
(10)下一模块处理。(10) Next module processing.
进一步,为区分实施方案一至方案四中哪一方案,在输入指令后可先进行判断SIMD的线程数量是否为源操作数收集器中线程数目的2倍。具体地,参见图18示意一种跨线程数据处理流程,主要包括如下步骤。Further, in order to distinguish which one of the implementation schemes 1 to 4, after the instruction is input, it can first be judged whether the number of threads in the SIMD is twice the number of threads in the source operand collector. Specifically, refer to FIG. 18 to illustrate a cross-thread data processing flow, which mainly includes the following steps.
(1)指令调度器输入指令。(1) The instruction scheduler inputs instructions.
并行计算程序或图形渲染程序经过编译器编译成二进制指令编码后配置到SIMD处理器中,指令编码作为指令调度器的输入,待处理的数据通过软件配置到存储中,并在指令开始发射前被初始化到寄存器中,作为寄存器模块的数据输入。The parallel computing program or graphics rendering program is compiled into a binary instruction code by a compiler and then configured into the SIMD processor. The instruction code is used as the input of the instruction scheduler, and the data to be processed is configured into the storage through the software and is processed before the instruction starts to be issued. Initialized into the register as the data input of the register module.
(2)指令调度器判断是否为SIMD 2N模式,即SIMD的线程数目是否为源操作数收集器中线程数目的2倍。如果是执行(3),指令只发射一次;如果否执行(4),执行发射两次。(2) The instruction scheduler judges whether it is SIMD 2N mode, that is, whether the number of threads of SIMD is twice the number of threads in the source operand collector. If (3) is executed, the instruction is issued only once; if (4) is not executed, the instruction is issued twice.
(3)指令解码器解析指令编码,得到操作数(如源操作数1,源操作数2,目的操作数等)以及操作码,如第一操作码、第二操作码。(3) The instruction decoder parses the instruction code to obtain operands (such as source operand 1, source operand 2, destination operand, etc.) and operation codes, such as the first operation code and the second operation code.
(4)指令解码器解析出源操作数后向通用寄存器发射源操作数读取请求,通用寄存器返回源操作数对应的数据给收集器。(4) After the instruction decoder parses the source operand, it sends a source operand read request to the general register, and the general register returns the data corresponding to the source operand to the collector.
(5)源操作数收集器按照前述方案一至方案四,执行第一操作指令所指示N个线程之间第一源数据的搬移操作。图18中以“进行SIMD N的方案一、方案二、方案三、方案四”示意。(5) The source operand collector executes the moving operation of the first source data between the N threads indicated by the first operation instruction according to the above schemes 1 to 4. In Fig. 18, "Scheme 1, Plan 2, Plan 3, and Plan 4 for carrying out SIMD N" are indicated.
(6)由ALU判断是否是为SIMD 2N的CROSS DOWN指令,或者判断是否收到CROSS DOWN的第二条指令(即前述第二操作指令)。如果否,则执行步骤(8)将数据送给下一模块;如果是则指令步骤(7)。(6) judge whether it is the CROSS DOWN instruction of SIMD 2N by ALU, or judge whether to receive the second instruction of CROSS DOWN (i.e. the aforementioned second operation instruction). If not, execute step (8) to send the data to the next module; if yes, execute step (7).
(7)根据方案四进行ALU阶段的数据交换。(7) Perform data exchange at the ALU stage according to scheme four.
如将ALU阶段I与ALU阶段I+m中,线程编号大于或等于N-SRC1至线程编号小于N的线程之间的数据进行交换。或者,将ALU阶段I与ALU阶段I+m中,线程编号大于或等于0至线程编号小于(N-SRC1%N)的线程之间的数据进行交换。For example, in the ALU stage I and the ALU stage I+m, the data between the threads whose thread number is greater than or equal to N-SRC1 to the thread number less than N is exchanged. Or, exchange data between threads whose thread numbers are greater than or equal to 0 and whose thread numbers are less than (N−SRC1%N) in ALU stage I and ALU stage I+m.
(8)下一模块处理。(8) Next module processing.
基于相同的构思,本申请实施例提供一种多线程数据处理方法,如图19示意。该方法主要包括如下流程。Based on the same idea, an embodiment of the present application provides a multi-thread data processing method, as shown in FIG. 19 . The method mainly includes the following processes.
S1901,获取第一操作指令。所述第一操作指令包括如下参数:第一操作码,所述第一操码用于指示N个线程之间的数据搬移方式,N为大于或者等于2的整数;第一源操作数,所述第一源操作数用于指示所述N个线程的第一源数据;第二源操作数,所述第二源操作数用于确定所述数据搬移方式对应的线程偏移量;S1901. Acquire a first operation instruction. The first operation instruction includes the following parameters: a first operation code, the first operation code is used to indicate the data transfer mode between N threads, and N is an integer greater than or equal to 2; the first source operand, the The first source operand is used to indicate the first source data of the N threads; the second source operand is used to determine the thread offset corresponding to the data movement mode;
S1902,根据所述第一操作指令对所述N个线程的第一源数据进行搬移,得到所述N个线程中每个线程上搬移后的第一数据。S1902. Move the first source data of the N threads according to the first operation instruction, to obtain the moved first data on each of the N threads.
本申请实施例中,通过单指令实现并行计算处理器的高效跨线程操作,相较于交叉网络cross bar更为简单,也无需频繁访存,能够以较低硬件或者信令开销实现高性能并行计算处理器中跨线程操作应用的加速处理。In the embodiment of the present application, the efficient cross-thread operation of the parallel computing processor is realized by a single instruction, which is simpler than the cross bar of the cross network, and does not require frequent memory access, and can realize high-performance parallelism with lower hardware or signaling overhead Accelerated processing of applications operating across threads in a computing processor.
在一种可选的实施方式中,所述数据搬移方式为第一搬移方式,所述根据所述第一操作指令对所述N个线程的第一源数据进行搬移,包括:将编号为I 1的线程的第一源数据搬移到编号为i的线程中;其中,N个线程的编号为0~(N-1),i取遍0~(N-1),I 1为(i+SRC1)对N取余的值;其中,SRC1表示所述第二源操作数,SRC1为正整数。 In an optional implementation manner, the data moving method is the first moving method, and the moving the first source data of the N threads according to the first operation instruction includes: The first source data of the thread 1 is moved to the thread numbered i; wherein, the numbering of N threads is 0~(N-1), i takes 0~(N-1), and I 1 is (i+ SRC1) the value of the remainder of N; wherein, SRC1 represents the second source operand, and SRC1 is a positive integer.
在一种可选的实施方式中,所述数据搬移方式为第二搬移方式,所述根据所述第一操作指令对所述N个线程的第一源数据进行搬移,包括:In an optional implementation manner, the data moving method is a second moving method, and the moving the first source data of the N threads according to the first operation instruction includes:
将编号为I 2的线程的第一源数据搬移到编号为i的线程中;其中,N个线程的编号为0~(N-1),i取遍0~(N-1);I 2为i与SRC1的异或值,SRC1表示所述第二源操作数,SRC1为正整数。 Move the first source data of the thread that is numbered I 2 to the thread that is numbered i; wherein, the numbers of N threads are 0~(N-1), and i takes 0~(N-1); I 2 is the XOR value of i and SRC1, SRC1 represents the second source operand, and SRC1 is a positive integer.
在一种可选的实施方式中,所述数据搬移方式为第三偏移方式,所述根据所述第一操作指令对所述N个线程的第一源数据进行搬移,包括:In an optional implementation manner, the data moving method is a third offset method, and the moving the first source data of the N threads according to the first operation instruction includes:
将编号为I 3的线程的第一源数据搬移到编号为i的线程中;其中,N个线程的编号为0~(N-1),i取遍0~(N-1);I 3的取值为
Figure PCTCN2021101533-appb-000007
SRC1表示所述第二源操作数,SRC1为正整数,n为能够整除N的正整数。
Move the first source data of the thread that is numbered I3 to the thread that is numbered i; wherein, the numbers of N threads are 0~(N-1), and i takes 0~(N-1); I3 The value is
Figure PCTCN2021101533-appb-000007
SRC1 represents the second source operand, SRC1 is a positive integer, and n is a positive integer that can divide N.
在一种可选的实施方式中,所述第一操作指令还包括第二操作码,所述第二操作码用于指示运算类型;所述方法还包括:In an optional implementation manner, the first operation instruction further includes a second operation code, and the second operation code is used to indicate an operation type; the method further includes:
针对所述N个线程中的第一线程,基于所述第一线程的第一源数据和所述第一线程上搬移后的第一数据,执行所述运算类型对应的运算操作。For a first thread among the N threads, based on the first source data of the first thread and the moved first data on the first thread, an operation operation corresponding to the operation type is performed.
在一种可选的实施方式中,所述N个线程中的每个线程关联有线程标志位,所述线程标志位用于指示线程的第一源数据是否参与运算操作。In an optional implementation manner, each of the N threads is associated with a thread flag bit, and the thread flag bit is used to indicate whether the first source data of the thread participates in an operation.
在一种可选的实施方式中,所述第一线程上搬移后的第一数据来自于所述N个线程中的第二线程;所述第一线程关联的线程标志位指示所述第一线程的第一源数据参与运算操作,且所述第二线程关联的线程标志位指示所述第二线程的第一源数据参与运算操作。In an optional implementation manner, the first data moved on the first thread comes from the second thread among the N threads; the thread flag associated with the first thread indicates that the first The first source data of the thread participates in the computing operation, and the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in the computing operation.
在一种可选的实施方式中,所述第一操作指令还包括目的操作数,所述目的操作数用 于指示所述第一线程对应运算结果的存储位置。In an optional implementation manner, the first operation instruction further includes a destination operand, and the destination operand is used to indicate a storage location of a corresponding operation result of the first thread.
在一种可选的实施方式中,所述N个线程的第一源数据来自并行计算处理器的N个连续的线程,所述并行计算处理器包含2N个线程,所述方法还包括:In an optional implementation manner, the first source data of the N threads comes from N consecutive threads of a parallel computing processor, and the parallel computing processor includes 2N threads, and the method further includes:
获取第二操作指令,所述第二操作指令包括如下参数:所述第一操作码;所述第二源操作数;第三源操作数,所述第三源操作数指示所述N个线程的第二源数据,所述N个线程的第二源数据来自所述并行计算处理器中的剩余N个连续的线程;Acquiring a second operation instruction, the second operation instruction including the following parameters: the first operation code; the second source operand; a third source operand, the third source operand indicating the N threads The second source data of the N threads comes from the remaining N continuous threads in the parallel computing processor;
根据所述第二操作指令对所述N个线程的第二源数据进行搬移,得到所述N个线程中每个线程上搬移后的第二数据。Moving the second source data of the N threads according to the second operation instruction to obtain the moved second data on each of the N threads.
在一种可选的实施方式中,所述方法还包括:In an optional embodiment, the method also includes:
将第三线程上搬移后的第一数据和所述第三线程上搬移后的第二数据进行交换;其中,所述三线程为所述N个线程中编号为r的线程;若SRC1小于N,r大于或者等于(N-SRC1),且r小于N;若SRC1大于或者等于N,r大于或者等于0,且r小于(N-SRC1%N)。Exchange the first data moved on the third thread with the second data moved on the third thread; wherein, the three threads are threads numbered r among the N threads; if SRC1 is less than N , r is greater than or equal to (N-SRC1), and r is less than N; if SRC1 is greater than or equal to N, r is greater than or equal to 0, and r is less than (N-SRC1%N).
基于相同的构思,如图20示意,本申请实施例还提供一种多线程数据处理装置2000,该多线程数据处理装置包括:Based on the same idea, as shown in FIG. 20, the embodiment of the present application also provides a multi-thread data processing device 2000, the multi-thread data processing device includes:
指令获取模块2001,用于获取第一操作指令,所述第一操作指令包括如下参数:第一操作码,所述第一操码用于指示N个线程之间的数据搬移方式,N为大于或者等于2的整数;第一源操作数,所述第一源操作数用于指示所述N个线程的第一源数据;第二源操作数,所述第二源操作数用于确定所述数据搬移方式对应的线程偏移量。The instruction acquisition module 2001 is configured to acquire a first operation instruction, the first operation instruction includes the following parameters: a first operation code, and the first operation code is used to indicate a data transfer mode between N threads, where N is greater than Or an integer equal to 2; the first source operand, the first source operand is used to indicate the first source data of the N threads; the second source operand, the second source operand is used to determine the The thread offset corresponding to the above data movement mode.
处理模块2002,用于根据所述第一操作指令对所述N个线程的第一源数据进行搬移,得到所述N个线程中每个线程上搬移后的第一数据。The processing module 2002 is configured to move the first source data of the N threads according to the first operation instruction, and obtain the moved first data on each of the N threads.
本申请实施例中,通过单指令实现并行计算处理器的高效跨线程操作,相较于交叉网络cross bar更为简单,也无需频繁访存,能够以较低硬件或者信令开销实现高性能并行计算处理器中跨线程操作应用的加速处理。In the embodiment of the present application, the efficient cross-thread operation of the parallel computing processor is realized by a single instruction, which is simpler than the cross bar of the cross network, and does not require frequent memory access, and can realize high-performance parallelism with lower hardware or signaling overhead Accelerated processing of applications operating across threads in a computing processor.
在一种可选的实施方式中,所述数据搬移方式为第一搬移方式,所述处理模块2002,具体用于:将编号为I 1的线程的第一源数据搬移到编号为i的线程中;其中,N个线程的编号为0~(N-1),i取遍0~(N-1),I 1为(i+SRC1)对N取余的值;其中,SRC1表示所述第二源操作数,SRC1为正整数。 In an optional implementation manner, the data transfer method is the first transfer method, and the processing module 2002 is specifically configured to: transfer the first source data of the thread numbered I +1 to the thread numbered i Among them; wherein, the numbering of N threads is 0~(N-1), i takes 0~(N-1), and I 1 is the value that (i+SRC1) gets remainder to N; Wherein, SRC1 represents described The second source operand, SRC1, is a positive integer.
在一种可选的实施方式中,所述数据搬移方式为第二搬移方式,所述处理模块2002,具体用于:将编号为I 2的线程的第一源数据搬移到编号为i的线程中;其中,N个线程的编号为0~(N-1),i取遍0~(N-1);I 2为i与SRC1的异或值,SRC1表示所述第二源操作数,SRC1为正整数。 In an optional implementation manner, the data transfer method is the second transfer method, and the processing module 2002 is specifically configured to: transfer the first source data of the thread numbered 12 to the thread numbered i Among them; wherein, the numbering of N threads is 0~(N-1), and i takes 0~(N-1); I 2 is the XOR value of i and SRC1, and SRC1 represents the second source operand, SRC1 is a positive integer.
在一种可选的实施方式中,所述数据搬移方式为第三偏移方式,所述处理模块2002,具体用于:将编号为I 3的线程的第一源数据搬移到编号为i的线程中;其中,N个线程的编号为0~(N-1),i取遍0~(N-1);I 3的取值为
Figure PCTCN2021101533-appb-000008
SRC1表示所述第二源操作数,SRC1为正整数,n为能够整除N的正整数。
In an optional implementation manner, the data moving method is a third offset method, and the processing module 2002 is specifically configured to: move the first source data of the thread numbered I3 to the thread numbered i Among the threads; wherein, the numbers of N threads are 0~(N-1), and i takes 0~(N-1); the value of I 3 is
Figure PCTCN2021101533-appb-000008
SRC1 represents the second source operand, SRC1 is a positive integer, and n is a positive integer that can divide N.
在一种可选的实施方式中,所述第一操作指令还包括第二操作码,所述第二操作码用于指示运算类型;所述处理模块2002,还用于:针对所述N个线程中的第一线程,基于所述第一线程的第一源数据和所述第一线程上搬移后的第一数据,执行所述运算类型对应的运算操作。In an optional implementation manner, the first operation instruction further includes a second operation code, and the second operation code is used to indicate the operation type; the processing module 2002 is further configured to: for the N A first thread among the threads executes an operation corresponding to the operation type based on the first source data of the first thread and the moved first data on the first thread.
在一种可选的实施方式中,所述N个线程中的每个线程关联有线程标志位,所述线程 标志位用于指示线程的第一源数据是否参与运算操作。In an optional implementation manner, each thread in the N threads is associated with a thread flag bit, and the thread flag bit is used to indicate whether the first source data of the thread participates in an operation.
在一种可选的实施方式中,所述第一线程上搬移后的第一数据来自于所述N个线程中的第二线程;所述第一线程关联的线程标志位指示所述第一线程的第一源数据参与运算操作,且所述第二线程关联的线程标志位指示所述第二线程的第一源数据参与运算操作。In an optional implementation manner, the first data moved on the first thread comes from the second thread among the N threads; the thread flag associated with the first thread indicates that the first The first source data of the thread participates in the computing operation, and the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in the computing operation.
在一种可选的实施方式中,所述第一操作指令还包括目的操作数,所述目的操作数用于指示所述第一线程对应运算结果的存储位置。In an optional implementation manner, the first operation instruction further includes a destination operand, and the destination operand is used to indicate a storage location of a corresponding operation result of the first thread.
在一种可选的实施方式中,所述N个线程的第一源数据来自并行计算处理器的N个连续的线程,所述并行计算处理器包含2N个线程;所述指令获取模块2001,还用于获取第二操作指令,所述第二操作指令包括如下参数:所述第一操作码;所述第二源操作数;第三源操作数,所述第三源操作数指示所述N个线程的第二源数据,所述N个线程的第二源数据来自所述并行计算处理器中的剩余N个连续的线程;所述处理模块2002,还用于根据所述第二操作指令对所述N个线程的第二源数据进行搬移,得到所述N个线程中每个线程上搬移后的第二数据。In an optional implementation manner, the first source data of the N threads comes from N consecutive threads of a parallel computing processor, and the parallel computing processor includes 2N threads; the instruction acquisition module 2001, It is also used to obtain a second operation instruction, the second operation instruction includes the following parameters: the first operation code; the second source operand; the third source operand, the third source operand indicates the The second source data of N threads, the second source data of the N threads comes from the remaining N continuous threads in the parallel computing processor; the processing module 2002 is further configured to perform the second operation according to the The instruction moves the second source data of the N threads to obtain the moved second data on each of the N threads.
在一种可选的实施方式中,所述处理模块2002,还用于:将第三线程上搬移后的第一数据和所述第三线程上搬移后的第二数据进行交换;其中,所述三线程为所述N个线程中编号为r的线程;若SRC1小于N,r大于或者等于(N-SRC1),且r小于N;若SRC1大于或者等于N,r大于或者等于0,且r小于(N-SRC1%N)。In an optional implementation manner, the processing module 2002 is further configured to: exchange the moved first data on the third thread with the second data moved on the third thread; wherein, the The three threads are the threads numbered r among the N threads; if SRC1 is less than N, r is greater than or equal to (N-SRC1), and r is less than N; if SRC1 is greater than or equal to N, r is greater than or equal to 0, and r is less than (N-SRC1%N).
基于相同的技术构思,本申请还提供了一种通信装置2100。该通信装置2100可以是芯片或者芯片系统。可选的,在本申请实施例中芯片系统可以由芯片构成,也可以包含芯片和其他分立器件。Based on the same technical concept, this application also provides a communication device 2100 . The communication device 2100 may be a chip or a chip system. Optionally, in the embodiment of the present application, the system-on-a-chip may be composed of chips, or may include chips and other discrete devices.
该通信装置2100可以包括至少一个处理器2110,该处理器2110与存储器耦合,可选的,存储器可以位于该装置之内,存储器可以和处理器集成在一起,存储器也可以位于该装置之外。例如,通信装置2100还可以包括至少一个存储器2120。存储器2120保存实施上述任一实施例中必要计算机程序、配置信息、计算机程序或指令和/或数据;处理器2110可能执行存储器2120中存储的计算机程序,完成上述任一实施例中的方法。The communication device 2100 may include at least one processor 2110, and the processor 2110 is coupled to a memory. Optionally, the memory may be located within the device, the memory may be integrated with the processor, or the memory may be located outside the device. For example, the communication device 2100 may further include at least one memory 2120 . The memory 2120 stores necessary computer programs, configuration information, computer programs or instructions and/or data for implementing any of the above embodiments; the processor 2110 may execute the computer programs stored in the memory 2120 to complete the method in any of the above embodiments.
本申请实施例中的耦合是装置、单元或模块之间的间接耦合或通信连接,可以是电性,机械或其它的形式,用于装置、单元或模块之间的信息交互。处理器2110可能和存储器2120协同操作。本申请实施例中不限定上述处理器2110以及存储器2120之间的具体连接介质。The coupling in the embodiments of the present application is an indirect coupling or a communication connection between devices, units or modules, which may be in electrical, mechanical or other forms, and is used for information exchange between devices, units or modules. Processor 2110 may cooperate with memory 2120 . The specific connection medium between the processor 2110 and the memory 2120 is not limited in this embodiment of the present application.
通信装置2100中还可以包括通信接口2130,通信装置2100可以通过通信接口2130和其它设备进行信息交互。示例性的,所述通信接口2130可以是收发器、电路、总线、模块或其它类型的通信接口。当该通信装置2100为芯片类的装置或者电路时,该装置2100中的通信接口2130也可以是输入输出电路,可以输入信息(或称,接收信息)和输出信息(或称,发送信息),处理器为集成的处理器或者微处理器或者集成电路或则逻辑电路,处理器可以根据输入信息确定输出信息。The communication device 2100 may further include a communication interface 2130, and the communication device 2100 may perform information exchange with other devices through the communication interface 2130. Exemplarily, the communication interface 2130 may be a transceiver, a circuit, a bus, a module or other types of communication interfaces. When the communication device 2100 is a chip-like device or circuit, the communication interface 2130 in the device 2100 can also be an input and output circuit, which can input information (or call it, receive information) and output information (or call it, send information), The processor is an integrated processor or a microprocessor or an integrated circuit or a logic circuit, and the processor can determine output information according to input information.
可选的,参见图21,所述通信接口2130、所述处理模块2110以及所述存储器2120之间通过总线2140相互连接。所述总线2140可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。所述总线可以分为地址总线、数据总线、控制总线等。为便 于表示,图21中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。Optionally, referring to FIG. 21 , the communication interface 2130 , the processing module 2110 and the memory 2120 are connected to each other through a bus 2140 . The bus 2140 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus or the like. The bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in Fig. 21, but it does not mean that there is only one bus or one type of bus.
在本申请实施例中,处理器可以是通用处理器、数字信号处理器、专用集成电路、现场可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件,可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。In this embodiment of the application, the processor may be a general-purpose processor, a digital signal processor, an application-specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, and may implement or Execute the methods, steps and logic block diagrams disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the methods disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.
在本申请实施例中,存储器可以是非易失性存储器,比如硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD)等,还可以是易失性存储器(volatile memory),例如随机存取存储器(random-access memory,RAM)。存储器是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。本申请实施例中的存储器还可以是电路或者其它任意能够实现存储功能的装置,用于存储程序指令和/或数据。In the embodiment of the present application, the memory may be a non-volatile memory, such as a hard disk (hard disk drive, HDD) or a solid-state drive (solid-state drive, SSD), etc., and may also be a volatile memory (volatile memory), such as Random-access memory (RAM). A memory is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory in the embodiment of the present application may also be a circuit or any other device capable of implementing a storage function, and is used for storing program instructions and/or data.
基于以上实施例,本申请实施例还提供了一种计算机程序,当所述计算机程序在计算机上运行时,使得所述计算机执行上述多线程数据处理方法。Based on the above embodiments, an embodiment of the present application further provides a computer program, which, when the computer program is run on a computer, causes the computer to execute the above multi-threaded data processing method.
基于以上实施例,本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有计算机程序,所述计算机程序被计算机执行时,使得计算机执行上述方法实施例中所提供的多线程数据处理方法。其中,存储介质可以是计算机能够存取的任何可用介质。以此为例但不限于:计算机可读介质可以包括RAM、ROM、EEPROM、CD-ROM或其他光盘存储、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质。Based on the above embodiments, the embodiments of the present application also provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a computer, the computer executes the method described in the above-mentioned method embodiments. Provides multi-threaded data processing methods. Wherein, the storage medium may be any available medium that can be accessed by a computer. By way of example but not limitation: computer-readable media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage media or other magnetic storage devices, or may be used to carry or store information in the form of instructions or data structures desired program code and any other medium that can be accessed by a computer.
基于以上实施例,本申请实施例还提供了一种计算机芯片,所述芯片与存储器相连,所述芯片用于读取并执行存储器中存储的软件程序,实现上述方法实施例中所提供的多线程数据处理方法。Based on the above embodiments, the embodiment of the present application also provides a computer chip, the chip is connected to the memory, and the chip is used to read and execute the software program stored in the memory, so as to realize the multiple functions provided in the above method embodiments. Thread data processing method.
基于以上实施例,本申请实施例提供了一种芯片系统,该芯片系统包括处理器,用于支持计算机装置实现上述方法实施例中多线程数据处理方法的功能。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器用于保存该计算机装置必要的程序和数据。该芯片系统,可以由芯片构成,也可以包含芯片和其他分立器件。Based on the above embodiments, an embodiment of the present application provides a chip system, the chip system includes a processor, configured to support a computer device to implement the functions of the multi-threaded data processing method in the above method embodiments. In a possible design, the chip system further includes a memory, and the memory is used to store necessary programs and data of the computer device. The system-on-a-chip may consist of chips, or may include chips and other discrete devices.
本申请实施例提供的技术方案可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、网络设备、终端设备或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机可以存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,数字视频光盘(digital video disc,DVD))、或者半导体介质等。The technical solutions provided by the embodiments of the present application may be fully or partially implemented by software, hardware, firmware or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present invention will be generated in whole or in part. The computer may be a general computer, a special computer, a computer network, a network device, a terminal device or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media. The available medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, a digital video disc (digital video disc, DVD)), or a semiconductor medium.
在本申请实施例中,在无逻辑矛盾的前提下,各实施例之间可以相互引用,例如方法实施例之间的方法和/或术语可以相互引用,例如装置实施例之间的功能和/或术语可以相互引用,例如装置实施例和方法实施例之间的功能和/或术语可以相互引用。In the embodiments of the present application, on the premise that there is no logical contradiction, the various embodiments may refer to each other, for example, the methods and/or terms between the method embodiments may refer to each other, such as the functions and/or terms between the device embodiments Or terms may refer to each other, for example, functions and/or terms between the apparatus embodiment and the method embodiment may refer to each other.
本申请是参照根据本申请的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the present application. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby The instructions provide steps for implementing the functions specified in the flow chart or blocks of the flowchart and/or the block or blocks of the block diagrams.
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。Apparently, those skilled in the art can make various changes and modifications to the present application without departing from the scope of the present application. In this way, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalent technologies, the present application is also intended to include these modifications and variations.

Claims (23)

  1. 一种多线程数据处理方法,其特征在于,包括:A multi-thread data processing method is characterized in that, comprising:
    获取第一操作指令,所述第一操作指令包括如下参数:第一操作码,所述第一操码用于指示N个线程之间的数据搬移方式,N为大于或者等于2的整数;第一源操作数,所述第一源操作数用于指示所述N个线程的第一源数据;第二源操作数,所述第二源操作数用于确定所述数据搬移方式对应的线程偏移量;Acquire a first operation instruction, the first operation instruction includes the following parameters: a first operation code, the first operation code is used to indicate the data transfer mode between N threads, and N is an integer greater than or equal to 2; A source operand, the first source operand is used to indicate the first source data of the N threads; a second source operand, the second source operand is used to determine the thread corresponding to the data movement mode Offset;
    根据所述第一操作指令对所述N个线程的第一源数据进行搬移,得到所述N个线程中每个线程上搬移后的第一数据。Moving the first source data of the N threads according to the first operation instruction to obtain the moved first data on each of the N threads.
  2. 如权利要求1所述的方法,其特征在于,所述数据搬移方式为第一搬移方式,所述根据所述第一操作指令对所述N个线程的第一源数据进行搬移,包括:The method according to claim 1, wherein the data moving method is a first moving method, and the moving the first source data of the N threads according to the first operation instruction includes:
    将编号为I 1的线程的第一源数据搬移到编号为i的线程中;其中,N个线程的编号为0~(N-1),i取遍0~(N-1),I 1为(i+SRC1)对N取余的值;其中,SRC1表示所述第二源操作数,SRC1为正整数。 Move the first source data of the thread numbered I 1 to the thread numbered i; wherein, the numbers of N threads are 0~(N-1), i takes 0~(N-1), I 1 (i+SRC1) is the value of the remainder of N; wherein, SRC1 represents the second source operand, and SRC1 is a positive integer.
  3. 如权利要求1所述的方法,其特征在于,所述数据搬移方式为第二搬移方式,所述根据所述第一操作指令对所述N个线程的第一源数据进行搬移,包括:The method according to claim 1, wherein the data moving method is a second moving method, and the moving the first source data of the N threads according to the first operation instruction includes:
    将编号为I 2的线程的第一源数据搬移到编号为i的线程中;其中,N个线程的编号为0~(N-1),i取遍0~(N-1);I 2为i与SRC1的异或值,SRC1表示所述第二源操作数,SRC1为正整数。 Move the first source data of the thread that is numbered I 2 to the thread that is numbered i; wherein, the numbers of N threads are 0~(N-1), and i takes 0~(N-1); I 2 is the XOR value of i and SRC1, SRC1 represents the second source operand, and SRC1 is a positive integer.
  4. 如权利要求1所述的方法,其特征在于,所述数据搬移方式为第三偏移方式,所述根据所述第一操作指令对所述N个线程的第一源数据进行搬移,包括:The method according to claim 1, wherein the data moving method is a third offset method, and the moving the first source data of the N threads according to the first operation instruction includes:
    将编号为I 3的线程的第一源数据搬移到编号为i的线程中;其中,N个线程的编号为0~(N-1),i取遍0~(N-1);I 3的取值为
    Figure PCTCN2021101533-appb-100001
    SRC1表示所述第二源操作数,SRC1为正整数,n为能够整除N的正整数。
    Move the first source data of the thread that is numbered I3 to the thread that is numbered i; wherein, the numbers of N threads are 0~(N-1), and i takes 0~(N-1); I3 The value is
    Figure PCTCN2021101533-appb-100001
    SRC1 represents the second source operand, SRC1 is a positive integer, and n is a positive integer that can divide N.
  5. 如权利要求1-4任一项所述的方法,其特征在于,所述第一操作指令还包括第二操作码,所述第二操作码用于指示运算类型;所述方法还包括:The method according to any one of claims 1-4, wherein the first operation instruction further includes a second operation code, and the second operation code is used to indicate the operation type; the method further includes:
    针对所述N个线程中的第一线程,基于所述第一线程的第一源数据和所述第一线程上搬移后的第一数据,执行所述运算类型对应的运算操作。For a first thread among the N threads, based on the first source data of the first thread and the moved first data on the first thread, an operation operation corresponding to the operation type is performed.
  6. 如权利要求5所述的方法,其特征在于,所述N个线程中的每个线程关联有线程标志位,所述线程标志位用于指示线程的第一源数据是否参与运算操作。The method according to claim 5, wherein each of the N threads is associated with a thread flag bit, and the thread flag bit is used to indicate whether the first source data of the thread participates in an operation.
  7. 如权利要求6所述的方法,其特征在于,所述第一线程上搬移后的第一数据来自于所述N个线程中的第二线程;所述第一线程关联的线程标志位指示所述第一线程的第一源数据参与运算操作,且所述第二线程关联的线程标志位指示所述第二线程的第一源数据参与运算操作。The method according to claim 6, wherein the first data moved on the first thread comes from a second thread among the N threads; the thread flag bit associated with the first thread indicates the The first source data of the first thread participates in the computing operation, and the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in the computing operation.
  8. 如权利要求5-7任一项所述的方法,其特征在于,所述第一操作指令还包括目的操作数,所述目的操作数用于指示所述第一线程对应运算结果的存储位置。The method according to any one of claims 5-7, wherein the first operation instruction further includes a destination operand, and the destination operand is used to indicate a storage location of a corresponding operation result of the first thread.
  9. 如权利要求2所述的方法,其特征在于,所述N个线程的第一源数据来自并行计算处理器的N个连续的线程,所述并行计算处理器包含2N个线程,所述方法还包括:The method according to claim 2, wherein the first source data of the N threads are from N continuous threads of a parallel computing processor, and the parallel computing processor includes 2N threads, and the method further comprises include:
    获取第二操作指令,所述第二操作指令包括如下参数:所述第一操作码;所述第二源操作数;第三源操作数,所述第三源操作数指示所述N个线程的第二源数据,所述N个线程 的第二源数据来自所述并行计算处理器中的剩余N个连续的线程;Acquiring a second operation instruction, the second operation instruction including the following parameters: the first operation code; the second source operand; a third source operand, the third source operand indicating the N threads The second source data of the N threads comes from the remaining N continuous threads in the parallel computing processor;
    根据所述第二操作指令对所述N个线程的第二源数据进行搬移,得到所述N个线程中每个线程上搬移后的第二数据。Moving the second source data of the N threads according to the second operation instruction to obtain the moved second data on each of the N threads.
  10. 如权利要求8所述的方法,其特征在于,所述方法还包括:The method of claim 8, further comprising:
    将第三线程上搬移后的第一数据和所述第三线程上搬移后的第二数据进行交换;其中,所述三线程为所述N个线程中编号为r的线程;若SRC1小于N,r大于或者等于(N-SRC1),且r小于N;若SRC1大于或者等于N,r大于或者等于0,且r小于(N-SRC1%N)。Exchange the first data moved on the third thread with the second data moved on the third thread; wherein, the three threads are threads numbered r among the N threads; if SRC1 is less than N , r is greater than or equal to (N-SRC1), and r is less than N; if SRC1 is greater than or equal to N, r is greater than or equal to 0, and r is less than (N-SRC1%N).
  11. 一种多线程数据处理装置,其特征在于,包括:A multi-thread data processing device is characterized in that it comprises:
    指令获取模块,用于获取第一操作指令,所述第一操作指令包括如下参数:第一操作码,所述第一操码用于指示N个线程之间的数据搬移方式,N为大于或者等于2的整数;第一源操作数,所述第一源操作数用于指示所述N个线程的第一源数据;第二源操作数,所述第二源操作数用于确定所述数据搬移方式对应的线程偏移量;An instruction acquisition module, configured to acquire a first operation instruction, the first operation instruction includes the following parameters: a first operation code, the first operation code is used to indicate a data transfer mode between N threads, and N is greater than or An integer equal to 2; a first source operand, the first source operand is used to indicate the first source data of the N threads; a second source operand, the second source operand is used to determine the The thread offset corresponding to the data movement method;
    处理模块,用于根据所述第一操作指令对所述N个线程的第一源数据进行搬移,得到所述N个线程中每个线程上搬移后的第一数据。A processing module, configured to move the first source data of the N threads according to the first operation instruction, to obtain the moved first data on each of the N threads.
  12. 如权利要求11所述的装置,其特征在于,所述数据搬移方式为第一搬移方式,所述处理模块,具体用于:将编号为I 1的线程的第一源数据搬移到编号为i的线程中;其中,N个线程的编号为0~(N-1),i取遍0~(N-1),I 1为(i+SRC1)对N取余的值;其中,SRC1表示所述第二源操作数,SRC1为正整数。 The device according to claim 11, wherein the data transfer method is a first transfer method, and the processing module is specifically configured to: transfer the first source data of the thread numbered I +1 to the numbered i Among the threads; wherein, the numbers of N threads are 0~(N-1), i is taken over 0~(N-1), and I 1 is the value of (i+SRC1) taking the remainder of N; wherein, SRC1 represents The second source operand, SRC1 is a positive integer.
  13. 如权利要求11所述的装置,其特征在于,所述数据搬移方式为第二搬移方式,所述处理模块,具体用于:将编号为I 2的线程的第一源数据搬移到编号为i的线程中;其中,N个线程的编号为0~(N-1),i取遍0~(N-1);I 2为i与SRC1的异或值,SRC1表示所述第二源操作数,SRC1为正整数。 The device according to claim 11, wherein the data transfer mode is the second transfer mode, and the processing module is specifically used to: move the first source data of the thread numbered I+ 2 to the numbered i Among the threads; wherein, the numbers of N threads are 0~(N-1), and i takes 0~(N-1); I 2 is the XOR value of i and SRC1, and SRC1 represents the second source operation number, SRC1 is a positive integer.
  14. 如权利要求11所述的装置,其特征在于,所述数据搬移方式为第三偏移方式,所述处理模块,具体用于:The device according to claim 11, wherein the data transfer method is a third offset method, and the processing module is specifically used for:
    将编号为I 3的线程的第一源数据搬移到编号为i的线程中;其中,N个线程的编号为0~(N-1),i取遍0~(N-1);I 3的取值为
    Figure PCTCN2021101533-appb-100002
    SRC1表示所述第二源操作数,SRC1为正整数,n为能够整除N的正整数。
    Move the first source data of the thread that is numbered I3 to the thread that is numbered i; wherein, the numbers of N threads are 0~(N-1), and i takes 0~(N-1); I3 The value is
    Figure PCTCN2021101533-appb-100002
    SRC1 represents the second source operand, SRC1 is a positive integer, and n is a positive integer that can divide N.
  15. 如权利要求11-14任一项所述的装置,其特征在于,所述第一操作指令还包括第二操作码,所述第二操作码用于指示运算类型;所述处理模块,还用于:The device according to any one of claims 11-14, wherein the first operation instruction further includes a second operation code, and the second operation code is used to indicate the operation type; the processing module also uses At:
    针对所述N个线程中的第一线程,基于所述第一线程的第一源数据和所述第一线程上搬移后的第一数据,执行所述运算类型对应的运算操作。For a first thread among the N threads, based on the first source data of the first thread and the moved first data on the first thread, an operation operation corresponding to the operation type is performed.
  16. 如权利要求15所述的装置,其特征在于,所述N个线程中的每个线程关联有线程标志位,所述线程标志位用于指示线程的第一源数据是否参与运算操作。The device according to claim 15, wherein each of the N threads is associated with a thread flag bit, and the thread flag bit is used to indicate whether the first source data of the thread participates in an operation.
  17. 如权利要求16所述的装置,其特征在于,所述第一线程上搬移后的第一数据来自于所述N个线程中的第二线程;所述第一线程关联的线程标志位指示所述第一线程的第一源数据参与运算操作,且所述第二线程关联的线程标志位指示所述第二线程的第一源数据参与运算操作。The device according to claim 16, wherein the first data moved on the first thread comes from the second thread among the N threads; the thread flag bit associated with the first thread indicates the The first source data of the first thread participates in the computing operation, and the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in the computing operation.
  18. 如权利要求15-17任一项所述的装置,其特征在于,所述第一操作指令还包括目的操作数,所述目的操作数用于指示所述第一线程对应运算结果的存储位置。The device according to any one of claims 15-17, wherein the first operation instruction further includes a destination operand, and the destination operand is used to indicate a storage location of a corresponding operation result of the first thread.
  19. 如权利要求12所述的装置,其特征在于,所述N个线程的第一源数据来自并行计算 处理器的N个连续的线程,所述并行计算处理器包含2N个线程;The device according to claim 12, wherein the first source data of the N threads is from N continuous threads of a parallel computing processor, and the parallel computing processor includes 2N threads;
    所述指令获取模块,还用于获取第二操作指令,所述第二操作指令包括如下参数:所述第一操作码;所述第二源操作数;第三源操作数,所述第三源操作数指示所述N个线程的第二源数据,所述N个线程的第二源数据来自所述并行计算处理器中的剩余N个连续的线程;The instruction obtaining module is also used to obtain a second operation instruction, the second operation instruction includes the following parameters: the first operation code; the second source operand; the third source operand, the third The source operand indicates the second source data of the N threads, and the second source data of the N threads comes from the remaining N consecutive threads in the parallel computing processor;
    所述处理模块,还用于根据所述第二操作指令对所述N个线程的第二源数据进行搬移,得到所述N个线程中每个线程上搬移后的第二数据。The processing module is further configured to move the second source data of the N threads according to the second operation instruction, to obtain the moved second data on each of the N threads.
  20. 如权利要求18所述的装置,其特征在于,所述处理模块,还用于:The device according to claim 18, wherein the processing module is further configured to:
    将第三线程上搬移后的第一数据和所述第三线程上搬移后的第二数据进行交换;其中,所述三线程为所述N个线程中编号为r的线程;若SRC1小于N,r大于或者等于(N-SRC1),且r小于N;若SRC1大于或者等于N,r大于或者等于0,且r小于(N-SRC1%N)。Exchange the first data moved on the third thread with the second data moved on the third thread; wherein, the three threads are threads numbered r among the N threads; if SRC1 is less than N , r is greater than or equal to (N-SRC1), and r is less than N; if SRC1 is greater than or equal to N, r is greater than or equal to 0, and r is less than (N-SRC1%N).
  21. 一种通信装置,其特征在于,包括处理器,所述处理器和存储器耦合,所述存储器用于存储计算机程序或指令,所述处理器用于执行所述计算机程序或指令,以执行权利要求1至10任一项所述的方法。A communication device, characterized in that it includes a processor, the processor is coupled to a memory, the memory is used to store computer programs or instructions, and the processor is used to execute the computer programs or instructions to implement claim 1 to the method described in any one of 10.
  22. 一种计算机可读存储介质,其特征在于,包括指令,当其在计算机上运行时,使得计算机执行如权利要求1至10任一项所述的方法。A computer-readable storage medium is characterized by comprising instructions, which, when run on a computer, cause the computer to execute the method according to any one of claims 1 to 10.
  23. 一种计算机程序产品,其特征在于,当其在计算机上运行时,使得计算机执行权利要求1至10任一项所述的方法。A computer program product, characterized in that, when it is run on a computer, it causes the computer to execute the method described in any one of claims 1 to 10.
PCT/CN2021/101533 2021-06-22 2021-06-22 Multi-thread data processing method and apparatus WO2022266842A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/101533 WO2022266842A1 (en) 2021-06-22 2021-06-22 Multi-thread data processing method and apparatus
CN202180099704.7A CN117561501A (en) 2021-06-22 2021-06-22 Multithread data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/101533 WO2022266842A1 (en) 2021-06-22 2021-06-22 Multi-thread data processing method and apparatus

Publications (1)

Publication Number Publication Date
WO2022266842A1 true WO2022266842A1 (en) 2022-12-29

Family

ID=84543861

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/101533 WO2022266842A1 (en) 2021-06-22 2021-06-22 Multi-thread data processing method and apparatus

Country Status (2)

Country Link
CN (1) CN117561501A (en)
WO (1) WO2022266842A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117389731A (en) * 2023-10-20 2024-01-12 上海芯高峰微电子有限公司 Data processing method and device, chip, device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105094749A (en) * 2009-12-22 2015-11-25 英特尔公司 Synchronizing simd vectors
CN105302749A (en) * 2015-10-29 2016-02-03 中国人民解放军国防科学技术大学 Single-instruction multi-thread mode oriented method for DMA transmission in GPDSP
US20180157598A1 (en) * 2016-12-05 2018-06-07 Intel Corporation Apparatuses, methods, and systems to share translation lookaside buffer entries
US10761741B1 (en) * 2016-04-07 2020-09-01 Beijing Baidu Netcome Science and Technology Co., Ltd. Method and system for managing and sharing data using smart pointers

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105094749A (en) * 2009-12-22 2015-11-25 英特尔公司 Synchronizing simd vectors
CN105302749A (en) * 2015-10-29 2016-02-03 中国人民解放军国防科学技术大学 Single-instruction multi-thread mode oriented method for DMA transmission in GPDSP
US10761741B1 (en) * 2016-04-07 2020-09-01 Beijing Baidu Netcome Science and Technology Co., Ltd. Method and system for managing and sharing data using smart pointers
US20180157598A1 (en) * 2016-12-05 2018-06-07 Intel Corporation Apparatuses, methods, and systems to share translation lookaside buffer entries

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CHEN YILE LELE: "Hystrix of Spring Cloud passes data across threads", CSDN BLOG, CSDN, CN, CN, XP009542253, Retrieved from the Internet <URL:https://blog.csdn.net/myle69/article/details/83512576> *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117389731A (en) * 2023-10-20 2024-01-12 上海芯高峰微电子有限公司 Data processing method and device, chip, device and storage medium
CN117389731B (en) * 2023-10-20 2024-04-02 上海芯高峰微电子有限公司 Data processing method and device, chip, device and storage medium

Also Published As

Publication number Publication date
CN117561501A (en) 2024-02-13

Similar Documents

Publication Publication Date Title
US7042466B1 (en) Efficient clip-testing in graphics acceleration
CN112099852A (en) Variable format, variable sparse matrix multiply instruction
CN109062608B (en) Vectorized read and write mask update instructions for recursive computation on independent data
CN109690475A (en) Hardware accelerator and method for transfer operation
TWI517038B (en) Instruction for element offset calculation in a multi-dimensional array
US20120254592A1 (en) Systems, apparatuses, and methods for expanding a memory source into a destination register and compressing a source register into a destination memory location
EP3623941B1 (en) Systems and methods for performing instructions specifying ternary tile logic operations
CN110659068A (en) Apparatus and method for tensor permutation engine
CN117724763A (en) Apparatus, method and system for matrix operation accelerator instruction
CN113791820B (en) bit matrix multiplication
TWI603262B (en) Packed finite impulse response (fir) filter processors, methods, systems, and instructions
CN104011664A (en) Super multiply ADD (super MADD) instruction with three scalar terms
CN115454501A (en) Method and apparatus for performing reduction operations on multiple data element values
US20210166156A1 (en) Data processing system and data processing method
CN112148251A (en) System and method for skipping meaningless matrix operations
WO2022266842A1 (en) Multi-thread data processing method and apparatus
CN108292228B (en) Systems, devices, and methods for channel-based step-by-step collection
JPWO2016024508A1 (en) Multiprocessor device
US7769981B2 (en) Row of floating point accumulators coupled to respective PEs in uppermost row of PE array for performing addition operation
TW201721580A (en) Multi-functional execution lane for image processor
CN109328334B (en) Systems, apparatus, and methods for cumulative summation
CN102012802A (en) Vector processor-oriented data exchange method and device
CN116257208A (en) Method and apparatus for separable convolution filter operation on matrix multiplication array
US6895424B2 (en) Method and circuit for alignment of floating point significants in a SIMD array MPP
US20050055394A1 (en) Method and system for high performance, multiple-precision multiply-and-add operation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21946344

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE