WO2019157743A1 - 线程处理方法和图形处理器 - Google Patents

线程处理方法和图形处理器 Download PDF

Info

Publication number
WO2019157743A1
WO2019157743A1 PCT/CN2018/076885 CN2018076885W WO2019157743A1 WO 2019157743 A1 WO2019157743 A1 WO 2019157743A1 CN 2018076885 W CN2018076885 W CN 2018076885W WO 2019157743 A1 WO2019157743 A1 WO 2019157743A1
Authority
WO
WIPO (PCT)
Prior art keywords
thread
processor
value
data
threads
Prior art date
Application number
PCT/CN2018/076885
Other languages
English (en)
French (fr)
Inventor
林焕鑫
王卓立
马军超
单东方
沈伟锋
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201880089527.2A priority Critical patent/CN111712793B/zh
Priority to PCT/CN2018/076885 priority patent/WO2019157743A1/zh
Publication of WO2019157743A1 publication Critical patent/WO2019157743A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers

Definitions

  • Embodiments of the present invention relate to the field of data processing, and in particular, to a thread processing method and a graphics processor.
  • a graphics processing unit When a graphics processing unit (GPU) receives a control instruction compiled from kernel code, it creates a task and creates a large number of threads for parallel processing for the task. For example, the GPU creates a workgroup based on the kernel code. Each workgroup includes multiple warps. One warp includes 64 threads. In the Open Computing Language (OpenCL), the warp is called For Wavefront (wave), the thread bundle is called warp in the Compute Unified Device Architecture (CUDA).
  • OpenCL Open Computing Language
  • wave For Wavefront
  • CUDA Compute Unified Device Architecture
  • One stream multiprocessor includes multiple stream processors, each stream processor can run one thread, and the stream multiprocessor runs multiple warps .
  • threads in a thread bundle are bound together to run on a stream multiprocessor, and unified instructions are executed at each moment.
  • part of the thread bundles may be in an active state, and part of the thread bundles may be in an active state.
  • the warp can be in a wait state, and when the active warp is running on the stream multiprocessor, the stream multiprocessor immediately executes the waiting warp.
  • the stream multiprocessor can run two warps at the same time. The two warps are active, and the other two warps are in a wait state. When any of the threads in the active state finishes running, the streaming multiprocessor immediately runs the waiting bundle.
  • kernel code For example, see the following kernel code:
  • Each thread of a workgroup executes the kernel code, where A[tid] is the data to be processed, tid is the thread identifier, and each thread has a unique tid.
  • A[tid] is the data to be processed
  • tid is the thread identifier
  • each thread has a unique tid.
  • the probability of occurrence of A[tid] ⁇ 5 is less than the probability of occurrence of A[tid] ⁇ 5. But if A[tid] ⁇ 5 occurs, code A (execution of the first branch statement) is executed, otherwise code B is executed (ie, the second branch statement is executed).
  • the stream multiprocessor needs to execute the first branch statement and the second branch statement serially for the thread bundle, thereby reducing the degree of parallelism and the efficiency of execution.
  • Thread-Data Remapping is an existing software technology that resolves branch disparities.
  • the TDR changes the arrangement of the data to be processed allocated to all the threads, so that the data to be processed of the same judgment condition is allocated to the same warp, and the judgment conditions obtained by the threads in the same warp are the same, thereby eliminating the branch divergence.
  • the threads that need to execute the first branch statement are all set in the same warp, and the thread that needs to execute the second branch statement is set in the other warp, so the flow is increased.
  • the processor does not need to execute the first branch statement and the second branch statement serially in the same warp.
  • TDR is performed before the thread executes the branch statement.
  • the number of threads in the work group to execute the first branch statement is small, it is meaningless to perform TDR.
  • the stream multiprocessor needs to execute the first branch statement and the second branch statement serially for the warp.
  • the TDR does not have an effect, but the time and computing resources for performing the TDR are wasted.
  • the embodiment of the invention provides a thread processing method and a graphics processor.
  • the thread data remapping is used only when the number of threads that need to execute the first branch statement is greater than a threshold, which can save time and computing resources.
  • the present application provides a thread processing method, which is applied to a graphics processor.
  • a graphics processor is configured to process M warps, each thread bundle includes N threads, and the graphics processor further includes at least one a thread bundle processor, the first warp processor of the at least one warp processor includes an integer multiple of a thread processor of N, the first warp processor includes a first thread processor, and the first thread processor is configured to run N
  • the first thread processor acquires The first to-be-processed data to be processed determines that the first to-be-processed data satisfies the first branch statement, and the value in the counter is further increased by one step.
  • the first thread processor determines the number of threads in the M*N threads that need to run the first branch statement according to the value of the counter.
  • the first thread processor performs thread synchronization and thread data remapping if the number of acknowledgments is greater than a threshold.
  • the first thread processor performs thread synchronization and thread data remapping by determining the number of threads running the first branch statement, and if the number of the confirmation is greater than the threshold, avoiding the case where the number is less than or equal to the threshold. Thread synchronization and thread data remapping are also performed to improve the speed and efficiency of the GPU.
  • the step size can be any positive integer, for example, can be set to 1.
  • the graphics processor is further configured with a flag bit, the value of the flag bit is set to a first flag value, and the first flag value is used to indicate that remapping is not performed, and the method further includes The following steps: The first thread processor reads the flag bit before determining that the number is greater than the threshold. Moreover, the first thread processor sets the first flag value to the second flag value after determining that the number is greater than the threshold and before performing the thread synchronization, and the second flag value is used to indicate that the remapping needs to be performed.
  • the first thread processor that finds that the value of the counter exceeds the threshold sets the flag bit to the second flag value.
  • the second flag value is set and the remapping is performed directly, instead of going to the remapping. Determine whether the value of the counter exceeds the threshold, to avoid different conclusions.
  • the method further includes the following Step: The first thread processor clears the value in the counter.
  • the counter is cleared to zero, so that other threads only need to set the second flag value according to the flag bit to confirm that the thread data remapping needs to be performed, without having to follow the counter.
  • the value determines whether thread data remapping needs to be performed.
  • the first thread bundle processor The second thread processor is configured to run one of the N threads to process the data to be processed that satisfies the first branch statement or satisfy the second branch statement, the method further comprising the following steps: processing the second thread
  • the device reads the flag bit and performs thread synchronization and thread data remapping when the value of the acknowledgment flag bit is the second flag value.
  • the second thread processor determines the number of threads in the M*N threads that need to run the first branch statement according to the value of the counter, and executes if the number of acknowledgments is greater than the threshold value. Thread synchronization and thread data remapping.
  • Other threads can directly determine whether remapping is required according to the flag bit, and can avoid the problem that individual threads perform thread synchronization separately.
  • the first thread processor is configured to run the first thread of the N threads to process the determining condition that satisfies the first branch statement Data to be processed
  • the second thread processor is configured to run a second thread of the N threads to process the data to be processed that satisfies the judgment condition of the second branch statement
  • the graphics processor further sets the one-dimensional array, the first variable, and the second variable.
  • the first thread processor performs thread data remapping, including: the first thread
  • the processor reads the value of the second variable, and writes the thread identifier of the second thread into the one-dimensional array with the value of the second variable as the position of the subscript, decrements the value of the second variable by one, and performs thread synchronization.
  • the second thread processor reads the value of the first variable, and writes the thread identifier of the second thread into the one-dimensional array with the value of the first variable as the position of the subscript, increments the value of the first variable, and executes the thread. Synchronize.
  • the first thread processor reads the value in the one-dimensional array with the thread identifier of the first thread as the subscript, and uses the read value as the update of the first thread generated by the thread data remapping.
  • Thread ID After the thread synchronization ends, the second thread processor reads the value in the one-dimensional array with the thread identifier of the second thread as the subscript, and uses the read value as the update of the second thread generated by the thread data re-mapping. Thread ID.
  • the implementation of the remapping only involves the pointer transformation of the one-dimensional array, which avoids directly calling the data to be processed in the remapping process, and can effectively improve the operation speed.
  • the first thread processor after performing thread synchronization and before performing thread data remapping, the method further includes: processing by the first thread The first pending data is recorded in the index table by using the thread identifier of the first thread as an index, wherein the thread identifier of the first thread has a one-to-one correspondence with the first data to be processed, and the index table records M*N threads. A one-to-one correspondence between thread identification and data to be processed. Further, after the first thread processor performs thread data remapping, the method further includes the following steps: the first thread processor uses the updated thread identifier of the first thread generated after performing thread data remapping as an index in the index table.
  • Reading the third pending data corresponding to the updated thread identifier of the first thread The first thread processor executes the first branch statement when the third to-be-processed data satisfies the determination condition of the first branch statement, and the first thread processor executes the first when the third to-be-processed data satisfies the determination condition of the second branch statement Two branch statements.
  • All threads save the data to be processed in the index table with their own thread identifier, and after obtaining the updated thread identifier of the remapping assignment, obtain the updated data corresponding to the linear identifier from the index table with the updated thread identifier as an index. And determining whether to execute the first branch statement or the second branch statement according to the to-be-processed data, data exchange between threads can be implemented, and the kernel code is guaranteed to operate normally.
  • the threshold is one.
  • the threshold is a positive integer greater than or equal to 2 and less than or equal to 5.
  • the probability that the first thread processor executes the first branch statement is less than the probability that the first thread processor executes the second branch statement.
  • the thread that needs to execute the first branch statement is concentrated in one or more warps by remapping, so that as many warps as possible avoid serial execution of the first branch statement. And the second branch statement.
  • the counter and the index table are disposed in a shared memory of the graphics processor.
  • the data to be processed is set in a global memory of the graphics processor.
  • the first thread processor performs a step-by-step operation by atomically adding a value in the counter, and the first thread processor performs atomic subtraction on the value in the counter.
  • the first thread processor performs a step-by-step operation by atomically adding a value in the counter, and the first thread processor performs atomic subtraction on the value in the counter.
  • the data to be processed is sent to a global memory of the graphics processor via a central processor coupled to the graphics processor.
  • the present application provides a thread processing method, which is applied to a graphics processor.
  • a graphics processor is configured to process M warps, each thread bundle includes N threads, and the graphics processor further includes at least a thread bundle processor, the first warp processor of the at least one warp processor includes an integer multiple of a thread processor of N, the first warp processor includes a first thread processor, and the first thread processor runs a loop a statement for running one of the N threads in a loop to process the data to be processed satisfying the judgment condition of the first branch statement or satisfying the judgment condition of the second branch statement, the graph processor is provided with a counter, the method includes The following steps: the first thread processor acquires the first to-be-processed data to be processed in the first loop, determines that the first to-be-processed data satisfies the first branch statement, and adds the value in the counter by one step.
  • the first thread processor determines the number of threads in the M*N threads that need to run the first branch statement according to the value of the counter.
  • the first thread processor performs thread synchronization and clears the value in the counter if the number of acknowledgments is greater than the threshold.
  • the first thread processor performs thread data remapping.
  • the first thread processor performs thread synchronization and clears the value in the counter, and performs thread data remapping, so that the first thread processor reads the counter in the next loop, if the number of acknowledgments is greater than the threshold.
  • the value is not affected by the first loop, and since thread synchronization and thread data remapping are performed when the number of acknowledgments is greater than the threshold, thread synchronization can be avoided if the number is less than or equal to the threshold.
  • thread data remapping can dynamically determine whether there is a need to perform thread data remapping in each loop, can avoid useless thread data remapping, and can improve the running speed and operating efficiency of the GPU.
  • the method further includes the following steps: the first thread processor acquires second to-be-processed data that needs to be processed in the second loop, and determines that the second to-be-processed data meets the second
  • the judgment condition of the branch statement reduces the value in the counter by one step.
  • Decreasing the value in the counter by one step can be offset by a step-by-step operation to avoid interference with other threads and avoid interference with the next loop.
  • the graphics processor is further configured with a flag bit, and the value of the flag bit is set to a first flag value, the first flag The value is used to indicate that remapping is not to be performed, and the method further includes the step of the first thread processor reading the flag bit before determining that the number is greater than the threshold. Moreover, the first thread processor sets the first flag value to the second flag value after determining that the number is greater than the threshold and before performing the thread synchronization, and the second flag value is used to indicate that the remapping needs to be performed.
  • the first thread processor that finds that the value of the counter exceeds the threshold sets the flag bit to the second flag value.
  • the second flag value is set and the remapping is performed directly, instead of going to the remapping. Determine whether the value of the counter exceeds the threshold, to avoid different conclusions.
  • the first thread bundle processor includes a second thread processor, and the second thread processor is configured to run one of the N threads
  • the method further includes: the second thread processor reads the flag bit, and when the value of the acknowledge flag bit is the second flag value, performing thread synchronization and the thread Data remapping; when the value of the acknowledgment flag bit is the first flag value, the second thread processor determines the number of threads in the M*N threads that need to run the first branch statement according to the value of the counter, and the number of acknowledgments is greater than the threshold. In this case, thread synchronization and thread data remapping are performed.
  • the first thread processor is configured to run the first thread of the N threads to process the determining condition that satisfies the first branch statement Data to be processed
  • the second thread processor is configured to run a second thread of the N threads to process the data to be processed that satisfies the judgment condition of the second branch statement
  • the graphics processor further sets the one-dimensional array, the first variable, and the second variable.
  • the first thread processor performs thread data remapping, including: the first thread
  • the processor reads the value of the second variable, and writes the thread identifier of the second thread into the one-dimensional array with the value of the second variable as the position of the subscript, decrements the value of the second variable by one, and performs thread synchronization.
  • the second thread processor reads the value of the first variable, and writes the thread identifier of the second thread into the one-dimensional array with the value of the first variable as the position of the subscript, increments the value of the first variable, and executes the thread. Synchronize.
  • the first thread processor reads the value in the one-dimensional array with the thread identifier of the first thread as the subscript, and uses the read value as the update of the first thread generated by the thread data remapping.
  • Thread ID After the thread synchronization ends, the second thread processor reads the value in the one-dimensional array with the thread identifier of the second thread as the subscript, and uses the read value as the update of the second thread generated by the thread data re-mapping. Thread ID.
  • the implementation of the remapping only involves the pointer transformation of the one-dimensional array, which avoids directly calling the data to be processed in the remapping process, and can effectively improve the operation speed.
  • the first thread processor runs the first thread
  • the first thread processor executes the thread synchronization and executes the thread
  • the method further includes the following steps: the first thread processor records the first to-be-processed data and the first loop variable in the index table by using the thread identifier of the first thread as an index, wherein the thread identifier of the first thread There is a one-to-one correspondence with the first data to be processed.
  • the method further includes the following steps: the first thread processor reads the updated thread identifier of the first thread generated after performing thread data remapping as an index in the index table.
  • the third pending data corresponding to the updated thread identifier of the first thread executes the first branch statement when the third to-be-processed data satisfies the determination condition of the first branch statement, and the first thread processor executes the first when the third to-be-processed data satisfies the determination condition of the second branch statement Two branch statements.
  • All threads save the data to be processed in the index table with their own thread identifier, and after obtaining the updated thread identifier of the remapping assignment, obtain the updated data corresponding to the linear identifier from the index table with the updated thread identifier as an index. And determining whether to execute the first branch statement or the second branch statement according to the to-be-processed data, the data to be processed can be exchanged between the threads, and the normal operation of the kernel code is ensured.
  • a loop variable of each thread is also recorded in the graphics processor, and the loop variable is used to indicate the sequence number of the loop in which the thread is currently located.
  • the index table records the correspondence between the loop variable of the first thread and the thread identifier of the first thread, and the data to be processed of the first thread in the loop indicated by the loop variable, and the first thread processor performs the thread data remapping after performing the thread data remapping.
  • the method further includes: the first thread processor reading, as an index, the loop variable corresponding to the updated thread identifier of the first thread in the index table by executing the updated thread identifier of the first thread generated after the thread data remapping.
  • the first thread processor After executing the first branch statement or the second branch statement, the first thread processor adds one of the loop variables corresponding to the updated thread identifier of the first thread to obtain the updated loop variable, and the updated loop variable does not conform to the loop statement.
  • the specified loop condition is met, the first thread ends, and when the updated loop variable meets the loop condition specified by the loop statement, the second loop of the first thread is run.
  • loop variables can be exchanged between threads so that the same warp can exist in different loops.
  • the threshold is 1.
  • the threshold is a positive integer greater than or equal to 2 and less than or equal to 5.
  • the probability that the first thread processor executes the first branch statement is less than the probability that the first thread processor executes the second branch statement.
  • the thread that needs to execute the first branch statement is concentrated in one or more warps by remapping, so that as many warps as possible avoid serial execution of the first branch statement. And the second branch statement.
  • the counter and the index table are disposed in a shared memory of the graphics processor.
  • the data to be processed is set in a global memory of the graphics processor.
  • the first thread processor performs a step-by-step operation by atomically adding a value in the counter, and the first thread processor performs atomic subtraction on the value in the counter.
  • the first thread processor performs a step-by-step operation by atomically adding a value in the counter, and the first thread processor performs atomic subtraction on the value in the counter.
  • the data to be processed is sent to the global memory of the graphics processor via a central processor coupled to the graphics processor.
  • the present application provides a graphics processor, where a graphics processor is configured to process M warps, each thread bundle includes N threads, and the graphics processor further includes at least one warp processor, at least one warp processing
  • the first warp processor in the device includes an integer multiple of N thread processors
  • the first warp processor includes a first thread processor
  • the first thread processor is configured to run one of the N threads to process the first a judgment condition of the branch statement or the data to be processed that satisfies the judgment condition of the second branch statement
  • the counter is set in the graphics processor, wherein the first thread processor is configured to acquire the first to-be-processed data to be processed, and determine the first The data to be processed satisfies the first branch statement, and the value in the counter is incremented by one step.
  • the first thread processor is configured to determine, according to the value of the counter, the number of threads in the M*N threads that need to run the first branch statement.
  • the first thread processor is configured to perform thread synchronization and thread data remapping if the number of acknowledgments is greater than a threshold.
  • the implementation of the third aspect or the third aspect is the apparatus implementation corresponding to the first aspect or the implementation manner of any one of the first aspect, and the description in any one of the first aspect or the first aspect is applicable to the third aspect. Or any implementation manner of the third aspect, and details are not described herein again.
  • the present application provides a graphics processor, where a graphics processor is configured to process M warps, each thread bundle includes N threads, and the graphics processor further includes at least one warp processor, at least one warp processing
  • the first warp processor in the device includes an integer multiple of N thread processors
  • the first warp processor includes a first thread processor
  • the first thread processor runs a loop statement to run N in a loop
  • the graphics processor is provided with a counter, wherein the first thread processor is used in the first loop
  • the first to-be-processed data to be processed is obtained, and the first to-be-processed data is determined to satisfy the first branch statement, and the value in the counter is further increased by one step.
  • the first thread processor is configured to determine, according to the value of the counter, the number of threads in the M*N threads that need to run the first branch statement.
  • the first thread processor is configured to perform thread synchronization and clear the value in the counter if the number of acknowledgments is greater than a threshold.
  • the first thread processor is configured to perform thread data remapping.
  • any one implementation manner is the apparatus implementation corresponding to the second aspect or the second aspect, and the description in any one of the second aspect or the second aspect is applicable to the fourth aspect. Or any implementation manner of the fourth aspect, and details are not described herein again.
  • a thread processing method is applied to a graphics processor.
  • the graphics processor is configured to process M warps, each thread bundle includes N threads, and the method includes the following steps: detecting M * The number of threads in the N threads that need to run the first branch statement. In the case where the confirmed number is greater than the threshold, thread data remapping is performed on the M*N threads.
  • thread synchronization and thread data remapping are performed, and thread synchronization and thread data remapping can be avoided when the number is less than or equal to the threshold, which can improve the running speed of the GPU and operating efficiency.
  • the M*N threads are respectively configured with a thread identifier and a data to be processed, and the data to be processed has a one-to-one correspondence with the thread identifier, and the thread data is performed on the M*N threads.
  • Remapping including: obtaining data to be processed of M*N threads.
  • the to-be-processed data of any thread of the M*N threads satisfies the judgment condition of the first branch statement
  • the to-be-processed data satisfying the judgment condition of the first branch statement is sequentially mapped to adjacent ones of the M*N threads. Part of the thread.
  • the to-be-processed data satisfying the judgment condition of the second branch statement is sequentially mapped to the adjacent ones of the M*N threads. Part of the thread.
  • the thread identifiers of the M*N threads are respectively updated to the thread identifiers corresponding to the mapped data to be processed.
  • the threads that need to execute the first branch statement are concentrated in one or more warps
  • the threads that need to execute the second branch statement are concentrated in one or more warps by remapping, so that as many as possible
  • the warp avoids executing the first branch statement and the second branch statement serially.
  • the method before performing thread data remapping on the M*N threads, the method further includes: controlling M*N The thread records its own pending data to the index table with its own thread identifier as an index. After performing thread data remapping on the M*N threads, the method further includes: controlling the M*N threads to obtain the to-be-processed data corresponding to the updated thread identifier in the index table by using the updated thread identifier as an index.
  • All threads save the data to be processed in the index table with their own thread identifier, and after obtaining the updated thread identifier of the remapping assignment, obtain the updated data corresponding to the linear identifier from the index table with the updated thread identifier as an index. And determining whether to execute the first branch statement or the second branch statement according to the to-be-processed data, data exchange between threads can be implemented, and the kernel code is guaranteed to operate normally.
  • the M*N threads respectively run a loop statement to detect the number of threads in the M*N threads that need to run the first branch statement.
  • the method includes: acquiring data to be processed that the M*N threads need to process, and adding a value of the counter when the data to be processed of any one of the M*N threads satisfies the judgment condition of the first branch.
  • the M*N threads before the thread data remapping of the M*N threads, the M*N threads are controlled to use their own thread identifier as an index record. Its own pending data and loop variables to the index table. After performing thread data remapping on the M*N threads, the M*N threads are controlled to obtain the to-be-processed data and the loop variable corresponding to the updated thread identifier in the index table with the updated thread identifier as an index.
  • performing thread data remapping on the M*N threads and the following sub-steps: controlling to confirm that the thread whose calculation variable is greater than the threshold is
  • the own thread identifier records its own pending data and loop variables as an index to the index table, sets the flag bit to the second flag value, and performs thread synchronization.
  • Controlling other threads to detect the flag bit as the second flag value recording its own data to be processed and the loop variable to the index table with its own thread identifier as an index, and performing thread synchronization.
  • Control M*N threads to perform thread data remapping to obtain updated thread identities.
  • the thread processor that finds that the value of the counter exceeds the threshold sets the flag bit to the second flag value.
  • the second flag value is set to perform the remapping directly, instead of judging the counter. Whether the value exceeds the threshold, to avoid different conclusions, after using this design, as long as one thread of M*N threads makes the decision to execute thread synchronization, other threads must follow, avoiding the problem of individual threads performing separate thread synchronization. .
  • the method further includes: setting the flag bit to the first flag value, and Set the value of the counter to 0.
  • the present application provides a graphics processor, where the graphics processor includes a thread bundle processor, the thread bundle processor is configured to process M warps, each thread bundle includes N threads, and the warp processor, Used to detect the number of threads in the M*N thread that need to run the first branch statement.
  • the warp processor is configured to perform thread data remapping on M*N threads if the number of acknowledgments is greater than a threshold.
  • any one of the sixth aspect or the sixth aspect is a device implementation corresponding to any one of the fifth aspect or the fifth aspect, and the description in the first aspect or the implementation manner in any one of the first aspect is applicable to the third aspect. Or any implementation manner of the third aspect, and details are not described herein again.
  • FIG. 1 is a schematic diagram showing a connection relationship between a graphics processor and a central processing unit according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a work group being delivered to a first warp processor according to an embodiment of the present invention
  • FIG. 3 is a flow chart of thread data remapping in accordance with an embodiment of the present invention.
  • FIG. 4 is a schematic diagram showing a data flow direction of thread data remapping according to an embodiment of the present invention
  • FIG. 5 is a schematic diagram of time before and after time of thread data remapping according to an embodiment of the present invention
  • FIG. 6 is another data flow diagram of thread data remapping according to an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of another front and back time overhead of thread data remapping according to an embodiment of the present invention.
  • FIG. 8 is a flowchart of a thread processing method according to an embodiment of the present invention.
  • FIG. 9 is another flow chart of a thread processing method according to an embodiment of the present invention.
  • FIG. 10 is a schematic diagram of time before and after time of thread data remapping according to an embodiment of the present invention.
  • FIG. 11 is a schematic diagram of another front and back time overhead of thread data remapping according to an embodiment of the present invention.
  • FIG. 12 is a schematic diagram of thread data remapping in accordance with an embodiment of the present invention.
  • FIG. 13 is another flowchart of a thread processing method according to an embodiment of the present invention.
  • FIG. 14 is another flowchart of a thread processing method according to an embodiment of the present invention.
  • Figure 15 is a block diagram showing the structure of a heterogeneous system according to an embodiment of the present invention.
  • a Graphics Processing Unit is a microprocessor that performs image computing on devices such as personal computers, workstations, game consoles, and some mobile devices (such as tablets, smartphones, etc.).
  • the purpose of the GPU is to convert and display the display information required by the computer system, and provide a line scan signal to the display to control the correct display of the display.
  • GPU has been widely used in general computing because of its increasingly powerful computing power.
  • a large number of programs in different fields are accelerated by GPUs, such as traditional computationally intensive scientific computing, file systems, network systems, database systems, and cloud computing.
  • the GPU includes a global memory, a scheduler, and a plurality of warp processors, wherein FIG. 1 is a schematic diagram of a connection relationship between a graphics processor and a central processing unit according to an embodiment of the present invention.
  • the global memory stores host code, kernel code, and pending data received from the CPU.
  • the scheduler is configured to set a workgroup according to the host code, the work group includes M warp bundles, each warp bundle includes N threads, the scheduler selects an idle warp beam processor, and delivers M warp bundles to The warp processor is configured to process M*N threads of the work group, where M ⁇ 1, N ⁇ 1.
  • Each warp includes a predetermined number of threads, and a warp processor can run the predetermined number of threads at the same time.
  • the warp processor may have another name in different products.
  • AMD Advanced Micro Devices
  • CU Computing Unit
  • NVIDIA NVIDIA Corporation
  • SM Stream Multiprocessor
  • Each thread bundle processor includes a shared memory, N thread processors, and N private memories, each thread processor is used to run one thread at the same time, and the private memory is used to store the pending data and processes involved in the thread running process. Data, where the process data includes calculation results, loop variables, and intermediate values involved in the calculation process.
  • the number of thread processors included in one thread bundle processor is an integer multiple of N.
  • Each thread processor is respectively provided with a private memory, and the thread processor can only access the private memory corresponding to itself, wherein the private memory can be, for example, a register group or a memory, and the thread processor in the same thread bundle processor can be Access to shared memory within the same thread bundle processor, but not to shared memory within different warp processors.
  • the first thread processor 211, the second thread processor 212, the third thread processor 213, and the third fourth thread processor 214 can access the shared memory. 210, but cannot access the shared memory 220.
  • the first thread processor 211 can access the private memory 215 of the first thread processor 211 but cannot access the private memory 216 of the second thread processor 212.
  • the thread processor in each of the thread bundle processors can access the global memory.
  • both the first thread processor 211 and the fifth thread processor 221 can access the global memory 24.
  • a GPU program can be divided into two parts: host-side code and kernel code as described above.
  • a code editor running on the CPU can edit the GPU program, set the data to be processed, and the GPU program can be compiled into GPU executable code in binary format by a compiler running on the CPU.
  • the CPU sends the pending data and the compiled GPU program to the global memory of the GPU.
  • the GPU's scheduler reads the host-side code in global memory to initialize the kernel code context to create a workgroup.
  • the GPU scheduler allocates the pending data to the threads in the workgroup and notifies each thread in the workgroup to execute the kernel code to process the respective pending data to be processed.
  • the data to be processed may be generated by the GPU by executing kernel code.
  • the initialization of the kernel code context may allocate the pending data set by the host-side code to the thread as an array, wherein the array is identified by the thread identifier, a GPU
  • the kernel code describes the behavior of a thread, and the thread corresponding to the thread identifier can read the pending data in the array and perform the behavior according to the data to be processed.
  • the kernel code describes the behavior of a thread that can generate the pending data according to the kernel code and perform the behavior according to the data to be processed. .
  • a warp is a collection of jobs.
  • the GPU acts as a coprocessor, and when the kernel code is called from the CPU, a large number of threads are created by the scheduler. These threads are organized hierarchically.
  • Some thread bundles form a workgroup.
  • the kernel code is correspondingly assigned one or more working groups, each working group includes M thread bundles, and each linear bundle includes N threads, wherein M controls the GPU creation after the CPU executes the host-side code to initialize the kernel code context.
  • the number of warps included in the workgroup, and the number of threads N included in each bundle is the number of threads included in any one of the bundles created by the workgroup.
  • the specific value of M can be set in the host code, or preset by the GPU as a fixed value.
  • the value range of M can be 4 ⁇ M ⁇ 32.
  • the M is taken.
  • the value can range from 1 ⁇ M.
  • FIG. 2 is a schematic illustration of a workgroup placement to a first warp processor in accordance with an embodiment of the present invention.
  • the first warp processor 21 is provided with only four first thread processors 211 to 214, and the first warp processor 21 can only run one warp at the same time, for example, the warp beam 11
  • the warp processor runs the warp 11
  • the first thread processor 211 runs the thread 1
  • the second thread processor 212 runs the thread 2
  • the third thread processor 213 runs the thread 3
  • the fourth thread processor 214 runs the thread. 4.
  • the warp processor 21 runs the warp 12 or the warp 13, wherein if the warp processor 21 runs the warp 12, the first thread processor 211 runs the thread 5, the second thread
  • the processor 212 runs the thread 6, the third thread processor 213 runs the thread 7, and the fourth thread processor 214 runs the thread 8.
  • the warp processor 21 can run the warp 13, when the warp processor 21 runs the warp 13, the first thread processor 211 runs the thread 9, the second thread processor 212 runs the thread 10, and the third thread processor 213 runs thread 11, and fourth thread processor 214 runs thread 12.
  • the sequence of the thread bundles of the first thread bundle processor 21 running the work group is determined by the scheduler 23, and the scheduler 23 causes the first warp processor 21 to preferentially run the warp without the read delay, wherein the read delay is Refers to the delay that the thread processor generates when reading data to be processed from the global memory 24.
  • the scheduler 23 first controls the first warp processor 21 to run the warp 11, in the process, the first thread processor 211 reads the data A to be processed from the global memory 24. 0] A delay occurs when the shared memory 210 is reached, at which time the scheduler 23 can notify the first warp processor 21 to stop running the warp 11 and instead run the warp 12 or the warp 13 to avoid waiting for the delay.
  • the scheduler 23 can notify the memory controller (not shown) to treat the data A[0] to be processed from The shared memory 210 is read into the global memory 24. Therefore, after the first warp processor 21 runs the warp 12 or the warp 13 can continue to run the warp 11 without waiting for the data A[0] to be processed under the control of the scheduler 23, thereby realizing the masking delay.
  • the warp processor 21 may include a greater number of thread processors, and therefore, the warp processor 21 may run more than one warp bundle at the same time, but in the embodiment of the present invention.
  • the warp processor 21 is set to include four thread processors such that the warp processor 21 runs one warp at the same time.
  • Branch divergence is a common cause of performance loss in GPU computing.
  • the thread identification (TID) of each thread and the pending data to be processed are different, so different judgment conditions are obtained when the branch is encountered.
  • TID thread identification
  • the thread bundle will serially execute all the branches that the member thread needs to execute due to the uniformity of the execution instructions, which is called a branch divergence problem.
  • Each branch is executed with all threads, but the results of unrelated thread runs are discarded, which reduces the degree of parallelism and efficiency of execution. For example, a single-layer branch can reduce efficiency to 50%, and multiple nested branches in a loop statement can cause exponential growth slowdown.
  • a loop statement can include a for statement:
  • the loop statement may also include a while statement, where the while statement does not define a loop variable, that is, the number of loops is not limited, and the general form of the while statement is as follows:
  • loop statement can also include other statements, such as the select statement, which will not be described here.
  • a branch statement can include an if statement:
  • the if statement is a select statement, and the if statement is used to implement the selection of two branches.
  • the general form of the if statement is as follows:
  • Conditional execution statement 1 and conditional execution statement 2 can be a simple statement, or a compound statement, or another if statement (ie, another one or more embedded if statements in an if statement) .
  • the conditional judgment can also be called an expression, and the conditional execution statement can also be called a statement.
  • conditional judgment formula is set such that the number of times the first branch statement needs to be executed is smaller than the number of times the second branch statement needs to be executed.
  • conditional judgment can be set to
  • the rand() function is a pseudo-random function that can be used to generate any value between 0 and 1.
  • temp is allocated to each thread of the workgroup as data to be processed.
  • the temp assigned to each thread is not the same, in the kernel code.
  • the if statement also includes one or more if statements called nesting of if statements.
  • nesting of if statements belongs to the nesting of the if statement, and its general form is as follows:
  • the branch statement also includes a switch statement, which is not described here.
  • Synchronization is required for data exchange between threads of the same workgroup, and the GPU provides a software interface.
  • the thread can implement thread synchronization by calling the barrier function (), for a thread that calls the barrier function, unless the same Other threads in a workgroup execute the barrier function, otherwise the thread will be blocked from executing kernel code after the barrier function, and after the thread processor executes the barrier function, the thread processor sets a break point for the thread.
  • the stall point records the location of the next statement of the barrier function in the kernel code, records the location in the thread's private memory, and suspends the thread.
  • the thread processor After the other function in the same work group executes the barrier function, the thread processor reads the pause point from the private memory, so that the next statement of the barrier function can be executed, so that the thread continues to run.
  • the condition that the thread synchronization ends is that the M*N threads in a work group execute the barrier function.
  • Thread data remapping is a software technology that resolves branch divergence. By adjusting the mapping relationship between threads and data to be processed, the judgment conditions obtained by threads in the same warp are consistent, thereby eliminating branch divergence.
  • FIG. 3 is a flowchart of thread data remapping according to an embodiment of the present invention
  • FIG. 4 is a data flow direction of thread data remapping according to an embodiment of the present invention.
  • the processor 21 executes, and the thread data remapping includes the following steps:
  • Step S1041 Acquire data to be processed.
  • the data in , 10, 11, 3, 12 ⁇ , the array A[tid] is sent by the CPU 30 to the global memory 24 of the GPU, and is read by the first warp processor 21 to the shared memory 210, the thread processor is running the thread
  • the pending data A[tid] can be read from the shared memory 210 according to the thread identifier tid of the thread.
  • the first thread processor 211 runs the thread 1
  • the array of the shared memory 210 can be according to the thread identifier 0 of the thread 1.
  • the data to be processed has a one-to-one correspondence with the thread identifier of the thread in which the data to be processed is located.
  • the data to be processed 13 corresponds to the thread identifier 0
  • the data to be processed 6 corresponds to the thread identifier 1.
  • Step S1042 determining a judgment condition of which branch the data to be processed satisfies.
  • the judgment condition is a judgment condition of the branch statement, such as A[tid] ⁇ 5 in the code segment above is a judgment condition of the first branch statement, and A[tid] ⁇ 5 is a judgment condition of the second branch statement, That is, when the condition is satisfied, the first branch statement is executed, and when the condition is not satisfied, the second branch statement is executed.
  • Step S1043 The data to be processed that satisfies the judgment condition of the second branch statement is sequentially mapped to the adjacent partial thread among the M*N threads.
  • the to-be-processed data 13, 6, 7, 8, 9, 10, 11, 12 satisfying the judgment condition of the second branch statement is allocated to the M threads from the first thread. Start and adjacent multiple threads 1 through 8.
  • Step S1044 The data to be processed that satisfies the judgment condition of the second branch statement is sequentially mapped to other partial threads adjacent to the M*N threads.
  • the to-be-processed data 0, 1, 2, 3 satisfying the judgment condition of the first branch statement is allocated to the plurality of threads 9 to 12 of the M warps starting from the last thread and adjacent.
  • Step S1045 Update the thread identifiers of the M*N threads to the thread identifiers corresponding to the mapped data to be processed.
  • the thread identifier 0 of the thread 1 is modified to the thread identifier 0 of the data to be processed 13
  • the thread identifier 1 of the thread 2 is modified to the thread identifier 1 of the data to be processed 6, and the thread identifier of the thread 3 is determined.
  • the thread identifier 2 is modified to the thread identifier of the data to be processed 7
  • the thread identifier 3 of the thread 4 is modified to the thread identifier 6 of the data to be processed 8
  • the thread identifier 4 of the thread 5 is modified to the thread identifier 7 of the data to be processed 9
  • the thread 6 is The thread identifier 5 is modified to the thread identifier 8 of the data to be processed 10
  • the thread identifier 6 of the thread 7 is modified to the thread identifier 9 of the data to be processed 11
  • the thread identifier 7 of the thread 8 is modified to the thread identifier 11 of the data 12 to be processed.
  • the thread identifier 3 modifies the thread identifier 11 of the thread 12 to the thread identifier 2 of the data 0 to be processed.
  • FIG. 5 is a schematic diagram of time and time overhead of thread data remapping according to an embodiment of the present invention.
  • FIG. 5 shows the time required for the first warp processor 21 to run the warps 11 to 13 respectively when thread data remapping is not performed, and it is assumed that the thread executes the first branch statement for T2.
  • the lower half of FIG. 5 shows the time required for the warp processor 21 to run the warps 11 to 13 respectively after thread data remapping, and is assigned to the same warp 13 due to the pending processing of the thread executing the first branch statement.
  • the threads of the warps 11 and 12 only need to execute the second branch statement, so the time required for the warp 11 is T1, the time required for the warp 12 is T1, and the time required for the warp 13 is T2, the first thread
  • FIG. 6 is another data flow diagram of thread data remapping according to an embodiment of the present invention. In the example of FIG. 6, only Thread 3 needs to execute the first branch statement.
  • FIG. 7 is a schematic diagram of another front and back time overhead of thread data remapping according to an embodiment of the present invention.
  • the warp 13 needs to be serialized.
  • the first branch statement and the second branch statement are executed. Therefore, in the example of FIG. 7, when the number of threads that need to execute the first branch statement is 1, regardless of whether the thread data is remapped or not, it takes 3T1+T2. . Therefore, in this case, it is ineffective to perform thread data remapping, which in turn wastes the limited computing resources of the first warp processor 21 and causes useless time overhead.
  • thread data remapping also wastes the warp to some extent.
  • the processor 21 has limited computing resources and causes useless time overhead.
  • thread synchronization is often required before thread data remapping.
  • Each thread synchronization will cause M*N threads to pause, which will greatly affect the running speed and running efficiency of the GPU, and it needs to be executed.
  • thread data remapping does not effectively solve the problem of branch divergence, but introduces thread synchronization, causing delay. .
  • an embodiment of the present invention provides a thread processing method, where the method is applied to a graphics processor, the graphics processor is configured to process M warps, each thread bundle includes N threads, and the graphics processor further includes at least one a thread bundle processor, the first warp processor of the at least one warp processor includes an integer multiple of a thread processor of N, the first warp processor includes a first thread processor, and the first thread processor is configured to run N One of the threads is configured to process the data to be processed that satisfies the judgment condition of the first branch statement or the judgment condition of the second branch statement, and the counter is set in the graphics processor, and the method includes:
  • the first thread processor obtains the first to-be-processed data to be processed, determines that the first to-be-processed data satisfies the first branch statement, and adds the value in the counter by one step;
  • the first thread processor determines, according to the value of the counter, the number of threads in the M*N threads that need to run the first branch statement;
  • the first thread processor performs thread synchronization and thread data remapping if the number of acknowledgments is greater than a threshold.
  • the number of threads running the first branch statement is determined by the first thread processor, and when the number of confirmations is greater than the threshold, thread synchronization and thread data remapping are performed, and the number is less than or equal to the threshold. Thread synchronization and thread data remapping are also performed in the case, which can improve the running speed and operating efficiency of the GPU.
  • FIG. 8 is a flowchart of a thread processing method according to an embodiment of the present invention. The method is applied to each thread processor of the first warp processor 21, which is worth noting.
  • the global memory 24 stores the to-be-processed data A[tid], which is a one-dimensional array set by the CPU and sent to the GPU, which uses the identifier tid of the thread as a subscript.
  • the work group includes the warps 11, 12, 13, and the warps 11, 12, 13 are dispatched by the dispatcher 23 to the first warp processor 21.
  • the probability that the value of A[tid] is less than 5 is less than the probability that the value of A[tid] is greater than or equal to 5.
  • the value of A[tid] is less than 5 is the judgment condition of the first branch statement
  • the value of A[tid] is greater than or equal to 5 is the judgment condition of the second branch statement, that is, the number of times the thread executes the first branch statement is less than the thread execution The number of times the second branch statement.
  • A[tid] [13,6,0,1,2,7,8,9,10,11,3,12].
  • the initial value of the counter is declared to be 0 in the shared memory 210, and the initial value of the flag is the first flag value.
  • the initial value of the first flag value can be set to zero.
  • the initial value of the counter is 0, and the initial value of the flag is the first flag value.
  • the scheduler 23 can establish the working group and the working group's warp 11, 12 during the initialization process. Before being placed in the first warp processor 21, any thread processor execution is selected from the first warp processor 21.
  • each thread processor of the first warp processor 21 can access the counter when executing the method shown in FIG. 8 to run the thread, specifically, the first warp processing
  • Each thread processor of the processor 21 can read the value of the counter when the thread is running, and can modify the value of the counter.
  • the value of the counter can be read atomically, and the value of the counter can be atomically increased by one step.
  • atomic read means that when a thread processor reads the counter value during the running of the thread, other thread processors located in the same warp processor cannot read the counter value, only the previous thread processor reads After completion, the other thread processors located in the same thread bundle processor read the counter value, that is, only one of the M*N threads of the same work group is allowed to read the counter value at the same time.
  • atomic plus one step means that only one of the M*N threads of the same working group is allowed to add a step to the value of the counter at the same time.
  • the step size can be any positive integer, for example, can be set to 1.
  • the thread processing method includes the following steps:
  • Step S401 The process starts.
  • Step S402 It is determined whether the data to be processed that needs to be processed satisfies the judgment condition of the first branch statement, and if yes, step S403 is performed, and if no, step S404 is performed.
  • Step S403 adding a step to the value of the counter.
  • the first thread processor 211, the second thread processor 212, the third thread processor 213, and the fourth thread processor 214 are running.
  • the value of the counter in the shared memory 210 can be separately increased by one step, that is, the counter can accumulate the warp 11 of the same working group. The number of threads in 12, 13 that need to execute the first branch statement.
  • Step S404 determining whether the value of the flag bit is the second flag value, if yes, executing step S408, and if no, executing step S405.
  • the second flag value is a value different from the first flag value.
  • the second flag value is 1, and when the first flag value is 1, the second flag value may be 0.
  • Step S405 Read the value of the counter.
  • the thread processor atom reads the value of the counter. Since the method is applied to each thread processor in the same warp processor 21, between step S403 and this step, When the first thread processor 211, the second thread processor 212, the third thread processor 213, and the fourth thread processor 214 respectively execute the threads of the warp 11, the warp 12, and the warp 13, the first operation is required. The number of threads that branch statements.
  • Step S406 It is determined whether the value of the counter is greater than a threshold. If yes, step S409 is performed, and if no, step S413 is performed.
  • the threshold value is 1 as an example.
  • the threshold may be set according to actual needs, such as a positive integer greater than or equal to 2 and less than or equal to 5.
  • Step S407 The value of the flag bit is set to a second flag value, and the second flag value is used to indicate that remapping needs to be performed.
  • the first thread processor that finds that the value of the counter exceeds the threshold sets the flag bit to the second flag value.
  • the second flag bit is set to jump to step S408. Thread synchronization is performed directly, instead of judging whether the value of the counter exceeds the threshold, avoiding different conclusions. After adopting this design, as long as one thread of M*N threads makes a decision to perform thread synchronization, other threads must follow, avoiding the problem that individual threads perform thread synchronization separately.
  • step S405 the value of the counter read by the thread processor when the thread is running is the shared memory from the thread processor executing the atomic read instruction. 210 reads the value, however there is a time interval between the execution of the instruction and the reading of the value, and other threads in this interval may perform atomic plus one on the counter value, so different thread processors run their respective threads to read There may be differences in the count values to which the individual threads perform thread synchronization separately, while other threads do not perform thread synchronization.
  • Step S408 Perform thread synchronization.
  • thread synchronization can be implemented by calling the barrier function (). For a thread that calls the barrier function, unless other threads in the same work group execute the barrier function, otherwise The thread will be blocked from executing the kernel code after the barrier function, and after the thread processor executes the barrier function, the thread processor sets a break point for the thread, which records the next statement of the barrier function in the kernel. The location in the code, the location is recorded in the thread's private memory, and the thread is halted.
  • the thread processor After the other function in the same work group executes the barrier function, the thread processor reads the pause point from the private memory, so that the next statement of the barrier function can be executed, so that the thread continues to run.
  • the condition that the thread synchronization ends is the threads 1 to 12.
  • the barrier function is called during the execution of the corresponding thread processor.
  • Step S409 After the thread synchronization ends, the value of the counter is set to 0, the value of the flag bit is set to the first flag value, and the data to be processed of the first thread is recorded in the index table with the thread identifier as an index.
  • the index table is set in the shared memory 210, and the threads 1 to 12 can access the index table.
  • the thread identifier is the thread identifier of the thread that the thread processor is running. For example, the thread processor can obtain the thread by calling the get_global_id() function. Thread ID.
  • the thread 1 records the thread identifier 0 in the index table corresponding to the data 13 to be processed.
  • the first thread can be any of threads 1 through 12. Therefore, after running this method on threads 1 through 12, the resulting index table is as shown in Table 1:
  • Step S410 Perform thread data remapping, acquire an updated thread identifier generated by thread data remapping, and acquire, to the processed data, the to-be-processed data corresponding to the updated thread identifier from the index table according to the updated thread identifier.
  • the third thread processor 213 acquires after running thread 3 and performing thread data remapping.
  • the updated thread identifier 5 generated by the thread data remapping acquires the to-be-processed data 7 corresponding to the updated thread identifier 5 from the index table according to the updated thread identifier 5.
  • Step S411 determining whether the data to be processed satisfies the judgment condition of the first branch statement or the judgment condition of the second branch statement. If the judgment condition of the first branch statement is satisfied, step S412 is performed, and if the judgment condition of the second branch statement is satisfied, Step S413 is performed.
  • the judgment condition of the first branch statement may be, for example, A[tid] ⁇ 5
  • the judgment condition of the second branch statement may be, for example, A[tid] ⁇ 5.
  • Step S412 executing the first branch statement.
  • Step S413 Execute the second branch statement.
  • Step S414 The process ends.
  • the end of the process means that the thread processor ends the currently running thread.
  • the first thread processor 211 ends the thread 1. It is noted that the first thread processor 211 can end the thread 1 after When the other threads of the warp 11 where the thread 1 is located are finished, the threads of other war bundles, such as thread 5 or thread 9, are turned.
  • the embodiment of the present invention further provides a thread processing method to implement useless thread data remapping in a multi-cycle scenario.
  • the thread processing method is applied to a graphics processor, the graphics processor is configured to process M warps, each thread bundle includes N threads, each thread runs a loop statement, and the graphics processor further includes at least one warp processing
  • the first warp processor in the at least one warp processor includes an integer multiple of the thread processor of N
  • the first warp processor includes a first thread processor
  • the first thread processor is configured to run in a loop
  • One of the N threads is configured to process the data to be processed that satisfies the judgment condition of the first branch statement or the judgment condition of the second branch statement
  • the counter is set in the graphics processor, and the method includes:
  • the first thread processor acquires the first to-be-processed data to be processed in the first loop, determines that the first to-be-processed data satisfies the first branch statement, and adds the value in the counter by one step;
  • the first thread processor determines, according to the value of the counter, the number of threads in the M*N threads that need to run the first branch statement;
  • the first thread processor performs thread synchronization and clears the value in the counter if the number of acknowledgments is greater than the threshold;
  • the first thread processor performs thread data remapping.
  • the first thread processor when the number of acknowledgments is greater than the threshold, performs thread synchronization and clears the value in the counter, and performs thread data remapping, so that the first thread processor reads the next loop.
  • the value of the counter is not affected by the first loop, and since thread synchronization and thread data remapping are performed only when the number of acknowledgments is greater than the threshold, it can be avoided that the number is less than or equal to the threshold.
  • Thread synchronization and thread data remapping can dynamically determine whether there is a need to perform thread data remapping in each loop, avoiding useless thread data remapping, and improving GPU running speed and operating efficiency.
  • FIG. 9 is another flowchart of a thread processing method according to an embodiment of the present invention. The method is applied to each thread processor of the first warp processor 21, It should be noted that, in this embodiment, it is assumed that the shared memory 24 stores the to-be-processed data A[i, tid], which is a one-dimensional array set by the CPU and sent to the GPU, which is threaded.
  • the identifier tid is used as the row subscript
  • the loop variable i is used as the column subscript
  • the loop variable i is used to indicate the sequence number of the loop in which the thread is currently located.
  • the work group includes the warps 11, 12, 13 and the warps 11, 12, 13 are dispatched by the dispatcher 23 to the first warp processor 21. Moreover, the probability that the thread executes the first branch statement is less than the probability that the thread executes the second branch statement.
  • the CPU can also set the pending data to A[tid,i].
  • thread 6 needs
  • the pending data of thread 12 that needs to be processed is A[10,11]
  • the A[i, tid] array is set such that when i is a fixed value, the probability that the value of A[i, tid] is less than 5 is less than the value of A[i, tid] is greater than or equal to The probability of 5, that is, in each cycle of the workgroup, the number of times the thread executes the first branch statement is less than the number of times the second branch statement is executed.
  • the initial value of the counter is declared to be 0 in the shared memory 210, and the initial value of the flag is the first flag value.
  • the first flag value can be zero.
  • the initial value of the counter is 0, and the initial value of the flag is the first flag value.
  • the scheduler 23 can establish the working group and the working group's warp 11, 12 during the initialization process. Before being placed in the first warp processor 21, any thread processor execution is selected from the first warp processor 21.
  • each thread processor of the first warp processor 21 can access the counter when the thread is running, specifically, each thread processor of the first warp processor 21 When you run a thread, you can read the value of the counter and modify the value of the counter.
  • the value of the counter can be read atomically, and the value of the counter can be atomically increased by one step.
  • atomic read means that when a thread processor reads the counter value during the running of the thread, other thread processors located in the same warp processor cannot read the counter value, only the previous thread processor reads After completion, the other thread processors located in the same thread bundle processor read the counter value, that is, only one of the M*N threads of the same work group is allowed to read the counter value at the same time.
  • atomic plus one step means that only one of the M*N threads of the same working group is allowed to add a step to the value of the counter at the same time.
  • the step size can be any positive integer, for example, can be set to 1.
  • the process processing method includes the following steps:
  • Step S501 The process starts.
  • Step S502 It is judged whether the loop is over, if yes, step S503 is performed, and if no, step S504 is performed.
  • the initial value of i is 0, and when the thread completes a loop, it adds 1 to i.
  • the value of i is incremented to 1000, the thread can jump out of the loop statement, and the loop ends.
  • Step 503 The process ends.
  • the end of the process means that the thread processor ends the currently running thread.
  • the first thread processor 211 ends the thread 1. It is noted that the first thread processor 211 can end the thread 1 after When the other threads of the warp 11 where the thread 1 is located are finished, the threads of other war bundles, such as thread 5 or thread 9, are turned.
  • Step S504 It is determined whether the data to be processed that needs to be processed satisfies the judgment condition of the first branch statement, and if yes, step S505 is performed, and if no, step S506 is performed.
  • the global memory 24 records the data to be processed that needs to be processed, and the data to be processed that needs to be processed is A[i, tid], and the thread processor can determine the value of the thread identifier and the loop variable of the thread running on the thread processor. A[i, tid] is obtained from the global memory 24.
  • the judgment condition of the first branch statement is A[i, tid] ⁇ 5.
  • Step S505 The value of the counter is further increased by one step.
  • the first thread processor 211, the second thread processor 212, the third thread processor 213, and the fourth thread processor 214 are running.
  • the value of the counter in the shared memory 210 can be separately increased by one step, that is, the counter can accumulate the warp 11 of the same working group. The number of threads in 12, 13 that need to execute the first branch statement.
  • Step S506 determining whether the value of the flag bit is the second flag value, if yes, executing step S510, and if no, executing step S507.
  • the second flag value is a different value than the first flag value, the second flag value is used to indicate that remapping needs to be performed, in some examples, the first flag value is 0, the second flag value is 1, in other examples The first flag value is 1, and the second flag value is 0.
  • Step S507 Read the value of the counter.
  • the thread processor atom reads the value of the counter. Since the method is applied to each thread processor in the same warp processor 21, between step S505 and this step, When the first thread processor 211, the second thread processor 212, the third thread processor 213, and the fourth thread processor 214 respectively execute the threads of the warp 11, the warp 12, and the warp 13, the first operation is required. The number of threads that branch statements.
  • Step S508 It is judged whether the value of the counter is greater than the threshold. If yes, step S509 is performed, and if no, step S513 is performed.
  • the threshold is 1.
  • the threshold may be a positive integer between 2-5.
  • Step S509 setting the value of the flag bit to the second flag value, where the second flag value is used to indicate that remapping needs to be performed.
  • the first thread processor that finds that the value of the counter exceeds the threshold sets the flag bit to the second flag value.
  • the flag is read to the second flag value in step S506.
  • the thread synchronization is directly executed, and it is no longer determined whether the value of the counter exceeds the threshold (step 508), thereby avoiding a different conclusion.
  • step S507 the value of the counter read by the thread processor when the thread is running is the shared memory from the thread processor executing the atomic read instruction. 210 reads the value, however there is a time interval between the execution of the instruction and the reading of the value, and other threads in this interval may atomically add the value of the counter in their respective loops, thus different thread processors There may be differences in the count values read by the respective threads, so that individual threads perform thread synchronization separately, while other threads do not perform thread synchronization.
  • Step S510 Perform thread synchronization.
  • thread synchronization can be implemented by calling the barrier function (). For a thread that calls the barrier function, unless other threads in the same work group execute the barrier function, otherwise The thread will be blocked from executing the kernel code after the barrier function, and after the thread processor executes the barrier function, the thread processor sets a break point for the thread, which records the next statement of the barrier function in the kernel. The location in the code, the location is recorded in the thread's private memory, and the thread is halted.
  • the thread processor After the other function in the same work group executes the barrier function, the thread processor reads the pause point from the private memory, so that the next statement of the barrier function can be executed, so that the thread continues to run.
  • the condition that the thread synchronization ends is that the threads 1 to 12 are The barrier function is called during the execution of the respective thread processor.
  • Step S511 After the thread synchronization ends, the value of the counter is set to 0, the flag bit is set to the first flag value, and the data to be processed and the loop variable of the first thread are recorded in the index table with the thread identifier as an index.
  • the index table is set in the shared memory 210, and the threads 1 to 12 can access the index table.
  • the first thread can be any of threads 1 through 12. Therefore, after running the method in threads 1 to 12, the generated index table is as shown in Table 2:
  • the thread identifier, the data to be processed, and the loop variable have a one-to-one correspondence.
  • Step S512 Perform thread data remapping, obtain an updated thread identifier generated by thread data remapping, and acquire, to the processed data, the to-be-processed data corresponding to the updated thread identifier from the index table according to the updated thread identifier.
  • the tid of thread 3 is updated from 2 to 5, so the third thread processor 213 acquires after running thread 3 and performing thread data remapping.
  • the updated thread identifier 5 generated by the thread data remapping acquires the to-be-processed data 7 corresponding to the updated thread identifier 5 from the index table according to the updated thread identifier 5.
  • Step S513 The value of the counter is decremented by one step.
  • the step size can be any positive integer.
  • the step size is set to 1, and the value of the counter is decremented by one step, specifically by performing an atomic decrement operation on the value of the counter.
  • step S505 when the data to be processed does not satisfy the judgment condition of the first branch statement, the value of the counter is decremented by one step, and the operation of adding one step longer in step S505 is cancelled, thereby avoiding interference to the judgment of other threads.
  • Step S514 determining whether the data to be processed satisfies the judgment condition of the first branch statement or the judgment condition of the second branch statement. If the judgment condition of the first branch statement is satisfied, step S515 is performed, and if the judgment condition of the second branch statement is satisfied, Step S516 is performed.
  • Step S515 Execute the first branch statement.
  • Step S516 Execute the second branch statement.
  • Step S5517 Add one to the loop variable and jump to step S502.
  • FIG. 11 is a schematic diagram of another front and back time overhead of thread data remapping according to an embodiment of the present invention.
  • the upper and middle portions of FIG. 11 respectively add 1 to each thread's own loop variable.
  • FIG. 12 is a schematic diagram of thread data remapping according to an embodiment of the present invention.
  • the thread processor performs thread synchronization and clears the value in the counter when the number of acknowledgments is greater than the threshold, and performs thread data remapping, so that the value of the counter read by the first thread processor in the next loop It is not affected by the first loop, and since thread synchronization and thread data remapping are performed when the number of acknowledgments is greater than the threshold, thread synchronization and threads can be prevented from being executed even if the number is less than or equal to the threshold.
  • Data remapping can dynamically determine whether there is a need to perform thread data remapping in each loop, avoiding useless thread data remapping, and improving GPU running speed and operating efficiency.
  • thread data remapping involved in FIG. 4, FIG. 6, and FIG. 12 will be further clarified from the perspective of a single thread, wherein the shared memory 210 of the graphics processor sets a one-dimensional array id_pood[], a first variable H, and a The second variable R, wherein the length of the one-dimensional array is M*N, the initial value of the first variable H is 0, and the initial value of the second variable R is M*N-1, and thread data remapping can be performed in the following manner:
  • Step 1 When confirming that the data to be processed satisfies the judgment condition of the second branch statement, reading the value of the first variable H, and writing the thread identifier into the one-dimensional array id_pood[] with the value of the first variable H as the subscript Position, the value of the first variable H is incremented by one, and thread synchronization is performed; when it is confirmed that the first pending data of the first thread satisfies the judgment condition of the first branch statement, the value of the second variable R is read, and the thread is The identifier is written into the one-dimensional array id_pood[] with the value of the second variable as the subscript position, the value of the second variable R is decremented by one, and thread synchronization is performed;
  • Step 2 After the thread synchronization ends, read the value in the one-dimensional array id_pood[] with the thread identifier of the first thread as the subscript, and read the value as the updated thread identifier generated by the thread data re-mapping. .
  • execution bodies of the above methods are the first thread processor 211, the second thread processor 212, the third thread processor 213, and the fourth thread processor 214, respectively.
  • end of thread synchronization means that all threads in the work group, such as threads 1 to 12, perform thread synchronization.
  • the first thread processor 211 determines that the pending data 13 of the thread 1 satisfies the judgment condition of the second branch statement (greater than or equal to 5), and reads Take the value 0 of the first variable H, write the thread ID 0 of thread 1 to id_pood[0], and add 1 to the atomic value of the second variable R to make it 1.
  • the second thread processor 212 determines that the data 6 to be processed of the thread 2 satisfies the judgment condition of the second branch statement (greater than or equal to 5), reads the value 1 of the first variable H, and the thread 2 Thread ID 1 is written to id_pood[1], and the value atom of the first variable H is incremented by 1, making it 2.
  • the fourth thread processor 214 determines that the data 1 to be processed of the thread 4 satisfies the judgment condition of the first branch statement (less than 5), reads the value 10 of the second variable R, and identifies the thread identifier of the thread 4. 3 Write id_pood[10] and add 1 to the value atom of the second variable R to make it 9.
  • the first thread processor 211 runs the thread 5 and the thread 9, respectively
  • the second thread processor 212 runs the thread 6 and the thread 10, respectively
  • the third thread processor 213 runs the thread 7 and the thread 11, respectively
  • the fourth thread processor When 214 runs thread 8 and thread 12 respectively, a similar process is performed, and id_pood[] is obtained as follows:
  • thread data remapping can also be performed in a similar manner, and details are not described herein.
  • the embodiment of the present invention further provides a thread processing method, which is applied to a thread bundle processor of a graphics processor, where the graphics processor includes M warps, each thread bundle includes N threads, M* of M warp bundles. At least one thread in the N threads needs to run the first branch statement, and the method includes:
  • thread data remapping is performed on the M*N threads.
  • FIG. 13 is another flowchart of a thread processing method according to an embodiment of the present invention.
  • the method shown in FIG. 13 is applied to the first warp processor 21, and FIG. 8 The difference is that the embodiment is described by the first thread bundle processor 21 as an execution body, and the method includes:
  • Step S101 Detect the number of threads in the M*N threads that need to run the first branch statement.
  • the judgment condition of the first branch statement is A[tid] ⁇ 5.
  • Step S102 It is determined whether the quantity is greater than a threshold. If yes, step S104 is performed, and if no, step S103 is performed.
  • the threshold can be, for example, one.
  • Step S103 No thread data remapping is performed.
  • Step S104 Perform thread data remapping.
  • Step S105 Run M*N threads and execute the first branch statement or the second branch statement according to the to-be-processed data.
  • step S102 since the step of threshold determination is introduced in step S102, invalid thread data remapping can be filtered, thereby saving computation resources of the graphics processor and reducing unnecessary time overhead.
  • the thread that executes the first branch statement can be set in one or more warps, and the threads of the warp only need to execute the first Branch statement.
  • the threshold may also be set to an empirical value, and the threshold may be valued by experiments to achieve a balance between time overhead and computing resources.
  • kernel code For example, see the following kernel code:
  • each thread needs to perform branch judgment 1000 times. At this time, invalid thread data remapping will cause more time overhead and waste more computing resources.
  • another embodiment of the present invention further provides a thread processing method running in a loop to solve the above technical problem.
  • FIG. 14 is another flowchart of a thread processing method according to an embodiment of the present invention.
  • the method runs on the first warp processor 21, and the difference between the embodiment shown in FIG. 10 is that The embodiment is described with the first warp processor 21 as an execution body, and the method includes:
  • the method specifically includes the following steps:
  • Step S201 Acquire the to-be-processed data that M*N threads need to process, and increase the value of the counter by one if the pending data of any thread satisfies the judgment condition of the first branch statement.
  • FIG. 10 is a schematic diagram of another front and back time overhead of thread data remapping according to an embodiment of the present invention.
  • the warp beam 11 is currently in the 12th loop
  • Step S202 Read the value of the counter.
  • Step S203 It is determined whether the value of the counter is greater than a threshold. If yes, step S204 is performed, and if no, step S213 is performed.
  • Step S204 The control confirms that the thread whose calculation variable is greater than the threshold records its own data to be processed and the loop variable with its own thread identifier as an index, sets the flag bit to the second flag value, and performs thread synchronization.
  • Step S205 Control other threads to detect the flag bit as the second flag value, record its own data to be processed and the loop variable to the index table with its own thread identifier as an index, and perform thread synchronization.
  • Step S206 Control M*N threads to perform thread data remapping to obtain updated thread identifiers.
  • thread data remapping After thread data remapping, the thread ID of thread 1 is 0, the thread ID of thread 2 is 1, and the thread identifier of thread 3 is 5, thread 4 has a thread identifier of 6, thread 5 has a thread identifier of 7, thread 6 has a thread identifier of 8, thread 7 has a thread identifier of 9, thread 8 has a thread identifier of 11, and thread 9 has a thread identifier of 10.
  • the thread ID of thread 10 is 4, the thread ID of thread 11 is 3, and the thread ID of thread 12 is 2.
  • Step S207 Set the count variable to 0, and set the flag bit to the first flag value.
  • Step S208 Control M*N threads to obtain the data to be processed and the loop variable in the index table with the updated thread identifier as an index.
  • Step S209 Control M*N threads to execute the first branch statement if the respective to-be-processed data satisfies the determination condition of the first branch statement, and if the judgment condition of the second branch statement is satisfied, execute the second branch. Statement.
  • threads 1 through 8 each execute a second branch statement
  • threads 9 through 12 each execute a first branch statement.
  • Step S210 Control M*N thread threads to determine whether to enter the next loop according to the respective loop variables, if yes, execute step S212, and if no, execute step S211.
  • the loop variable i can be incremented by one, and if the i after the addition is less than 1000, the next loop is entered, and if not, the loop is ended.
  • Step S211 The control thread exits the loop.
  • Step S212 The control thread enters the next loop.
  • Step S213 setting the value of the counter to 0, and acquiring the data to be processed and the loop variable of the M*N threads.
  • Step S214 Control M*N threads to execute the first branch statement if the respective to-be-processed data satisfies the determination condition of the first branch statement, and execute the second branch statement if the judgment condition of the second branch statement is satisfied .
  • Step S215 Control M*N threads to judge whether to enter the next loop according to the respective loop variables, if yes, execute step S217, and if no, execute step S216.
  • Step S216 The control thread exits the loop.
  • Step S217 The control thread enters the next loop.
  • step S201 is performed.
  • FIG. 11 and FIG. 12 are described in detail from the perspective of a thread processor, and details are not described herein.
  • kernel code is written in C language, and the embodiment of the invention can be understood in conjunction with the kernel code, and the comment after each code is Describes the function of the code:
  • the above kernel code is only the first example in the embodiment of the present invention, and can be edited by a code editor on the CPU side, and the thread runs the kernel code to implement the method described in FIG.
  • FIG. 15 is a schematic structural diagram of a device of a heterogeneous system according to an embodiment of the present invention.
  • the heterogeneous system includes a central processing unit 30 and a graphics processor 20.
  • the central processing unit 30 includes a host code 301.
  • Kernel code 302 can be, for example:
  • the code editor 300 can set the branch processing code in the kernel code 302 to form a new kernel code.
  • code editor 300 can add branch processing code to kernel code 302, for example:
  • Code editor 300 sends binary host code 301 and kernel code 302 to compiler 304, which produces binary kernel code and host code.
  • the CPU transmits the binary kernel code, the host code, and the to-be-processed data A[i, tid] to the graphics processor 20.
  • the embodiment of the invention further provides a graphics processor, which includes a first thread processor, and the first thread processor is configured to execute the method shown in FIG. 8 or FIG.
  • An embodiment of the present invention further provides a graphics processor including a first warp processor, and the first warp processor is configured to perform the method shown in FIG. 13 or FIG.
  • any of the device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as the cells may or may not be Physical units can be located in one place or distributed to multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • the connection relationship between the modules indicates that there is a communication connection between them, and specifically, one or more communication buses or signal lines can be realized.
  • the present invention can be implemented by means of software plus necessary general hardware, and of course, dedicated hardware, dedicated CPU, dedicated memory, dedicated memory, Special components and so on.
  • functions performed by computer programs can be easily implemented with the corresponding hardware, and the specific hardware structure used to implement the same function can be various, such as analog circuits, digital circuits, or dedicated circuits. Circuits, etc.
  • software program implementation is a better implementation in more cases.
  • the technical solution of the present invention which is essential or contributes to the prior art, can be embodied in the form of a software product stored in a readable storage medium, such as a floppy disk of a computer.
  • U disk mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk, etc., including a number of commands to make a computer device (may be A personal computer, server, or network device, etc.) performs the methods described in various embodiments of the present invention.
  • a computer device may be A personal computer, server, or network device, etc.

Abstract

一种应用于图形处理器的方法,该方法包括如下的步骤:第一线程处理器获取需要处理的第一待处理数据,确定第一待处理数据满足第一分支语句,将计数器中的数值加一步长。第一线程处理器根据计数器的数值确定M*N个线程中需运行第一分支语句的线程的数量。第一线程处理器在确认数量大于阈值的情况下,执行线程同步以及线程数据重映射,在有在第一分支的线程的数量较多的情况下,才使用线程数据重映射,可节约时间和运算资源。

Description

线程处理方法和图形处理器 技术领域
本发明实施例涉及数据处理领域,尤其涉及一种线程处理方法和图形处理器。
背景技术
图形处理器(graphics processing unit,GPU)在接收到由内核代码编译成的控制指令时,会创建一个任务,并针对该任务创建大量的线程进行并行处理。举例而言,GPU根据内核代码创建一个工作组(workgroup),每个工作组包括多个线程束,一个线程束包括64个线程,在开放运算语言(Open Computing Language,OpenCL)中线程束称之为Wavefront(简称wave),在计算统一设备架构(Compute Unified Device Architecture,CUDA)中线程束称之为warp。
一个工作组的多个线程束被分配至GPU的一个流多处理器,一个流多处理器包括多个流处理器,每个流处理器可以运行一个线程,流多处理器运行多个线程束。
具体而言,一个线程束中的线程绑定在一起在一个流多处理器上运行,每一时刻都执行统一的指令,工作组的多个线程束中,部分线程束可处于活跃状态,部分线程束可处于等待状态,当处于活跃状态的线程束在该流多处理器运行完毕,该流多处理器马上执行处于等待状态的线程束。举例而言,假设一个流多处理器包括64个流处理器,一个工作组包括4个线程束,一个线程束包括32个线程,则该流多处理器同一时刻可以运行两个线程束,这两个线程束处于活跃状态,另外两个线程束处于等待状态,当处于活跃状态的任一线程束运行完毕,流多处理器马上运行处于等待状态的线程束。
当同一线程束中的线程遇到分支且判断条件不唯一的时候,由于指令的统一性,该线程束需要串行执行其成员线程对应的分支,这被称为分支分歧问题。
举例而言,请参见以下的内核代码:
Figure PCTCN2018076885-appb-000001
一个工作组的每个线程都执行该内核代码,其中,A[tid]为待处理数据,tid为线程标识,在每个线程均具有唯一的一个tid,在该内核代码中,假设针对多个线程而言,A[tid]<5发生的概率小于A[tid]≥5发生的概率。但A[tid]<5一旦发生,则要执行代码A(执行第一分支语句),反之则执行代码B(即执行第二分支语句)。
在同一线程束中,判断到A[tid]<5的线程需要执行第一分支语句,判断到A[tid]≥5的线程执行第二分支语句(或先执行第二分支语句然后执行第一分支语句),流多处理器针对该线程束需串行执行第一分支语句和第二分支语句,从而降低了并行度以及执行的效率。
线程数据重映射(Thread-Data Remapping,TDR)是现有的解决分支分歧的软件技术。TDR通过改变分配给所有线程的待处理数据的排列,使得相同判断条件的待处理数据分配 到相同的线程束中,同一线程束中的线程得到的判断条件一致,从而消除分支分歧。
例如,通过TDR重新调配待处理数据,可将需执行第一分支语句的线程均设置在同一个线程束中,并使得需执行第二分支语句的线程设置在其他线程束中,因此,流多处理器无需在同一个线程束中串行地执行第一分支语句和第二分支语句。
现有技术为解决分支分歧的问题,在线程执行分支语句之前先进行TDR,但是,若在工作组中需执行第一分支语句的线程的数量较少,则进行TDR是毫无意义的。
举例而言,假设整个工作组只有1个需执行第一分支语句的线程,其他线程均执行第二分支语句,那么,无论将需执行第一分支语句的线程的待处理数据集中分配到哪一个线程束,流多处理器针对该线程束均需要串行执行第一分支语句和第二分支语句。
因此,在工作组中需执行第一分支语句的线程的数量较少的情况下,进行TDR并没有产生作用,反而白白浪费了进行TDR的时间和运算资源。
发明内容
本发明实施例提供了一种线程处理方法和图形处理器,只有在需执行第一分支语句的线程的数量多于阈值的情况下,才使用线程数据重映射,可节约时间和运算资源。
第一方面,本申请提供一种线程处理方法,该方法应用于图形处理器,具体的,图形处理器用于处理M个线程束,每个线程束包括N个线程,图形处理器还包括至少一个线程束处理器,至少一个线程束处理器中的第一线程束处理器包括N的整数倍个线程处理器,第一线程束处理器包括第一线程处理器,第一线程处理器用以运行N个线程中的一个以处理满足第一分支语句的判断条件或满足第二分支语句的判断条件的待处理数据,图形处理器中设置有计数器,该方法包括如下的步骤:第一线程处理器获取需要处理的第一待处理数据,确定第一待处理数据满足第一分支语句,将计数器中的数值加一步长。第一线程处理器根据计数器的数值确定M*N个线程中需运行第一分支语句的线程的数量。第一线程处理器在确认数量大于阈值的情况下,执行线程同步以及线程数据重映射。
其中,第一线程处理器通过判断运行第一分支语句的线程的数量,在确认数量大于阈值的情况下,才执行线程同步以及线程数据重映射,可避免在该数量小于或等于阈值的情况下也执行线程同步以及线程数据重映射,可提高GPU的运行速度和运行效率。其中,步长可以是任意正整数,举例而言,可设置为1。
在第一方面的第一种可能的实现方式中,图形处理器还设置有标志位,标志位的值设置为第一标志值,第一标志值用于指示不执行重映射,该方法还包括如下步骤:第一线程处理器在确定数量大于阈值的情况之前,读取标志位。且,第一线程处理器在确定数量大于阈值的情况之后及执行线程同步之前,将第一标志值设置为第二标志值,第二标志值用于指示需要执行重映射。
第一个发现计数器的数值超过阈值的线程处理器将标志位设置为第二标志值,其他线程处理运行其他线程时,读取到第二标志值被设置就直接执行重映射,而不再去判断计数器的数值是否超过阈值,避免得出不一样的结论,采用这个设计后,只要M*N个线程中的一个线程作出执行线程同步的决定,其他线程必定跟随,规避了个别线程单独执行线程同步的问题。
根据第一方面或第一方面的第一种可能的实现方式,在第二种可能的实现方式中,第一线程处理器在执行线程同步之后且执行线程数据重映射之前,该方法还包括以下步骤:第一线程处理器将计数器中的数值清零。
在本线程已经确定需要执行线程数据重映射的情况下,将计数器清零,使得其他线程仅需根据标志位设置为第二标志值即可确认需要执行线程数据重映射,而无需再根据计数器的数值判断是否需要执行线程数据重映射。
根据第一方面、第一方面的第一种可能的实现方式以及第一方面的第二种可能的实现方式中的任一者,在第三种可能的实现方式中,第一线程束处理器包括第二线程处理器,第二线程处理器用以在运行N个线程中的一个以处理满足第一分支语句或满足第二分支语句的待处理数据,该方法还包括以下步骤:第二线程处理器读取标志位,在确认标志位的值为第二标志值时,执行线程同步以及线程数据重映射。第二线程处理器在确认标志位的值为第一标志值时,根据计数器的数值确定M*N个线程中需运行第一分支语句的线程的数量,在确认数量大于阈值的情况下,执行线程同步以及线程数据重映射。
其他线程可根据标志位直接判断是否需进行重映射,可规避个别线程单独执行线程同步的问题。
根据第一方面的第三种可能的实现方式,在第四种可能的实现方式中,第一线程处理器用以运行N个线程中的第一线程以处理满足第一分支语句的判断条件的的待处理数据,第二线程处理器用以运行N个线程中的第二线程以处理满足第二分支语句的判断条件的待处理数据,图形处理器还设置有一维数组、第一变量以及第二变量,其中一维数组的长度是M*N,第一变量的初始值是0,第二变量的初始值是M*N-1,第一线程处理器执行线程数据重映射,包括:第一线程处理器读取第二变量的数值,并将第二线程的线程标识写入一维数组中以第二变量的数值作为下标的位置,将第二变量的数值减一,并执行线程同步。第二线程处理器读取第一变量的数值,并将第二线程的线程标识写入一维数组中以第一变量的数值作为下标的位置,将第一变量的数值加一,并执行线程同步。第一线程处理器在线程同步结束后,读取一维数组中以第一线程的线程标识作为下标的位置上的数值,并将读取的数值作为线程数据重映射产生的第一线程的更新的线程标识。第二线程处理器在线程同步结束后,读取一维数组中以第二线程的线程标识作为下标的位置上的数值,并将读取的数值作为线程数据重映射产生的第二线程的更新的线程标识。
该重映射的实现方式仅涉及一维数组的指针变换,避免了在重映射过程中直接调用待处理数据,可有效提高运算速度。
根第一方面的第四种可能的实现方式,在第五种可能的实现方式中,第一线程处理器在执行线程同步之后且执行线程数据重映射之前,该方法还包括:第一线程处理器以第一线程的线程标识作为索引将第一待处理数据记录在索引表,其中第一线程的线程标识与第一待处理数据具有一一对应关系,索引表记录有M*N个线程的线程标识与待处理数据之间的一一对应关系。进一步地,第一线程处理器在执行线程数据重映射之后,该方法还包括以下步骤:第一线程处理器以执行线程数据重映射后产生的第一线程的更新的线程标识作为索引在索引表中读取与第一线程的更新的线程标识对应的第三待处理数据。第一线程处理器在第三待处理数据满足第一分支语句的判断条件时,执行第一分支语句,第一线程处 理器在第三待处理数据满足第二分支语句的判断条件时,执行第二分支语句。
所有线程以自身的线程标识通过将待处理数据保存在索引表,并在获得重映射分配的更新的线程标识之后,以更新的线程标识作为索引从索引表获取更新的线性标识对应的待处理数据,并根据该待处理数据判断是执行第一分支语句抑或是第二分支语句,可实现线程之间的数据交换,并保证内核代码的正常运行。
在第一方面的一种可能的实现方式中,阈值为1。
在第一方面的一种可能的实现方式中,阈值为大于等于2且小于等于5的正整数。
在第一方面的一种可能的实现方式中,第一线程处理器执行第一分支语句的概率小于第一线程处理器执行第二分支语句的概率。
由于执行第一分支语句的概率较小,因此通过重映射将需执行第一分支语句的线程尽量集中在一个或多个线程束中,使得尽量多的线程束避免串行地执行第一分支语句和第二分支语句。
在第一方面的一种可能的实现方式中,计数器、索引表设置在图形处理器的共享存储器。
在第一方面的一种可能的实现方式中,待处理数据设置在图形处理器的全局存储器。
在第一方面的一种可能的实现方式中,第一线程处理器通过对计数器中的数值进行原子加一以实现加一步长的操作,第一线程处理器通过对计数器中的数值进行原子减一以实现减一步长的操作。
在第一方面的一种可能的实现方式中,待处理数据经由与图形处理器连接的中央处理器发送到图形处理器的全局存储器。
第二方面,本申请提供一种线程处理方法,该方法应用于图形处理器,具体的,图形处理器用于处理M个线程束,每个线程束包括N个线程,的图形处理器还包括至少一个线程束处理器,至少一个线程束处理器中的第一线程束处理器包括N的整数倍个线程处理器,第一线程束处理器包括第一线程处理器,第一线程处理器运行循环语句,用以在一个循环中运行N个线程中的一个以处理满足第一分支语句的判断条件或满足第二分支语句的判断条件的待处理数据,图形处理器中设置有计数器,该方法包括以下步骤:第一线程处理器在第一循环中获取需要处理的第一待处理数据,确定第一待处理数据满足第一分支语句,将计数器中的数值加一步长。第一线程处理器根据计数器的数值确定M*N个线程中需运行第一分支语句的线程的数量。第一线程处理器在确认数量大于阈值的情况下,执行线程同步并将计数器中的数值清零。第一线程处理器执行线程数据重映射。
综上,第一线程处理器在确认数量大于阈值的情况下,执行线程同步并将计数器中的数值清零,并执行线程数据重映射,使得第一线程处理器在下一循环读取到的计数器的数值不会受到第一循环的影响,并且,由于在确认数量大于阈值的情况下,才执行线程同步以及线程数据重映射,因此可避免在该数量小于或等于阈值的情况下也执行线程同步以及线程数据重映射,可在每一循环中动态地判断是否有执行线程数据重映射的需要,可避免进行无用的线程数据重映射,可提高GPU的运行速度和运行效率。
在第二方面的第一种可能的实现方式中,该方法还包括以下步骤:第一线程处理器获 取在第二循环中需要处理的第二待处理数据,确定第二待处理数据满足第二分支语句的判断条件,将计数器中的数值减一步长。
将计数器中的数值减一步长,可与加一步长的操作抵消,可避免对其他线程的判断造成干扰,并且可避免对下一循环进行干扰。
结合第二方面或第二方面的第一种可能的实现方式,在第二种可能的实现方式中,图形处理器还设置有标志位,标志位的值设置为第一标志值,第一标志值用于指示不执行重映射,该方法还包括以下步骤:第一线程处理器在确定数量大于阈值的情况之前,读取标志位。且,第一线程处理器在确定数量大于阈值的情况之后及执行线程同步之前,将第一标志值设置为第二标志值,第二标志值用于指示需要执行重映射。
第一个发现计数器的数值超过阈值的线程处理器将标志位设置为第二标志值,其他线程处理运行其他线程时,读取到第二标志值被设置就直接执行重映射,而不再去判断计数器的数值是否超过阈值,避免得出不一样的结论,采用这个设计后,只要M*N个线程中的一个线程作出执行线程同步的决定,其他线程必定跟随,规避了个别线程单独执行线程同步的问题。
结合第二方面的第二种可能的实现方式,在第三种可能的实现方式中,第一线程束处理器包括第二线程处理器,第二线程处理器用以在运行N个线程中的一个以处理满足第一分支语句或满足第二分支语句的待处理数据,方法还包括:第二线程处理器读取标志位,在确认标志位的值为第二标志值时,执行线程同步以及线程数据重映射;第二线程处理器在确认标志位的值为第一标志值时,根据计数器的数值确定M*N个线程中需运行第一分支语句的线程的数量,在确认数量大于阈值的情况下,执行线程同步以及线程数据重映射。
由于其他线程可根据标志位直接判断是否需进行重映射,可规避个别线程单独执行线程同步的问题。
根据第二方面的第三种可能的实现方式,在第四种可能的实现方式中,第一线程处理器用以运行N个线程中的第一线程以处理满足第一分支语句的判断条件的的待处理数据,第二线程处理器用以运行N个线程中的第二线程以处理满足第二分支语句的判断条件的待处理数据,图形处理器还设置有一维数组、第一变量以及第二变量,其中一维数组的长度是M*N,第一变量的初始值是0,第二变量的初始值是M*N-1,第一线程处理器执行线程数据重映射,包括:第一线程处理器读取第二变量的数值,并将第二线程的线程标识写入一维数组中以第二变量的数值作为下标的位置,将第二变量的数值减一,并执行线程同步。第二线程处理器读取第一变量的数值,并将第二线程的线程标识写入一维数组中以第一变量的数值作为下标的位置,将第一变量的数值加一,并执行线程同步。第一线程处理器在线程同步结束后,读取一维数组中以第一线程的线程标识作为下标的位置上的数值,并将读取的数值作为线程数据重映射产生的第一线程的更新的线程标识。第二线程处理器在线程同步结束后,读取一维数组中以第二线程的线程标识作为下标的位置上的数值,并将读取的数值作为线程数据重映射产生的第二线程的更新的线程标识。
该重映射的实现方式仅涉及一维数组的指针变换,避免了在重映射过程中直接调用待处理数据,可有效提高运算速度。
结合第二方面的四种可能的实现方式中的任一者,在第五种可能的实现方式中,第一线程处理器运行第一线程,第一线程处理器在执行线程同步之后且执行线程数据重映射之前,该方法还包括以下步骤:第一线程处理器以第一线程的线程标识作为索引将第一待处理数据和第一循环变量记录在索引表,其中,第一线程的线程标识与第一待处理数据具有一一对应关系。第一线程处理器在执行线程数据重映射之后,该方法还包括以下步骤:第一线程处理器以执行线程数据重映射后产生的第一线程的更新的线程标识作为索引在索引表中读取与第一线程的更新的线程标识对应的第三待处理数据。第一线程处理器在第三待处理数据满足第一分支语句的判断条件时,执行第一分支语句,第一线程处理器在第三待处理数据满足第二分支语句的判断条件时,执行第二分支语句。
所有线程以自身的线程标识通过将待处理数据保存在索引表,并在获得重映射分配的更新的线程标识之后,以更新的线程标识作为索引从索引表获取更新的线性标识对应的待处理数据,并根据该待处理数据判断是执行第一分支语句抑或是第二分支语句,可在线程之间交换待处理数据,并保证内核代码的正常运行。
结合第二方面的第五种可能的实现方式,在第六种可能的实现方式中,图形处理器中还记录有每个线程的循环变量,循环变量用于指示线程当前所在的循环的序号,索引表中记录有第一线程的循环变量与第一线程的线程标识、第一线程在循环变量所指示的循环中的待处理数据的对应关系,第一线程处理器在执行线程数据重映射之后,该方法还包括:第一线程处理器以执行线程数据重映射后产生的第一线程的更新的线程标识作为索引在索引表中读取与第一线程的更新的线程标识对应的循环变量。第一线程处理器在执行第一分支语句或第二分支语句之后,将第一线程的更新的线程标识对应的循环变量加一以获取更新的循环变量,且在更新的循环变量不符合循环语句规定的循环条件时,结束第一线程,在更新的循环变量符合循环语句规定的循环条件时,运行第一线程的第二循环。
综上,可实现循环变量在线程之间交换,使得同一线程束可存在处于不同循环的线程。
在第二方面的一种可能的实现方式中,阈值为1。
在第二方面的一种可能的实现方式中,阈值为大于等于2且小于等于5的正整数。
在第二方面的一种可能的实现方式中,第一线程处理器执行第一分支语句的概率小于第一线程处理器执行第二分支语句的概率。
由于执行第一分支语句的概率较小,因此通过重映射将需执行第一分支语句的线程尽量集中在一个或多个线程束中,使得尽量多的线程束避免串行地执行第一分支语句和第二分支语句。
在第二方面的一种可能的实现方式中,计数器、索引表设置在图形处理器的共享存储器。
在第二方面的一种可能的实现方式中,待处理数据设置在图形处理器的全局存储器。
在第二方面的一种可能的实现方式中,第一线程处理器通过对计数器中的数值进行原子加一以实现加一步长的操作,第一线程处理器通过对计数器中的数值进行原子减一以实现减一步长的操作。
在第二方面的一种可能的实现方式中,待处理数据经由与图形处理器连接的中央处理器发送到图形处理器的全局存储器。
第三方面,本申请提供一种图形处理器,图形处理器用于处理M个线程束,每个线程束包括N个线程,的图形处理器还包括至少一个线程束处理器,至少一个线程束处理器中的第一线程束处理器包括N的整数倍个线程处理器,第一线程束处理器包括第一线程处理器,第一线程处理器用以运行N个线程中的一个以处理满足第一分支语句的判断条件或满足第二分支语句的判断条件的待处理数据,图形处理器中设置有计数器,其中,第一线程处理器,用于获取需要处理的第一待处理数据,确定第一待处理数据满足第一分支语句,将计数器中的数值加一步长。第一线程处理器,用于根据计数器的数值确定M*N个线程中需运行第一分支语句的线程的数量。第一线程处理器,用于在确认数量大于阈值的情况下,执行线程同步以及线程数据重映射。
第三方面或第三方面任意一种实现方式是第一方面或第一方面任意一种实现方式对应的装置实现,第一方面或第一方面任意一种实现方式中的描述适用于第三方面或第三方面任意一种实现方式,在此不再赘述。
第四方面,本申请提供一种图形处理器,图形处理器用于处理M个线程束,每个线程束包括N个线程,的图形处理器还包括至少一个线程束处理器,至少一个线程束处理器中的第一线程束处理器包括N的整数倍个线程处理器,第一线程束处理器包括第一线程处理器,第一线程处理器运行循环语句,用以在一个循环中运行N个线程中的一个以处理满足第一分支语句的判断条件或满足第二分支语句的判断条件的待处理数据,图形处理器中设置有计数器,其中,第一线程处理器,用于在第一循环中获取需要处理的第一待处理数据,确定第一待处理数据满足第一分支语句,将计数器中的数值加一步长。第一线程处理器,用于根据计数器的数值确定M*N个线程中需运行第一分支语句的线程的数量。第一线程处理器,用于在确认数量大于阈值的情况下,执行线程同步并将计数器中的数值清零。第一线程处理器,用于执行线程数据重映射。
第四方面或第四方面任意一种实现方式是第二方面或第二方面任意一种实现方式对应的装置实现,第二方面或第二方面任意一种实现方式中的描述适用于第四方面或第四方面任意一种实现方式,在此不再赘述。
第五方面,本申请一种线程处理方法,该方法应用于图形处理器,具体的,图形处理器用于处理M个线程束,每个线程束包括N个线程,该方法包括以下步骤:检测M*N个线程中需运行第一分支语句的线程的数量。在确认数量大于阈值的情况下,对M*N个线程进行线程数据重映射。
其中,在确认数量大于阈值的情况下,才执行线程同步以及线程数据重映射,可避免在该数量小于或等于阈值的情况下也执行线程同步以及线程数据重映射,可提高GPU的运行速度和运行效率。
在第五方面的第一种可能的实现方式中,M*N个线程分别设置有线程标识和待处理数据,待处理数据与线程标识具有一一对应关系,对M*N个线程进行线程数据重映射,包括:获取M*N个线程的待处理数据。在M*N个线程的任一线程的待处理数据满足第一分支语句的判断条件的情况下,将满足第一分支语句的判断条件的待处理数据依次映射至M*N个线程中相邻的部分线程。在M*N个线程的任一线程的待处理数据满足第二分支语句的判断 条件的情况下,将满足第二分支语句的判断条件的待处理数据依次映射至M*N个线程中相邻的部分线程。将M*N个线程的线程标识分别更新为所映射的待处理数据对应的线程标识。
因此通过重映射将需执行第一分支语句的线程尽量集中在一个或多个线程束中,通过重映射将需执行第二分支语句的线程尽量集中在一个或多个线程束中,使得尽量多的线程束避免串行地执行第一分支语句和第二分支语句。
结合第五方面和第五方面的第一种可能的实现方式,在第二种可能的实现方式中,对M*N个线程进行线程数据重映射之前,该方法还包括:控制M*N个线程以自身的线程标识作为索引记录自身的待处理数据至索引表。对M*N个线程进行线程数据重映射之后,该方法还包括:控制M*N个线程以更新的线程标识作为索引在索引表获取与更新的线程标识对应的待处理数据。
所有线程以自身的线程标识通过将待处理数据保存在索引表,并在获得重映射分配的更新的线程标识之后,以更新的线程标识作为索引从索引表获取更新的线性标识对应的待处理数据,并根据该待处理数据判断是执行第一分支语句抑或是第二分支语句,可实现线程之间的数据交换,并保证内核代码的正常运行。
结合第五方面的第一种可能的实现方式,在第三种可能的实现方式中,M*N个线程分别运行循环语句,检测M*N个线程中需运行第一分支语句的线程的数量,包括:获取M*N个线程需要处理的待处理数据,在M*N个线程中的任一线程的待处理数据满足第一分支的判断条件的情况下,将计数器的数值加一。
结合第五方面的第三种可能的实现方式,在第四种可能的实现方式中,对M*N个线程进行线程数据重映射之前,控制M*N个线程以自身的线程标识作为索引记录自身的待处理数据和循环变量至索引表。且在对M*N个线程进行线程数据重映射之后,控制M*N个线程以更新的线程标识作为索引在索引表获取与更新的线程标识对应的待处理数据和循环变量。
通过将循环变量存储在索引表,可以实现循环变量在线程之间的交换。
结合第五方面的第三种可能的实现方式,在第五种可能的实现方式中,对M*N个线程进行线程数据重映射,还包括以下子步骤:控制确认计算变量大于阈值的线程以自身的线程标识作为索引记录自身的待处理数据和循环变量到索引表,将标志位设置为第二标志值,执行线程同步。控制其他线程在检测到标志位为第二标志值的情况下,以自身的线程标识作为索引记录自身的待处理数据和循环变量到索引表,执行线程同步。控制M*N个线程执行线程数据重映射以获取更新的线程标识。
发现计数器的数值超过阈值的线程处理器将标志位设置为第二标志值,其他线程处理运行其他线程时,读取到第二标志值被设置就直接执行重映射,而不再去判断计数器的数值是否超过阈值,避免得出不一样的结论,采用这个设计后,只要M*N个线程中的一个线程作出执行线程同步的决定,其他线程必定跟随,规避了个别线程单独执行线程同步的问题。
在执行线程同步的步骤之后,结合第五方面的第四或第五种可能的实现方式,在第六种可能的实现方式中,该方法还包括:将标志位设置为第一标志值,并将计数器的数值设置为0。
将计数器中的数值清零,线程在下一循环读取到的计数器的数值不会受到本循环的影响。
第六方面,本申请提供一种图形处理器,该图形处理器包括线程束处理器,该线程束处理器用于处理M个线程束,每个线程束包括N个线程,该线程束处理器,用于检测M*N个线程中需运行第一分支语句的线程的数量。该线程束处理器,用于在确认数量大于阈值的情况下,对M*N个线程进行线程数据重映射。
第六方面或第六方面任意一种实现方式是第五方面或第五方面任意一种实现方式对应的装置实现,第一方面或第一方面任意一种实现方式中的描述适用于第三方面或第三方面任意一种实现方式,在此不再赘述。
附图说明
图1是根据本发明实施例的图形处理器和中央处理器的连接关系示意图;
图2是根据本发明实施例的工作组投放至第一线程束处理器的示意图;
图3是根据本发明实施例的线程数据重映射的流程图;
图4是根据本发明实施例的线程数据重映射的数据流向示意图;
图5是根据本发明实施例的线程数据重映射的前后时间开销示意图;
图6是根据本发明实施例的线程数据重映射的另一数据流向图;
图7是是根据本发明实施例的线程数据重映射的另一前后时间开销示意图;
图8是根据本发明实施例的线程处理方法的流程图;
图9是根据本发明实施例的线程处理方法的另一流程图;
图10是根据本发明实施例的线程数据重映射的前后时间开销示意图;
图11是根据本发明实施例的线程数据重映射的另一前后时间开销示意图;
图12是根据本发明实施例的线程数据重映射的原理图;
图13是根据本发明实施例的线程处理方法的另一流程图;
图14是根据本发明实施例的线程处理方法的另一流程图;
图15是根据本发明实施例的异构系统的装置结构示意图。
具体实施方式
为了方便理解本发明的各实施例,下面先对本发明各实施例涉及到的一些技术术语进行介绍,后文的各实施例可以参考下面的技术术语介绍。
1、图形处理器
图形处理器(Graphics Processing Unit,GPU)是一种在个人电脑、工作站、游戏机和一些移动设备(如平板电脑、智能手机等)等设备上进行图像运算工作的微处理器。GPU的用途是将计算机系统所需要的显示信息进行转换驱动,并向显示器提供行扫描信号,控制显示器的正确显示。
GPU作为一个大规模并行计算元件,因其日益强大的计算能力,已经被广泛应用于通用计算中。不同领域中的大量程序都用GPU进行加速,如传统计算密集型的科学计算、 文件系统、网络系统、数据库系统和云计算等。
如图1所示,GPU包括全局存储器、调度器以及多个线程束处理器,其中图1是根据本发明实施例的图形处理器和中央处理器的连接关系示意图。
全局存储器存储有接收自CPU的主机端代码(host code)、内核代码(kernel code)以及待处理数据。
调度器用于根据主机代码设置工作组(workgroup),该工作组包括M个线程束,每个线程束包括N个线程,调度器选择一个空闲的线程束处理器,并将M个线程束投放到该线程束处理器,该线程束处理器用于处理该工作组的M*N个线程,其中M≥1,N≥1。
每个线程束包括预定数量的线程,一个线程束处理器可同一时间运行该预定数量的线程。在不同的产品中,所述线程束处理器可能有别的名称。比如,美国超微半导体公司(Advanced Micro Devices,AMD)会将该线程束处理器称作计算单元(Computing Unit,CU),英伟达公司(NVIDIA Corporation,NVIDIA)会将线程束处理器称作流多处理器(Stream Multiprocessor,SM)。
每个线程束处理器包括共享存储器、N个线程处理器以及N个私有存储器,每个线程处理器用于在同一时刻运行一个线程,私有存储器用于存储线程运行过程中涉及的待处理数据以及过程数据,其中过程数据包括计算结果、循环变量、以及计算过程中涉及的中间值等。
其中,一个线程束处理器包括的线程处理器的数量为N的整数倍。
每个线程处理器分别对应设置有一个私有存储器,线程处理器只能访问与自身对应的私有存储器,其中私有存储器可例如为寄存器组或内存,同一个线程束处理器内的线程处理器均可访问同一个线程束处理器内的共享存储器,但不能访问不同线程束处理器内的共享存储器。举例而言,在图1所示的图形处理器20中,第一线程处理器211、第二线程处理器212、第三线程处理器213以及第三第四线程处理器214均可访问共享存储器210,但不能访问共享存储器220。第一线程处理器211可访问第一线程处理器211的私有存储器215,但不能访问第二线程处理器212的私有存储器216。
进一步,每个线程束处理器中的线程处理器均可访问全局存储器,举例而言,在图1中,第一线程处理器211和第五线程处理器221均可访问全局存储器24。
2、GPU程序
一个GPU程序可以分为两个部分:如上所述的主机端代码和内核代码。在CPU上运行的代码编辑器可编辑GPU程序,设置待处理数据,GPU程序可通过在CPU上运行的编译器编译成二进制格式的可GPU可执行代码。CPU发送待处理数据和编译好的GPU程序至GPU的全局存储器。
GPU的调度器读取全局存储器中的主机端代码对内核代码上下文进行初始化以创建工作组。GPU的调度器将待处理数据分配至工作组中的线程,并通知工作组中的每一线程执行内核代码以处理各自分配到的待处理数据。
在另外一些示例中,待处理数据可由GPU通过执行内核代码产生。
在主机端代码设置好的待处理数据的情况下,内核代码上下文的初始化可将主机端代码设置好的待处理数据以数组的方式分配至线程,其中该数组以线程标识为下标,一个GPU 的内核代码描述的是一个线程的行为,对应线程标识的线程可读取该数组中的待处理数据,并根据待处理数据执行该行为。在另一些示例中,在主机端代码没有设置好的待处理数据的情况下,内核代码描述的是一个线程的行为,该线程可根据内核代码产生待处理数据,并根据待处理数据执行该行为。
3、线程束
线程束为由工作组成的集合体。GPU作为协处理器,在从CPU接收到内核代码启动调用时,会通过调度器创建大量的线程。这些线程会分层的组织到一起。举例而言,一个线程束可包括32个线程,即N=32,又或者,一个线程束可包括64个线程或其他数量的线程,即N=64。
线程束中的线程在线程束处理器上执行时,绑定在一起执行,每一时刻都执行统一的指令,但处理不同的待处理数据。一些线程束组成一个工作组。内核代码对应分配有一个或多个工作组,每一工作组包括M个线程束,每一线性束包括N个线程,其中,M为CPU执行主机端代码进行内核代码上下文的初始化后控制GPU创建的工作组所包括的线程束的数量,每个线程束所包括的线程的数量N为所创建的该工作组的任意一个线程束所包括的线程的数量。
M的具体数值可在主机端代码中设置,或由GPU预先设定为一个固定数值,在一些示例中,M的取值范围可以为4≤M≤32,在另一些示例中,M的取值范围可以为1≤M。
N与GPU的一个线程束处理器所包括的线程处理器的数量E之间通常具有倍数关系,E=n*N,n为大于或等于1的正整数。
可结合图2进行参考,图2是根据本发明实施例的工作组投放至第一线程束处理器的示意图。在图2的示例中,调度器23创建的工作组包括3个线程束11、12、13,即M=3,线程束11包括线程1至4,线程束12包括线程5至8,线程束13包括线程9至13,即N=4。
并且,在图2中,第一线程束处理器21仅设置有4个第一线程处理器211至214,而第一线程束处理器21在同一时刻仅可运行一个线程束,例如线程束11,当线程束处理器运行线程束11时,第一线程处理器211运行线程1,第二线程处理器212运行线程2,第三线程处理器213运行线程3,第四线程处理器214运行线程4。在线程11的所有线程运行完毕之后,线程束处理器21运行线程束12或线程束13,其中,若线程束处理器21运行线程束12,第一线程处理器211运行线程5,第二线程处理器212运行线程6,第三线程处理器213运行线程7,第四线程处理器214运行线程8。之后,线程束处理器21可运行线程束13,在线程束处理器21运行线程束13时,第一线程处理器211运行线程9,第二线程处理器212运行线程10,第三线程处理器213运行线程11,第四线程处理器214运行线程12。
其中,第一线程束处理器21运行工作组的线程束的先后顺序由调度器23决定,调度器23会让第一线程束处理器21优先运行没有发生读延迟的线程束,其中读延迟是指线程处理器在从全局存储器24读取待处理数据时产生的延迟。
举例而言,在图2中,若调度器23首先控制第一线程束处理器21运行线程束11,而在此过程中,第一线程处理器211从全局存储器24读取待处理数据A[0]至共享存储器210时产生延迟,此时调度器23可通知第一线程束处理器21停止运行线程束11,转而运行线 程束12或线程束13,从而避免等待该延迟。
值得注意的是,第一线程束处理器21运行线程束12或线程束13的过程中,调度器23可通知存储器控制器(memory controller,图未示出)将待处理数据A[0]从共享存储器210读取至全局存储器24中。因此,在第一线程束处理器21运行线程束12或线程束13完毕之后,可在调度器23的控制下无需等待待处理数据A[0],继续运行线程束11,从而实现掩盖延迟的功能。
值得注意的是,在实际应用中,线程束处理器21可包括更多数量的线程处理器,因此,线程束处理器21在同一时刻可运行不止1个线程束,而在本发明实施例中,为了便于说明,将线程束处理器21设置为包括4个线程处理器,使得线程束处理器21在同一时刻运行一个线程束。
4、分支分歧
分支分歧是GPU计算中常见的导致性能损耗的因素。
每个线程的线程标识(thread identification,TID)以及读取到的待处理数据都不尽相同,因此遭遇分支的时候会得出不同的判断条件。当同一个线程束中的线程需要执行不同的分支时,由于执行指令的统一性,该线程束将串行执行成员线程需要执行的所有分支,这被称为分支分歧问题。每条分支都是全部线程一起执行,但是无关的线程运行的结果会被舍弃,这降低了并行度以及执行的效率。例如,单层的分支可以将效率降至50%,而循环语句中的多层嵌套分支更可造成指数级增长的减速。
5、循环语句
举例而言,循环语句可包括for语句:
for语句的一般形式如下:
“for(循环变量的初始值;循环变量的范围;循环变量加1)”
循环变量可例如为i,初始值设置为0,循环变量的范围可限定为小于1000次,即i<1000,并且,循环变量在每个循环过程进行加一操作i++,即i=i+1,使得循环变量在每个循环执行时进行自加1的操作。
其中,for(i=0;i<1000;i++)语句1,表示需要执行语句1一千次。
在另外一些示例中,循环语句也可包括while语句,其中while语句并没有限定循环变量,即其循环的次数没有限定,while语句的一般形式如下:
“while(待处理数据是否满足判断条件)”
当待处理数据满足判断条件时,继续循环,当待处理不满足判断条件时,则退出循环。
值得注意的是循环语句还可以包括其他语句,如select语句,于此不作赘述。
6、分支语句
举例而言,分支语句可包括if语句:
if语句是选择语句,if语句用于实现两个分支的选择。if语句的一般形式如下:
“if(条件判断式)第一分支语句
   [else第二分支语句]”
其中,方括号内的部分(即else子句)为可选的,即可以有,也可以没有。条件执行语句1和条件执行语句2可以是一个简单的语句,也可以是一个复合语句,还可以是另一 个if语句(即在一个if语句中又包括另一个或多个内嵌的if语句)。条件判断式也可以称之为表达式(expression),条件执行语句也可以称之为语句(statement)。
在本发明实施例中,条件判断式设置为使得需执行第一分支语句的次数小于需执行第二分支语句的次数。
举例而言,条件判断式可设置为
Figure PCTCN2018076885-appb-000002
其中,temp=rand()*1000。
rand()函数为伪随机函数,可用于产生0至1之间的任意数值,temp作为待处理数据分配给工作组的每个线程,每个线程分到的temp不相同,在该内核代码中,temp<5发生的概率小于temp>=5发生的概率。但temp<5一旦发生,则要执行代码A从而执行第一分支语句,反之则执行代码B从而执行第二分支语句。
在if语句中又包括一个或多个if语句称为if语句的嵌套。在上述的(3)的形式属于if语句的嵌套,其一般形式如下:
Figure PCTCN2018076885-appb-000003
在其他示例中,分支语句还包括switch语句,于此不作赘述。
7、线程同步
同一工作组的线程间进行数据交换时需要进行同步,而GPU提供了软件接口,举例而言线程可通过调用障碍函数barrier()实现线程同步,对于一个调用了障碍函数的线程来说,除非同一个工作组内其他线程都执行了障碍函数,否则该线程将被阻止执行障碍函数之后的内核代码,且在线程处理器执行了障碍函数之后,线程处理器为该线程设置停顿点(break point),该停顿点记录了障碍函数的下一语句在内核代码中的位置,将该位置记录在该线程的私有存储器中,并暂停运行该线程。
当同一个工作组内其他线程都执行了障碍函数之后,线程处理器从私有存储器中读取停顿点,从而可执行障碍函数的下一语句,使得线程继续运行。
因此,针对执行了障碍函数的线程而言,线程同步结束的条件是一个工作组内M*N个线程都执行了障碍函数。
8、线程数据重映射
线程数据重映射是解决分支分歧的软件技术,其通过调整线程与待处理数据间的映射关系,使同一线程束中的线程得到的判断条件一致,从而消除分支分歧。
为了进一步清楚说明,请结合图3和图4一并参考,图3是根据本发明实施例的线程数据重映射的流程图,而图4是根据本发明实施例的线程数据重映射的数据流向示意图,在本发明实施例中,假设图2所示的3个线程束11、12、13由调度器23投放至第一线程束处理器21,因此图3所示的方法由第一线程束处理器21执行,线程数据重映射包括以下步骤:
步骤S1041:获取待处理数据。
举例而言,如图4所示,线程1至12的线程标识tid分别为0至12,待处理数据为数组A[tid]={13,6,0,1,2,7,8,9,10,11,3,12}中的数据,数组A[tid]由CPU30发送至GPU的全局存储器24,并被第一线程束处理器21读取至共享存储器210,线程处理器在运行线程时可根据该线程的线程标识tid从共享存储器210读取到待处理数据A[tid],第一线程处理器211运行线程1时,可根据线程1的线程标识0,在共享存储器210的数组A[tid]读取待处理数据A[tid]=13。
待处理数据与待处理数据所在线程的线程标识具有一一对应关系,例如待处理数据13对应线程标识0,待处理数据6对应线程标识1。
步骤S1042:判断待处理数据满足何种分支的判断条件。
举例而言,判断条件为分支语句的判断条件,如上文的代码段中的A[tid]<5为第一分支语句的判断条件,A[tid]≥5为第二分支语句的判断条件,即在满足该条件时,执行第一分支语句,在不满足该条件时,执行第二分支语句。
结合图4,A[2]=0,A[3]=1,A[4]=2,A[10]=3,均小于5,因此,第一线程处理器211运行线程3、线程5和线程9是需执行第一分支语句,第四线程处理器214运行线程4时需执行第二分支语句。
步骤S1043:将满足第二分支语句的判断条件的待处理数据依次映射至M*N个线程中相邻的部分线程。
举例而言,如图4所示,将满足第二分支语句的判断条件的待处理数据13,6,7,8,9,10,11,12分配至M个线程束中从第一个线程开始且相邻的多个线程1至8。
步骤S1044:将满足第二分支语句的判断条件的待处理数据依次映射至M*N个线程中相邻的其他部分线程。
举例而言,将满足第一分支语句的判断条件的待处理数据0,1,2,3分配至M个线程束中从最后一个线程开始且相邻的多个线程9至12。
步骤S1045:将M*N个线程的线程标识分别更新为所映射的待处理数据对应的线程标识。
对应地,在图4中,将线程1的线程标识0修改为待处理数据13的线程标识0,将线程2的线程标识1修改为待处理数据6的线程标识1,将线程3的线程标识2修改为待处理数据7的线程标识,将线程4的线程标识3修改为待处理数据8的线程标识6,将线程5的线程标识4修改为待处理数据9的线程标识7,将线程6的线程标识5修改为待处理数据10的线程标识8,将线程7的线程标识6修改为待处理数据11的线程标识9,将线程8的线程标识7修改为待处理数据12的线程标识11,将线程9的线程标识8修改为待处理数据3的线程标识10,将线程10的线程标识9修改为待处理数据2的线程标识4,将线程11 的线程标识10修改为待处理数据1的线程标识3,将线程12的线程标识11修改为待处理数据0的线程标识2。
为了进一步清楚说明线程数据重映射,以下请结合图5,图5是根据本发明实施例的线程数据重映射的前后时间开销示意图。
图5的上半部分示出的是未做线程数据重映射时第一线程束处理器21分别运行线程束11至13所需的时间,于此假设线程执行第一分支语句的时间为T2,线程执行第二分支语句的时间为T1,由于每个线程束均存在执行不同分支的线程,因此每个线程束均需要串行地执行第一分支语句和第二分支语句,所需总时间为T1+T2,第一线程束处理器21运行三个线程束共需时间t1=3T1+3T2。
图5的下半部分示出的是线程数据重映射之后线程束处理器21分别运行线程束11至13所需的时间,由于引起线程执行第一分支语句的待处理均分配到同一线程束13中,而线程束11和12的线程均只需执行第二分支语句,因此线程束11所需时间为T1,线程束12所需时间为T1,线程束13所需时间为T2,第一线程束处理器21运行三个线程束共需时间时间t2=2T1+T2。
综上可知,运行线程数据重映射之后,可节约t1-t2=T1+2T2的时间,可以预见到,当线程束数量不止3个,分支数量不止2个时,所能节约的时间更多。
但是,并不是每种情况下都适合做线程数据重映射,具体可参见图6,图6是根据本发明实施例的线程数据重映射的另一数据流向图,在图6的示例中,只有线程3需执行第一分支语句。
并请结合图7进行参考,图7是是根据本发明实施例的线程数据重映射的另一前后时间开销示意图。
如图7所示,由于只有1个线程束11具有需执行第一分支语句的一个线程3,因此,即便将线程3的待处理数据0转移到最后一个线程12,线程束13也需要串行执行第一分支语句和第二分支语句,因此,在图7的示例中,当需执行第一分支语句的线程的数量为1时,无论执行线程数据重映射与否,均需时3T1+T2。故在这种情况下,进行线程数据重映射是没有效果的,反而会浪费第一线程束处理器21有限的运算资源,并造成无用的时间开销。
值得注意的是,在另外一些示例中,当需执行第一分支语句的线程的数量可以为2、3、4、5等正整数时,进行线程数据重映射也会在一定程度上浪费线程束处理器21有限的运算资源,并造成无用的时间开销。
另外,进行线程数据重映射之前,往往需要进行线程同步,每次线程同步都会造成M*N个线程的暂停,如此一来将会极大地影响GPU的运行速度和运行效率,并且,在需要执行第一分支语句的线程的数量为1或其他较小的数值如为2、3、4以及5时,进行线程数据重映射并不能有效解决分支分歧的问题,反而引入了线程同步,造成延时。
有鉴于此,本发明实施例提供了一种线程处理方法,该方法应用于图形处理器,图形处理器用于处理M个线程束,每个线程束包括N个线程,图形处理器还包括至少一个线程束处理器,至少一个线程束处理器中的第一线程束处理器包括N的整数倍个线程处理器,第一线程束处理器包括第一线程处理器,第一线程处理器用以运行N个线程中的一个以处 理满足第一分支语句的判断条件或满足第二分支语句的判断条件的待处理数据,图形处理器中设置有计数器,该方法包括:
第一线程处理器获取需要处理的第一待处理数据,确定第一待处理数据满足第一分支语句,将计数器中的数值加一步长;
第一线程处理器根据计数器的数值确定M*N个线程中需运行第一分支语句的线程的数量;
第一线程处理器在确认数量大于阈值的情况下,执行线程同步以及线程数据重映射。
本发明实施例通过第一线程处理器判断运行第一分支语句的线程的数量,在确认数量大于阈值的情况下,才执行线程同步以及线程数据重映射,可避免在该数量小于或等于阈值的情况下也执行线程同步以及线程数据重映射,可提高GPU的运行速度和运行效率。
为了对上述方法作出清楚说明,具体可参见图8,图8是根据本发明实施例的线程处理方法的流程图,该方法应用于第一线程束处理器21的每一线程处理器,值得注意的是,在本实施例中,假设全局存储器24存储了待处理数据A[tid],A[tid]是由CPU设置好并发送至GPU的一维数组,其以线程的标识tid作为下标,工作组包括线程束11、12、13,且线程束11、12、13被调度器23投放到第一线程束处理器21。
并且,在本发明实施例中,A[tid]的数值小于5的概率小于A[tid]的数值大于或等于5的概率。
其中,A[tid]的数值小于5为第一分支语句的判断条件,A[tid]的数值大于或等于5为第二分支语句的判断条件,即线程执行第一分支语句的次数小于线程执行第二分支语句的次数。
举例而言,A[tid]=[13,6,0,1,2,7,8,9,10,11,3,12]。
在一些示例中,A[tid]可在CPU端根据temp=rand()*1000生成,其中rand()是伪随机数,其会产生0-1之间的任意数值。
值得注意的是,在图8所示的方法被线程处理器运行之前,需先在共享存储器210声明计数器的初始数值为0,标志位的初始数值为第一标志值。
举例而言,第一标志值的初始数值可设置为0。
其中,在共享存储器210声明计数器的初始数值为0,标志位的初始数值为第一标志值的步骤可由调度器23在初始化过程中,在建立工作组并将工作组的线程束11、12、13投放至第一线程束处理器21之前,从第一线程束处理器21选择任一线程处理器执行。
并且,计数器设置在共享存储器210中,第一线程束处理器21的每一线程处理器在执行图8所示的方法以运行线程时,均可访问计数器,具体而言,第一线程束处理器21的每一线程处理器在运行线程时,均可读取计数器的数值,并可修改计数器的数值。
值得注意的是,在本实施例中,第一线程束处理器21的线程处理器在运行线程时,可原子读取计数器的数值,并可对计数器的数值进行原子加一步长。
其中,“原子读取”表示一个线程处理器在运行线程过程中读取计数器的数值时,位于同一线程束处理器的其他线程处理器不能读取计数器的数值,只有前面的线程处理器读取完毕之后,位于同一线程束处理器的其他线程处理器才读取计数器的数值,即,同一时间 仅允许同一工作组的M*N个线程中的一个线程读取计数器的数值。
同理,“原子加一步长”是指同一时间仅允许同一工作组的M*N个线程中的一个线程对计数器的数值加一步长。
其中,步长可以是任意正整数,举例而言,可设置为1。
该线程处理方法包括以下步骤:
步骤S401:流程开始。
步骤S402:判断需要处理的待处理数据是否满足第一分支语句的判断条件,如果是,执行步骤S403,如果否,执行步骤S404。
举例而言,第一线程处理器211可根据正在第一线程处理器211上运行的线程1的线程标识tid=0从全局存储器24读取需要处理的待处理数据A[O]=13。
步骤S403:对计数器的数值加一步长。
由于本方法应用于同一线程束处理器21中的每一线程处理器,因此,第一线程处理器211、第二线程处理器212、第三线程处理器213以及第四线程处理器214在运行各自的线程时,在各自需要处理的待处理数据满足第一分支语句的判断条件时,可分别对处于共享存储器210的计数器的数值加一步长,即,计数器可累计同一工作组的线程束11、12以及13中需执行第一分支语句的线程的数量。
步骤S404:判断标志位的值是否为第二标志值,如果是执行步骤S408,如果否,执行步骤S405。
第二标志值是与第一标志值不同的数值,当第一标志值是0时,第二标志值是1,当第一标志值是1时,第二标志值可为0。
步骤S405:读取计数器的数值。
具体而言,在本步骤中,线程处理器原子读取计数器的数值,由于本方法应用于同一线程束处理器21中的每一线程处理器,因此,在步骤S403至本步骤之间,可累计第一线程处理器211、第二线程处理器212、第三线程处理器213以及第四线程处理器214分别运行线程束11、线程束12、以及线程束13的线程时,需运行第一分支语句的线程的数量。
步骤S406:判断计数器的数值是否大于阈值,如果是,执行步骤S409,如果否,执行步骤S413。
此处以阈值为1为例。如上所述,具体可根据实际需要设置该阈值,比如设置为大于等于2且小于等于5的正整数。
步骤S407:将标志位的值设置为第二标志值,第二标志值用于指示需要执行重映射。
在本步骤中,第一个发现计数器的数值超过阈值的线程处理器将标志位设置为第二标志值,其他线程处理运行其他线程时,发现第二标志位被设置就会跳至步骤S408,直接执行线程同步,而不再去判断计数器的数值是否超过阈值,避免得出不一样的结论。采用这个设计后,只要M*N个线程中的一个线程作出执行线程同步的决定,其他线程必定跟随,规避了个别线程单独执行线程同步的问题。
其中,个别线程单独执行线程同步的问题是因为以下原因引起的:在步骤S405中,线程处理器运行线程时读取到的计数器的数值是该线程处理器执行原子读取的指令之后从共享存储器210读取到的数值,然而从执行指令到读取数值之间会存在时间间隔,而这个间 隔中其他线程可能对计数器的数值进行原子加一的操作,因此不同线程处理器运行各自线程读取到的计数值可能存在差异,从而使得个别线程单独执行线程同步,而其他线程不执行线程同步。
步骤S408:执行线程同步。
正如以上所述,线程处理器运行线程时可通过调用障碍函数barrier()实现线程同步,对于一个调用了障碍函数的线程来说,除非同一个工作组内其他线程都执行了障碍函数,否则该线程将被阻止执行障碍函数之后的内核代码,且在线程处理器执行了障碍函数之后,线程处理器为该线程设置停顿点(break point),该停顿点记录了障碍函数的下一语句在内核代码中的位置,将该位置记录在该线程的私有存储器中,并暂停运行该线程。
当同一个工作组内其他线程都执行了障碍函数之后,线程处理器从私有存储器中读取停顿点,从而可执行障碍函数的下一语句,使得线程继续运行。
因此,在图2所示的第一线程束处理器21中,只要任一线程处理器在运行线程1至12中任一者时调用了障碍函数,则线程同步结束的条件是线程1至12在被各自对应的线程处理器运行过程中都调用了障碍函数。
步骤S409:在线程同步结束后,将计数器的数值设置为0,将标志位的值设置为第一标志值,以线程标识为索引将第一线程的待处理数据记录在索引表。
其中,索引表设置在共享存储器210中,线程1至12均可访问索引表,线程标识为线程处理器正在运行的线程的线程标识,举例而言,线程处理器可通过调用get_global_id()函数获得线程标识。
举例而言,假设第一线程为线程1,其待处理数据如图4所示为A[0]=13,则线程1在索引表记录线程标识为0对应待处理数据13。
由于本方法应用于同一线程束处理器21中的每一线程处理器,因此,第一线程可为线程1至12中的任一者。故在线程1至12均运行完本方法之后,产生的索引表如表1所示:
线程标识(索引) 待处理数据
0 13
1 6
2 0
3 1
4 2
5 7
6 8
7 9
8 10
9 11
10 3
11 12
在表1中,线程标识与待处理数据之间具有一一对应关系。
步骤S410:执行线程数据重映射,获取线程数据重映射产生的更新的线程标识,根据 更新的线程标识从索引表获取与更新的线程标识对应的待处理数据。
其中,线程数据重映射的原理可参见图3及其对应描述,于此不作赘述。
在本步骤中,举例而言,根据图4,线程数据重映射之后,线程3的tid从2更新为5,因此,第三线程处理器213在运行线程3并执行线程数据重映射之后,获取线程数据重映射产生的更新的线程标识5,根据更新的线程标识5从索引表获取与更新的线程标识5对应的待处理数据7。
步骤S411:判断待处理数据满足第一分支语句的判断条件还是第二分支语句的判断条件,如果满足第一分支语句的判断条件,则执行步骤S412,如果满足第二分支语句的判断条件,则执行步骤S413。
举例而言,第一分支语句的判断条件可例如为A[tid]<5,第二分支语句的判断条件可例如为A[tid]≥5。
步骤S412:执行第一分支语句。
步骤S413:执行第二分支语句。
步骤S414:流程结束。
在本步骤中,流程结束是指线程处理器结束当前运行的线程,举例而言,第一线程处理器211结束线程1,值得注意的是,第一线程处理器211可在结束线程1后,在线程1所在的线程束11的其他线程均结束时,转而运行其他线程束的线程,如线程5或线程9。
在线程处理器运行每一线程以实现本方法之后,由于每个线程在运行过程中均对计数器的数值与阈值进行比较,在计数器的数值不大于阈值时,不执行线程同步和线程数据重映射,因此可避免出现图7所示的情况,从而防止因无效的线程数据重映射造成延时。
由于GPU并行运算的场景中,涉及多循环的场景较多,因此本发明实施例进一步提供一种线程处理方法,以在多循环的场景实现无用的线程数据重映射的识别。具体而言,该线程处理方法应用于图形处理器,图形处理器用于处理M个线程束,每个线程束包括N个线程,每个线程运行循环语句,图形处理器还包括至少一个线程束处理器,至少一个线程束处理器中的第一线程束处理器包括N的整数倍个线程处理器,第一线程束处理器包括第一线程处理器,第一线程处理器用以在一个循环中运行N个线程中的一个以处理满足第一分支语句的判断条件或满足第二分支语句的判断条件的待处理数据,图形处理器中设置有计数器,该方法包括:
第一线程处理器在第一循环中获取需要处理的第一待处理数据,确定第一待处理数据满足第一分支语句,将计数器中的数值加一步长;
第一线程处理器根据计数器的数值确定M*N个线程中需运行第一分支语句的线程的数量;
第一线程处理器在确认数量大于阈值的情况下,执行线程同步并将计数器中的数值清零;
第一线程处理器执行线程数据重映射。
本发明实施例通过第一线程处理器在确认数量大于阈值的情况下,执行线程同步并将计数器中的数值清零,并执行线程数据重映射,使得第一线程处理器在下一循环读取到的计数器的数值不会受到第一循环的影响,并且,由于在确认数量大于阈值的情况下,才执 行线程同步以及线程数据重映射,因此可避免在该数量小于或等于阈值的情况下也执行线程同步以及线程数据重映射,可在每一循环中动态地判断是否有执行线程数据重映射的需要,可避免进行无用的线程数据重映射,可提高GPU的运行速度和运行效率。
为了对上述方法作出清楚说明,具体可参见图9,图9是根据本发明实施例的线程处理方法的另一流程图,该方法应用于第一线程束处理器21的每一线程处理器,值得注意的是,在本实施例中,假设共享存储器24存储了待处理数据A[i,tid],A[i,tid]是由CPU设置好并发送至GPU的一维数组,其以线程标识tid作为行下标,以循环变量i作为列下标,循环变量i用于指示线程当前所在的循环的序号。工作组包括线程束11、12、13,且线程束11、12、13被调度器23投放到第一线程束处理器21。并且,线程执行第一分支语句的概率小于线程执行第二分支语句的概率。
在另一些示例中,CPU也可以将待处理数据设置为A[tid,i]。
在本实施例中,以循环语句for(i=0;i<1000;i++)为例进行说明,且tid为0至11,因此A[i,tid]具有1000行,12列。由于二维数组涉及的数值较多,为便于说明,仅示出A[i,tid]之部分,具体而言,在本实施例中,假设线程束11在当前时刻运行至i=11的循环,线程束12在当前时刻运行至i=8的循环,线程束13在当前时刻运行至i=10的循环。
并且,假设在当前时刻,线程1需要处理的待处理数据是A[11,0]=13,线程2需要处理的待处理数据是A[11,1]=6,线程3需要处理的待处理数据是A[11,2]=0,线程4需要处理的待处理数据是A[11,3]=1,线程5需要处理的待处理数据是A[8,4]=2,线程6需要处理的待处理数据是A[8,5]=7,线程7需要处理的待处理数据是A[8,6]=8,线程8需要处理的待处理数据是A[8,7]=9,线程9需要处理的待处理数据是A[10,8]=10,线程10需要处理的待处理数据是A[10,9]=11,线程11需要处理的待处理数据是A[10,10]=3,线程12在需要处理的待处理数据是A[10,11]=12。具体可结合图10的上半部分进行参考,其中,图10是根据本发明实施例的线程数据重映射的前后时间开销示意图。
值得注意的是,在本实施例中,A[i,tid]数组设置为在i为固定值时A[i,tid]的数值小于5的概率小于A[i,tid]的数值大于或等于5的概率,即在工作组的每次循环中,线程执行第一分支语句的次数小于执行第二分支语句的次数。
在一些示例中,A[i,tid]可由CPU根据temp=rand()*1000生成,其中rand()是伪随机数,其会产生0-1之间的任意数值。
并且,与上一实施例类似,在图9所示的方法被线程处理器运行之前,需先在共享存储器210声明计数器的初始数值为0,标志位的初始数值为第一标志值。
举例而言,第一标志值可为0。
其中,在共享存储器210声明计数器的初始数值为0,标志位的初始数值为第一标志值的步骤可由调度器23在初始化过程中,在建立工作组并将工作组的线程束11、12、13投放至第一线程束处理器21之前,从第一线程束处理器21选择任一线程处理器执行。
并且,计数器设置在共享存储器210中,第一线程束处理器21的每一线程处理器在运行线程时,均可访问计数器,具体而言,第一线程束处理器21的每一线程处理器在运行线程时,均可读取计数器的数值,并可修改计数器的数值。
值得注意的是,在本实施例中,第一线程束处理器21的线程处理器在运行线程时,可 原子读取计数器的数值,并可对计数器的数值进行原子加一步长。
其中,“原子读取”表示一个线程处理器在运行线程过程中读取计数器的数值时,位于同一线程束处理器的其他线程处理器不能读取计数器的数值,只有前面的线程处理器读取完毕之后,位于同一线程束处理器的其他线程处理器才读取计数器的数值,即,同一时间仅允许同一工作组的M*N个线程中的一个线程读取计数器的数值。
同理,“原子加一步长”是指同一时间仅允许同一工作组的M*N个线程中的一个线程对计数器的数值加一步长。
其中,步长可以是任意正整数,举例而言,可设置为1。
该流程处理方法包括以下步骤:
步骤S501:流程开始。
步骤S502:判断循环是否结束,如果是,执行步骤S503,如果否,执行步骤S504。
可选地,在本实施例中,循环语句例如为for(i=0;i<1000;i++),其中,i为循环变量,循环变量用于指示线程已完成的循环的次数。
其中,i的初始值是0,且线程在完成一次循环时,对i加1,当i的值累加至1000时,线程可跳出循环语句,此时循环结束。
步骤503:流程结束。
在本步骤中,流程结束是指线程处理器结束当前运行的线程,举例而言,第一线程处理器211结束线程1,值得注意的是,第一线程处理器211可在结束线程1后,在线程1所在的线程束11的其他线程均结束时,转而运行其他线程束的线程,如线程5或线程9。
步骤S504:判断需要处理的待处理数据是否满足第一分支语句的判断条件,如果是,执行步骤S505,如果否,执行步骤S506。
其中,全局存储器24记录有需要处理的待处理数据,需要处理的待处理数据为A[i,tid],线程处理器可根据在线程处理器上运行的线程的线程标识和循环变量的数值,从全局存储器24获取A[i,tid]。
举例而言,第一分支语句的判断条件为A[i,tid]<5。
步骤S505:对计数器的数值加一步长。
由于本方法应用于同一线程束处理器21中的每一线程处理器,因此,第一线程处理器211、第二线程处理器212、第三线程处理器213以及第四线程处理器214在运行各自的线程时,在各自需要处理的待处理数据满足第一分支语句的判断条件时,可分别对处于共享存储器210的计数器的数值加一步长,即,计数器可累计同一工作组的线程束11、12以及13中需执行第一分支语句的线程的数量。
步骤S506:判断标志位的值是否为第二标志值,如果是执行步骤S510,如果否,执行步骤S507。
第二标志值是与第一标志值不同的数值,第二标志值用于指示需要执行重映射,在一些示例中,第一标志值是0,第二标志值是1,在另一些示例中,第一标志值是1,第二标志值是0。
步骤S507:读取计数器的数值。
具体而言,在本步骤中,线程处理器原子读取计数器的数值,由于本方法应用于同一 线程束处理器21中的每一线程处理器,因此,在步骤S505至本步骤之间,可累计第一线程处理器211、第二线程处理器212、第三线程处理器213以及第四线程处理器214分别运行线程束11、线程束12、以及线程束13的线程时,需运行第一分支语句的线程的数量。
步骤S508:判断计数器的数值是否大于阈值,如果是,执行步骤S509,如果否,执行步骤S513。
在本实施例中,阈值为1。
在另外一些示例中,阈值可为2-5之间的正整数。
步骤S509:将标志位的值设置为第二标志值,第二标志值用于指示需要执行重映射。
在本步骤中,第一个发现计数器的数值超过阈值的线程处理器将标志位设置为第二标志值,其他线程处理运行其他线程时,在步骤S506读取到标志位设置为第二标志值时会跳至步骤S510,直接执行线程同步,而不再去判断计数器的数值是否超过阈值(步骤508),避免得出不一样的结论。采用这个设计后,只要M*N个线程中的一个线程作出执行线程同步的决定,其他线程必定跟随,规避了个别线程单独执行线程同步的问题。
其中,个别线程单独执行线程同步的问题是因为以下原因引起的:在步骤S507中,线程处理器运行线程时读取到的计数器的数值是该线程处理器执行原子读取的指令之后从共享存储器210读取到的数值,然而从执行指令到读取数值之间会存在时间间隔,而这个间隔中其他线程在各自的循环中可能对计数器的数值进行原子加一的操作,因此不同线程处理器运行各自线程读取到的计数值可能存在差异,从而使得个别线程单独执行线程同步,而其他线程不执行线程同步。
步骤S510:执行线程同步。
正如以上所述,线程处理器运行线程时可通过调用障碍函数barrier()实现线程同步,对于一个调用了障碍函数的线程来说,除非同一个工作组内其他线程都执行了障碍函数,否则该线程将被阻止执行障碍函数之后的内核代码,且在线程处理器执行了障碍函数之后,线程处理器为该线程设置停顿点(break point),该停顿点记录了障碍函数的下一语句在内核代码中的位置,将该位置记录在该线程的私有存储器中,并暂停运行该线程。
当同一个工作组内其他线程都执行了障碍函数之后,线程处理器从私有存储器中读取停顿点,从而可执行障碍函数的下一语句,使得线程继续运行。
因此,针对图2所示的第一线程束处理器21,只要任一线程处理器在运行线程1至12中任一者时调用了障碍函数,则线程同步结束的条件是线程1至12在被各自的线程处理器运行过程中都调用了障碍函数。
步骤S511:在线程同步结束后,将计数器的数值设置为0,将标志位设置为第一标志值,以线程标识为索引将第一线程的待处理数据和循环变量记录在索引表。
其中,索引表设置在共享存储器210中,线程1至12均可访问索引表。
由于本方法应用于同一线程束处理器21中的每一线程处理器,因此,第一线程可为线程1至12中的任一者。故在线程1至12均运行完本方法之后,产生的索引表如表2所示:
线程标识(索引) 待处理数据 循环变量
0 13 11
1 6 11
2 0 11
3 1 11
4 2 8
5 7 8
6 8 8
7 9 8
8 10 10
9 11 10
10 3 10
11 12 10
其中,线程标识、待处理数据以及循环变量三者之间具有一一对应关系。
步骤S512:执行线程数据重映射,获取线程数据重映射产生的更新的线程标识,根据更新的线程标识从索引表获取与更新的线程标识对应的待处理数据。
其中,线程数据重映射的原理可参见图3及其对应描述。
举例而言,请参见图10的下半部分,线程数据重映射之后,线程3的tid从2更新为5,因此,第三线程处理器213在运行线程3并执行线程数据重映射之后,获取线程数据重映射产生的更新的线程标识5,根据更新的线程标识5从索引表获取与更新的线程标识5对应的待处理数据7。
其他线程亦可根据类似方式获取待处理数据,具体可参见图10,于此不作赘述。
步骤S513:对计数器的数值减一步长。
举例而言,步长可为任意正整数,在本实施例中,步长设置为1,且对计数器的数值减一步长具体为对计数器的数值进行原子减一操作。
在本步骤中,在待处理数据不满足第一分支语句的判断条件时对计数器的数值减一步长,与步骤S505中加一步长的操作抵消,可避免对其他线程的判断造成干扰。
步骤S514:判断待处理数据满足第一分支语句的判断条件还是第二分支语句的判断条件,如果满足第一分支语句的判断条件,则执行步骤S515,如果满足第二分支语句的判断条件,则执行步骤S516。
步骤S515:执行第一分支语句。
步骤S516:执行第二分支语句。
步骤S5517:对循环变量加一,并跳转至步骤S502。
举例而言,并请结合图11,图11是根据本发明实施例的线程数据重映射的另一前后时间开销示意图,图11的上部和中部示出对各个线程分别对自身的循环变量加1进入下一循环的过程,在每一线程分别执行本步骤之后,线程1的i=12,tid=0,线程2的i=12,tid=1,线程3的i=9,tid=5,线程4的i=9,tid=6,线程5的i=9,tid=7,线程6的i=11,tid=8,线程7的i=11,tid=9,线程8的i=11,tid=11,线程9的i=11,tid=10,线程10的i=9,tid=4,线程11的i=12,tid=3,线程12的i=12,tid=2。
并且,在第一线程处理器211运行线程1并进入线程1的下一循环时,从全局存储器 24读取A[12,0]=100,第二线程处理器212运行线程2并进入线程2的下一循环时,从全局存储器24读取A[12,1]=2,第三线程处理器213运行线程3并进入线程3的下一循环时,从全局存储器24读取A[9,5]=4,第四线程处理器214运行线程4并进入线程4的下一循环时,从全局存储器24读取A[9,6]=101。
在第一线程处理器211运行线程5并进入线程1的下一循环时,从全局存储器24读取A[11,9]=666,第二线程处理器212运行线程6并进入线程6的下一循环时,从全局存储器24读取A[11,11]=410,第三线程处理器213运行线程7并进入线程7的下一循环时,从全局存储器24读取A[11,10]=510,第四线程处理器214运行线程8并进入线程8的下一循环时,从全局存储器24读取A[11,11]=410。
在第一线程处理器211运行线程9并进入线程9的下一循环时,从全局存储器24读取A[11,10]=510,第二线程处理器212运行线程10并进入线程10的下一循环时,从全局存储器24读取A[9,4]=777,第三线程处理器213运行线程11并进入线程11的下一循环时,从全局存储器24读取A[12,3]=63,第四线程处理器214运行线程12并进入线程12的下一循环时,从全局存储器24读取A[12,2]=1。
各线程处理器根据当前循环需要处理的待处理执行图9中步骤S504及其以下的步骤,值得注意的是,在本轮循环中,由于计数器的数值是3,大于阈值1,因此会进行图11下半部分所示的线程数据重映射,其中,线程数据重映射的具体方式可参见图12,于此不作赘述,其中图12是根据本发明实施例的线程数据重映射的原理图。
综上,线程处理器在确认数量大于阈值的情况下,执行线程同步并将计数器中的数值清零,并执行线程数据重映射,使得第一线程处理器在下一循环读取到的计数器的数值不会受到第一循环的影响,并且,由于在确认数量大于阈值的情况下,才执行线程同步以及线程数据重映射,因此可避免在该数量小于或等于阈值的情况下也执行线程同步以及线程数据重映射,可在每一循环中动态地判断是否有执行线程数据重映射的需要,可避免进行无用的线程数据重映射,可提高GPU的运行速度和运行效率。
下文将从单个线程的角度进一步清楚说明图4、图6以及图12涉及的线程数据重映射的原理,其中,图形处理器的共享存储器210设置有一维数组id_pood[]、第一变量H以及第二变量R,其中一维数组的长度是M*N,第一变量H的初始值是0,第二变量R的初始值是M*N-1,可通过以下方式执行线程数据重映射:
步骤1:在确认待处理数据满足第二分支语句的判断条件时,读取第一变量H的数值,并将线程标识写入一维数组id_pood[]中以第一变量H的数值作为下标的位置,将第一变量H的数值加一,并执行线程同步;在确认第一线程的第一待处理数据满足第一分支语句的判断条件时,读取第二变量R的数值,并将线程标识写入一维数组id_pood[]中以第二变量的数值作为下标的位置,将第二变量R的数值减一,并执行线程同步;
步骤2:在线程同步结束后,读取一维数组id_pood[]中以第一线程的线程标识作为下标的位置上的数值,并将读取的数值作为线程数据重映射产生的更新的线程标识。
值得注意的是,上述方法的执行主体分别是第一线程处理器211、第二线程处理器212、第三线程处理器213和第四线程处理器214。
并且,线程同步结束是指工作组内的所有线程,如线程1至12,均执行了线程同步。
举例而言,在图3所示的场景中,第一线程处理器211在运行线程1时,判断到线程1的待处理数据13满足第二分支语句的判断条件(大于或等于5),读取第一变量H的数值0,将线程1的线程标识0写入id_pood[0],并将第二变量R的数值原子加1,使之变成1。
第二线程处理器212在运行线程2时,判断到线程2的待处理数据6满足第二分支语句的判断条件(大于或等于5),读取第一变量H的数值1,将线程2的线程标识1写入id_pood[1],并将第一变量H的数值原子加1,使之变成2。
第三线程处理器213在运行线程3时,判断到线程3的待处理数据0满足第一分支语句的判断条件(小于5),读取第二变量R的数值11(M=3,N=4,M*N-1=12-1=11),将线程3的线程标识2写入id_pood[11],并将第二变量R的数值原子加1,使之变成10。
第四线程处理器214在运行线程4时,判断到线程4的待处理数据1满足第一分支语句的判断条件(小于5),读取第二变量R的数值10,将线程4的线程标识3写入id_pood[10],并将第二变量R的数值原子加1,使之变成9。
同理,第一线程处理器211分别运行线程5和线程9,第二线程处理器212分别运行线程6和线程10,第三线程处理器213分别运行线程7和线程11,第四线程处理器214分别运行线程8和线程12时,作出类似处理,得到id_pood[]如下:
0 1 5 6 7 8 9 11 10 4 3 2
第一线程处理器211线程同步结束时,以线程1的线程标识0作为下标读取id_pood[0]=0,以线程5的线程标识4作为下标读取id_pood[4]=7,以线程9的线程标识8作为下标读取id_pood[8]=10。
第二线程处理器212线程同步结束时,以线程2的线程标识1作为下标读取id_pood[1]=1,以线程6的线程标识5作为下标读取id_pood[5]=8,以线程10的线程标识9作为下标读取id_pood[9]=10。
第三线程处理器213线程同步结束时,以线程3的线程标识2作为下标读取id_pood[2]=5,以线程7的线程标识6作为下标读取id_pood[6]=9,以线程11的线程标识10作为下标读取id_pood[10]=3。
第四线程处理器214线程同步结束时,以线程4的线程标识3作为下标读取id_pood[3]=6,以线程8的线程标识7作为下标读取id_pood[7]=11,以线程12的线程标识11作为下标读取id_pood[8]=12。
针对图6以及图12所示的场景,也可以类似的方式进行线程数据重映射,于此不作赘述。
本发明实施例进一步提供一种线程处理方法,该方法应用于图形处理器的线程束处理器,图形处理器包括M个线程束,每个线程束包括N个线程,M个线程束的M*N个线程中存在至少一个线程需运行第一分支语句,该方法包括:
检测M*N个线程中需运行第一分支语句的线程的数量;
在确认数量大于阈值的情况下,对M*N个线程进行线程数据重映射。
为了清楚说明,以下结合图13进行详细说明,图13是根据本发明实施例的线程处理 方法的另一流程图,图13所示的方法应用于第一线程束处理器21,其图8所示的实施例相比,区别在于本实施例以第一线程束处理器21作为执行主体进行描述,该方法包括:
步骤S101:检测M*N个线程中需运行第一分支语句的线程的数量。
举例而言,针对图2所示,M=3,N=4。
针对以下代码:
Figure PCTCN2018076885-appb-000004
第一分支语句的判断条件为A[tid]<5。
步骤S102:判断该数量是否大于阈值,如果是,执行步骤S104,如果否,执行步骤S103。
阈值可例如为1。
步骤S103:不进行线程数据重映射。
步骤S104:进行线程数据重映射。
步骤S105:运行M*N个线程并根据待处理数据执行第一分支语句或第二分支语句。
在本发明实施例中,由于在步骤S102中引入阈值判断的步骤,因此可以过滤无效的线程数据重映射,从而节约图形处理器的运算资源和减少不必要的时间开销。
值得注意的是,当阈值=1时,可杜绝在只有1个线程需运行第一分支语句而进行无效的线程数据重映射的情况发生,但是,在实际应用中,由于线程数量较多,也可以将阈值设置为其他数值,例如阈值=N,或N的整数倍,可以将执行第一分支语句的线程集中设置在一个或多个线程束中,该些线程束的线程只需执行第一分支语句。或者,在另一些示例中,阈值也可以设置为经验值,通过实验可以对阈值进行取值,来达到时间开销与运算资源之间的平衡。
在上述实施例中,只对执行一个分支语句的情况进行说明,但是,在实际应用中,分支语句往往嵌套在循环语句中,因此分支语句需要循环执行多次,此时,若每层循环均执行线程数据重映射,将极大地浪费运算资源,并造成较高的时间开销。
举例而言,可参见以下的内核代码:
Figure PCTCN2018076885-appb-000005
由于for循环的存在,每个线程需执行分支判断1000次,此时无效的线程数据重映射会造成较多的时间开销和浪费较多的运算资源。
为此,本发明另一实施例进一步提供一种运行在循环中的线程处理方法,以解决上述技术问题。
以下请参见图14,图14是根据本发明实施例的线程处理方法的另一流程图,该方法运行于第一线程束处理器21,其图10所示的实施例相比,区别在于本实施例以第一线程束处理器21作为执行主体进行描述,,该方法包括:
如图14所示,该方法具体包括以下步骤:
步骤S201:获取M*N个线程需要处理的待处理数据,在任一线程的待处理数据满足第一分支语句的判断条件的情况下,将计数器的数值加一。
并请结合图10一并参考,图10是根据本发明实施例的线程数据重映射的另一前后时间开销示意图。在图10的上半部分示出,线程束11当前处于第12个循环,循环变量i=11,线程束12当前处于第9个循环,循环变量i=8,线程束13当前处于第11个循环,循环变量i=10。
根据图10,线程3、4、5、11的待处理数据满足第一分支语句的判断条件(小于5),因此计数器的数值进行4次加一的操作,使得计数器的数值的值=4。
步骤S202:读取计数器的数值。
步骤S203:判断计数器的数值是否大于阈值,如果是,执行步骤S204,如果否,执行步骤S213。
步骤S204:控制确认计算变量大于阈值的线程以自身的线程标识作为索引记录自身的待处理数据和循环变量,将标志位设置为第二标志值,执行线程同步。
步骤S205:控制其他线程在检测到标志位为第二标志值的情况下,以自身的线程标识作为索引记录自身的待处理数据和循环变量到索引表,执行线程同步。
步骤S206:控制M*N个线程执行线程数据重映射以获取更新的线程标识。
本步骤的线程数据重映射的具体过程可参见图10的下半部分所示,经线程数据重映射之后,线程1的线程标识为0,线程2的线程标识为1,线程3的线程标识为5,线程4的线程标识为6,线程5的线程标识为7,线程6的线程标识为8,线程7的线程标识为9,线程8的线程标识为11,线程9的线程标识为10,线程10的线程标识为4,线程11的线程标识为3,线程12的线程标识为2
步骤S207:将计数变量设置为0,将标志位设置为第一标志值。
在本步骤中,通过对标志位和计数器的数值清零,可确保本循环的数值不会对下一循环造成影响。
步骤S208:控制M*N个线程以更新的线程标识作为索引在索引表获取待处理数据和循环变量。
步骤S209:控制M*N个线程在各自的待处理数据在满足第一分支语句的判断条件的情况下,执行第一分支语句,满足第二分支语句的判断条件的情况下,执行第二分支语句。
根据图10,经线程数据重映射之后,线程1至8均执行第二分支语句,线程9至12均执行第一分支语句。
步骤S210:控制M*N个线程线程根据各自的循环变量判断是否进入下一循环,如果是,执行步骤S212,如果否,执行步骤S211。
具体而言,针对循环语句for,可将循环变量i进行加一,在加一之后的i小于1000的情况下,进入下一循环,反之,则结束循环。
步骤S211:控制线程退出循环。
步骤S212:控制线程进入下一循环。
步骤S213:将计数器的数值设置为0,获取M*N个线程的待处理数据和循环变量。
步骤S214:控制M*N个线程在各自的待处理数据满足第一分支语句的判断条件的情况下,执行第一分支语句,满足第二分支语句的判断条件的情况下,执行第二分支语句。
步骤S215:控制M*N个线程根据各自的循环变量判断是否进入下一循环,如果是,执行步骤S217,如果否,执行步骤S216。
步骤S216:控制线程退出循环。
步骤S217:控制线程进入下一循环。
在图10的示例中,由于控制执行线程数据重映射后的线程的循环变量在进行加一操作之后均小于1000,因此进入下一循环,执行步骤S201。
具体可结合图11和12进行参考,由于前述以线程处理器的角度对图11和图12进行过详细介绍,于此不做赘述。
示例地,为帮助理解,以下将列出内核代码的一种具体形式,其中,该内核代码以C语言编写,可结合该内核代码对本发明实施例进行理解,而在每句代码后的注释则说明了该代码的功能:
Figure PCTCN2018076885-appb-000006
Figure PCTCN2018076885-appb-000007
以上内核代码仅为本发明实施例中第一个示例,可由CPU端的代码编辑器编辑,线程运行该内核代码,可实现图10所述的方法。
可参见图15,图15是根据本发明实施例的异构系统的装置结构示意图,如图15所示,异构系统包括中央处理器30和图形处理器20,中央处理器30包括主机代码301、内核代码302、编译器304以及运行时刻库307,其中,主机代码301和内核代码302设置在代码编辑器300上。
内核代码302举例而言可为:
Figure PCTCN2018076885-appb-000008
其中,代码编辑器300可将分支处理代码设置在内核代码302中,形成新的内核代码。
举例而言,代码编辑器300可将分支处理代码加入到内核代码302中,例如:将
Figure PCTCN2018076885-appb-000009
Figure PCTCN2018076885-appb-000010
加入到以下内核代码之前,
Figure PCTCN2018076885-appb-000011
形成新的内核代码。
其中,主机代码301中设置了待处理数据A[i,tid],例如为A[i,tid]=rand()*1000。
代码编辑器300将二进制的主机代码301和内核代码302发送至编译器304,编译器304产生二进制的内核代码和主机代码。
CPU将二进制的内核代码、主机代码以及待处理数据A[i,tid]发送至图形处理器20。
本发明实施例进一步提供一种图形处理器,其包括第一线程处理器,第一线程处理器用于执行图8或图9所示的方法。
本发明实施例进一步提供一种图形处理器,其包括第一线程束处理器,第一线程束处理器用于执行图13或图14所示的方法。
需说明的是,以上描述的任意装置实施例都仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本发明提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条 或多条通信总线或信号线。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本发明可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本发明而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘,U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等,包括若干命令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述的方法。
所属领域的技术人员可以清楚地了解到,上述描述的系统、装置或单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。

Claims (38)

  1. 一种线程处理方法,其特征在于,所述方法应用于图形处理器,所述图形处理器用于处理M个线程束,每个所述线程束包括N个线程,所述图形处理器还包括至少一个线程束处理器,所述至少一个线程束处理器中的第一线程束处理器包括N的整数倍个线程处理器,所述第一线程束处理器包括第一线程处理器,所述第一线程处理器用以运行N个线程中的一个以处理满足第一分支语句的判断条件或满足第二分支语句的判断条件的待处理数据,所述图形处理器中设置有计数器,所述方法包括:
    所述第一线程处理器获取需要处理的第一待处理数据,确定所述第一待处理数据满足第一分支语句,将所述计数器中的数值加一步长;
    所述第一线程处理器根据所述计数器的数值确定所述M*N个线程中需运行第一分支语句的线程的数量;
    所述第一线程处理器在确认所述数量大于阈值的情况下,执行线程同步以及线程数据重映射。
  2. 根据权利要求1所述的方法,其特征在于,所述图形处理器还设置有标志位,所述标志位的值设置为第一标志值,所述第一标志值用于指示不执行重映射,所述方法包括:
    所述第一线程处理器在确定所述数量大于阈值的情况之前,读取所述标志位;且,
    所述第一线程处理器在确定所述数量大于阈值的情况之后及执行线程同步之前,将所述第一标志值设置为第二标志值,所述第二标志值用于指示需要执行重映射。
  3. 根据权利要求2所述的方法,其特征在于,所述第一线程处理器在执行所述线程同步之后且执行所述线程数据重映射之前,所述方法还包括:
    所述第一线程处理器将所述计数器中的数值清零。
  4. 根据权利要求2或3所述的方法,其特征在于,所述第一线程束处理器包括第二线程处理器,所述第二线程处理器用以在运行N个线程中的一个以处理满足第一分支语句或满足第二分支语句的待处理数据,所述方法还包括:
    第二线程处理器读取所述标志位,在确认所述标志位的值为所述第二标志值时,执行线程同步以及线程数据重映射;
    所述第二线程处理器在确认所述标志位的值为所述第一标志值时,根据所述计数器的数值确定所述M*N个线程中需运行第一分支语句的线程的数量,在确认所述数量大于阈值的情况下,执行线程同步以及线程数据重映射。
  5. 根据权利要求4所述的方法,其特征在于,所述第一线程处理器用以运行N个线程中的第一线程以处理满足第一分支语句的判断条件的的待处理数据,所述第二线程处理器用以运行N个线程中的第二线程以处理满足第二分支语句的判断条件的待处理数据,所述图形处理器还设置有一维数组、第一变量以及第二变量,其中所述一维数组的长度是M*N,所述第一变量的初始值是0,所述第二变量的初始值是M*N-1,所述第一线程处理器执行所述线程数据重映射,包括:
    所述第一线程处理器读取所述第二变量的数值,并将所述第二线程的线程标识写入所述一维数组中以所述第二变量的数值作为下标的位置,将所述第二变量的数值减一,并执行所述线程同步;
    所述第二线程处理器读取所述第一变量的数值,并将所述第二线程的线程标识写入所述一维数组中以所述第一变量的数值作为下标的位置,将所述第一变量的数值加一,并执行所述线程同步;
    所述第一线程处理器在所述线程同步结束后,读取所述一维数组中以所述第一线程的线程标识作为下标的位置上的数值,并将读取的所述数值作为所述线程数据重映射产生的所述第一线程的更新的线程标识;
    所述第二线程处理器在所述线程同步结束后,读取所述一维数组中以所述第二线程的线程标识作为下标的位置上的数值,并将读取的所述数值作为所述线程数据重映射产生的所述第二线程的更新的线程标识。
  6. 根据权利要求5所述的方法,其特征在于,所述第一线程处理器在执行所述线程同步之后且执行所述线程数据重映射之前,所述方法还包括:
    所述第一线程处理器以所述第一线程的线程标识作为索引将所述第一待处理数据记录在索引表,其中所述第一线程的线程标识与所述第一待处理数据具有一一对应关系,所述索引表记录有所述M*N个线程的线程标识与待处理数据之间的一一对应关系;
    所述第一线程处理器在执行所述线程数据重映射之后,所述方法还包括:
    所述第一线程处理器以执行所述线程数据重映射后产生的所述第一线程的更新的线程标识作为索引在所述索引表中读取与所述第一线程的更新的线程标识对应的第三待处理数据;
    所述第一线程处理器在所述第三待处理数据满足第一分支语句的判断条件时,执行所述第一分支语句,所述第一线程处理器在所述第三待处理数据满足第二分支语句的判断条件时,执行所述第二分支语句。
  7. 根据权利要求1至6任一项所述的方法,其特征在于,所述阈值为1。
  8. 根据权利要求1至6任一项所述的方法,其特征在于,所述阈值为大于等于2且小于等于5的正整数。
  9. 根据权利要求1至8任一项所述的方法,其特征在于,所述第一线程处理器执行所述第一分支语句的概率小于所述第一线程处理器执行所述第二分支语句的概率。
  10. 一种线程处理方法,其特征在于,所述方法应用于图形处理器,所述图形处理器用于处理M个线程束,每个所述线程束包括N个线程,所述图形处理器还包括至少一个线程束处理器,所述至少一个线程束处理器中的第一线程束处理器包括N的整数倍个线程处理器,所述第一线程束处理器包括第一线程处理器,所述第一线程处理器运行循环语句,用以在一个循环中运行N个线程中的一个以处理满足第一分支语句的判断条件或满足第二分支语句的判断条件的待处理数据,所述图形处理器中设置有计数器,所述方法包括:
    所述第一线程处理器在第一循环中获取需要处理的第一待处理数据,确定所述第一待处理数据满足第一分支语句,将所述计数器中的数值加一步长;
    所述第一线程处理器根据所述计数器的数值确定所述M*N个线程中需运行第一分支语句的线程的数量;
    所述第一线程处理器在确认所述数量大于阈值的情况下,执行线程同步并将所述计数器中的数值清零;
    所述第一线程处理器执行线程数据重映射。
  11. 根据权利要求10所述的方法,其特征在于,所述方法还包括:
    所述第一线程处理器获取在第二循环中需要处理的第二待处理数据,确定所述第二待处理数据满足所述第二分支语句的判断条件,将所述计数器中的数值减一步长。
  12. 根据权利要求10或11所述的方法,其特征在于,所述图形处理器还设置有标志位,所述标志位的值设置为第一标志值,所述第一标志值用于指示不执行重映射,所述方法包括:
    所述第一线程处理器在确定所述数量大于阈值的情况之前,读取所述标志位;且,
    所述第一线程处理器在确定所述数量大于阈值的情况之后及执行线程同步之前,将所述第一标志值设置为第二标志值,所述第二标志值用于指示需要执行重映射。
  13. 根据权利要求12所述的方法,其特征在于,所述第一线程束处理器包括第二线程处理器,所述第二线程处理器用以在运行N个线程中的一个以处理满足第一分支语句或满足第二分支语句的待处理数据,所述方法还包括:
    第二线程处理器读取所述标志位,在确认所述标志位的值为所述第二标志值时,执行线程同步以及线程数据重映射;
    第二线程处理器在确认所述标志位的值为所述第一标志值时,根据所述计数器的数值确定所述M*N个线程中需运行第一分支语句的线程的数量,在确认所述数量大于阈值的情况下,执行线程同步以及线程数据重映射。
  14. 根据权利要求13所述的方法,其特征在于,所述第一线程处理器用以运行N个线程中的第一线程以处理满足第一分支语句的判断条件的的待处理数据,所述第二线程处理器用以运行N个线程中的第二线程以处理满足第二分支语句的判断条件的待处理数据,所述图形处理器还设置有一维数组、第一变量以及第二变量,其中所述一维数组的长度是M*N,所述第一变量的初始值是0,所述第二变量的初始值是M*N-1,所述第一线程处理器执行所述线程数据重映射,包括:
    所述第一线程处理器读取所述第二变量的数值,并将所述第二线程的线程标识写入所述一维数组中以所述第二变量的数值作为下标的位置,将所述第二变量的数值减一,并执行所述线程同步;
    所述第二线程处理器读取所述第一变量的数值,并将所述第二线程的线程标识写入所述一维数组中以所述第一变量的数值作为下标的位置,将所述第一变量的数值加一,并执行所述线程同步;
    所述第一线程处理器在所述线程同步结束后,读取所述一维数组中以所述第一线程的线程标识作为下标的位置上的数值,并将读取的所述数值作为所述线程数据重映射产生的所述第一线程的更新的线程标识;
    所述第二线程处理器在所述线程同步结束后,读取所述一维数组中以所述第二线程的线程标识作为下标的位置上的数值,并将读取的所述数值作为所述线程数据重映射产生的所述第二线程的更新的线程标识。
  15. 根据权利要求14所述的方法,其特征在于,所述第一线程处理器在执行所述线程同步之后且执行所述线程数据重映射之前,所述方法还包括:
    所述第一线程处理器以所述第一线程的线程标识作为索引将所述第一待处理数据和第一循环变量记录在索引表,其中,所述第一线程的线程标识与所述第一待处理数据具有一一对应关系;
    所述第一线程处理器在执行所述线程数据重映射之后,所述方法还包括:
    所述第一线程处理器以执行所述线程数据重映射后产生的所述第一线程的更新的线程标识作为索引在所述索引表中读取与所述第一线程的更新的线程标识对应的第三待处理数据;
    所述第一线程处理器在所述第三待处理数据满足第一分支语句的判断条件时,执行所述第一分支语句,所述第一线程处理器在所述第三待处理数据满足第二分支语句的判断条件时,执行所述第二分支语句。
  16. 根据权利要求15所述的方法,其特征在于,所述图形处理器中还记录有每个线程的循环变量,所述循环变量用于指示线程当前所在的循环的序号,所述索引表中记录有所述第一线程的循环变量与所述第一线程的线程标识、第一线程在所述循环变量所指示的循环中的待处理数据的对应关系,所述第一线程处理器在执行所述线程数据重映射之后,所述方法还包括:
    所述第一线程处理器以执行所述线程数据重映射后产生的所述第一线程的更新的线程标识作为索引在所述索引表中读取与所述第一线程的更新的线程标识对应的循环变量;
    所述第一线程处理器在执行所述第一分支语句或所述第二分支语句之后,将所述第一线程的更新的线程标识对应的循环变量加一以获取更新的循环变量,且在所述更新的循环变量不符合所述循环语句规定的循环条件时,结束所述第一线程,在所述更新的循环变量符合所述循环语句规定的循环条件时,运行所述第一线程的第二循环。
  17. 根据权利要求10至16任一项所述的方法,其特征在于,所述阈值为1。
  18. 根据权利要求10至16任一项所述的方法,其特征在于,所述阈值为大于或等于2且小于或等于5的正整数。
  19. 根据权利要求10至18任一项所述的方法,其特征在于,所述第一线程处理器执行所述第一分支语句的概率小于所述第一线程处理器执行所述第二分支语句的概率。
  20. 一种图形处理器,所述图形处理器用于处理M个线程束,每个所述线程束包括N个线程,所述图形处理器还包括至少一个线程束处理器,所述至少一个线程束处理器中的第一线程束处理器包括N的整数倍个线程处理器,所述第一线程束处理器包括第一线程处理器,所述第一线程处理器用以运行N个线程中的一个以处理满足第一分支语句的判断条件或满足第二分支语句的判断条件的待处理数据,所述图形处理器中设置有计数器,其中,
    所述第一线程处理器,用于获取需要处理的第一待处理数据,确定所述第一待处理数据满足第一分支语句,将所述计数器中的数值加一步长;
    所述第一线程处理器,用于根据所述计数器的数值确定所述M*N个线程中需运行第一分支语句的线程的数量;
    所述第一线程处理器,用于在确认所述数量大于阈值的情况下,执行线程同步以及线程数据重映射。
  21. 根据权利要求20所述的图形处理器,其特征在于,所述图形处理器还设置有标志位,所述标志位的值设置为第一标志值,所述第一标志值用于指示不执行重映射,其中,
    所述第一线程处理器,用于在确定所述数量大于阈值的情况之前,读取所述标志位;且,
    所述第一线程处理器,用于在确定所述数量大于阈值的情况之后及执行线程同步之前,将所述第一标志值设置为第二标志值,所述第二标志值用于指示需要执行重映射。
  22. 根据权利要求21所述的图形处理器,其特征在于,所述第一线程处理器,还用于在执行所述线程同步之后且执行所述线程数据重映射之前,将所述计数器中的数值清零。
  23. 根据权利要求21或22所述的图形处理器,其特征在于,所述第一线程束处理器包括第二线程处理器,所述第二线程处理器用以在运行N个线程中的一个以处理满足第一分支语句或满足第二分支语句的待处理数据,
    第二线程处理器,用于读取所述标志位,在确认所述标志位的值为所述第二标志值时,执行线程同步以及线程数据重映射;
    第二线程处理器,用于在确认所述标志位的值为所述第一标志值时,根据所述计数器的数值确定所述M*N个线程中需运行第一分支语句的线程的数量,在确认所述数量大于阈值的情况下,执行线程同步以及线程数据重映射。
  24. 根据权利要求23所述的图形处理器,其特征在于,所述第一线程处理器用以运行N个线程中的第一线程以处理满足第一分支语句的判断条件的的待处理数据,所述第二线程处理器用以运行N个线程中的第二线程以处理满足第二分支语句的判断条件的待处理数据,所述图形处理器还设置有一维数组、第一变量以及第二变量,其中所述一维数组的长度是M*N,所述第一变量的初始值是0,所述第二变量的初始值是M*N-1,所述第一线程处理器执行所述线程数据重映射,
    所述第一线程处理器,用于读取所述第二变量的数值,并将所述第二线程的线程标识写入所述一维数组中以所述第二变量的数值作为下标的位置,将所述第二变量的数值减一,并执行所述线程同步;
    所述第二线程处理器,用于读取所述第一变量的数值,并将所述第二线程的线程标识写入所述一维数组中以所述第一变量的数值作为下标的位置,将所述第一变量的数值加一,并执行所述线程同步;
    所述第一线程处理器,用于在所述线程同步结束后,读取所述一维数组中以所述第一线程的线程标识作为下标的位置上的数值,并将读取的所述数值作为所述线程数据重映射产生的所述第一线程的更新的线程标识;
    所述第二线程处理器,用于在所述线程同步结束后,读取所述一维数组中以所述第二线程的线程标识作为下标的位置上的数值,并将读取的所述数值作为所述线程数据重映射产生的所述第二线程的更新的线程标识。
  25. 根据权利要求24所述的图形处理器,其特征在于,所述第一线程处理器在执行所述线程同步之后且执行所述线程数据重映射之前,
    所述第一线程处理器,用于以所述第一线程的线程标识作为索引将所述第一待处理数据记录在索引表,其中所述第一线程的线程标识与所述第一待处理数据具有一一对应关系, 所述索引表记录有所述M*N个线程的线程标识与待处理数据之间的一一对应关系;
    所述第一线程处理器,用于在执行所述线程数据重映射之后,以执行所述线程数据重映射后产生的所述第一线程的更新的线程标识作为索引在所述索引表中读取与所述第一线程的更新的线程标识对应的第三待处理数据;
    所述第一线程处理器,用于在所述第三待处理数据满足第一分支语句的判断条件时,执行所述第一分支语句,在所述第三待处理数据满足第二分支语句的判断条件时,执行所述第二分支语句。
  26. 根据权利要求20至25任一项所述的方法,其特征在于,所述阈值为1。
  27. 根据权利要求20至25任一项所述的方法,其特征在于,所述阈值为大于等于2且小于等于5的正整数。
  28. 根据权利要求20至27任一项所述的方法,其特征在于,所述第一线程处理器执行所述第一分支语句的概率小于所述第一线程处理器执行所述第二分支语句的概率。
  29. 一种图形处理器,所述图形处理器用于处理M个线程束,每个所述线程束包括N个线程,所述图形处理器还包括至少一个线程束处理器,所述至少一个线程束处理器中的第一线程束处理器包括N的整数倍个线程处理器,所述第一线程束处理器包括第一线程处理器,所述第一线程处理器运行循环语句,用以在一个循环中运行N个线程中的一个以处理满足第一分支语句的判断条件或满足第二分支语句的判断条件的待处理数据,所述图形处理器中设置有计数器,其中,
    所述第一线程处理器,用于在第一循环中获取需要处理的第一待处理数据,确定所述第一待处理数据满足第一分支语句,将所述计数器中的数值加一步长;
    所述第一线程处理器,用于根据所述计数器的数值确定所述M*N个线程中需运行第一分支语句的线程的数量;
    所述第一线程处理器,用于在确认所述数量大于阈值的情况下,执行线程同步并将所述计数器中的数值清零;
    所述第一线程处理器,用于执行线程数据重映射。
  30. 根据权利要求29所述的图形处理器,其特征在于,
    所述第一线程处理器,用于获取在所述第一线程的第二循环中需要处理的第二待处理数据,确定所述第二待处理数据满足所述第二分支语句的判断条件,将所述计数器中的数值减一步长。
  31. 根据权利要求29或30所述的图形处理器,其特征在于,所述图形处理器还设置有标志位,所述标志位的值设置为第一标志值,所述第一标志值用于指示不执行重映射,
    所述第一线程处理器,用于在确定所述数量大于阈值的情况之前,读取所述标志位;
    所述第一线程处理器,用于在确定所述数量大于阈值的情况之后及执行线程同步之前,将所述第一标志值设置为第二标志值,所述第二标志值用于指示需要执行重映射。
  32. 根据权利要求31所述的方法,其特征在于,所述第一线程束处理器包括第二线程处理器,所述第二线程处理器用以在运行N个线程中的一个以处理满足第一分支语句或满足第二分支语句的待处理数据,
    第二线程处理器,用于读取所述标志位,在确认所述标志位的值为所述第二标志值时, 执行线程同步以及线程数据重映射;
    第二线程处理器,用于在确认所述标志位的值为所述第一标志值时,根据所述计数器的数值确定所述M*N个线程中需运行第一分支语句的线程的数量,在确认所述数量大于阈值的情况下,执行线程同步以及线程数据重映射。
  33. 根据权利要求32所述的图形处理器,其特征在于,所述第一线程处理器用以运行N个线程中的第一线程以处理满足第一分支语句的判断条件的的待处理数据,所述第二线程处理器用以运行N个线程中的第二线程以处理满足第二分支语句的判断条件的待处理数据,所述图形处理器还设置有一维数组、第一变量以及第二变量,其中所述一维数组的长度是M*N,所述第一变量的初始值是0,所述第二变量的初始值是M*N-1,所述第一线程处理器执行所述线程数据重映射,
    所述第一线程处理器,用于读取所述第二变量的数值,并将所述第二线程的线程标识写入所述一维数组中以所述第二变量的数值作为下标的位置,将所述第二变量的数值减一,并执行所述线程同步;
    所述第二线程处理器,用于读取所述第一变量的数值,并将所述第二线程的线程标识写入所述一维数组中以所述第一变量的数值作为下标的位置,将所述第一变量的数值加一,并执行所述线程同步;
    所述第一线程处理器,用于在所述线程同步结束后,读取所述一维数组中以所述第一线程的线程标识作为下标的位置上的数值,并将读取的所述数值作为所述线程数据重映射产生的所述第一线程的更新的线程标识;
    所述第二线程处理器,用于在所述线程同步结束后,读取所述一维数组中以所述第二线程的线程标识作为下标的位置上的数值,并将读取的所述数值作为所述线程数据重映射产生的所述第二线程的更新的线程标识。
  34. 根据权利要求33所述的图形处理器,其特征在于,所述第一线程处理器运行第一线程,
    所述第一线程处理器,用于在执行所述线程同步之后且执行所述线程数据重映射之前,以所述第一线程的线程标识作为索引将所述第一待处理数据记录在索引表,其中,所述第一线程的线程标识与所述第一待处理数据具有一一对应关系;
    所述第一线程处理器,用于在执行所述线程数据重映射之后,以执行所述线程数据重映射后产生的所述第一线程的更新的线程标识作为索引在所述索引表中读取与所述第一线程的更新的线程标识对应的第三待处理数据;
    所述第一线程处理器,用于在所述第三待处理数据满足第一分支语句的判断条件时,执行所述第一分支语句,所述第一线程处理器在所述第三待处理数据满足第二分支语句的判断条件时,执行所述第二分支语句。
  35. 根据权利要求34所述的方法,其特征在于,所述图形处理器中还记录有每个线程的循环变量,所述循环变量用于指示线程当前所在的循环的序号,所述索引表中记录有所述第一线程的循环变量与所述第一线程的线程标识、第一线程在所述循环变量所指示的循环中待处理数据的对应关系,
    所述第一线程处理器,用于在执行所述线程数据重映射之后,执行所述线程数据重映 射后产生的所述第一线程的更新的线程标识作为索引在所述索引表中读取与所述第一线程的更新的线程标识对应的循环变量;
    所述第一线程处理器,用于在执行所述第一分支语句或所述第二分支语句之后,将所述第一线程的更新的线程标识对应的循环变量加一以获取更新的循环变量,且在所述更新的循环变量不符合所述循环语句规定的循环条件时,结束所述第一线程,在所述更新的循环变量符合所述循环语句规定的循环条件时,运行所述第一线程的第二循环。
  36. 根据权利要求29至35任一项所述的方法,其特征在于,所述阈值为1。
  37. 根据权利要求29至35任一项所述的方法,其特征在于,所述阈值为大于等于2且小于等于5的正整数。
  38. 根据权利要求29至37任一项所述的方法,其特征在于,所述第一线程处理器执行所述第一分支语句的概率小于所述第一线程处理器执行所述第二分支语句的概率。
PCT/CN2018/076885 2018-02-14 2018-02-14 线程处理方法和图形处理器 WO2019157743A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201880089527.2A CN111712793B (zh) 2018-02-14 2018-02-14 线程处理方法和图形处理器
PCT/CN2018/076885 WO2019157743A1 (zh) 2018-02-14 2018-02-14 线程处理方法和图形处理器

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/076885 WO2019157743A1 (zh) 2018-02-14 2018-02-14 线程处理方法和图形处理器

Publications (1)

Publication Number Publication Date
WO2019157743A1 true WO2019157743A1 (zh) 2019-08-22

Family

ID=67619118

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/076885 WO2019157743A1 (zh) 2018-02-14 2018-02-14 线程处理方法和图形处理器

Country Status (2)

Country Link
CN (1) CN111712793B (zh)
WO (1) WO2019157743A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112131008B (zh) * 2020-09-28 2024-04-19 芯瞳半导体技术(山东)有限公司 一种调度线程束warp的方法、处理器及计算机存储介质
CN116243872B (zh) * 2023-05-12 2023-07-21 南京砺算科技有限公司 一种私有内存分配寻址方法、装置、图形处理器及介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102163072A (zh) * 2008-12-09 2011-08-24 英特尔公司 用于节能的基于软件的线程重映射
US8200940B1 (en) * 2008-06-30 2012-06-12 Nvidia Corporation Reduction operations in a synchronous parallel thread processing system with disabled execution threads
CN102640131A (zh) * 2009-09-24 2012-08-15 辉达公司 并行线程处理器中的一致分支指令
CN103970511A (zh) * 2013-01-28 2014-08-06 三星电子株式会社 能够支持多模式的处理器及其多模式支持方法

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7290261B2 (en) * 2003-04-24 2007-10-30 International Business Machines Corporation Method and logical apparatus for rename register reallocation in a simultaneous multi-threaded (SMT) processor
US7743233B2 (en) * 2005-04-05 2010-06-22 Intel Corporation Sequencer address management
US9354944B2 (en) * 2009-07-27 2016-05-31 Advanced Micro Devices, Inc. Mapping processing logic having data-parallel threads across processors
US8443376B2 (en) * 2010-06-01 2013-05-14 Microsoft Corporation Hypervisor scheduler
US8499305B2 (en) * 2010-10-15 2013-07-30 Via Technologies, Inc. Systems and methods for performing multi-program general purpose shader kickoff
CN103729166B (zh) * 2012-10-10 2017-04-12 华为技术有限公司 程序的线程关系确定方法、设备及系统
KR102062208B1 (ko) * 2013-05-03 2020-02-11 삼성전자주식회사 멀티스레드 프로그램 코드의 변환 장치 및 방법
US20150074353A1 (en) * 2013-09-06 2015-03-12 Futurewei Technologies, Inc. System and Method for an Asynchronous Processor with Multiple Threading
US9652284B2 (en) * 2013-10-01 2017-05-16 Qualcomm Incorporated GPU divergence barrier
US9898348B2 (en) * 2014-10-22 2018-02-20 International Business Machines Corporation Resource mapping in multi-threaded central processor units
CN107357661B (zh) * 2017-07-12 2020-07-10 北京航空航天大学 一种针对混合负载的细粒度gpu资源管理方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8200940B1 (en) * 2008-06-30 2012-06-12 Nvidia Corporation Reduction operations in a synchronous parallel thread processing system with disabled execution threads
CN102163072A (zh) * 2008-12-09 2011-08-24 英特尔公司 用于节能的基于软件的线程重映射
CN102640131A (zh) * 2009-09-24 2012-08-15 辉达公司 并行线程处理器中的一致分支指令
CN103970511A (zh) * 2013-01-28 2014-08-06 三星电子株式会社 能够支持多模式的处理器及其多模式支持方法

Also Published As

Publication number Publication date
CN111712793A (zh) 2020-09-25
CN111712793B (zh) 2023-10-20

Similar Documents

Publication Publication Date Title
US9830156B2 (en) Temporal SIMT execution optimization through elimination of redundant operations
KR101332840B1 (ko) 병렬 컴퓨팅 프레임워크 기반의 클러스터 시스템, 호스트 노드, 계산 노드 및 어플리케이션 실행 방법
CN113284038B (zh) 用于执行计算的方法、计算设备、计算系统和存储介质
JP2007200288A (ja) 実行スレッドをグループ化するためのシステム及び方法
US8359588B2 (en) Reducing inter-task latency in a multiprocessor system
US20140143524A1 (en) Information processing apparatus, information processing apparatus control method, and a computer-readable storage medium storing a control program for controlling an information processing apparatus
WO2019157743A1 (zh) 线程处理方法和图形处理器
JP2014216021A (ja) バッチスレッド処理のためのプロセッサ、コード生成装置及びバッチスレッド処理方法
JP6493088B2 (ja) 演算処理装置及び演算処理装置の制御方法
Liu et al. Supporting soft real-time parallel applications on multicore processors
CN114153500A (zh) 指令调度方法、指令调度装置、处理器及存储介质
JP2008146503A5 (zh)
US9286114B2 (en) System and method for launching data parallel and task parallel application threads and graphics processing unit incorporating the same
CN114610394B (zh) 指令调度的方法、处理电路和电子设备
KR20130080663A (ko) 멀티-쓰레딩을 사용하는 그래픽 처리를 위한 방법 및 장치
JP7122299B2 (ja) 処理タスクを実行するための方法、装置、デバイス、および記憶媒体
CN116069480B (zh) 一种处理器及计算设备
US20100169889A1 (en) Multi-core system
TW202109286A (zh) 純函數語言神經網路加速器系統及結構
US9760969B2 (en) Graphic processing system and method thereof
KR101984635B1 (ko) 어플리케이션을 고속으로 처리하는 연산 처리 장치 및 방법
TWI591579B (zh) 減少流程控制發散度之分析系統與方法
JP5630798B1 (ja) プロセッサーおよび方法
WO2019188175A1 (ja) デッドロック回避方法、デッドロック回避装置
US20130166887A1 (en) Data processing apparatus and data processing method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18906202

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18906202

Country of ref document: EP

Kind code of ref document: A1