CN118819868B

CN118819868B - A GPGPU-based thread reorganization method, device and medium

Info

Publication number: CN118819868B
Application number: CN202411303411.8A
Authority: CN
Inventors: 许桂龙; 赵鑫鑫; 魏朝飞; 姜凯
Original assignee: Shandong Inspur Science Research Institute Co Ltd
Current assignee: Yuanqixin Shandong Semiconductor Technology Co ltd
Priority date: 2024-09-19
Filing date: 2024-09-19
Publication date: 2025-01-21
Anticipated expiration: 2044-09-19
Also published as: CN118819868A

Abstract

The embodiment of the specification discloses a GPGPU-based thread reorganization method, equipment and medium, which relate to the technical field of thread branch management and comprise the steps of monitoring a plurality of thread bundles executed in a current processing unit, collecting current thread execution information corresponding to each thread bundle under a current clock cycle, monitoring execution performance according to current execution performance information in the current thread execution information and previously acquired historical execution performance information corresponding to a previous clock cycle to determine current adjustment active thread threshold data corresponding to the current clock cycle, setting active thread state identification of each thread bundle based on the current adjustment active thread threshold data and the current active thread information corresponding to each thread bundle, reorganizing threads of a plurality of appointed thread bundles according to the active thread state identification of each thread bundle and current thread bundle execution PC information, and adopting different reorganization modes aiming at thread bundles with different active thread numbers to reduce reorganization cost.

Description

GPGPU-based thread reorganization method, equipment and medium

Technical Field

The present disclosure relates to the technical field of thread branching management, and in particular, to a thread reorganizing method, device and medium based on a GPGPU.

Background

Graphics processor General-Purpose computing on Graphics Processing Units (GPGPU) generates thread branches when executing, and the thread branches cause waste of computing resources, thereby affecting the performance of programs. At present, two methods of multipath alternate execution and recombination mainly exist on hardware to optimize inconsistent control flows in a GPU, wherein the multipath alternate execution method has limited improvement on thread-level parallelism, is essentially an improvement on a stack-based refocus mechanism, and still has no good solution on the selection of refocus opportunities. While thread-level parallelism can be improved well by the thread-bundle (Warp) reorganization strategy, not all reorganizations are effective and the overhead of reorganizations on hardware is not negligible.

For Warp reorganization with a smaller number of active threads, the parallel efficiency can be significantly improved and the number of Warp can be reduced. However, for Warp with a large number of active threads, the reorganization may destroy the continuity of data between threads, especially when the threads are originally accessed continuously, the reorganization may result in an increase in the number of access requests, thereby counteracting the performance advantage brought by the GPGPU access merging mechanism and increasing the extra reorganization overhead. Therefore, the thread reorganization policy in the prior art does not consider thread reorganization differences among different active thread numbers, adopts the same reorganization mode for thread bundles with different active thread numbers, has the risk of breaking the continuity of data among threads, and increases additional reorganization cost.

Disclosure of Invention

One or more embodiments of the present disclosure provide a thread reorganizing method, device and medium based on a GPGPU, which are used for solving the technical problems that the thread reorganizing difference between different active thread numbers is not considered in the traditional thread reorganizing strategy, the thread bundles with different active thread numbers are subjected to the same reorganizing mode, there is a risk of destroying the continuity of data between threads, and additional reorganizing overhead is added.

One or more embodiments of the present disclosure adopt the following technical solutions:

One or more embodiments of the present disclosure provide a thread reorganizing method based on a GPGPU, where the method includes monitoring a plurality of thread bundles executed in a current processing unit, collecting current thread execution information corresponding to each thread bundle in a current clock cycle, where the current thread execution information includes current execution performance information, current thread bundle execution PC information, and current active thread information, monitoring execution performance according to the current execution performance information in the current thread execution information and previously acquired historical execution performance information corresponding to a previous clock cycle, so as to determine current adjustment active thread threshold data corresponding to the current clock cycle, where the execution performance information includes a number of executing process accesses and a number of execution channel conflicts, and setting an active thread state identifier of each thread bundle based on the current adjustment active thread threshold data and the current active thread information corresponding to each thread bundle, so as to reorganize a plurality of specified threads by the active thread state identifier of each thread bundle and the current thread bundle execution PC information.

One or more embodiments of the present specification provide a GPGPU-based thread reorganization apparatus, including:

At least one processor, and

A memory communicatively coupled to the at least one processor, wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described above.

One or more embodiments of the present specification provide a non-volatile computer storage medium storing computer-executable instructions configured to perform the above-described method.

The at least one technical scheme adopted by the embodiment of the specification has the advantages that the problem that the difference between different active thread numbers is not considered in a traditional thread reorganization strategy is effectively solved by dynamically monitoring and adjusting the active thread states of the thread bundles and carrying out targeted thread reorganization through the technical scheme, the performance can be improved by dynamically adjusting the active thread threshold data and setting the active thread state identification according to the current active thread information of each thread bundle, the situation that the thread bundles need to be reorganized can be controlled more accurately, unnecessary reorganization operation is reduced through the targeted reorganization strategy, and therefore additional reorganization cost is reduced, the active thread threshold data and the thread reorganization can be dynamically adjusted according to the execution performance information under the current clock cycle, different workloads and scenes can be more flexibly handled, the system can improve the performance by optimizing the active thread states and the reorganization strategy of the thread bundles no matter whether the high-load task or the low-load task is handled, the performance degradation or the system caused by blind reorganization is avoided by comprehensively considering the execution performance information and the active thread information, the system efficiency is improved, the continuity is improved, the system stability is improved, and the system stability is improved.

Drawings

In order to more clearly illustrate the embodiments of the present description or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some of the embodiments described in the present description, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. In the drawings:

Fig. 1 is a schematic flow chart of a thread reorganizing method based on a GPGPU according to an embodiment of the present disclosure;

FIG. 2 is a self-adjusting thread flow framework diagram provided in an embodiment of the present disclosure;

FIG. 3 is a functional schematic of a performance monitoring module according to an embodiment of the present disclosure;

FIG. 4 is a functional schematic diagram of a self-adjusting thread reorganizing unit according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a thread reorganizing device based on a GPGPU according to an embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present disclosure.

The embodiment of the present disclosure provides a thread reorganizing method based on a GPGPU, and it should be noted that an execution body in the embodiment of the present disclosure may be a server, or may be any device having a data processing capability. Fig. 1 is a flow chart of a thread reorganizing method based on a GPGPU according to an embodiment of the present disclosure, as shown in fig. 1, mainly including the following steps:

Step S101, monitoring a plurality of thread bundles executed in a current processing unit, and collecting current thread execution information corresponding to each thread bundle in a current clock cycle.

In one embodiment of the present disclosure, a plurality of thread bundles executing in a current processing unit are monitored, and current thread execution information corresponding to each thread bundle in a current clock cycle is collected, where the current thread execution information includes current execution performance information, current thread bundle execution PC information, and current active thread information. The current processing unit refers to a streaming multiprocessor (STREAMING MULTIPROCESSOR, SM), i.e. an SM processing unit. It should be noted that, the current execution performance information herein includes the number of access times of the execution process and the number of conflict times of the execution channel, where the number of access times refers to the number of access times of the thread bundle to the memory during the execution process, and since the access speed of the memory is far lower than the execution speed of the processor, excessive access operations may cause the processor to wait for the memory to respond, thereby reducing the overall performance, the number of conflict times of the execution channel refers to the number of conflict times of the thread bundle during the execution process caused by resource competition (such as a register file, an execution unit, etc.), where these conflict times may cause the execution of the thread bundle to be suspended or delayed, and the number of access times of the thread bundle may be recorded in real time by a hardware performance counter usually equipped by the GPGPU, and the number of conflict times of the execution channel may be detected by a special performance analysis tool. It should be noted that, the english corresponding to the PC in the thread bundle execution PC is called a Program Counter, i.e., a Program Counter, for storing the address of the execution instruction.

Step S102, monitoring the execution performance according to the current execution performance information in the current thread execution information and the previously acquired historical execution performance information corresponding to the previous clock cycle, so as to determine the current adjustment active thread threshold data corresponding to the current clock cycle.

The execution performance information comprises the access times of the execution process and the conflict times of the execution channels;

In one embodiment of the present disclosure, when monitoring a plurality of thread bundles executing in each clock cycle, thread execution information corresponding to each thread bundle in each clock cycle is stored to obtain historical execution performance information corresponding to a clock cycle located above a current clock cycle. And monitoring the execution performance through current execution performance information in the current thread execution information and historical execution performance information corresponding to the previous clock cycle, and determining current active thread threshold adjustment data corresponding to the current clock cycle through the change of the thread bundle execution performance.

The method specifically comprises the steps of respectively determining a current execution performance evaluation index corresponding to the current clock cycle and a historical execution performance evaluation index corresponding to the last clock cycle according to the current execution performance information and the historical execution performance information corresponding to each thread bundle, determining an execution performance change type of the current clock cycle relative to the last clock cycle through the current execution performance evaluation index and the historical execution performance evaluation index, wherein the execution performance change type comprises any one of performance improvement and performance reduction, acquiring historical active thread threshold data corresponding to the last clock cycle, adjusting the historical active thread threshold data based on the execution performance change type, and determining the current adjustment active thread threshold data corresponding to the current clock cycle.

In one embodiment of the present disclosure, a current execution performance evaluation index corresponding to the current clock cycle and a historical execution performance evaluation index corresponding to the last clock cycle are determined according to the current execution performance information and the historical execution performance information corresponding to each thread bundle, respectively. And determining that the execution performance change type is performance improvement when the current average execution performance evaluation index is smaller than the historical average execution performance evaluation index. It should be noted that, if the previous clock cycle is not the first clock cycle, the historical active thread threshold data corresponding to the previous clock cycle is obtained according to the previous clock cycle of the previous clock cycle, and for the execution process of the first clock cycle, the corresponding initial active thread threshold data may be set according to the actual requirement, for example, may be set to half of the number of threads that can accommodate the maximum in the thread bundle. And in the execution process of each round, based on the execution performance change type, adjusting the historical active thread threshold data of the previous round, and determining the current active thread threshold data corresponding to the current clock cycle.

The method comprises the steps of respectively determining current execution performance evaluation indexes corresponding to current clock cycles and historical execution performance evaluation indexes corresponding to previous clock cycles according to current execution performance information and historical execution performance information corresponding to each thread bundle, and specifically comprises the steps of accumulating current execution process visit times and current execution channel conflict times in the current execution performance information corresponding to each thread bundle, determining first execution performance evaluation indexes of each thread bundle, taking an average value of the first execution performance evaluation indexes corresponding to each thread bundle, determining the current execution performance evaluation indexes corresponding to the current clock cycles, accumulating historical execution process visit times and historical execution channel conflict times in the historical execution performance information, determining a plurality of second historical execution performance evaluation indexes, carrying out average value processing, and determining the historical execution performance evaluation indexes corresponding to the previous clock cycles.

In one embodiment of the present disclosure, the execution performance information includes a number of execution process accesses and a number of execution channel conflicts, and for each thread bundle, a total number of accesses to memory by all threads in a current clock cycle is counted, and a number of conflicts due to threads in the thread bundle attempting to access the same resource (e.g., register file, shared memory, or global memory) at the same time in the current clock cycle is counted. And adding the two indexes for each thread bundle to obtain a first execution performance evaluation index of the thread bundle. And averaging the first execution performance evaluation indexes of all the thread bundles to obtain a current execution performance evaluation index of the current clock cycle, wherein the current execution performance evaluation index is used for representing the average execution efficiency of all the thread bundles in the current clock cycle.

And processing the historical execution performance information according to the same steps to obtain the historical execution performance evaluation index in the last clock cycle. And adding the access times of the historical execution process and the conflict times of the historical execution channels for each thread bundle to obtain a second historical execution performance evaluation index of the thread bundle, calculating a mean value of the second historical execution performance evaluation index of the previous clock cycle, and obtaining the historical execution performance evaluation index of the previous clock cycle, wherein the historical execution performance evaluation index is used for representing the average execution efficiency of all the thread bundles in the previous clock cycle.

The method comprises the steps of adjusting the threshold data of a current adjusting active thread corresponding to a current clock cycle based on an execution performance change type, determining the threshold data of the current adjusting active thread corresponding to the current clock cycle, determining a performance change index through the current execution performance evaluation index and the historical execution performance evaluation index, determining a performance change rate of the current clock cycle according to the performance change index and the historical execution performance evaluation index, determining a threshold adjustment amount based on the performance change rate and the historical active thread threshold data, determining a rounding mode and a threshold adjustment mode corresponding to the threshold adjustment amount through the execution performance change type, wherein the rounding mode comprises any one of rounding up and rounding down, the threshold adjustment mode comprises any one of rounding up and rounding down, determining a threshold adjustment step length according to the rounding mode, adjusting according to the threshold adjustment step length on the basis of the threshold data of the historical active thread, and determining the threshold data of the current adjusting active thread corresponding to the current clock cycle.

In one embodiment of the present specification, a difference between the current execution performance evaluation index and the historical execution performance evaluation index is calculated, and the absolute value of the difference is the performance change index. And obtaining the performance change rate according to the ratio of the performance change index to the historical execution performance evaluation index. Based on the performance change rate and the historical active thread threshold data, a threshold adjustment amount may be calculated. It should be noted that, the threshold adjustment amount obtained here may be a non-integer, and in the case of the non-integer, rounding operations, such as rounding up and rounding down, in the corresponding direction need to be performed on the obtained threshold adjustment amount according to the type of the execution performance change. For example, when the performance change type is a performance degradation, the values may be rounded up to ensure that the resulting adjusted values are reduced from the previous values, ensuring that relatively small values are obtained after the historical active thread threshold data is reduced by a threshold adjustment step size. When the performance change type is performance improvement, the round down may be performed here, providing a safety margin. And rounding the threshold adjustment amount according to a selected rounding mode to obtain a threshold adjustment step length, wherein the threshold adjustment step length refers to the adjusted threshold amount. Further, also based on the type of performance change performed, a threshold up-or threshold down-adjustment is selected, where threshold up-or threshold down-adjustment refers to the direction of adjustment, i.e. increasing or decreasing the number. And according to the selected threshold adjustment mode and the calculated threshold adjustment step length, adjusting on the basis of the historical active thread threshold data to obtain the current active thread threshold adjustment data corresponding to the current clock period. And increasing or decreasing the threshold adjustment step length through the historical active thread threshold data to obtain the current active thread threshold adjustment data.

Step S103, setting an active thread state identifier of each thread bundle based on the current adjustment active thread threshold data and the current active thread information corresponding to each thread bundle, so as to carry out thread reorganization on a plurality of appointed thread bundles through the active thread state identifier of each thread bundle and the current thread bundle execution PC information.

In one embodiment of the present disclosure, current active thread information corresponding to each thread bundle is compared with current active thread threshold data, and according to a comparison result, an active thread state identifier of each thread bundle is set, a thread bundle lower than the current active thread threshold data is set to a low active state, and a thread bundle not lower than the current active thread threshold data is set to a high active state. And carrying out thread reorganization on a plurality of appointed thread bundles through the active thread state identification of each thread bundle and the current thread bundle execution PC information.

The method comprises the steps of screening a plurality of first thread bundles according to the active thread state identification of each thread bundle and the current thread bundle execution PC information to determine a plurality of first thread bundles meeting preset requirements, wherein the active thread state identification of the first thread bundles is in a low active state, the current thread bundle execution PC information corresponding to the plurality of first thread bundles is the same, determining the number of thread bundles to be recombined of the plurality of first thread bundles, and obtaining a pre-generated thread bundle recombination number threshold value, and determining the plurality of specified thread bundles meeting the preset requirements in the plurality of first thread bundles according to specified execution performance evaluation elements of each first thread bundle when the number of thread bundles to be recombined is larger than the thread bundle recombination number threshold value to conduct thread recombination.

In one embodiment of the present disclosure, by the active thread state identifier of each thread bundle and the current thread bundle execution PC information, traversing all thread bundles, determining that the active thread state identifier is a low active state, and currently executing a thread bundle with the same PC information, a plurality of first thread bundles are obtained in the above manner. The number of the first thread bundles obtained is counted, and a pre-generated thread bundle reorganization number threshold is obtained, and it should be noted that, the thread bundle reorganization number threshold is used for representing the number threshold of thread bundles reorganized each time, and is usually a positive integer not less than 2. The number of thread bundles may be a larger number of thread bundles each time a thread bundle is reorganized, and the threshold of the thread bundle reorganization number limits a threshold so that the thread bundles participating in the reorganization are not higher than the threshold. And when the number of the thread bundles to be recombined is not greater than the thread bundle recombination number threshold, all the obtained first thread bundles are recombined. If the number of the thread bundles to be recombined is greater than the thread bundle recombination number threshold, screening is required in a plurality of first thread bundles, a plurality of specified thread bundles meeting preset requirements can be determined in the plurality of first thread bundles according to specified execution performance evaluation elements of each first thread bundle, and thread recombination is performed on the plurality of specified thread bundles, wherein the number of the plurality of specified thread bundles is the thread bundle recombination number threshold.

The method comprises the steps of obtaining a historical thread bundle reorganization quantity threshold corresponding to a previous time period, determining current execution process visit times corresponding to each thread bundle in the current time period, determining current average visit times in the current time period based on a plurality of current execution process visit times, determining historical average visit times in the previous time period according to a plurality of historical execution process visit times corresponding to the previous time period, and carrying out unit quantity reduction on the basis of the historical thread bundle reorganization quantity threshold under the condition that the current average visit times are larger than the historical average visit times, so as to determine the thread bundle reorganization quantity threshold.

In one embodiment of the present disclosure, a historical thread bundle reorganization number threshold corresponding to a previous time period is obtained, a current execution process access count corresponding to each thread bundle in a current time period is determined, and a current average access count in the current time period is determined based on a plurality of the current execution process access counts. In the same way, the historical average memory access times in the previous clock period are calculated, namely, the historical average memory access times in the previous time period are determined according to the historical execution process memory access times corresponding to the previous time period. If the current average memory access times are greater than the historical average memory access times, subtracting one from the historical thread bundle reorganization quantity threshold value, and determining the thread bundle reorganization quantity threshold value, wherein the thread bundle reorganization quantity threshold value is minimum to be 2. It should be noted that the initial thread bundle reassembly number threshold may be set to half the maximum number of thread bundles that an SM can accommodate. If the current average access times is not greater than the historical average access times, adopting a historical thread bundle reorganization quantity threshold as the thread bundle reorganization quantity threshold.

Determining a plurality of specified thread bundles meeting preset requirements in a plurality of first thread bundles according to specified execution performance evaluation elements of each first thread bundle to carry out thread reorganization, wherein the method specifically comprises the steps of accumulating the current execution process access times and the current execution channel conflict times in the current execution performance information corresponding to each first thread bundle to determine the specified execution performance evaluation elements of each first thread bundle, sequencing the plurality of first thread bundles according to the specified execution performance evaluation elements of each first thread bundle from large to small, and sequentially determining a corresponding number of specified thread bundles according to a thread bundle reorganization quantity threshold value to carry out thread reorganization.

In one embodiment of the present disclosure, the number of accesses to the current execution process and the number of conflicts of the current execution channel in the current execution performance information corresponding to each first thread bundle are accumulated to determine a specified execution performance evaluation element of each first thread bundle, where the specified execution performance evaluation element may be determined by the first execution performance evaluation index of each thread bundle in the above steps. And sequencing all the first thread bundles according to the accumulated specified execution performance evaluation factors. The order of ordering should be from large to small because thread bundles with higher numbers of accesses and conflicts are more likely to be performance bottlenecks and should be prioritized for reorganization. And sequentially selecting a specified number of thread bundles as specified thread bundles according to the ordered sequence, wherein the specified number is equal to a previously determined thread bundle reorganization number threshold value, and the obtained specified thread bundles are subjected to subsequent thread reorganization operation.

The method comprises the steps of acquiring a PC-Warp lookup table corresponding to a current processing unit, wherein the PC-Warp lookup table comprises a thread bundle identification table item, a thread bundle execution PC table item and an active thread status identification table item, updating the active thread status identification of each thread bundle and the current thread bundle execution PC information in the PC-Warp lookup table, and determining an updated PC-Warp lookup table to determine a plurality of appointed thread bundles according to the updated PC-Warp lookup table so as to carry out thread reorganization on the plurality of appointed thread bundles.

In one embodiment of the present disclosure, a plurality of designated thread bundles may be thread reorganized by using an active thread state identifier of each thread bundle and information of an active thread bundle execution PC of the current thread bundle in the form of a PC-Warp lookup table, and a PC-Warp lookup table is described below as an example, where first, a PC-Warp lookup table corresponding to the current processing unit is obtained, and the PC-Warp lookup table includes a thread bundle identifier table entry, a thread bundle execution PC table entry, and an active thread state identifier table entry, where the thread bundle identifier table entry is used for storing a thread bundle ID, the thread bundle execution PC table entry is used for storing an execution PC of the thread bundle, and the active thread state identifier table entry is used for storing an active thread state of the thread bundle, and besides the foregoing table entry, an additional table entry may be added for storing parameters related to an execution performance evaluation index of each thread bundle. And performing table entry updating in the PC-Warp lookup table according to the active thread state identification of each thread bundle and the current thread bundle execution PC information, determining an updated PC-Warp lookup table, and determining a plurality of appointed thread bundles according to the updated PC-Warp lookup table so as to perform thread reorganization on the plurality of appointed thread bundles. Checking the updated PC-Warp lookup table against the table entry, and when the number of active threads in the thread bundles is lower than a threshold value, namely corresponding to a low active state, reorganizing the thread bundles which have the same PC as that in the PC-Warp and are in the same low active state, and sending the thread bundles into a Warp pool to wait for scheduling. When the state of the active thread is higher than the threshold value, namely the state of the active thread is identified as a high active state, the state is judged to be in a Warp pool for waiting to be scheduled without being recombined.

According to the technical scheme, the problem that the difference between different active thread numbers is not considered in a traditional thread reorganization strategy is effectively solved by dynamically monitoring and adjusting the active thread states of the thread bundles and carrying out targeted thread reorganization, the active thread state identification is set according to the current active thread information of each thread bundle by dynamically adjusting the active thread threshold data, which thread bundles need to be reorganized can be controlled more accurately, the targeted reorganization strategy reduces unnecessary reorganization operation, so that extra reorganization cost is reduced, the active thread threshold data and the thread reorganization can be dynamically adjusted according to the execution performance information in the current clock period, different workloads and scenes can be handled more flexibly, whether the high-load task or the low-load task is processed, the system can improve the performance by optimizing the active thread states and the reorganization strategy of the thread bundles, the performance degradation or the system caused by blind reorganization can be avoided by comprehensively considering the execution performance information and the active thread information, the data continuity is protected, the reorganization cost is reduced, and the system flexibility and stability are improved.

In one embodiment of the present disclosure, to improve the performance of execution when a program encounters a branch, the overhead of thread bundle reorganization is reduced, and the efficiency of thread reorganization is improved, so different reorganization modes are adopted for thread bundles having different numbers of active threads. And setting a performance monitoring unit, monitoring the performance of a currently executed thread reorganization mode, dynamically adjusting thread reorganization conditions according to the execution performance, and setting a self-adjusting thread reorganization unit to reorganize thread bundles conforming to the reorganization conditions.

Fig. 2 is a thread self-tuning pipeline framework provided in the embodiment of the present disclosure, where, as shown in fig. 2, the self-tuning framework includes a thread scheduler, a pipeline unit, a register file unit, a parallel execution unit, a performance monitoring module, a self-tuning thread reorganizing unit, a thread scheduler, and a cache storage unit. The thread scheduler performs scheduling execution of the thread bundles by adopting a specific scheduling strategy according to the obtained thread bundle information, and the pipeline unit comprises functions of instruction fetching, decoding, scoreboard, instruction transmitting and the like, realizes pipeline execution of instructions, and is used for storing operand information required by execution of each thread. In FIG. 2, a parallel execution unit is illustratively shown as an arithmetic logic unit (ARITHMETIC LOGIC UNIT, ALU) for performing specific operations on operands based on information about instruction execution. The performance monitoring module is used for collecting information in the thread executing process and sending the integrated information to the self-adjusting thread reorganizing unit. The self-adjusting thread reorganizing unit automatically adjusts the thread bundle reorganizing mode according to the information fed back by the performance monitoring module, so that the result of improving the program execution performance is achieved. The cache memory Unit is used for storing data executed by instructions, and can be loaded through a Load/Store Unit (LSU), namely through an LSU module.

Firstly, a thread scheduler performs scheduling instruction fetching according to a thread bundle fed back by a self-adjusting thread reorganizing unit, an instruction corresponding to the thread bundle is taken out from an instruction unit, is sent to a pipeline unit for execution, mainly performs instruction decoding, checks conflict by a scoreboard, and sends the thread bundle which meets the requirements and has obtained an operand from a register file to a rear execution unit. After executing the thread bundle instruction, the performance monitoring module may collect the number of memory accesses and the number of execution channel conflicts during the execution process, so as to dynamically adjust the threshold value. The adjusted threshold value is sent to a self-adjusting thread reorganizing unit, the self-adjusting thread reorganizing unit screens thread bundles needing reorganizing according to the threshold value, and the reorganized thread bundles which do not meet reorganizing conditions are placed into a Warp pool for a scheduler to select.

Fig. 3 is a functional schematic diagram of a performance monitoring module provided in this embodiment of the present disclosure, where, as shown in fig. 3, the performance monitoring module functions may collect and accumulate the number of memory accesses and the number of conflicts of execution channels in the execution process, then up-regulate (increase) the active thread threshold value, and after the thread bundles of the next round of pipeline are recombined, determine whether the performance is improved by the sum of the number of memory accesses and the number of conflicts of execution channels, and when the sum number is reduced, the performance is improved, and when the sum number is increased, the performance is reduced. And continuously adjusting the active thread threshold value upwards when the performance is improved, and adjusting the active thread threshold value downwards when the performance is reduced so as to circularly and dynamically adjust the active thread threshold value, wherein the active thread threshold value is sent into the self-adjusting thread reorganizing unit after each adjustment. In addition to the adjustment of the active thread threshold value, the access times in the execution process are collected, the access times are compared with the historical access times, whether the access times are increased is judged, if yes, the thread bundle threshold value is reduced, if not, the thread bundle threshold value of the last time is kept unchanged, and the obtained thread bundle threshold value is sent to the self-adjusting thread reorganizing unit.

Fig. 4 is a functional schematic diagram of a self-adjusting thread reorganizing unit according to an embodiment of the present disclosure, as shown in fig. 4, where the module generates a lookup table of PC-Warp at each clock cycle, where the lookup table includes the ID of Warp and the number of PC and active threads it executes. And checking the table entry of the lookup table, performing a condition 1 judgment process of the number of active threads and the threshold value of the active thread bundles, and when the number of active threads in the Warp is not lower than the threshold value, judging that the recombination is not needed, and directly entering the Warp pool to wait for scheduling. When the number of active threads in the Warp is lower than a threshold value, determining the number of thread bundles with the number of active threads lower than the threshold value, judging whether the number is larger than the threshold value of the number of thread bundles, if not, reorganizing all the thread bundles with the same PC in the PC-Warp, which meet the conditions, if so, preferentially selecting the thread bundles with the same PC, which meet the conditions 1, and have lower number of active threads, from among the thread bundles with the same PC, which meet the conditions 1, and selecting the thread bundles with the same number of access times and higher conflict times as the threshold value of the number of the thread bundles, to reorganize. And after reorganization, sending the data into a Warp pool to wait for scheduling.

Through the technical scheme, after each instruction execution, the threshold value can be adaptively and dynamically adjusted according to the performance of program execution, so that program blocks with different functions can be met, the overhead of thread reorganization is reduced, and the method has higher flexibility and better branch execution performance.

The embodiment of the specification also provides a thread reorganizing device based on the GPGPU, as shown in fig. 5, wherein the device comprises at least one processor and a memory in communication connection with the at least one processor, and the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute the method.

The present description also provides a non-transitory computer storage medium storing computer-executable instructions configured to perform the above-described method.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus, devices, non-volatile computer storage medium embodiments, the description is relatively simple, as it is substantially similar to method embodiments, with reference to the section of the method embodiments being relevant.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The devices and media provided in the embodiments of the present disclosure are in one-to-one correspondence with the methods, so that the devices and media also have similar beneficial technical effects as the corresponding methods, and since the beneficial technical effects of the methods have been described in detail above, the beneficial technical effects of the devices and media are not repeated here.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

The foregoing is merely one or more embodiments of the present description and is not intended to limit the present description. Various modifications and alterations to one or more embodiments of this description will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, or the like, which is within the spirit and principles of one or more embodiments of the present description, is intended to be included within the scope of the claims of the present description.

Claims

1. A GPGPU-based thread reorganization method, the method comprising:

monitoring a plurality of thread bundles executed in a current processing unit, and collecting current thread execution information corresponding to each thread bundle in a current clock cycle, wherein the current thread execution information comprises current execution performance information, current thread bundle execution PC information and current active thread information;

monitoring the execution performance according to the current execution performance information in the current thread execution information and the previously acquired historical execution performance information corresponding to the previous clock cycle to determine the current active thread threshold adjustment data corresponding to the current clock cycle, wherein the execution performance information comprises the access times and the execution channel conflict times of an execution process;

setting an active thread state identifier of each thread bundle based on the current adjustment active thread threshold data and current active thread information corresponding to each thread bundle, so as to carry out thread reorganization on a plurality of appointed thread bundles through the active thread state identifier of each thread bundle and the current thread bundle execution PC information;

and carrying out thread reorganization on a plurality of appointed thread bundles through the active thread state identification of each thread bundle and the current thread bundle execution PC information, wherein the thread reorganization specifically comprises the following steps:

Screening among the thread bundles according to the active thread state identification of each thread bundle and the current thread bundle execution PC information to determine a plurality of first thread bundles meeting preset requirements, wherein the active thread state identification of the first thread bundles is in a low active state, and the current thread bundle execution PC information corresponding to the plurality of first thread bundles is the same;

Determining the number of thread bundles to be recombined of the plurality of first thread bundles, and acquiring a pre-generated thread bundle recombination number threshold;

When the number of the thread bundles to be recombined is larger than the thread bundle recombination number threshold, determining a plurality of specified thread bundles meeting preset requirements in a plurality of first thread bundles according to specified execution performance evaluation factors of each first thread bundle so as to carry out thread recombination, wherein the number of the plurality of specified thread bundles is the thread bundle recombination number threshold.

2. The method of claim 1, wherein monitoring the execution performance according to current execution performance information in the current thread execution information and previously acquired historical execution performance information corresponding to a previous clock cycle to determine current active thread threshold adjustment data corresponding to the current clock cycle, specifically comprises:

According to the current execution performance information and the historical execution performance information corresponding to each thread bundle, determining a current execution performance evaluation index corresponding to the current clock cycle and a historical execution performance evaluation index corresponding to the previous clock cycle respectively;

Determining an execution performance change type of the current clock cycle relative to the previous clock cycle through the current execution performance evaluation index and the historical execution performance evaluation index, wherein the execution performance change type comprises any one of performance improvement and performance reduction;

And acquiring historical active thread threshold data corresponding to the previous clock cycle, adjusting the historical active thread threshold data based on the execution performance change type, and determining current adjustment active thread threshold data corresponding to the current clock cycle.

3. The method according to claim 2, wherein determining the current execution performance evaluation index corresponding to the current clock cycle and the historical execution performance evaluation index corresponding to the previous clock cycle according to the current execution performance information and the historical execution performance information corresponding to each thread bundle, respectively, specifically includes:

Accumulating the current execution process access times and the current execution channel conflict times in the current execution performance information corresponding to each thread bundle, and determining a first execution performance evaluation index of each thread bundle;

averaging the first execution performance evaluation indexes corresponding to each thread bundle, and determining the current execution performance evaluation index corresponding to the current clock cycle;

And accumulating the historical execution process access times and the historical execution channel conflict times in each piece of historical execution performance information to determine a plurality of second historical execution performance evaluation indexes, so as to perform average processing, and determining the historical execution performance evaluation indexes corresponding to the last clock cycle.

4. The GPGPU-based thread reorganization method of claim 2, wherein the historical active thread threshold data is adjusted based on the execution performance change type, and the current adjustment active thread threshold data corresponding to the current clock cycle is determined, which specifically includes:

determining a performance change index through the current execution performance evaluation index and the historical execution performance evaluation index;

Determining a performance change rate of the current clock cycle according to the performance change index and the historical execution performance evaluation index, and determining a threshold adjustment amount based on the performance change rate and the historical active thread threshold data;

Determining a rounding mode and a threshold adjustment mode corresponding to the threshold adjustment amount according to the execution performance change type, wherein the rounding mode comprises any one of rounding up and rounding down, and the threshold adjustment mode comprises any one of threshold up and threshold down;

rounding the threshold adjustment amount according to the rounding mode, and determining a threshold adjustment step length;

And according to the threshold adjustment mode, on the basis of the historical active thread threshold data, adjusting according to the threshold adjustment step length, and determining the current active thread threshold adjustment data corresponding to the current clock period.

5. The method for thread reorganization based on GPGPU according to claim 1, wherein obtaining a pre-generated thread bundle reorganization number threshold value specifically comprises:

acquiring a history thread bundle reorganization quantity threshold corresponding to the previous time period;

Determining the current execution process access times corresponding to each thread bundle in a current time period, so as to determine the current average access times in the current time period based on a plurality of current execution process access times;

determining historical average memory access times in the previous time period according to the plurality of historical execution process memory access times corresponding to the previous time period;

And under the condition that the current average visit times is larger than the historical average visit times, carrying out unit quantity reduction on the basis of the historical thread bundle reorganization quantity threshold value, and determining the thread bundle reorganization quantity threshold value.

6. The GPGPU-based thread reorganization method of claim 1, wherein determining, among the plurality of first thread bundles, a plurality of specified thread bundles meeting a preset requirement according to a specified execution performance evaluation element of each of the first thread bundles, so as to reorganize threads, comprises:

Accumulating the current execution process access times and the current execution channel conflict times in the current execution performance information corresponding to each first thread bundle, and determining a specified execution performance evaluation element of each first thread bundle;

And sequencing the plurality of first thread bundles according to the appointed execution performance evaluation factors of each first thread bundle from large to small, and sequentially determining a corresponding number of appointed thread bundles according to the thread bundle recombination number threshold value so as to carry out thread recombination.

7. The GPGPU-based thread reorganization method of claim 1, wherein thread reorganization is performed on a plurality of specified thread bundles by using an active thread state identifier of each thread bundle and the current thread bundle execution PC information, specifically including:

Acquiring a PC-Warp lookup table corresponding to the current processing unit, wherein the PC-Warp lookup table comprises a thread bundle identification table entry, a thread bundle execution PC table entry and an active thread state identification table entry;

and carrying out table entry updating in the PC-Warp lookup table by the active thread state identification of each thread bundle and the current thread bundle execution PC information, determining an updated PC-Warp lookup table, and determining a plurality of appointed thread bundles according to the updated PC-Warp lookup table so as to carry out thread reorganization on the plurality of appointed thread bundles.

8. A GPGPU-based thread reorganization apparatus, the apparatus comprising:

At least one processor, and

A memory communicatively coupled to the at least one processor, wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

9. A non-transitory computer storage medium storing computer-executable instructions configured to perform the method recited in any one of claims 1-7.