CN115344306A

CN115344306A - CUDA multithreading method, and related system, device, medium, and program

Info

Publication number: CN115344306A
Application number: CN202210807694.4A
Authority: CN
Inventors: 雷宇; 李原; 朱建斌; 付尧; 永田敏雄
Original assignee: Zhuhai Core Power Technology Co ltd
Current assignee: Zhuhai Core Power Technology Co ltd
Priority date: 2021-10-18
Filing date: 2021-10-18
Publication date: 2022-11-15
Also published as: CN114020333A; CN114020333B

Abstract

The application provides a CUDA multi-thread processing method, a system, related equipment, a medium and a program, wherein the method comprises the following steps: acquiring configuration information corresponding to the kernel function; generating a three-dimensional index of the thread according to the configuration information under the condition that target historical configuration information matched with the configuration information does not exist in the historical configuration information; compressing and packaging the generated three-dimensional index according to the configuration information, and storing the compressed and packaged three-dimensional index in a memory; acquiring a historical three-dimensional index corresponding to the target historical configuration information under the condition that the target historical configuration information exists in the historical configuration information; and compressing and packaging the historical three-dimensional index according to the target historical configuration information, and storing the compressed and packaged historical three-dimensional index in a memory. The embodiment of the application is beneficial to improving the efficiency of multi-thread parallel processing in the CUDA.

Description

CUDA multithreading method, and related system, device, medium, and program

Technical Field

The present application relates to the field of computer technologies, and in particular, to a CUDA multithread processing method and related system, device, medium, and program.

Background

CUDA (computer Unified Device Architecture) is an operating platform introduced by video card vendor english viada (NVIDIA), and provides a large amount of high-performance computing instruction development capability by using C language as a programming language. The computation in the CUDA is not independent of a kernel function (kernel) and a thread (thread), one kernel function corresponds to one thread grid (grid), one thread grid comprises a plurality of thread blocks (thread blocks), and one thread block comprises a plurality of threads. Before the kernel function executes the thread, the index of the thread generally needs to be generated, and under the condition of adopting low-complexity hardware, a large number of clock cycles are needed to generate all indexes in one thread block, and the larger the time delay for generating the index is, the larger the execution time delay of the kernel function is, so that the efficiency of parallel processing in the CUDA is influenced.

Disclosure of Invention

In view of the above problems, the present application provides a CUDA multithreading method, system, related device, medium, program, and the like, which is beneficial to improving the efficiency of parallel processing in CUDA.

In order to achieve the above object, a first aspect of the embodiments of the present application provides a CUDA multithreading method applied to an index generator, where the method includes:

acquiring configuration information corresponding to the kernel function;

generating a three-dimensional index of the thread according to the configuration information under the condition that target historical configuration information matched with the configuration information does not exist in the historical configuration information;

compressing and packaging the generated three-dimensional index according to the configuration information, and storing the compressed and packaged three-dimensional index in a memory;

acquiring a historical three-dimensional index corresponding to the target historical configuration information under the condition that the target historical configuration information exists in the historical configuration information;

and compressing and packaging the historical three-dimensional index according to the target historical configuration information, and storing the compressed and packaged historical three-dimensional index in a memory.

With reference to the first aspect, in a possible implementation manner, the generating the three-dimensional index of the thread according to the configuration information includes:

obtaining a three-dimensional index of any thread in the thread blocks according to the dimension information of the thread blocks;

and iterating the three-dimensional index of any one thread by adopting the dimension information, the iteration step length and the three-dimensional index of the iteration step length of the thread block to generate the three-dimensional index of the thread in the thread block.

With reference to the first aspect, in a possible implementation manner, iterating the three-dimensional index of any one thread by using the dimension information of the thread block, the iteration step size, and the three-dimensional index of the iteration step size, so as to generate the three-dimensional index of the thread in the thread block, includes:

iterating the three-dimensional index of any one thread by adopting the dimension information, the iteration step length and the three-dimensional index of the iteration step length of the thread block to generate the three-dimensional index of a first target thread in the thread block, wherein the interval between the first target thread and any one thread meets the iteration step length;

iterating the three-dimensional index of the first target thread by adopting the dimension information, the iteration step length and the three-dimensional index of the iteration step length of the thread block to generate the three-dimensional index of a second target thread in the thread block, wherein the interval between the second target thread and the first target thread meets the iteration step length;

and repeatedly executing the three-dimensional index of the target thread in the thread block by adopting the dimension information, the iteration step length and the three-dimensional index of the iteration step length, iterating the three-dimensional index of the target thread in the thread block, and generating the three-dimensional index of the thread of which the interval with the target thread meets the iteration step length to obtain the three-dimensional index of the thread in the thread block, wherein the target thread is the thread of which the three-dimensional index is generated in the thread block.

With reference to the first aspect, in a possible implementation manner, the dimension information of the thread block includes dimension information of the thread block in an x direction and dimension information of the thread block in a y direction, the three-dimensional index of the first target thread includes an index of the first target thread in the x direction, an index of the first target thread in the y direction, and an index of the first target thread in the z direction, and the three-dimensional index of any one thread is iterated by using the dimension information of the thread block, the iteration step, and the three-dimensional index of the iteration step, so as to generate the three-dimensional index of the first target thread in the thread block, including:

iterating the three-dimensional index of any one thread by adopting the dimension information of the thread block in the x direction, the dimension information of the thread block in the y direction, the iteration step length and the three-dimensional index of the iteration step length to obtain the index of the first target thread in the x direction;

and iterating the three-dimensional index of any one thread by adopting the dimension information of the thread block in the x direction, the dimension information of the thread block in the y direction and the three-dimensional index of the iteration step length to obtain the index of the first target thread in the y direction and the index of the first target thread in the z direction.

With reference to the first aspect, in a possible implementation manner, the configuration information includes a position of a highest bit of the generated three-dimensional index in an x direction being 1, a position of the highest bit in a y direction being 1, and a position of the highest bit in a z direction being 1, and the compressing and packaging of the generated three-dimensional index according to the configuration information includes:

compressing and packaging the x-dimension index in the generated three-dimensional index according to the position with the highest position of 1 in the x direction;

compressing and packaging the dimension information in the y direction in the generated three-dimensional index according to the position with the highest position in the y direction being 1; and (c) a second step of,

and compressing and packaging the z-dimension index in the generated three-dimensional index according to the position with the highest position in the z direction being 1.

With reference to the first aspect, in a possible implementation manner, the kernel function is a frequently-called kernel function, the configuration information includes a dedicated memory address of the kernel function, and the storing, in the memory, the three-dimensional index after being compressed and packed includes:

and storing the compressed and packaged three-dimensional index in a special memory address of the kernel function.

With reference to the first aspect, in a possible implementation manner, the kernel function is a kernel function that is called infrequently, the configuration information includes a common memory address, and the step of storing the compressed and packed three-dimensional index in the memory includes:

and storing the compressed and packed three-dimensional index in a common memory address.

With reference to the first aspect, in a possible implementation manner, generating a three-dimensional index of a thread according to configuration information includes:

and generating three-dimensional indexes of a preset number of threads in each clock cycle according to the configuration information.

A second aspect of an embodiment of the present application provides an index generator, which includes a configuration filter, an iterator coupled to the configuration filter, and a format packing module coupled to the iterator; wherein the content of the first and second substances,

the configuration filter is configured to acquire configuration information corresponding to the kernel function;

the iterator is configured to generate a three-dimensional index of the thread according to the configuration information under the condition that target historical configuration information matched with the configuration information does not exist in the historical configuration information;

the format packing module is configured to compress and pack the generated three-dimensional index according to the configuration information and store the compressed and packed three-dimensional index in a memory;

the configuration filter is further configured to acquire a historical three-dimensional index corresponding to the target historical configuration information under the condition that the target historical configuration information exists in the historical configuration information;

and the format packing module is also configured to compress and pack the historical three-dimensional index according to the target historical configuration information and store the compressed and packed historical three-dimensional index in the memory.

With reference to the second aspect, in a possible implementation manner, the configuration information includes dimension information of the thread block and a three-dimensional index of a preset iteration step, and the iterator is specifically configured to:

With reference to the second aspect, in a possible implementation manner, the iterator is specifically configured to:

With reference to the second aspect, in a possible implementation manner, the dimension information of the thread block includes dimension information of the thread block in an x direction and dimension information of the thread block in a y direction, the three-dimensional index of the first target thread includes an index of the first target thread in the x direction, an index of the first target thread in the y direction, and an index of the first target thread in the z direction, and the iterator is specifically configured to:

With reference to the second aspect, in a possible implementation manner, the configuration information includes a position of the generated three-dimensional index with a highest bit of 1 in an x direction, a position of the generated three-dimensional index with a highest bit of 1 in a y direction, and a position of the generated three-dimensional index with a highest bit of 1 in a z direction, and the format packing module is specifically configured to:

compressing and packaging the dimension information in the y direction in the generated three-dimensional index according to the position with the highest position in the y direction being 1; and the number of the first and second groups,

With reference to the second aspect, in a possible implementation manner, the kernel function is a frequently-called kernel function, the configuration information includes a dedicated memory address of the kernel function, and the format packing module is specifically configured to:

With reference to the second aspect, in a possible implementation manner, the kernel function is a kernel function that is not frequently called, the configuration information includes a common memory address, and the format packing module is specifically configured to:

A third aspect of the embodiments of the present application provides a CUDA multithreading system, including an index generator, a memory, and a Graphics Processing Unit (GPU), where the GPU includes a scheduler and a GPU computational core;

the index generator is configured to acquire configuration information corresponding to the kernel function; generating a three-dimensional index of the thread according to the configuration information under the condition that target historical configuration information matched with the configuration information does not exist in the historical configuration information; compressing and packaging the generated three-dimensional index according to the configuration information, and storing the compressed and packaged three-dimensional index in a memory; acquiring a historical three-dimensional index corresponding to the target historical configuration information under the condition that the target historical configuration information exists in the historical configuration information; compressing and packaging the historical three-dimensional index according to the target historical configuration information, and storing the compressed and packaged historical three-dimensional index in a memory;

the memory is configured to store the three-dimensional index after the index generator compresses and packs or the historical three-dimensional index after the index generator compresses and packs;

the scheduler is configured to schedule the compressed and packaged three-dimensional index or the compressed and packaged historical three-dimensional index and the instruction of the kernel function to a GPU (graphics processing Unit) computing core;

and the GPU computing core is configured to read the compressed and packaged three-dimensional index or the compressed and packaged historical three-dimensional index from the memory, perform decompression operation, execute the instruction and the decompressed three-dimensional index or the decompressed historical three-dimensional index.

With reference to the third aspect, in a possible implementation manner, the configuration information includes dimension information of the thread block and a three-dimensional index of a preset iteration step, and the index generator is specifically configured to:

With reference to the third aspect, in a possible implementation manner, the index generator is specifically configured to:

With reference to the third aspect, in a possible implementation manner, the dimension information of the thread block includes dimension information of the thread block in an x direction and dimension information of the thread block in a y direction, the three-dimensional index of the first target thread includes an index of the first target thread in the x direction, an index of the first target thread in the y direction, and an index generator, which is specifically configured to:

With reference to the third aspect, in a possible implementation manner, the configuration information includes a position of the generated three-dimensional index with a highest order of 1 in the x direction, a position of the generated three-dimensional index with a highest order of 1 in the y direction, and a position of the generated three-dimensional index with a highest order of 1 in the z direction, and the index generator is specifically configured to:

With reference to the third aspect, in a possible implementation manner, the kernel function is a frequently-called kernel function, the configuration information includes a dedicated memory address of the kernel function, and the index generator is specifically configured to:

With reference to the third aspect, in a possible implementation manner, the kernel function is a kernel function with infrequent calls, the configuration information includes a common memory address, and the index generator is specifically configured to:

With reference to the third aspect, in one possible implementation, the operating mode of the scheduler is an incremental mode, where the incremental mode includes that the scheduling of the three-dimensional index of the thread is incremented in the time dimension and the scheduling of the instruction is incremented in the GPU compute core dimension.

With reference to the third aspect, in a possible implementation manner, the GPU computing core is specifically configured to:

and decompressing the compressed and packaged three-dimensional index or the compressed and packaged historical three-dimensional index by adopting the position with the highest bit of 1 in the x direction, the position with the highest bit of 1 in the y direction and the position with the highest bit of 1 in the z direction.

A fourth aspect of embodiments of the present application provides an electronic device, which includes an input device and an output device, and further includes a processor adapted to implement one or more instructions; and a computer readable storage medium, said computer readable storage medium storing one or more instructions adapted to be loaded by said processor and to perform the steps of the method according to the first aspect as described above.

A fifth aspect of embodiments of the present application provides a computer-readable storage medium having stored thereon one or more instructions adapted to be loaded by a processor and to perform the steps of the method according to the first aspect.

A sixth aspect of embodiments herein provides a computer program product, wherein the computer program product comprises a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform the steps as in the first aspect above.

The above scheme of the present application includes at least the following beneficial effects:

in the embodiment of the application, the configuration information corresponding to the kernel function is obtained; generating a three-dimensional index of the thread according to the configuration information under the condition that target historical configuration information matched with the configuration information does not exist in the historical configuration information; compressing and packaging the generated three-dimensional index according to the configuration information, and storing the compressed and packaged three-dimensional index in a memory; acquiring a historical three-dimensional index corresponding to the target historical configuration information under the condition that the target historical configuration information exists in the historical configuration information; and compressing and packaging the historical three-dimensional index according to the target historical configuration information, and storing the compressed and packaged historical three-dimensional index in a memory. Therefore, the index generator generates the three-dimensional index of the thread with lower time delay, time overhead caused by generating the three-dimensional index by a complex algorithm is reduced, the generation speed of the index is coordinated with the execution speed of the kernel function, the condition that the kernel function (or GPU computing core) needs to wait for the generation of the index is avoided, the execution efficiency of the kernel function is improved, and the efficiency of parallel processing in the CUDA is improved. In addition, when the historical configuration information contains the target historical configuration information matched with the configuration information of the kernel function, the index generator does not need to generate a new three-dimensional index, and can directly multiplex the historical three-dimensional index, so that the time overhead caused by generating the new three-dimensional index is saved, the execution time delay of the kernel function is favorably reduced, and the efficiency of parallel processing in the CUDA is favorably improved. Meanwhile, the index generator can generate a new three-dimensional index with lower time delay without executing a complex algorithm, thereby reducing the complexity of hardware and being beneficial to saving the cost of the hardware.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of a thread Cheng Wangge and a thread block according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a thread block and threads in the thread block according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of a CUDA multithreading method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of another CUDA multithreading method according to an embodiment of the present application;

fig. 5 is a schematic format diagram of a compressed and packaged three-dimensional index according to an embodiment of the present application;

FIG. 6 is a diagram illustrating a kernel function and corresponding memory addresses provided by an embodiment of the present application;

fig. 7 is a schematic structural diagram of an index generator according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a CUDA multithreading system according to an embodiment of the present application;

fig. 9 is a schematic diagram of incremental scheduling provided in an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "comprising" and "having," and any variations thereof, as appearing in the specification, claims and drawings of this application, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Furthermore, the terms "first," "second," and "third," etc. are used to distinguish between different objects and are not used to describe a particular order.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The related art of the present application will be described below with reference to the accompanying drawings. The CUDA language is a general parallel computing programming language, which describes the behavior of a single thread by writing kernel functions, one kernel function corresponds to one thread grid, and one thread grid corresponds to one thread gridA plurality of thread blocks, one thread block in turn being made up of a plurality of threads. Wherein, the relationship between the thread grid and the thread blocks can be as shown in FIG. 1, (G) _x ,G _y ,G _z ) Representing dimension information of the thread grid in the x, y and z directions, (D) _x ,D _y ,D _z ) And representing dimension information of the thread block in three directions of x, y and z. The relationship between the thread block and the thread may be as shown in fig. 2, where the black square in fig. 2 represents one thread in the thread block, and (x, y, z) represents the index of the thread in the thread block in the x, y, z directions. Before the kernel function executes, a three-dimensional index of the thread is generally generated in advance and written into a register for the GPU to call and execute.

It should be understood that the three-dimensional index (x) of the nth thread in a thread block _n ,y _n ,z _n ) Dimension information (D) with thread blocks _x ,D _y ,D _z ) The following relationships exist:

n＝x _n +y _n D _x +z _n D _x D _y (1)

transforming equation (1) can obtain:

x _n ＝n％D _x (2)

wherein "%" is a modulo operation, and as can be seen, the three-dimensional index (x) is calculated by formula (2), formula (3) and formula (4) _n ,y _n ,z _n ) The division and the modulo operation are relatively complex to implement at the hardware level, which results in a relatively high time delay for generating the index, and finally affects the execution efficiency of the kernel function and the efficiency of parallel processing in the CUDA.

Based on thisThe embodiment of the present application provides a CUDA multithreading method, so as to solve the problem that the time delay for generating a three-dimensional index of a thread in the existing CUDA multithreading parallel processing is high, generate the three-dimensional index with a low-complexity hardware architecture and low time delay, and improve the execution efficiency of a kernel function, thereby improving the processing efficiency of hardware (such as a chip). Specifically, the present application utilizes configuration information of the kernel function to a known three-dimensional index (x) _n ,y _n ,z _n ) Iteration is carried out to generate a new three-dimensional index, and the original division and modulus operation is replaced by simple comparison (compare) and addition/subtraction calculation through an iteration method, so that the complexity of the algorithm is reduced, and the complexity of hardware is reduced.

Referring to fig. 3, fig. 3 is a schematic flow chart of a CUDA multithreading method according to an embodiment of the present application, where the CUDA multithreading method is applicable to an index generator, and as shown in fig. 3, includes steps 301 to 303:

301: and acquiring configuration information corresponding to the kernel function.

In the embodiment of the present application, the configuration information of each kernel function may be issued to the index generator, where the configuration information may include a memory address, dimension information of a thread block, and a three-dimensional index of a preset iteration step. Further, the configuration information may further include compression format information of the generated three-dimensional index, where the compression format information is used to perform compression packing on the three-dimensional index. Specifically, the compression format information may be a position of the three-dimensional index with a highest bit in the x direction of 1, a position with a highest bit in the y direction of 1, and a position with a highest bit in the z direction of 1, where the position with the highest bit in the x direction of 1 is used to indicate a number of bits occupied by the index in the x direction in the generated three-dimensional index, the position with the highest bit in the y direction of 1 is used to indicate a number of bits occupied by the index in the y direction in the generated three-dimensional index, and the position with the highest bit in the z direction of 1 is used to indicate a number of bits occupied by the index in the z direction in the generated three-dimensional index.

302: and executing a preset iterative algorithm according to the configuration information to generate a three-dimensional index of the thread in the thread block.

In the embodiment of the application, the cores are acquired from the configuration queueAnd after configuration information corresponding to the function is obtained, matching the configuration information with historical configuration information in a historical cache, if target historical configuration information matched with the configuration information does not exist in the historical configuration information, executing an iterative algorithm, generating a three-dimensional index of the thread based on the configuration information, and putting the configuration information into the historical cache. Such as: the acquired configuration information may be (memory Address, dimension information D of the thread block in the x direction) _x Dimension information D of thread block in y direction _y Dimension information D of thread block in z direction _z Index x of iteration step in x direction _c Index y of iteration step in y direction _c Index z of iteration step in z direction _c ) Due to the three-dimensional index (x) of the iteration step _c ,y _c ,z _c ) Is based on dimension information (D) of the thread block _x ,D _y ,D _z ) Calculated with iteration step c, only the comparison (Address, D) is needed when matching with the historical configuration information _x ，D _y ，D _z ) Four parameters are needed.

In the embodiment of the application, the three-dimensional index of the thread in the thread block and the dimension information (D) of the thread block are used for determining the dimension of the thread block _x ,D _y ,D _z ) The relation between the three-dimensional indexes, namely the formula (1), of any one thread in the thread block can be obtained by transforming the formula (1). Such as: for the 1 st thread in the thread block, the three-dimensional index of the 1 st thread can be obtained according to the formula (2), the formula (3) and the formula (4)

x ₁ ＝1％D _x ；

Assuming that the arbitrary thread is the nth thread, and the interval between the mth thread and the nth thread satisfies the iteration step c, the following relationship exists:

c＝m-n (5)

wherein, the iteration step c is a fixed constant, and the dimension information (D) of the thread block is adopted under the condition that the three-dimensional index of the nth thread is known _x ,D _y ,D _z ) Three-dimensional index of iteration step c (x) _c ,y _c ,z _c ) And an iteration step c, which can iterate the three-dimensional index of the nth thread to generate the three-dimensional index of the mth thread. Such as: the three-dimensional index of 1+c can be generated by iterating the three-dimensional index of the 1 st thread, the three-dimensional index of 2+c can be generated by iterating the three-dimensional index of the 2 nd thread, and the three-dimensional index of the m + c th thread can be generated by iterating the three-dimensional index of the m < th > thread. Thus, by iterating through the three-dimensional indexes of the known threads, the three-dimensional indexes of all threads in the thread block can be generated.

303: and compressing and packaging the generated three-dimensional index according to the configuration information, and storing the compressed and packaged three-dimensional index in a memory.

In this embodiment, since the configuration information includes the position of the generated three-dimensional index with the highest bit of 1 in the x direction, the position of the generated three-dimensional index with the highest bit of 1 in the y direction, and the position of the generated three-dimensional index with the highest bit of 1 in the z direction, the indexes in the corresponding directions in the generated three-dimensional indexes can be compressed and packaged according to the position of the generated three-dimensional index with the highest bit of 1 in each direction of x, y, and z, and the compressed and packaged three-dimensional index is obtained.

It should be understood that the three-dimensional index of a thread is typically stored in a register, but for most kernel functions, the three-dimensional index is used to compute offsets of the thread's input and output data, and is used infrequently. Compared with expensive register resources, the three-dimensional index after compression and packaging is stored in the memory, so that the performance of parallel computing is not influenced, and the cost of the register resources is saved.

It can be seen that, in the embodiment of the present application, configuration information corresponding to a kernel function is obtained; executing a preset iterative algorithm according to the configuration information to generate a three-dimensional index of a thread in the thread block; and compressing and packaging the generated three-dimensional index according to the configuration information, and storing the compressed and packaged three-dimensional index in a memory. Therefore, the three-dimensional index of the thread is generated by adopting a simpler iterative algorithm, and the generation time delay of the three-dimensional index is favorably reduced, so that the execution efficiency of the kernel function is improved, and the parallel processing efficiency in the CUDA is favorably improved.

Referring to fig. 4, fig. 4 is a schematic flowchart of another CUDA multithreading method provided in the embodiment of the present application, where the CUDA multithreading method is applicable to an index generator, as shown in fig. 4, and includes steps 401 to 405:

401: acquiring configuration information corresponding to the kernel function;

in this embodiment, the configuration information may include a memory address, dimension information of a thread block, and a three-dimensional index of a preset iteration step. Further, the configuration information may further include a position where the generated three-dimensional index has a highest order of 1 in the x direction, a position where the generated three-dimensional index has a highest order of 1 in the y direction, and a position where the generated three-dimensional index has a highest order of 1 in the z direction.

402: generating a three-dimensional index of the thread according to the configuration information under the condition that target historical configuration information matched with the configuration information does not exist in the historical configuration information;

in this embodiment of the present application, for example, generating a three-dimensional index of a thread according to configuration information includes:

and obtaining the three-dimensional index of any thread in the thread blocks according to the dimension information of the thread blocks. Specifically, the three-dimensional index of any one thread in the thread block can be obtained by transforming the formula (1).

Illustratively, iterating the three-dimensional index of any one thread by using the dimension information of the thread block, the iteration step size and the three-dimensional index of the iteration step size to generate the three-dimensional index of the thread in the thread block includes:

and repeatedly executing the three-dimensional index of the target thread in the thread block by adopting the dimension information, the iteration step and the three-dimensional index of the iteration step, generating the operation of generating the three-dimensional index of the thread with the interval meeting the iteration step with the target thread, and obtaining the three-dimensional index of the thread in the thread block, wherein the target thread is the thread which generates the three-dimensional index in the thread block.

Specifically, assume that any one thread is the nth thread, the first target thread is the mth thread, and the three-dimensional index (x) of the nth thread is known _n ,y _n ,z _n ) In the case of (1), dimension information D of the thread block in the x direction is used _x Y-direction dimension information D _y Iteration step c and three-dimensional index of iteration step (x) _c ,y _c ,z _c ) Three-dimensional index (x) to the nth thread _n ,y _n ,z _n ) Iteration is carried out to obtain the index x of the mth thread in the x direction _m . According to formula (1), formula (2) and formula (5) there are:

x _m ＝m％D _x

＝(n+c)％D _x

＝(x _n +y _n D _x +Z _n D _x D _y +x _c +y _c D _x +z _c D _x D _y )％D _x

＝(x _n +x _c )％D _x (6)

it is to be understood that x _n And x _c Are all less than D _x Then equation (6) can be simplified as:

comparing the formula (2) with the formula (7), the index x of the mth thread in the x direction can be obtained by the iterative algorithm through only addition, subtraction and comparison operation _m Without the need for complex division and modulo operations.

Specifically, dimension information D of the thread block in the x direction is adopted _x Y-direction dimension information D _y And three-dimensional index (x) of iteration step size c _c ,y _c ,z _c ) Three-dimensional index (x) to the nth thread _n ,y _n ,z _n ) Iteration is carried out to obtain the index y of the mth thread in the y direction _m . According to formula (1), formula (3) and formula (5) there are:

wherein the content of the first and second substances,

denotes rounding down, it being understood that y _n And y _c Are all less than D _y Then equation (8) can be simplified to:

similarly, comparing the formula (3) with the formula (9) can easily find that the index y of the mth thread in the y direction can be obtained by only adding, subtracting and comparing _m 。

Specifically, dimension information D of the thread block in the x direction is adopted _x Y-direction dimension information D _y And three-dimensional index (x) of iteration step size c _c ,y _c ,z _c ) Three-dimensional index (x) to the nth thread _n ,y _n ,z _n ) Iteration is carried out to obtain the index z of the mth thread in the z direction _m . According to formula (1), formula (4) and formula (5) there are:

it should be understood that, since a thread block includes three dimensions, when Dy ≧ 2,

equation (10) can be reduced to:

similarly, comparing the formula (4) with the formula (11) can be found easily, and the index z of the mth thread in the z direction can be obtained by only adding, subtracting and comparing _m 。

It should be understood that the three-dimensional index (x) of the mth thread is known _m ,y _m ,z _m ) In the case of (2), dimension information (D) of a thread block is used _x ,D _y ,D _z ) Iteration step c and three-dimensional index of iteration step (x) _c ,y _c ,z _c ) And iterating the mth thread to obtain a three-dimensional index of a second target thread, which satisfies an iteration step length c, of the mth thread, namely, a three-dimensional index of the (m + c) th thread. In the embodiment, for the known three-dimensional index of any one thread in the thread block, the three-dimensional index of the thread of which the interval with the any one thread meets the iteration step length c can be obtained through the iterative algorithm, and the three-dimensional indexes of all threads in the thread block can be generated by repeating the operation.

403: and compressing and packaging the generated three-dimensional index according to the configuration information, and storing the compressed and packaged three-dimensional index in a memory.

Illustratively, the compression and packaging of the generated three-dimensional index according to the configuration information includes:

and compressing and packaging the z-dimension index in the generated three-dimensional index according to the position with the highest bit of 1 in the z direction.

It should be understood that for the three-dimensional index of the thread (x, y, z), the range satisfies:

0≤x＜D _x

0≤y＜D _y

0≤z＜D _z

dimension (D) of thread blocks in three directions _x ，D _y ，D _z ) Are all smaller than 64K, when the bit width of the register is 16 bits, 3 registers of 16-bits are needed for storing the three-dimensional index (x, y, z) of one thread, and 3D is needed for storing all the three-dimensional indexes in 1 thread block _x D _y D _z 16-bit registers, and therefore, direct storage with uncompressed (x, y, z), would result in a very expensive register overhead.

Defining the maximum thread number of a thread block as S, then:

D _x D _y D _z ＜S

dimension information D in x direction based on thread block _x The position mx of the highest bit of the generated thread in the x direction being 1 can be defined, such as: 4 is 100, the position with the highest bit being 1 is the 2 nd bit, and similarly, based on the dimension information D of the thread block in the y direction _y The position my of the highest bit of the generated thread in the y direction being 1 can be defined, and the dimension information D in the z direction based on the thread block _z A position mz where the highest bit of the generated thread in the z direction is 1 may be defined.

Defining ms as the position where the highest bit of S is 1, then there are:

2 ^mx 2 ^my 2 ^mz ≤xyz＜D _x D _y D _z ＜S＜2 ^(ms+1)

2 ^(mx+my+mz) ＜2 ^(ms+1)

(mx+my+mz)＜(ms+1)

(mx+my+mz)≤ms (12)

from the equation (12), the generated three-dimensional index (x, y, z) can be stored by msbits, and in general, S is smaller than 64k, ms =16, that is, the compressed and packed three-dimensional index (x, y, z) can be stored by 1 16-bit register. Fig. 5 shows a format of an unprocessed (un-used) three-dimensional index (x, y, z) after compression-packing the indexes in three directions according to (mx, my, mz), and storing the compression-packed three-dimensional index into a memory according to an address in configuration information.

404: and acquiring a history three-dimensional index corresponding to the target history configuration information when the target history configuration information exists in the history configuration information.

In the embodiment of the present application, if there is target historical configuration information that matches configuration information corresponding to a kernel function in the historical configuration information, for example, (Address, D) _x ，D _y ，D _z ) If the four parameters are consistent, the historical three-dimensional index corresponding to the target historical configuration information can be obtained from the historical cache without executing an iterative algorithm, and the historical three-dimensional index is reused.

405: and compressing and packaging the historical three-dimensional index according to the target historical configuration information, and storing the compressed and packaged historical three-dimensional index in a memory.

In this embodiment of the application, the target historical configuration information also includes a position where the highest bit of the historical three-dimensional index in the x direction is 1, a position where the highest bit of the historical three-dimensional index in the y direction is 1, and a position where the highest bit of the historical three-dimensional index in the z direction is 1, that is, the number of bits occupied by the historical three-dimensional index in the three directions is indicated, the indexes in the three directions are compressed and packaged according to the position where the highest bit of the historical three-dimensional index in the x direction is 1, the position where the highest bit of the historical three-dimensional index in the y direction is 1, and the position where the highest bit of the historical three-dimensional index in the z direction is 1, so as to obtain the compressed and packaged historical three-dimensional index, and the compressed and packaged historical three-dimensional index is stored in the memory according to the address in the target historical configuration information.

For example, when the storage space of the memory is greater than or equal to a certain threshold, the user may configure the dedicated memory Address for the frequently-called kernel function, that is, the configuration information includes the dedicated memory Address of the kernel function, as shown in fig. 6, if the kernel functions 2 and 3 are frequently-called kernel functions, the three-dimensional index executed by the kernel function 2 has the dedicated memory Address2, and the three-dimensional index executed by the kernel function 3 has the dedicated memory Address3. Storing the compressed and packed three-dimensional index in a memory, including:

Similarly, the step of storing the history three-dimensional index after compression and packaging in the memory comprises the following steps:

and storing the history three-dimensional index after compression and packaging in a special memory address.

In this embodiment, for a frequently called kernel function, the three-dimensional index to be executed is stored in the dedicated memory address, and the three-dimensional index of the thread does not need to be regenerated for the kernel function, which is beneficial to reducing the execution delay of the kernel function.

For example, the user may configure the common memory Address for the kernel function with infrequent calls, that is, the configuration information includes the common memory Address of the kernel function with infrequent calls, as shown in fig. 6, the kernel functions 1, 4, and 5 are the kernel functions with infrequent calls, and the three-dimensional index executed by the kernel functions has the common memory Address1. Storing the compressed and packed three-dimensional index in a memory, including:

In the embodiment, for the kernel function which is called infrequently, the three-dimensional index to be executed is stored in the shared memory address, and even if the three-dimensional index of the thread needs to be regenerated, the performance of the hardware is not affected due to the infrequent calling.

For example, when the storage space of the memory is less than a certain threshold, the user may configure the common memory address for the kernel function which calls infrequently and the kernel function which calls frequently, so as to save the memory overhead.

Illustratively, generating a three-dimensional index of threads according to configuration information includes:

and generating three-dimensional indexes of a preset number of threads in each clock cycle (cycle) according to the configuration information.

It will be appreciated that if only 1 thread of three-dimensional index is generated per clock cycle, the hardware complexity is low, but D is required _x D _y D _z All indexes in the thread block can be generated after a clock cycle, and the larger the thread block is, the larger the execution time delay of the kernel function is. If D is generated every clock cycle _x D _y D _z The kernel function can be run after 1 clock cycle for the three-dimensional index of each thread, but the hardware complexity is very high, and the larger the thread block is, the higher the hardware complexity is.

In the embodiment of the present application, a preset number (warp size) of three-dimensional indices of threads are generated per clock cycle, such as: the warp size can be taken as 32, in the next clock cycle, the GPU can schedule warp size three-dimensional indexes of the previous clock cycle for the kernel function to execute, so that the three-dimensional indexes are generated, the hardware complexity is low, and the sequence of generating the three-dimensional indexes keeps a specific relationship (such as increasing in time dimension) with the sequence of scheduling the GPU, which is beneficial to reducing the execution delay of the kernel function.

It can be seen that, in the embodiment of the present application, configuration information corresponding to a kernel function is obtained; generating a three-dimensional index of the thread according to the configuration information under the condition that target historical configuration information matched with the configuration information does not exist in the historical configuration information; compressing and packaging the generated three-dimensional index according to the configuration information, and storing the compressed and packaged three-dimensional index in a memory; acquiring a historical three-dimensional index corresponding to the target historical configuration information under the condition that the target historical configuration information exists in the historical configuration information; and compressing and packaging the historical three-dimensional index according to the target historical configuration information, and storing the compressed and packaged historical three-dimensional index in a memory. Therefore, the index generator generates the three-dimensional index of the thread with lower time delay, time overhead caused by generating the three-dimensional index by a complex algorithm is reduced, the generation speed of the index is coordinated with the execution speed of the kernel function, the condition that the kernel function (or GPU computing core) needs to wait for the generation of the index is avoided, the execution efficiency of the kernel function is improved, and the efficiency of parallel processing in the CUDA is improved. In addition, when the historical configuration information contains the target historical configuration information matched with the configuration information of the kernel function, the index generator does not need to generate a new three-dimensional index, and can directly multiplex the historical three-dimensional index, so that the time overhead caused by generating the new three-dimensional index is saved, the execution time delay of the kernel function is favorably reduced, and the efficiency of parallel processing in the CUDA is favorably improved. Meanwhile, the index generator can generate a new three-dimensional index with lower time delay without executing a complex algorithm, thereby reducing the complexity of hardware and being beneficial to saving the cost of the hardware.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an index generator according to an embodiment of the present application, and as shown in fig. 7, the index generator includes a configuration filter 701, an iterator 702 coupled to the configuration filter 701, and a format packing module 703 coupled to the iterator 702, where:

a configuration filter 701 configured to obtain configuration information corresponding to the kernel function;

an iterator 702 configured to generate a three-dimensional index of the thread according to the configuration information in a case where there is no target historical configuration information matching the configuration information in the historical configuration information;

a format packing module 703 configured to compress and pack the generated three-dimensional index according to the configuration information, and store the compressed and packed three-dimensional index in a memory;

the configuration filter 701 is further configured to acquire a historical three-dimensional index corresponding to the target historical configuration information under the condition that the target historical configuration information exists in the historical configuration information;

the format packing module 703 is further configured to perform compression packing on the history three-dimensional index according to the target history configuration information, and store the history three-dimensional index after compression packing in the memory.

In this embodiment, the configuration information may include a memory address, dimension information of the thread block, a three-dimensional index of a preset iteration step, and a position mx of the generated three-dimensional index with a highest bit of 1 in the x direction, a position my with a highest bit of 1 in the y direction, and a position mz with a highest bit of 1 in the z direction. After obtaining the configuration information from the configuration queue, the configuration filter 701 matches the configuration information with historical configuration information in a historical cache, if the historical configuration information does not have target historical configuration information matched with the configuration information, the iterator 702 is started, the iterator 702 generates a three-dimensional index of a thread according to the configuration information, if the historical configuration information has the target historical configuration information matched with the configuration information, the iterator 702 is not started, and the configuration filter 701 obtains the historical three-dimensional index corresponding to the target historical configuration information to multiplex the historical three-dimensional index.

The workflow of configuring the filter 701 is described below with an example:

1. when the index generator is in an initial state, the history cache is empty;

2. configuration filter 701 obtains configuration information 1 from a configuration queue

(Address1，D _x ＝32，D _y ＝64，D _z ＝1)；

3. Because the history cache is empty and the matching of the configuration information 1 fails, the configuration filter 701 starts the iterator 702 to generate a three-dimensional index, and the configuration information 1 is put into the history cache;

4. configuration filter 701 obtains configuration information 2 from configuration queue

(Address2，D _x ＝100，D _y ＝2，D _z ＝2)；

5. The configuration filter 701 matches the configuration information 2 with historical configuration information in the historical cache, the configuration information 2 fails to be matched, the configuration filter 701 starts the iterator 702 to generate a three-dimensional index, and the configuration information 2 is placed in the historical cache;

6. configuration filter 701 obtains configuration information 3 from the configuration queue

(Address1，D _x ＝32，D _y ＝64，D _z ＝1)；

7. The configuration filter 701 matches the configuration information 3 with the historical configuration information in the historical cache, the configuration information 3 is successfully matched with the configuration information 1, and the configuration filter 701 does not start the iterator 702 for the configuration information 3.

Illustratively, the iterator 702 is specifically configured to:

In this embodiment of the present application, the iterator 702 may obtain the three-dimensional index of any one thread in the thread block by transforming the formula (1).

Illustratively, the iterator 702 is specifically configured to:

iterating the three-dimensional index of any one thread by adopting the dimension information, the iteration step length and the three-dimensional index of the iteration step length of the thread block to generate the three-dimensional index of the thread in the thread block, wherein the three-dimensional index of the thread in the thread block comprises the following steps:

Illustratively, the iterator 702 is specifically configured to:

In this embodiment of the present application, it is assumed that any one thread is the nth thread, the first target thread is the mth thread, and the iterator 702 uses the dimension information D of the thread block in the x direction _x Y-direction dimension information D _y Iteration step c and three-dimensional index of iteration step (x) _c ,y _c ,z _c ) Three-dimensional index (x) to the nth thread _n ,y _n ,z _n ) Iteration is carried out to obtain the index x of the mth thread in the x direction _m 。

The iterator 702 uses the dimension information D of the thread block in the x direction _x Y-direction dimension information D _y And three-dimensional index (x) of iteration step size c _c ,y _c ,z _c ) Three-dimensional index (x) to the nth thread _n ,y _n ,z _n ) Iteration is carried out to obtain the index y of the mth thread in the y direction _m 。

The iterator 702 uses the dimension information D of the thread block in the x direction _x Y-direction dimension information D _y And a three-dimensional index (x) of iteration step size c _c ,y _c ,z _c ) Three-dimensional index (x) to the nth thread _n ,y _n ,z _n ) Iteration is carried out to obtain the m-th thread in zIndex z of direction _m 。

The iterator 702 knows the three-dimensional index (x) of the mth thread _m ,y _m ,z _m ) In the case of (2), dimension information (D) of a thread block is used _x ,D _y ,D _z ) Iteration step c and three-dimensional index of iteration step (x) _c ,y _c ,z _c ) And iterating the mth thread to obtain the three-dimensional index of the second target thread, which satisfies the iteration step length c with the mth thread, namely the three-dimensional index of the (m + c) th thread. In this embodiment, the iterator 702 repeatedly iterates the known three-dimensional indexes to generate three-dimensional indexes of all threads, and all the operations used in the iteration process are addition, subtraction, and comparison operations, which is low in complexity.

Illustratively, the format packing module 703 is specifically configured to:

It should be understood that the position mx with the highest order of 1 in the x direction in the configuration information is used to indicate the number of bits occupied by the index in the x direction in the generated three-dimensional index, the position my with the highest order of 1 in the y direction is used to indicate the number of bits occupied by the index in the y direction in the generated three-dimensional index, and the position mz with the highest order of 1 in the z direction is used to indicate the number of bits occupied by the index in the z direction in the generated three-dimensional index. The format packing module 703 compresses and packs the three-dimensional indexes in three directions according to (mx, my, mz), and stores the compressed and packed three-dimensional indexes in the memory according to the addresses in the configuration information.

Illustratively, the kernel function is a frequently called kernel function, the configuration information includes a dedicated memory address of the kernel function, and the format packing module 703 is specifically configured to:

In this embodiment, for a frequently called kernel function, the format packing module 703 stores the three-dimensional index to be executed in a dedicated memory address, and thus, the three-dimensional index of the thread does not need to be regenerated for the kernel function, which is beneficial to reducing the execution delay of the kernel function.

Illustratively, the kernel function is a kernel function that is called infrequently, the configuration information includes a common memory address, and the format packing module 703 is specifically configured to:

In this embodiment of the present application, since the kernel function is not frequently called, the format packing module 703 stores the three-dimensional index to be executed in a common memory address, which does not greatly affect the performance of hardware.

Illustratively, the iterator 702 is specifically configured to:

In this embodiment of the present application, the iterator 702 generates three-dimensional indexes of a preset number (warp size) of threads in each clock cycle, and a specific relationship is maintained between a sequence of generating the three-dimensional indexes and a sequence of scheduling the GPU, which is beneficial to reducing execution delay of the kernel function.

It can be seen that, in the index generator shown in fig. 7, the configuration information corresponding to the kernel function is obtained; generating a three-dimensional index of the thread according to the configuration information under the condition that target historical configuration information matched with the configuration information does not exist in the historical configuration information; compressing and packaging the generated three-dimensional index according to the configuration information, and storing the compressed and packaged three-dimensional index in a memory; acquiring a historical three-dimensional index corresponding to the target historical configuration information under the condition that the target historical configuration information exists in the historical configuration information; and compressing and packaging the historical three-dimensional index according to the target historical configuration information, and storing the compressed and packaged historical three-dimensional index in a memory. Therefore, the index generator generates the three-dimensional index of the thread with lower time delay, time overhead caused by generating the three-dimensional index by a complex algorithm is reduced, the generation speed of the index is coordinated with the execution speed of the kernel function, the condition that the kernel function (or GPU computing core) needs to wait for the generation of the index is avoided, the execution efficiency of the kernel function is improved, and the efficiency of parallel processing in the CUDA is improved. In addition, when the historical configuration information contains the target historical configuration information matched with the configuration information of the kernel function, the index generator does not need to generate a new three-dimensional index, and can directly multiplex the historical three-dimensional index, so that the time overhead caused by generating the new three-dimensional index is saved, the execution time delay of the kernel function is favorably reduced, and the efficiency of parallel processing in the CUDA is favorably improved. Meanwhile, the index generator can generate a new three-dimensional index with lower time delay without executing a complex algorithm, thereby reducing the complexity of hardware and being beneficial to saving the cost of the hardware.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a CUDA multithreading system according to an embodiment of the present disclosure, as shown in fig. 8, the CUDA multithreading system includes an index generator 801, a memory 802, and a Graphics Processor (GPU) 803, where the GPU803 includes a scheduler 8031 and a GPU compute core 8032; wherein:

an index generator 801 configured to obtain configuration information corresponding to a kernel function; generating a three-dimensional index of the thread according to the configuration information under the condition that target historical configuration information matched with the configuration information does not exist in the historical configuration information; performing compression and packaging on the generated three-dimensional index according to the configuration information, and storing the three-dimensional index after compression and packaging in a memory 802; acquiring a historical three-dimensional index corresponding to the target historical configuration information under the condition that the target historical configuration information exists in the historical configuration information; compressing and packaging the historical three-dimensional index according to the target historical configuration information, and storing the compressed and packaged historical three-dimensional index in a memory 802;

a memory 802 configured to store the three-dimensional index after the index generator compresses and packs, or the historical three-dimensional index after compressing and packing;

the scheduler 8031 is configured to schedule the compressed and packaged three-dimensional index or the compressed and packaged historical three-dimensional index and the instruction of the kernel function to the GPU computation core 8032;

the GPU computation core 8032 is configured to read the compressed and packaged three-dimensional index or the compressed and packaged historical three-dimensional index from the memory 802, perform decompression operation, execute an instruction and the three-dimensional index obtained after decompression or the historical three-dimensional index obtained after decompression.

In a possible implementation, the configuration information includes dimension information of the thread block and a three-dimensional index of a preset iteration step, and the index generator 801 is specifically configured to:

In one possible implementation, the index generator 801 is specifically configured to:

iterating the three-dimensional index of any thread by adopting the dimension information, the iteration step length and the three-dimensional index of the iteration step length of the thread block to generate the three-dimensional index of a first target thread in the thread block, wherein the interval between the first target thread and any thread meets the iteration step length;

In one possible implementation, the dimension information of the thread block includes dimension information of the thread block in an x direction and dimension information of the thread block in a y direction, the three-dimensional index of the first target thread includes an index of the first target thread in the x direction, an index of the first target thread in the y direction, and an index of the first target thread in the z direction, and the index generator 801 is specifically configured to:

In a possible embodiment, the configuration information includes a position of the generated three-dimensional index with the highest bit being 1 in the x direction, a position with the highest bit being 1 in the y direction, and a position with the highest bit being 1 in the z direction, and the index generator 801 is specifically configured to:

In one possible implementation, the kernel function is a frequently called kernel function, the configuration information includes a dedicated memory address of the kernel function, and the index generator 801 is specifically configured to:

In one possible implementation, the kernel function is a kernel function with infrequent calls, the configuration information includes a common memory address, and the index generator 801 is specifically configured to:

The working flow of the index generator 801 has been described in the embodiments shown in fig. 4 to fig. 7, and can achieve the same or similar beneficial effects, and is not described herein again.

In one possible implementation, the mode of operation of the scheduler 8031 is an incremental mode that includes scheduling of three-dimensional indices of threads being incremented in the time dimension and scheduling of instructions being incremented in the GPU compute core 8032 dimension.

It should be understood that the scheduling of threads in the CUDA is in units of thread bundles (warp), such as scheduling warp size threads at a time, but the GPU is non-incremental in scheduling thread bundles because there is no specific relationship between generating a three-dimensional index and scheduling a three-dimensional index.

In this embodiment, the index generator 801 generates a three-dimensional index of a thread bundle every clock cycle, and then the scheduler 8031 may schedule the instructions of the kernel function and the thread bundles in an incremental mode, which is specifically embodied that the scheduling of the three-dimensional index of the thread bundle is increased in a time dimension, and the scheduling GPU computing core 8032 of the instructions of the kernel function is increased in a dimension. As shown in FIG. 9, index generator 801 generates an index for thread bundle 0 at clock cycle 0; generating an index of a thread bundle 1 in a clock cycle 1, and storing the index of a thread bundle 0 in a memory 802; an index of the thread bundle 2 is generated in the clock cycle 2, and the index of the thread bundle 0 and the index of the thread bundle 1 are stored in the memory 802; an index of the thread bundle 3 is generated in the clock cycle 3, and the index of the thread bundle 0, the index of the thread bundle 1, and the index of the thread bundle 2 are stored in the memory 802; an index of the thread bundle 4 is generated in the clock cycle 4, and the memory 802 stores an index of the thread bundle 0, an index of the thread bundle 1, an index of the thread bundle 2, and an index of the thread bundle 3.

It should be appreciated that the kernel typically includes a plurality of instructions that are dispatched by an instruction dispatcher on the GPU, and the dispatched instructions are scheduled by scheduler 8031. Assuming that the GPU compute core 8032 includes compute core 0, compute core 1, compute core 2, and compute core 3, the instructions of the kernel function include instruction 0, instruction 1, instruction 2, and instruction 3, and the scheduler 8031 is specifically configured to schedule: the index and instruction 0 of thread bundle 0 are dispatched to compute core 0 for processing in clock cycle 1, the index and instruction 0 of thread bundle 1 are dispatched to compute core 0 for processing in clock cycle 2, the index and instruction 1 of thread bundle 0 are dispatched to compute core 1 for processing in clock cycle 1, the index and instruction 0 of thread bundle 2 are dispatched to compute core 0 for processing in clock cycle 3, the index and instruction 1 of thread bundle 1 are dispatched to compute core 1 for processing and the index and instruction 2 of thread bundle 0 are dispatched to compute core 2 for processing in clock cycle 3, the index and instruction 0 of thread bundle 3 are dispatched to compute core 0 for processing in clock cycle 4, the index and instruction 1 of thread bundle 2 are dispatched to compute core 1 for processing, the index and instruction 2 of thread bundle 1 are dispatched to compute core 2 for processing, and the index and instruction 3 of thread bundle 0 are dispatched to compute core 3 for processing.

In this embodiment, as can be seen from fig. 9, the scheduler 8031 schedules the three-dimensional indexes of the threads incrementally in the time dimension, and schedules the kernel function instructions incrementally in the GPU computing core 8032 dimension, such a scheduling mode has a delay of only one clock cycle, the index generator only needs to output the three-dimensional indexes of a preset number (warp size) of threads per clock cycle, and the GPU computing core 8032 can process the three-dimensional indexes in the next clock cycle, so that the delay is significantly reduced.

In one possible implementation, the GPU compute core 8032 is specifically configured to:

In the embodiment of the present application, the generated three-dimensional index is also given to the GPU803 as configuration information, such as: the index generator 801 may send the index to the GPU803, or the index may be configured to the GPU803 by a user, the GPU computation core 8032 receives the scheduling of the scheduler 8031, reads the compressed and packaged three-dimensional index or the compressed and packaged historical three-dimensional index from the memory 802, decompresses the compressed and packaged three-dimensional index or the compressed and packaged historical three-dimensional index by using (mx, my, mz), and then executes the decompressed three-dimensional index or the decompressed historical three-dimensional index by using the kernel function instruction scheduled by the scheduler 8031.

It can be seen that, in the CUDA multithreading system shown in fig. 8, the index generator 801 obtains the configuration information corresponding to the kernel function; generating a three-dimensional index of the thread according to the configuration information under the condition that target historical configuration information matched with the configuration information does not exist in the historical configuration information; compressing and packaging the generated three-dimensional index according to the configuration information, and storing the compressed and packaged three-dimensional index in a memory; acquiring a historical three-dimensional index corresponding to the target historical configuration information under the condition that the target historical configuration information exists in the historical configuration information; compressing and packaging the historical three-dimensional index according to the target historical configuration information, and storing the compressed and packaged historical three-dimensional index in a memory; the three-dimensional index compressed and packaged by the index generator or the historical three-dimensional index compressed and packaged is stored in the memory 802; the three-dimensional index after compression and packaging or the historical three-dimensional index after compression and packaging and the instruction of the kernel function are dispatched to a GPU computing core 8032 through a dispatcher 8031; the GPU computing core 8032 reads the compressed and packaged three-dimensional index or the compressed and packaged historical three-dimensional index from the memory, performs decompression operation, and executes the instruction and the decompressed three-dimensional index or the decompressed historical three-dimensional index. Therefore, the index generator 801 generates the three-dimensional index of the thread with a relatively low time delay, which is beneficial to reducing the time overhead caused by generating the three-dimensional index by a complex algorithm and keeping the generation speed of the index coordinated with the execution speed of the kernel function, thereby being beneficial to avoiding the condition that the kernel function (or GPU computation core) needs to wait for the generation of the index, improving the execution efficiency of the kernel function and further being beneficial to improving the efficiency of parallel processing in the CUDA. In addition, when the historical configuration information contains the target historical configuration information matched with the configuration information of the kernel function, the index generator 801 does not need to generate a new three-dimensional index, and can directly multiplex the historical three-dimensional index, so that the time overhead caused by generating the new three-dimensional index is saved, the execution delay of the kernel function is favorably reduced, and the efficiency of parallel processing in the CUDA is favorably improved. Meanwhile, the index generator 801 can generate a new three-dimensional index with a low time delay without executing a complex algorithm, so that the hardware complexity is reduced, and the hardware cost is saved. The scheduling of the scheduler 8031 is beneficial to ensuring that the GPU computing core 8032 executes the instructions of the kernel function and the corresponding three-dimensional index with low time delay, and is also beneficial to improving the efficiency of parallel processing in the CUDA.

Referring to fig. 10, fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure, and as shown in fig. 10, the electronic device at least includes a processor 1001, an input device 1002, an output device 1003, and a computer-readable storage medium 1004. The processor 1001, the input device 1002, the output device 1003, and the computer-readable storage medium 1004 within the electronic device may be connected by a bus or other means.

A computer-readable storage medium 1004 may be stored in the memory of the electronic device, the computer-readable storage medium 1004 being used for storing a computer program comprising program instructions, the processor 1001 being used for executing the program instructions stored by the computer-readable storage medium 1004. The processor 1001 (or CPU) is a computing core and a control core of the electronic device, and is adapted to implement one or more instructions, and in particular, is adapted to load and execute the one or more instructions so as to implement a corresponding method flow or a corresponding function.

In one embodiment, the processor 1001 of the electronic device provided in the embodiment of the present application may be configured to perform a series of CUDA multithreading processes:

acquiring configuration information corresponding to the kernel function;

In another embodiment, the configuration information includes dimension information of the thread block and a three-dimensional index of a preset iteration step, and the processor 1001 performs generating the three-dimensional index of the thread according to the configuration information, including:

In another embodiment, the processor 1001 executes a three-dimensional index that uses the dimension information of the thread block, the iteration step size, and the three-dimensional index of the iteration step size to iterate the three-dimensional index of any thread to generate a three-dimensional index of a thread in the thread block, including:

In another embodiment, the generating the three-dimensional index of the first target thread in the thread block includes generating the three-dimensional index of the first target thread in the thread block by using the dimension information of the thread block, the index of the first target thread in the x direction, the index of the first target thread in the y direction, and the index of the first target thread in the z direction, and the processor 1001 performs iteration on the three-dimensional index of any one thread by using the dimension information of the thread block, the iteration step, and the three-dimensional index of the first target thread in the thread block, and includes:

In another embodiment, the configuration information includes a position of the generated three-dimensional index with highest bit 1 in x direction, a position of the generated three-dimensional index with highest bit 1 in y direction, and a position of the generated three-dimensional index with highest bit 1 in z direction, and the processor 1001 performs compression and packaging on the generated three-dimensional index according to the configuration information, including:

In another embodiment, the kernel function is a frequently called kernel function, the configuration information includes a dedicated memory address of the kernel function, and the processor 1001 executes the storing of the compressed and packed three-dimensional index in the memory, including:

In another embodiment, the kernel function is a kernel function with infrequent calls, the configuration information includes a common memory address, and the processor 1001 performs the storing of the compressed and packed three-dimensional index in the memory, including:

In yet another embodiment, the processor 1001 performs generating a three-dimensional index of threads according to configuration information, including:

It can be seen that, in the electronic device shown in fig. 10, the configuration information corresponding to the kernel function is obtained; generating a three-dimensional index of the thread according to the configuration information under the condition that target historical configuration information matched with the configuration information does not exist in the historical configuration information; compressing and packaging the generated three-dimensional index according to the configuration information, and storing the compressed and packaged three-dimensional index in a memory; acquiring a historical three-dimensional index corresponding to the target historical configuration information under the condition that the target historical configuration information exists in the historical configuration information; and compressing and packaging the historical three-dimensional index according to the target historical configuration information, and storing the compressed and packaged historical three-dimensional index in a memory. Therefore, the index generator generates the three-dimensional index of the thread with lower time delay, the time overhead brought by generating the three-dimensional index by a complex algorithm is favorably reduced, and the generation speed of the index is kept coordinated with the execution speed of the kernel function, so that the condition that the kernel function (or GPU (graphics processing unit) computation core) needs to wait for the generation of the index is favorably avoided, the execution efficiency of the kernel function is improved, and the efficiency of parallel processing in the CUDA is favorably improved. In addition, when the historical configuration information contains the target historical configuration information matched with the configuration information of the kernel function, the index generator does not need to generate a new three-dimensional index, and can directly multiplex the historical three-dimensional index, so that the time overhead caused by generating the new three-dimensional index is saved, the execution time delay of the kernel function is favorably reduced, and the efficiency of parallel processing in the CUDA is favorably improved. Meanwhile, the index generator can generate a new three-dimensional index with lower time delay without executing a complex algorithm, thereby reducing the complexity of hardware and being beneficial to saving the cost of the hardware.

Illustratively, the electronic device may be a computer, a notebook computer, a tablet computer, a server, or the like. Electronic devices may include, but are not limited to, a processor 1001, an input device 1002, an output device 1003, and a computer-readable storage medium 1004. And the system also comprises a memory, a power supply, an application client module and the like. The input device 1002 may be a keyboard, touch screen, radio frequency receiver, etc., and the output device 1003 may be a speaker, display, radio frequency transmitter, etc. It will be appreciated by those skilled in the art that the schematic diagrams are merely examples of an electronic device and are not limiting of an electronic device and may include more or fewer components than those shown, or some components in combination, or different components.

It should be noted that, since the steps in the CUDA multithreading method are implemented when the processor 1001 of the electronic device executes the computer program, the embodiments of the CUDA multithreading method are all applicable to the electronic device, and all can achieve the same or similar beneficial effects.

An embodiment of the present application also provides a computer-readable storage medium, which is a memory device in an information processing device or an information transmitting device or an information receiving device, and is used for storing programs and data. It is understood that the computer readable storage medium herein can include both a built-in storage medium in the terminal and, of course, an extended storage medium supported by the terminal. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), suitable for loading and execution by the processor. It should be noted that the computer-readable storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory; optionally, at least one computer readable storage medium located remotely from the aforementioned processor is also possible. In one embodiment, one or more instructions stored in a computer-readable storage medium may be loaded and executed by a processor to perform the corresponding steps in the above-described method for CUDA multithreading.

Embodiments of the present application also provide a computer program product, where the computer program product includes a non-transitory computer-readable storage medium storing a computer program, the computer program being operable to cause a computer to perform the steps in the CUDA multithreading method as described above. The computer program product may be a software installation package.

The foregoing embodiments have been described in detail, and specific examples are used herein to explain the principles and implementations of the present application, where the above description of the embodiments is only intended to help understand the method and its core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A CUDA multithreading method, the method comprising:

acquiring configuration information corresponding to the kernel function;

generating a three-dimensional index of a thread according to the configuration information under the condition that target historical configuration information matched with the configuration information does not exist in the historical configuration information;

2. The method of claim 1, wherein the configuration information includes dimension information of a thread block and a three-dimensional index of a preset iteration step, and the generating the three-dimensional index of the thread according to the configuration information includes:

and iterating the three-dimensional index of any one thread by adopting the dimension information of the thread block, the iteration step length and the three-dimensional index of the iteration step length to generate the three-dimensional index of the thread in the thread block.

3. The method of claim 2, wherein iterating the three-dimensional index of any one thread using the dimension information of the thread block, the iteration step, and the three-dimensional index of the iteration step to generate the three-dimensional index of the thread in the thread block comprises:

iterating the three-dimensional index of any one thread by adopting the dimension information of the thread block, the iteration step length and the three-dimensional index of the iteration step length to generate the three-dimensional index of a first target thread in the thread block, wherein the interval between the first target thread and any one thread meets the iteration step length;

iterating the three-dimensional index of the first target thread by adopting the dimension information of the thread block, the iteration step length and the three-dimensional index of the iteration step length to generate a three-dimensional index of a second target thread in the thread block, wherein the interval between the second target thread and the first target thread meets the iteration step length;

and repeatedly executing the three-dimensional index of the target thread in the thread block by adopting the dimension information of the thread block, the iteration step and the three-dimensional index of the iteration step, and generating the three-dimensional index of the thread of which the interval with the target thread meets the iteration step to obtain the three-dimensional index of the thread in the thread block, wherein the target thread is the thread which generates the three-dimensional index in the thread block.

4. The method of claim 3, wherein the dimension information of the thread block comprises dimension information of the thread block in an x direction and dimension information of the thread block in a y direction, the three-dimensional index of the first target thread comprises an index of the first target thread in the x direction, an index of the first target thread in the y direction and an index of the first target thread in the z direction, and the iterating the three-dimensional index of any one thread by using the dimension information of the thread block, the iteration step and the three-dimensional index of the iteration step to generate the three-dimensional index of the first target thread in the thread block comprises:

iterating the three-dimensional index of any one thread by adopting the dimension information of the thread block in the x direction, the dimension information of the y direction, the iteration step length and the three-dimensional index of the iteration step length to obtain the index of the first target thread in the x direction;

and iterating the three-dimensional index of any one thread by adopting the dimension information of the thread block in the x direction, the dimension information of the y direction and the three-dimensional index of the iteration step length to obtain the index of the first target thread in the y direction and the index of the first target thread in the z direction.

5. An index generator comprising a configuration filter, an iterator coupled to the configuration filter, and a format packing module coupled to the iterator; wherein the content of the first and second substances,

the configuration filter is configured to obtain configuration information corresponding to the kernel function;

the iterator is configured to generate a three-dimensional index of a thread according to the configuration information under the condition that target historical configuration information matched with the configuration information does not exist in the historical configuration information;

and the format packing module is also configured to compress and pack the historical three-dimensional index according to the target historical configuration information and store the compressed and packed historical three-dimensional index in a memory.

6. The index generator of claim 5, wherein the configuration information comprises dimension information of the thread blocks and a three-dimensional index of a preset iteration step, and the iterator is specifically configured to:

7. A CUDA multithreading system comprising an index generator, a memory, and a graphics processor, GPU, the GPU comprising a scheduler and a GPU compute core;

the index generator is configured to obtain configuration information corresponding to the kernel function; generating a three-dimensional index of a thread according to the configuration information under the condition that target historical configuration information matched with the configuration information does not exist in the historical configuration information; compressing and packaging the generated three-dimensional index according to the configuration information, and storing the compressed and packaged three-dimensional index in a memory; acquiring a historical three-dimensional index corresponding to the target historical configuration information under the condition that the target historical configuration information exists in the historical configuration information; compressing and packaging the historical three-dimensional index according to the target historical configuration information, and storing the compressed and packaged historical three-dimensional index in a memory;

the memory is configured to store the three-dimensional index after the index generator compresses and packs, or the historical three-dimensional index after the index generator compresses and packs;

the scheduler is configured to schedule the compressed and packaged three-dimensional index or the compressed and packaged historical three-dimensional index and the instruction of the kernel function to the GPU computing core;

and the GPU computing core is configured to read the compressed and packaged three-dimensional index or the compressed and packaged historical three-dimensional index from the memory, perform decompression operation, and execute the instruction and the decompressed three-dimensional index or the decompressed historical three-dimensional index.

8. An electronic device comprising an input device and an output device, characterized in that it further comprises:

a processor adapted to implement one or more instructions; and the number of the first and second groups,

a computer-readable storage medium having stored thereon one or more instructions adapted to be loaded by the processor and to perform the method of any of claims 1-8.

9. A computer-readable storage medium, characterized in that,

the computer-readable storage medium has stored thereon one or more instructions adapted to be loaded by a processor and to perform the method of any of claims 1-8.

10. A computer program product comprising, in a computer readable medium,

the computer program product comprising one or more instructions adapted to be loaded by a processor and to perform the method according to any of claims 1-8.