CN115934102A - Dynamic allocation method and device of general register, computer equipment and storage medium - Google Patents

Dynamic allocation method and device of general register, computer equipment and storage medium Download PDF

Info

Publication number
CN115934102A
CN115934102A CN202211705663.4A CN202211705663A CN115934102A CN 115934102 A CN115934102 A CN 115934102A CN 202211705663 A CN202211705663 A CN 202211705663A CN 115934102 A CN115934102 A CN 115934102A
Authority
CN
China
Prior art keywords
registers
general
register
machine instruction
general registers
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211705663.4A
Other languages
Chinese (zh)
Other versions
CN115934102B (en
Inventor
董丙元
常竹林
董有臣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Granfei Intelligent Technology Co.,Ltd.
Original Assignee
Glenfly Tech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Glenfly Tech Co Ltd filed Critical Glenfly Tech Co Ltd
Priority to CN202211705663.4A priority Critical patent/CN115934102B/en
Publication of CN115934102A publication Critical patent/CN115934102A/en
Application granted granted Critical
Publication of CN115934102B publication Critical patent/CN115934102B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Devices For Executing Special Programs (AREA)

Abstract

The application relates to a dynamic allocation method and device of a general register, computer equipment and a storage medium. The method comprises the following steps: if the machine instruction program meets the register repartitioning condition in the execution process, determining the actual use number of general registers in the execution process of the machine instruction program; if a register overflow instruction is used in the execution process of the machine instruction program, adjusting the maximum distributable number of the general registers so as to enable the register overflow instruction not to be used in the execution process of the machine instruction program; and resetting the size of the working group in the compiler according to the actual using number of the general registers and the adjusted maximum allocable number of the general registers. By adopting the method, the problem that the number of the starting threads of the GPU processor and the general register resource are contradictory can be solved, the size of the working group of the compiler can be directly re-divided without recompiling, and therefore the compiling time of the compiler is reduced.

Description

Dynamic allocation method and device of general register, computer equipment and storage medium
Technical Field
The present application relates to the field of general register allocation technologies, and in particular, to a method and an apparatus for dynamically allocating general registers, a computer device, and a storage medium.
Background
OpenCL (Open Computing Language) is an Open, royalty-free parallel programming standard for heterogeneous Computing systems. The OpenCL program mainly includes two parts, a host code and a device code, wherein the host code is compiled by a C/C + + compiler to generate an executable program on the host end, and the device code is compiled by an OpenCL C compiler to generate an executable object code of a specific device. In general, an OpenCL compiler refers to a compiler tool used to compile device-side Kernel code.
In an OpenCL compiler, the conventional register allocation method sets the maximum available number of registers to a fixed value and allocates the registers only once. For complex routines, the traditional register allocation method has the condition that registers are not enough, the common method is to temporarily store variables in the registers into a memory first and then take out the variables from the memory when in use, however, reading and writing temporary data from the memory consumes a long time, and the performance of a device end is greatly reduced. Moreover, for a specific OpenCL Kernel program, due to the unreasonable setting of the size of the working group, hardware resources cannot be fully utilized, and thus the execution efficiency of the device-side hardware platform cannot be fully exerted.
Disclosure of Invention
In view of the foregoing, it is necessary to provide a method, an apparatus, a computer device, a computer readable storage medium, and a computer program product for dynamically allocating the maximum available number of general purpose registers, so that the number of registers used in a single thread and the number of threads started by a computing unit corresponding to the thread are more reasonable, and the execution efficiency of device-side hardware in executing parallel computing code is improved.
In a first aspect, the present application provides a method for dynamically allocating general registers, the method including:
converting the source code into a machine instruction program through a compiler;
if the machine instruction program meets the register repartitioning condition in the execution process, performing register allocation operation according to the maximum allocable number of the general registers, and determining the actual use number of the general registers in the execution process of the machine instruction program;
if a register overflow instruction is used in the execution process of the machine instruction program, adjusting the maximum distributable number of the general registers so as to enable the register overflow instruction not to be used in the execution process of the machine instruction program;
and resetting the size of the work group in the compiler according to the actual using number of the general registers and the adjusted maximum distributable number of the general registers.
In one embodiment, the method further comprises:
if the machine instruction program does not meet the register repartitioning condition in the execution process, setting the maximum allocable number of the general registers as a fixed value, performing register allocation operation according to the maximum allocable number of the general registers, and determining the actual use number of the general registers in the execution process of the machine instruction program.
In one embodiment, the register repartitioning condition comprises: the workgroup index information participates in operations other than the global workgroup index, and does not use any condition of the local memory statement or the synchronous statement.
In one embodiment, adjusting the maximum allocable number of general purpose registers comprises:
determining a first theoretical value of the number of clocks required by the execution of the general register overflow instruction and a second theoretical value of the number of clocks required by the execution of the machine instruction program;
and if the ratio of the first theoretical value to the second theoretical value is greater than the preset value, increasing the maximum allocable number of the general registers by a fixed step value, returning to execute the step of performing register allocation operation according to the maximum allocable number of the general registers and determining the actual use number of the general registers in the execution process of the machine instruction program until the ratio of the first theoretical value to the second theoretical value is less than the preset value.
In one embodiment, the method further comprises:
if the processing unit in the GPU processor is judged not to meet the thread starting condition according to the maximum allocable number of the general registers and the actual using number of the general registers, the maximum allocable number of the general registers is reset through the compiler, and the step of converting the source code into the machine instruction program through the compiler is returned to be executed until the processing unit meets the thread starting condition.
In one embodiment, the thread-initiating condition includes a ratio of a maximum allocable number of general purpose registers to an actual number of general purpose registers used, greater than a ratio between a size of the workgroup and a thread granularity.
In one embodiment, resizing the workgroup based on the actual number of general purpose registers used and the adjusted maximum allocable number of general purpose registers comprises:
determining the thread number range which can be started by a processing unit in the GPU processor according to the actual using number of the general registers and the adjusted maximum distributable number of the general registers;
taking any number in the thread number range as the target starting thread number of the processing unit;
determining a plurality of first primitives according to the number of target starting threads, and taking any one of the plurality of first primitives as a first target primitive; the first primitive comprises three dimensions, and the product of the numerical values of the three dimensions is equal to the number of target starting threads;
determining a plurality of second primitives according to the thread granularity, and taking any one of the plurality of second primitives as a second target primitive; the second primitive comprises three dimensions, and the product of the values of the three dimensions is equal to the thread granularity;
and determining the numerical product of the corresponding dimensions of the first target primitive and the second target primitive as the target size of the workgroup in different dimensions.
In a second aspect, the present application further provides a device for dynamically allocating general registers. The device comprises:
the conversion module is used for converting the source code into a machine instruction program through a compiler;
the allocation module is used for performing register allocation operation according to the maximum allocable number of the general registers and determining the actual using number of the general registers in the execution process of the machine instruction program if the register re-division condition is met in the execution process of the machine instruction program;
the adjusting module is used for adjusting the maximum distributable number of the general registers if the register overflow instruction is used in the executing process of the machine instruction program, so that the register overflow instruction is not used in the executing process of the machine instruction program;
and the working group dividing module is used for resetting the size of the working group in the compiler according to the actual using number of the general registers and the adjusted maximum distributable number of the general registers.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the following steps when executing the computer program:
converting the source code into a machine instruction program through a compiler;
if the machine instruction program meets the register repartitioning condition in the execution process, performing register allocation operation according to the maximum allocable number of the general registers, and determining the actual use number of the general registers in the execution process of the machine instruction program;
if a register overflow instruction is used in the execution process of the machine instruction program, adjusting the maximum distributable number of the general registers so as to enable the register overflow instruction not to be used in the execution process of the machine instruction program;
and resetting the size of the work group in the compiler according to the actual using number of the general registers and the adjusted maximum distributable number of the general registers.
In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:
converting the source code into a machine instruction program through a compiler;
if the machine instruction program meets the register repartitioning condition in the execution process, performing register allocation operation according to the maximum allocable number of the general registers, and determining the actual use number of the general registers in the execution process of the machine instruction program;
if a register overflow instruction is used in the execution process of the machine instruction program, adjusting the maximum distributable number of the general registers so as to enable the register overflow instruction not to be used in the execution process of the machine instruction program;
and resetting the size of the working group in the compiler according to the actual using number of the general registers and the adjusted maximum allocable number of the general registers.
In a fifth aspect, the present application further provides a computer program product. The computer program product comprising a computer program which when executed by a processor performs the steps of:
converting the source code into a machine instruction program through a compiler;
if the machine instruction program meets the register repartitioning condition in the execution process, performing register allocation operation according to the maximum allocable number of the general registers, and determining the actual use number of the general registers in the execution process of the machine instruction program;
if a register overflow instruction is used in the execution process of the machine instruction program, adjusting the maximum distributable number of the general registers so as to enable the register overflow instruction not to be used in the execution process of the machine instruction program;
and resetting the size of the working group in the compiler according to the actual using number of the general registers and the adjusted maximum allocable number of the general registers.
According to the method, the device, the computer equipment and the storage medium for dynamically allocating the general purpose registers, aiming at a machine instruction program which meets the condition that workgroup index information participates in other operations except global workgroup indexes, and does not use any condition of local memory statements and synchronous statements, the relationship between the number of hardware threads started in one processing unit and the number of general purpose registers used in one hardware thread can be balanced according to whether register overflow instructions are used or not in the execution process of the machine instruction program, and the maximum allocable number of the general purpose registers is dynamically allocated, so that the problem that the number of the starting threads of a GPU processor and the number of the general purpose registers are mutually contradictory is solved, and the execution efficiency of the GPU processor is improved; meanwhile, aiming at the machine instruction program which meets the condition that the information of the workgroup index participates in other operations except the global workgroup index and does not use any condition of a local memory statement or a synchronous statement, the size of the workgroup of the compiler can be directly re-divided without recompiling, so that the compiling time of the compiler is effectively reduced.
Drawings
FIG. 1 is a diagram of an embodiment of a dynamic general register allocation method;
FIG. 2 is a diagram illustrating a mapping relationship between the OpenCL memory model and a GPU processor according to an embodiment;
FIG. 3 is a flowchart illustrating a method for dynamic allocation of general purpose registers according to one embodiment;
FIG. 4 is a flow diagram illustrating the compiler and the dynamic allocation of general purpose registers according to another embodiment;
FIG. 5 is a flow diagram illustrating a process for adjusting the maximum allocable number of general purpose registers, according to one embodiment;
FIG. 6 is a schematic flow chart diagram that illustrates the generation of a target program in one embodiment;
FIG. 7 is a flow diagram illustrating resizing of workgroups in one embodiment;
FIG. 8 is a flow diagram of compiler execution in one embodiment;
FIG. 9 is a block diagram showing the structure of a general register dynamic allocation apparatus according to an embodiment;
fig. 10 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The method for dynamically allocating the general registers provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Wherein, the terminal 102 converts the source code into a machine instruction program through the compiler 104; if the machine instruction program meets the register repartitioning condition in the execution process, performing register allocation operation according to the maximum allocable number of the general registers, and determining the actual use number of the general registers in the execution process of the machine instruction program; if a register overflow instruction is used in the execution process of the machine instruction program, adjusting the maximum distributable number of the general registers so as to enable the register overflow instruction not to be used in the execution process of the machine instruction program; and resetting the size of the working group in the compiler according to the actual using number of the general registers and the adjusted maximum allocable number of the general registers. The compiler 104 converts the machine instruction program into an object program, issues the object program to the GPU processor 106, and the GPU processor 106 executes a task corresponding to the object program. The terminal 102 may be, but is not limited to, various personal computers, laptops, smartphones, tablets, internet of things devices, and portable wearable devices including the compiler 104 and the GPU processor 106, among others.
As shown in fig. 2, the memory model of the OpenCL compiler includes a global memory, a constant memory, a local memory and a private memory, these four memories represent different memory regions, and the memory space is related to the OpenCL Kernel source code. In a kernel, there may be global, constant, local, or private keys, with different keys specifying the storage space in which a variable uses a particular memory to create data. From the OpenCL memory model level, these memory regions are logically disjoint, and the data of different regions to be used by other regions or data movement is controlled by the Kernel source code, and each memory region has its own performance characteristics. These different levels of memory partitioning have large performance differences when read. Wherein the description of each memory region is as follows:
global memory is visible to all work items (work items) in the execution kernel. The data transmission from the host and the device can be performed only by storing the data in the global Memory, and the data is correspondingly stored in a Video Memory (Video Memory) of the GPU in the present invention. The key word is global or _ global, and global int × a in the Kernel source code represents that the data pointed by the a pointer is placed in the global memory. The work item (work item) refers to the innermost operation in a loop, and is called a work item, and the work items accessing the same processing resource form a work group. Processing resources capable of supporting a workgroup are referred to as processing units. Each workgroup may be executed on a single processing unit. Because the synchronization processing is needed among the workitems, the workgroup division needs to be carried out on the workitems, and the synchronization of the workitems can be realized only if the workitems belong to the same workgroup.
Constant memory, which is not designed for read-only data, allows all workitems in the computation space to access the same data at the same time. The values stored by the constant memory typically do not change. The key source code uses the key constant or _ constant to map the corresponding data to the constant memory. Generally, an OpenCL compiler maps constants into a Constant Buffer (Constant Buffer) or a Video card Memory (Video Memory) of a GPU processor according to a Constant Memory used in a Kernel source code, although the GPU processor reads data stored in the Constant Buffer more quickly.
The local memory is used for storing data in the local memory, and only work items in the same work group (work group) can be shared. Local Memory is mapped to a Vertex Buffer (VB) or shared Memory (Share Memory) in the GPU processor, which has shorter access latency and higher transmission bandwidth than global Memory. When clSetKernelArg () is called to set the local memory, only the size needs to be transferred, the corresponding pointer does not need to be transferred, and the corresponding local memory can open up a storage space at the equipment end during operation. In the OpenCL kernel, local or _ local keys are used to describe local memory.
Private memory, which can only be accessed by the workitem itself. In practice, the private variables usually correspond to general registers, and when the general registers are not used for enough private data, a data overflow phenomenon occurs, and the overflow data is usually stored in the global memory. It should be noted that: in this embodiment, the variable using the private or _ private key in the Kernel source code is stored in the global Memory, which corresponds to the Video Memory of the GPU processor on the right side of fig. 3.
Fig. 3 mainly introduces the memory model of OpenCL and the mapping relationship of memories of different levels in Kernel source code in the GPU processor. The present embodiment adopts the GPU memory architecture on the right side of fig. 3, in the GPU architecture, one Work Group (Work Group) in Kernel can only be configured in the same processing unit (PE), and then the maximum value of one Work Group in the GPU of the architecture is the maximum number of executable threads nx SIMD32 or nx SIMD64 in the processing unit (PE).
In one embodiment, as shown in fig. 3, a method for dynamically allocating general registers is provided, which is described by taking the method as an example for being applied to the terminal in fig. 1, and includes the following steps:
in step 302, the source code is converted into a machine instruction program by a compiler.
The compiler refers to a compiler tool for compiling device-side Kernel codes. The source code may be device side Kernel code. The machine instruction program refers to an instruction which can be directly recognized and executed by a CPU processor, and the representation form of the machine instruction program is binary code.
Optionally, the terminal starts a process through a driver to call an OpenCL compiler, the OpenCL compiler converts the source code into an intermediate representation program, and the intermediate representation program is an internal representation generated after the compiler scans the source code and represents the semantic and syntactic structure of the source code; and performing instruction degradation and instruction selection processing on the intermediate representation program, and converting the intermediate representation program into a machine instruction program.
Step 304, if the machine instruction program meets the register repartitioning condition in the execution process, performing register allocation operation according to the maximum allocable number of the general registers, and determining the actual use number of the general registers in the execution process of the machine instruction program.
The general purpose register (CRF) is a high-speed storage component with limited storage capacity in the GPGPU, and is a temporary memory for each thread to execute, and is used for recording original data and results of instruction execution. Usually only a few tens of kbytes (c) in a computing unit or computing core 2 x512 bitxM) or expressed as only M pass registers, while the number N of hardware threads (Wave) that the computing unit can start up is dynamically variable, 1-N max (typically 1-32 threads may be started) then the general register space available to each thread is M/N, since M is a fixed value, then the more threads, the fewer general registers that each thread may use. For a section of Kernel code, how the OpenCL compiler allocates general purpose registers makes the number m of registers actually used in a single thread and the number n of threads started by the computing unit more reasonable, which is a key loop for improving hardware execution efficiency.
General register allocation is a method of increasing the execution speed of a program by allocating program variables to registers as much as possible. After the general registers are distributed, each distributed or used general register has a unique number, and the maximum number value can be converted into the actual using number m of the general registers by counting the maximum number value of the general registers in the instruction program of the machine total
In some embodiments, the maximum allocable number m of general purpose registers is assigned to the processor if the register repartitioning condition is not met during execution of the program of machine instructions max Set to a fixed value, according to the maximum allocable number m of general purpose registers max And performing register allocation operation to determine the actual use number of the general registers in the execution process of the machine instruction program. Wherein the register repartitioning condition comprises: the workgroup index information participates in other operations except the global workgroup index, and any condition of the local memory statement is not used so as not to use the synchronous statement.
Optionally, as shown in fig. 4, the terminal analyzes the machine finger corresponding to the Kernel source code from the machine instruction programWhen a program is enabled to meet the register repartitioning condition in the execution process, namely any one of the conditions of using a local memory statement (local keyword), a synchronous statement (barrier () or sync ()) and participating the workgroup index information in other operations except the global workgroup index does not occur, the initialization work before the allocation of the general registers is corresponded to at this moment, firstly, the maximum allocable number m of the general registers is set for the first time max At the same time, copying a machine instruction as a backup according to the maximum allocable number m of the initialized general registers max Performing register allocation operation, counting the maximum number value of general registers in the machine instruction program, and converting the maximum number value into the actual number of the general registers mtotal
If the machine instruction program does not satisfy the register repartitioning condition in the execution process, namely at least one condition of using a local memory statement (local keyword), a synchronous statement (barrier () or sync ()) and participating the workgroup index information in other operations except the global workgroup index occurs, the maximum distributable number of the general register is set as a fixed value, and the maximum distributable number m of the general register is set as a fixed value max Performing register allocation operation once, setting a state bit for marking whether a local memory statement or a synchronous statement is used in a machine instruction program or whether the workgroup index information participates in other operations except the global workgroup index, judging whether the workgroup needs to be reset according to the state bit, and if the workgroup needs to be set, setting the status bit as true; false if no work group needs to be set. Determining the actual number of general purpose registers used during the execution of a program of machine instructions mtotal
In step 306, if the register overflow instruction is used during the execution of the machine instruction program, the maximum allocable number of the general registers is adjusted, so that the register overflow instruction is not used during the execution of the machine instruction program.
After the Kernel source code of OpenCL is converted into a machine instruction program by a compiler, the basis of execution of instructions in the machine instruction program in the GPU is measured by taking threads as a scale, so that the Kernel source code needs to be calculated by using a compiler and then executed by using a processorOf particular concern are the number of general purpose registers used in a compiler-generated machine instruction program and the number of threads that can be started in a GPU processor device if a maximum of n can be started in a GPU processor max One thread, then the minimum number of general purpose registers that can be used in 1 thread at this time is m max /n max Therefore, these indexes need to be informed to compiler designers in advance, and when the compiler designers design the compiler, the upper limit of the number of available physical general purpose registers needs to be fixed to m at the register allocation stage max /n max If the algorithm in the Kernel source code is more complex at this time, m max /n max The general purpose registers cannot meet the demand of allocable registers, and at this time, a register Spill (Spill) technique is used, that is, some unused temporary variables placed in the general purpose registers or used temporary variables which are needed to be used are stored in the global memory, and the temporary variables are read from the global memory and written back into the general purpose registers until they are used. Although it is guaranteed that the threads with the largest number can be started at this time, a large amount of time is consumed for reading and writing the extra global memory during the execution of each thread, so that the time required for executing all instructions after the threads are started is greatly increased.
For different Kernel source codes, it becomes important for a compiler to be able to dynamically adjust the number of available general purpose registers when they are allocated. Meanwhile, data stored locally (local memory) according to the OpenCL standard can only be shared among different workitems in the same workgroup, and it can be known theoretically that the size of the workgroup can be reset in a Kernel source code that does not use local memory statements or synchronous statements, so that the relationship between the size of the workgroup and the number of threads started in one processing unit (PE) and the number of general purpose registers used in one thread can be adjusted more flexibly.
Alternatively, as shown in fig. 4, if a register overflow instruction is used during the execution of the machine instruction program, the maximum allocable number m of the general-purpose register is shown max Cannot meet the requirement of allocable register and adjust initializationUsing maximum allocable number m of registers max The amount is then adjusted according to the maximum allocable number m of the general registers max Performing register allocation, judging whether a register overflow instruction is used in the execution process of the machine instruction program again, if the register overflow instruction is used in the execution process of the machine instruction program, adjusting the maximum allocable number m of the general registers again on the basis of the first adjustment max Returning to execute and carrying out register allocation according to the adjusted maximum allocable number of the general registers until a register overflow instruction is not used in the execution process of the machine instruction program, setting a state bit for marking whether a local memory statement or a synchronous statement is used in the machine instruction program or working group index information participates in other operations except the global working group index, judging whether the working group needs to be reset according to the state bit, and if the working group needs to be set, setting the working group as true; false if no work group needs to be set. Determining the actual number of general purpose registers used during the execution of a program of machine instructions mtotal
Step 308, resetting the size of the working group in the compiler according to the actual using number of the general registers and the adjusted maximum distributable number of the general registers.
The size of the working group can be re-divided under the condition that local memory statements or synchronous statements are not used or working group index information participates in other operations except the global working group index by analyzing the memory mode used in the program, so that the aim of reasonably using a general register by each hardware thread is fulfilled, and the recompilation is reduced.
Alternatively, the terminal may be based on the actual number of uses m of the general purpose registers total And the maximum allocable number m of the general register after adjustment max The maximum number of threads that can be started at the time in a Processing Element (PE) is calculated, and the size of the work group in the compiler is reset according to the maximum number of threads that can be started.
In the above method for dynamically allocating general registers, information satisfying the workgroup index participates in the method except the global workgroup indexDynamically allocating the maximum allocable number m of general purpose registers according to whether register overflow instructions are used in the execution process of the machine instruction program or not in the machine instruction program of any condition of other operations, unused local memory statements and unused synchronous statements max The method can balance the relation between the number of hardware threads started in one processing unit and the number of general registers used in one hardware thread, solve the problem that the number of the starting threads of the GPU processor and the general register resources are mutually contradictory, and improve the execution efficiency of the GPU processor; meanwhile, aiming at a machine instruction program which meets any condition that the workgroup index information participates in other operations except the global workgroup index and does not use any condition of local memory statements and synchronous statements, the size of the workgroup of the compiler can be directly re-divided without recompilation, so that the compiling time of the compiler is effectively reduced.
In one embodiment, the general purpose registers are allocated in a one-time allocation mode of a one-time compiling flow, that is, the maximum allocable general purpose register number m is set in the compiler max The number of general registers which can be allocated at most in the general register allocation stage is m max If the Kernel source code logic is very complex, m max If the general registers are not enough to meet the demand of allocation, the basic block of the machine instruction is Split (Split) to reduce the use of the general registers, and if the Split can not meet the demand, the general register Spill technology is used, namely, variables which are not used temporarily are stored in a memory firstly, and are read out from the memory and written back to the general registers for use when waiting to be used. Therefore, if a register overflow instruction is used during the execution of the machine instruction program, it indicates that the maximum allocable number m of the general-purpose registers needs to be dynamically adjusted max So that register overflow instructions are not used during execution of the program of machine instructions. Specifically, as shown in fig. 5, adjusting the maximum allocable number of the general register includes:
at step 502, a first theoretical value of a number of clocks required for execution of a general register spill instruction and a second theoretical value of a number of clocks required for completion of execution of a program of machine instructions are determined.
Optionally, the terminal determines a first theoretical value C of the clock quantity required by the execution of the general register overflow instruction according to the GPU processor operation speed and the clock frequency spill And a second theoretical value C of the number of clocks required for the completion of the execution of the program of machine instructions total
Step 504, if the ratio of the first theoretical value to the second theoretical value is greater than the preset value, the maximum allocable number of the general purpose register is increased by a fixed step value, the step of performing register allocation operation according to the maximum allocable number of the general purpose register and determining the actual use number of the general purpose register in the execution process of the machine instruction program is returned to be executed until the ratio of the first theoretical value to the second theoretical value is less than the preset value.
Alternatively, if the first theoretical value C spill And a second theoretical value C total If the ratio of (m) is greater than the preset value, it means that the maximum number of distributable registers needs to be increased to reduce the register overflow operation, and the terminal will use a fixed step value m step With the current maximum number m of allocable registers max Sum, determined as the maximum number of allocatable registers to reset, i.e. m max =m max +m step (ii) a Returning to execute the step of performing register allocation operation according to the maximum allocable number of the general registers and determining the actual use number of the general registers in the execution process of the machine instruction program, wherein the first theoretical value C is spill And a second theoretical value C total If the ratio of (d) is less than the predetermined value, then the maximum number of allocable registers need not be increased.
In this embodiment, it is determined whether the current maximum distributable register number meets the requirement by a ratio between a first theoretical value of the number of clocks required for executing the general register overflow instruction and a second theoretical value of the number of clocks required for completing the execution of the machine instruction program, and when it is necessary to increase the maximum distributable register number to reduce the register overflow operation, the maximum distributable number m of the general register is increased by a fixed step value max Returning to compiler for recompilation until the firstThe ratio of one theoretical value to the second theoretical value is less than a preset value, that is, the maximum number of distributable registers does not need to be increased, a mode of multiple dynamic register allocation is adopted, and the maximum number of available registers during register allocation is reasonably set by analyzing the dependence relationship between the memory mode used in the machine instruction program and the generated instructions and the time required for executing each instruction, so that the method for reducing or not using temporary variables temporarily stored in the memory is realized, and the time spent on executing the program can be greatly reduced.
In one embodiment, the larger the workgroup, the greater the number of general purpose registers that need to be used to start the workgroup, as is known from the OpenCL programming standard and GPU processor characteristics. For a workgroup of size 0 x 400, it is computationally feasible that the GPU processor theoretically requires 32 threads to start a workgroup. It follows that the size of a workgroup can directly affect the maximum number of threads that can be started by a processing unit, and the maximum allocable number m of general purpose registers required by a processing unit max . As shown in FIG. 6, the status bits according to the above embodiments determine whether the workgroup needs to be repartitioned, and if repartitioning is needed, the number m of the general purpose registers actually used is determined total And the adjusted maximum allocable number m of general purpose registers max Resetting the size of the workgroup, and generating a target program through an instruction emission flow, wherein the target program is a binary file which can be identified and processed by a GPU processor, and returning the target program and the state bit as true to a compiler; and if the program does not need to be divided again, generating the target program through the instruction transmitting flow, and returning the target program and the state bit as false to the compiler.
As shown in fig. 7, the resizing the working set according to the actual number of general registers and the adjusted maximum allocable number of general registers includes:
step 702, determining the range of the number of threads that can be started by the processing unit in the GPU processor according to the actual number of used general registers and the adjusted maximum allocable number of general registers.
Optionally, of general purpose registers after adjustmentMaximum allocable number m max And the actual use number m of general registers total The maximum number of threads n that the processing unit can start max Thus, the number of threads that a processing unit in a GPU processor can start ranges from [1-n ] max ]。
Step 704, any number in the thread number range is fetched as the target starting thread number of the processing unit.
Step 706, determining a plurality of first primitives according to the number of target starting threads, and taking any one of the plurality of first primitives as a first target primitive; the first primitive includes three dimensions, the product of the values of which is equal to the target number of threads to start.
Wherein, the first base refers to a combination for representing different dimensions of the working group, and generally refers to the dimension in the xyz direction. Let the target number of start threads be denoted N, the first primitive be denoted (N1, N2, N3), then N = N1 × N2 × N3.
For example, the number of threads ranges from [1-12], and the number of target start threads is 12, then N1, N2, and N3 in the first primitive may be 6 triples composed of 12,1, and 1, or may be 6 triples composed of 2, and 3.
Step 708, determining a plurality of second primitives according to the thread granularity, and taking any one of the plurality of second primitives as a second target primitive; the second primitive includes three dimensions, the product of the values of the three dimensions being equal to the thread granularity.
Where thread granularity is used to reflect the amount of computation of a thread, typically a constant 32. The second primitive is represented as (x _ cst, y _ cst, z _ cst), then 32= x _ cst _ y _ cst _ z _ cst, and the x _ cst, y _ cst, and z _ cst in the second primitive may be 6 triples consisting of 36, 1, and 1, may be 6 triples consisting of 1, 4, and 8, may be 6 triples consisting of 2, and 8, may be 6 triples consisting of 1, 2, and 16, and may be 6 triples consisting of 2, 4, and any one of the second primitives in the second primitive may serve as a second target primitive.
Step 710, determining the numerical product of the first target primitive and the second target primitive on the corresponding dimension as the target size of the workgroup in different dimensions.
Wherein, the strategy for repartitioning the working group is as follows: the repartitioned workgroup size = { x _ cst × N1, y _ cst × N2, z _ cst × N3}, where N = N1 × N2 × N3, and N is the target number of start threads. In this embodiment, the first primitive is (12, 1) and the second primitive is (32, 1), the size of the re-partitioned work group is {32 × 12,0 × 1}, the size range of the re-partitioned work group is {32 × N,0 × 1}, and N is [1-12]. It should be noted that: 0 × 1 in {32 × N,0 × 1} is not a number 0 multiplied by 1, but a 16-ary representation, 0 × 1 being 1.
In this embodiment, the actual number m of general purpose registers used is based on total And the adjusted maximum allocable number m of general purpose registers max Determining the range of the number of threads which can be started by a processing unit in the GPU processor, determining a plurality of first primitives according to the range of the number of threads, determining a plurality of second primitives according to the thread granularity, taking any one of the first primitives and one of the second primitives as a first target primitive and a second target primitive, and determining the size of a re-divided working group according to the product of corresponding positions of the first target primitive and the second target primitive. By adopting the method to determine the size of the working group, the relation between the size of the working group and the number of the general registers required by the working group can be balanced, the purpose that each hardware thread reasonably uses the general registers is ensured, and the recompilation is reduced.
In one embodiment, the machine instruction program, when the register repartitioning condition is met or not met by the above embodiments, sets a status bit, with true indicating that the workgroup needs to be set and false indicating that the workgroup does not need to be set. In the above embodiment, the target program and the status bit are returned to the compiler, and the compiler executes different processing flows according to the status bit. As shown in fig. 8, if the compiler determines that the work group needs to be reset according to the status bit, the compiler adjusts the size of the current work group to the size of the subdivided work group calculated in the foregoing embodiment, and requests the GPU processor to execute the task corresponding to the object program. If the compiler determines that the reset task is not needed according to the status bitGrouping, the compiler then allocates m according to the maximum allocable number of general purpose registers max And the actual number m of general purpose registers total And judging whether the processing unit in the GPU processor meets the thread starting condition or not. If the processing unit in the GPU processor does not meet the thread starting condition, resetting the maximum distributable number of the general registers through the compiler, and returning to the step of executing the step of converting the source code into the machine instruction program through the compiler until the processing unit meets the thread starting condition. And if the processing unit in the GPU processor meets the thread starting condition, requesting the GPU processor to execute the task corresponding to the target program.
Wherein the thread starting condition comprises the maximum allocable number m of general registers max And the actual use number m of general registers total Is greater than the ratio between the size of the workgroup and the thread granularity, i.e.:
m max /m total >(working group/32).
It should be noted that: the size of the work group is divided into two modes, one is an alignment mode, and the other is a non-alignment mode, and the mode of the work group is related to the writing direction of a compiler designer. The size of the initial workgroup is marked as: workgroup = { x _ ori, y _ ori, z _ ori }. Wherein, the size of the working group in the alignment mode is as follows:
working group = (((x _ ori + x _ cst-1)/x _ cst) × x _ cst) ((y _ ori + y _ cst-1)/y _ cst) × y _ cst)
*(((z_ori+z_cst-1)/z_cst)*z_cst)
The working set size in the non-aligned mode is:
working group = (x _ ori y _ ori z _ ori + 32-1)/32 x 32.
It should be noted that: when the work group does not need to be re-divided, aiming at the machine instruction program which does not meet the register re-division condition and the thread starting condition, the situation that the thread cannot be started by the current work group size can occur, therefore, when the thread cannot be started by the current work group size, the maximum register number which can be used for compiling Kernel is reset, and the recompilation is carried out or the current work group is sent to exceed the maximum work group range and exit.
In this embodiment, when the work group does not need to be re-divided, whether the processing unit in the GPU processor meets the thread starting condition is further determined, and when the processing unit in the GPU processor does not meet the thread starting condition, the maximum number of registers usable by compiling Kernel is re-set, and the current work group is recompiled or sent to exceed the maximum work group range and exit, so that the execution efficiency of the GPU processor is ensured.
In one embodiment, the Kernel source code includes host-side code and device-side Kernel code. The host end code comprises the size of a global working group and the size of a local working group set in a program, the size of the local working group is set by a program developer, and when the device end executes the kernel program, the local working groups with different sizes can influence the execution time of the kernel program. The kernel program works to calculate the result of the arctan function (atan) of the data in the memory buffer a in the processing unit and stores the result in the memory buffer b in the processing unit. The kernel program does not use the operation of local memory or synchronous statement, and when the general register is distributed, firstly, the maximum distributable number m of the general register area is set max Is 64, and then according to the maximum allocable number m max Registers are allocated, however, due to the fact that the algorithm intersection of Taylor expansion adopted by the arctangent function is complex, and meanwhile, the space of a general register occupied by the double type in the SIMD instruction data mode of the GPU processor is 2 times of that of a single-precision floating point number, so that m is the number of m max At 64, a register overflow instruction occurs for general register allocation. At this time, m is updated max =m max +m step Wherein the embodiment takes the step value m step 21, the updated maximum allocable number m max At 85, analyzed at m max Register allocation continues at 85, where the actual number of general purpose registers m is obtained total 79 general registers without register overflow instructions, and setting the state bit to true to mark unused local memory statements or synchronous statements in the program of the machine instruction or to participate the information of the workgroup index in other than the global workgroup indexAnd (6) performing operation.
The state bit can determine that the machine instruction program does not contain shared memory or synchronous statements, and can judge that the work group needs to be divided again, namely the work group is divided again according to the actual using number m of the general registers total The maximum number of threads that can be started at the time in a processing unit (PE) is calculated to be 4x m max /m total (4 is a fixed constant), i.e. the maximum number of threads n that can be started when the general purpose register is 79 max Is 4x256/79=12. The work group can be subdivided into the range of 32N, 0X 1, and N is 1-12. Then converting the machine instruction program into a target program through an instruction emission flow, setting the state of the target program and the work group repartitioning as true, and setting the thread range [1-N ] when the work group is repartitioned]And sending to a compiler.
And a driver in the compiler receives the compiled kernel binary file, receives the state true of the work group repartitioning and the thread range (-N) of the work group repartitioning calculated by the compiler, can reset the work group, and then calls a ClEnqueeNDRangeKernel API to start and execute the kernel program. If the size of the workgroup of the host-side code is initially set to [0 × 400,0 × 1], the workgroup has 0 × 400 workitems, and as can be seen from the OpenCL programming standard and the GPU processor characteristics, the larger the workgroup, the greater the number of general purpose registers that need to be used to start the workgroup. For a workgroup with the size of 0 x 400, it can be obtained through calculation that a GPU processor in the present invention theoretically needs 32 threads to start a workgroup, and at this time, one thread can only be divided into 32 general register resources, and at least 79 allocable general registers are needed to avoid register overflow, so as to improve the program execution performance. Meanwhile, according to the state of setting the work group to be re-divided and the thread range when the work group is re-divided, which is sent by the compiler, the size of the work group to be re-divided is driven to be selected, and then the GPU processor is requested to execute the task corresponding to the target program.
For the condition that the work group can not be divided again, judging that the machine instruction program does not meet the register dividing condition in the execution process, and if the machine instruction program can not meet the register dividing condition, judging that the machine instruction program can not meet the register dividing condition in the execution processAnd the instruction program of the device meets the register repartitioning condition in the execution process, namely enough hardware resources can start the thread at the moment, and the GPU processor is requested to execute the task corresponding to the target program. If the machine instruction program does not meet the register repartitioning condition in the execution process, namely the size of the working group at the moment can not start the thread, resetting the maximum distributable number m max And performing secondary compilation or sending the current local working group exceeding the maximum local working group range and exiting.
In one embodiment, detailed steps of a method for dynamically allocating general purpose registers are provided, comprising:
step 1, converting a source code into a machine instruction program through a compiler.
Step 2, judging whether the machine instruction program meets the register re-partition condition in the execution process, wherein the register re-partition condition comprises the following steps: the workgroup index information participates in other operations except the global workgroup index, and any condition of the local memory statement is not used and the synchronous statement is not used; if not, executing step 3; if yes, executing step 4.
And 3, setting the maximum allocable number of the general registers as a fixed value, performing register allocation operation according to the maximum allocable number of the general registers, determining the actual use number of the general registers in the execution process of the machine instruction program, and executing the step 7.
And 4, performing register allocation operation according to the maximum allocable number of the general registers, determining the actual use number of the general registers in the execution process of the machine instruction program, and executing the step 5.
Step 5, judging whether a register overflow instruction is used in the execution process of the machine instruction program, and if the register overflow instruction is used, executing step 6; if the register overflow instruction is not used, step 7 is performed.
Step 6, determining a first theoretical value of the number of clocks required by the execution of the overflow instruction of the general register and a second theoretical value of the number of clocks required by the execution of the program of the machine instruction; if the ratio of the first theoretical value to the second theoretical value is larger than the preset value, increasing the maximum distributable number of the general register by a fixed step value, returning to execute the step of performing register distribution operation according to the maximum distributable number of the general register and determining the actual using number of the general register in the execution process of the machine instruction program until the ratio of the first theoretical value to the second theoretical value is smaller than the preset value, so that the register overflow instruction is not used in the execution process of the machine instruction program, and executing step 7.
Step 7, setting a status bit for marking whether a local memory statement or a synchronous statement is used in a machine instruction program or whether the information of the workgroup index participates in other operations except the global workgroup index, and executing step 8; when the register repartitioning condition is not met in the execution process of the machine instruction program, the state bit is false; when the machine instruction program meets the register repartitioning condition in the execution process, the state bit is true.
Step 8, judging whether the work groups need to be divided again according to the state bits; if the status bit is true, execute step 9; if the status bit is false, step 10 is performed.
Step 9, determining the thread number range which can be started by a processing unit in the GPU processor according to the actual using number of the general registers and the adjusted maximum distributable number of the general registers; taking any number in the thread number range as the target starting thread number of the processing unit; determining a plurality of first primitives according to the number of target starting threads, and taking any one of the plurality of first primitives as a first target primitive; the first primitive comprises three dimensions, and the product of the numerical values of the three dimensions is equal to the number of target starting threads; determining a plurality of second primitives according to the thread granularity, and taking any one of the plurality of second primitives as a second target primitive; the second primitive comprises three dimensions, and the product of the numerical values of the three dimensions is equal to the thread granularity; and (3) determining the numerical product of the corresponding dimensions of the first target primitive and the second target primitive as the target size of the workgroup in different dimensions, and executing the step 10.
And step 10, converting the machine instruction program into an object program through an instruction transmitting flow, and sending the object program, the state bit and the thread number range which can be started by the processing unit to a driver of the compiler.
Step 11, the driver determines whether the size of the working group needs to be subdivided according to the state bit, and if the state bit is true, step 12 is executed; if the status bit is false, go to step 13.
Step 12, the compiler adjusts the size of the current workgroup to the calculated size of the repartitioned workgroup, and step 15 is executed.
Step 13, the compiler judges whether the processing unit in the GPU processor meets the thread starting condition according to the maximum allocable number of the general registers and the actual using number of the general registers, and if the processing unit does not meet the thread starting condition, the compiler executes step 14; if the thread start condition is satisfied, step 15 is executed.
Step 14, resetting the maximum allocable number of the general registers through the compiler, returning to execute the step of converting the source code into the machine instruction program through the compiler until the processing unit meets the thread starting condition, and executing step 15.
And step 15, requesting the GPU processor to execute a task corresponding to the target program.
In the embodiment, according to a specific GPU architecture, the number of general registers can be dynamically allocated to programs which participate in other operations except the global workgroup index and do not use local storage or synchronous statements according to the workgroup index information, so that the contradiction between the number of starting threads of a GPU processor and the resources of the general registers is balanced, and the execution efficiency of hardware is improved; according to a specific GPU processor architecture, the size of the workgroup set by a user can be reasonably set again for the workgroup index information participating in other operations except the global workgroup index and programs which do not use local storage or synchronous statements, so that the method is more suitable for the current GPU processor architecture; the situation of secondary compilation can be avoided for programs which do not use local storage or synchronous statements.
It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not limited to being performed in the exact order illustrated and, unless explicitly stated herein, may be performed in other orders. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.
Based on the same inventive concept, the embodiment of the present application further provides a dynamic allocation apparatus for a general register, which is used for implementing the above-mentioned dynamic allocation method for a general register. The implementation scheme for solving the problem provided by the apparatus is similar to the implementation scheme described in the method above, so the specific limitations in the following one or more embodiments of the dynamic allocation apparatus for general registers may refer to the limitations of the dynamic allocation method for general registers in the foregoing description, and are not described again here.
In one embodiment, as shown in fig. 9, there is provided a dynamic allocation apparatus for general registers, comprising:
a conversion module 100, configured to convert a source code into a machine instruction program through a compiler;
the allocation module 200 is configured to, if the machine instruction program meets the register repartitioning condition in the execution process, perform register allocation operation according to the maximum allocable number of the general purpose registers, and determine the actual usage number of the general purpose registers in the execution process of the machine instruction program;
an adjusting module 300, configured to adjust a maximum allocable number of a general register if a register overflow instruction is used in the execution process of the machine instruction program, so that the register overflow instruction is not used in the execution process of the machine instruction program;
the work group dividing module 400 is configured to reset the size of the work group in the compiler according to the actual number of used general registers and the adjusted maximum allocable number of general registers.
In one embodiment, the allocating module 200 is further configured to, if the machine instruction program does not meet the register repartitioning condition during execution, set the maximum allocable number of the general purpose registers to a fixed value, perform a register allocation operation according to the maximum allocable number of the general purpose registers, and determine an actual usage number of the general purpose registers during execution of the machine instruction program.
In one embodiment, the adjustment module 300 is further configured to determine a first theoretical value of a number of clocks required for execution of the general register spill instruction, and a second theoretical value of a number of clocks required for completion of program execution of the machine instruction;
and if the ratio of the first theoretical value to the second theoretical value is greater than the preset value, increasing the maximum allocable number of the general registers by a fixed step value, returning to execute the step of performing register allocation operation according to the maximum allocable number of the general registers and determining the actual use number of the general registers in the execution process of the machine instruction program until the ratio of the first theoretical value to the second theoretical value is less than the preset value.
In one embodiment, the workgroup partitioning module 400 is further configured to, if it is determined that the processing unit in the GPU processor does not satisfy the thread starting condition according to the maximum allocable number of the general purpose registers and the actual usage number of the general purpose registers, reset the maximum allocable number of the general purpose registers by the compiler, and return to the step of executing the step of converting the source code into the machine instruction program by the compiler until the processing unit satisfies the thread starting condition.
The thread starting condition comprises the ratio of the maximum allocable number of the general registers to the actual using number of the general registers and the ratio between the size of the working group and the thread granularity.
In one embodiment, the workgroup partitioning module 400 is further configured to determine a range of thread numbers that can be started by the processing unit in the GPU processor according to the actual number of used general purpose registers and the adjusted maximum allocable number of general purpose registers;
taking any number in the thread number range as the target starting thread number of the processing unit;
determining a plurality of first primitives according to the number of target starting threads, and taking any one of the plurality of first primitives as a first target primitive; the first primitive comprises three dimensions, and the product of the numerical values of the three dimensions is equal to the number of target starting threads;
determining a plurality of second primitives according to the thread granularity, and taking any one of the plurality of second primitives as a second target primitive; the second primitive comprises three dimensions, and the product of the values of the three dimensions is equal to the thread granularity;
and determining the numerical product of the corresponding dimensions of the first target primitive and the second target primitive as the target size of the workgroup in different dimensions.
The various modules in the above-described dynamic allocation apparatus for general registers may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 10. The computer apparatus includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input device. The processor, the memory and the input/output interface are connected by a system bus, and the communication interface, the display unit and the input device are connected by the input/output interface to the system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The input/output interface of the computer device is used for exchanging information between the processor and an external device. The communication interface of the computer device is used for communicating with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a method of dynamic allocation of general purpose registers. The display unit of the computer device is used for forming a visual picture and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on a shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the above-described method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, carries out the steps in the method embodiments described above.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, displayed data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant country and region.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, databases, or other media used in the embodiments provided herein can include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), magnetic Random Access Memory (MRAM), ferroelectric Random Access Memory (FRAM), phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the various embodiments provided herein may be, without limitation, general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, or the like.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present application should be subject to the appended claims.

Claims (10)

1. A method for dynamic allocation of general purpose registers, the method comprising:
converting the source code into a machine instruction program through a compiler;
if the machine instruction program meets the register repartitioning condition in the execution process, performing register allocation operation according to the maximum allocable number of the general registers, and determining the actual use number of the general registers in the execution process of the machine instruction program;
if a register overflow instruction is used in the machine instruction program executing process, adjusting the maximum distributable number of general registers to enable the register overflow instruction not to be used in the machine instruction program executing process;
and resetting the size of the working group in the compiler according to the actual using number of the general registers and the adjusted maximum distributable number of the general registers.
2. The method of claim 1, further comprising:
and if the machine instruction program does not meet the register repartitioning condition in the execution process, setting the maximum allocable number of the general registers as a fixed value, performing register allocation operation according to the maximum allocable number of the general registers, and determining the actual use number of the general registers in the execution process of the machine instruction program.
3. The method of claim 1 or 2, wherein the register repartitioning condition comprises: the workgroup index information participates in operations other than the global workgroup index, and does not use any condition of the local memory statement or the synchronous statement.
4. The method of claim 1, wherein said adjusting the maximum allocable number of said general purpose registers comprises:
determining a first theoretical value of the number of clocks required by the execution of the general register overflow instruction and a second theoretical value of the number of clocks required by the execution of the machine instruction program;
if the ratio of the first theoretical value to the second theoretical value is larger than a preset value, increasing the maximum allocable number of the general registers by a fixed step value, returning to execute the step of performing register allocation operation according to the maximum allocable number of the general registers, and determining the actual use number of the general registers in the execution process of the machine instruction program until the ratio of the first theoretical value to the second theoretical value is smaller than the preset value.
5. The method of claim 2, further comprising:
if the fact that the processing unit in the GPU processor does not meet the thread starting condition is judged according to the maximum allocable number of the general registers and the actual using number of the general registers, the compiler resets the maximum allocable number of the general registers, and the step of converting the source code into the machine instruction program through the compiler is returned to be executed until the processing unit meets the thread starting condition.
6. The method of claim 5, wherein the start thread condition comprises a ratio of a maximum allocable number of the general purpose registers to an actual number of uses of the general purpose registers, greater than a ratio between a size of the workgroup and a thread granularity.
7. The method of claim 1, wherein resizing the working set based on the actual number of general purpose registers and the adjusted maximum allocable number of general purpose registers comprises:
determining the thread number range which can be started by a processing unit in the GPU according to the actual using number of the general registers and the adjusted maximum distributable number of the general registers;
taking any number in the thread number range as the target starting thread number of the processing unit;
determining a plurality of first primitives according to the number of the target starting threads, and taking any one of the plurality of first primitives as a first target primitive; the first primitive comprises three dimensions, and the product of the numerical values of the three dimensions is equal to the number of the target starting threads;
determining a plurality of second primitives according to the thread granularity, and taking any one of the plurality of second primitives as a second target primitive; the second primitive comprises three dimensions, the product of the values of the three dimensions being equal to the thread granularity;
and determining the numerical product of the first target primitive and the second target primitive on the corresponding dimension as the target size of the working group in different dimensions.
8. An apparatus for dynamic allocation of general purpose registers, the apparatus comprising:
the conversion module is used for converting the source code into a machine instruction program through a compiler;
the allocation module is used for performing register allocation operation according to the maximum allocable number of the general registers and determining the actual using number of the general registers in the execution process of the machine instruction program if the register repartitioning condition is met in the execution process of the machine instruction program;
the adjusting module is used for adjusting the maximum distributable number of the general registers if the register overflow instruction is used in the executing process of the machine instruction program, so that the register overflow instruction is not used in the executing process of the machine instruction program;
and the working group dividing module is used for resetting the size of the working group in the compiler according to the actual using number of the general registers and the adjusted maximum distributable number of the general registers.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202211705663.4A 2022-12-29 2022-12-29 Dynamic allocation method and device for general registers, computer equipment and storage medium Active CN115934102B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211705663.4A CN115934102B (en) 2022-12-29 2022-12-29 Dynamic allocation method and device for general registers, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211705663.4A CN115934102B (en) 2022-12-29 2022-12-29 Dynamic allocation method and device for general registers, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115934102A true CN115934102A (en) 2023-04-07
CN115934102B CN115934102B (en) 2023-12-12

Family

ID=86699083

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211705663.4A Active CN115934102B (en) 2022-12-29 2022-12-29 Dynamic allocation method and device for general registers, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115934102B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117971437A (en) * 2024-03-26 2024-05-03 摩尔线程智能科技(北京)有限责任公司 Task allocation method, circuit, device, medium, and program
CN118113486A (en) * 2024-04-30 2024-05-31 北京麟卓信息科技有限公司 Quick test method for maximum available register number of GPU threads

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1662904A (en) * 2002-06-26 2005-08-31 国际商业机器公司 Digital signal processor with cascaded SIMD organization
CN101014933A (en) * 2004-07-13 2007-08-08 辉达公司 Simulating multiported memories using lower port count memories
CN102640132A (en) * 2009-09-28 2012-08-15 辉达公司 Efficient predicated execution for parallel processors
CN107851004A (en) * 2015-08-17 2018-03-27 高通股份有限公司 For the register spilling management of general register (GPR)
CN110032395A (en) * 2017-11-14 2019-07-19 辉达公司 For improving the unified register file of resource utilization
CN110187977A (en) * 2018-02-23 2019-08-30 英特尔公司 System and method for reducing register block conflict based on software prompt and hardware thread switching
WO2021051022A1 (en) * 2019-09-11 2021-03-18 Redpine Signals Inc. Multi-threaded processor with thread granularity
CN112947998A (en) * 2019-11-26 2021-06-11 Arm有限公司 Register provide opcode instruction
CN114924748A (en) * 2022-05-31 2022-08-19 上海阵量智能科技有限公司 Compiling method, device and equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1662904A (en) * 2002-06-26 2005-08-31 国际商业机器公司 Digital signal processor with cascaded SIMD organization
CN101014933A (en) * 2004-07-13 2007-08-08 辉达公司 Simulating multiported memories using lower port count memories
CN102640132A (en) * 2009-09-28 2012-08-15 辉达公司 Efficient predicated execution for parallel processors
CN107851004A (en) * 2015-08-17 2018-03-27 高通股份有限公司 For the register spilling management of general register (GPR)
CN110032395A (en) * 2017-11-14 2019-07-19 辉达公司 For improving the unified register file of resource utilization
CN110187977A (en) * 2018-02-23 2019-08-30 英特尔公司 System and method for reducing register block conflict based on software prompt and hardware thread switching
WO2021051022A1 (en) * 2019-09-11 2021-03-18 Redpine Signals Inc. Multi-threaded processor with thread granularity
CN112947998A (en) * 2019-11-26 2021-06-11 Arm有限公司 Register provide opcode instruction
CN114924748A (en) * 2022-05-31 2022-08-19 上海阵量智能科技有限公司 Compiling method, device and equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
C. -C. HSIAO 等: "An Adaptive Thread Scheduling Mechanism With Low-Power Register File for Mobile GPUs", 《IEEE TRANSACTIONS ON MULTIMEDIA》, vol. 16, no. 01, pages 60 - 67, XP011533902, DOI: 10.1109/TMM.2013.2281584 *
邱亚琼 等: "基于两类寄存器互为缓存方法的DSP寄存器分配溢出处理优化算法", 《计算机科学》, vol. 46, no. 06, pages 196 - 200 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117971437A (en) * 2024-03-26 2024-05-03 摩尔线程智能科技(北京)有限责任公司 Task allocation method, circuit, device, medium, and program
CN118113486A (en) * 2024-04-30 2024-05-31 北京麟卓信息科技有限公司 Quick test method for maximum available register number of GPU threads

Also Published As

Publication number Publication date
CN115934102B (en) 2023-12-12

Similar Documents

Publication Publication Date Title
US9354944B2 (en) Mapping processing logic having data-parallel threads across processors
US11243816B2 (en) Program execution on heterogeneous platform
US8145625B2 (en) Methods and systems for optimizing data accesses
CN110008009B (en) Binding constants at runtime to improve resource utilization
US8806513B2 (en) Application programming interfaces for data parallel computing on multiple processors
US9720726B2 (en) Multi-dimensional thread grouping for multiple processors
JP5939524B2 (en) Subbuffer object
US9779469B2 (en) Register spill management for general purpose registers (GPRs)
US9411715B2 (en) System, method, and computer program product for optimizing the management of thread stack memory
CN109154886B (en) Method and apparatus for handling data
KR101609079B1 (en) Instruction culling in graphics processing unit
WO2014190315A1 (en) Graphics processing using dynamic resources
US20140215192A1 (en) Heap data management for limited local memory(llm) multi-core processors
US11500828B1 (en) Method and device for constructing database model with ID-based data indexing-enabled data accessing
CN103996216A (en) Power efficient attribute handling for tessellation and geometry shaders
CN115934102B (en) Dynamic allocation method and device for general registers, computer equipment and storage medium
US10599638B2 (en) System and method for identifying maximal independent sets in parallel
US12020065B2 (en) Hierarchical processor selection
CN103870247A (en) Technique for saving and restoring thread group operating state
JP2019204348A (en) Information processing device, control method therefor, and program
CN111699506B (en) Instruction processing
CN114995908A (en) Method and device for determining multi-core starting configuration, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: 200135, 11th Floor, Building 3, No. 889 Bibo Road, China (Shanghai) Pilot Free Trade Zone, Pudong New Area, Shanghai

Patentee after: Granfei Intelligent Technology Co.,Ltd.

Country or region after: China

Address before: 200135 Room 201, No. 2557, Jinke Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai

Patentee before: Gryfield Intelligent Technology Co.,Ltd.

Country or region before: China