CN116578343B - Instruction compiling method and device, graphic processing device, storage medium and terminal equipment - Google Patents

Instruction compiling method and device, graphic processing device, storage medium and terminal equipment Download PDF

Info

Publication number
CN116578343B
CN116578343B CN202310840716.1A CN202310840716A CN116578343B CN 116578343 B CN116578343 B CN 116578343B CN 202310840716 A CN202310840716 A CN 202310840716A CN 116578343 B CN116578343 B CN 116578343B
Authority
CN
China
Prior art keywords
instruction
register
registers
virtual
instructions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310840716.1A
Other languages
Chinese (zh)
Other versions
CN116578343A (en
Inventor
陈林锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Li Computing Technology Shanghai Co ltd
Original Assignee
Li Computing Technology Shanghai Co ltd
Nanjing Lisuan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Li Computing Technology Shanghai Co ltd, Nanjing Lisuan Technology Co ltd filed Critical Li Computing Technology Shanghai Co ltd
Priority to CN202310840716.1A priority Critical patent/CN116578343B/en
Publication of CN116578343A publication Critical patent/CN116578343A/en
Application granted granted Critical
Publication of CN116578343B publication Critical patent/CN116578343B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/005General purpose rendering architectures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Graphics (AREA)
  • Complex Calculations (AREA)

Abstract

The application provides an instruction compiling method and device, a graphic processing device, a storage medium and a terminal device, wherein the instruction compiling method comprises the following steps: generating a first instruction, wherein the first instruction comprises a plurality of copying instructions and a plurality of first operation instructions, the copying instructions indicate a first source virtual register and a first destination virtual register, the first destination virtual register indicated in the copying instructions is a plurality of virtual sub-registers in the same virtual vector register respectively, and addresses of the plurality of virtual sub-registers are continuous; and generating a second instruction according to the first instruction, wherein the second instruction comprises a plurality of first operation instructions, destination registers indicated in the plurality of first operation instructions are a plurality of physical scalar registers in the same physical vector register, and the plurality of physical scalar registers correspond to the plurality of virtual sub-registers. The application can optimize the execution instruction in the graphic processing device, reduce the instruction quantity, save the physical register resource and improve the operation efficiency.

Description

Instruction compiling method and device, graphic processing device, storage medium and terminal equipment
Technical Field
The present application relates to the field of graphics processing technologies, and in particular, to a method and apparatus for compiling instructions, a graphics processing apparatus, a storage medium, and a terminal device.
Background
As three-dimensional games build scene pictures more realistic and gorgeous, the design of the shader becomes more complex. A large number of texture samples or other arithmetic operations are typically used in shaders to enhance picture effects.
In the prior art, when loading and storing graphic data, coordinates are generally used, that is, coordinates of one-dimensional x, two-dimensional xy or three-dimensional xyz are adopted according to whether an image is one-dimensional, two-dimensional or three-dimensional. These coordinates are placed in a vector register made up of a plurality of consecutive scalar registers, each of which is a component (x-component, y-component, z-component), or sub-register, of the vector register. Two-to-three-dimensional dot-product operations also require vector registers to store operands. For example, when generating a texture sample code, the shader typically generates different arithmetic logic unit (arithmetic and logic unit, ALU) instructions to generate component coordinates of the texture coordinates, which are then copied into vector registers by copy instructions (MOVs). These instructions using vector registers are typically provided to texture sample instructions and image load instructions because of encoding space constraints using the first scalar register as the register header address plus the number of registers. The instruction sequence is as follows:
FMUL R0, R1, R2;
FADD R8, R7, R6;
FADD R10, R11, R12;
MOV R16, R0;
MOV R17, R8;
MOV R18, R10;
SMP R10, R16.xyz,t[0], s[0];
wherein FMUL/FADD/MOV/SMP respectively denote multiplication, addition, copying, and texture adoption, R0, R1,., R18, and the like denote registers. t 0 represents a texture unit description register used in texture sampling, and s 0 represents a sample description register.
However, in the prior art, since the component registers of the texture coordinates are defined by different instructions, the registers are allocated to different and discrete scalar registers, such as R0, R8 and R10, and additional copy instructions are required to form the vector registers. The conventional compiler optimization method copy propagation cannot eliminate the redundant copy instructions because of the continuous limitation of registers, and when the texture samples in the shader are relatively more, a large number of copy instructions are introduced, so that the running efficiency of the shader is seriously affected. These operations often introduce additional register copy instructions, allocate additional registers, and cause code execution inefficiencies, occupying more register resources.
Disclosure of Invention
The application can optimize the execution instruction in the graphic processing device, reduce the number of instructions and improve the operation efficiency.
In order to achieve the above purpose, the present application provides the following technical solutions:
in a first aspect, the present application provides an instruction compiling method, the instruction compiling method including: generating a first instruction, wherein the first instruction comprises a plurality of copy instructions and a plurality of first operation instructions, the copy instructions indicate a first source virtual register and a first destination virtual register, the first destination virtual registers indicated in the copy instructions are respectively a plurality of virtual sub-registers in the same virtual vector register, and addresses of the plurality of virtual sub-registers are continuous; generating a second instruction according to the first instruction, wherein the second instruction comprises the plurality of first operation instructions, destination registers indicated in the plurality of first operation instructions are a plurality of physical scalar registers in the same physical vector register, and the plurality of physical scalar registers correspond to the plurality of virtual sub-registers.
Optionally, the generating the second instruction according to the first instruction includes: updating the first source virtual register into the first destination virtual register in a predefined instruction corresponding to the first source virtual register, and deleting the copy instruction to obtain a third instruction;
and allocating a physical register for the virtual register indicated in the third instruction to obtain the second instruction.
Optionally, the allocating a physical register for the virtual register indicated in the third instruction includes: and allocating a corresponding physical vector register for the virtual vector register indicated in the third instruction, and allocating a corresponding physical scalar register for the virtual sub-register indicated in the third instruction.
Optionally, the updating the first source virtual register to the first destination virtual register includes: traversing the second instruction to determine an instruction pair comprising the copy instruction and the predefined instruction, the predefined instruction of the instruction pair corresponding to a first source virtual register indicated in the copy instruction of the instruction pair; updating a first source virtual register indicated in a predefined instruction in the instruction pair to a first destination virtual register in a copy instruction in the instruction pair.
Optionally, the predefined instruction is a second operation instruction, where the second operation instruction is used to instruct to perform an operation, and store an operation result in the first source virtual register.
Optionally, the second operation instruction includes a second source virtual register and the first source virtual register, and the second operation instruction is used for indicating to perform an operation on data in the second source virtual register.
Optionally, the first operation instruction is configured to instruct to perform an operation, and store an operation result in the destination register.
Optionally, the second instruction includes a third operation instruction, where the third operation instruction is configured to perform an operation on data in the physical vector register.
In a second aspect, the present application also discloses a graphics processing apparatus, the graphics processing apparatus including: a scheduling executor for receiving the second instruction; and the operation unit is used for executing the second instruction.
In a third aspect, the present application also discloses an instruction compiling apparatus, including: the first instruction generation module is used for generating a first instruction, the first instruction comprises a plurality of copy instructions and a plurality of first operation instructions, the copy instructions indicate a first source virtual register and a first destination virtual register, the first destination virtual registers indicated in the copy instructions are respectively a plurality of virtual sub-registers in the same virtual vector register, and the addresses of the plurality of virtual sub-registers are continuous; and the second instruction generating module is used for generating a second instruction according to the first instruction, the second instruction comprises the plurality of first operation instructions, destination registers indicated in the plurality of first operation instructions are a plurality of physical scalar registers in the same physical vector register, and the plurality of physical scalar registers correspond to the plurality of virtual sub-registers.
In a fourth aspect, the present application provides a terminal device, which is characterized by comprising the graphics processing apparatus according to the first aspect.
In a fifth aspect, the present application provides a computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor performs the steps of the instruction compilation method.
Compared with the prior art, the technical scheme of the application has the following beneficial effects:
in order to optimize an instruction, in the technical scheme of the application, a first instruction is generated, the first instruction comprises a plurality of copy instructions and a plurality of first operation instructions, the copy instructions indicate a first source virtual register and a first destination virtual register, the first destination virtual registers indicated in the copy instructions are respectively a plurality of virtual sub-registers in the same virtual vector register, and addresses of the plurality of virtual sub-registers are continuous; and generating a second instruction according to the first instruction, wherein the first second instruction comprises a plurality of first operation instructions, destination registers indicated in the plurality of first operation instructions are a plurality of physical scalar registers in the same physical vector register, and the plurality of physical scalar registers correspond to the plurality of virtual sub-registers. Because the destination registers in the first operation instructions are a plurality of continuous physical scalar registers, a vector register can be formed, and therefore, data is copied between different registers without additional copying instructions, the number of instructions in the second instruction is reduced, and the optimization of the instructions is realized. In addition, by reducing the number of instructions, the number of instructions to be executed by the subsequent graphics processing device is reduced, and the computing efficiency of the graphics processing device is improved.
Further, in a predefined instruction corresponding to a first source virtual register, the application updates the first source virtual register into a first destination virtual register, and deletes a copy instruction to obtain a third instruction; and allocating a physical register for the virtual register indicated in the third instruction to obtain the first instruction. In the application, the first target virtual registers indicated in the plurality of copy instructions are respectively a plurality of virtual sub-registers in the same virtual vector register, and the addresses of the plurality of virtual sub-registers in the virtual vector register are continuous, so that the vector registers can be formed without copying the instructions in the third instruction and the first instruction by updating the first source virtual register into the first target virtual register, and the number of instructions in the first instruction is reduced on the basis of meeting the continuous requirement of the subsequent task on the data storage address.
Further, by introducing the virtual vector register and the virtual sub-register, the instruction optimization is realized before the physical register is allocated, the physical register is prevented from being occupied, the waste of the physical register is avoided, and the physical register resource is saved.
Drawings
FIG. 1 is a flow chart of an instruction compiling method according to an embodiment of the present application;
FIG. 2 is a flow chart of another method for compiling instructions according to an embodiment of the present application;
FIG. 3 is a block diagram of a graphics processing apparatus according to an embodiment of the present application;
FIG. 4 is a block diagram of an instruction compiling apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of another instruction compiling apparatus according to an embodiment of the present application.
Detailed Description
As described in the background, in the prior art, since the component registers of the texture coordinates are defined by different instructions, when allocating registers, different and discrete scalar registers, such as R0, R8 and R10, are allocated, and additional copy instructions are necessary to form the vector registers. The conventional compiler optimization method copy propagation cannot eliminate the redundant copy instructions because of the continuous limitation of registers, and when the texture samples in the shader are relatively more, a large number of copy instructions are introduced, so that the running efficiency of the shader is seriously affected.
In the technical scheme of the application, the destination registers in the plurality of first operation instructions are a plurality of continuous physical scalar registers, so that a vector register can be formed, and therefore, data is copied between different registers without additional copying instructions, thereby reducing the number of instructions in a second instruction and realizing the optimization of the instructions. In addition, by reducing the number of instructions, the number of instructions to be executed by the subsequent graphics processing device is reduced, and the computing efficiency of the graphics processing device is improved.
Further, in the present application, the first destination virtual register indicated in the plurality of copy instructions is set to be a plurality of virtual sub-registers in the same virtual vector register, and addresses of the plurality of virtual sub-registers in the virtual vector register are continuous, so that the vector registers can be formed without copying instructions in the third instruction and the second instruction by updating the first source virtual register to the first destination virtual register, and the number of instructions in the second instruction is reduced on the basis of meeting the continuous requirement of the subsequent task on the data storage address.
Before describing embodiments of the present application, some terms related to the embodiments of the present application will be explained.
The physical register in the embodiment of the application refers to a register corresponding to an actual hardware entity.
The virtual register refers to an abstract register which is irrelevant to a hardware platform and is used for storing basic data types without quantity limitation, and has a virtual storage address. In particular, the virtual register may be a number of operands in the source file during compilation. The concept of "variable" will be mapped into a "virtual register" in the process of converting from a higher level intermediate representation (IR: intermediate representation) to a lower level IR. The number of virtual registers is unlimited, and when registers are allocated, virtual registers in an instruction need to be mapped to actual physical registers, and virtual registers can be mapped to a limited number of physical registers.
The scalar register according to the embodiment of the application refers to a register for completing scalar calculation.
The vector register according to the embodiment of the present application refers to a register for special purpose, which is wider than a scalar register, and which is composed of a plurality of scalar registers having consecutive memory addresses.
In order that the above objects, features and advantages of the application will be readily understood, a more particular description of the application will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.
In embodiment 1, referring to fig. 1, the instruction compiling method may be executed by a compiler, that is, the compiler executes each step of the instruction compiling method to generate and send the second instruction. It will of course be appreciated that the instruction compilation method may be performed by any other suitable entity.
Specifically, the instruction compiling method specifically may include the following steps:
step 101: and generating a first instruction, wherein the first instruction comprises a plurality of copy instructions and a plurality of first operation instructions, the copy instructions indicate a first source virtual register and a first destination virtual register, the first destination virtual register indicated in the copy instructions is a plurality of virtual sub-registers in the same virtual vector register respectively, and addresses of the plurality of virtual sub-registers are continuous.
Step 102: and generating a second instruction, wherein the second instruction comprises a plurality of first operation instructions, the destination registers indicated in the plurality of first operation instructions are a plurality of physical scalar registers in the same physical vector register, and the plurality of physical scalar registers correspond to the plurality of virtual sub-registers.
It will be appreciated that in particular implementations, each of the steps of the method described above may be implemented in a software program running on a processor integrated within a chip or chip module. The method may also be implemented by combining software with hardware, and the application is not limited.
The embodiment of the application can be used for scenes requiring logic operation on data and requiring continuous data storage addresses, such as texture sampling scenes, image loading scenes, image storage scenes and three-dimensional dot product scenes.
In an embodiment of the present application, the first operation instruction may be an arithmetic logic unit (arithmetic and logic unit, ALU) instruction, such as a multiply operation instruction, an add operation instruction, or the like. Specifically, in the texture sampling scene, the first operation instruction further includes a sampling instruction.
Taking a texture sampling scenario as an example, the second instruction is as follows:
FMUL R16, R1, R2;
FADD R17, R7, R6;
FADD R18, R11, R12;
SMP R10, R16.xyz,t[0], s[0]。
wherein the first operation instruction FMUL represents multiplication operation of data in the physical scalar registers R1, R2, and stores the result to the physical scalar register R16; accordingly, the first operation instruction FADD indicates that the data in the physical scalar registers R6, R7 are subjected to an addition operation, and the result is stored in the physical scalar register R17; the first arithmetic instruction FADD indicates that the data in the physical scalar registers R11, R12 are added, and the result is stored in the physical scalar register R18; SMP represents a third operation instruction, namely a sampling instruction, using 3 consecutive scalar registers R16-R18 to form vector register V3R16, representing the sampling of data in registers R16-R18, t 0, s 0, and storing the result in physical scalar register R10.t 0 represents a texture unit description register used in texture sampling, and s 0 represents a sample description register.
Taking the image loading scenario as an example, the second instruction is as follows:
FMUL R16, R1, R2;
FADD R17, R7, R6;
FADD R18, R11, R12;
LD R10, R16.xyz,u[0]。
the LD represents a third operation instruction, namely an image loading instruction, and uses 3 consecutive scalar registers R16-R18 to form a vector register V3R16, representing loading the data in the registers R16-R18 and u [0], and storing the result into a physical scalar register R10.u 0 represents an image description register used at the time of image loading.
Taking the image storage scenario as an example, the second instruction is as follows:
FMUL R16, R1, R2;
FADD R17, R7, R6;
FADD R18, R11, R12;
ST R10, R16.xyz,u[0]。
wherein ST represents a third operation instruction, i.e., an image store instruction, using 3 consecutive scalar registers R16-R18 to form vector register V3R16, representing the assembly of the data in registers R16-R18, u [0], and storing the result to physical scalar register R10.
Taking a three-dimensional dot product scene as an example, the second instruction is as follows:
FMUL R16, R1, R2;
FADD R17, R7, R6;
FADD R18, R11, R12;
DOT.rpt2 R10, R16.xyz。
wherein dot.rpt2 represents a third operation instruction, i.e. a three-dimensional dot product instruction, using 3 consecutive scalar registers R16 to R18 to form vector register V3R16, representing three-dimensional dot product operations on data in registers R16 to R18, and storing the result to physical scalar register R10, e.g. R10' =r16 ' ×r16' +r17' ×r17' +r18' ×r18', where R10' represents data stored in memory R10, and R16', R17', and R18' represent data stored in memories R16, R17, and R18, respectively.
In this embodiment, the destination registers of the first operation instruction FMUL, FADD, FADD are the physical scalar registers R16, R17 and R18, respectively, and the addresses of the physical scalar registers R16, R17 and R18 are consecutive, so as to form one physical vector register. The data is copied between different registers without additional copying instructions in the second instruction, so that the number of instructions in the second instruction is reduced, and the optimization of the instructions is realized.
Embodiment 2 referring to fig. 2, fig. 2 shows a flow of another instruction compiling method. Specifically, the instruction compiling method may include the steps of:
step 201: a first instruction is generated.
Step 202: updating the first source virtual register into a first destination virtual register in a predefined instruction corresponding to the first source virtual register, and deleting a copy instruction to obtain a third instruction;
step 203: and allocating a physical register for the virtual register indicated in the third instruction to obtain a second instruction.
In order to make the destination registers indicated in the plurality of first operation instructions in the second instruction be a plurality of physical scalar registers in the same physical vector register, the compiler may be caused to perform steps 201 to 203 described above.
The physical vector registers, physical scalar registers, virtual vector registers, and virtual sub-registers (which may also be referred to as virtual scalar registers) may be defined prior to generating the second instruction. Specifically, the physical registers are defined as physical vector registers and physical scalar registers, respectively, which are sub-registers of the physical vector registers. The physical vector register consists of consecutive physical scalar physical registers, e.g., physical vector register V4R4, consisting of 4 physical scalar registers, R4, R5, R6, R7. Specifically, virtual registers are defined as virtual vector registers and virtual sub-registers, respectively. The virtual vector register is composed of virtual sub-registers, which are sub-registers of the virtual vector register. Each sub-register of the virtual vector register may be represented by an identification of the virtual vector register and its sub-registers, e.g.,% 0:f32v4crf: sub0,% 0 is representing the virtual vector register, F32 is representing the floating point 32, v4 is representing that the virtual vector register comprises 4 virtual sub-registers, crf is representing the general purpose registers, sub0 is representing the sub-registers of the virtual vector register.
The embodiment of the application can generate the first instruction according to the existing algorithm, but the difference between the first instruction and the instruction generated in the prior art is that the register identifications carried in the first instruction are all identifications of virtual registers. Specifically, the source registers and destination registers indicated in the plurality of copy instructions in the first instruction are virtual registers. Accordingly, the register indicated by the predefined instruction in the first instruction is also a virtual register. The predefined instruction corresponding to the first source virtual register refers to a first operation instruction including the first source register, and the first source register is a destination register in the first operation instruction.
Taking a texture sampling scenario as an example, the first instruction is as follows:
%8:f32crf = FMUL %0:f32crf, %1:f32crf;
%9:f32crf = FADD %2:f32crf, %3:f32crf;
%10:f32crf = FADD %4:f32crf, %5:f32crf;
%11.sub0:f32v3crf = COPY %8:f32crf;
%11.sub1:f32v3crf = COPY %9.f32crf;
%11.sub2:f32v3crf = COPY %10.f32crf;
%12:f32crf = Smp_F32_V3 %11:f32v3crf, 1, 0。
wherein,% 0,% 1, …%5,% 8,% 9,% 10,% 12 are virtual sub-registers, and,% 11 is a virtual vector register comprising three virtual sub-registers: %11.sub0,% 11.sub1, and% 11.sub2. %8,% 9,% 10 are the first source virtual registers,% 11.sub0,% 11.sub1,% 11.sub2 are the first destination virtual registers. FMUL is a predefined instruction for the first source virtual register% 8, FADD is a predefined instruction for the first source virtual registers% 9,% 10. Smp represents a sample instruction.
Updating a first source virtual register in the predefined instructions FMUL and FADD into a first destination virtual register, and deleting the copy instruction. That is, the first source virtual register% 8 is updated to the first destination virtual register% 11.sub0; updating the first source virtual register% 9 to a first destination virtual register% 11.sub1; the first source virtual register% 10 is updated to the first destination virtual register% 11.
The third instruction obtained is as follows:
%11.sub0:f32crf = FMUL %0:f32crf, %1:f32crf;
%11.sub1:f32crf = FADD %2:f32crf, %3:f32crf;
%11.sub2:f32crf = FADD %4:f32crf, %5:f32crf;
%12:f32crf = Smp_F32_V3 %11:f32v3crf, 0, 0。
and running a register allocation algorithm on the third instruction, and allocating a physical register for the virtual register to obtain a second instruction, wherein the second instruction is as follows:
FMUL R16, R1, R2;
FADD R17, R7, R6;
FADD R18, R11, R12;
SMP R10, R16.xyz,t[0], s[0]。
wherein virtual register% 0 maps to physical register R1, virtual register% 1 maps to physical register R2, virtual register% 2 maps to physical register R7, virtual register% 3 maps to physical register R3, virtual register% 4 maps to R11, virtual register% 5 maps to physical register R12, virtual registers% 11.sub0,% 11.sub1, and% 11.sub2 maps to physical registers R16, R17, and R18, respectively.
It should be noted that the above embodiment is described taking the virtual register as a floating point 32-bit example, and in practical application, the virtual register may be any register with any number of bits, which is not limited in this aspect of the present application.
In particular, the register allocation algorithm can map virtual registers to actual physical registers, and may specifically be a graph coloring algorithm, a linear scanning algorithm based on a register life cycle, a variant algorithm thereof, and the like, which is not limited in this application.
Further, when allocating physical registers for the virtual registers, allocating corresponding physical vector registers for the virtual vector registers indicated in the third instruction, and allocating corresponding physical scalar registers for the virtual sub-registers indicated in the third instruction.
For example, physical scalar registers are allocated for virtual sub-registers% 0,% 1, …%5, respectively, and physical vector register V3R16, i.e., physical scalar registers R16-R18, are allocated for virtual vector register% 11.
Compared with the instructions in the prior art, the second instruction in the embodiment of the application has no additional copy instruction MOV, and the number of instructions is greatly reduced.
In one particular embodiment, upon updating from a first instruction to a third instruction, the first instruction may be traversed to determine an instruction pair, the instruction pair including a copy instruction and a predefined instruction, the predefined instruction of the instruction pair corresponding to a first source virtual register indicated in the copy instruction of the instruction pair. The first source virtual register indicated in the predefined instruction in the instruction pair is updated to the first destination virtual register in the copy instruction in the instruction pair.
Taking the first instruction as an example, there are three instruction pairs, which are predefined instruction% 8:f32crf=fmul% 0:f32crf,% 1:f32crf and COPY instruction% 11.sub 0:f32v3crf=copy% 8:f32crf; predefined instruction% 9:f32crf=fadd% 2:f32crf,% 3:f32crf and COPY instruction% 11.sub1:f32v3crf=copy% 9.f32crf; predefined instruction% 10:f32crf=fadd% 4:f32crf,% 5:f32crf, and COPY instruction% 11.sub 2:f32v3crf=copy% 10.f32crf.
Updating a first source virtual register%8 in a first instruction pair to a first destination virtual register%11.sub0; updating the first source virtual register%9 in the second instruction pair to the first destination virtual register%11.sub1; the first source virtual register% 10 in the third instruction pair is updated to the first destination virtual register% 11.
It should be noted that the first operation instruction, the second operation instruction, and the third operation instruction may be any other executable instructions, which is not limited in this aspect of the present application.
Embodiment 3 referring to fig. 3, the present application also discloses a graphics processing apparatus 30. The graphic processing apparatus 30 includes a schedule executor 301 and an operation unit 302. Wherein, the scheduling executor 301 is configured to receive a second instruction; an operation unit 302, configured to execute the second instruction.
In this embodiment, the compiler 40 may execute the steps of the foregoing instruction compiling method to generate the second instruction. The compiler 40 sends the second instruction to the schedule executor 301 in the graphics processing apparatus 30. The scheduling executor 301 receives and parses the second instruction, and forwards the parsed second instruction to the operation unit 302. The operation unit 302 executes the second instruction to complete the corresponding operation.
Specifically, the operation unit 302 may be a shader. Such as a Hull shader (Hull shader), domain shader (Domain shader), etc.
Taking the texture sampling scenario as an example, the operation unit 302 is a shader, which executes a first operation instruction FMUL, FADD, FADD in a second instruction and a third operation instruction SMP. After the arithmetic unit 302 executes the second instruction, the resulting texture coordinates are stored in the register R10.
Compared with the prior art that the operation unit 302 needs to execute 7 instructions, the operation unit in the embodiment of the application only needs to execute 4 instructions, when the number of coordinates needing texture sampling in the shader is relatively large, a large number of copy instructions can be reduced, and the operation efficiency of the operation unit 302 is improved.
For more specific implementation manners of the embodiments of the present application, please refer to the foregoing embodiments, and the details are not repeated here.
Referring to fig. 4, fig. 4 shows an instruction compiling apparatus. The instruction compiling device includes:
the first instruction generating module 401 is configured to generate a first instruction.
The second instruction generating module 402 is configured to generate a second instruction, where the second instruction includes a plurality of first operation instructions, and destination registers indicated in the plurality of first operation instructions are a plurality of physical scalar registers in a same physical vector register.
According to the embodiment of the application, the data is copied between different registers without additional copying instructions, so that the number of instructions in the second instruction is reduced, and the optimization of the instructions is realized. In addition, by reducing the number of instructions, the number of instructions to be executed by the subsequent graphics processing device is reduced, and the computing efficiency of the graphics processing device is improved.
Further, referring to fig. 5, the second instruction generating module 402 may include:
an updating unit 501, configured to update the first source virtual register to the first destination virtual register in the predefined instruction corresponding to the first source virtual register, and delete the copy instruction to obtain a third instruction;
a register allocation unit 502, configured to allocate a physical register to the virtual register indicated in the third instruction, so as to obtain the second instruction.
According to the embodiment of the application, the first target virtual register indicated in the plurality of copy instructions is respectively a plurality of virtual sub-registers in the same virtual vector register, and the addresses of the plurality of virtual sub-registers in the virtual vector register are continuous, so that the first source virtual register is updated into the first target virtual register, the vector registers can be formed without copying the instructions in the third instruction and the second instruction, and the number of instructions in the second instruction is reduced on the basis of meeting the continuous requirement of the subsequent task on the data storage address.
With respect to each of the apparatuses and each of the modules/units included in the products described in the above embodiments, it may be a software module/unit, a hardware module/unit, or a software module/unit, and a hardware module/unit. For example, for each device or product applied to or integrated on a chip, each module/unit included in the device or product may be implemented in hardware such as a circuit, or at least part of the modules/units may be implemented in software program, where the software program runs on a processor integrated inside the chip, and the rest (if any) of the modules/units may be implemented in hardware such as a circuit; for each device and product applied to or integrated in the chip module, each module/unit contained in the device and product can be realized in a hardware manner such as a circuit, different modules/units can be located in the same component (such as a chip, a circuit module and the like) or different components of the chip module, or at least part of the modules/units can be realized in a software program, the software program runs on a processor integrated in the chip module, and the rest (if any) of the modules/units can be realized in a hardware manner such as a circuit; for each device, product, or application to or integrated with the terminal device, each module/unit included in the device may be implemented in hardware such as a circuit, and different modules/units may be located in the same component (e.g., a chip, a circuit module, etc.) or different components in the terminal device, or at least some modules/units may be implemented in a software program, where the software program runs on a processor integrated within the terminal device, and the remaining (if any) part of the modules/units may be implemented in hardware such as a circuit.
The embodiment of the application also discloses a storage medium which is a computer readable storage medium and is stored with a computer program, and the computer program can execute the steps of the method shown in fig. 1 when running. The storage medium may include Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic or optical disks, and the like. The storage medium may also include non-volatile memory (non-volatile) or non-transitory memory (non-transitory) or the like.
The embodiment of the application also discloses a terminal device, which comprises the graphics processing device; alternatively, the terminal device comprises a memory and a processor, the memory storing a computer program executable on the processor, the processor executing the steps of the instruction compiling method described above when the computer program is executed.
The term "plurality" as used in the embodiments of the present application means two or more.
The first, second, etc. descriptions in the embodiments of the present application are only used for illustrating and distinguishing the description objects, and no order is used, nor is the number of the devices in the embodiments of the present application limited, and no limitation on the embodiments of the present application should be construed.
The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired or wireless means.
It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.
In the several embodiments provided in the present application, it should be understood that the disclosed method, apparatus and system may be implemented in other manners. For example, the device embodiments described above are merely illustrative; for example, the division of the units is only one logic function division, and other division modes can be adopted in actual implementation; for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may be physically included separately, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.
The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform part of the steps of the method according to the embodiments of the present application.
Although the present application is disclosed above, the present application is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the application, and the scope of the application should be assessed accordingly to that of the appended claims.

Claims (11)

1. A method of compiling instructions, comprising:
generating a first instruction, wherein the first instruction comprises a plurality of copy instructions and a plurality of first operation instructions, the copy instructions indicate a first source virtual register and a first destination virtual register, the first destination virtual registers indicated in the copy instructions are respectively a plurality of virtual sub-registers in the same virtual vector register, and addresses of the plurality of virtual sub-registers are continuous;
generating a second instruction according to the first instruction, wherein the second instruction comprises the plurality of first operation instructions, destination registers indicated in the plurality of first operation instructions are a plurality of physical scalar registers in the same physical vector register, and the plurality of physical scalar registers correspond to the plurality of virtual sub-registers;
wherein generating a second instruction according to the first instruction comprises:
updating the first source virtual register into the first destination virtual register in a predefined instruction corresponding to the first source virtual register, and deleting the copy instruction to obtain a third instruction;
and allocating a physical register for the virtual register indicated in the third instruction to obtain the second instruction.
2. The method of claim 1, wherein said allocating physical registers for virtual registers indicated in said third instruction comprises:
and allocating a corresponding physical vector register for the virtual vector register indicated in the third instruction, and allocating a corresponding physical scalar register for the virtual sub-register indicated in the third instruction.
3. The method of claim 1, wherein the updating the first source virtual register to the first destination virtual register comprises:
traversing the second instruction to determine an instruction pair comprising the copy instruction and the predefined instruction, the predefined instruction of the instruction pair corresponding to a first source virtual register indicated in the copy instruction of the instruction pair;
updating a first source virtual register indicated in a predefined instruction in the instruction pair to a first destination virtual register in a copy instruction in the instruction pair.
4. The instruction compilation method according to claim 1, wherein the predefined instruction is a second operation instruction, the second operation instruction being configured to instruct execution of an operation and store an operation result in the first source virtual register.
5. The instruction compilation method of claim 4 wherein the second operation instruction includes a second source virtual register and the first source virtual register, the second operation instruction to instruct performing an operation on data in the second source virtual register.
6. The instruction compiling method according to claim 1, wherein the first operation instruction is for instructing execution of an operation and storing an operation result in the destination register.
7. The instruction compilation method of claim 1 wherein the second instruction includes a third operation instruction for performing operations on data in the physical vector registers.
8. A graphics processing apparatus based on the instruction compiling method according to any one of claims 1 to 7, comprising:
a scheduling executor for receiving the second instruction;
and the operation unit is used for executing the second instruction.
9. An instruction compiling apparatus, comprising:
the first instruction generation module is used for generating a first instruction, the first instruction comprises a plurality of copy instructions and a plurality of first operation instructions, the copy instructions indicate a first source virtual register and a first destination virtual register, the first destination virtual registers indicated in the copy instructions are respectively a plurality of virtual sub-registers in the same virtual vector register, and the addresses of the plurality of virtual sub-registers are continuous;
a second instruction generating module, configured to generate a second instruction according to the first instruction, where the second instruction includes the plurality of first operation instructions, destination registers indicated in the plurality of first operation instructions are a plurality of physical scalar registers in a same physical vector register, and the plurality of physical scalar registers correspond to the plurality of virtual sub-registers;
the second instruction generating module updates the first source virtual register into the first destination virtual register in a predefined instruction corresponding to the first source virtual register, and deletes the copy instruction to obtain a third instruction; and the second instruction generating module allocates a physical register for the virtual register indicated in the third instruction to obtain the second instruction.
10. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor performs the steps of the instruction compiling method of any one of claims 1 to 7.
11. A terminal device comprising a memory and a processor, said memory having stored thereon a computer program, characterized in that said processor performs the steps of the instruction compiling method according to any one of claims 1 to 7 or comprises the graphics processing apparatus according to claim 8.
CN202310840716.1A 2023-07-10 2023-07-10 Instruction compiling method and device, graphic processing device, storage medium and terminal equipment Active CN116578343B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310840716.1A CN116578343B (en) 2023-07-10 2023-07-10 Instruction compiling method and device, graphic processing device, storage medium and terminal equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310840716.1A CN116578343B (en) 2023-07-10 2023-07-10 Instruction compiling method and device, graphic processing device, storage medium and terminal equipment

Publications (2)

Publication Number Publication Date
CN116578343A CN116578343A (en) 2023-08-11
CN116578343B true CN116578343B (en) 2023-11-21

Family

ID=87536187

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310840716.1A Active CN116578343B (en) 2023-07-10 2023-07-10 Instruction compiling method and device, graphic processing device, storage medium and terminal equipment

Country Status (1)

Country Link
CN (1) CN116578343B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5909572A (en) * 1996-12-02 1999-06-01 Compaq Computer Corp. System and method for conditionally moving an operand from a source register to a destination register
CN102662720A (en) * 2012-03-12 2012-09-12 天津国芯科技有限公司 Optimization method of compiler of multi-issue embedded processor
CN103793201A (en) * 2012-10-30 2014-05-14 英特尔公司 Instruction and logic to provide vector compress and rotate functionality
CN103810111A (en) * 2012-11-08 2014-05-21 国际商业机器公司 Address Generation In An Active Memory Device
CN104583957A (en) * 2012-06-15 2015-04-29 索夫特机械公司 Reordered speculative instruction sequences with a disambiguation-free out of order load store queue
CN109032672A (en) * 2018-07-19 2018-12-18 江苏华存电子科技有限公司 Low latency instruction scheduler and filtering conjecture access method
CN113485716A (en) * 2021-09-03 2021-10-08 支付宝(杭州)信息技术有限公司 Program compiling method and device for preventing memory boundary crossing
CN116302103A (en) * 2023-05-18 2023-06-23 南京砺算科技有限公司 Instruction compiling method and device, graphic processing unit and storage medium
CN116302099A (en) * 2022-12-23 2023-06-23 海光信息技术股份有限公司 Method, processor, device, medium for loading data into vector registers

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10908911B2 (en) * 2017-08-18 2021-02-02 International Business Machines Corporation Predicting and storing a predicted target address in a plurality of selected locations

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5909572A (en) * 1996-12-02 1999-06-01 Compaq Computer Corp. System and method for conditionally moving an operand from a source register to a destination register
CN102662720A (en) * 2012-03-12 2012-09-12 天津国芯科技有限公司 Optimization method of compiler of multi-issue embedded processor
CN104583957A (en) * 2012-06-15 2015-04-29 索夫特机械公司 Reordered speculative instruction sequences with a disambiguation-free out of order load store queue
CN103793201A (en) * 2012-10-30 2014-05-14 英特尔公司 Instruction and logic to provide vector compress and rotate functionality
CN103810111A (en) * 2012-11-08 2014-05-21 国际商业机器公司 Address Generation In An Active Memory Device
CN109032672A (en) * 2018-07-19 2018-12-18 江苏华存电子科技有限公司 Low latency instruction scheduler and filtering conjecture access method
CN113485716A (en) * 2021-09-03 2021-10-08 支付宝(杭州)信息技术有限公司 Program compiling method and device for preventing memory boundary crossing
CN116302099A (en) * 2022-12-23 2023-06-23 海光信息技术股份有限公司 Method, processor, device, medium for loading data into vector registers
CN116302103A (en) * 2023-05-18 2023-06-23 南京砺算科技有限公司 Instruction compiling method and device, graphic processing unit and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
IGC:The Open Source Intel Graphics Compiler;Anupama Chandrasekhar等;《IEEE》;254-265 *
高速数据采集系统中的PCI中断处理机制和DMA编程;王孝国 等;《军事通信技术》;第25卷(第1期);44-47 *

Also Published As

Publication number Publication date
CN116578343A (en) 2023-08-11

Similar Documents

Publication Publication Date Title
CN111124656B (en) Method, apparatus, and computer readable storage medium for assigning tasks to dedicated computing resources
JP6821715B2 (en) Block processing for image processors with 2D execution lane arrays and 2D shift registers
JP5054203B2 (en) System and method for reducing instruction latency in graphics processing
JP5242771B2 (en) Programmable streaming processor with mixed precision instruction execution
US7568189B2 (en) Code translation and pipeline optimization
JP6837084B2 (en) Core process for block processing on image processors with 2D execution lane arrays and 2D shift registers
US20080109795A1 (en) C/c++ language extensions for general-purpose graphics processing unit
CN101231585A (en) Virtual architecture and instruction set for parallel thread computing
CN1997964A (en) Optimized chaining of vertex and fragment programs
US9355428B2 (en) Method and apparatus for data processing using graphic processing unit
CN107315717B (en) Device and method for executing vector four-rule operation
CN112785676B (en) Image rendering method, device, equipment and storage medium
JP4637640B2 (en) Graphic drawing device
CN116302103B (en) Instruction compiling method and device, graphic processing unit and storage medium
US7624255B1 (en) Scheduling program instruction execution by using fence instructions
US8539458B2 (en) Transforming addressing alignment during code generation
CN111796812B (en) Image rendering method and device, electronic equipment and computer readable storage medium
CN116578343B (en) Instruction compiling method and device, graphic processing device, storage medium and terminal equipment
US8713039B2 (en) Co-map communication operator
CN113391813B (en) Program compiling method and device, storage medium and electronic equipment
CN115205093A (en) Spatio-temporal resampling with decoupled coloring and reuse
CN116416355A (en) Shader script generation method and device, electronic equipment and storage medium
CN114003197A (en) Technique and application for fast computing modulo operation using Messen number or Fermat number
WO2024169618A1 (en) Code compiling method and electronic device
US8427490B1 (en) Validating a graphics pipeline using pre-determined schedules

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240521

Address after: 201207 Pudong New Area, Shanghai, China (Shanghai) free trade trial area, No. 3, 1 1, Fang Chun road.

Patentee after: Li Computing Technology (Shanghai) Co.,Ltd.

Country or region after: China

Address before: Room 2794, Hatching Building, No. 99 Tuanjie Road, Nanjing Area, Nanjing (Jiangsu) Pilot Free Trade Zone, Jiangsu Province, 210031

Patentee before: Nanjing Lisuan Technology Co.,Ltd.

Country or region before: China

Patentee before: Li Computing Technology (Shanghai) Co.,Ltd.