WO2022141484A1 - Gpu、sppu和任务处理方法 - Google Patents

Gpu、sppu和任务处理方法 Download PDF

Info

Publication number
WO2022141484A1
WO2022141484A1 PCT/CN2020/142342 CN2020142342W WO2022141484A1 WO 2022141484 A1 WO2022141484 A1 WO 2022141484A1 CN 2020142342 W CN2020142342 W CN 2020142342W WO 2022141484 A1 WO2022141484 A1 WO 2022141484A1
Authority
WO
WIPO (PCT)
Prior art keywords
task
calculation
processing unit
processor
sppu
Prior art date
Application number
PCT/CN2020/142342
Other languages
English (en)
French (fr)
Inventor
肖潇
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2020/142342 priority Critical patent/WO2022141484A1/zh
Priority to CN202080108253.4A priority patent/CN116964661A/zh
Publication of WO2022141484A1 publication Critical patent/WO2022141484A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09GARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
    • G09G5/00Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators

Definitions

  • the embodiments of the present application relate to processor technologies, and in particular, to a GPU, SPPU, and a task processing method.
  • a graphics processing unit is a programmable processor that performs functions such as parallel computing and graphics processing by executing a programmable shader (Shader) program.
  • the GPU Different from the central processing unit (CPU), the GPU adopts a parallel computing architecture based on single instruction multithread (SIMT).
  • SIMMT single instruction multithread
  • the constant B is calculated from the constant A, and B can be pre-calculated before the Shader program is executed, and then all Shader programs start the calculation with B as the starting point.
  • This process is called constant folding (CF), and CF can bring huge performance and power gains to GPUs. Therefore, how to improve the performance of CF processing is the key to improve GPU performance and power consumption benefits.
  • Embodiments of the present application provide a GPU, an SPPU, and a task processing method, so as to improve the performance of the GPU.
  • the present application provides a graphics processing unit (GPU), comprising: a first processor and a second processor; wherein the first processor and the second processor are connected; the first processor, is configured to process constant folding CF tasks; the second processor is configured to process non-CF tasks.
  • GPU graphics processing unit
  • the first processor is configured to process CF tasks and the second processor is configured to process other tasks than CF tasks.
  • the CF task in the Shader program is completed by the coprocessor (the first processor), so that the CF task can be released from the main processor (the second processor) of the GPU, which obviously reduces the number of The load of the second processor enables the second processor to process other tasks with more sufficient computing resources, thereby improving the performance of the GPU.
  • both the first processor and the second processor may be any general-purpose processor, for example, a micro control unit (micro control unit, MCU).
  • MCU micro control unit
  • the first processor can be a low performance MCU (low performance MCU).
  • the load of the second processor can be reduced and the performance of the GPU can be improved.
  • the first processor can use a high-performance The high-performance single-chip microcomputer can further improve the processing efficiency of CF tasks.
  • the method further includes: a GPU task controller connected to the first processor and the second processor respectively.
  • the GPU task controller is configured to send the task to be processed to the first processor or the second processor, the first processor is configured to process the CF task, and the second processor is configured to process other tasks . That is, after receiving the to-be-processed task, the GPU task controller may first determine whether the to-be-processed task is a CF task or a non-CF task. When determining that the task to be processed is a CF task, the GPU task controller may send the task to be processed to the first processor for processing; when determining that the task to be processed is not a CF task, the GPU task controller may send the task to be processed to the second processor to process.
  • the above-mentioned GPU task controller may be a general-purpose programmable processor, a dedicated control circuit for GPU hardware, or the like.
  • any task to be processed can be analyzed and pre-judged. This judgment can distinguish each task at the very beginning of the Shader program, and hand it over to the corresponding processor for processing without improving the processing.
  • the scheduling complexity of the processor is reduced, and even if there are many CF tasks to be processed, they are handled by the independent first processor, and the load of the second processor will not be increased.
  • the first processor is a Shader preprocessing unit SPPU; the SPPU includes a Shader preprocessing unit controller SPPU_CTRL and a programmable processing unit connected to each other.
  • the above programmable processing unit may be an MCU or a digital signal processor (digital signal process, DSP). It should be noted that the above programmable processing unit may also be any other programmable processing unit, which is not specifically limited in this embodiment of the present application.
  • the SPPU is located at the front end of the entire GPU pipeline, and can support all shader types in any GPU, for example, VS vertex shader (Vertex Shader), FS fragment shader (Fragment Shader), CS general computing Shader (Compute Shader), etc.
  • SPPU_CTRL is configured to receive a constant folding CF task, and send a first calculation instruction to the programmable processing unit according to the CF task; the programmable processing unit is configured to perform calculation corresponding to the CF task according to the first calculation instruction.
  • SPPU_CTRL receives the CF task, it notifies the programmable processing unit of the CF task, and the programmable processing unit performs log calculation.
  • the programmable processing unit in the embodiment of the present application can process conditional judgment of floating-point numbers, loop control, floating-point calculation, etc., and can also process special function calculations, such as calculation of log, calculation of square, calculation of open, calculation of trigonometric function, calculation of reciprocal, etc.
  • the above-mentioned special function calculation can be completed by software algorithm.
  • these special functions can be fixed in the memory in the form of a table, and with the calculation instruction, the programmable processing unit can complete the corresponding calculation by looking up the table, instead of Need to design a dedicated function hardware computing acceleration co-processing unit.
  • the SPPU further includes a special function unit SFU connected with the programmable processing unit.
  • the SFU is used as a coprocessor unit of the programmable processing unit to complete the hardware acceleration of the above-mentioned special function calculation.
  • SPPU_CTRL configured to receive a CF task, and send a first calculation instruction to the programmable processing unit according to the CF task; the programmable processing unit, configured to perform a first calculation corresponding to the CF task according to the first calculation instruction; when the programmable processing unit When the unit detects that the second calculation corresponding to the CF task is to be performed, it sends a second calculation instruction to the SFU; the SFU is configured to perform the second calculation according to the second calculation instruction.
  • the programmable processing unit can process first calculations such as conditional judgment of floating-point numbers, loop control, and floating-point calculations, and the SFU can process special function calculations, such as calculating log, calculating square, calculating open, calculating trigonometric functions, and calculating Count down and wait for the second calculation.
  • first calculations such as conditional judgment of floating-point numbers, loop control, and floating-point calculations
  • SFU can process special function calculations, such as calculating log, calculating square, calculating open, calculating trigonometric functions, and calculating Count down and wait for the second calculation.
  • the programmable processing unit can send the second calculation instruction to the SFU, and the aforementioned second calculation is handed over to the SFU for processing; If the programmable processing unit does not detect that the second calculation is to be performed during the process of processing the first calculation based on the first calculation instruction, the programmable processing unit completes the first calculation.
  • the above-mentioned special function calculation completed by a software algorithm is replaced by a hardware (SFU) implementation, which can improve the calculation speed.
  • the SPPU further includes a TCM and a DMA; wherein the TCM is connected to the programmable processing unit; the DMA is connected to the TCM and the SPPU_CTRL respectively.
  • the TCM is used as the memory in the SPPU to store data and instructions
  • the DMA is responsible for the access function of all data between the memory of the SPPU and the GPU.
  • SPPU_CTRL configured to receive the CF task, obtain the instruction information corresponding to the CF task, and send the instruction information to the DMA
  • DMA configured to obtain the data corresponding to the instruction information and store it in the TCM
  • SPPU_CTRL configured to send the programmable
  • the processing unit sends a first calculation instruction; the programmable processing unit is configured to obtain data from the TCM according to the first calculation instruction and perform a first calculation corresponding to the CF task;
  • a second calculation instruction is sent to the SFU;
  • the SFU is configured to perform the second calculation according to the second calculation instruction, and the calculation result of the second calculation is sent to the programmable processing unit;
  • the programmable processing unit is configured To send the calculation result of the first calculation and the calculation result of the second calculation to the TCM.
  • the GPU may obtain information required for processing tasks from a memory
  • the memory may be any type of memory, for example, double data rate (DDR) synchronous dynamic Random access memory (synchronous dynamic random access memory, SDRAM), referred to as DDR.
  • DDR double data rate
  • SDRAM synchronous dynamic random access memory
  • Multiple buffers can be set in the DDR, including a buffer for storing descriptors of pending tasks, an instruction buffer for storing instructions, and a constant buffer for storing constants. buffer).
  • a buffer area for storing other information may also be set in the DDR, which is not specifically limited in this embodiment of the present application.
  • the above-mentioned embodiment only illustrates the connection relationship between the GPU and the DDR as an example, but this structure is not used as a limitation. In other embodiments, the GPU can also be connected in the same way. DDR.
  • the present application provides a Shader preprocessing unit SPPU, including: a Shader preprocessing unit controller SPPU_CTRL and a programmable processing unit connected to each other.
  • the SPPU is located at the forefront of the entire GPU pipeline, and can support all Shader types in any GPU.
  • SPPU_CTRL is configured to receive the constant folding CF task, and send a first calculation instruction to the MCU according to the CF task; the MCU is configured to perform calculation corresponding to the CF task according to the first calculation instruction.
  • SPPU_CTRL receives the CF task, it notifies the CF task of the CF task, and the MCU performs log calculation.
  • the MCU can process conditional judgment of floating-point numbers, loop control, floating-point calculation, etc., and can also process special function calculations, such as calculating log, calculating square, calculating open, calculating trigonometric function, calculating reciprocal, etc.
  • special function calculations such as calculating log, calculating square, calculating open, calculating trigonometric function, calculating reciprocal, etc.
  • the calculation of special functions can be completed by software algorithms. For example, these special functions can be fixed in the memory in the form of a table. With the calculation instructions, the MCU can complete the corresponding calculation by looking up the table.
  • a special function unit SFU connected to the programmable processing unit.
  • the SFU is used as a co-processor unit of the DSP to complete the hardware acceleration of the above-mentioned special function calculation.
  • SPPU_CTRL configured to receive the CF task, and send a first calculation instruction to the DSP according to the CF task;
  • DSP configured to perform the first calculation corresponding to the CF task according to the first calculation instruction;
  • the SFU is configured to perform the second calculation according to the second calculation instruction.
  • the DSP can process the first calculation such as conditional judgment of floating-point numbers, loop control, and floating-point calculation
  • the SFU can process the calculation of special functions, such as calculating the log, calculating the square, calculating the open, calculating the trigonometric function, and calculating the reciprocal calculation. Second calculation.
  • the DSP can send the second calculation instruction to the SFU, and the aforesaid second calculation is handed over to the SFU for processing; if the DSP is based on the first calculation During the process of processing the first calculation by the instruction, if it is not detected that the second calculation is to be performed, the DSP completes the first calculation.
  • the special function calculation completed by the software algorithm in the above-mentioned embodiment is replaced by the hardware (SFU) implementation, which can improve the calculation speed.
  • the method further includes: a TCM and a DMA; wherein the TCM is connected to the programmable processing unit; the DMA is connected to the TCM and the SPPU_CTRL respectively.
  • the TCM is used as the memory in the SPPU to store data and instructions
  • the DMA is responsible for the access function of all data between the memory of the SPPU and the GPU.
  • SPPU_CTRL configured to receive the CF task, obtain the instruction information corresponding to the CF task, and send the instruction information to the DMA
  • DMA configured to obtain the data corresponding to the instruction information and store it in the TCM
  • SPPU_CTRL configured to send the programmable
  • the processing unit sends a first calculation instruction; the programmable processing unit is configured to obtain data from the TCM according to the first calculation instruction and perform a first calculation corresponding to the CF task;
  • a second calculation instruction is sent to the SFU;
  • the SFU is configured to perform the second calculation according to the second calculation instruction, and the calculation result of the second calculation is sent to the programmable processing unit;
  • the programmable processing unit is configured To send the calculation result of the first calculation and the calculation result of the second calculation to the TCM.
  • the present application provides a task processing method, comprising: receiving a task to be processed; when the task to be processed is a constant folding CF task, controlling a first processor to process the task to be processed; When the task is not the CF task, the second processor is controlled to process the to-be-processed task.
  • the first processor may be the SPPU in the above-mentioned embodiments, or a general-purpose processor, which is used as a co-processor of the GPU;
  • the second processor may be an MCU or other programmable processing unit, which is used as a GPU main processor.
  • the first processor is configured to process CF tasks and the second processor is configured to process other tasks than CF tasks.
  • the CF task in the Shader program is completed by the coprocessor (the first processor), so that the CF task can be released from the main processor (the second processor) of the GPU, which obviously reduces the number of The load of the second processor enables the second processor to process other tasks with more sufficient computing resources, thereby improving the performance of the GPU.
  • the GPU can analyze any task to be processed and make a pre-judgment. The judgment can distinguish each task at the very beginning of the Shader program and hand it over to the corresponding processor for processing. It does not increase the scheduling complexity of the processor, and even if there are many CF tasks to be processed, they are handled by an independent first processor, and the load of the second processor is not increased.
  • FIG. 1 is an exemplary schematic diagram of CF processing in the related art
  • FIG. 2 is an exemplary structural diagram of a GPU according to an embodiment of the present application.
  • FIG. 3 is an exemplary structural diagram of a GPU according to an embodiment of the present application.
  • FIG. 4 is an exemplary structural diagram of a GPU according to an embodiment of the present application.
  • FIG. 5 is an exemplary structural diagram of a GPU according to an embodiment of the present application.
  • FIG. 6 is an exemplary structural diagram of a GPU according to an embodiment of the present application.
  • FIG. 7 is an exemplary structural diagram of an SPPU according to an embodiment of the present application.
  • FIG. 8 is an exemplary structural diagram of an SPPU according to an embodiment of the present application.
  • FIG. 9 is an exemplary structural diagram of an SPPU according to an embodiment of the present application.
  • FIG. 10 is an exemplary flowchart of a task processing method according to an embodiment of the present application.
  • At least one (item) refers to one or more, and "a plurality” refers to two or more.
  • “And/or” is used to describe the relationship between related objects, indicating that there can be three kinds of relationships, for example, “A and/or B” can mean: only A, only B, and both A and B exist , where A and B can be singular or plural.
  • the character “/” generally indicates that the associated objects are an “or” relationship.
  • At least one item(s) below” or similar expressions thereof refer to any combination of these items, including any combination of single item(s) or plural items(s).
  • At least one (a) of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c" ", where a, b, c can be single or multiple.
  • FIG. 1 is an exemplary schematic diagram of CF processing in the related art.
  • an additional process is embedded, that is, a preprocessing Shader program is started, and the CF is completed by using the preprocessing Shader program. Processing, and then all Shader programs start to calculate with the processing result of CF as the starting point. This can save threads (Thread) in the processor and bring huge performance and power consumption benefits to the processor.
  • Thread threads
  • the CF processing is completed by a separate preprocessing Shader program, all the Shader programs consume the computing resources and storage resources of the GPU together, which cannot achieve the purpose of reducing the computing load of the processor.
  • each Shader program needs to be preceded by a pre-processing Shader program to process the corresponding CF task, then the number of pre-processing Shader programs will also be very large, not only can not improve the performance of the processor, On the contrary, it will increase the computing load of the processor and reduce the performance of the GPU.
  • the additional preprocessing Shader program will also increase the control complexity of the processor, resulting in complicated scheduling methods and reducing the performance of the processor.
  • an embodiment of the present application provides a GPU, which can reduce the load of the GPU and improve the performance of the GPU without increasing the scheduling complexity.
  • the above-mentioned GPU can be integrated on any electronic device and used as a processor of a system on chip (SOC), and the electronic device can be, for example, a mobile phone, a vehicle-mounted terminal, a computer, and the like.
  • SOC system on chip
  • FIG. 2 is an exemplary structural diagram of a GPU according to an embodiment of the present application. As shown in FIG. 2 , the GPU includes a first processor and a second processor, wherein the first processor and the second processor are connected.
  • the first processor is configured to process CF tasks and the second processor is configured to process other tasks than CF tasks.
  • the CF task in the Shader program is completed by the coprocessor (the first processor), so that the CF task can be released from the main processor (the second processor) of the GPU, which obviously reduces the number of The load of the second processor enables the second processor to process other tasks with more sufficient computing resources, thereby improving the performance of the GPU.
  • both the first processor and the second processor may be any general-purpose processor, for example, a micro control unit (micro control unit, MCU).
  • MCU micro control unit
  • the first processor can be a low performance MCU (low performance MCU).
  • the load of the second processor can be reduced and the performance of the GPU can be improved.
  • the first processor can use a high-performance The high-performance single-chip microcomputer can further improve the processing efficiency of CF tasks.
  • FIG. 3 is an exemplary structural diagram of a GPU according to an embodiment of the present application. As shown in FIG. 3 , on the basis of the structure of the GPU shown in FIG. 2 , the GPU further includes: Connected GPU task controller.
  • the GPU task controller is configured to send the task to be processed to the first processor or the second processor, the first processor is configured to process the CF task, and the second processor is configured to process other tasks . That is, after receiving the to-be-processed task, the GPU task controller may first determine whether the to-be-processed task is a CF task or a non-CF task. When determining that the task to be processed is a CF task, the GPU task controller may send the task to be processed to the first processor for processing; when determining that the task to be processed is not a CF task, the GPU task controller may send the task to be processed to the second processor to process.
  • the above-mentioned GPU task controller may be a general-purpose programmable processor, a dedicated control circuit for GPU hardware, or the like.
  • any task to be processed can be analyzed and pre-judged. This judgment can distinguish each task at the very beginning of the Shader program, and hand it over to the corresponding processor for processing without improving the processing.
  • the scheduling complexity of the processor is reduced, and even if there are many CF tasks to be processed, they are handled by the independent first processor, and the load of the second processor will not be increased.
  • FIG. 4 is an exemplary structural diagram of a GPU according to an embodiment of the present application.
  • the GPU includes a GPU task controller, a first processor, and a second processor, wherein the first processor and the second processor
  • the first processor is a Shader preprocessor unit (SPPU);
  • the SPPU includes an interconnected Shader preprocessor unit controller (SPPU_CTRL) and programmable processing unit.
  • SPPU_CTRL Shader preprocessor unit controller
  • the above programmable processing unit may be an MCU or a digital signal processor (digital signal process, DSP). It should be noted that the above programmable processing unit may also be any other programmable processing unit, which is not specifically limited in this embodiment of the present application.
  • the SPPU is located at the front end of the entire GPU pipeline, and can support all shader types in any GPU, for example, VS vertex shader (Vertex Shader), FS fragment shader (Fragment Shader), CS general computing Shader (Compute Shader), etc.
  • SPPU_CTRL is configured to receive a constant folding CF task, and send a first calculation instruction to the programmable processing unit according to the CF task; the programmable processing unit is configured to perform calculation corresponding to the CF task according to the first calculation instruction.
  • SPPU_CTRL receives the CF task, it notifies the programmable processing unit of the CF task, and the programmable processing unit performs log calculation.
  • the programmable processing unit in the embodiment of the present application can process conditional judgment of floating-point numbers, loop control, floating-point calculation, etc., and can also process special function calculations, such as calculation of log, calculation of square, calculation of open, calculation of trigonometric function, calculation of reciprocal, etc.
  • the above-mentioned special function calculation can be completed by software algorithm.
  • these special functions can be fixed in the memory in the form of a table, and with the calculation instruction, the programmable processing unit can complete the corresponding calculation by looking up the table, instead of Need to design a dedicated function hardware computing acceleration co-processing unit.
  • FIG. 5 is an exemplary structural diagram of a GPU according to an embodiment of the application.
  • the SPPU further includes: a special function unit (special function unit) connected to the programmable processing unit. function unit, SFU).
  • special function unit special function unit
  • the SFU is used as a coprocessor unit of the programmable processing unit to complete the hardware acceleration of the above-mentioned special function calculation.
  • SPPU_CTRL configured to receive a CF task, and send a first calculation instruction to the programmable processing unit according to the CF task; the programmable processing unit, configured to perform a first calculation corresponding to the CF task according to the first calculation instruction; when the programmable processing unit When the unit detects that the second calculation corresponding to the CF task is to be performed, it sends a second calculation instruction to the SFU; the SFU is configured to perform the second calculation according to the second calculation instruction.
  • the programmable processing unit can process first calculations such as conditional judgment of floating-point numbers, loop control, and floating-point calculations, and the SFU can process special function calculations, such as calculating log, calculating square, calculating open, calculating trigonometric functions, and calculating Count down and wait for the second calculation.
  • first calculations such as conditional judgment of floating-point numbers, loop control, and floating-point calculations
  • SFU can process special function calculations, such as calculating log, calculating square, calculating open, calculating trigonometric functions, and calculating Count down and wait for the second calculation.
  • the programmable processing unit can send the second calculation instruction to the SFU, and the aforementioned second calculation is handed over to the SFU for processing; If the programmable processing unit does not detect that the second calculation is to be performed during the process of processing the first calculation based on the first calculation instruction, the programmable processing unit completes the first calculation.
  • the special function calculation completed by the software algorithm in the embodiment shown in FIG. 4 is replaced by the hardware (SFU) implementation, which can improve the calculation speed.
  • FIG. 6 is an exemplary structural diagram of a GPU according to an embodiment of the present application.
  • the SPPU further includes: a tightly coupled memory (TCM) and a direct Memory access (direct memory access, DMA); among them, TCM is connected to the programmable processing unit; DMA is connected to TCM and SPPU_CTRL respectively.
  • TCM tightly coupled memory
  • DMA direct Memory access
  • the TCM is used as the memory in the SPPU to store data and instructions
  • the DMA is responsible for the access function of all data between the memory of the SPPU and the GPU.
  • SPPU_CTRL configured to receive the CF task, obtain the instruction information corresponding to the CF task, and send the instruction information to the DMA
  • DMA configured to obtain the data corresponding to the instruction information and store it in the TCM
  • SPPU_CTRL configured to send the programmable
  • the processing unit sends a first calculation instruction; the programmable processing unit is configured to obtain data from the TCM according to the first calculation instruction and perform a first calculation corresponding to the CF task;
  • a second calculation instruction is sent to the SFU;
  • the SFU is configured to perform the second calculation according to the second calculation instruction, and the calculation result of the second calculation is sent to the programmable processing unit;
  • the programmable processing unit is configured To send the calculation result of the first calculation and the calculation result of the second calculation to the TCM.
  • the GPU can obtain the information required for processing tasks from the memory, and the memory can be any kind of memory, for example, double rate (double rate) data rate, DDR) synchronous dynamic random access memory (synchronous dynamic random access memory, SDRAM), referred to as DDR.
  • DDR double rate
  • SDRAM synchronous dynamic random access memory
  • Multiple buffers can be set in the DDR, including a buffer for storing descriptors of pending tasks, an instruction buffer for storing instructions, and a constant buffer for storing constants. buffer).
  • a buffer area for storing other information may also be set in the DDR, which is not specifically limited in this embodiment of the present application.
  • FIG. 6 in the embodiment of the present application only illustrates the connection relationship between the GPU and the DDR as an example, but the structure shown in FIG. 6 is not used as a limitation. In other embodiments GPUs can also be connected to DDR in the same way.
  • FIG. 7 is an exemplary structural diagram of an SPPU according to an embodiment of the present application.
  • the SPPU includes an SPPU_CTRL and a programmable processing unit (the programmable processing unit in the embodiment of the present application is an MCU) that are connected to each other.
  • the programmable processing unit in the embodiment of the present application is an MCU
  • the SPPU is located at the forefront of the entire GPU pipeline, and can support all Shader types in any GPU.
  • SPPU_CTRL is configured to receive the constant folding CF task, and send a first calculation instruction to the MCU according to the CF task; the MCU is configured to perform calculation corresponding to the CF task according to the first calculation instruction.
  • SPPU_CTRL receives the CF task, it notifies the CF task of the CF task, and the MCU performs log calculation.
  • the MCU can process conditional judgment of floating-point numbers, loop control, floating-point calculation, etc., and can also process special function calculations, such as calculating log, calculating square, calculating open, calculating trigonometric function, calculating reciprocal, etc.
  • special function calculations such as calculating log, calculating square, calculating open, calculating trigonometric function, calculating reciprocal, etc.
  • the calculation of special functions can be completed by software algorithms. For example, these special functions can be fixed in the memory in the form of a table. With the calculation instructions, the MCU can complete the corresponding calculation by looking up the table.
  • FIG. 8 is an exemplary structural diagram of the SPPU according to an embodiment of the present application. As shown in FIG. 8 , on the basis of the structure of the GPU shown in FIG. 7 , the SPPU further includes:
  • the programming processing unit is the SFU connected to the DSP.
  • the SFU is used as a co-processor unit of the DSP to complete the hardware acceleration of the above-mentioned special function calculation.
  • SPPU_CTRL configured to receive the CF task, and send a first calculation instruction to the DSP according to the CF task;
  • DSP configured to perform the first calculation corresponding to the CF task according to the first calculation instruction;
  • the SFU is configured to perform the second calculation according to the second calculation instruction.
  • the DSP can process the first calculation such as conditional judgment of floating-point numbers, loop control, and floating-point calculation
  • the SFU can process the calculation of special functions, such as calculating the log, calculating the square, calculating the open, calculating the trigonometric function, and calculating the reciprocal calculation. Second calculation.
  • the DSP can send the second calculation instruction to the SFU, and the aforesaid second calculation is handed over to the SFU for processing; if the DSP is based on the first calculation During the process of processing the first calculation by the instruction, if it is not detected that the second calculation is to be performed, the DSP completes the first calculation.
  • the embodiment of the present application replaces the special function calculation completed by the software algorithm in the embodiment shown in FIG. 7 to be implemented by hardware (SFU), which can improve the calculation speed.
  • FIG. 9 is an exemplary structural diagram of an SPPU according to an embodiment of the present application.
  • the SPPU further includes: TCM and DMA, wherein TCM and programmable processing unit (The programmable processing unit is the MCU in the embodiment of the present application) is connected; the DMA is connected to the TCM and the SPPU_CTRL respectively.
  • TCM and programmable processing unit The programmable processing unit is the MCU in the embodiment of the present application
  • DMA is connected to the TCM and the SPPU_CTRL respectively.
  • SPPU_CTRL and MCU communicate through MailBox
  • SPPU_CTRL and DMA communicate through DMA control interface
  • GPU is also provided with bus controller to realize communication between GPU and DDR.
  • the processing flow of CF tasks can include:
  • the GPU task controller uses the advanced peripheral bus (APB) interface (the APB bus protocol is the advanced microcontroller bus architecture (AMBA) (also known as the on-chip bus protocol) proposed by ARM.
  • APB bus protocol is the advanced microcontroller bus architecture (AMBA) (also known as the on-chip bus protocol) proposed by ARM.
  • APB bus protocol is the advanced microcontroller bus architecture (AMBA) (also known as the on-chip bus protocol) proposed by ARM.
  • APB bus protocol is the advanced microcontroller bus architecture (AMBA) (also known as the on-chip bus protocol) proposed by ARM.
  • APB bus protocol is the advanced microcontroller bus architecture (AMBA) (also known as the on-chip bus protocol) proposed by ARM.
  • APB bus protocol is the advanced microcontroller bus architecture (AMBA) (also known as the on-chip bus protocol) proposed by ARM.
  • APB bus protocol is the advanced microcontroller bus architecture (AMBA) (also known as the on-chip bus protocol) proposed by ARM.
  • SPPU_CTRL reads the descriptor related to the CF task from the DDR through the bus controller.
  • SPPU_CTRL obtains the address of the instruction buffer, the address of the constant buffer and the size (size) information related to the CF task through the parsed descriptor, and converts it into the configuration parameters of the DMA, and starts the DMA to transfer the data in the aforementioned buffer to the TCM. middle.
  • SPPU_CTRL informs the MCU through MailBox to start the calculation.
  • the MCU executes the pre-prepared instructions in the TCM, and performs corresponding function calculation on the constants in the TCM.
  • the MCU will notify the SFU to complete the special function calculation.
  • the MCU stores all the calculation results in the TCM.
  • the result of the CF task is stored in the DDR through DMA, and the result will become the input data of other tasks of the GPU.
  • FIG. 10 is an exemplary flowchart of a task processing method according to an embodiment of the present application. As shown in FIG. 10 , the method in the embodiment of the present application may be executed by the GPU in the above-mentioned embodiment, and the method may include:
  • Step 1001 Receive a task to be processed.
  • the GPU receives tasks to be processed, and the tasks to be processed can be implemented by a Shader program.
  • Step 1002a when the to-be-processed task is a CF task, control the first processor to process the to-be-processed task.
  • Step 1002b when the to-be-processed task is not a CF task, control the second processor to process the to-be-processed task.
  • Step 1002a and step 1002b are one-of-two steps, and the GPU can determine which of the two steps to execute according to the task to be processed.
  • the first processor may be the SPPU in the above-mentioned embodiments, or a general-purpose processor, which is used as a co-processor of the GPU;
  • the second processor may be an MCU or other programmable processing unit, which is used as a GPU main processor.
  • the first processor is configured to process CF tasks and the second processor is configured to process other tasks than CF tasks.
  • the CF task in the Shader program is completed by the coprocessor (the first processor), so that the CF task can be released from the main processor (the second processor) of the GPU, which obviously reduces the number of The load of the second processor enables the second processor to process other tasks with more sufficient computing resources, thereby improving the performance of the GPU.
  • the GPU can analyze any task to be processed and make a pre-judgment. The judgment can distinguish each task at the very beginning of the Shader program and hand it over to the corresponding processor for processing. The scheduling complexity of the processor will not be increased, and even if there are many CF tasks to be processed, they are handled by the independent first processor, and the load of the second processor will not be increased.
  • each step of the above method embodiments may be completed by a hardware integrated logic circuit in a processor or an instruction in the form of software.
  • the processor may be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA), or other Programming logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processor
  • ASIC application-specific integrated circuit
  • FPGA field programmable gate array
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the methods disclosed in the embodiments of the present application may be directly embodied as executed by a hardware encoding processor, or executed by a combination of hardware and software modules in the encoding processor.
  • the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory, and the processor reads the information in the memory, and completes the steps of the above method in combination with its hardware.
  • the memory mentioned in the above embodiments may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically programmable Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory may be random access memory (RAM), which acts as an external cache.
  • RAM random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous DRAM
  • SDRAM double data rate synchronous dynamic random access memory
  • ESDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous link dynamic random access memory
  • direct rambus RAM direct rambus RAM
  • the disclosed systems, devices and methods may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solutions of the embodiments of the present application.
  • each functional unit in each embodiment of the embodiments of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium.
  • the technical solutions of the embodiments of the present application can be embodied in the form of software products in essence, or the parts that make contributions to the prior art or the parts of the technical solutions, and the computer software products are stored in a storage medium , which includes several instructions for causing a computer device (personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Advance Control (AREA)

Abstract

一种GPU、SPPU和任务处理方法。所述GPU,包括:第一处理器和第二处理器;其中,所述第一处理器和所述第二处理器连接;所述第一处理器,被配置为处理常量折叠CF任务;所述第二处理器,被配置为处理非CF任务。上述方法可以提高GPU的性能。

Description

GPU、SPPU和任务处理方法 技术领域
本申请实施例涉及处理器技术,尤其涉及一种GPU、SPPU和任务处理方法。
背景技术
图形处理器(graphics processing unit,GPU)是一个可编程处理器,通过执行可编程着色器(Shader)程序,完成并行计算和图形处理等功能。区别于中央处理器(central processing unit,CPU),GPU是采用基于单指令多线程(single instruction multi thread,SIMT)的并行计算架构,在执行Shader程序时是将相同线程簇中的线程在同一时间周期执行相同的Shader指令。
在Shader程序中,很多的典型场景涉及到常数之间的计算,例如由常数A计算得到常数B,在Shader程序执行之前可以预先计算得到B,然后所有的Shader程序以B为起始点开始计算,该过程称为常量折叠(constant folding,CF),CF可以给GPU带来巨大的性能和功耗收益。因此,如何提高CF处理的性能是提高GPU性能和功耗收益的关键。
发明内容
本申请实施例提供一种GPU、SPPU和任务处理方法,以提高GPU的性能。
第一方面,本申请提供一种图形处理器GPU,包括:第一处理器和第二处理器;其中,所述第一处理器和所述第二处理器连接;所述第一处理器,被配置为处理常量折叠CF任务;所述第二处理器,被配置为处理非CF任务。
第一处理器被配置为处理CF任务,第二处理器被配置为处理除CF任务外的其他任务。本申请实施例将Shader程序中的CF任务由协处理器(第一处理器)来完成,这样可以将CF任务从GPU的主处理器(第二处理器)中释放出来,很显然减少了第二处理器的负载,使得第二处理器可以以更充分的计算资源处理其他任务,进而提高GPU的性能。
可选的,第一处理器和第二处理器均可以是任意一种通用处理器,例如,微控制单元(micro control unit,MCU)。
例如,将上述GPU应用于移动设备中的片上系统(system on chip,SOC)GPU中,由于移动设备对功耗的要求较高,因此第一处理器可以采用低性能单片机(low performance MCU),在满足移动设备低功耗需求的前提下,可以减少第二处理器的负载,提高GPU的性能。又例如,将上述GPU应用于在其他电子设备的GPU中,该电子设备对GPU的功耗要求没有移动设备那么苛刻,因此第一处理器可以采用一个高性能的微处理器,相较于低性能单片机,可以进一步提高CF任务的处理效率。
在一种可能的实现方式中,还包括:与所述第一处理器和所述第二处理器分别连接的GPU任务控制器。
本申请实施例中,GPU任务控制器被配置为将待处理任务发送给第一处理器或第二处理器,第一处理器被配置为处理CF任务,第二处理器被配置为处理其他任务。即GPU任务控制器接收到待处理任务后,可以先判断该待处理任务是CF任务还是非CF任务。在确定待处理任务是CF任务时,GPU任务控制器可以将该待处理任务发送给第一处理器 去处理;在确定待处理任务不是CF任务时,GPU任务控制器可以将该待处理任务发送给第二处理器去处理。
可选的,上述GPU任务控制器可以是通用可编程处理器、GPU硬件专用控制电路等等。通过GPU任务控制器可以对任意一个待处理任务进行分析作出预判,该判断可以在Shader程序的最开始阶段就对各个任务作出区分,并交由对应的处理器去处理,并不会提升处理器的调度复杂度,而且即使有很多CF任务需要处理,也是交由独立的第一处理器处理,并不会增加第二处理器的负载。
在一种可能的实现方式中,所述第一处理器为Shader预处理单元SPPU;所述SPPU包括相互连接的Shader预处理单元控制器SPPU_CTRL和可编程处理单元。
可选的,上述可编程处理单元可以是MCU或者数字信号处理器(digital signal process,DSP)。需要说明的是,上述可编程处理单元还可以是其他任意一种可编程处理单元,本申请实施例对此不做具体限定。
本申请实施例中SPPU位于整个GPU流水线的最前端,可以支持任意一种GPU中的所有Shader类型,例如,VS顶点着色器(Vertex Shader),FS片元着色器(Fragment Shader),CS通用计算着色器(Compute Shader)等。SPPU_CTRL,被配置为接收常量折叠CF任务,根据CF任务向可编程处理单元发送第一计算指令;可编程处理单元,被配置为根据第一计算指令进行与CF任务对应的计算。
例如,CF任务为A=logB,其中A和B均为常量。SPPU_CTRL接收到该CF任务后,将该CF任务通知给可编程处理单元,由可编程处理单元实施log计算。
本申请实施例中可编程处理单元可以处理浮点数的条件判断、循环控制和浮点计算等,还可以处理特殊函数计算,例如计算log、计算平方、计算开放、计算三角函数和计算倒数等。
可选的,上述特殊函数计算可以通过软件算法完成,例如,可以将这些特殊函数以表的形式固话在存储器中,配合计算指令,可编程处理单元可以通过查表完成相应的计算,而不需要设计专用的函数硬件计算加速协处理单元。
在一种可能的实现方式中,所述SPPU还包括与所述可编程处理单元连接的特殊功能单元SFU。
本申请实施例中SFU作为可编程处理单元的协处理器单元,完成上述特殊函数计算的硬件加速。SPPU_CTRL,被配置为接收CF任务,根据CF任务向可编程处理单元发送第一计算指令;可编程处理单元,被配置为根据第一计算指令进行与CF任务对应的第一计算;当可编程处理单元检测到要进行与CF任务对应的第二计算时,向SFU发送第二计算指令;SFU,被配置为根据第二计算指令进行第二计算。
本申请实施例中可编程处理单元可以处理浮点数的条件判断、循环控制和浮点计算等第一计算,SFU可以处理特殊函数计算,例如计算log、计算平方、计算开放、计算三角函数和计算倒数等第二计算。若可编程处理单元基于第一计算指令处理第一计算的过程中,检测到要进行第二计算,则可编程处理单元可以向SFU发送第二计算指令,将前述第二计算交由SFU处理;若可编程处理单元基于第一计算指令处理第一计算的过程中,没有检测到要进行第二计算,则可编程处理单元完成第一计算。本申请实施例将上述由软件算法完成的特殊函数计算替换成由硬件(SFU)实现,可以提高计算速度。
在一种可能的实现方式中,所述SPPU还包括TCM和DMA;其中,所述TCM和所述可编程处理单元连接;所述DMA与所述TCM和所述SPPU_CTRL分别连接。
本申请实施例中,TCM用作SPPU中存储器,存储数据和指令,DMA负责SPPU和GPU的存储器之间的所有数据的存取功能。SPPU_CTRL,被配置为接收CF任务,获取与CF任务对应的指示信息,向DMA发送指示信息;DMA,被配置为获取与指示信息对应的数据并存入TCM中;SPPU_CTRL,被配置为向可编程处理单元发送第一计算指令;可编程处理单元,被配置为根据第一计算指令从TCM获取数据并进行与CF任务对应的第一计算;当可编程处理单元检测到要进行与CF任务对应的第二计算时,向SFU发送第二计算指令;SFU,被配置为根据第二计算指令进行第二计算,并将第二计算的计算结果发送给可编程处理单元;可编程处理单元,被配置为将第一计算的计算结果和第二计算的计算结果发送至TCM。
在一种可能的实现方式中,上述实施例中,GPU可以从存储器中获取处理任务所需的信息,该存储器可以是任意一种内存,例如,双倍速率(double data rate,DDR)同步动态随机存储器(synchronous dynamic random access memory,SDRAM),简称DDR。DDR中可以设置有多个缓冲区,包括用于存储待处理任务的描述符(descriptor)的缓冲区,用于存储指令的指令缓冲区(instruction buffer)以及用于存储常量的常量缓冲区(constant buffer)。需要说明的是,DDR中还可以设置有存储其他信息的缓冲区,本申请实施例对此不做具体限定。需要说明的是,上述实施例仅作为一种示例绘示出了GPU和DDR之间的连接关系,但并不是以该结构作为一种限定,在其他实施例中GPU也可以以相同的方式连接DDR。
第二方面,本申请提供一种Shader预处理单元SPPU,包括:相互连接的Shader预处理单元控制器SPPU_CTRL和可编程处理单元。
本申请实施例中SPPU位于整个GPU流水线的最前端,可以支持任意一种GPU中的所有Shader类型。SPPU_CTRL,被配置为接收常量折叠CF任务,根据CF任务向MCU发送第一计算指令;MCU,被配置为根据第一计算指令进行与CF任务对应的计算。
例如,CF任务为A=logB,其中A和B均为常量。SPPU_CTRL接收到该CF任务后,将该CF任务通知给MCU,由MCU实施log计算。
本申请实施例中MCU可以处理浮点数的条件判断、循环控制和浮点计算等,还可以处理特殊函数计算,例如计算log、计算平方、计算开放、计算三角函数和计算倒数等。特殊函数计算可以通过软件算法完成,例如,可以将这些特殊函数以表的形式固话在存储器中,配合计算指令,MCU可以通过查表完成相应的计算。
在一种可能的实现方式中,还包括:与所述可编程处理单元连接的特殊功能单元SFU。
本申请实施例中SFU作为DSP的协处理器单元,完成上述特殊函数计算的硬件加速。SPPU_CTRL,被配置为接收CF任务,根据CF任务向DSP发送第一计算指令;DSP,被配置为根据第一计算指令进行与CF任务对应的第一计算;当DSP检测到要进行与CF任务对应的第二计算时,向SFU发送第二计算指令;SFU,被配置为根据第二计算指令进行第二计算。
本申请实施例中DSP可以处理浮点数的条件判断、循环控制和浮点计算等第一计算,SFU可以处理特殊函数计算,例如计算log、计算平方、计算开放、计算三角函数和计算 倒数等第二计算。若DSP基于第一计算指令处理第一计算的过程中,检测到要进行第二计算,则DSP可以向SFU发送第二计算指令,将前述第二计算交由SFU处理;若DSP基于第一计算指令处理第一计算的过程中,没有检测到要进行第二计算,则DSP完成第一计算。本申请实施例将上述实施例中由软件算法完成的特殊函数计算替换成由硬件(SFU)实现,可以提高计算速度。
在一种可能的实现方式中,还包括:TCM和DMA;其中,所述TCM和所述可编程处理单元连接;所述DMA与所述TCM和所述SPPU_CTRL分别连接。
本申请实施例中,TCM用作SPPU中存储器,存储数据和指令,DMA负责SPPU和GPU的存储器之间的所有数据的存取功能。SPPU_CTRL,被配置为接收CF任务,获取与CF任务对应的指示信息,向DMA发送指示信息;DMA,被配置为获取与指示信息对应的数据并存入TCM中;SPPU_CTRL,被配置为向可编程处理单元发送第一计算指令;可编程处理单元,被配置为根据第一计算指令从TCM获取数据并进行与CF任务对应的第一计算;当可编程处理单元检测到要进行与CF任务对应的第二计算时,向SFU发送第二计算指令;SFU,被配置为根据第二计算指令进行第二计算,并将第二计算的计算结果发送给可编程处理单元;可编程处理单元,被配置为将第一计算的计算结果和第二计算的计算结果发送至TCM。
第三方面,本申请提供一种任务处理方法,包括:接收待处理任务;当所述待处理任务是常量折叠CF任务时,控制第一处理器处理所述待处理任务;当所述待处理任务不是所述CF任务时,控制第二处理器处理所述待处理任务。
本申请实施例中第一处理器可以是上述实施例中的SPPU,也可以是通用处理器,用作GPU的协处理器;第二处理器可以是MCU或其他可编程处理单元,用作GPU的主处理器。
第一处理器被配置为处理CF任务,第二处理器被配置为处理除CF任务外的其他任务。本申请实施例将Shader程序中的CF任务由协处理器(第一处理器)来完成,这样可以将CF任务从GPU的主处理器(第二处理器)中释放出来,很显然减少了第二处理器的负载,使得第二处理器可以以更充分的计算资源处理其他任务,进而提高GPU的性能。
在一种可能的实现方式中,GPU可以对任意一个待处理任务进行分析作出预判,该判断可以在Shader程序的最开始阶段就对各个任务作出区分,并交由对应的处理器去处理,并不会提升处理器的调度复杂度,而且即使有很多CF任务需要处理,也是交由独立的第一处理器处理,并不会增加第二处理器的负载。
附图说明
图1为相关技术CF处理的一个示例性的示意图;
图2为本申请实施例GPU的一个示例性的结构图;
图3为本申请实施例GPU的一个示例性的结构图;
图4为本申请实施例GPU的一个示例性的结构图;
图5为本申请实施例GPU的一个示例性的结构图;
图6为本申请实施例GPU的一个示例性的结构图;
图7为本申请实施例SPPU的一个示例性的结构图;
图8为本申请实施例SPPU的一个示例性的结构图;
图9为本申请实施例SPPU的一个示例性的结构图;
图10为本申请实施例任务处理方法的一个示例性的流程图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请实施例一部分实施例,而不是全部的实施例。基于本申请实施例中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请实施例保护的范围。
本申请实施例的说明书实施例和权利要求书及附图中的术语“第一”、“第二”等仅用于区分描述的目的,而不能理解为指示或暗示相对重要性,也不能理解为指示或暗示顺序。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元。方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
应当理解,在本申请实施例中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。
图1为相关技术CF处理的一个示例性的示意图,如图1所示,在执行Shader程序前,先内嵌一个额外的过程,即启动一个预处理Shader程序,利用该预处理Shader程序完成CF处理,然后所有的Shader程序以CF的处理结果为起始点开始计算。这样可以节省处理器中的线程(Thread),给处理器带来巨大的性能和功耗收益。但是,虽然由单独的预处理Shader程序完成CF处理,但是还是所有的Shader程序共同消耗GPU的计算资源和存储资源,不能达到减少处理器计算负载的目的。而且如果有非常多的Shader程序,每一个Shader程序都要需要前置一个预处理Shader程序来处理对应的CF任务,那么预处理Shader程序的数量也会非常多,不但不能提升处理器的性能,反而会增加处理器的计算负载,降低GPU的性能。另外,额外的预处理Shader程序也会提升处理器的控制复杂度,导致调度方法复杂,降低处理器的性能。
基于上述问题,本申请实施例提供了一种GPU,在不增加调度复杂度的前提下,可以减少GPU的负载,提升GPU的性能。上述GPU可以集成于任意一种电子设备上,用作片上系统(system on chip,SOC)的处理器,电子设备例如可以是手机、车载终端、计算机等。
图2为本申请实施例GPU的一个示例性的结构图,如图2所示,该GPU包括第一处理器和第二处理器,其中,第一处理器和第二处理器连接。
第一处理器被配置为处理CF任务,第二处理器被配置为处理除CF任务外的其他任 务。本申请实施例将Shader程序中的CF任务由协处理器(第一处理器)来完成,这样可以将CF任务从GPU的主处理器(第二处理器)中释放出来,很显然减少了第二处理器的负载,使得第二处理器可以以更充分的计算资源处理其他任务,进而提高GPU的性能。
可选的,第一处理器和第二处理器均可以是任意一种通用处理器,例如,微控制单元(micro control unit,MCU)。
例如,将上述GPU应用于移动设备中的片上系统(system on chip,SOC)GPU中,由于移动设备对功耗的要求较高,因此第一处理器可以采用低性能单片机(low performance MCU),在满足移动设备低功耗需求的前提下,可以减少第二处理器的负载,提高GPU的性能。又例如,将上述GPU应用于在其他电子设备的GPU中,该电子设备对GPU的功耗要求没有移动设备那么苛刻,因此第一处理器可以采用一个高性能的微处理器,相较于低性能单片机,可以进一步提高CF任务的处理效率。
图3为本申请实施例GPU的一个示例性的结构图,如图3所示,在图2所示GPU的结构的基础上,该GPU还包括:与第一处理器和第二处理器分别连接的GPU任务控制器。
本申请实施例中,GPU任务控制器被配置为将待处理任务发送给第一处理器或第二处理器,第一处理器被配置为处理CF任务,第二处理器被配置为处理其他任务。即GPU任务控制器接收到待处理任务后,可以先判断该待处理任务是CF任务还是非CF任务。在确定待处理任务是CF任务时,GPU任务控制器可以将该待处理任务发送给第一处理器去处理;在确定待处理任务不是CF任务时,GPU任务控制器可以将该待处理任务发送给第二处理器去处理。
可选的,上述GPU任务控制器可以是通用可编程处理器、GPU硬件专用控制电路等等。通过GPU任务控制器可以对任意一个待处理任务进行分析作出预判,该判断可以在Shader程序的最开始阶段就对各个任务作出区分,并交由对应的处理器去处理,并不会提升处理器的调度复杂度,而且即使有很多CF任务需要处理,也是交由独立的第一处理器处理,并不会增加第二处理器的负载。
图4为本申请实施例GPU的一个示例性的结构图,如图4所示,该GPU包括GPU任务控制器、第一处理器和第二处理器,其中,第一处理器和第二处理器分别与GPU任务控制器连接;第一处理器为Shader预处理单元(Shader pre processor unit,SPPU);SPPU包括相互连接的Shader预处理单元控制器(Shader pre processor unit controller,SPPU_CTRL)和可编程处理单元。
可选的,上述可编程处理单元可以是MCU或者数字信号处理器(digital signal process,DSP)。需要说明的是,上述可编程处理单元还可以是其他任意一种可编程处理单元,本申请实施例对此不做具体限定。
本申请实施例中SPPU位于整个GPU流水线的最前端,可以支持任意一种GPU中的所有Shader类型,例如,VS顶点着色器(Vertex Shader),FS片元着色器(Fragment Shader),CS通用计算着色器(Compute Shader)等。SPPU_CTRL,被配置为接收常量折叠CF任务,根据CF任务向可编程处理单元发送第一计算指令;可编程处理单元,被配置为根据第一计算指令进行与CF任务对应的计算。
例如,CF任务为A=logB,其中A和B均为常量。SPPU_CTRL接收到该CF任务后,将该CF任务通知给可编程处理单元,由可编程处理单元实施log计算。
本申请实施例中可编程处理单元可以处理浮点数的条件判断、循环控制和浮点计算等,还可以处理特殊函数计算,例如计算log、计算平方、计算开放、计算三角函数和计算倒数等。
可选的,上述特殊函数计算可以通过软件算法完成,例如,可以将这些特殊函数以表的形式固话在存储器中,配合计算指令,可编程处理单元可以通过查表完成相应的计算,而不需要设计专用的函数硬件计算加速协处理单元。
图5为本申请实施例GPU的一个示例性的结构图,如图5所示,在图4所示GPU的结构的基础上,SPPU还包括:与可编程处理单元连接的特殊功能单元(special function unit,SFU)。
本申请实施例中SFU作为可编程处理单元的协处理器单元,完成上述特殊函数计算的硬件加速。SPPU_CTRL,被配置为接收CF任务,根据CF任务向可编程处理单元发送第一计算指令;可编程处理单元,被配置为根据第一计算指令进行与CF任务对应的第一计算;当可编程处理单元检测到要进行与CF任务对应的第二计算时,向SFU发送第二计算指令;SFU,被配置为根据第二计算指令进行第二计算。
本申请实施例中可编程处理单元可以处理浮点数的条件判断、循环控制和浮点计算等第一计算,SFU可以处理特殊函数计算,例如计算log、计算平方、计算开放、计算三角函数和计算倒数等第二计算。若可编程处理单元基于第一计算指令处理第一计算的过程中,检测到要进行第二计算,则可编程处理单元可以向SFU发送第二计算指令,将前述第二计算交由SFU处理;若可编程处理单元基于第一计算指令处理第一计算的过程中,没有检测到要进行第二计算,则可编程处理单元完成第一计算。本申请实施例将图4所示实施例中由软件算法完成的特殊函数计算替换成由硬件(SFU)实现,可以提高计算速度。
图6为本申请实施例GPU的一个示例性的结构图,如图6所示,在图5所示GPU的结构的基础上,SPPU还包括:紧密耦合存储器(tightly coupled memories,TCM)和直接内存访问(direct memory access,DMA);其中,TCM和可编程处理单元连接;DMA与TCM和SPPU_CTRL分别连接。
本申请实施例中,TCM用作SPPU中存储器,存储数据和指令,DMA负责SPPU和GPU的存储器之间的所有数据的存取功能。SPPU_CTRL,被配置为接收CF任务,获取与CF任务对应的指示信息,向DMA发送指示信息;DMA,被配置为获取与指示信息对应的数据并存入TCM中;SPPU_CTRL,被配置为向可编程处理单元发送第一计算指令;可编程处理单元,被配置为根据第一计算指令从TCM获取数据并进行与CF任务对应的第一计算;当可编程处理单元检测到要进行与CF任务对应的第二计算时,向SFU发送第二计算指令;SFU,被配置为根据第二计算指令进行第二计算,并将第二计算的计算结果发送给可编程处理单元;可编程处理单元,被配置为将第一计算的计算结果和第二计算的计算结果发送至TCM。
在一种可能的实现方式中,上述图2~图6所示实施例中,GPU可以从存储器中获取处理任务所需的信息,该存储器可以是任意一种内存,例如,双倍速率(double data rate,DDR)同步动态随机存储器(synchronous dynamic random access memory,SDRAM),简称DDR。DDR中可以设置有多个缓冲区,包括用于存储待处理任务的描述符(descriptor)的缓冲区,用于存储指令的指令缓冲区(instruction buffer)以及用于存储常量的常量缓冲 区(constant buffer)。需要说明的是,DDR中还可以设置有存储其他信息的缓冲区,本申请实施例对此不做具体限定。需要说明的是,本申请实施例中的图6仅作为一种示例绘示出了GPU和DDR之间的连接关系,但并不是以图6所示结构作为一种限定,在其他实施例中GPU也可以以相同的方式连接DDR。
图7为本申请实施例SPPU的一个示例性的结构图,如图7所示,该SPPU包括相互连接的SPPU_CTRL和可编程处理单元(本申请实施例中可编程处理单元为MCU)。
本申请实施例中SPPU位于整个GPU流水线的最前端,可以支持任意一种GPU中的所有Shader类型。SPPU_CTRL,被配置为接收常量折叠CF任务,根据CF任务向MCU发送第一计算指令;MCU,被配置为根据第一计算指令进行与CF任务对应的计算。
例如,CF任务为A=logB,其中A和B均为常量。SPPU_CTRL接收到该CF任务后,将该CF任务通知给MCU,由MCU实施log计算。
本申请实施例中MCU可以处理浮点数的条件判断、循环控制和浮点计算等,还可以处理特殊函数计算,例如计算log、计算平方、计算开放、计算三角函数和计算倒数等。特殊函数计算可以通过软件算法完成,例如,可以将这些特殊函数以表的形式固话在存储器中,配合计算指令,MCU可以通过查表完成相应的计算。
图8为本申请实施例SPPU的一个示例性的结构图,如图8所示,在图7所示GPU的结构的基础上,SPPU还包括:与可编程处理单元(本申请实施例中可编程处理单元为DSP)连接的SFU。
本申请实施例中SFU作为DSP的协处理器单元,完成上述特殊函数计算的硬件加速。SPPU_CTRL,被配置为接收CF任务,根据CF任务向DSP发送第一计算指令;DSP,被配置为根据第一计算指令进行与CF任务对应的第一计算;当DSP检测到要进行与CF任务对应的第二计算时,向SFU发送第二计算指令;SFU,被配置为根据第二计算指令进行第二计算。
本申请实施例中DSP可以处理浮点数的条件判断、循环控制和浮点计算等第一计算,SFU可以处理特殊函数计算,例如计算log、计算平方、计算开放、计算三角函数和计算倒数等第二计算。若DSP基于第一计算指令处理第一计算的过程中,检测到要进行第二计算,则DSP可以向SFU发送第二计算指令,将前述第二计算交由SFU处理;若DSP基于第一计算指令处理第一计算的过程中,没有检测到要进行第二计算,则DSP完成第一计算。本申请实施例将图7所示实施例中由软件算法完成的特殊函数计算替换成由硬件(SFU)实现,可以提高计算速度。
图9为本申请实施例SPPU的一个示例性的结构图,如图9所示,在图8所示GPU的结构的基础上,SPPU还包括:TCM和DMA,其中,TCM和可编程处理单元(本申请实施例中可编程处理单元为MCU)连接;DMA与TCM和SPPU_CTRL分别连接。除此之外,SPPU_CTRL和MCU之间通过邮箱(MailBox)进行通信;SPPU_CTRL和DMA之间通过DMA控制接口进行通信;GPU上还设置有总线控制器,以实现GPU和DDR之间的通信。
基于上述GPU,CF任务的处理流程可以包括:
1、GPU任务控制器通过外围总线(advanced peripheral bus,APB)接口(APB总线协议是ARM公司提出的先进微控制器总线体系结构(advanced microcontroller bus  architecture,AMBA)(亦称作片上总线协议)总线结构之一,是一种标准的片上总线结构),将CF任务下发给SPPU_CTRL,启动SPPU开始工作。
2、SPPU_CTRL通过总线控制器从DDR读取CF任务相关的descriptor。
3、SPPU_CTRL通过解析的descriptor,获得与CF任务相关的instruction buffer的地址、constant buffer的地址以及大小(size)信息,并转换成DMA的配置参数,启动DMA将前述缓冲区中的数据搬运到TCM中。
4、DMA完成所有数据的搬运后,SPPU_CTRL通过MailBox通知MCU开始进行计算。
5、MCU执行TCM中预先准备好的指令,对TCM中的常量进行相应的函数计算,当执行到特殊函数计算指令后,MCU会通知SFU完成特殊函数计算。
6、MCU将所有的计算结果存储到TCM中。
7、MCU完成所有计算后,将CF任务的结果通过DMA存储到DDR中,该结果将成为GPU的其他任务的输入数据。
图10为本申请实施例任务处理方法的一个示例性的流程图,如图10所示,本申请实施例的方法可以由上述实施例中的GPU执行,该方法可以包括:
步骤1001、接收待处理任务。
GPU接收待处理任务,该待处理任务可以通过Shader程序实现。
步骤1002a、当待处理任务是CF任务时,控制第一处理器处理待处理任务。
步骤1002b、当待处理任务不是CF任务时,控制第二处理器处理待处理任务。
步骤1002a和步骤1002b是二选一的步骤,GPU可以根据待处理任务确定执行该两个步骤中的哪一个。
本申请实施例中第一处理器可以是上述实施例中的SPPU,也可以是通用处理器,用作GPU的协处理器;第二处理器可以是MCU或其他可编程处理单元,用作GPU的主处理器。
第一处理器被配置为处理CF任务,第二处理器被配置为处理除CF任务外的其他任务。本申请实施例将Shader程序中的CF任务由协处理器(第一处理器)来完成,这样可以将CF任务从GPU的主处理器(第二处理器)中释放出来,很显然减少了第二处理器的负载,使得第二处理器可以以更充分的计算资源处理其他任务,进而提高GPU的性能。
在一种可能的实现方式中,GPU可以对任意一个待处理任务进行分析作出预判,该判断可以在Shader程序的最开始阶段就对各个任务作出区分,并交由对应的处理器去处理,并不会提升处理器的调度复杂度,而且即使有很多CF任务需要处理,也是交由独立的第一处理器处理,并不会增加第二处理器的负载。
在实现过程中,上述方法实施例的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。处理器可以是通用处理器、数字信号处理器(digital signal processor,DSP)、特定应用集成电路(application-specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。本申请实施例公开的方法的步骤可以直接体现为硬件编码处理器执行完成,或者用编码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器, 闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。
上述各实施例中提及的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(dynamic RAM,DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。应注意,本文描述的系统和方法的存储器旨在包括但不限于这些和任意其它适合类型的存储器。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请实施例的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请实施例所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。
另外,在本申请实施例各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(个人计算机,服务器,或者网络设备等)执行本申请实施例各个实施例所述方法的全部或部分步 骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请实施例的具体实施方式,但本申请实施例的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请实施例揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请实施例的保护范围之内。因此,本申请实施例的保护范围应以所述权利要求的保护范围为准。

Claims (11)

  1. 一种图形处理器GPU,其特征在于,包括:第一处理器和第二处理器;其中,所述第一处理器和所述第二处理器连接;
    所述第一处理器,被配置为处理常量折叠CF任务;
    所述第二处理器,被配置为处理非CF任务。
  2. 根据权利要求1所述的GPU,其特征在于,还包括:
    与所述第一处理器和所述第二处理器分别连接的GPU任务控制器,被配置为:
    在确定待处理任务是所述CF任务时,将所述待处理任务发送至所述第一处理器;
    在确定待处理任务不是所述CF任务时,将所述待处理任务发送至所述第二处理器。
  3. 根据权利要求1或2所述的GPU,其特征在于,所述第一处理器为Shader预处理单元SPPU;所述SPPU包括相互连接的Shader预处理单元控制器SPPU_CTRL和可编程处理单元,
    所述SPPU_CTRL,被配置为接收所述CF任务,根据所述CF任务向所述可编程处理单元发送第一计算指令;
    所述可编程处理单元,被配置为根据所述第一计算指令进行与所述CF任务对应的计算。
  4. 根据权利要求3所述的GPU,其特征在于,所述SPPU还包括与所述可编程处理单元连接的特殊功能单元SFU,
    所述可编程处理单元,被配置为当检测到要进行与所述CF任务对应的第二计算时,向所述SFU发送第二计算指令;
    所述SFU,被配置为根据所述第二计算指令进行所述第二计算。
  5. 根据权利要求4所述的GPU,其特征在于,所述SPPU还包括紧密耦合存储器TCM和直接内存访问DMA;其中,所述TCM和所述可编程处理单元连接;所述DMA与所述TCM和所述SPPU_CTRL分别连接,
    所述SPPU_CTRL,被配置为接收所述CF任务,获取与所述CF任务对应的指示信息,向所述DMA发送所述指示信息;
    所述DMA,被配置为获取与所述指示信息对应的数据并存入所述TCM中;
    所述SPPU_CTRL,被配置为向所述可编程处理单元发送第一计算指令;
    所述可编程处理单元,被配置为根据所述第一计算指令从所述TCM获取所述数据并进行与所述CF任务对应的第一计算;当所述可编程处理单元检测到要进行与所述CF任务对应的第二计算时,向所述SFU发送第二计算指令;
    所述SFU,被配置为根据所述第二计算指令进行所述第二计算,并将所述第二计算的计算结果发送给所述可编程处理单元;
    所述可编程处理单元,被配置为将所述第一计算的计算结果和所述第二计算的计算结果发送至所述TCM。
  6. 一种Shader预处理单元SPPU,其特征在于,包括:相互连接的Shader预处理单元控制器SPPU_CTRL和可编程处理单元,
    所述SPPU_CTRL,被配置为接收常量折叠CF任务,根据所述CF任务向所述可编程 处理单元发送第一计算指令;
    所述可编程处理单元,被配置为根据所述第一计算指令进行与所述CF任务对应的计算。
  7. 根据权利要求6所述的SPPU,其特征在于,还包括:与所述可编程处理单元连接的特殊功能单元SFU,
    所述可编程处理单元,被配置为当检测到要进行与所述CF任务对应的第二计算时,向所述SFU发送第二计算指令;
    所述SFU,被配置为根据所述第二计算指令进行所述第二计算。
  8. 根据权利要求7所述的SPPU,其特征在于,还包括:紧密耦合存储器TCM和直接内存访问DMA;其中,所述TCM和所述可编程处理单元连接;所述DMA与所述TCM和所述SPPU_CTRL分别连接,
    所述SPPU_CTRL,被配置为接收所述CF任务,获取与所述CF任务对应的指示信息,向所述DMA发送所述指示信息;
    所述DMA,被配置为获取与所述指示信息对应的数据并存入所述TCM中;
    所述SPPU_CTRL,被配置为向所述可编程处理单元发送第一计算指令;
    所述可编程处理单元,被配置为根据所述第一计算指令从所述TCM获取所述数据并进行与所述CF任务对应的第一计算;当所述可编程处理单元检测到要进行与所述CF任务对应的第二计算时,向所述SFU发送第二计算指令;
    所述SFU,被配置为根据所述第二计算指令进行所述第二计算,并将所述第二计算的计算结果发送给所述可编程处理单元;
    所述可编程处理单元,被配置为将所述第一计算的计算结果和所述第二计算的计算结果发送至所述TCM。
  9. 一种任务处理方法,其特征在于,包括:
    接收待处理任务;
    当所述待处理任务是常量折叠CF任务时,控制第一处理器处理所述待处理任务;
    当所述待处理任务不是所述CF任务时,控制第二处理器处理所述待处理任务。
  10. 根据权利要求9所述的方法,其特征在于,所述接收待处理任务之后,还包括:
    判断所述待处理任务是否是所述CF任务。
  11. 根据权利要求9或10所述的方法,其特征在于,所述第一处理器为Shader预处理单元SPPU;所述SPPU包括相互连接的Shader预处理单元控制器SPPU_CTRL和可编程处理单元。
PCT/CN2020/142342 2020-12-31 2020-12-31 Gpu、sppu和任务处理方法 WO2022141484A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2020/142342 WO2022141484A1 (zh) 2020-12-31 2020-12-31 Gpu、sppu和任务处理方法
CN202080108253.4A CN116964661A (zh) 2020-12-31 2020-12-31 Gpu、sppu和任务处理方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/142342 WO2022141484A1 (zh) 2020-12-31 2020-12-31 Gpu、sppu和任务处理方法

Publications (1)

Publication Number Publication Date
WO2022141484A1 true WO2022141484A1 (zh) 2022-07-07

Family

ID=82258913

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/142342 WO2022141484A1 (zh) 2020-12-31 2020-12-31 Gpu、sppu和任务处理方法

Country Status (2)

Country Link
CN (1) CN116964661A (zh)
WO (1) WO2022141484A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116559789A (zh) * 2023-07-07 2023-08-08 成都泰格微电子研究所有限责任公司 一种雷达控制系统的信号处理方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101271584A (zh) * 2008-04-11 2008-09-24 威盛电子股份有限公司 可编程图形处理单元计算核心的常量缓冲的方法和系统
CN102918488A (zh) * 2010-04-29 2013-02-06 苹果公司 热插拔gpu功率控制的系统和方法
CN106774782A (zh) * 2015-11-24 2017-05-31 中兴通讯股份有限公司 界面显示方法、装置及终端
US20190066256A1 (en) * 2015-12-18 2019-02-28 Intel Corporation Specialized code paths in gpu processing
CN112132936A (zh) * 2020-09-22 2020-12-25 上海米哈游天命科技有限公司 画面渲染方法、装置、计算机设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101271584A (zh) * 2008-04-11 2008-09-24 威盛电子股份有限公司 可编程图形处理单元计算核心的常量缓冲的方法和系统
CN102918488A (zh) * 2010-04-29 2013-02-06 苹果公司 热插拔gpu功率控制的系统和方法
CN106774782A (zh) * 2015-11-24 2017-05-31 中兴通讯股份有限公司 界面显示方法、装置及终端
US20190066256A1 (en) * 2015-12-18 2019-02-28 Intel Corporation Specialized code paths in gpu processing
CN112132936A (zh) * 2020-09-22 2020-12-25 上海米哈游天命科技有限公司 画面渲染方法、装置、计算机设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116559789A (zh) * 2023-07-07 2023-08-08 成都泰格微电子研究所有限责任公司 一种雷达控制系统的信号处理方法
CN116559789B (zh) * 2023-07-07 2023-09-19 成都泰格微电子研究所有限责任公司 一种雷达控制系统的信号处理方法

Also Published As

Publication number Publication date
CN116964661A (zh) 2023-10-27

Similar Documents

Publication Publication Date Title
KR102368970B1 (ko) 지능형 고 대역폭 메모리 장치
US8004533B2 (en) Graphics input command stream scheduling method and apparatus
US9842083B2 (en) Using completion queues for RDMA event detection
US11321256B2 (en) Persistent kernel for graphics processing unit direct memory access network packet processing
US20160283111A1 (en) Read operations in memory devices
US10909655B2 (en) Direct memory access for graphics processing unit packet processing
JP2015079542A (ja) 割り込み分配スキーム
US20180247164A1 (en) Image processing method and image processing apparatus
US9632958B2 (en) System for migrating stash transactions
US9715392B2 (en) Multiple clustered very long instruction word processing core
WO2022089592A1 (zh) 一种图形渲染方法及其相关设备
US20180227249A1 (en) Adjusting buffer size for network interface controller
US20170024138A1 (en) Memory management
US20150268985A1 (en) Low Latency Data Delivery
WO2022141484A1 (zh) Gpu、sppu和任务处理方法
US20170212852A1 (en) Method and accelerator unit for interrupt handling
WO2020252763A1 (en) Adaptive pipeline selection for accelerating memory copy operations
WO2022011841A1 (zh) Gpgpu中簇的实现方法、装置、终端及介质
US11237994B2 (en) Interrupt controller for controlling interrupts based on priorities of interrupts
US8677028B2 (en) Interrupt-based command processing
US20170161081A1 (en) Apparatuses for enqueuing kernels on a device-side
US10185604B2 (en) Methods and apparatus for software chaining of co-processor commands before submission to a command queue
US20220414014A1 (en) Technology for early abort of compression acceleration
WO2023115529A1 (zh) 芯片内的数据处理方法及芯片
TW202111561A (zh) 資訊處理裝置及其方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20967823

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202080108253.4

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20967823

Country of ref document: EP

Kind code of ref document: A1