WO2023022722A1 - Apparatus and method of block-based graphics processing in a system-on-a-chip - Google Patents

Apparatus and method of block-based graphics processing in a system-on-a-chip Download PDF

Info

Publication number
WO2023022722A1
WO2023022722A1 PCT/US2021/046739 US2021046739W WO2023022722A1 WO 2023022722 A1 WO2023022722 A1 WO 2023022722A1 US 2021046739 W US2021046739 W US 2021046739W WO 2023022722 A1 WO2023022722 A1 WO 2023022722A1
Authority
WO
WIPO (PCT)
Prior art keywords
block
graphics processing
gpu
npu
dsp
Prior art date
Application number
PCT/US2021/046739
Other languages
French (fr)
Inventor
Jing Wu
Chen Li
Yu DAI
Hongyu Sun
Original Assignee
Zeku, Inc.
Innopeak Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zeku, Inc., Innopeak Technology, Inc. filed Critical Zeku, Inc.
Priority to PCT/US2021/046739 priority Critical patent/WO2023022722A1/en
Publication of WO2023022722A1 publication Critical patent/WO2023022722A1/en

Links

Classifications

    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09GARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
    • G09G5/00Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators
    • G09G5/36Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators characterised by the display of a graphic pattern, e.g. using an all-points-addressable [APA] memory
    • G09G5/363Graphics controllers
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09GARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
    • G09G5/00Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators
    • G09G5/001Arbitration of resources in a display system, e.g. control of access to frame buffer by video controller and/or main processor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/14Digital output to display device ; Cooperation and interconnection of the display device with other functional units
    • G06F3/1407General aspects irrespective of display type, e.g. determination of decimal point position, display with fixed or driving decimal point, suppression of non-significant zeros
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09GARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
    • G09G2360/00Aspects of the architecture of display systems
    • G09G2360/06Use of more than one graphics processor to process data before displaying to one or more screens
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09GARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
    • G09G2360/00Aspects of the architecture of display systems
    • G09G2360/08Power processing, i.e. workload management for processors involved in display operations, such as CPUs or GPUs
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09GARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
    • G09G2360/00Aspects of the architecture of display systems
    • G09G2360/12Frame memory handling
    • G09G2360/121Frame memory handling using a cache memory

Definitions

  • Embodiments of the present disclosure relate to an apparatus and method for graphics processing.
  • a system-on-a-chip is an integrated circuit that integrates different subsystems or processors, each having a different function, in a computing system or other electronic device.
  • Graphics processors of an SoC may be configured to generate and process frames for output to a display device.
  • a neural processing unit NPU
  • DSP digital signal processor
  • GPU graphics processing unit
  • the SoC may output the frame to a display device.
  • an SoC may include a GPU configured to generate a set of blocks of a frame by first graphics processing.
  • the SoC may also include a NPU or DSP configured to initiate second graphics processing of the set of blocks of the frame prior to an entirety of the frame being generated by the first graphics processing.
  • an SoC may include a GPU configured to generate a first block of a frame by first graphics processing.
  • the SoC may include a GPU configured to send the first block to a storage unit (e.g., a buffer, a portion of a system cache, main double data rate (DDR) memory, etc.).
  • the SoC may include a GPU configured to output, to a NPU or DSP, a signal indicating that the first block is ready for second graphics processing by the NPU or DSP.
  • the signal may be output prior to generating a second block of the frame by the first processing.
  • an SoC may include an NPU or DSP configured to obtain, from a GPU, a signal indicating a first block of a frame generated by first graphics processing is ready for second graphics processing.
  • the NPU or DSP obtain the first block from a storage unit.
  • the NPU or DSP perform the second graphics processing of the first block.
  • the second graphics processing of the first block by the NPU or DSP may begin prior to a completion of the first graphics processing of the frame by the GPU.
  • a method of graphics processing by a GPU may include generating a first block of a frame by first graphics processing.
  • the method may include sending the first block to a storage unit.
  • the method may include outputting a first signal to a NPU or DSP.
  • the first signal may indicate that the first block is ready for second graphics processing.
  • the first signal may include first header transaction information of the first block.
  • the method may include generating a second block of the first graphics processing.
  • the first graphics processing of the second block by the GPU may be performed concurrent to the second graphics processing of the first block by the NPU or DSP.
  • a method of graphics processing by a NPU or DSP may include obtaining, from a GPU, a first signal indicating a first block of a frame generated by first graphics processing is ready for second graphics processing.
  • the method may include obtaining the first block from a storage unit.
  • the method may include perform second graphics processing of the first block.
  • the second graphics processing of the first block by the NPU or DSP may occur prior to a completion of the first graphics processing of the frame by the GPU.
  • FIG. 1 illustrates a block diagram of an exemplary SoC, in accordance with certain aspects of the disclosure.
  • FIG. 2 illustrates a block diagram of a frame that is made up of blocks that may generated and processed for output to a display device, in accordance with certain aspects of the disclosure.
  • FIG. 3 A illustrates a first timing diagram of a graphics processing technique.
  • FIG. 3B illustrates a second timing diagram associated with an exemplary graphics processing technique, in accordance with certain aspects of the disclosure.
  • FIG. 4A illustrates a flow diagram of a first exemplary graphics processing technique implemented by various components of the exemplary SoC of FIG. 1, in accordance with certain aspects of the disclosure.
  • FIG. 4B illustrates a flow diagram of a second exemplary graphics processing technique implemented by various components of the exemplary SoC of FIG. 1, in accordance with certain aspects of the disclosure.
  • FIG. 5 A illustrates a flow chart of a first exemplary graphics processing technique, according to embodiments of the disclosure.
  • FIG. 5B illustrates a flow chart of a second exemplary graphics processing technique, according to embodiments of the disclosure.
  • references in the specification to “one embodiment,” “an embodiment,” “an example embodiment,” “some embodiments,” “certain embodiments,” etc. indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases do not necessarily refer to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it would be within the knowledge of a person skilled in the pertinent art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
  • terminology may be understood at least in part from usage in context.
  • the term “one or more” as used herein, depending at least in part upon context may be used to describe any feature, structure, or characteristic in a singular sense or may be used to describe combinations of features, structures or characteristics in a plural sense.
  • terms, such as “a,” “an,” or “the,” again, may be understood to convey a singular usage or to convey a plural usage, depending at least in part upon context.
  • the term “based on” may be understood as not necessarily intended to convey an exclusive set of factors and may, instead, allow for existence of additional factors not necessarily expressly described, again, depending at least in part on context.
  • an SoC is an integrated circuit that integrates subsystems, each having a different function, in a computing system or other electronic device.
  • the subsystems integrated by an SoC may include, without limitation, one or more of the following: central processing units (CPUs), graphical processing units (GPUs), digital signal processors (DSPs), neural processing units (NPUs), microcontrollers, microprocessors, multiprocessors, other types of cores, a memory unit, read-only memory (ROM), random-access memory (RAM), clock signal generators, input/output (I/O) interfaces, analog interfaces, voltage regulators and power management circuits, an advanced peripheral unit(s), wireless communication unit(s) (e.g., Wi-Fi module, cellular module, 5G new radio (NR) module, Bluetooth® module, etc.), or coprocessors, just to name a few.
  • CPUs central processing units
  • GPUs graphical processing units
  • DSPs digital signal processors
  • NPUs neural processing units
  • microcontrollers
  • FIG. 2 illustrates a diagram 200 of a frame 202 that may be generated by a GPU and an NPU or DSP of an SoC.
  • a frame 202 may be segmented into blocks 204 that are arranged in rows and columns.
  • the size of a block 204 may be configurable. For example, a single block may include a tile, multiple tiles, a bin, or multiple bins.
  • frame 202 includes m rows and n columns for a total of m*n blocks. However, more or fewer rows, columns, blocks per row, and/or blocks per column may be included in frame 202 without departing from the scope of the present disclosure.
  • FIG. 3A illustrates a timing diagram 300 of the traditional graphics processing technique of an SoC.
  • the graphics processing may include generating (at 302a) each block 204 of frame 202. Once the entire frame 202 has been generated, data processing (at 304a) may begin.
  • a GPU may perform first graphics processing to generate frame 202.
  • a storage device may be used to maintain the blocks until the entire frame 202 is generated.
  • an NPU or DSP may be used to perform second data processing of the blocks of the frame, until frame 202 is ready for output to a display device. Due to its relatively large size, frame 202 may be stored in a double data rate (DDR) system memory rather than temporary storage, such as a buffer orcache.
  • DDR double data rate
  • the NPU or DSP reads the frame data from system memory. Memory transactions to and from DDR memory consume more power than those to and from a small footprint memory storage, e.g., an internal buffer or cache. Furthermore, since the data processing (e.g., second graphics processing) by the NPU or DSP occurs only once the entire frame has been generated by the GPU, the latency from the beginning of frame generation until its ready for output to second graphics processing is longer than if second graphics processing occurs in tandem with first graphics processing once a block 204 is generated. For example, as shown in FIG.
  • the total latency from the time the first block is generated by a GPU until the last block has been processed by an NPU or DSP includes the duration of the first graphics processing time for m*n blocks plus the second graphics processing time for the m*n blocks. Consequently, the traditional method of nonconcurrent frame data processing by the GPU and NPU or DSP is undesirable in terms of latency, efficiency, and power consumption.
  • the present disclosure provides a frame data processing technique that enables concurrent first graphics processing by a GPU and second graphics processing by an NPU or DSP, as shown in the exemplary timing diagram 301 of FIG. 3B.
  • the GPU may signal to the NPU or DSP when a block is ready for second graphics processing. More specifically, the GPU may write the generated block to the storage unit (e.g., a buffer, a portion of system cache, DDR memory, etc.) and send a signal, which indicates the block is ready for second graphics processing, to the NPU or DSP.
  • the NPU or DSP may read the block from the storage unit based on header information included in the transaction.
  • the same process may be performed after the generation of each block such that the total latency of the exemplary graphics processing technique is considerably reduced, as shown in FIG. 3B.
  • the total processing time may include the duration of the first graphics processing of the entire frame plus the duration of the second graphics processing for a single block.
  • the storage footprint and power consumption may be reduced since individual blocks may be written, read, and deleted from the smaller storage (e.g., a buffer, a portion of system cache, etc.), as compared to maintaining an entire frame which may be stored in the DDR system memory in the conventional technique described in connection with FIG. 3A. Additional details of the present graphics processing techniques are described below in connection with FIGs. 1, 4A, 4B, 5 A, and 5B.
  • FIG. 1 illustrates a block diagram of an exemplary SoC 100, according to some embodiments of the present disclosure.
  • SoC 100 may be applied or integrated into various systems and apparatus capable of high-speed data processing, such as computers and wireless communication devices.
  • SoC 100 may be part of a mobile phone, a desktop computer, a laptop computer, a tablet, a vehicle computer, a gaming console, a printer, a positioning device, a wearable electronic device, a smart sensor, a virtual reality (VR) device, an argument reality (AR) device, or any other suitable electronic devices having high-speed data processing capability.
  • VR virtual reality
  • AR argument reality
  • SoC 100 may serve as a graphics processor that imports data and instructions from main memory 102 and/or an external memory (not shown), executing instructions to perform various mathematical and logical calculations on the data to generate and process a graphics frame, and exporting the graphics frame for display on a display device coupled to SoC 100.
  • SoC designs may integrate one or more components for computation and processing on an integrated-circuit (IC) substrate. Such SoC designs may also be referred to as “platforms.” For applications where chip size matters, such as smartphones and wearable gadgets, SoC design is an ideal design choice because of its compact area. It further has the advantage of small power consumption.
  • main memory 102, host central processing unit (CPU) 104, system bus 106, GPU 108, storage unit 110, and NPU/DSP 112 are integrated into SoC 100. It is understood that in some examples, main memory 102, host central processing unit (CPU) 104, system bus 106, GPU 108, storage unit 110, and/or NPU/DSP 112 may not be integrated on the same chip, but instead on separate chips.
  • GPU 108 may include any suitable specialized graphics processor configured to perform first graphics processing.
  • first graphics processing may include the generation of blocks of a frame, e.g., such as blocks 204 and frame 202 of FIG. 2.
  • NPU/DSP 112 may include any suitable specialized processor configured to perform second graphics processing, e.g., an NPU or DSP.
  • second graphics processing may include any type of graphics processing performed once a block and/or frame is generated and prior to its display on a display device.
  • Storage unit 110 may include any suitable storage device configured to maintain generated blocks of a frame.
  • storage unit 110 may include an internal buffer, a portion of system cache, or DDR memory.
  • Storage unit 110 may be configured to maintain the processed blocks on a temporary basis.
  • storage unit 110 may be configured to maintain two blocks at a time. When a first block is read from storage unit 110 by NPU/DSP 112, the first block may then be deleted or removed from storage unit 110. Then, a new block may be written to storage unit 110 by GPU 108.
  • GPU 108 may send a signal indicating to NPU/DSP 112 that a block is ready for second graphics processing.
  • the signal may include header transaction information associated with the newly written block.
  • the header transaction information may include block number, block data format, block size, and the start address of the block in storage unit 110.
  • the signal may be sent via direct connection between GPU 108 and NPU/DSP 112. In some other embodiments, the signal may be sent to storage unit 110 and retrieved by NPU/DSP 112.
  • storage unit 110 may include a header transaction region and a data region (as shown in FIG. 4B).
  • Header transaction region may be configured to maintain header transaction information that is written by GPU 108 when an associated block is written to the data region.
  • NPU/DSP 112 may periodically (at regular or irregular intervals) check the header region of storage unit 110 as a mailbox to see if header transaction information is available. When available, NPU/DSP 112 may read the associated block from the data region and begin second graphics processing thereon.
  • GPU 108 may increment a counter that keeps track of how many blocks have been written to storage unit 110, which may prevent storage overflow.
  • NPU/DSP 112 may send a signal to GPU 108 upon completion of second graphics processing of a block.
  • the signal may include header transaction information for the block that was processed by NPU/DSP 112.
  • GPU 108 may decrement the counter. In this way, GPU 108 may keep track of the number of blocks currently written to storage unit 110.
  • GPU 108 may wait until a signal is received from NPU/DSP 112 indicating that at least one block has been read (and removed) from storage unit 110. Then, the new block may be written to the data region of storage unit 110.
  • GPU 108 and NPU/DSP 112 may each include, among others, one or more processing cores (a.k.a. “cores”).
  • a processing core may include one or more functional units that perform various data operations associated with first and/or second graphics processing.
  • processing core may include an arithmetic logic unit (ALU) that performs arithmetic and bitwise operations on data (also known as “operand”), such as addition, subtraction, increment, decrement, AND, OR, Exclusive-OR, etc.
  • ALU arithmetic logic unit
  • Processing core may also include a floating-point unit (FPU) that performs similar arithmetic operations but on a type of operands (e.g., floating-point numbers) different from those operated by the ALU (e.g., binary numbers).
  • the operations may be addition, subtraction, multiplication, etc.
  • Another way of categorizing the functional units may be based on whether the data processed by the functional unit is a scalar or a vector.
  • processing cores may include scalar function units (SFUs) for handling scalar operations and vector function units (VFUs) for handling vector operations.
  • SFUs scalar function units
  • VFUs vector function units
  • GPU 108 and/or NPU/DSP 112 includes multiple processing cores, each processing core may carry out regular mode data and instruction operations in serial or in parallel.
  • This multi-core processor design can effectively enhance the processing speed of GPU 108 and NPU/DSP 112 and improves their performance.
  • SoC 100 may be included in SoC 100 as well, such as components that act as an interface between NPU/DSP 112 and a display device coupled to SoC 100. Additional details of the process flow performed by SoC 100 to generate a frame 202 with the reduced latency/power consumption and increased efficiency, as described above in connection with FIG. 3B, are provided below in connection to FIGs. 4A, 4B, 5 A, and 5B.
  • FIG. 4A illustrates a flow diagram 400 of a first exemplary graphics processing technique that may be performed by SoC 100 when a direct connection is provided between GPU 108 and NPU/DSP 112.
  • FIG. 4B illustrates a flow diagram 401 of a second exemplary graphics processing technique that may be performed by SoC 100 when a direct connection is not provided between GPU 108 and NPU/DSP 112.
  • FIGs. 4A and 4B will be described together along with reference to FIGs. 2 and 3B.
  • GPU 108 may generate (at 302b) blocks 204 of frame 202 by first graphics processing. As shown in the example of FIG. 3B, blocks 204 may be generated serially by GPU 108. However, blocks 204 may be generated in parallel when GPU 108 can provide the throughput. In either case, a block may be written (at 402) to storage unit 110 once generated.
  • a header transaction information 404 may be signaled to NPU/DSP 112. As shown in FIG. 4A, in some embodiments, header transaction information 404 may be signaled via a direct connection between GPU 108 and NPU/DSP 112.
  • header transaction information 404 may be signaled via an indirect connection between GPU 108 and NPU/DSP 112, as shown in FIG. 4B.
  • GPU 108 may send the header transaction to the header region 410 of storage unit 110.
  • NPU/DSP 112 may access header region 410 to check whether a header transaction has been written thereto.
  • header transaction information 404 information may include, e.g., block number, block data format, block size, and/or start address of a newly generated block 204 in storage unit 110.
  • NPU/DSP 112 may read (at 406) the generated block from the data region 412 (depicted in FIG. 4B) and begin second graphics processing (at 304b). Thus, NPU/DSP 112 may perform (at 304b) second graphics processing while the GPU 108 generates (at 302b) a second block by first graphics processing. In other words, first and second graphics processing of different blocks may be performed concurrently, which enables exemplary SoC 100 to generate and process an entire frame for display with reduced latency/power and increased efficiency (as depicted in FIG. 3B), as compared to conventional techniques (as depicted in FIG. 3 A).
  • the present graphics processing techniques may be performed for tile-based rendering (TBR) or tile-based deferred rendering (TBDR), as the aforementioned blocks 204 can be a single tile or a group of tiles.
  • TBR tile-based rendering
  • TBDR tile-based deferred rendering
  • the header transaction information can be transferred to NPU/DSP 112 (e.g., a downstream pipeline coprocessor NPU or DSP), which may begin second graphics processing of the tile.
  • NPU/DSP 112 e.g., a downstream pipeline coprocessor NPU or DSP
  • the present graphics processing techniques may be applied to various fields, e.g., such as super resolution, image recognition, etc.
  • some graphics processing techniques may be performed for full frame or adjacent blocks, which may be accomplished using the techniques of FIGs. 4A and 4B through appropriate header transaction signaling and storage allocation.
  • GPU 108 may send the header transaction information to NPU/DSP 112 when a sufficient number of small blocks have been generated, or the entire frame has been generated by
  • FIG. 5A illustrates a flowchart of a first exemplary method 500 for graphics processing, according to embodiments of the disclosure.
  • First exemplary method 500 may be performed by a GPU, e.g., such as GPU 108.
  • First exemplary method 500 may include operations 502-516. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 5 A.
  • the GPU may generate a first block of a frame by first graphics processing.
  • the GPU may generate (at 302b) a first block 204 (e.g., block 1 in FIG. 2).
  • the GPU may send the first block to a storage unit and increment a counter.
  • GPU 108 may send the generated first block to storage unit 110.
  • GPU 108 may send the generated first block to data region 412, as shown in FIG. 3B.
  • GPU 108 may increment a counter to keep track of how many blocks are temporarily maintained by storage unit 110 and awaiting second graphics processing by NPU/DSP 112. Because storage unit 110 may have limited storage capacity, using a counter may prevent GPU 108 from causing storage/buffer overflow by sending blocks that the storage unit 110 cannot simultaneously maintain.
  • the GPU may output a first signal to an NPU or DSP.
  • the first signal may indicate to the NPU or DSP that the first block is generated and available for second graphics processing.
  • the NPU or DSP may begin the second graphics processing of the first block concurrent with the GPU generating a second block by the first graphics processing.
  • the first signal may include header transaction information 404 that is sent via a direct connection between GPU 108 and NPU/DSP 112.
  • the header transaction information 404 may be sent to a header region 410 of storage unit 110 that is checked by NPU/DSP 112.
  • the GPU may determine whether all blocks of the frame have been rendered. For example, referring to FIGs. 2, 4A and 4B, GPU 108 may determine whether all m*n blocks 204 of frame 202 have been rendered each time a block is written (at 402) to storage unit 110. Upon determining that all blocks 204 have been generated, the operation is complete for the frame. On the other hand, when all m*n blocks 204 have not been rendered, the operations may move to 510.
  • the GPU may determine whether the counter is greater than or equal to a threshold.
  • the threshold may be set based on the maximum number of blocks that can be maintained by storage unit 110. Referring to FIGs. 1, 4A, and 4B, for example, when GPU 108 generates a new block (e.g., block 2) for writing to storage unit 110, GPU 108 checks to see whether the threshold is less than 2. A counter value of 2 indicates storage unit 110 cannot currently maintain any additional blocks, and the operation may move to 512. When the counter is less than the threshold, the operations may return to 502 and GPU 108 may generate the next block 204. Otherwise, when the counter is greater than or equal to the threshold, the operation may move to 512.
  • the GPU waits until the counter value is less than the threshold. For example, referring to FIGs. 1, 4A, and 4B, GPU 108 may wait until a signal is received from NPU/DSP 112 indicating that at least one block has been read (and removed) from storage unit 110. Then, the new block may be written to the data region 412 of storage unit 110.
  • operations 514 and 516 may be performed by the GPU.
  • the GPU may receive a second signal from the NPU or DSP.
  • the second signal may indicate that the second graphics processing of the first block is complete.
  • NPU/DSP 112 may send a second signal (header transaction information 404) to GPU 108 upon completion of second graphics processing of a block.
  • the signal may include header transaction information of the block processed by NPU/DSP 112.
  • the second signal may be sent via direct connection, as in FIG. 4A.
  • the second signal may be sent to the header region 410 of storage unit 110.
  • GPU 108 may check header region 410 to determine whether a second signal is waiting for retrieval.
  • the GPU may decrement the counter upon receipt of the second signal. For example, referring to FIGs. 1, 4A, and 4B, when the signal is received, GPU 108 may decrement the counter. In this way, GPU 108 may keep track of the number of blocks written to storage unit 110.
  • FIG. 5B illustrates a flowchart of a second exemplary method 501 for graphics processing, according to embodiments of the disclosure.
  • Second exemplary method 501 may be performed by an NPU or DSP, e.g., such as NPU/DSP 112.
  • Second exemplary method 501 may include operations 522-528. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 5B.
  • the NPU or DSP may obtain a first signal indicating a first block of a frame has been generated and is ready for second graphics processing. For example, referring to FIG.
  • the first signal may include a header transaction information 404 that is sent via a direct connection between GPU 108 and NPU/DSP 112.
  • the header transaction information 404 may be sent to a header region 410 of storage unit 110 that is checked by NPU/DSP 112.
  • the NPU or DSP may obtain the first block from the storage unit.
  • NPU/DSP 112 may read (at 406) the first block from storage unit 110.
  • the NPU or DSP may perform second graphics processing of the first block.
  • NPU/DSP 112 may perform (at 304b) second graphics processing of the first block. Because the first signal indicates the first block is ready for second graphics processing, the second graphics processing may proceed concurrent with the first graphics processing, thereby reducing latency and improving efficiency.
  • the NPU or DSP may send a second signal to GPU when the second graphics processing of the first block is complete.
  • NPU/DSP 112 may send a second signal (header transaction information 404) to GPU 108 upon completion of second graphics processing of a block.
  • the signal may include header transaction information of the block that was processed by NPU/DSP 112.
  • the second signal may be sent via direct connection, as in FIG. 4A.
  • the second signal may be sent to the header region 410 of storage unit 110.
  • GPU 108 may check header region 410 to determine whether a second signal is waiting for retrieval.
  • the functions described herein may be implemented in hardware, software, or any combination thereof. If implemented in software, the functions may be stored on or encoded as instructions or code on a non-transitory computer- readable medium.
  • Computer-readable media includes computer storage media. Storage media may be any available media that can be accessed by a computing device, such as SoC 100 of FIG. 1.
  • such computer-readable media can include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, hard disks drive (HDD), such as magnetic disk storage or other magnetic storage devices, Flash drive, solid-state drive (SSD), or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a processing system, such as a mobile device or a computer.
  • Disk and disc includes CD, laser disc, optical disc, DVD, and floppy disk where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
  • an SoC is provided.
  • the SoC may include a GPU configured to generate a set of blocks of a frame by first graphics processing.
  • the SoC may also include an NPU or DSP configured to initiate second graphics processing of the set of blocks of the frame prior to an entirety of the frame being generated by the first graphics processing.
  • the GPU may be configured to generate the set of blocks of the frame by the first graphics processing by generating a first block of the frame by the first graphics processing. In some embodiments, the GPU may be configured to generate the set of blocks of the frame by the first graphics processing by generating a second block of the frame by the first graphics processing.
  • the GPU may be further configured to send the first block to a storage unit. In some embodiments, the GPU may be further configured to output a first signal indicating that the first block is ready for the second graphics processing. In some embodiments, the first signal may be output prior to the generating of the second block.
  • the first signal may include one or more of header transaction information, a block number, a block data format, block size, or a start address of the first block in the storage unit.
  • the first signal may be output via a direct connection with the NPU or DSP.
  • the first signal may be output to a transaction header transaction region of the storage unit.
  • the storage unit may include a buffer, a cache, or a DDR memory.
  • the NPU or DSP may be configured to obtain the first signal indicating that the first block is ready for the second graphics processing. In some embodiments, the NPU or DSP may be configured to obtain the first block from the storage unit. In some embodiments, the NPU or DSP may be configured to perform the second graphics processing of the first block. In some embodiments, the second graphics processing of the first block may be concurrent with the first graphics processing of the second block.
  • the NPU or DSP may be further configured to output a second signal indicating that the second graphics processing of the first block is complete.
  • the second signal may be output via a direct connection with the GPU.
  • the second signal may be output to a header transaction region of the storage unit.
  • the GPU may be further configured to obtain the second signal indicating that the second graphics processing of the first block is complete. In some embodiments, the GPU may be further configured to increment a counter based on a block generation.
  • GPU may be further configured to, in response to the counter meeting a threshold, delay sending the second block to the storage unit. In some embodiments, GPU may be further configured to, in response to the counter not meeting the threshold, send the second block to the storage unit.
  • an SoC may include a GPU configured to generate a first block of a frame by first graphics processing.
  • the SoC may include a GPU configured to send the first block to a storage unit.
  • the SoC may include a GPU configured to output, to an NPU or DSP, a signal indicating that the first block is ready for second graphics processing by the NPU or DSP.
  • the signal may be output prior to generating a second block of the frame by the first processing.
  • the SoC may include the NPU or DSP configured to in response to receiving the signal output by the GPU, perform second graphics processing of the first block.
  • the GPU may be further configured to generate the second block of the frame by the first graphics processing.
  • the second graphics processing of the first block by the NPU or DSP may be performed concurrent to the first graphics processing of the second block by the GPU.
  • the GPU is further configured to generate the second block of the frame by the first graphics processing.
  • the signal may be output to the NPU or DSP via a direct connection. In some embodiments, the signal may be output to the NPU or DSP via the storage unit as a mailbox.
  • an SoC may include an NPU or DSP configured to obtain, from a GPU, a signal indicating a first block of a frame generated by first graphics processing is ready for second graphics processing.
  • the NPU or DSP may be configured to obtain the first block from a storage unit.
  • the NPU or DSP may be configured to perform the second graphics processing of the first block.
  • the second graphics processing of the first block by the NPU or DSP may begin prior to a completion of the first graphics processing of the frame.
  • the SoC may further include a GPU configured to generate a second block by the first graphics processing.
  • the second graphics processing of the first block by the NPU or DSP may be performed concurrent to the first graphics processing of the second block by the GPU.
  • a method of graphics processing by a GPU may include generating a first block of a frame by first graphics processing.
  • the method may include sending the first block to a storage unit.
  • the method may include outputting a first signal to an NPU or DSP.
  • the first signal may indicate that the first block is ready for second graphics processing.
  • the first signal may include first header transaction information of the first block.
  • the method may include generating a second block of the first graphics processing.
  • the first graphics processing of the second block by the GPU may be performed concurrent to the second graphics processing of the first block by the NPU or DSP.
  • a method of graphics processing by an NPU or DSP may include obtaining, from a GPU, a first signal indicating a first block of a frame generated by first graphics processing is ready for second graphics processing.
  • the method may include obtaining the first block from a storage unit.
  • the method may include perform second graphics processing of the first block.
  • the second graphics processing of the first block may occur prior to a completion of the first graphics processing of the frame by the GPU.

Abstract

According to one aspect of the disclosure, a system-on-a-chip (SoC) is provided. The SoC may include a graphics processing unit (GPU) to generate a set of blocks of a frame by first graphics processing. The SoC may also include a neural processing unit (NPU) or digital signal processor (DSP) configured to initiate second graphics processing of the set of blocks of the frame prior to an entirety of the frame being generated by the first graphics processing. With this technique, the latency from frame generation by GPU (first graphics processing) to data post processing (second graphics processing) by NPU or DSP is significantly reduced thus improving the performance. The smaller granularity of exchange memory footprint also helps reducing the power consumption thus improving the power efficiency.

Description

APPARATUS AND METHOD OF BLOCK-BASED GRAPHICS PROCESSING IN A SYSTEM-ON-A-CHIP
BACKGROUND
[0001] Embodiments of the present disclosure relate to an apparatus and method for graphics processing.
[0002] A system-on-a-chip (SoC) is an integrated circuit that integrates different subsystems or processors, each having a different function, in a computing system or other electronic device. Graphics processors of an SoC may be configured to generate and process frames for output to a display device. For example, a neural processing unit (NPU) or a digital signal processor (DSP) may process a frame generated by a graphics processing unit (GPU). Once processed, the SoC may output the frame to a display device.
SUMMARY
[0003] Embodiments of apparatus and method for graphics processing are disclosed herein. [0004] According to one aspect of the disclosure, an SoC is provided. The SoC may include a GPU configured to generate a set of blocks of a frame by first graphics processing. The SoC may also include a NPU or DSP configured to initiate second graphics processing of the set of blocks of the frame prior to an entirety of the frame being generated by the first graphics processing.
[0005] According to another aspect of the disclosure, an SoC is provided. The SoC may include a GPU configured to generate a first block of a frame by first graphics processing. The SoC may include a GPU configured to send the first block to a storage unit (e.g., a buffer, a portion of a system cache, main double data rate (DDR) memory, etc.). The SoC may include a GPU configured to output, to a NPU or DSP, a signal indicating that the first block is ready for second graphics processing by the NPU or DSP. In some embodiments, the signal may be output prior to generating a second block of the frame by the first processing.
[0006] According to yet another aspect of the disclosure, an SoC is provided. The SoC may include an NPU or DSP configured to obtain, from a GPU, a signal indicating a first block of a frame generated by first graphics processing is ready for second graphics processing. The NPU or DSP obtain the first block from a storage unit. The NPU or DSP perform the second graphics processing of the first block. In some embodiments, the second graphics processing of the first block by the NPU or DSP may begin prior to a completion of the first graphics processing of the frame by the GPU.
[0007] According to a further aspect of the disclosure, a method of graphics processing by a GPU is disclosed. The method may include generating a first block of a frame by first graphics processing. The method may include sending the first block to a storage unit. The method may include outputting a first signal to a NPU or DSP. The first signal may indicate that the first block is ready for second graphics processing. The first signal may include first header transaction information of the first block. The method may include generating a second block of the first graphics processing. The first graphics processing of the second block by the GPU may be performed concurrent to the second graphics processing of the first block by the NPU or DSP.
[0008] According to still another aspect of the disclosure, a method of graphics processing by a NPU or DSP is disclosed. The method may include obtaining, from a GPU, a first signal indicating a first block of a frame generated by first graphics processing is ready for second graphics processing. The method may include obtaining the first block from a storage unit. The method may include perform second graphics processing of the first block. The second graphics processing of the first block by the NPU or DSP may occur prior to a completion of the first graphics processing of the frame by the GPU.
BRIEF DESCRIPTION OF THE DRAWINGS
[0009] The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate embodiments of the present disclosure and, together with the description, further serve to explain the principles of the present disclosure and to enable a person skilled in the pertinent art to make and use the present disclosure.
[0010] FIG. 1 illustrates a block diagram of an exemplary SoC, in accordance with certain aspects of the disclosure.
[0011] FIG. 2 illustrates a block diagram of a frame that is made up of blocks that may generated and processed for output to a display device, in accordance with certain aspects of the disclosure.
[0012] FIG. 3 A illustrates a first timing diagram of a graphics processing technique.
[0013] FIG. 3B illustrates a second timing diagram associated with an exemplary graphics processing technique, in accordance with certain aspects of the disclosure.
[0014] FIG. 4A illustrates a flow diagram of a first exemplary graphics processing technique implemented by various components of the exemplary SoC of FIG. 1, in accordance with certain aspects of the disclosure.
[0015] FIG. 4B illustrates a flow diagram of a second exemplary graphics processing technique implemented by various components of the exemplary SoC of FIG. 1, in accordance with certain aspects of the disclosure.
[0016] FIG. 5 A illustrates a flow chart of a first exemplary graphics processing technique, according to embodiments of the disclosure.
[0017] FIG. 5B illustrates a flow chart of a second exemplary graphics processing technique, according to embodiments of the disclosure.
[0018] Embodiments of the present disclosure will be described with reference to the accompanying drawings.
DETAILED DESCRIPTION
[0019] Although specific configurations and arrangements are discussed, it should be understood that this is done for illustrative purposes only. A person skilled in the pertinent art will recognize that other configurations and arrangements can be used without departing from the spirit and scope of the present disclosure. It will be apparent to a person skilled in the pertinent art that the present disclosure can also be employed in a variety of other applications.
[0020] It is noted that references in the specification to “one embodiment,” “an embodiment,” “an example embodiment,” “some embodiments,” “certain embodiments,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases do not necessarily refer to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it would be within the knowledge of a person skilled in the pertinent art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
[0021] In general, terminology may be understood at least in part from usage in context. For example, the term “one or more” as used herein, depending at least in part upon context, may be used to describe any feature, structure, or characteristic in a singular sense or may be used to describe combinations of features, structures or characteristics in a plural sense. Similarly, terms, such as “a,” “an,” or “the,” again, may be understood to convey a singular usage or to convey a plural usage, depending at least in part upon context. In addition, the term “based on” may be understood as not necessarily intended to convey an exclusive set of factors and may, instead, allow for existence of additional factors not necessarily expressly described, again, depending at least in part on context.
[0022] Various aspects of configurable power management will now be described with reference to various apparatus and methods. These apparatus and methods will be described in the following detailed description and illustrated in the accompanying drawings by various blocks, modules, units, components, circuits, steps, operations, processes, algorithms, etc. (collectively referred to as “elements”). These elements may be implemented using electronic hardware, firmware, computer software, or any combination thereof. Whether such elements are implemented as hardware, firmware, or software depends upon the particular application and design constraints imposed on the overall system.
[0023] As mentioned above, an SoC is an integrated circuit that integrates subsystems, each having a different function, in a computing system or other electronic device. The subsystems integrated by an SoC may include, without limitation, one or more of the following: central processing units (CPUs), graphical processing units (GPUs), digital signal processors (DSPs), neural processing units (NPUs), microcontrollers, microprocessors, multiprocessors, other types of cores, a memory unit, read-only memory (ROM), random-access memory (RAM), clock signal generators, input/output (I/O) interfaces, analog interfaces, voltage regulators and power management circuits, an advanced peripheral unit(s), wireless communication unit(s) (e.g., Wi-Fi module, cellular module, 5G new radio (NR) module, Bluetooth® module, etc.), or coprocessors, just to name a few.
[0024] FIG. 2 illustrates a diagram 200 of a frame 202 that may be generated by a GPU and an NPU or DSP of an SoC. As shown, a frame 202 may be segmented into blocks 204 that are arranged in rows and columns. The size of a block 204 may be configurable. For example, a single block may include a tile, multiple tiles, a bin, or multiple bins. In the example depicted in FIG. 2, frame 202 includes m rows and n columns for a total of m*n blocks. However, more or fewer rows, columns, blocks per row, and/or blocks per column may be included in frame 202 without departing from the scope of the present disclosure.
[0025] FIG. 3A illustrates a timing diagram 300 of the traditional graphics processing technique of an SoC. The graphics processing may include generating (at 302a) each block 204 of frame 202. Once the entire frame 202 has been generated, data processing (at 304a) may begin. A GPU may perform first graphics processing to generate frame 202. A storage device may be used to maintain the blocks until the entire frame 202 is generated. Then, an NPU or DSP may be used to perform second data processing of the blocks of the frame, until frame 202 is ready for output to a display device. Due to its relatively large size, frame 202 may be stored in a double data rate (DDR) system memory rather than temporary storage, such as a buffer orcache. To process frame data, the NPU or DSP reads the frame data from system memory. Memory transactions to and from DDR memory consume more power than those to and from a small footprint memory storage, e.g., an internal buffer or cache. Furthermore, since the data processing (e.g., second graphics processing) by the NPU or DSP occurs only once the entire frame has been generated by the GPU, the latency from the beginning of frame generation until its ready for output to second graphics processing is longer than if second graphics processing occurs in tandem with first graphics processing once a block 204 is generated. For example, as shown in FIG. 3 A, when frame 202 includes m*n blocks, the total latency from the time the first block is generated by a GPU until the last block has been processed by an NPU or DSP includes the duration of the first graphics processing time for m*n blocks plus the second graphics processing time for the m*n blocks. Consequently, the traditional method of nonconcurrent frame data processing by the GPU and NPU or DSP is undesirable in terms of latency, efficiency, and power consumption.
[0026] Thus, there exists an unmet need for a frame processing technique in which a GPU generates frame data (e.g., blocks) and an NPU or DSP performs additional graphics processing in a way that overcomes the latency, efficiency, and power consumption challenges of traditional techniques.
[0027] To overcome these and other challenges, the present disclosure provides a frame data processing technique that enables concurrent first graphics processing by a GPU and second graphics processing by an NPU or DSP, as shown in the exemplary timing diagram 301 of FIG. 3B. Using the exemplary technique disclosed herein, the GPU may signal to the NPU or DSP when a block is ready for second graphics processing. More specifically, the GPU may write the generated block to the storage unit (e.g., a buffer, a portion of system cache, DDR memory, etc.) and send a signal, which indicates the block is ready for second graphics processing, to the NPU or DSP. Upon receipt of the signal, the NPU or DSP may read the block from the storage unit based on header information included in the transaction. The same process may be performed after the generation of each block such that the total latency of the exemplary graphics processing technique is considerably reduced, as shown in FIG. 3B. For example, the total processing time may include the duration of the first graphics processing of the entire frame plus the duration of the second graphics processing for a single block. Thus, using the present technique, a significant reduction in latency (e.g., up to (m*n-l) per block second graphics processing duration) can be achieved, thus improving the graphics processing efficiency. Moreover, because the second graphics processing is performed as blocks are generated, the storage footprint and power consumption may be reduced since individual blocks may be written, read, and deleted from the smaller storage (e.g., a buffer, a portion of system cache, etc.), as compared to maintaining an entire frame which may be stored in the DDR system memory in the conventional technique described in connection with FIG. 3A. Additional details of the present graphics processing techniques are described below in connection with FIGs. 1, 4A, 4B, 5 A, and 5B.
[0028] FIG. 1 illustrates a block diagram of an exemplary SoC 100, according to some embodiments of the present disclosure. SoC 100 may be applied or integrated into various systems and apparatus capable of high-speed data processing, such as computers and wireless communication devices. For example, SoC 100 may be part of a mobile phone, a desktop computer, a laptop computer, a tablet, a vehicle computer, a gaming console, a printer, a positioning device, a wearable electronic device, a smart sensor, a virtual reality (VR) device, an argument reality (AR) device, or any other suitable electronic devices having high-speed data processing capability. Using a display device as an example, SoC 100 may serve as a graphics processor that imports data and instructions from main memory 102 and/or an external memory (not shown), executing instructions to perform various mathematical and logical calculations on the data to generate and process a graphics frame, and exporting the graphics frame for display on a display device coupled to SoC 100.
[0029] SoC designs may integrate one or more components for computation and processing on an integrated-circuit (IC) substrate. Such SoC designs may also be referred to as “platforms.” For applications where chip size matters, such as smartphones and wearable gadgets, SoC design is an ideal design choice because of its compact area. It further has the advantage of small power consumption. In some embodiments, as shown in FIG. 1, main memory 102, host central processing unit (CPU) 104, system bus 106, GPU 108, storage unit 110, and NPU/DSP 112 are integrated into SoC 100. It is understood that in some examples, main memory 102, host central processing unit (CPU) 104, system bus 106, GPU 108, storage unit 110, and/or NPU/DSP 112 may not be integrated on the same chip, but instead on separate chips.
[0030] GPU 108 may include any suitable specialized graphics processor configured to perform first graphics processing. As used herein, “first graphics processing” may include the generation of blocks of a frame, e.g., such as blocks 204 and frame 202 of FIG. 2. NPU/DSP 112 may include any suitable specialized processor configured to perform second graphics processing, e.g., an NPU or DSP. As used herein, “second graphics processing” may include any type of graphics processing performed once a block and/or frame is generated and prior to its display on a display device.
[0031] Storage unit 110 may include any suitable storage device configured to maintain generated blocks of a frame. For example, storage unit 110 may include an internal buffer, a portion of system cache, or DDR memory. Storage unit 110 may be configured to maintain the processed blocks on a temporary basis. By way of example and not limitation, storage unit 110 may be configured to maintain two blocks at a time. When a first block is read from storage unit 110 by NPU/DSP 112, the first block may then be deleted or removed from storage unit 110. Then, a new block may be written to storage unit 110 by GPU 108.
[0032] Each time a block is written to storage unit 110, GPU 108 may send a signal indicating to NPU/DSP 112 that a block is ready for second graphics processing. The signal may include header transaction information associated with the newly written block. For example, the header transaction information may include block number, block data format, block size, and the start address of the block in storage unit 110. In some embodiments, the signal may be sent via direct connection between GPU 108 and NPU/DSP 112. In some other embodiments, the signal may be sent to storage unit 110 and retrieved by NPU/DSP 112.
[0033] In some embodiments, storage unit 110 may include a header transaction region and a data region (as shown in FIG. 4B). Header transaction region may be configured to maintain header transaction information that is written by GPU 108 when an associated block is written to the data region. NPU/DSP 112 may periodically (at regular or irregular intervals) check the header region of storage unit 110 as a mailbox to see if header transaction information is available. When available, NPU/DSP 112 may read the associated block from the data region and begin second graphics processing thereon.
[0034] In some embodiments, each time a block is written to storage unit 110, GPU 108 may increment a counter that keeps track of how many blocks have been written to storage unit 110, which may prevent storage overflow. In some embodiments, NPU/DSP 112 may send a signal to GPU 108 upon completion of second graphics processing of a block. The signal may include header transaction information for the block that was processed by NPU/DSP 112. When such a signal is received, GPU 108 may decrement the counter. In this way, GPU 108 may keep track of the number of blocks currently written to storage unit 110. If, for example, GPU 108 generates a new block for writing to storage unit 110, and the counter indicates that the data region of storage unit 110 cannot hold an additional block, GPU 108 may wait until a signal is received from NPU/DSP 112 indicating that at least one block has been read (and removed) from storage unit 110. Then, the new block may be written to the data region of storage unit 110.
[0035] Although not shown, in some embodiments, GPU 108 and NPU/DSP 112 may each include, among others, one or more processing cores (a.k.a. “cores”). A processing core may include one or more functional units that perform various data operations associated with first and/or second graphics processing. For example, processing core may include an arithmetic logic unit (ALU) that performs arithmetic and bitwise operations on data (also known as “operand”), such as addition, subtraction, increment, decrement, AND, OR, Exclusive-OR, etc. Processing core may also include a floating-point unit (FPU) that performs similar arithmetic operations but on a type of operands (e.g., floating-point numbers) different from those operated by the ALU (e.g., binary numbers). The operations may be addition, subtraction, multiplication, etc. Another way of categorizing the functional units may be based on whether the data processed by the functional unit is a scalar or a vector. For example, processing cores may include scalar function units (SFUs) for handling scalar operations and vector function units (VFUs) for handling vector operations. It is understood that in the case that GPU 108 and/or NPU/DSP 112 includes multiple processing cores, each processing core may carry out regular mode data and instruction operations in serial or in parallel. This multi-core processor design can effectively enhance the processing speed of GPU 108 and NPU/DSP 112 and improves their performance.
[0036] It is understood that additional components, although not shown in FIG.l, may be included in SoC 100 as well, such as components that act as an interface between NPU/DSP 112 and a display device coupled to SoC 100. Additional details of the process flow performed by SoC 100 to generate a frame 202 with the reduced latency/power consumption and increased efficiency, as described above in connection with FIG. 3B, are provided below in connection to FIGs. 4A, 4B, 5 A, and 5B.
[0037] FIG. 4A illustrates a flow diagram 400 of a first exemplary graphics processing technique that may be performed by SoC 100 when a direct connection is provided between GPU 108 and NPU/DSP 112. FIG. 4B illustrates a flow diagram 401 of a second exemplary graphics processing technique that may be performed by SoC 100 when a direct connection is not provided between GPU 108 and NPU/DSP 112. FIGs. 4A and 4B will be described together along with reference to FIGs. 2 and 3B.
[0038] Referring to FIGs. 4A and 4B, GPU 108 may generate (at 302b) blocks 204 of frame 202 by first graphics processing. As shown in the example of FIG. 3B, blocks 204 may be generated serially by GPU 108. However, blocks 204 may be generated in parallel when GPU 108 can provide the throughput. In either case, a block may be written (at 402) to storage unit 110 once generated. To reduce processing latency and power consumption, and to increase efficiency, a header transaction information 404 may be signaled to NPU/DSP 112. As shown in FIG. 4A, in some embodiments, header transaction information 404 may be signaled via a direct connection between GPU 108 and NPU/DSP 112. However, in some embodiments, header transaction information 404 may be signaled via an indirect connection between GPU 108 and NPU/DSP 112, as shown in FIG. 4B. Referring to FIG. 4B, GPU 108 may send the header transaction to the header region 410 of storage unit 110. NPU/DSP 112 may access header region 410 to check whether a header transaction has been written thereto. In either case, header transaction information 404 information may include, e.g., block number, block data format, block size, and/or start address of a newly generated block 204 in storage unit 110.
[0039] However, in both the direct connection and indirect connection embodiments, there is a tradeoff between interface simplicity and the number of memory access operations. For example, while reducing the number of memory transactions by implementing a direct connection, the embodiment depicted in FIG. 4A uses an additional interface between the two processors. On the other hand, while eliminating the interface between the two processors, the embodiment depicted in FIG. 4B employs additional memory access operations in writing/reading header transaction information to/from storage unit 110 by GPU 108 and NPU/DSP 112, respectively.
[0040] In either embodiment, once the header transaction information is received, NPU/DSP 112 may read (at 406) the generated block from the data region 412 (depicted in FIG. 4B) and begin second graphics processing (at 304b). Thus, NPU/DSP 112 may perform (at 304b) second graphics processing while the GPU 108 generates (at 302b) a second block by first graphics processing. In other words, first and second graphics processing of different blocks may be performed concurrently, which enables exemplary SoC 100 to generate and process an entire frame for display with reduced latency/power and increased efficiency (as depicted in FIG. 3B), as compared to conventional techniques (as depicted in FIG. 3 A).
[0041] The present graphics processing techniques may be performed for tile-based rendering (TBR) or tile-based deferred rendering (TBDR), as the aforementioned blocks 204 can be a single tile or a group of tiles. Once the tile or tiles are ready, the header transaction information can be transferred to NPU/DSP 112 (e.g., a downstream pipeline coprocessor NPU or DSP), which may begin second graphics processing of the tile. Moreover, the present graphics processing techniques may be applied to various fields, e.g., such as super resolution, image recognition, etc. [0042] Still further, some graphics processing techniques may be performed for full frame or adjacent blocks, which may be accomplished using the techniques of FIGs. 4A and 4B through appropriate header transaction signaling and storage allocation. For example, GPU 108 may send the header transaction information to NPU/DSP 112 when a sufficient number of small blocks have been generated, or the entire frame has been generated by first graphics processing.
[0043] FIG. 5A illustrates a flowchart of a first exemplary method 500 for graphics processing, according to embodiments of the disclosure. First exemplary method 500 may be performed by a GPU, e.g., such as GPU 108. First exemplary method 500 may include operations 502-516. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 5 A.
[0044] Referring to FIG. 5A, at 502, the GPU may generate a first block of a frame by first graphics processing. For example, referring to FIGs. 2, 3B, 4A, and 4B, GPU 108 may generate (at 302b) a first block 204 (e.g., block 1 in FIG. 2).
[0045] At 504, the GPU may send the first block to a storage unit and increment a counter. For example, referring to FIGs. 1, 2, 3B, 4A, and 4B, GPU 108 may send the generated first block to storage unit 110. In some embodiments, GPU 108 may send the generated first block to data region 412, as shown in FIG. 3B. When the first block is sent to storage unit 110, GPU 108 may increment a counter to keep track of how many blocks are temporarily maintained by storage unit 110 and awaiting second graphics processing by NPU/DSP 112. Because storage unit 110 may have limited storage capacity, using a counter may prevent GPU 108 from causing storage/buffer overflow by sending blocks that the storage unit 110 cannot simultaneously maintain.
[0046] At 506, the GPU may output a first signal to an NPU or DSP. The first signal may indicate to the NPU or DSP that the first block is generated and available for second graphics processing. By outputting the signal to the NPU or DSP, the NPU or DSP may begin the second graphics processing of the first block concurrent with the GPU generating a second block by the first graphics processing. For example, referring to FIG. 4 A, the first signal may include header transaction information 404 that is sent via a direct connection between GPU 108 and NPU/DSP 112. In some embodiments, as shown in FIG. 4B, the header transaction information 404 may be sent to a header region 410 of storage unit 110 that is checked by NPU/DSP 112.
[0047] At 508, the GPU may determine whether all blocks of the frame have been rendered. For example, referring to FIGs. 2, 4A and 4B, GPU 108 may determine whether all m*n blocks 204 of frame 202 have been rendered each time a block is written (at 402) to storage unit 110. Upon determining that all blocks 204 have been generated, the operation is complete for the frame. On the other hand, when all m*n blocks 204 have not been rendered, the operations may move to 510.
[0048] At 510, the GPU may determine whether the counter is greater than or equal to a threshold. The threshold may be set based on the maximum number of blocks that can be maintained by storage unit 110. Referring to FIGs. 1, 4A, and 4B, for example, when GPU 108 generates a new block (e.g., block 2) for writing to storage unit 110, GPU 108 checks to see whether the threshold is less than 2. A counter value of 2 indicates storage unit 110 cannot currently maintain any additional blocks, and the operation may move to 512. When the counter is less than the threshold, the operations may return to 502 and GPU 108 may generate the next block 204. Otherwise, when the counter is greater than or equal to the threshold, the operation may move to 512.
[0049] At 512, the GPU waits until the counter value is less than the threshold. For example, referring to FIGs. 1, 4A, and 4B, GPU 108 may wait until a signal is received from NPU/DSP 112 indicating that at least one block has been read (and removed) from storage unit 110. Then, the new block may be written to the data region 412 of storage unit 110.
[0050] Concurrent with the operations of 502-512, operations 514 and 516 may be performed by the GPU. For example, at 514, the GPU may receive a second signal from the NPU or DSP. The second signal may indicate that the second graphics processing of the first block is complete. For example, referring to FIGs. 1, 4A, and 4B, NPU/DSP 112 may send a second signal (header transaction information 404) to GPU 108 upon completion of second graphics processing of a block. The signal may include header transaction information of the block processed by NPU/DSP 112. The second signal may be sent via direct connection, as in FIG. 4A. In some embodiments, as seen in FIG. 4B, the second signal may be sent to the header region 410 of storage unit 110. Here, GPU 108 may check header region 410 to determine whether a second signal is waiting for retrieval.
[0051] At 516, the GPU may decrement the counter upon receipt of the second signal. For example, referring to FIGs. 1, 4A, and 4B, when the signal is received, GPU 108 may decrement the counter. In this way, GPU 108 may keep track of the number of blocks written to storage unit 110.
[0052] FIG. 5B illustrates a flowchart of a second exemplary method 501 for graphics processing, according to embodiments of the disclosure. Second exemplary method 501 may be performed by an NPU or DSP, e.g., such as NPU/DSP 112. Second exemplary method 501 may include operations 522-528. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 5B. [0053] Referring to FIG. 5B, at 522, the NPU or DSP may obtain a first signal indicating a first block of a frame has been generated and is ready for second graphics processing. For example, referring to FIG. 4A, the first signal may include a header transaction information 404 that is sent via a direct connection between GPU 108 and NPU/DSP 112. In some embodiments, as shown in FIG. 4B, the header transaction information 404 may be sent to a header region 410 of storage unit 110 that is checked by NPU/DSP 112.
[0054] At 524, the NPU or DSP may obtain the first block from the storage unit. For example, referring to FIGs. 4A and 4B, NPU/DSP 112 may read (at 406) the first block from storage unit 110.
[0055] At 526, the NPU or DSP may perform second graphics processing of the first block. For example, referring to FIGs. 4A and 4B, NPU/DSP 112 may perform (at 304b) second graphics processing of the first block. Because the first signal indicates the first block is ready for second graphics processing, the second graphics processing may proceed concurrent with the first graphics processing, thereby reducing latency and improving efficiency.
[0056] At 528, the NPU or DSP may send a second signal to GPU when the second graphics processing of the first block is complete. For example, referring to FIGs. 1, 4A, and 4B, NPU/DSP 112 may send a second signal (header transaction information 404) to GPU 108 upon completion of second graphics processing of a block. The signal may include header transaction information of the block that was processed by NPU/DSP 112. The second signal may be sent via direct connection, as in FIG. 4A. In some embodiments, as seen in FIG. 4B, the second signal may be sent to the header region 410 of storage unit 110. Here, GPU 108 may check header region 410 to determine whether a second signal is waiting for retrieval.
[0057] In various aspects of the present disclosure, the functions described herein may be implemented in hardware, software, or any combination thereof. If implemented in software, the functions may be stored on or encoded as instructions or code on a non-transitory computer- readable medium. Computer-readable media includes computer storage media. Storage media may be any available media that can be accessed by a computing device, such as SoC 100 of FIG. 1. By way of example, and not limitation, such computer-readable media can include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, hard disks drive (HDD), such as magnetic disk storage or other magnetic storage devices, Flash drive, solid-state drive (SSD), or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a processing system, such as a mobile device or a computer. Disk and disc, as used herein, includes CD, laser disc, optical disc, DVD, and floppy disk where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media. [0058] According to one aspect of the disclosure, an SoC is provided. The SoC may include a GPU configured to generate a set of blocks of a frame by first graphics processing. The SoC may also include an NPU or DSP configured to initiate second graphics processing of the set of blocks of the frame prior to an entirety of the frame being generated by the first graphics processing.
[0059] In some embodiments, the GPU may be configured to generate the set of blocks of the frame by the first graphics processing by generating a first block of the frame by the first graphics processing. In some embodiments, the GPU may be configured to generate the set of blocks of the frame by the first graphics processing by generating a second block of the frame by the first graphics processing.
[0060] In some embodiments, the GPU may be further configured to send the first block to a storage unit. In some embodiments, the GPU may be further configured to output a first signal indicating that the first block is ready for the second graphics processing. In some embodiments, the first signal may be output prior to the generating of the second block.
[0061] In some embodiments, the first signal may include one or more of header transaction information, a block number, a block data format, block size, or a start address of the first block in the storage unit.
[0062] In some embodiments, the first signal may be output via a direct connection with the NPU or DSP.
[0063] In some embodiments, the first signal may be output to a transaction header transaction region of the storage unit. [0064] In some embodiments, the storage unit may include a buffer, a cache, or a DDR memory.
[0065] In some embodiments, the NPU or DSP may be configured to obtain the first signal indicating that the first block is ready for the second graphics processing. In some embodiments, the NPU or DSP may be configured to obtain the first block from the storage unit. In some embodiments, the NPU or DSP may be configured to perform the second graphics processing of the first block. In some embodiments, the second graphics processing of the first block may be concurrent with the first graphics processing of the second block.
[0066] In some embodiments, the NPU or DSP may be further configured to output a second signal indicating that the second graphics processing of the first block is complete.
[0067] In some embodiments, the second signal may be output via a direct connection with the GPU.
[0068] In some embodiments, the second signal may be output to a header transaction region of the storage unit.
[0069] In some embodiments, the GPU may be further configured to obtain the second signal indicating that the second graphics processing of the first block is complete. In some embodiments, the GPU may be further configured to increment a counter based on a block generation.
[0070] In some embodiments, GPU may be further configured to, in response to the counter meeting a threshold, delay sending the second block to the storage unit. In some embodiments, GPU may be further configured to, in response to the counter not meeting the threshold, send the second block to the storage unit.
[0071] According to another aspect of the disclosure, an SoC is provided. The SoC may include a GPU configured to generate a first block of a frame by first graphics processing. The SoC may include a GPU configured to send the first block to a storage unit. The SoC may include a GPU configured to output, to an NPU or DSP, a signal indicating that the first block is ready for second graphics processing by the NPU or DSP. In some embodiments, the signal may be output prior to generating a second block of the frame by the first processing.
[0072] In some embodiments, the SoC may include the NPU or DSP configured to in response to receiving the signal output by the GPU, perform second graphics processing of the first block. In some embodiments, the GPU may be further configured to generate the second block of the frame by the first graphics processing. In some embodiments, the second graphics processing of the first block by the NPU or DSP may be performed concurrent to the first graphics processing of the second block by the GPU.
[0073] In some embodiments, the GPU is further configured to generate the second block of the frame by the first graphics processing.
[0074] In some embodiments, the signal may be output to the NPU or DSP via a direct connection. In some embodiments, the signal may be output to the NPU or DSP via the storage unit as a mailbox.
[0075] According to yet another aspect of the disclosure, an SoC is provided. The SoC may include an NPU or DSP configured to obtain, from a GPU, a signal indicating a first block of a frame generated by first graphics processing is ready for second graphics processing. The NPU or DSP may be configured to obtain the first block from a storage unit. The NPU or DSP may be configured to perform the second graphics processing of the first block. In some embodiments, the second graphics processing of the first block by the NPU or DSP may begin prior to a completion of the first graphics processing of the frame.
[0076] In some embodiments, the SoC may further include a GPU configured to generate a second block by the first graphics processing. In some embodiments, the second graphics processing of the first block by the NPU or DSP may be performed concurrent to the first graphics processing of the second block by the GPU.
[0077] According to a further aspect of the disclosure, a method of graphics processing by a GPU is disclosed. The method may include generating a first block of a frame by first graphics processing. The method may include sending the first block to a storage unit. The method may include outputting a first signal to an NPU or DSP. The first signal may indicate that the first block is ready for second graphics processing. The first signal may include first header transaction information of the first block. The method may include generating a second block of the first graphics processing. The first graphics processing of the second block by the GPU may be performed concurrent to the second graphics processing of the first block by the NPU or DSP.
[0078] According to still another aspect of the disclosure, a method of graphics processing by an NPU or DSP is disclosed. The method may include obtaining, from a GPU, a first signal indicating a first block of a frame generated by first graphics processing is ready for second graphics processing. The method may include obtaining the first block from a storage unit. The method may include perform second graphics processing of the first block. The second graphics processing of the first block may occur prior to a completion of the first graphics processing of the frame by the GPU.
[0079] The foregoing description of the embodiments will so reveal the general nature of the present disclosure that others can, by applying knowledge within the skill of the art, readily modify and/or adapt for various applications such embodiments, without undue experimentation, without departing from the general concept of the present disclosure. Therefore, such adaptations and modifications are intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein. It is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by the skilled artisan in light of the teachings and guidance.
[0080] Embodiments of the present disclosure have been described above with the aid of functional building blocks illustrating the implementation of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed.
[0081] The Summary and Abstract sections may set forth one or more but not all exemplary embodiments of the present disclosure as contemplated by the inventor(s), and thus, are not intended to limit the present disclosure and the appended claims in any way.
[0082] Various functional blocks, modules, and steps are disclosed above. The arrangements provided are illustrative and without limitation. Accordingly, the functional blocks, modules, and steps may be re-ordered or combined in different ways than in the examples provided above. Likewise, certain embodiments include only a subset of the functional blocks, modules, and steps, and any such subset is permitted.
[0083] The breadth and scope of the present disclosure should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims

WHAT IS CLAIMED IS:
1. A system-on-chip (SoC), comprising: a graphics processing unit (GPU) configured to generate a set of blocks of a frame by first graphics processing; and a neural processing unit (NPU) or digital signal processing (DSP) configured to initiate second graphics processing of the set of blocks of the frame prior to an entirety of the frame being generated by the first graphics processing.
2. The SoC of claim 1, wherein the GPU is configured to generate the set of blocks of the frame by the first graphics processing by: generating a first block of the frame by the first graphics processing; and generating a second block of the frame by the first graphics processing.
3. The SoC of claim 2, wherein the GPU is further configured to: send the first block to a storage unit; and output a first signal indicating that the first block is ready for the second graphics processing, wherein the first signal is output prior to the generating of the second block.
4. The SoC of claim 3, wherein the first signal includes one or more of header transaction information, a block number, a block data format, block size, or a start address of the first block in the storage unit.
5. The SoC of claim 3, wherein the first signal is output via a direct connection with the NPU or DSP.
6. The SoC of claim 3, wherein the first signal is output to a transaction header transaction region of the storage unit.
7. The SoC of claim 3, wherein the storage unit comprises a buffer, a cache, or a double data rate (DDR) memory.
8. The SoC of claim 3, wherein the NPU or DSP is configured to: obtain the first signal indicating that the first block is ready for the second graphics processing; obtain the first block from the storage unit; and perform the second graphics processing of the first block, wherein the second graphics processing of the first block is concurrent with the first graphics processing of the second block.
9. The SoC of claim 8, wherein the NPU or DSP is further configured to: output a second signal indicating that the second graphics processing of the first block is complete.
10. The SoC of claim 9, wherein the second signal is output via a direct connection with the GPU.
11. The SoC of claim 9, wherein the second signal is output to a header transaction region of the storage unit.
12. The SoC of claim 9, wherein the GPU is further configured to: obtain the second signal indicating that the second graphics processing of the first block is complete; and decrement a counter.
13. The SoC of claim 12, wherein the GPU is further configured to: in response to the counter meeting a threshold, delay sending the second block to the storage unit; or in response to the counter not meeting the threshold, send the second block to the storage unit; and increment a counter.
14. A system-on-chip (SoC), comprising: - 19 - a graphics processing unit (GPU) configured to: generate a first block of a frame by first graphics processing; send the first block to a storage unit; and output, to a neural processing unit (NPU) or digital signal processor (DSP), a signal indicating that the first block is ready for second graphics processing by the NPU or DSP, wherein the signal is output prior to generating a second block of the frame by the first graphics processing.
15. The SoC of claim 14, further comprising: the NPU or DSP configured to: in response to receiving the signal output by the GPU, perform second graphics processing of the first block, wherein the GPU is further configured to generate the second block of the frame by the first graphics processing, and wherein the second graphics processing of the first block by the NPU or DSP is performed concurrent to the first graphics processing of the second block by the GPU.
16. The SoC of claim 14, wherein: the signal is output to the NPU or DSP via a direct connection, or the signal is output to the NPU or DSP via the storage unit.
17. A system-on-chip (SoC), comprising: a neural processing unit (NPU) or digital signal processor (DSP) configured to: obtain, from a graphics processing unit (GPU), a signal indicating a first block of a frame generated by first graphics processing is ready for second graphics processing; obtain the first block from a storage unit; and perform the second graphics processing of the first block, wherein the second graphics processing of the first block by the NPU or DSP begins prior to a completion of the first graphics processing of the frame.
18. The SoC of claim 17, further comprising: the GPU configured to generate a second block by the first graphics processing, - 20 - wherein the second graphics processing of the first block by the NPU or DSP is performed concurrent to the first graphics processing of the second block by the GPU.
19. A method of graphics processing by a graphics processing unit (GPU), comprising: generating a first block of a frame by first graphics processing; sending the first block to a storage unit; outputting a first signal to a neural processing unit (NPU) or digital signal processing (DSP), the first signal indicating that the first block is ready for second graphics processing, and the first signal including first header transaction information of the first block; and generating a second block of the first graphics processing, wherein the first graphics processing of the second block by the GPU is performed concurrent to the second graphics processing of the first block by the NPU or DSP.
20. A method of graphics processing by a neural processing unit (NPU) or digital signal processor (DSP), comprising: obtaining, from graphics processing unit (GPU), a signal indicating a first block of a frame generated by first graphics processing is ready for second graphics processing; obtaining the first block from a storage unit; and performing second graphics processing of the first block, wherein the second graphics processing of the first block occurs prior to a completion of the first graphics processing of the frame by the GPU.
PCT/US2021/046739 2021-08-19 2021-08-19 Apparatus and method of block-based graphics processing in a system-on-a-chip WO2023022722A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2021/046739 WO2023022722A1 (en) 2021-08-19 2021-08-19 Apparatus and method of block-based graphics processing in a system-on-a-chip

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2021/046739 WO2023022722A1 (en) 2021-08-19 2021-08-19 Apparatus and method of block-based graphics processing in a system-on-a-chip

Publications (1)

Publication Number Publication Date
WO2023022722A1 true WO2023022722A1 (en) 2023-02-23

Family

ID=85239701

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/046739 WO2023022722A1 (en) 2021-08-19 2021-08-19 Apparatus and method of block-based graphics processing in a system-on-a-chip

Country Status (1)

Country Link
WO (1) WO2023022722A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020174252A1 (en) * 2001-05-18 2002-11-21 Broadcom Corporaion System on a chip for packet processing
US20150049216A1 (en) * 2012-03-21 2015-02-19 Canon Kabushiki Kaisha Image capturing apparatus
US20190188907A1 (en) * 2017-12-14 2019-06-20 Imagination Technologies Limited Assembling Primitive Data into Multi-view Primitive Blocks in a Graphics Processing System

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020174252A1 (en) * 2001-05-18 2002-11-21 Broadcom Corporaion System on a chip for packet processing
US20150049216A1 (en) * 2012-03-21 2015-02-19 Canon Kabushiki Kaisha Image capturing apparatus
US20190188907A1 (en) * 2017-12-14 2019-06-20 Imagination Technologies Limited Assembling Primitive Data into Multi-view Primitive Blocks in a Graphics Processing System

Similar Documents

Publication Publication Date Title
US9870341B2 (en) Memory reduction method for fixed point matrix multiply
US9129674B2 (en) Hybrid memory device
US9250999B1 (en) Non-volatile random access memory in computer primary memory
EP2438529B1 (en) Conditional operation in an internal processor of a memory device
WO2018205708A1 (en) Processing system and method for binary weight convolutional network
EP3828775A1 (en) Energy efficient compute near memory binary neural network circuits
CN114391135A (en) Method for performing in-memory processing operations on contiguously allocated data, and related memory device and system
US20200233803A1 (en) Efficient hardware architecture for accelerating grouped convolutions
US9383968B2 (en) Math processing by detection of elementary valued operands
US20220076717A1 (en) Memory system for performing data operations within memory device and method of operating the same
US20130159812A1 (en) Memory architecture for read-modify-write operations
CN114341802A (en) Method for performing in-memory processing operations and related memory device and system
US11455781B2 (en) Data reading/writing method and system in 3D image processing, storage medium and terminal
KR102518010B1 (en) Polarity based data transfer function for volatile memory
WO2023022722A1 (en) Apparatus and method of block-based graphics processing in a system-on-a-chip
CN116415100A (en) Service processing method, device, processor and computing equipment
US10782918B2 (en) Near-memory data-dependent gather and packing
US10956210B2 (en) Multi-processor system, multi-core processing device, and method of operating the same
Tsung et al. Heterogeneous computing for edge AI
US20240020248A1 (en) Partial data handling hardware
US20240078202A1 (en) Flexible Dual Ranks Memory System To Boost Performance
CN115759204A (en) Data processing method of neural network model, storage medium and electronic device
WO2022220835A1 (en) Shared register for vector register file and scalar register file
Amirul et al. Sorting very large text data in multi GPUs
EP4085395A1 (en) Reducing power consumption by hardware accelerator during generation and transmission of machine learning inferences

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21954392

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE