US20210191729A1 - System and method for high throughput in multiple computations - Google Patents

System and method for high throughput in multiple computations Download PDF

Info

Publication number
US20210191729A1
US20210191729A1 US17/167,077 US202117167077A US2021191729A1 US 20210191729 A1 US20210191729 A1 US 20210191729A1 US 202117167077 A US202117167077 A US 202117167077A US 2021191729 A1 US2021191729 A1 US 2021191729A1
Authority
US
United States
Prior art keywords
unit
data
graphical data
register file
gpu
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US17/167,077
Inventor
Shahar HANIA
Hanan ZELTZER
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Rail Vision Ltd
Original Assignee
Rail Vision Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rail Vision Ltd filed Critical Rail Vision Ltd
Priority to US17/167,077 priority Critical patent/US20210191729A1/en
Assigned to RAIL VISION LTD reassignment RAIL VISION LTD ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: HANIA, Shahar, ZELTZER, Hanan
Publication of US20210191729A1 publication Critical patent/US20210191729A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3851Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/441Register allocation; Assignment of physical memory space to logical memory space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • G06F9/30043LOAD or STORE instructions; Clear instruction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3877Concurrent instruction execution, e.g. pipeline or look ahead using a slave processor, e.g. coprocessor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • G06F13/30Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal with priority control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4004Coupling between buses
    • G06F13/4022Coupling between buses using switching circuits, e.g. switching matrix, connection or expansion network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/62Details of cache specific to multiprocessor cache arrangements
    • G06F2212/621Coherency control relating to peripheral accessing, e.g. from DMA or I/O device

Definitions

  • Computing systems that are adapted to handle/process vast amounts of graphical data and computations typically comprise, aside a central processing unit (CPU), multiple processing unit (MPU) such as GPU, GPGPU, DSP, SIMD based processing unit, VLIW based processing unit adapted and designated for handling and processing the required data.
  • CPU central processing unit
  • MPU multiple processing unit
  • GPU GPU
  • GPGPU GPU
  • DSP SIMD based processing unit
  • VLIW based processing unit adapted and designated for handling and processing the required data.
  • Such computing systems' structure is well known in the art. This structure typically splits the computing tasks between the CPU and MPU so that the heavy computations are assigned to MPU, leaving the rest of the computations tasks for the CPU to handle.
  • the net time usable for data computations in a CPU-MPU computing structure may be as low as 5% or less.
  • a Nvidia® Compute Unified Device Architecture (CUDA) parallel computing platform and application programming interface model typical time portions spent for graphical data handling may be 49% for transferring the graphical data from the CPU environment to the GPU environment (e.g. to CUDA memory), 47% for transferring the graphical data from the GPU environment back to the CPU environment (CUDA memory) and no more than 4% for graphical computations.
  • CUDA Compute Unified Device Architecture
  • the GPU may comprise a processing core unit (PCU), a register file unit, multiple cache units, shared memory unit, unified cache unit and interface cache unit.
  • the method may comprise transferring stream of graphical data via interface cache unit and via the multiple cache units and via the unified cache unit to the register file unit, transferring a second stream of graphical data from the register file unit to the processing core unit, and storing and receiving frequently used portions of data in shared memory unit, via register file unit.
  • the register file unit is configured to direct data processed by the PCU to the shared memory unit as long as it is capable of receiving more data, based on the level of frequent use of that data.
  • the level of frequent use is determined by the PCU.
  • a streaming multiprocessor unit for enhancing throughput of processing of data comprising a processing core unit (PCU) configured to process graphical data, a register file unit configured to provide graphical data from the PCU and to receive and temporary store processed graphical data from the PCU, multiple cache units, configured to provide graphical data from the register file unit and to receive and temporary store processed graphical data from the register file unit, shared memory unit configured to provide graphical data from the register file unit and to receive and temporary store processed graphical data from the register file unit, unified cache unit configured to provide graphical data from the register file unit and to receive and temporary store processed graphical data from the register file unit, and interface cache unit, configured to receive graphical data for graphical processing at high pace, to provide the graphical data to at least one of shared memory unit and unified cache unit, to receive processed graphical data from the unified cache unit, and to provide the processed graphical data to external processing units.
  • PCU processing core unit
  • register file unit configured to provide graphical data from the PCU and to receive and temporary store processed graphical data from the
  • At least some of the graphical data elements are stored, before and/or after processing by the PCU in the shared memory unit, based on a priority figure that is associated with the probability of their close call by the PCU.
  • the priority figure is higher as the probability is higher.
  • a circuit for handling unprocessed data comprising a data stream divider unit (DSDU) and a graphics processing unit (GPU).
  • the DSDU comprising an array comprising plurality of first-in-first-out (FIFO) registers, configured to receive a stream of data and to divide it into portions of data and to pass each of the portions of data through one of the plurality of FIFO registers and a first advanced extensible interface (AXI) unit configured to receive the data portions.
  • the GPU comprising a second advanced extensible interface (AXI) unit configured to receive data portions from the first AXI unit and a plurality of streaming multiprocessors (SM) configured to receive each data portion from a respective FIFO register, and to process the received data portion.
  • AXI streaming multiprocessors
  • a specific FIFO register in the DSDU is connected to an assigned SM in the GPU via an assigned first AXI unit in the DSDU and an assigned second AXI unit in the GPU.
  • each of the FIFO registers in the DSDU is connected to an assigned SM in the GPU via a first common AXI unit in the DSDU and a common AXI unit in the GPU.
  • a method for efficiently processing large amount of data comprising receiving a stream of unprocessed data, dividing the stream to a plurality of data portions, passing each data portion via a specific FIFO register in a data stream divider unit (DSDU), and transferring the data portion from the specific FIFO register to an assigned streaming multiprocessor (SM) in graphics processor unit (GPU) for processing.
  • DSDU data stream divider unit
  • SM streaming multiprocessor
  • GPU graphics processor unit
  • the data portions are transferred via a first specific advanced extensible interface (AXI) unit in the DSDU and a second specific advanced extensible interface (AXI) unit in the GPU.
  • AXI advanced extensible interface
  • a data portion received from a specific FIFO register is transferred to the assigned SM in the GPU via an assigned first AXI unit in the DSDU and an assigned second AXI unit in the GPU.
  • each of the data portion received from FIFO registers in the DSDU is transferred to the assigned SM in the GPU via a common first AXI unit in the DSDU and a common second AXI unit in the GPU.
  • FIG. 1 schematically illustrates data flow in computing unit using a GPU
  • FIG. 2 is a schematic block diagram of a typical streaming multiprocessor (SM) in a GPU unit;
  • SM streaming multiprocessor
  • FIG. 3A is a schematic block diagram depicting an unprocessed data (UPD) handling unit (UPDHU) 300 , structured and operative according to embodiments of the present invention, and
  • FIGS. 3B and 3C are schematic block diagrams of two different embodiments, of UPDHU, such UPDHU 300 of FIG. 3A
  • the bottle-neck of CPU-GPU mutual operation in known computing systems lies, mostly, in the data transfer channels used for directing graphical related data by the CPU to the GPU and receiving the processed graphical data back from the GPU.
  • the CPU and the GPU processors operate and communicate in standard computing environments.
  • Computing unit 100 comprise CPU 111 , CPU dynamic RAM (DRAM) 111 A, and computing unit peripheral controlling unit (such as main board chipset) 112 .
  • Unit 100 further comprises GPU unit 150 , communicating data with the CPU via unit 112 .
  • GPU unit 150 typically comprises GPU DRAM unit 154 , interfacing data between unit 112 and the GPU processors, GPU cache units 156 (such as L2 cache units) that is adapted to cache data for the GPU processing units, and GPU processing units 158 (such as streaming multiprocessor/SM).
  • GPU cache units 156 such as L2 cache units
  • GPU processing units 158 such as streaming multiprocessor/SM.
  • First Data flow—DF 1 depicts the flow of data into computing unit 100 , where CPU 111 directs the flow—DF 2 —via peripheral controlling unit (PCU) 112 , to DRAM 111 A and back from it—DF 3 —via PCU 112 —DF 4 —to GPU 150 .
  • PCU peripheral controlling unit
  • DF 3 via PCU 112 —DF 4 —to GPU 150 .
  • PCU peripheral controlling unit
  • DF 3 via PCU 112 —DF 4 —to GPU 150
  • the flow of the data passes through DRAM unit 154 and through cache units 156 to the plurality of streaming multiprocessors (SMs) units 158 where graphical processing takes place.
  • SMs streaming multiprocessors
  • FIG. 2 is a schematic block diagram of a typical streaming multiprocessor (SM) 200 in a GPU unit.
  • SM 200 may comprise processing core unit 210 (sometimes called Compute Unified Device Architecture (CUDA) core), register file 220 to mediate data between core 201 and cache units 230 (constant cache), 250 (unified cache) and with shared memory 240 .
  • CUDA Compute Unified Device Architecture
  • Data inbound towards SM 200 and outbound from it is exchanged with the GPU cache unit 256 (such as cache units 156 (L2)) of FIG. 1 .
  • the GPU unit will await until the entire amount of data to be processed is loaded onto the memory units of its several SM 200 units before graphical processing commences.
  • intermediate results calculated by core 210 may be stored in register file 220 instead of storing them in the DRAM.
  • shared memory 240 may be used for storing data that is frequently used within SM 200 , instead of circulating it outbound, as is commonly done.
  • the level of frequency of use is determined by the PCU.
  • constant memory units and/or cache memory units may be defined in SM 210 .
  • data flow bottle-neck between the CPU computing environment and the GPU computing environment may be reduced or eliminate, by replacing the CPU with a specifically structured computing unit for all handling of graphical-related data.
  • FIG. 3A is a schematic block diagram depicting an unprocessed data (UPD) handling unit (UPDHU) 300 , structured and operative according to embodiments of the present invention
  • FIGS. 3B and 3C are schematic block diagrams of two different embodiments, 350 and 380 , of UPDHU, such UPDHU 300 of FIG. 3A
  • the term ‘unprocessed data’ as used herein after relates to large stream of data that is about to be processed, and that typically requires large computation capacity, such as fast stream of graphical data (e.g. received from 4 K video camera) that needs to be processed in virtually “real time” (i.e. with as small as possible latency).
  • UPDHU 300 depicted in FIG. 3A is designed to overcome the inherent bottle neck of data stream flow typical to CPU-GPU known architectures, where incoming stream of data acquisition is first handled by a the CPU, then being temporarily stored in the CPU memory and/or RAM associated with the CPU, then it is transferred, again as data stream (e.g. over Peripheral Component Interconnect Express (PCIe) bus), to the GPU and again being handled by the GPU processor before it is sent to the multiple streaming processors being part of the GPU.
  • PCIe Peripheral Component Interconnect Express
  • data stream divider unit (DSDU) 304 may be embodied by, for example, using FPGA that is programmed to receive large amount of streamed UPD, e.g. video stream from a camera, to distribute it into plurality of smaller streams and to transfer the streams to SMs of a GPU.
  • the FPGA and the GPU may further be programmed to operate so that the GPU begins processing of the graphical data transferred to it as soon as at least one SM of the plurality of the SMs of the GPU is fully loaded.
  • the fully loaded SMs hold amount of data that is smaller than the full data file, therefore the processing by the GPU will begin, according to this embodiment, much earlier compared to commonly known embodiments where the processing begins after the entire data file was loaded to the GPU.
  • UPDHU 300 comprises a Multi Streamer unit (MSU) 310 that may comprise a DSDU 304 comprising plurality of first-in-first-out (FIFO) registers/storage units array 304 A (the FIFO units are not shown separately), of which one FIFO unit may be assigned to each of the SMs 318 of GPU 320 .
  • MSU Multi Streamer unit
  • DSDU 304 comprising plurality of first-in-first-out (FIFO) registers/storage units array 304 A (the FIFO units are not shown separately), of which one FIFO unit may be assigned to each of the SMs 318 of GPU 320 .
  • FIFO first-in-first-out
  • the received UPD stream that is received by DSDU 304 may be partitioned to multiple data units, which may be transferred to GPU 320 via FIFO units 304 A, broadcasted to the GPU over an interface unit, such as AXI interface, such that data unit in each FIFO 304 A is transferred to the associated SM 318 , thereby enabling, for example, single action multiple data (SIMD) computing.
  • SIMD single action multiple data
  • MSU 310 may comprise unprocessed data interface unit 302 , configured to receive long streams of graphical data.
  • the large amount of unprocessed data received via interface unit 302 may be partitioned to smaller size, plurality number of data units, to be transferred each via an assigned FIFO unit in FIFO unit 304 A and then, over an AXI channel 315 , via GPU AXI interface 316 to the assigned SM 318 of GPU 320 .
  • Data units that were processed by the respective SM of SMs 318 may then be transferred back, over AXI connection, to the MSU.
  • large overhead that is typical to CPU-GPU architectures is saved in the embodiments described above.
  • FIGS. 3B and 3C depict schematic block diagrams of two optional architectures embodying MSU 310 of FIG. 3A , according to embodiments of the present invention.
  • FIG. 3B depicts MSU 350 comprises FIU 356 and GPU 358 .
  • FIU 356 may comprise plurality of FIFO units (collectively named 356 A)—FIFO 0 , FIFO 1 . . . FIFO n . each FIFO unit may be in active communication with an assigned FPGA AXI (F-AXI) unit—F-AXI 0 , F-AXI 1 . . . F-AXI n (collectively named 356 B).
  • F-AXI F-AXI
  • Each of the separate F-AXI units may be in direct connection with an assigned GPU AXI (G-AXI) unit-G-AXI 0 , G-AXI 1 . . . G-AXI n .
  • each of the G-AXI interface units may be in active connection with, and may provide data to an assigned SM-SM 0 , SM 1 . . . SM n .
  • MSU 380 comprises FIU 386 and GPU 388 .
  • FIU 386 may comprise plurality of FIFO units (collectively named 386 A)—FIFO 0 , FIFO 1 . . . FIFO n .
  • each FIFO unit may be in active communication with a FPGA AXI (F-AXI) unit that may be configured to control administer the data streams from the plurality of FIFO units into a single AXI stream.
  • the AXI stream may be transmitted to an AXI interface unit 388 A of GPU 388 and may then be divided to the respective SMs units-SM 0 , SM 1 . . . SM n .
  • the architecture depicted in FIG. 3B may provide a faster overall performance but may require larger number of pins (for an integrated circuit (IC) embodying the described circuit) and a larger number of wires/conduits.
  • the architecture depicted in FIG. 3C may provide a relatively slower overall performance but may require smaller number of pins (for an integrated circuit (IC) embodying the described circuit) and a smaller number of wires/conduits.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Multi Processors (AREA)
  • Image Processing (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

Device, circuit and method are configured to enhance throughout of processing of vast amount of data such as video stream. In some embodiment frequently used data blocks are stored in a fast RAM of the processor. In another embodiment received stream of data is divided to plurality of data portions and is streamed concurrently to streaming multiprocessors of a graphic processing unit (GPU) and is processed concurrently before the entire stream is loaded.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application is continuation of U.S. patent application Ser. No. 16/642,026, filed on Feb. 26, 2020, which is a National Phase Application of PCT International Application No. PCT/IL2018/050965, International Filing Date Aug. 30, 2018, published as WO 2019/043710 on Mar. 7, 2019 and entitled System and Method for High Throughput in Multiple Computations, claiming the benefit of U.S. Provisional Patent Application No. 62/552,475, filed Aug. 31, 2017 which is hereby incorporated by reference.
  • BACKGROUND OF THE INVENTION
  • Computing systems that are adapted to handle/process vast amounts of graphical data and computations typically comprise, aside a central processing unit (CPU), multiple processing unit (MPU) such as GPU, GPGPU, DSP, SIMD based processing unit, VLIW based processing unit adapted and designated for handling and processing the required data. Such computing systems' structure is well known in the art. This structure typically splits the computing tasks between the CPU and MPU so that the heavy computations are assigned to MPU, leaving the rest of the computations tasks for the CPU to handle.
  • However, this well-known structure suffers of low efficiency where large amounts of graphical data is involved, due to the large amounts of handling resources required for managing the transfer of the graphical data between the CPU and the MPU, back-and-forth. In some cases, the net time usable for data computations in a CPU-MPU computing structure may be as low as 5% or less. For example, a Nvidia® Compute Unified Device Architecture (CUDA) parallel computing platform and application programming interface model typical time portions spent for graphical data handling may be 49% for transferring the graphical data from the CPU environment to the GPU environment (e.g. to CUDA memory), 47% for transferring the graphical data from the GPU environment back to the CPU environment (CUDA memory) and no more than 4% for graphical computations. Such very low graphical computation efficiency stems from the common architectures defining the way graphical data is transferred between the processors.
  • There is a need to enable substantial raise of the MPU efficiency, that is substantial raise of the time portion assigned to graphical calculations.
  • SUMMARY OF THE INVENTION
  • A method for enhancing graphical data throughput exchanged between graphical data source and a graphical processing unit (GPU) via a streaming multiprocessor unit is disclosed. The GPU may comprise a processing core unit (PCU), a register file unit, multiple cache units, shared memory unit, unified cache unit and interface cache unit. The method may comprise transferring stream of graphical data via interface cache unit and via the multiple cache units and via the unified cache unit to the register file unit, transferring a second stream of graphical data from the register file unit to the processing core unit, and storing and receiving frequently used portions of data in shared memory unit, via register file unit.
  • In some embodiments the register file unit is configured to direct data processed by the PCU to the shared memory unit as long as it is capable of receiving more data, based on the level of frequent use of that data.
  • In some embodiments the level of frequent use is determined by the PCU.
  • A streaming multiprocessor unit for enhancing throughput of processing of data is disclosed comprising a processing core unit (PCU) configured to process graphical data, a register file unit configured to provide graphical data from the PCU and to receive and temporary store processed graphical data from the PCU, multiple cache units, configured to provide graphical data from the register file unit and to receive and temporary store processed graphical data from the register file unit, shared memory unit configured to provide graphical data from the register file unit and to receive and temporary store processed graphical data from the register file unit, unified cache unit configured to provide graphical data from the register file unit and to receive and temporary store processed graphical data from the register file unit, and interface cache unit, configured to receive graphical data for graphical processing at high pace, to provide the graphical data to at least one of shared memory unit and unified cache unit, to receive processed graphical data from the unified cache unit, and to provide the processed graphical data to external processing units.
  • In some embodiments at least some of the graphical data elements are stored, before and/or after processing by the PCU in the shared memory unit, based on a priority figure that is associated with the probability of their close call by the PCU.
  • In some embodiments the priority figure is higher as the probability is higher.
  • A circuit for handling unprocessed data is disclosed comprising a data stream divider unit (DSDU) and a graphics processing unit (GPU). The DSDU comprising an array comprising plurality of first-in-first-out (FIFO) registers, configured to receive a stream of data and to divide it into portions of data and to pass each of the portions of data through one of the plurality of FIFO registers and a first advanced extensible interface (AXI) unit configured to receive the data portions. The GPU comprising a second advanced extensible interface (AXI) unit configured to receive data portions from the first AXI unit and a plurality of streaming multiprocessors (SM) configured to receive each data portion from a respective FIFO register, and to process the received data portion.
  • In some embodiments a specific FIFO register in the DSDU is connected to an assigned SM in the GPU via an assigned first AXI unit in the DSDU and an assigned second AXI unit in the GPU.
  • In some embodiments each of the FIFO registers in the DSDU is connected to an assigned SM in the GPU via a first common AXI unit in the DSDU and a common AXI unit in the GPU.
  • A method for efficiently processing large amount of data is disclosed comprising receiving a stream of unprocessed data, dividing the stream to a plurality of data portions, passing each data portion via a specific FIFO register in a data stream divider unit (DSDU), and transferring the data portion from the specific FIFO register to an assigned streaming multiprocessor (SM) in graphics processor unit (GPU) for processing.
  • In some embodiments the data portions are transferred via a first specific advanced extensible interface (AXI) unit in the DSDU and a second specific advanced extensible interface (AXI) unit in the GPU.
  • In some embodiments a data portion received from a specific FIFO register is transferred to the assigned SM in the GPU via an assigned first AXI unit in the DSDU and an assigned second AXI unit in the GPU.
  • In some embodiments each of the data portion received from FIFO registers in the DSDU is transferred to the assigned SM in the GPU via a common first AXI unit in the DSDU and a common second AXI unit in the GPU.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings in which:
  • FIG. 1 schematically illustrates data flow in computing unit using a GPU;
  • FIG. 2 is a schematic block diagram of a typical streaming multiprocessor (SM) in a GPU unit;
  • FIG. 3A is a schematic block diagram depicting an unprocessed data (UPD) handling unit (UPDHU) 300, structured and operative according to embodiments of the present invention, and
  • FIGS. 3B and 3C are schematic block diagrams of two different embodiments, of UPDHU, such UPDHU 300 of FIG. 3A
  • It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.
  • DETAILED DESCRIPTION OF THE INVENTION
  • In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to obscure the present invention.
  • The bottle-neck of CPU-GPU mutual operation in known computing systems lies, mostly, in the data transfer channels used for directing graphical related data by the CPU to the GPU and receiving the processed graphical data back from the GPU. Typically, the CPU and the GPU processors operate and communicate in standard computing environments.
  • Reference is made to FIG. 1, schematically illustrating data flow in computing unit 100 using a GPU. Computing unit 100 comprise CPU 111, CPU dynamic RAM (DRAM) 111A, and computing unit peripheral controlling unit (such as main board chipset) 112. Unit 100 further comprises GPU unit 150, communicating data with the CPU via unit 112.
  • GPU unit 150 typically comprises GPU DRAM unit 154, interfacing data between unit 112 and the GPU processors, GPU cache units 156 (such as L2 cache units) that is adapted to cache data for the GPU processing units, and GPU processing units 158 (such as streaming multiprocessor/SM).
  • The flow of graphical data that enters processing unit 100 and is intended to be processed by GPU 150 is described by data flow (DF) arrows. First Data flow—DF1 depicts the flow of data into computing unit 100, where CPU 111 directs the flow—DF2—via peripheral controlling unit (PCU) 112, to DRAM 111A and back from it—DF3—via PCU 112—DF4—to GPU 150. At GPU 150 the flow of the data passes through DRAM unit 154 and through cache units 156 to the plurality of streaming multiprocessors (SMs) units 158 where graphical processing takes place.
  • It is a target of methods and structures according to the present invention to eliminate as much data flow bottle-necks as possible.
  • Reference is made now to FIG. 2, which is a schematic block diagram of a typical streaming multiprocessor (SM) 200 in a GPU unit. SM 200 may comprise processing core unit 210 (sometimes called Compute Unified Device Architecture (CUDA) core), register file 220 to mediate data between core 201 and cache units 230 (constant cache), 250 (unified cache) and with shared memory 240. Data inbound towards SM 200 and outbound from it is exchanged with the GPU cache unit 256 (such as cache units 156 (L2)) of FIG. 1. When graphical processing is carried out in known methods, the GPU unit will await until the entire amount of data to be processed is loaded onto the memory units of its several SM 200 units before graphical processing commences.
  • One way of reducing data transfers time is minimization of redundant data transfers. For example, intermediate results calculated by core 210 may be stored in register file 220 instead of storing them in the DRAM. Further, shared memory 240 may be used for storing data that is frequently used within SM 200, instead of circulating it outbound, as is commonly done. In some embodiments the level of frequency of use is determined by the PCU. Still further, constant memory units and/or cache memory units may be defined in SM 210.
  • According to further embodiments of the present invention data flow bottle-neck between the CPU computing environment and the GPU computing environment may be reduced or eliminate, by replacing the CPU with a specifically structured computing unit for all handling of graphical-related data.
  • Reference is made now to FIG. 3A, which is a schematic block diagram depicting an unprocessed data (UPD) handling unit (UPDHU) 300, structured and operative according to embodiments of the present invention, and to FIGS. 3B and 3C, which are schematic block diagrams of two different embodiments, 350 and 380, of UPDHU, such UPDHU 300 of FIG. 3A. The term ‘unprocessed data’ as used herein after relates to large stream of data that is about to be processed, and that typically requires large computation capacity, such as fast stream of graphical data (e.g. received from 4K video camera) that needs to be processed in virtually “real time” (i.e. with as small as possible latency). The architecture of UPDHU 300 depicted in FIG. 3A is designed to overcome the inherent bottle neck of data stream flow typical to CPU-GPU known architectures, where incoming stream of data acquisition is first handled by a the CPU, then being temporarily stored in the CPU memory and/or RAM associated with the CPU, then it is transferred, again as data stream (e.g. over Peripheral Component Interconnect Express (PCIe) bus), to the GPU and again being handled by the GPU processor before it is sent to the multiple streaming processors being part of the GPU. The example described herein with regard to FIG. 3A demonstrates use of field programmable gate array (FPGA) programmed to operate according to the advanced extensible interface (AXI), however it would be apparent to those skilled in the art that the method of operation described herein may be embodied using other computing units that are adapted to interface with a respective GPU and to transfer large amount of graphical—related data in high throughput. According to embodiments of the invention data stream divider unit (DSDU) 304 may be embodied by, for example, using FPGA that is programmed to receive large amount of streamed UPD, e.g. video stream from a camera, to distribute it into plurality of smaller streams and to transfer the streams to SMs of a GPU. The FPGA and the GPU may further be programmed to operate so that the GPU begins processing of the graphical data transferred to it as soon as at least one SM of the plurality of the SMs of the GPU is fully loaded. In most cases the fully loaded SMs hold amount of data that is smaller than the full data file, therefore the processing by the GPU will begin, according to this embodiment, much earlier compared to commonly known embodiments where the processing begins after the entire data file was loaded to the GPU.
  • In an exemplary embodiment UPDHU 300 comprises a Multi Streamer unit (MSU) 310 that may comprise a DSDU 304 comprising plurality of first-in-first-out (FIFO) registers/storage units array 304A (the FIFO units are not shown separately), of which one FIFO unit may be assigned to each of the SMs 318 of GPU 320. In some embodiments the received UPD stream that is received by DSDU 304 may be partitioned to multiple data units, which may be transferred to GPU 320 via FIFO units 304A, broadcasted to the GPU over an interface unit, such as AXI interface, such that data unit in each FIFO 304A is transferred to the associated SM 318, thereby enabling, for example, single action multiple data (SIMD) computing. When each (even a single) SM 318 of GPU 320 is loaded with the respective portion of the unprocessed data received from the associated FIFO 304A unit over an AXI interface, GPU 320 may start processing, not having to wait until the entire UPD file is loaded.
  • MSU 310 may comprise unprocessed data interface unit 302, configured to receive long streams of graphical data. The large amount of unprocessed data received via interface unit 302 may be partitioned to smaller size, plurality number of data units, to be transferred each via an assigned FIFO unit in FIFO unit 304A and then, over an AXI channel 315, via GPU AXI interface 316 to the assigned SM 318 of GPU 320.
  • Data units that were processed by the respective SM of SMs 318 may then be transferred back, over AXI connection, to the MSU. As seen, large overhead that is typical to CPU-GPU architectures is saved in the embodiments described above.
  • FIGS. 3B and 3C depict schematic block diagrams of two optional architectures embodying MSU 310 of FIG. 3A, according to embodiments of the present invention. FIG. 3B depicts MSU 350 comprises FIU 356 and GPU 358. FIU 356 may comprise plurality of FIFO units (collectively named 356A)—FIFO0, FIFO1 . . . FIFOn. each FIFO unit may be in active communication with an assigned FPGA AXI (F-AXI) unit—F-AXI0, F-AXI1 . . . F-AXIn (collectively named 356B). Each of the separate F-AXI units may be in direct connection with an assigned GPU AXI (G-AXI) unit-G-AXI0, G-AXI1 . . . G-AXIn. each of the G-AXI interface units may be in active connection with, and may provide data to an assigned SM-SM0, SM1 . . . SMn. According to yet another embodiment, as depicted in FIG. 3C, MSU 380 comprises FIU 386 and GPU 388. FIU 386 may comprise plurality of FIFO units (collectively named 386A)—FIFO0, FIFO1 . . . FIFOn. each FIFO unit may be in active communication with a FPGA AXI (F-AXI) unit that may be configured to control administer the data streams from the plurality of FIFO units into a single AXI stream. The AXI stream may be transmitted to an AXI interface unit 388A of GPU 388 and may then be divided to the respective SMs units-SM0, SM1 . . . SMn. The architecture depicted in FIG. 3B may provide a faster overall performance but may require larger number of pins (for an integrated circuit (IC) embodying the described circuit) and a larger number of wires/conduits. The architecture depicted in FIG. 3C may provide a relatively slower overall performance but may require smaller number of pins (for an integrated circuit (IC) embodying the described circuit) and a smaller number of wires/conduits.
  • The above described devices, structures and methods may accelerate the processing of large amount of unprocessed data, compared to known architectures and methods. For example, in known embodiments there is the need to transfer the whole image before the process/algorithm could start on the GPU. If the image size is 1 GB, the theoretical throughput of the PCI-E bus transferring data to the GPU is 32 GB/s, so latency would be 1 GB/(32 GB/s)= 1/32 s=31.125 ms≈31.3 ms. in contrary, with the FPGA according to embodiment of the invention it is just needed to fully load all SM units. For example, in the Tesla P100 GPU there are 56 SM units, and in each SM there are 64 cores that support 32 bit (in single precision mode) or 32 cores that support 64 bit (extended precision mode), thus the data size for a fully loaded GPU (same result for single or extended precision modes) is 56*32*64=114688 bits=14.336 Mbytes. The FPGA to GPU AXI stream theoretical throughput is 896 MB/s (for 56 lanes), so latency is 14.336 MB/(896 MB/s)=14.336/896 s=16 ms, which is substantially half the latency.
  • While certain features of the invention have been illustrated and described herein, many modifications, substitutions, changes, and equivalents will now occur to those of ordinary skill in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the invention.

Claims (6)

1. A method for enhancing graphical data throughput exchanged between graphical data source and a graphical processing unit (GPU) via a streaming multiprocessor unit that comprises a processing core unit (PCU), a register file unit, multiple cache units, shared memory unit, unified cache unit and interface cache unit, the method comprising:
transferring stream of graphical data via interface cache unit and via the multiple cache units and via the unified cache unit to the register file unit;
transferring a second stream of graphical data from the register file unit to the processing core unit (PCU); and
storing and receiving frequently used portions of data in shared memory unit, via register file unit.
2. The method of claim 1 wherein the register file unit is configured to direct data processed by the PCU to the shared memory unit as long as it is capable of receiving more data, based on the level of frequent use of that data.
3. The method of claim 2 wherein the level of frequent use is determined by the PCU.
4. A streaming multiprocessor unit for enhancing graphical data throughput comprising:
a processing core unit (PCU) configured to process graphical data;
a register file unit, configured to provide graphical data from the PCU and to receive and temporary store processed graphical data from the PCU;
multiple cache units, configured to provide graphical data from the register file unit and to receive and temporary store processed graphical data from the register file unit;
shared memory unit configured to provide graphical data from the register file unit and to receive and temporary store processed graphical data from the register file unit;
unified cache unit configured to provide graphical data from the register file unit and to receive and temporary store processed graphical data from the register file unit; and
interface cache unit, configured to receive graphical data for graphical processing at high pace, to provide the graphical data to at least one of shared memory unit and unified cache unit, to receive processed graphical data from the unified cache unit, and to provide the processed graphical data to external processing units.
5. The streaming multiprocessor unit of claim 4 wherein at least some of the graphical data elements are stored, before and/or after processing by the PCU in the shared memory unit, based on a priority figure that is associated with the probability of their close call by the PCU.
6. The streaming multiprocessor unit of claim 5 wherein the priority figure is higher as the probability is higher.
US17/167,077 2017-08-31 2021-02-03 System and method for high throughput in multiple computations Abandoned US20210191729A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/167,077 US20210191729A1 (en) 2017-08-31 2021-02-03 System and method for high throughput in multiple computations

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201762552475P 2017-08-31 2017-08-31
PCT/IL2018/050965 WO2019043710A1 (en) 2017-08-31 2018-08-30 System and method for high throughput in multiple computations
US202016642026A 2020-02-26 2020-02-26
US17/167,077 US20210191729A1 (en) 2017-08-31 2021-02-03 System and method for high throughput in multiple computations

Related Parent Applications (2)

Application Number Title Priority Date Filing Date
PCT/IL2018/050965 Continuation WO2019043710A1 (en) 2017-08-31 2018-08-30 System and method for high throughput in multiple computations
US16/642,026 Continuation US10942746B2 (en) 2017-08-31 2018-08-30 System and method for high throughput in multiple computations

Publications (1)

Publication Number Publication Date
US20210191729A1 true US20210191729A1 (en) 2021-06-24

Family

ID=65527276

Family Applications (2)

Application Number Title Priority Date Filing Date
US16/642,026 Active US10942746B2 (en) 2017-08-31 2018-08-30 System and method for high throughput in multiple computations
US17/167,077 Abandoned US20210191729A1 (en) 2017-08-31 2021-02-03 System and method for high throughput in multiple computations

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US16/642,026 Active US10942746B2 (en) 2017-08-31 2018-08-30 System and method for high throughput in multiple computations

Country Status (4)

Country Link
US (2) US10942746B2 (en)
EP (1) EP3676710A4 (en)
JP (2) JP2020532795A (en)
WO (1) WO2019043710A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11204819B2 (en) * 2018-12-21 2021-12-21 Samsung Electronics Co., Ltd. System and method for offloading application functions to a device
US20220147320A1 (en) * 2020-11-09 2022-05-12 Vizzio Technologies Pte Ltd Highly parallel processing system

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0258153A (en) * 1988-08-24 1990-02-27 Mitsubishi Electric Corp Information processor
JP2001134490A (en) * 1999-11-01 2001-05-18 Fujitsu Ltd Method for controlling cache memory and computer for realizing the method
JP4043225B2 (en) * 2001-12-10 2008-02-06 株式会社ルネサステクノロジ Cache apparatus and method
US7812844B2 (en) * 2004-01-28 2010-10-12 Lucid Information Technology, Ltd. PC-based computing system employing a silicon chip having a routing unit and a control unit for parallelizing multiple GPU-driven pipeline cores according to the object division mode of parallel operation during the running of a graphics application
JP4904802B2 (en) * 2005-02-01 2012-03-28 セイコーエプソン株式会社 Cache memory and processor
JP4837305B2 (en) * 2005-05-10 2011-12-14 ルネサスエレクトロニクス株式会社 Microprocessor and control method of microprocessor
GB0723536D0 (en) * 2007-11-30 2008-01-09 Imagination Tech Ltd Multi-core geometry processing in a tile based rendering system
US8400458B2 (en) * 2009-09-09 2013-03-19 Hewlett-Packard Development Company, L.P. Method and system for blocking data on a GPU
US9639479B2 (en) * 2009-09-23 2017-05-02 Nvidia Corporation Instructions for managing a parallel cache hierarchy
JP5487882B2 (en) * 2009-10-27 2014-05-14 セイコーエプソン株式会社 Image processing apparatus and image processing method
JP5648465B2 (en) * 2010-12-17 2015-01-07 富士通セミコンダクター株式会社 Graphics processor
JP5659817B2 (en) * 2011-01-21 2015-01-28 ソニー株式会社 Interconnect equipment
US9092267B2 (en) 2011-06-20 2015-07-28 Qualcomm Incorporated Memory sharing in graphics processing unit
US8954599B2 (en) * 2011-10-28 2015-02-10 Hewlett-Packard Development Company, L.P. Data stream operations
KR101511972B1 (en) * 2011-12-23 2015-04-15 인텔 코포레이션 Methods and apparatus for efficient communication between caches in hierarchical caching design
US9720829B2 (en) * 2011-12-29 2017-08-01 Intel Corporation Online learning based algorithms to increase retention and reuse of GPU-generated dynamic surfaces in outer-level caches
US9244683B2 (en) * 2013-02-26 2016-01-26 Nvidia Corporation System, method, and computer program product for implementing large integer operations on a graphics processing unit
US9086813B2 (en) * 2013-03-15 2015-07-21 Qualcomm Incorporated Method and apparatus to save and restore system memory management unit (MMU) contexts

Also Published As

Publication number Publication date
EP3676710A4 (en) 2021-07-28
US10942746B2 (en) 2021-03-09
EP3676710A1 (en) 2020-07-08
WO2019043710A1 (en) 2019-03-07
US20200183698A1 (en) 2020-06-11
JP2020532795A (en) 2020-11-12
JP2023078204A (en) 2023-06-06

Similar Documents

Publication Publication Date Title
US20210191729A1 (en) System and method for high throughput in multiple computations
US20180307438A1 (en) Statically-schedulable feed and drain structure for systolic array architecture
US11768601B2 (en) System and method for accelerated data processing in SSDS
KR20190017639A (en) Intelligent high bandwidth memory appliance
US9164690B2 (en) System, method, and computer program product for copying data between memory locations
US7185127B2 (en) Method and an apparatus to efficiently handle read completions that satisfy a read request
JP4364202B2 (en) Data processing method and data processing system
US10444813B2 (en) Multi-criteria power management scheme for pooled accelerator architectures
EP1604286B1 (en) Data processing system with cache optimised for processing dataflow applications
WO2020247240A1 (en) Extended memory interface
EP2199919B1 (en) Method for processing data using triple buffering
CN110633233A (en) DMA data transmission processing method based on assembly line
CN111666253B (en) Delivering programmable data to a system having shared processing elements sharing memory
CN113434813B (en) Matrix multiplication operation method based on neural network and related device
US11579882B2 (en) Extended memory operations
US20180095929A1 (en) Scratchpad memory with bank tiling for localized and random data access
KR20230034373A (en) Distribution and collection of streaming data via circular FIFO
US20240069965A1 (en) Systems and methods for executing compute functions
US7831746B2 (en) Direct memory access engine for data transfers
CN113434814B (en) Matrix multiplication operation method based on neural network and related device
US20230259486A1 (en) Neural processing unit synchronization systems and methods
US20230058749A1 (en) Adaptive matrix multipliers
CN117785763A (en) Configuration method of multi-level memory
CN114116553A (en) Data processing device, method and system

Legal Events

Date Code Title Description
AS Assignment

Owner name: RAIL VISION LTD, ISRAEL

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HANIA, SHAHAR;ZELTZER, HANAN;SIGNING DATES FROM 20200301 TO 20200303;REEL/FRAME:055557/0848

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION