EP4229572A1 - Parallelverarbeitungsarchitektur mit hintergrundlasten - Google Patents

Parallelverarbeitungsarchitektur mit hintergrundlasten

Info

Publication number
EP4229572A1
EP4229572A1 EP21881045.5A EP21881045A EP4229572A1 EP 4229572 A1 EP4229572 A1 EP 4229572A1 EP 21881045 A EP21881045 A EP 21881045A EP 4229572 A1 EP4229572 A1 EP 4229572A1
Authority
EP
European Patent Office
Prior art keywords
array
compute elements
data
compute
elements
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP21881045.5A
Other languages
English (en)
French (fr)
Inventor
Peter Foley
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ascenium Inc
Original Assignee
Ascenium Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ascenium Inc filed Critical Ascenium Inc
Publication of EP4229572A1 publication Critical patent/EP4229572A1/de
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8007Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
    • G06F15/8023Two dimensional arrays, e.g. mesh, torus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/45Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • G06F13/1605Handling requests for interconnection or transfer for access to memory bus based on arbitration
    • G06F13/1642Handling requests for interconnection or transfer for access to memory bus based on arbitration with request queuing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/45Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
    • G06F8/453Data distribution

Definitions

  • Fig. 2 is a flow diagram for data tagging.
  • the other elements to which the CEs can be coupled can include storage elements such as scratchpad memories; multiplier units; address generator units for generating load (LD) and store (ST) addresses; load queues; and so on.
  • the compiler to which each compute element is known can include a general-purpose compiler such as a C, C++, or Python compiler; a hardware-oriented compiler such as a VHDL or Verilog compiler; a compiler written for the array of compute elements; and so on.
  • the coupling of each CE to it neighboring CEs enables communication between or among neighboring CEs and the like.
  • Fig. 2 is a flow diagram for data tagging.
  • tasks can be processed on an array of compute elements.
  • the tasks can include general operations such as arithmetic, vector, or matrix operations; operations based on applications such as neural network or deep learning operations; and so on.
  • the tasks In order for the tasks to be processed correctly, the tasks must be scheduled on the array of compute elements, and data must be accessed that will be operated on by the tasks.
  • the data can be provided to the tasks by using background loads.
  • the background loads can transfer data to compute elements from load queues, from a memory system, from local or remote storage, etc. Since the data that is loaded can be intended for one or more compute elements within the array of compute elements, the data can be tagged.
  • the data tagging enables a parallel processing architecture with background loads.
  • a two-dimensional (2D) array of compute elements is accessed, wherein each compute element within the array of compute elements is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements. Operation of the array of compute elements is paused, wherein the pausing occurs while a memory system continues operation.
  • a bus coupling the array of compute elements to the memory system is repurposed for operation during the pausing. Data is transferred from the memory system to the array of compute elements, using the bus that was repurposed.
  • a system block diagram 300 for a highly parallel architecture with a shallow pipeline is shown.
  • the system block diagram can include a compute element array 310.
  • the compute element array 310 can be based on compute elements, where the compute elements can include processors, central processing units (CPUs), graphics processing units (GPUs), coprocessors, and so on.
  • the compute elements can be based on processing cores configured within chips such as application specific integrated circuits (ASICs), processing cores programmed into programmable chips such as field programmable gate arrays (FPGAs), and so on.
  • the compute elements can comprise a homogeneous array of compute elements.
  • the system block diagram 300 can include translation and look-aside buffers such as translation and look-aside buffers 312 and 338.
  • the memory systems can be free running and can continue to operate while the array is paused. Because multicycle latency can occur due to control signal transport, which results in additional “dead time”, it can be beneficial to allow the memory system to "reach into” the array and deliver load data to appropriate scratchpad memories while the array is paused. This mechanism can operate such that the array state is known, as far as the compiler is concerned. When array operation resumes after a pause, new load data will have arrived at a scratchpad, as required for the compiler to maintain the statically scheduled model.
  • Wall time which can include system clock ticks, system processing cycles, and the like, can occur continuously. That is, while the compiler time can suspend during the array being paused, wall time can proceed. Using this technique, background loads can appear to occur during a single, virtual compiler cycle, while the actual accessing of load queues, a memory system, etc., can be performed under wall time.
  • the accesses can also be associated with a second or further column such as column 7 530.
  • the accesses that originate within column 7 can include access 4 532 and access 5 534.
  • the accesses 4 and 5 can also be offset.
  • the accesses can be performed.
  • the accesses to load queues, the memory system, etc. can be performed based on wall time. Since compiler time suspends while the array is paused, as opposed to wall time that never stops, the accesses occur within one virtual compiler clock tick or cycle. When the accesses are complete, the array can be resumed, and compile time can continue.
  • Fig. 6 shows virtual single cycle load latency.
  • An array of compute elements can be known to a compiler, where the compiler can generate or compile code for the compute elements.
  • the compiler can also direct communications to or from, between, and among compute elements, where the communications are used for data transfers.
  • the data that is transferred can include one or more operands.
  • the compiler can pause the compute elements, resume the compute elements, and the like. Since data can be transferred between a memory system and the compute elements of the array while the compute elements within the array are paused, and since pausing the compute elements can comprise a single compiler time step, the data transfers can appear to the compiler to have taken place within as few as a one compiler time step.
  • a virtual single cycle load latency enables a parallel processing architecture with background loads.
  • a two-dimensional (2D) array of compute elements is accessed, where each compute element within the array of compute elements is known to a compiler and is coupled to its neighboring compute elements. Operation of the array of compute elements is paused, and a bus coupling the array of compute elements to the memory system is repurposed. The repurposing couples one or more compute elements in the array of compute elements to the memory system, and a memory system operation is enabled during the pausing. Data is transferred from the memory system to the array of compute elements, using the bus that was repurposed.
  • Fig. 7 illustrates logic for control background loads.
  • a background load can be used to transfer or load data from a memory system into an array of compute elements for processing by the compute elements.
  • a background load can occur while the array of compute elements is paused. Background loads enable a parallel processing architecture.
  • a two-dimensional (2D) array of compute elements is accessed, where each compute element within the array of compute elements is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements. Operation of the array of compute elements is paused, wherein the pausing occurs while a memory system continues operation.
  • a background load can be based on or controlled by a data “packet” 710.
  • the packet can include data, where the data can be available on a bus.
  • the data can include 64-bit data and can be available on a bus such as a column data bus.
  • the packet can further include a target ID 712.
  • the target ID can include a 4-bit target ID, where the target ID can be associated with a target row of compute elements within an array of compute elements.
  • the packet can also include one or more control signals.
  • a control signal can include a background load data valid signal 714.
  • the data available on the 64-bit column data bus can be stored in one or more scratchpad memories.
  • Fig. 8 is a system diagram for a parallel processing architecture with background loads.
  • the parallel processing architecture with background loads enables task processing.
  • the system 800 can include one or more processors 810, which are attached to a memory 812 which stores instructions.
  • the system 800 can further include a display 814 coupled to the one or more processors 810 for displaying data; intermediate steps; control words; control words implementing Very Long Instruction Word (VLIW) functionality; topologies including systolic, vector, cyclic, spatial, streaming, or VLIW topologies; and so on.
  • VLIW Very Long Instruction Word
  • the compute elements can include compute elements within one or more integrated circuits or chips, compute elements or cores configured within one or more programmable chips such as application specific integrated circuits (ASICs) or field programmable gate arrays (FPGAs), processors configured as a mesh, standalone processors, etc.
  • ASICs application specific integrated circuits
  • FPGAs field programmable gate arrays
  • processors configured as a mesh, standalone processors, etc.
  • the system 800 can include one or more scratchpad memories 820.
  • the one or more scratchpad memories 820 can be used to store data, control words, intermediate results, microcode, and so on.
  • the scratchpad memory can be used for data transfer.
  • the data from the memory system is transferred to a scratchpad memory in one or more compute elements within the two-dimensional array.
  • a scratchpad memory can comprise a small, local, easily accessible memory available to a compute elements.
  • the scratchpad memory provides operand storage. Since a scratchpad memory is associated with a particular compute element, the compute element for which the contents of the scratchpad memory are intended can be identified. Further embodiments include tagging the data before it is transferred.
  • the tagging can include a flag, an address, a code, and so on. In embodiments, the tagging can guide the transferring to a particular compute element within the array of compute elements. The tagging can be based on a location within the array. In embodiments, the tagging can include a target row location within the array of compute elements. The tagging can further include a target column location within the array of compute elements.
  • the scratchpad memory can be accessible to one or more compute elements. In embodiments, the scratchpad memory can include a dual read, single write (2R1W) scratchpad memory. That is, the 2R1W scratchpad memory can enable two contemporaneous read operations and one write operation without the read and write operations interfering with one another.
  • Communication between and among compute elements can be accomplished using a bus such as an industry standard bus, an on-chip bus such as a ring bus, a network such as a computer network, etc.
  • the ring bus is implemented as a distributed multiplexor (MUX).
  • the ring bus can be used to support various communication geometries within the array of compute elements such as a Manhattan communication geometry.
  • the bus can include a bus, such as a ring bus, along a row or column of the array of compute elements.
  • the system 800 can include a pausing component 840.
  • the pausing component 840 can include control and functions for pausing operation of the array of compute elements, wherein the pausing occurs while a memory system continues operation.
  • the pausing operation can occur due to waiting for data such as operands to be processed by the compute elements.
  • the pausing operation can be necessitated by an exception.
  • An exception can include an arithmetic exception, waiting for data, waiting for an acknowledgement that data has been received, and the like.
  • An exception can occur due to a data cache “miss”, where data needed for a computation by a compute element is neither available within a scratchpad associated with that compute element nor available in the data cache, which necessitates seeking the data from the memory system.
  • the pausing operation can be necessitated by data congestion. That is, one or more buses within the array of compute elements can become congested while trying to move data between memory system and the compute elements, between or among compute elements, etc.
  • the data congestion can be due to access jitter.
  • the data congestion can be due to a cache miss.
  • the pausing operation of the array of compute elements can include storing a state of the compute elements within the array. Other components within the array of compute elements can continue operation during the pausing.
  • the bus can continue operation during the pausing.
  • the bus operation can include transferring data to one or more compute elements within the array of compute elements. The data can be transferred from the memory system to one or more compute elements.
  • the system 800 can include a repurposing component 850.
  • the repurposing component 850 can include control logic and functions for repurposing a bus coupling the array of compute elements to the memory system for operation during the pausing.
  • the repurposing of the bus can include placing the bus into a “pass through” mode in which the bus can continue operation during the pausing. Pass through mode may include saving the state currently on the bus to allow background load data to pass, and then restoring that saved data when the array resumes from the pause.
  • a bus in a pass-through mode can be used for passing data between the memory system and one or more scratchpad memories, one or more queues, and so on. Further embodiments include load queues coupled between the memory system and the bus.
  • the buffers can be filled and emptied during a pause of the array of compute elements.
  • the load queues can be emptied of the data that was buffered before a resume occurs.
  • the data can be tagged before it is transferred between the memory system and the array of compute elements.
  • the tagging can guide the transferring to a particular compute element within the array of compute elements.
  • the tagging can serve as a compute element address, an identifier, and the like.
  • the pausing, the repurposing, and the transferring can comprise a background data load.
  • a background data load can be used to provide data such as operands to one or more compute elements for other data arrives at the compute elements.
  • the background data load can be used to anticipate outcomes of a branch or other control transfer operation.
  • the system 800 can include a computer program product embodied in a computer readable medium for task processing, the computer program product comprising code which causes one or more processors to perform operations of: accessing a two- dimensional (2D) array of compute elements, wherein each compute element within the array of compute elements is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements; pausing operation of the array of compute elements, wherein the pausing occurs while a memory system continues operation; repurposing a bus coupling the array of compute elements, wherein the repurposing couples one or more compute elements in the array of compute elements to the memory system, and wherein a memory system operation is enabled during the pausing; and transferring data from the memory system to the array of compute elements, using the bus that was repurposed.
  • 2D two- dimensional
  • Each of the above methods may be executed on one or more processors on one or more computer systems.
  • Embodiments may include various forms of distributed computing, client/server computing, and cloud-based computing.
  • the depicted steps or boxes contained in this disclosure’s flow charts are solely illustrative and explanatory. The steps may be modified, omitted, repeated, or reordered without departing from the scope of this disclosure. Further, each step may contain one or more sub-steps. While the foregoing drawings and description set forth functional aspects of the disclosed systems, no particular implementation or arrangement of software and/or hardware should be inferred from these descriptions unless explicitly stated or otherwise clear from the context. All such arrangements of software and/or hardware are intended to fall within the scope of this disclosure.
  • FIG. 1 The block diagrams and flowchart illustrations depict methods, apparatus, systems, and computer program products.
  • the elements and combinations of elements in the block diagrams and flow diagrams show functions, steps, or groups of steps of the methods, apparatus, systems, computer program products and/or computer-implemented methods. Any and all such functions — generally referred to herein as a “circuit,” “module,” or “system” — may be implemented by computer program instructions, by special-purpose hardware-based computer systems, by combinations of special purpose hardware and computer instructions, by combinations of general-purpose hardware and computer instructions, and so on.
  • a programmable apparatus which executes any of the above-mentioned computer program products or computer-implemented methods may include one or more microprocessors, microcontrollers, embedded microcontrollers, programmable digital signal processors, programmable devices, programmable gate arrays, programmable array logic, memory devices, application specific integrated circuits, or the like. Each may be suitably employed or configured to process computer program instructions, execute computer logic, store computer data, and so on.
  • a computer may include a computer program product from a computer-readable storage medium and that this medium may be internal or external, removable and replaceable, or fixed.
  • a computer may include a Basic Input/Output System (BIOS), firmware, an operating system, a database, or the like that may include, interface with, or support the software and hardware described herein.
  • BIOS Basic Input/Output System
  • Embodiments of the present invention are limited to neither conventional computer applications nor the programmable apparatus that run them. To illustrate: the embodiments of the presently claimed invention could include an optical computer, quantum computer, analog computer, or the like.
  • a computer program may be loaded onto a computer to produce a particular machine that may perform any and all of the depicted functions. This particular machine provides a means for carrying out any and all of the depicted functions.
  • any combination of one or more computer readable media may be utilized including but not limited to: a non-transitory computer readable medium for storage; an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor computer readable storage medium or any suitable combination of the foregoing; a portable computer diskette; a hard disk; a random access memory (RAM); a read-only memory (ROM), an erasable programmable read-only memory (EPROM, Flash, MRAM, FeRAM, or phase change memory); an optical fiber; a portable compact disc; an optical storage device; a magnetic storage device; or any suitable combination of the foregoing.
  • a computer readable storage medium may be any tangible medium that can contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • computer program instructions may include computer executable code.
  • languages for expressing computer program instructions may include without limitation C, C++, Java, JavaScriptTM, ActionScriptTM, assembly language, Lisp, Perl, Tel, Python, Ruby, hardware description languages, database programming languages, functional programming languages, imperative programming languages, and so on.
  • computer program instructions may be stored, compiled, or interpreted to run on a computer, a programmable data processing apparatus, a heterogeneous combination of processors or processor architectures, and so on.
  • embodiments of the present invention may take the form of web-based computer software, which includes client/server software, software-as-a-service, peer-to-peer software, or the like.
  • a computer may enable execution of computer program instructions including multiple programs or threads.
  • the multiple programs or threads may be processed approximately simultaneously to enhance utilization of the processor and to facilitate substantially simultaneous functions.
  • any and all methods, program codes, program instructions, and the like described herein may be implemented in one or more threads which may in turn spawn other threads, which may themselves have priorities associated with them.
  • a computer may process these threads based on priority or other order.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
EP21881045.5A 2020-10-15 2021-10-14 Parallelverarbeitungsarchitektur mit hintergrundlasten Pending EP4229572A1 (de)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202063091947P 2020-10-15 2020-10-15
PCT/US2021/054889 WO2022081784A1 (en) 2020-10-15 2021-10-14 Parallel processing architecture with background loads

Publications (1)

Publication Number Publication Date
EP4229572A1 true EP4229572A1 (de) 2023-08-23

Family

ID=81208770

Family Applications (1)

Application Number Title Priority Date Filing Date
EP21881045.5A Pending EP4229572A1 (de) 2020-10-15 2021-10-14 Parallelverarbeitungsarchitektur mit hintergrundlasten

Country Status (3)

Country Link
EP (1) EP4229572A1 (de)
KR (1) KR20230087553A (de)
WO (1) WO2022081784A1 (de)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU3829500A (en) * 1999-04-09 2000-11-14 Clearspeed Technology Limited Parallel data processing apparatus
EP2996035A1 (de) * 2008-10-15 2016-03-16 Hyperion Core, Inc. Datenverarbeitungsvorrichtung
US9329834B2 (en) * 2012-01-10 2016-05-03 Intel Corporation Intelligent parametric scratchap memory architecture
US11029949B2 (en) * 2015-10-08 2021-06-08 Shanghai Zhaoxin Semiconductor Co., Ltd. Neural network unit
US11347477B2 (en) * 2019-09-27 2022-05-31 Intel Corporation Compute in/near memory (CIM) circuit architecture for unified matrix-matrix and matrix-vector computations

Also Published As

Publication number Publication date
KR20230087553A (ko) 2023-06-16
WO2022081784A1 (en) 2022-04-21

Similar Documents

Publication Publication Date Title
US20220107812A1 (en) Highly parallel processing architecture using dual branch execution
US20220075627A1 (en) Highly parallel processing architecture with shallow pipeline
EP4384902A1 (de) Parallelverarbeitungsarchitektur mit verteilten registerspeichern
WO2022055792A1 (en) Highly parallel processing architecture with shallow pipeline
US20220075740A1 (en) Parallel processing architecture with background loads
EP4229572A1 (de) Parallelverarbeitungsarchitektur mit hintergrundlasten
EP4244726A1 (de) Hochparallele verarbeitungsarchitektur mit compiler
US20230350713A1 (en) Parallel processing architecture with countdown tagging
US20220291957A1 (en) Parallel processing architecture with distributed register files
US20220308872A1 (en) Parallel processing architecture using distributed register files
US20230273818A1 (en) Highly parallel processing architecture with out-of-order resolution
US20230031902A1 (en) Load latency amelioration using bunch buffers
US20240168802A1 (en) Parallel processing with hazard detection and store probes
US20240070076A1 (en) Parallel processing using hazard detection and mitigation
US20220374286A1 (en) Parallel processing architecture for atomic operations
US20220214885A1 (en) Parallel processing architecture using speculative encoding
US20230409328A1 (en) Parallel processing architecture with memory block transfers
US20230342152A1 (en) Parallel processing architecture with split control word caches
US20240264974A1 (en) Parallel processing hazard mitigation avoidance
US20240078182A1 (en) Parallel processing with switch block execution
US20230221931A1 (en) Autonomous compute element operation using buffers
US20230281014A1 (en) Parallel processing of multiple loops with loads and stores
WO2024015318A1 (en) Parallel processing architecture with countdown tagging
WO2022251272A1 (en) Parallel processing architecture with distributed register files
US20240193009A1 (en) Parallel processing architecture for branch path suppression

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20230404

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)