EP4211567A1 - Highly parallel processing architecture with shallow pipeline - Google Patents

Highly parallel processing architecture with shallow pipeline

Info

Publication number
EP4211567A1
EP4211567A1 EP21867391.1A EP21867391A EP4211567A1 EP 4211567 A1 EP4211567 A1 EP 4211567A1 EP 21867391 A EP21867391 A EP 21867391A EP 4211567 A1 EP4211567 A1 EP 4211567A1
Authority
EP
European Patent Office
Prior art keywords
array
compute elements
compute
control
cache
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP21867391.1A
Other languages
German (de)
English (en)
French (fr)
Other versions
EP4211567A4 (en
Inventor
Peter Foley
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ascenium Inc
Original Assignee
Ascenium Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ascenium Inc filed Critical Ascenium Inc
Publication of EP4211567A1 publication Critical patent/EP4211567A1/en
Publication of EP4211567A4 publication Critical patent/EP4211567A4/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/3017Runtime instruction translation, e.g. macros
    • G06F9/30178Runtime instruction translation, e.g. macros of compressed or encrypted instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8007Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
    • G06F15/8023Two dimensional arrays, e.g. mesh, torus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • G06F9/30149Instruction analysis, e.g. decoding, instruction word fields of variable length instructions

Definitions

  • This application relates generally to task processing and more particularly to a highly parallel processing architecture with shallow pipeline.
  • Vast resources are expended annually to support the data processing requirements of organizations.
  • the data must be collected, stored, analyzed, processed, preserved, protected, backed up, and so on.
  • Some organizations continue to support their data handing and processing needs “in-house” by building, supporting, and maintaining their own datacenters. In-house processing can be the preferred approach for asset management, security, and other reasons.
  • Other organizations have taken advantage of now-common cloud-based computational facilities. These latter data handling and processing facilities, which can include multiple datacenters distributed across large geographic areas, provide computation, data collection, data storage, and other needs “as a service”. These services enable data processing and handling access for even small organizations that would otherwise be unable to equip, staff, and maintain their own datacenters. Whether supported in-house or contracted with cloud-based services, the organizations operate based on data processing.
  • Many and varied data collection techniques are used to collect data from a wide and diverse range of individuals.
  • the individuals typically include clients, purchasers, patients, test subjects, citizens, students, and volunteers. At times the individuals are willing participants, while at other times they are unwitting subjects or even victims of data collection.
  • Often used data collection strategies include “opt-in” techniques, where an individual signs up, registers, creates a user ID or account, or otherwise willingly and actively agrees to participate in the data collection.
  • Other techniques are legislative, such as a government requiring citizens to obtain a registration number and to use that number while interacting with government agencies, law enforcement, or emergency services, among others.
  • Additional data collection techniques are more subtle or intentionally hidden, such as tracking purchase histories, website visits, button clicks, and menu choices. Irrespective of the techniques used for the data collection, the collected data is highly valuable to the organizations that collected it. However collected, the rapid processing of this data remains critical.
  • the job processing typically includes running payroll or billing tasks, analyzing research data, assigning student grades, and so on.
  • the job processing can also include training a processing network such as a neural network for machine learning.
  • These jobs are highly complex and are composed of many tasks.
  • the tasks can include loading and storing various datasets, accessing processing components and systems, executing data processing, and so on.
  • the tasks themselves are typically based on subtasks which themselves can be complex.
  • the subtasks can be used to handle specific jobs such as loading or reading data from storage, performing computations and other manipulations on the data, storing or writing the results data back to storage, handling intersubtask communication such as data transfer and control, and so on.
  • 2D arrays of elements can be used for the processing of the tasks and subtasks.
  • the 2D arrays include compute elements, multiplier elements, registers, caches, queues, controllers, decompressors, arithmetic logic units (ALUs), storage elements, and other components which can communicate among themselves.
  • ALUs arithmetic logic units
  • storage elements and other components which can communicate among themselves.
  • These arrays of elements are configured and operated by providing control to the array of elements on a cycle-by-cycle basis. The control of the 2D array is accomplished by providing control words generated by a compiler.
  • the control includes a stream of control words, where the control words can include wide, variable length, microcode control words generated by the compiler.
  • the control words are used to configure the array and to control the flow or transfer of data and the processing of the tasks and subtasks.
  • the arrays can be configured in a topology which is best suited to the task processing.
  • the topologies into which the arrays can be configured include a systolic, a vector, a cyclic, a spatial, a streaming, or a Very Long Instruction Word (VLIW) topology, among others.
  • the topologies can include a topology that enables machine learning functionality.
  • the highly parallel processing architecture is based on a two-dimensional (2D) array of compute elements.
  • the compute elements can comprise CPUs, GPSs, processor cores, compute engine cores, and so on.
  • the compute elements can further include elements that support the compute elements, such as storage elements, switching elements, caches, memories, and the like.
  • the compute elements within the 2D array are controlled by providing control on a cycle-by-cycle basis.
  • the control is accomplished by providing one or more control words.
  • the control words can be provided as a stream of control words.
  • the control words include variable length, microcode control words that can be generated by a compiler, an assembler, etc.
  • control words can be compressed.
  • the provided control words can be loaded into a cache memory, where the cache memory can be shared by more than one compute element.
  • single control words can be provided to more than one compute element. That is, a control word can be distributed to elements across a row or a column of the array of compute elements. A control word can be distributed across the entire array.
  • the control words can also be used to selectively enable and disable compute elements that are not required for a given processing task. Selectively disabling compute elements can simplify data transfers within the array, reduce power consumption by the array, etc.
  • the control words can be decompressed to enable control of one or more compute elements.
  • the compute elements can include a single compute element, a row of compute elements, a column of compute elements, an array of compute elements, etc. Having configured compute elements within the 2D array, a compiled task can be executed.
  • the decompressed control words can control the execution of the task, associated subtasks, and so on.
  • the decompressed control words can further enable parallel processing within the 2D array.
  • a processor-implemented method for task processing comprising: accessing a two-dimensional (2D) array of compute elements, wherein each compute element within the array of compute elements is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements; providing control for the array of compute elements on a cycle-by-cycle basis, wherein the control is enabled by a stream of wide, variable length, microcode control words generated by the compiler; decompressing the control words to enable control on a per element basis; and executing a compiled task on the array of compute elements, wherein the executing is based on the control words that were decompressed.
  • 2D two-dimensional
  • the compute elements within the array of compute elements can have identical functionality such as word length, number and size of scratchpad memory elements, depth of register files, processing rates, etc.
  • Embodiments include storing relevant portions of the control word within a cache associated with the array of compute elements.
  • the cache can be based on a dual read, single write (2R1W) cache.
  • the 2R1W cache enables two reads or fetches from the cache and one write or store to the cache to occur substantially simultaneously.
  • the cache can include a hierarchical cache comprising multiple levels of cache storage such as LI, L2, and L3 cache levels. The cache enables high speed, local access to the portions of the control words used to control the compute elements and to other associated elements within the array.
  • the decompressing can occur cycle-by-cycle out of the cache, thus providing control on a cycle-by-cycle basis to the elements of the 2D array.
  • decompressing of a single control word can occur over multiple cycles. The multiple cycles can accommodate control word straddle over a cache line fetch boundary.
  • the control words that are provided enable parallel execution of tasks.
  • the tasks can include substantially similar tasks that process different datasets (e.g, SIMD), two or more tasks that are independent of one another, and so on.
  • simultaneous execution of two or more potential compiled task outcomes can be provided, where the two or more potential compiled task outcomes comprise a computation result or a routing control.
  • the computational result can include a result of an arithmetic operation, a logical operation, and so on.
  • the routing control can include a conditional branch, an unconditional branch, and the like. Since the outcome of the operation or a conditional branch is not known a priori, then the possible execution paths that can be taken can be executed in parallel.
  • the two or more potential compiled outcomes can be controlled by the same control word. When the correct outcome of the operation or the branch decision is determined, then processing of the correct outcome is continued while processing of any alternative outcome is halted.
  • FIG. 1 is a flow diagram for a highly parallel processing architecture with a shallow pipeline.
  • Fig. 2 is a flow diagram for task scheduling.
  • FIG. 3 shows a system block diagram for a highly parallel architecture with a shallow pipeline.
  • Fig. 4 illustrates compute element array detail.
  • Fig. 5 shows array row control decode.
  • Fig. 6 illustrates example encoding for a single control word row.
  • Fig. 7 shows example compressed control word sizes.
  • Fig. 8 is a table showing example decompressed control word fields.
  • Fig. 9 is a system diagram for task processing using a highly parallel processing architecture.
  • the tasks that are processed can perform a variety of operations including arithmetic operations, shift operations, logical operations including Boolean operations, vector or matrix operations, and the like.
  • the tasks can include a plurality of subtasks.
  • the subtasks can be processed based on precedence, priority, coding order, amount of parallelization, data flow, data availability, compute element availability, communication channel availability, and so on.
  • the data manipulations are performed on a two-dimensional array of compute elements.
  • the compute elements which can include CPUs, GPUs, ASICs, FPGAs, cores, and other processing components, can be coupled to local storage, which can include cache storage.
  • the cache which can include a hierarchical cache, can be used for storing relevant portions of a control word, where the control word controls the compute element. Both compressed and decompressed control words can be stored in a cache, however storing decompressed control words in a cache is generally much less efficient.
  • the compute elements can also be coupled to data cache, which can also be hierarchical, either directly or through queues, busses, and so on.
  • the tasks, subtasks, etc. are compiled by a compiler.
  • the compiler can include a general-purpose compiler, a hardware description-based compiler, a compiler written or “tuned” for the array of compute elements, a constraint-based compiler, a satisfiability-based compiler (SAT solver), and so on.
  • the compiler generates a stream of wide, variable length, microcode control words. The length of a microcode control word can be adjusted by compressing the control word, by recognizing that a compute element is unneeded by a task so that control bits within that control word are not required for that compute element, etc.
  • the control words can be used to route data, to set up operations to be performed by the compute elements, to idle individual compute elements or rows and/or columns of compute elements, etc.
  • the compiled microcode control words associated with the compute elements are distributed to the compute elements, and the processing task is executed. In order to accelerate the execution of tasks, the executing can include providing simultaneous execution of two or more potential compiled task outcomes.
  • a task can include a control word containing a branch. Since the outcome of the branch may not be known a priori to execution of the control word containing a branch, all possible control sequences that could be executed based on the branch can be simultaneously executed in the array. Then, when the branch outcome becomes known, the correct sequence of computations can be used, and the incorrect sequences of computations (e.g, the path not taken by the branch) can be ignored and/or flushed.
  • a highly parallel architecture with a shallow pipeline enables task processing.
  • a two-dimensional (2D) array of compute elements is accessed.
  • the compute elements can include compute elements, processors, or cores within an integrated circuit; processors or cores within an application specific integrated circuit (ASIC); cores programmed within a programmable device such as a field programmable gate array (FPGA); and so on.
  • Each compute element within the 2D array of compute elements is known to a compiler.
  • the compiler which can include a general-purpose compiler, a hardware-oriented compiler, or a compiler specific to the compute elements, can compile code for each of the compute elements.
  • Each compute element is coupled to its neighboring compute elements within the array of compute elements. The coupling of the compute elements enables data communication between and among compute elements.
  • Control for the array of compute elements is provided on a cycle-by-cycle basis.
  • the cycle can include a clock cycle, a data cycle, a processing cycle, etc.
  • the control is enabled by a stream of wide, variable length, microcode control words generated by the compiler.
  • the microcode control word lengths can vary based on the type of control, compression, simplification such as identifying that a compute element is unneeded, etc.
  • the control words, which can include compressed control words, are decoded on a per element basis within the compute element array.
  • the control word can be decompressed to a level of fine control granularity, where each compute element (whether an integer compute element, floating point compute element, address generation compute element, write buffer element, read buffer element, etc.), is individually and uniquely controlled.
  • Each compressed control word is decompressed to allow control on a per element basis.
  • the decoding can be dependent on whether a given compute element is needed for processing a task or subtask; whether the compute element has a specific control word associated with it or the compute element receives a repeated control word (e.g, a control word used for two or more compute elements), and the like.
  • a compiled task is executed on the array of compute elements, based on the decompressing. The execution can be accomplished by executing a plurality of subtasks associated with the compiled task.
  • Fig. 1 is a flow diagram for a highly parallel processing architecture with a shallow pipeline.
  • Clusters of compute elements such as CEs assessable within a 2D array of CEs, can be configured to process a variety of tasks.
  • the tasks can be based on a plurality of subtasks.
  • the tasks can accomplish a variety of processing objectives such as data manipulation, application processing, and so on.
  • the tasks can operate on a variety of data types including integer, real (floating point), and character data types; vectors and matrices; etc.
  • Control is provided to the array of compute elements based on microcode control words generated by a compiler. The control words enable or idle various compute elements; provide data; route results between or among CEs, caches, and storage; and the like.
  • control words which were compressed to reduce storage requirements, are decompressed on a per compute element basis. Because a control word spans the entire array, decompression is across the entire array on a per compute element basis. The decompressing enables execution of a compiled task on the array of compute elements.
  • the flow 100 includes accessing a two-dimensional (2D) array 110 of compute elements, wherein each compute element within the array of compute elements is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements.
  • the compute elements can be based on a variety of types of processors.
  • the compute elements or CEs can include central processing units (CPUs), graphics processing units (GPUs), processors or processing cores within application specific integrated circuits (ASICs), processing cores programmed within field programmable gate arrays (FPGAs), and so on.
  • compute elements within the array of compute elements have identical functionality.
  • the compute elements can be configured in a topology, where the topology can be built into the array, programmed or configured within the array, etc.
  • the array of compute elements is configured by the control word to implement one or more of a systolic, a Single Instruction Multiple Data (SIMD), a Multiple Instruction Multiple Data (MIMD), a vector, a cyclic, a spatial, a streaming, or a Very Long Instruction Word (VLIW) topology.
  • SIMD Single Instruction Multiple Data
  • MIMD Multiple Instruction Multiple Data
  • VLIW Very Long Instruction Word
  • the compute elements can further include a topology suited to machine learning computation.
  • the compute elements can be coupled to other elements within the array of CEs.
  • the coupling of the compute elements can enable one or more topologies.
  • the other elements to which the CEs can be coupled can include storage elements such as one or more levels of cache storage; multiplier units; address generator units for generating load (LD) and store (ST) addresses; various queues; and so on.
  • the compiler to which each compute element is known can include a general purpose compiler such as a C, C++, or Python compiler; a hardware-oriented compiler such as a VHDL or Verilog compiler; a compiler written for the array of compute elements; and so on.
  • the coupling of each CE to its neighboring CEs enables sharing of elements such as cache elements or multiplier elements; communication between or among neighboring CEs; and the like.
  • column busses can facilitate sharing between CEs and multiplier units and/or data cache elements.
  • the flow 100 includes providing control 120 for the array of compute elements on a cycle-by-cycle basis.
  • the control can be provided in the form of a control word, where the control word can be provided by the compiler.
  • the control word can be used to configure one or more CEs, to enable data to flow to or from the CE, to configure the CE to perform an operation, and so on.
  • one or more of the CEs can be controlled, while other CEs are unneeded by the particular task.
  • a CE that is unneeded can be marked as unneeded so that the data, control word, etc. is neither needed in the control word nor sent to the CE after decompression.
  • the unneeded compute element can be controlled by a single bit.
  • a single bit can control an entire row of CEs by being decompressed into idle signals for each CE in the row.
  • the single bit can be set for “unneeded”, reset for “needed”, or set for a similar usage of the bit to indicate when a particular CE is unneeded by a task.
  • the control is enabled by a stream of wide, variable length, microcode control words 122 generated by the compiler.
  • the microcode control words can vary in length based on the operations of the CEs controlled by the control word, compression of the control word, and so on.
  • a control word can be compressed by encoding fields or “bunches” of bits within the control word.
  • the compiled task can include multiple programming loop instances circulating within the array of compute elements.
  • the multiple programming loop instances can be used to accomplish parallelization of operations performed by the task.
  • the compiled task can include machine learning functionality.
  • the machine learning can be accomplished by configuring the compute elements within the array.
  • the machine learning functionality can include neural network implementation. The machine learning can be based on deep learning.
  • the flow 100 further includes storing relevant portions 130 of the control word within a cache associated with the array of compute elements.
  • the cache can be closely associated with the array of compute elements in order to provide fast, local storage for control words, data, intermediate results, and so on.
  • the cache can include a hierarchical cache.
  • a hierarchical cache can include a hierarchy of levels of cache such as cache level 1 (LI), cache level 2 (L2), cache level 3 (L3), and so on.
  • LI cache level 1
  • L2 cache level 2
  • L3 cache level 3
  • each successive level of cache can be larger and slower than the preceding level of cache. That is, LI can be smaller and faster than L2, L2 can be larger and slower than LI and smaller and faster than L3, and so on.
  • the one or more levels of cache provide faster access to control words, data, intermediate results, and so on, than a main storage accessible to the array of CEs.
  • LI, L2, and L3 caches can be four- way set associative.
  • the cache such as the L2 cache, comprises a dual read, single write (2R1 W) cache. In a 2R1 W cache, two read or load accesses to the cache and one write or store can occur at substantially the same time.
  • the cache can be used for other purposes.
  • the cache can enable the control word to be distributed across a row of the array of compute elements. The control word can be distributed from the cache across one or more CEs in the row of the array of the CEs.
  • the distribution across a row of the array of compute elements can be accomplished in one cycle.
  • the 2R1W cache supports simultaneous fetch of potential branch paths for the compiled task (discussed below).
  • the initial parts of different branch paths can be simultaneously instantiated in consecutive control words.
  • the flow 100 includes decompressing the control words 140 on a per element basis. Recall that within a given row of compute elements within the array of compute elements, one or more CEs may be unneeded by a given task or subtask.
  • control words that are distributed 142 per element can include control words that enable a CE to access data, perform an operation, generate data, etc.
  • the control word In the cases of the unneeded CEs, if any, the control word only needs to provide a “not needed” bit for the CE, and if all compute elements in a row are not needed, then only one bit is needed for that entire row to indicate the row is idle.
  • the decompressing can be performed on the control words stored in the cache.
  • the decompressing occurs cycle-by-cycle out of the cache.
  • the cycle-by-cycle decompressing can include decompressing a control word for a row of CEs, control words for each CE, control words shared by more than one CE, etc.
  • decompressing of a single control word can occur over multiple cycles.
  • the multiply cycles can include accessing a control word in the cache, decompressing a code word per CE, transmitting the decompressed code words to the CEs, etc.
  • the multiple cycles can accommodate control word straddle over a cache line fetch boundary. Since a control word can be of variable length, then the control word can be long enough to straddle the cache line fetch boundary. Accessing such control words can require multiple cycles.
  • the accessing, the providing, and the decompressing comprise a superstatic processor architecture.
  • a superstatic processor architecture can include various components such as input and output components, a main memory, and a CPU that includes a control unit and a processor.
  • the processor can further include registers and combinational logic.
  • the flow 100 can include providing 144 control information.
  • the control information can be provided by the compiler, downloaded from a library of control information, uploaded by a user, and so on.
  • the providing control information can include data handling.
  • the flow 100 includes ordering data retiring 146.
  • the data retiring can occur when data such as input or intermediate data is no longer required by a task or subtask.
  • Data retiring can also occur due to a cache miss. That is, when data is sought for processing by a task and that data is not located within the cache, a higher level of cache, or in a queue to load data into the cache, then a cache miss occurs.
  • the cache miss can cause the data within the cache to be “retired”, flushed, or written back, and new data to be accessed within a higher- level cache or from main storage.
  • Data retirement can be based on latency.
  • a task can require a multiplication operation which can be performed on a multiplier element.
  • the data required by the multiplier element must be available within an amount of time, and the product generated by the multiplier element must also be generated within an amount of time subsequent to data availability.
  • resources such as the multiplier element must be “consumed” by performing a multiplication, or “retired” because the multiplication did not occur within a window of time.
  • the flow 100 includes executing a compiled task 150 on the array of compute elements, based on the decompressing.
  • the task and any subtasks associated with the task can be executed on the CEs within the array.
  • the executing can include reading or loading data, processing data, writing or storing data, and so on.
  • the executing is based on the control word.
  • the executing can occur during a single cycle or can extend over multiple cycles.
  • the flow 100 further includes providing simultaneous execution 160 of two or more potential compiled task outcomes.
  • a task can include a decision point, where the decision point can be based on data, a result, a condition, and so on. The decision point can generate the two or more potential compiled task outcomes.
  • the two or more potential compiled task outcomes comprise a computation result or a routing control.
  • a compiled task outcome can include executing one sequence of control words based on a condition; executing a second sequence of control words based on a different, negative, or unmet condition; and so on.
  • the two or more potential compiled outcomes can be controlled by the same control word.
  • the code sequences associated with the potential compiled task outcomes can be fetched, and the execution of the code sequences, where a sequence is a succession of control words, can be initiated. Then, when the correct or true outcome is determined, the sequence of control words associated with the correct outcome proceeds, while execution of the incorrect outcome is halted.
  • the two or more potential compiled outcomes are executed on spatially separate compute elements within the array of compute elements.
  • the spatially separate compute elements can reduce or eliminate resource contention within the array of CEs.
  • steps in the flow 100 may be changed in order, repeated, omitted, or the like without departing from the disclosed concepts.
  • Various embodiments of the flow 100 can be included in a computer program product embodied in a computer readable medium that includes code executable by one or more processors.
  • Fig. 2 is a flow diagram for task scheduling.
  • tasks can be processed on an array of compute elements.
  • the task can include general operations such as arithmetic, vector, or matrix operations; operations based on applications such as neural network or deep learning operations; and so on.
  • the tasks In order for the tasks to be processed correctly, the tasks must be scheduled on the array of compute elements. Scheduling the tasks can be performed to maximize task processing throughput, to ensure that a task that generates data for a second task is processed prior to processing of the second task, and so on.
  • the task scheduling enables a highly parallel processing architecture with a shallow pipeline.
  • a two-dimensional (2D) array of compute elements is accessed, wherein each compute element within the array of compute elements is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements.
  • Control for the array of compute elements is provided on a cycle-by-cycle basis, wherein the control is enabled by a stream of wide, variable length, microcode control words generated by the compiler.
  • the control words are decompressed (in parallel, not sequentially) on a per element basis.
  • a compiled task is executed on the array of compute elements, based on the decompressing.
  • the flow 200 includes compiling tasks 210 for execution on a two- dimensional array of compute elements. Recall that each of the compute elements within the array is known to the compiler, so that the compiler can generate, if needed, a bunch for each of the compute elements.
  • the compiler can include a general-purpose compiler such as a C, C++, or Python compiler; a hardware-oriented compiler such as a VHDL or Verilog compiler; etc.
  • the compiler enables the array of compute elements to act as a software-defined processor.
  • the compiled task can determine 212 an unneeded compute element within a row of compute elements in the array of compute elements.
  • a row of compute elements within an array of compute elements can include a number of compute elements, where the number of compute elements can include 2, 4, 8, 16, etc. compute elements.
  • a compiled task can be executed in one or more compute elements. If fewer than the full complement of compute elements within a row is required for execution of a task, then the unneeded compute elements can be marked as unneeded.
  • the flow 200 includes using compression 214 to reduce the size of control words generated by the compiler.
  • the compression can be used to increase functional density of the control words, where the increase in functional density, also known as an increase in information density, enables a reduction in storage requirements for the control words.
  • the compression can include lossless compression.
  • the unneeded compute element or idle row/column can be controlled by a single bit 216 in the control word. Setting the bit indicating that the compute element is unneeded for a given task can further improve compression since further information, such as control information for the unneeded compute element, can be eliminated from the control word.
  • the compiled task includes a spatial allocation 218 of subtasks on one or more compute elements within the array of compute elements.
  • a given task can comprise a plurality of subtasks.
  • the subtasks can be distributed across the array of compute elements based on compute element availability, task precedence, task order, and the like.
  • Spatial allocation of subtasks can include allocating subtasks to unused processing elements within a row or a column of the array.
  • the spatial allocation provides for an idle compute element row and/or column 220 in the array of compute elements. That is, instead of simply assigning a subtask to a random compute element, the subtasks can be assigned to unused compute elements within rows or columns that already include assigned compute elements.
  • unused compute elements can be “accumulated” or collected into columns and rows, and the columns and rows can be marked as unneeded.
  • the providing for idle compute element rows and/or columns further enables compression of compiled control words by eliminating the need for control words for the unneeded rows and/or columns.
  • the compiled task schedules computation 230 on the array of compute elements.
  • the scheduling of computation on the array of compute elements can be dependent on the tasks and subtasks that are being scheduled.
  • the scheduling can be based on task precedence or priority, compute element availability, data availability, and so on.
  • the scheduling can be based on system management of the array of compute elements.
  • the computation that is scheduled includes compute element placement, results routing, and computation wave-front propagation within the array of compute elements.
  • the scheduling can further be based on power consumption, heat dissipation, processing speed, and the like.
  • the flow 200 can include determining routing and scheduling 240 within the array of compute elements.
  • the determining routing and scheduling can be based on choosing the shortest communications paths between and among compute elements; organizing data within one or more levels of cache accessible to the compute elements; minimizing access to storage beyond the one or more levels of cache; and so on.
  • the computation wavefront can include routing through an element without that element actually manipulating the data passing through it. For example, an arithmetic logic unit (ALU) can allow routed information to pass through untouched. Likewise, a ringbus structure for interelement communication can allow routed information for pass through untouched.
  • the computation wavefront can include data that has been temporarily “parked”, that is, stored for later use, within a memory element of a compute array system.
  • the temporary parking can occur within a ringbus register, a local memory element, a compute element memory, and so on.
  • Various steps in the flow 200 may be changed in order, repeated, omitted, or the like without departing from the disclosed concepts.
  • Various embodiments of the flow 200 can be included in a computer program product embodied in a computer readable medium that includes code executable by one or more processors.
  • Fig. 3 shows a system block diagram for a highly parallel architecture with a shallow pipeline.
  • the shallow pipeline primarily refers to the pipeline for the compressed control word fetch and decompress functions disclosed herein.
  • the highly parallel architecture can comprise components including compute elements, processing elements, buffers, one or more levels of cache storage, system management, arithmetic logic units, multipliers, and so on.
  • the various components can be used to accomplish task processing, where the task processing is associated with program execution, job processing, etc.
  • the task processing is enabled using a parallel processing architecture with a shallow pipeline.
  • a two- dimensional (2D) array of compute elements is accessed, wherein each compute element within the array of compute elements is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements.
  • Control for the array of compute elements is provided on a cycle-by-cycle basis, wherein the control is enabled by a stream of wide, variable length, microcode control words generated by the compiler.
  • the control words are decompressed on a per element basis.
  • there may be global control information in a control word that is not associated with any given control element such as next compressed control word (CCW) fetch address, control information for queues and other elements, information for hazard detection logic, etc.
  • CCW next compressed control word
  • a compiled task is executed on the array of compute elements, based on the decompressing.
  • a system block diagram 300 for a highly parallel architecture with a shallow pipeline is shown.
  • the system block diagram can include a compute element array 310.
  • the compute element array 310 can be based on compute elements, where the compute elements can include processors, central processing units (CPUs), graphics processing units (GPUs), coprocessors, and so on.
  • the compute elements can be based on processing cores configured within chips such as application specific integrated circuits (ASICs), processing cores programmed into programmable chips such as field programmable gate arrays (FPGAs), and so on.
  • the compute elements can comprise a homogeneous array of compute elements.
  • the system block diagram 300 can include translation and look-aside buffers such as translation and look-aside buffers 312 and 338.
  • the translation and look-aside buffers can comprise memory caches, where the memory caches can be used to reduce storage access times.
  • the system block diagram can include logic for load and access order and selection.
  • the logic for load and access order and selection can include logic 314 and logic 340.
  • Logic 314 and 340 can accomplish load and access order and selection for the lower data block (316, 318, and 320) and the upper data block (342, 344, and 346), respectively. This layout technique can double access bandwidth, reduce interconnect complexity, and so on.
  • Logic 340 can be coupled to compute element array 310 through the queues, address generators, and multiplier units 347 component. In the same way, logic 314 can be coupled to compute element array 310 through the queues, address generators, and multiplier units 317 component.
  • the system block diagram can include access queues.
  • the access queues can include access queues 316 and 342.
  • the access queues can be used to queue requests to access caches, storage, and so on, for storing data and loading data.
  • the system block diagram can include level 1 (LI) data caches such as LI caches 318 and 344.
  • the LI caches can be used to store blocks of data such as data to be processed together, data to be processed sequentially, and so on.
  • the LI cache can include a small, fast memory that is quickly accessible by the compute elements and other components.
  • the system block diagram can include level 2 (L2) data caches.
  • the L2 caches can include L2 caches 320 and 346.
  • the L2 caches can include larger, slower storage in comparison to the LI caches.
  • the L2 caches can store “next up” data, results such as intermediate results, and so on.
  • the LI and L2 caches can further be coupled to level 3 (L3) caches.
  • the L3 caches can include L3 caches 322 and 348.
  • the L3 caches can be larger than the L2 and LI caches and can include slower storage. Accessing data from L3 caches is still faster than accessing main storage.
  • the LI, L2, and L3 caches can include 4-way set associative caches.
  • the block diagram 300 can include a system management buffer 324.
  • the system management buffer can be used to store system management codes or control words that can be used to control the array 310 of compute elements.
  • the system management buffer can be employed for holding opcodes, codes, routines, functions, etc. which can be used for exception or error handling, management of the parallel architecture for processing tasks, and so on.
  • the system management buffer can be coupled to a decompressor 326.
  • the decompressor can be used to decompress system management compressed control words (CCWs) from system management compressed control word buffer 328 and can store the decompressed system management control words in the system management buffer 324.
  • the compressed system management control words can require less storage than the uncompressed control words.
  • the system management CCW component 328 can also include a spill buffer.
  • the spill buffer can comprise a large static random-access memory (SRAM) which can be used to support multiple nested levels of exceptions.
  • SRAM static random-access memory
  • the compute elements within the array of compute elements can be controlled by a control unit such as control unit 330. While the compiler, through the control word, controls the individual elements, the control unit can pause the array to ensure that new control words are not driven into the array.
  • the control unit can receive a decompressed control word from a decompressor 332.
  • the decompressor can decompress a control word (discussed below) to enable or idle rows or columns of compute elements, to enable or idle individual compute elements, to transmit control words to individual compute elements, etc.
  • the decompressor can be coupled to a compressed control word store such as compressed control word cache 1 (CCWC1) 334.
  • CCWC1 can include a cache such as an LI cache that includes one or more compressed control words.
  • CCWC1 can be coupled to a further compressed control word store such as compressed control word cache 2 (CCWC2) 336.
  • CCWC2 can be used as an L2 cache for compressed control words.
  • CCWC2 can be larger and slower than CCWC1.
  • CCWC1 and CCWC2 can include 4-way set associativity.
  • the CCWC1 cache can contain decompressed control words, in which case it could be designated as DCWC1.
  • decompressor 332 can be coupled between CCWC1 334 (now DCWC1) and CCWC2 336.
  • Fig. 4 illustrates compute element array detail 400.
  • a compute element array can be coupled to components which enable the compute elements to process one or more tasks, subtasks, and so on. The components can access and provide data, perform specific high-speed operations, and the like.
  • the compute element array and its associated components enable a parallel processing architecture with a shallow pipeline.
  • the compute element array 410 can perform a variety of processing tasks, where the processing tasks can include operations such as arithmetic, vector, or matrix operations; audio and video processing operations; neural network operations; etc.
  • Each compute element of the compute element array 410 can contain one or more scratchpad memory elements 411.
  • the scratchpad memory elements can be an integral part of a compute element.
  • the scratchpad memory elements can function as a level 0 (L0) cache for an individual compute element.
  • the scratchpad memory elements can function as register files for each individual CE.
  • the compiler can organize a plurality of CE register files as a larger, many -ported register file.
  • the compute elements can be coupled to multiplier units such as lower multiplier units 412 and upper multiplier units 414.
  • the multiplier units can be used to perform high-speed multiplications associated with general processing tasks, multiplications associated with neural networks such as deep learning networks, multiplications associated with vector operations, and the like.
  • the compute elements can be coupled to load queues such as load queues 416 and load queues 418.
  • the load queues can be coupled to the LI data caches as discussed previously.
  • the load queues can be used to load storage access requests from the compute elements.
  • the load queues can track expected load latencies and can notify a control unit if a load latency exceeds a threshold. Notification of the control unit can be used to signal that a load may not arrive within an expected timeframe.
  • the load queues can further be used to pause the array of compute elements.
  • the load queues can send a pause request to the control unit that will pause the entire array, while individual elements can be idled under control of the control word.
  • an element When an element is not explicitly controlled, it can be placed in the idle (or low power) state. No operation is performed, but ring buses can continue to operate in a "pass thru” mode to allow the rest of the array to operate properly.
  • a compute element is used just to route data unchanged through its ALU, it is still considered active.
  • the memory systems can be free running and can continue to operate while the array is paused. Because multicycle latency can occur due to control signal transport, which results in additional “dead time”, it can be beneficial to allow the memory system to "reach into” the array and deliver load data to appropriate scratchpad memories while the array is paused. This mechanism can operate such that the array state is known, as far as the compiler is concerned. When array operation resumes after a pause, new load data will have arrived at a scratchpad, as required for the compiler to maintain the statically scheduled model.
  • Fig. 5 shows array row control decode.
  • a control word such as a compressed control word can be decompressed and decoded.
  • the decoded control word can be used to provide control to compute elements within a row or a column of an array of compute elements.
  • the array row control decode enables a highly parallel processing architecture with a shallow pipeline.
  • a two-dimensional (2D) array of compute elements is accessed, where each compute element within the array of compute elements is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements.
  • Control for the array of compute elements is provided on a cycle-by-cycle basis, where the control is enabled by a stream of wide, variable length, microcode control words generated by the compiler.
  • the control words are decompressed on a per element basis, and a compiled task is executed on the array of compute elements, based on the decompressing.
  • the row decode can include a row valid field V 510.
  • a control word can be associated with an element valid (EV) bit 514.
  • an idle bit can be transmitted and the previous control word can be sent to a given compute or other element within the array.
  • the various functions that can be performed based on row valid V, repeat R, and element valid (EV) are shown 516.
  • the various functions can include transmitting idle bits to all elements, transmitting an idle bit for a given element, transmitting a unique control word, and transmitting a repeated control word for a given element.
  • Fig. 6 illustrates example encoding for a single control word row 600.
  • Elements such as compute elements within a row of compute elements can be controlled such that some or all of the compute elements can be enabled for processing a task.
  • the determination of whether a given compute element is active can be based on a bit, such as an element valid (EV) bit, associated with each compute element.
  • EV element valid
  • all of the compute elements within a row of the array of compute elements can remain idle.
  • the row of compute elements can remain idle due to pending data, pending processing tasks, and so on.
  • the idle compute element row can be controlled by a single bit in the control word.
  • the single control bit can include a leading control bit.
  • a column of compute elements within the array of compute elements can be idle, and the idle compute element column can be controlled by a single bit in the control word.
  • Control word encoding for a single compute element row enables a highly parallel processing architecture with a shallow pipeline.
  • a two-dimensional (2D) array of compute elements is accessed.
  • Control for the array of compute elements is provided on a cycle-by-cycle basis, wherein the control is enabled by a stream of wide, variable length, microcode control words generated by a compiler. The control words are decompressed on a per element basis, and a compiled task is executed on the array of compute elements.
  • the encoding can include a single bit 610 which can be used to indicate whether a given row of compute elements is idle. Similarly, a single bit can be included to indicate that a given column of compute elements is idle or not (not shown).
  • the encoding can include bits such as element valid (EV) bits associated with each compute element within the row or column of compute elements.
  • the example encoding can indicate that two compute elements within the row of compute elements are active, while other compute elements within the row remain idle.
  • the example encoding for a single compute element row can include fields or “bunches” for compute element control word bits. Two example fields are shown, field 620 and field 622.
  • the control word bunches can include control bits for a type of element, where the type of element can include a compute element, a multiply element, and so on.
  • Fig. 7 shows example compressed control word sizes.
  • Control words which are used to control compute elements within an array of compute elements, can be generated by a compiler.
  • the generated control words can be compressed in order to reduce storage requirements associated with the compiled control words.
  • the compressed control words can be decompressed, and the decompressed control words can be used to control the compute elements within the array of compute elements.
  • Compressed control words enable a highly parallel processing architecture with a shallow pipeline.
  • a 2D array of compute elements is accessed. Each compute element within the array of compute elements is known to a compiler and is coupled to its neighboring compute elements within the array. Control for the array of compute elements is provided on a cycle-by-cycle basis.
  • the control words are decompressed on a per element basis, and a compiled task is executed on the array of compute elements.
  • Control words provided to control the array of compute elements can be compressed 700.
  • the amounts of compression that can be achieved for a control word can be compared to a baseline, such as comparison to an x86 instruction.
  • the control words are compressed in order to reduce computational requirements for the array of compute requirements with regard to storage of the control words.
  • the example compressed control word (CCW) can include a “pause” 710. The pause discontinues operation of the array of compute elements (CEs) and no operations are performed while in a pause. A pause can be used to handle stalls that can occur due to cache misses when accessing data to be processed by compute elements.
  • the CCW can control a number of rows of CEs 712 within the array.
  • the CCW can control a number of CEs 714, where the CEs can include CEs within a row.
  • the CEs can be controlled by a processing element valid (EV) bit. Controlling more rows of CEs at a time achieves an economy of scale with respect EV bits of the CCW.
  • the CCW can control the number of multiply elements (MEs) 716 and whether upper multiply elements (MEs) 718 are used. In the example, the number of MEs can include 32 MEs.
  • the CCW can control a number of address generator units (AGUs) 720. Increasing numbers of AGUs can be associated with an increasing number of compute elements.
  • AGUs address generator units
  • the CCW can control upper AGUs 722 and lower AGUs (not shown).
  • the CCW can control a number of load operations (LD) 724 and a number of store (ST) 726 operations.
  • LD load operations
  • ST store
  • the numbers of LD and ST operations can be dependent on the types of tasks being processed on the CEs.
  • the size of a compressed control word can vary 728.
  • the control word can include a control word within a plurality of control words, where the control words comprise a stream of wide, variable length, microcode control words generated by the compiler.
  • the size in bits of a CCW can vary based the numbers of CEs, MEs, AGUs, and LD and ST operations performed by compute elements within the array of compute elements.
  • the amount of compression 730 that can be achieved for a control word with respect to a baseline such as an x86 instruction depends on the number of CEs, MEs, AGUs, data operations, etc. associated with a given CCW.
  • the amount of compression or compression factor may be reduced based on the complexity of the control performed by the CCW.
  • Fig. 8 is a table showing example decompressed control word fields.
  • control can be provided to an array of compute elements.
  • the control of the array is enabled by a stream of microcode control words, where the microcode control words can be generated by a compiler.
  • the microcode control word which comprises a plurality of fields, can be stored in a compressed format to reduce storage requirements.
  • the compressed control word can be decompressed in order to enable control of one or more compute elements within the array of compute elements.
  • the fields of the decompressed control word enable a highly parallel processing architecture with a shallow pipeline.
  • a two- dimensional (2D) array of compute elements is accessed, wherein each compute element within the array of compute elements is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements.
  • Control for the array of compute elements is provided on a cycle-by-cycle basis, wherein the control is enabled by a stream of wide, variable length, microcode control words generated by the compiler.
  • the control words are decompressed on a per element basis, such that the control word, once decompressed, can control an entire array of compute elements (or any subset of compute elements) on a cycle-by-cycle basis.
  • a compiled task is executed on the array of compute elements, based on the decompressing.
  • a table 800 depicting control word fields for a decompressed control word is shown.
  • the decompressed control word comprises fields 810. While 20 fields are shown, other numbers of fields can be included in the decompressed control word. The number of fields can be based on a number of compute elements within an array, processing capabilities of the compute elements, compiler capabilities, requirements of processing tasks, and so on.
  • Each field within the decompressed control word can be assigned a purpose or function 812. The function of a field can include providing, controlling, etc., commands, data, addresses, and so on.
  • the one or more fields within the decompressed control word can include spare bits.
  • Each field within the decompressed control word can include a size 814.
  • the size can be based on a number of bits, although other bit groupings can be specified, such as nibbles, bytes, and the like.
  • Comments 816 can also be associated with fields within the decompressed control word. The comments further explain the purpose, function, etc., of a given field.
  • Fig. 9 is a system diagram for task processing.
  • the task processing is performed using a highly parallel processing architecture with a shallow pipeline.
  • the system 900 can include one or more processors 910, which are attached to a memory 912 which stores instructions.
  • the system 900 can further include a display 914 coupled to the one or more processors 910 for displaying data; intermediate steps; control words; control words implementing Very Long Instruction Word (VLIW) functionality; topologies including systolic, vector, cyclic, spatial, streaming, or VLIW topologies; and so on.
  • VLIW Very Long Instruction Word
  • one or more processors 910 are coupled to the memory 912, wherein the one or more processors, when executing the instructions which are stored, are configured to: access a two- dimensional (2D) array of compute elements, wherein each compute element within the array of compute elements is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements; provide control for the array of compute elements on a cycle-by-cycle basis, wherein the control is enabled by a stream of wide, variable length, microcode control words generated by the compiler; decompress the control words on a per element basis; and execute a compiled task on the array of compute elements, based on the decompressing.
  • 2D two- dimensional
  • the compute elements can include compute elements within one or more integrated circuits or chips; compute elements or cores configured within one or more programmable chips such as application specific integrated circuits (ASICs); field programmable gate arrays (FPGAs); processors configured as a mesh; standalone processors; etc.
  • ASICs application specific integrated circuits
  • FPGAs field programmable gate arrays
  • processors configured as a mesh; standalone processors; etc.
  • the system 900 can include a cache 920.
  • the cache 920 can be used to store data, control words, intermediate results, microcode, and so on.
  • the cache can comprise a small, local, easily accessible memory available to one or more compute elements. Embodiments include storing relevant portions of a control word within the cache associated with the array of compute elements.
  • the cache can be accessible to one or more compute elements.
  • the cache comprises a dual read, single write (2R1W) cache. That is, the 2R1W cache can enable two contemporaneous read operations and one write operation without the read and write operations interfering with one another.
  • the system 900 can include an accessing component 930.
  • the accessing component 930 can include control logic and functions for accessing a two-dimensional (2D) array of compute elements, wherein each compute element within the array of compute elements is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements.
  • a compute element can include one or more processors, processor cores, processor macros, and so on.
  • Each compute element can include an amount of local storage. The local storage may be accessible to one or more compute elements.
  • Each compute element can communicate with neighbors, where the neighbors can include nearest neighbors or more remote “neighbors”. Communication between and among compute elements can be accomplished using a bus such as an industry standard bus, a ringbus, a network such as a computer network, etc.
  • the ringbus is implemented as a distributed multiplexor (MUX).
  • MUX distributed multiplexor
  • the 2R1W cache can support simultaneous fetch of potential branch paths for the compiled task. Since the branch path taken by a branch control word can be data dependent and is therefore not known a priori, then control words associated with more than one branch path can be fetched prior to execution of the branch control word. As discussed previously, initial parts of both branch paths can be instantiated in a succession of control words. When the correct branch path is determined, the computations associated with the untaken branch can be flushed and/or ignored.
  • the system 900 can include a providing component 940.
  • the providing component 940 can include control and functions for providing control for the array of compute elements on a cycle-by-cycle basis, wherein the control is enabled by a stream of wide, variable length, microcode control words generated by the compiler.
  • the control of the array of compute elements can include configuring the array to perform various compute operations.
  • the compute operations can enable audio or video processing, artificial intelligence processing, deep learning, and the like.
  • the microcode control words can include opcode fields, data fields, compute array configuration fields, etc.
  • the compiler that generates the microcode can include a general-purpose compiler, a parallelizing compiler, a compiler optimized for the array of compute elements, a compiler specialized to perform one or more processing tasks, and so on.
  • the providing control can implement one or more topologies such as processing topologies within the array of compute elements.
  • the array of compute elements can include a systolic, a vector, a cyclic, a spatial, a streaming, or a Very Long Instruction Word (VLIW) topology.
  • VLIW Very Long Instruction Word
  • Other topologies can include a neural network topology.
  • a control word that can be associated with one or more compute elements within the array need not be stored by a single compute element.
  • the cache 920 enables the control word to be distributed across a row of the array of compute elements.
  • the system 900 can include a decompressing component 950.
  • the decompressing component 950 can include control logic and functions for decompressing the control words on a per element basis, where each control word can be comprised of a plurality of compute element control groups or bunches.
  • One or more control words can be stored in a compressed format within a memory such as the cache.
  • the compression of the control words can reduce storage requirements, complexity of decoding components, and so on.
  • a substantially similar decompression technique can be used to decompress control words for each compute element, or more than one decompression technique can be used.
  • the compression of the control words can be based on compute cycles associated with the array of compute elements. In embodiments, the decompressing can occur cycle-by-cycle out of the cache.
  • the decompressing of control words for one or more compute elements can occur cycle-by-cycle. In other embodiments, decompressing of a single control word can occur over multiple cycles.
  • the system 900 can include an executing component 960.
  • the executing component 960 can include control logic and functions for executing a compiled task on the array of compute elements, based on the decompressing.
  • the compiled task which can be one of many tasks associated with a processing job, can be executed on one or more compute elements within the array of compute elements. In embodiments, the executing of the compiled task can be distributed across compute elements in order to parallelize the execution.
  • the executing the compiled task can include executing the tasks for processing multiple datasets (e.g., single instruction multiple data or SIMD execution).
  • Embodiments can include providing simultaneous execution of two or more potential compiled task outcomes.
  • the two or more potential compiled task outcomes can be based on one or more branch paths, data, etc.
  • the executing can be based on one or more control words.
  • the same control word can be executed on a given cycle across the array of compute elements.
  • the executing tasks can be performed by compute elements located throughout the array of compute elements.
  • the two or more potential compiled outcomes can be executed on spatially separate compute elements within the array of compute elements. Using spatially separate compute elements can enable reduced storage, bus, and network contention; reduced power dissipation by the compute elements; etc.
  • the system 900 can include a computer program product embodied in a computer readable medium for task processing, the computer program product comprising code which causes one or more processors to perform operations of: accessing a two- dimensional (2D) array of compute elements, wherein each compute element within the array of compute elements is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements; providing control for the array of compute elements on a cycle-by-cycle basis, wherein the control is enabled by a stream of wide, variable length, microcode control words generated by the compiler; decompressing the control words on a per element basis; and executing a compiled task on the array of compute elements, based on the decompressing.
  • 2D two- dimensional
  • Each of the above methods may be executed on one or more processors on one or more computer systems.
  • Embodiments may include various forms of distributed computing, client/server computing, and cloud-based computing.
  • the depicted steps or boxes contained in this disclosure’s flow charts are solely illustrative and explanatory. The steps may be modified, omitted, repeated, or reordered without departing from the scope of this disclosure. Further, each step may contain one or more sub-steps. While the foregoing drawings and description set forth functional aspects of the disclosed systems, no particular implementation or arrangement of software and/or hardware should be inferred from these descriptions unless explicitly stated or otherwise clear from the context. All such arrangements of software and/or hardware are intended to fall within the scope of this disclosure.
  • FIG. 1 The block diagrams and flowchart illustrations depict methods, apparatus, systems, and computer program products.
  • the elements and combinations of elements in the block diagrams and flow diagrams show functions, steps, or groups of steps of the methods, apparatus, systems, computer program products and/or computer-implemented methods. Any and all such functions — generally referred to herein as a “circuit,” “module,” or “system” — may be implemented by computer program instructions, by special-purpose hardware-based computer systems, by combinations of special purpose hardware and computer instructions, by combinations of general purpose hardware and computer instructions, and so on.
  • a programmable apparatus which executes any of the above-mentioned computer program products or computer-implemented methods may include one or more microprocessors, microcontrollers, embedded microcontrollers, programmable digital signal processors, programmable devices, programmable gate arrays, programmable array logic, memory devices, application specific integrated circuits, or the like. Each may be suitably employed or configured to process computer program instructions, execute computer logic, store computer data, and so on.
  • a computer may include a computer program product from a computer-readable storage medium and that this medium may be internal or external, removable and replaceable, or fixed.
  • a computer may include a Basic Input/Output System (BIOS), firmware, an operating system, a database, or the like that may include, interface with, or support the software and hardware described herein.
  • BIOS Basic Input/Output System
  • Embodiments of the present invention are limited to neither conventional computer applications nor the programmable apparatus that run them.
  • the embodiments of the presently claimed invention could include an optical computer, quantum computer, analog computer, or the like.
  • a computer program may be loaded onto a computer to produce a particular machine that may perform any and all of the depicted functions. This particular machine provides a means for carrying out any and all of the depicted functions.
  • any combination of one or more computer readable media may be utilized including but not limited to: a non-transitory computer readable medium for storage; an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor computer readable storage medium or any suitable combination of the foregoing; a portable computer diskette; a hard disk; a random access memory (RAM); a read-only memory (ROM), an erasable programmable read-only memory (EPROM, Flash, MRAM, FeRAM, or phase change memory); an optical fiber; a portable compact disc; an optical storage device; a magnetic storage device; or any suitable combination of the foregoing.
  • a computer readable storage medium may be any tangible medium that can contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • computer program instructions may include computer executable code.
  • languages for expressing computer program instructions may include without limitation C, C++, Java, JavaScriptTM, ActionScriptTM, assembly language, Lisp, Perl, Tel, Python, Ruby, hardware description languages, database programming languages, functional programming languages, imperative programming languages, and so on.
  • computer program instructions may be stored, compiled, or interpreted to run on a computer, a programmable data processing apparatus, a heterogeneous combination of processors or processor architectures, and so on.
  • embodiments of the present invention may take the form of web-based computer software, which includes client/server software, software-as-a-service, peer-to-peer software, or the like.
  • a computer may enable execution of computer program instructions including multiple programs or threads.
  • the multiple programs or threads may be processed approximately simultaneously to enhance utilization of the processor and to facilitate substantially simultaneous functions.
  • any and all methods, program codes, program instructions, and the like described herein may be implemented in one or more threads which may in turn spawn other threads, which may themselves have priorities associated with them.
  • a computer may process these threads based on priority or other order.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
EP21867391.1A 2020-09-09 2021-09-03 SHALLOW PIPELINED HIGHLY PARALLEL PROCESSING ARCHITECTURE Pending EP4211567A4 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202063075849P 2020-09-09 2020-09-09
PCT/US2021/048964 WO2022055792A1 (en) 2020-09-09 2021-09-03 Highly parallel processing architecture with shallow pipeline

Publications (2)

Publication Number Publication Date
EP4211567A1 true EP4211567A1 (en) 2023-07-19
EP4211567A4 EP4211567A4 (en) 2024-10-09

Family

ID=80629976

Family Applications (1)

Application Number Title Priority Date Filing Date
EP21867391.1A Pending EP4211567A4 (en) 2020-09-09 2021-09-03 SHALLOW PIPELINED HIGHLY PARALLEL PROCESSING ARCHITECTURE

Country Status (3)

Country Link
EP (1) EP4211567A4 (ko)
KR (1) KR20230082621A (ko)
WO (1) WO2022055792A1 (ko)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024044150A1 (en) * 2022-08-23 2024-02-29 Ascenium, Inc. Parallel processing architecture with bin packing

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4586130A (en) * 1983-10-03 1986-04-29 Digital Equipment Corporation Central processing unit for a digital computer
US5764994A (en) * 1996-09-16 1998-06-09 International Business Machines Corporation Method and system for compressing compiled microcode to be executed within a data processing system
GB2399901B (en) * 2003-03-27 2005-12-28 Micron Technology Inc System and method for encoding processing element commands in an active memory device
US9348792B2 (en) * 2012-05-11 2016-05-24 Samsung Electronics Co., Ltd. Coarse-grained reconfigurable processor and code decompression method thereof
KR102056730B1 (ko) * 2013-04-22 2019-12-17 삼성전자주식회사 Vliw 프로세서를 위한 명령어 압축 장치 및 방법과, 명령어 인출 장치 및 방법
EP3005078A2 (en) * 2013-05-24 2016-04-13 Coherent Logix Incorporated Memory-network processor with programmable optimizations
US9524242B2 (en) * 2014-01-28 2016-12-20 Stmicroelectronics International N.V. Cache memory system with simultaneous read-write in single cycle
US10380009B2 (en) * 2015-02-27 2019-08-13 Walmart Apollo, Llc Code usage map
WO2018217222A1 (en) * 2017-05-26 2018-11-29 The Charles Stark Draper Laboratory, Inc. Machine intelligence and learning for graphic chip accessibility and execution

Also Published As

Publication number Publication date
EP4211567A4 (en) 2024-10-09
KR20230082621A (ko) 2023-06-08
WO2022055792A1 (en) 2022-03-17

Similar Documents

Publication Publication Date Title
US20220075651A1 (en) Highly parallel processing architecture with compiler
US20220075627A1 (en) Highly parallel processing architecture with shallow pipeline
US20220107812A1 (en) Highly parallel processing architecture using dual branch execution
EP4211567A1 (en) Highly parallel processing architecture with shallow pipeline
US20220075740A1 (en) Parallel processing architecture with background loads
EP4384902A1 (en) Parallel processing architecture using distributed register files
WO2022104176A1 (en) Highly parallel processing architecture with compiler
US20220308872A1 (en) Parallel processing architecture using distributed register files
US20220291957A1 (en) Parallel processing architecture with distributed register files
US20220374286A1 (en) Parallel processing architecture for atomic operations
US20230031902A1 (en) Load latency amelioration using bunch buffers
US20220214885A1 (en) Parallel processing architecture using speculative encoding
US20230350713A1 (en) Parallel processing architecture with countdown tagging
US20230342152A1 (en) Parallel processing architecture with split control word caches
US20230273818A1 (en) Highly parallel processing architecture with out-of-order resolution
WO2022081784A1 (en) Parallel processing architecture with background loads
US20240078182A1 (en) Parallel processing with switch block execution
US20230221931A1 (en) Autonomous compute element operation using buffers
US20240070076A1 (en) Parallel processing using hazard detection and mitigation
US20230409328A1 (en) Parallel processing architecture with memory block transfers
US20240168802A1 (en) Parallel processing with hazard detection and store probes
WO2022251272A1 (en) Parallel processing architecture with distributed register files
EP4315045A1 (en) Parallel processing architecture using speculative encoding
WO2023014759A1 (en) Parallel processing architecture for atomic operations
WO2023064230A1 (en) Load latency amelioration using bunch buffers

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20230404

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
A4 Supplementary search report drawn up and despatched

Effective date: 20240909

RIC1 Information provided on ipc code assigned before grant

Ipc: G06F 9/38 20180101ALI20240903BHEP

Ipc: G06F 9/28 20060101ALI20240903BHEP

Ipc: G06F 9/30 20180101ALI20240903BHEP

Ipc: G06F 15/80 20060101AFI20240903BHEP