US20220179823A1 - Reconfigurable reduced instruction set computer processor architecture with fractured cores - Google Patents

Reconfigurable reduced instruction set computer processor architecture with fractured cores Download PDF

Info

Publication number
US20220179823A1
US20220179823A1 US17/681,163 US202217681163A US2022179823A1 US 20220179823 A1 US20220179823 A1 US 20220179823A1 US 202217681163 A US202217681163 A US 202217681163A US 2022179823 A1 US2022179823 A1 US 2022179823A1
Authority
US
United States
Prior art keywords
data
cores
core
arithmetic logic
memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US17/681,163
Inventor
Paul L. Master
Frederick Furtek
Martin Alan Franz II
Raymond J. Andraka PE
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cornami Inc
Original Assignee
Cornami Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cornami Inc filed Critical Cornami Inc
Priority to US17/681,163 priority Critical patent/US20220179823A1/en
Publication of US20220179823A1 publication Critical patent/US20220179823A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7839Architectures of general purpose stored program computers comprising a single central processing unit with memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7867Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8007Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30181Instruction operation extension or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/57Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations

Definitions

  • the present disclosure relates to systems and methods for a reconfigurable reduced instruction set computer processor architecture. also use this, describe all antibiotics, all fractal and mix
  • CISC processors are based on a processor design where single instructions can execute several low-level operations (such as a load from memory, an arithmetic operation, and a memory store) or are capable of multi-step operations or addressing modes within single instructions.
  • CISC processors are characterized by having many clock cycles per each instruction, a slow overall clock due to the large amount of circuitry required to implement each complex instruction, and a single control thread, thus characterized as being control-centric.
  • control-centric refers to a processor that relies primarily on reading and executing instructions for its processing and moving of data. In most applications, moving data is the most resource intensive operation.
  • RISC Reduced Instruction Set Computing
  • a RISC processor is one whose instruction set architecture has a set of attributes that allows it to have much simpler circuitry required to implement its instructions and thus a lower cycles per instruction than a complex instruction set computer.
  • a processor that has a small set of simple and general instructions running faster, rather than a large set of complex and specialized instructions running slower is generally more efficient.
  • RISC processors are characterized by having relatively few clock cycles per instruction, a fast clock, a single control thread, and are characterized as being control centric.
  • RISC processors Due to the requirement that processors must run very large instruction code bases RISC processors have been optimized with multiple levels of memory caches that are backed up by even larger Double Data Rate (DDR) DRAM memory.
  • the smaller memory caches are faster from a clock cycle access point of view than the large DRAM.
  • code exhibits “locality of reference”, that is the probability that the next instruction required to be executed in the code base is relatively nearby (as defined by its address), the DRAM holds the majority of the executable code, and the specific code to be executed is loaded from the DRAM into the memory caches with a high probability that the next instruction to be accessed will be available in the cache. While this multiple level cache system is excellent in terms of speeding up the execution of large code bases, it fails when moving large amounts of data.
  • Modern RISC processor designs consist of a multiplicity of levels of caches. This allows flexibility of instructions flow for large executable code bases but is not efficient for large amounts of data. Moving data in and out of caches is relatively slow, there is overhead in extra circuitry required to maintain cache coherency across all the levels of caches and memory and requires a large amount of energy. This “penalty” is acceptable when a group of instructions is brought in from DRAM and executed multiple times from a cache but is highly inefficient for data movement. Data that needs to be processed once, must go thru the cache overhead (extra power dissipation, extra circuitry which equates to slower clock speeds, and multiple copies in multiple caches) of the caches.
  • Multi-core designs of processors and GPUs replicate the caches per individual processor core and only serve to exacerbate the performance and power dissipation penalty of using these legacy architectures to solve problems that require vast amounts of data movement. Therefore, recent developments in computing technology, such as Artificial Intelligence (AI), Deep Learning (DL), Machine Learning (ML), Machine Intelligence (MI), and Neural Networks (NN), which require enormous amounts of computing resources both in terms of number of processor cores whose total sum aggregate performance is measured in TeraOperations (Trillions of operations) or TeraFLOPS (Trillion of Floating Point Operations) per second and power dissipation measured in the 100's of watts.
  • AI Artificial Intelligence
  • DL Deep Learning
  • ML Machine Learning
  • MI Machine Intelligence
  • NN Neural Networks
  • SEGNET a neural network architecture for semantic pixel-wise segmentation, requires that all data that is processed in each layer of the neural network must be moved by memory caches in a conventional processor.
  • NVIDIA's Drive PX 2TM is used in Tesla vehicles to power the Autopilot feature using Tesla VisionTM image processing technology.
  • the computer is comparable in computing power to about 150 MacBook ProsTM and has been reported to consume 250 W of power and require liquid cooling. See AnandTech, NVIDIA Announces DRIVE PX 2 —Pascal Power For Self - Driving Cars , Ryan Smith, Jan. 5, 2016; https://www.anandtech.com/show/9903/nvidia-announces-drive-px-2-pascal-power-for-selfdriving-cars.
  • the system may include one or more hardware processors configured by machine-readable instructions.
  • a RISC processor may define a primary processing core, and include one or more processing elements (e.g. ALU unit(s), Integer Multiplier unit(s), Integer Multipler-Accumulator unit(s), Divider unit(s), Floating Point ALU unit(s), Floating Point Multiplier unit(s), FP Multiplier-Accumulator unit(s), Integer Vector unit(s), Floating Point Vector unit(s), integer SIMD (single instruction, multiple Data) unit(s), Bit Encryption/Decryption unit(s)).
  • ALU unit Integer Multiplier unit(s), Integer Multipler-Accumulator unit(s), Divider unit(s), Floating Point ALU unit(s), Floating Point Multiplier unit(s), FP Multiplier-Accumulator unit(s), Integer Vector unit(s), Floating Point Vector unit(
  • Each primary processing core includes a main memory and at least one cache memory or local memory interfacing to a Network-On-Chip.
  • Each RISC core being configurable as either RISC mode or streaming mode via a machine-readable-writeable configuration bit.
  • each processor block becomes an individually accessible secondary, i.e. “fractured” core.
  • Each fractured core having at least one arithmetic “processor block” and being capable of reading from and writing to the at least one cache or local memory in a data-centric mode via interfaces to a Network-on-Chip.
  • a node wrapper associated with each of the plurality of fractured cores being configured to allow data to stream out of the corresponding fractal core into the main memory and other ones of the plurality of fractal cores and to allow data from the main memory and other fractal cores to stream into the corresponding core in a streaming mode.
  • the node wrapper may include, access memory associate with each fractured core, a load/unload matrix associated with each fractured core.
  • the processor(s) may be configured to partition logic module configured to individually configure each of the fractured cores to operate in the streaming mode (data-centric) or the control-centric mode.
  • Another aspect relates to a method for reconfiguring a reduced instruction set computer processor architecture, the method includes providing a primary processing core consisting of a RISC processor, each primary processing core comprising a main memory, at least one cache memory, and a plurality of secondary processing cores, each secondary processing core having at least one arithmetic logic unit, providing a node wrapper associated with each of the plurality of secondary cores, the node wrapper comprising access memory associates with each secondary core, and a load/unload matrix associated with each secondary core.
  • the architecture is operated in a manner in which, for at least one core, data is read from and written to the at least cache memory in a control-centric mode and the cores are selectively partitioned to operate in a streaming mode wherein data streams out of the corresponding secondary core into the main memory and other ones of the plurality of secondary cores and data streams from the main memory and other secondary cores to stream into the corresponding core in a streaming mode or the control-centric mode.
  • FIG. 1 is a schematic illustration of a processor architecture in accordance with one or more implementations.
  • FIG. 2 a is a schematic illustration of a single RISC processor and related hardware showing the data streams of both control mode and streaming mode.
  • FIG. 2 b is a schematic illustration of a processor architecture showing that the core modes can be dynamically and flexibly configured.
  • FIG. 3 is a flow chart of a pipeline of the computer processor architecture in a streaming mode, in accordance with one or more implementations.
  • FIG. 4 is a schematic diagram of a secondary core in a streaming mode, in accordance with one or more implementations.
  • FIG. 5 is a schematic diagram of specific topology of a secondary core, in accordance with one or more implementations.
  • FIG. 6 is a flow chart of a method for configuring an architecture. in accordance with one or more implementations.
  • FIG. 7 is schematic diagram of a SegNet architecture.
  • FIG. 8 is a flow chart of a data stream of a portion of the SegNet implementation.
  • FIG. 9 is a schematic diagram of a compression data structure.
  • FIG. 10 is flowchart of an implementation of XEncoder.
  • FIG. 11 is a flowchart of an implementation of ZMac.
  • the inventors have developed an architecture and methodology that allows processor cores, such as known RISC processors to be leveraged for increased computing power.
  • the processor cores referred to as “primary cores” herein, are segregated into control logic and simple processing elements, such as arithmetic logic units.
  • a node wrapper allows the architecture to be configurable into a streaming mode (“fractured moded”) in which pipelines are defined and data is streamed directly to the execution units/processing elements as “secondary cores”.
  • fractured moded in which pipelines are defined and data is streamed directly to the execution units/processing elements as “secondary cores”.
  • Applicant refers to secondary cores using the tradename “Fractal CoresTM.”
  • the processor control logic need not be used.
  • the secondary cores are addressed individually and there is reduced need for data to be stored in temporary storage as the data is streamed from point to point in the pipelines.
  • the architecture is extensible across chips, boards and racks.
  • FIG. 1 illustrates an example of a computing architecture.
  • architecture 102 includes multiple primary processing cores 108 a , 108 b . . . 108 n .
  • Each main processing core 108 can include a corresponding node wrapper 110 a , 110 b . . . 110 n (only some of which are labeled 110 in FIG. 1 . for clarity) as described in greater detail below.
  • Each primary processing core 108 may be defined by a RISC processor, such as the Altera NIOSTM processor.
  • each primary processing core 108 may include a corresponding main memory 112 a , 112 b . . . 112 n (only some of which are labeled FIG.
  • the node wrappers 110 can include access memory associated with each secondary core, and a load/unload matrix associated with each secondary core.
  • Each primary processing core 108 can also include a set of processing units 114 a , 114 b . . . 114 n , such as arithmetic logic units (ALUs), which separately or collectively can define a secondary processing core as described in detail below.
  • ALUs arithmetic logic units
  • a “wrapper” is generally known as hardware or software that contains (“wraps around”) other hardware, data or software, so that the contained elements can exist in a newer system.
  • the wrapper provides a new interface to an existing element.
  • the node wrappers provide a configurable interface that can be configured to allow execution in a conventional control-centric mode or in a streaming mode, or fractured mode, that is described below.
  • RISC mode In a conventional control-centric mode (“RISC mode”), the architecture uses the core control logic to control data flow and operates in a manner wherein data is read from and written to the cache memory and processed by a primary core in accordance with control logic.
  • secondary cores 114 may be selectively “fractured” to operate in a fractured mode, as part of a pipeline, wherein data streams out of the corresponding secondary core into the main memory and other ones of the plurality of secondary cores and data streams from the main memory and other secondary cores to stream into the corresponding core, as described in greater detail below.
  • a rectangular partition can be created from a result matrix y using single precision floating point arithmetic.
  • the node wrappers 110 may be configured to partition logic and an input state machine for transferring data from memory to the processing element and wherein each arithmetic logic unit has an output that is associated with an output memory. The output memory may be updated throughout processing with the latest sum as it is computed.
  • Arithmetic logic units 114 of the RISC processor can be used as streaming secondary cores in the streaming mode.
  • Each node wrapper 110 can be configured to define multiple hardware streams, i.e. pipelines, to be allocated to specific ones of the cores.
  • FIG. 2 a illustrates the two possible modes of operation, RISC mode and fractured mode, of the architecture.
  • RISC Processor 208 includes two processing elements, ALU 1 and ALU 2 .
  • Node Wrapper 210 includes two secondary node wrappers NW 0 and NW 1 .
  • Memory 212 includes secondary memories M 0 and M 1 .
  • NOC Network on a Chip
  • the streams are indicated by the dashed lines.
  • node wrapper 210 is used as secondary node wrappers NW 0 and NW 1 and memory 212 is used as secondary memories M 0 and M 1 to define two data streams in this example.
  • the RISC processor can have any number of processing elements and data streams can be configured as needed. Note that, in this example, the RISC mode includes 4 data streams and a relatively large memory, while in the Fractured mode includes 2 data streams and a relatively small memory.
  • some cores of the architecture can be configured to operate in the RISC mode while some are configured to operate in the fractured mode, as needed by any specific application at any specific time.
  • core modes can be configured dynamically, in real-time, during execution.
  • all cores are configured as primary cores (RISC mode).
  • some cores are configured as primary cores and some cores are configured as secondary cores (fractured mode).
  • the configuration can take any form as required by the specific application at the specific time.
  • the various interconnections are configured by the node wrappers using a Network On Chip (NOC).
  • NOC Network On Chip
  • the NOC is a 2-layer NOC of L0 switches interconnected to a L1 switch via 64 bit lanes.
  • the NOC also has an overlay network that interconnects all the secondary cores in a linear manner, as shown by the red arrows in FIG. 1 .
  • the switches are “crosspoint” switches, i.e. a collection of switches arranged in a matrix configuration. Each switch can have multiple input and output lines that form a crossed pattern of interconnecting lines between which a connection may be established by closing a switch located at each intersection, the elements of the matrix.
  • PCIe PCI Express
  • PCIe provides a switched architecture of channels that can be combined in x2, x4, x8, x16 and x32 configurations, creating a parallel interface of independently controlled “lanes.”
  • the architecture may be formed on a single chip.
  • Each cache memory may be a nodal memory including multiple small memories.
  • each core may have multiple arithmetic logic units.
  • the arithmetic logic units may include at least one of integer multipliers, integer multiplier accumulators, integer dividers, floating point multipliers, floating point multiplier accumulators, floating point dividers.
  • the arithmetic logic units may be single instruction multiple data units.
  • an architecture can be made up of 500 primary processor cores 108 each having 16 processing elements. In the streaming mode, up to 8000 secondary cores 114 can be addressed individually. This allows for performance of massive mathematical operations, as is needed in Artificial Intelligence applications.
  • the primary cores and secondary cores can be dynamically mixed to implement new algorithms.
  • FIG. 3 illustrates a simple data stream pipeline which connects 4 arithmetic logic units 302 , 304 , 306 , and 308 in series so that an input from source 301 is processed into an output 309 .
  • the ALUs are examples of the processing elements described above that define the secondary cores.
  • the pipeline is defined by setting the L0 and L1 switches in the NOC described above.
  • the NOC can be configured in any manner to define any data stream pipeline(s).
  • the appropriate node wrapper(s) 110 can execute code to configure the NOC.
  • class source public threadModule ⁇ // code to run on a RISC core outputStream ⁇ int> outStrm; void code( ); // pointer to the RISC code ⁇ ; // sends data to output class pipeline: public streamModule ⁇ // code to run on a Fractured core inputStream ⁇ int> inStrm; outputStream ⁇ int> outStrm; void code( ); // pointer to the operation the Fractured core will perform ⁇ ; // process data from input and send to output class sink: public threadModule ⁇ // code to run on a RISC core inputStream ⁇ int> inStrm; void code( ); // pointer to the RISC code ⁇ ; // receives data from input
  • code( ) can point to the source code below:
  • class pipelineTest public streamModule ⁇ source src; pipeline pipe; sink snk; public: pipelineTest( ) // Constructor ⁇ src >> pipe >> pipe >>pile >> pipe >> snk; // Connect modules end( ); // Housekeeping ⁇ ⁇ ;
  • FIG. 4 illustrates a top-level diagram of an example of a secondary core 400 defined by processing elements.
  • the pipeline configuration requires a number of clock cycles for a value to be read out of Y memory, added to the new product, and returned to Y memory before that element can be accessed again.
  • a product that arrives before the Y memory element is ready to be read is shunted to the T-FIFO for later accumulation.
  • Memory hazard logic (not shown) can be used to determine if the Y memory location for a new product has been used recently that controls steering of the data in the design.
  • the pre-loaded X mem holds the partition of the X (right) matrix applicable to the partition of the Y (result) matrix performed by this Small Core.
  • the applicable partition of the A (left) matrix is streamed into the PE in compressed form (non-zero elements only, accompanied by row/column info).
  • the Y mem accumulates the products as the matrix is computed.
  • the implementation can also include a peer-to-peer connection between adjacent processing elements 114 in a ring intended to permit dividing the processing load for particular Y-elements between two or more processing elements, which is useful to make the design scalable to larger matrices without a significant loss of performance.
  • FIG. 5 illustrates a specific topology of secondary cores 500 .
  • the design includes a test scaffold built around the processing element ring that allows the test matrices to be initially stored in a central memory store, automatically partitioned and delivered to the processing elements, run through the processing elements with the option of continuously repeating the test matrices (for power measurement), and then have the result partitions collected and reassembled into the full output matrix and returned to the central memory where the result may be accessed easily using the memory initialization and dump tools.
  • Each processing element 114 in FIG. 5 is associated on the input side with a node input memory, partitioning logic and an input state machine for transferring data from the local memory to the processing element.
  • each processing element 114 is associated with an output memory that is updated throughout the process with the latest sum for each Y element as it is computed.
  • the accumulated data in the output memory is transferred back to the central access memory via combiners that either pass data from the previous processing element 114 , or replace input with data from the local processing element 114 to reconstruct the full matrix as the matrix is scanned by row and column.
  • the programming and data information in the central access memory includes a setup word for each processing element 114 that contains partition information for the processing element 114 . That setup word configures the partition logic at each processing element 114 to only use data with rows and columns associated with the processing element's partition. Both the pre-load X matrix data and the streaming A matrix data arrive over the same path and use the same partition setup to select data out of the data stream from the central memory. Selected data at each processing element 114 gets written into the node input memory and held until the access manager completes transferring data and starts the processing. When processing starts, the processing uses only the data that has been transferred into the node memories, and stops when the end of the data has been reached. If the repeat bit is set in the start word, the pointer into the node input memory is reset to 0 when the end of the buffered data is reached and allowed to repeat the data indefinitely. This allows power measurements to be made.
  • FIG. 6 illustrates a method 600 for reconfiguring a reduced instruction set computer processor architecture, in accordance with one or more implementations.
  • the operations of method 600 presented below are intended to be illustrative. In some implementations, method 600 may be accomplished with one or more additional operations not described, and/or without one or more of the operations discussed. Additionally, the order in which the operations of method 600 are illustrated in FIG. 6 and described below is not intended to be limiting.
  • An operation 602 may include providing configuration code to one or more node wrappers.
  • An operation 604 may include executing the configuration code to set the interconnections of the NOC in a manner which creates at least on pipeline.
  • An operation 606 may include operating the architecture in a streaming mode wherein data streams out of the corresponding secondary core into the main memory and other ones of the plurality of secondary cores and data streams from the main memory and other secondary cores to stream into the corresponding core in a streaming mode or the control-centric mode.
  • FIGS. 7 and 8 illustrates a specific example of the architecture applied to a SegNet topology.
  • SegNet is a fully convolutional neural network (CNN) architecture for semantic pixel-wise segmentation.
  • This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer.
  • the architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network.
  • the role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification.
  • the SegNet decoder upsamples its lower resolution input feature map(s).
  • the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample.
  • the upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps.
  • a SegNet Topology 700 includes encoder 710 and decoder 720 .
  • the three-dimensional CNN topology can be transformed into an equivalent one-dimensional topology using the techniques disclosed herein.
  • SegNet Layer 1 712 can be transformed into the 77-stage fractured core pipeline 800 shown in FIG. 8 .
  • the stages illustrated in FIG. 8 perform the following operations:
  • the embodiments facilitate more efficient data compression.
  • Neural Networks by their very definition, contain a high degree of sparsity, for the SegNet CNN over 3 ⁇ the computations involve a zero element.
  • having an architecture that can automatically eliminate the excess data movements for zero data, and the redundant multiply by zero for both random and non-random sparsity would result in higher performance and lower power dissipation.
  • Data which is not moved results in a bandwidth reduction and a power savings.
  • Multiplications that do not need to be performed also save power dissipation as well as allowing the multiplier to be utilized for data which is non-zero.
  • the highest bandwidth and computation load in terms of multiply accumulates occurs in the DataStreams exiting the “Reorder” modules in 801 which feed the “Convolve” Modules 802 .
  • Automatically compressing the data leaving the reorder module, 801 reduces the bandwidth required to feed the convolve modules as well as reducing the maximum MAC (multiply accumulates) that each convolve performs.
  • the input to a convolver, 802 consists of a 3-dimensional data structure (Width ⁇ Height ⁇ Channel).
  • Convolution is defined as multiplying and summing (accumulating) each element of the W ⁇ H ⁇ C against a Kernel Weight data structure also consisting of (Width ⁇ Height ⁇ Channel).
  • the data input into the convolver exhibits two types of sparsity—random zeros interspersed in the W ⁇ H ⁇ C data structure and short “bursts” of zeros across consecutive (W+1) ⁇ (H+1) ⁇ C data elements.
  • the compressed data structure that is sent from the Reorder Modules to the Convolver modules is detailed in FIG. 9 . For every possible 32 values one Bitmask value, 901 , is sent followed by any non-zero data values, 902 . Each bit position in the bitmask indicates where whether there is valid data or zero data in that position.
  • FIG. 10 is the flow chart for the circuitry which resides in 801 the reorder module which performs the compression.
  • FIG. 11 is the flow chart for the circuitry which resides in 802 , the convolver, to perform the de-compression.
  • bit position which is non-zero is critical since the convolution operation must multiply the non-zero data with the correct kernel weight—hence a counter ( FIG. 11 , step 1 and step 5 ) must be maintained.
  • the advantage is as follows: Given a SegNet Reorder/Convolution of width 7, height 7 and channels 64 an approach with no compression will send 3136 (7 ⁇ 7 ⁇ 64) values from the reorder module, 801 , to each convolver, 802 where 3136 Multiply Accumulations will be performed. With a 50% chance of zero values the described circuitry will send 98 BitMasks and only 1568 data values. This results in a savings in terms of bandwidth of almost 50% and a 50% reduction in multiply accumulates across 64 individual convolvers.
  • a simpler compression scheme such as the addition of an additional bit to each data values to indicate “non-zero” data plus the addition of several bits to indicate a “count” of zeros values can also be used to perform compression, at the penalty of increasing the bit width of the bus carrying the data values.
  • the embodiments disclosed herein can be used in connection with various computing platforms.
  • the platforms may include electronic storage, one or more processors, and/or other components.
  • Computing platforms may include communication lines, or ports to enable the exchange of information with a network and/or other computing platforms.
  • the computing platforms may include a plurality of hardware, software, and/or firmware components operating together to provide the functionality attributed herein.
  • Electronic storage may comprise non-transitory storage media that electronically stores information.

Abstract

Systems and methods for reconfiguring a reduced instruction set computer processor architecture are disclosed. Exemplary implementations may: provide a primary processing core consisting of a RISC processor; provide a node wrapper associated with each of the plurality of secondary cores, the node wrapper comprising access memory associates with each secondary core, and a load/unload matrix associated with each secondary core; operate the architecture in a manner in which, for at least one core, data is read from and written to the at least cache memory in a control-centric mode; the secondary cores are selectively partitioned to operate in a streaming mode wherein data streams out of the corresponding secondary core into the main memory and other ones of the plurality of secondary cores.

Description

  • The present application is a continuation of U.S. Ser. No. 15/970,915, filed May 4, 2018, the entire content of which is incorporated.
  • FIELD OF THE DISCLOSURE
  • The present disclosure relates to systems and methods for a reconfigurable reduced instruction set computer processor architecture. also use this, describe all risc, all fractal and mix
  • BACKGROUND
  • Computing needs have changed drastically over the last several years. Since the 1980s, computer processor design has been focused on optimizing processors to execute computer code of enormous sizes. For example, Microsoft Office, a popular productivity suite, has been estimated to have tens of millions of lines of code. Yet, the data size that these massive code bases manipulate are comparatively small. Again using Office as an example, a Word document of several megabytes is all that is being manipulated by the code base in most cases. Other applications, such as graphics processing while generating a massive amount of data, have the same lopsided characteristic of a large code base manipulating a relatively small working set size of data. Thus, the design of conventional graphics processors has been based on techniques similar to processors for more code intensive applications.
  • Complex Instruction set Computing (CISC) processors are based on a processor design where single instructions can execute several low-level operations (such as a load from memory, an arithmetic operation, and a memory store) or are capable of multi-step operations or addressing modes within single instructions. CISC processors are characterized by having many clock cycles per each instruction, a slow overall clock due to the large amount of circuitry required to implement each complex instruction, and a single control thread, thus characterized as being control-centric. The term “control-centric”, as used herein, refers to a processor that relies primarily on reading and executing instructions for its processing and moving of data. In most applications, moving data is the most resource intensive operation.
  • More recently, Reduced Instruction Set Computing (RISC) processors have become popular. A RISC processor is one whose instruction set architecture has a set of attributes that allows it to have much simpler circuitry required to implement its instructions and thus a lower cycles per instruction than a complex instruction set computer. A processor that has a small set of simple and general instructions running faster, rather than a large set of complex and specialized instructions running slower is generally more efficient. RISC processors are characterized by having relatively few clock cycles per instruction, a fast clock, a single control thread, and are characterized as being control centric.
  • Due to the requirement that processors must run very large instruction code bases RISC processors have been optimized with multiple levels of memory caches that are backed up by even larger Double Data Rate (DDR) DRAM memory. The smaller memory caches are faster from a clock cycle access point of view than the large DRAM. Since code exhibits “locality of reference”, that is the probability that the next instruction required to be executed in the code base is relatively nearby (as defined by its address), the DRAM holds the majority of the executable code, and the specific code to be executed is loaded from the DRAM into the memory caches with a high probability that the next instruction to be accessed will be available in the cache. While this multiple level cache system is excellent in terms of speeding up the execution of large code bases, it fails when moving large amounts of data.
  • Modern RISC processor designs consist of a multiplicity of levels of caches. This allows flexibility of instructions flow for large executable code bases but is not efficient for large amounts of data. Moving data in and out of caches is relatively slow, there is overhead in extra circuitry required to maintain cache coherency across all the levels of caches and memory and requires a large amount of energy. This “penalty” is acceptable when a group of instructions is brought in from DRAM and executed multiple times from a cache but is highly inefficient for data movement. Data that needs to be processed once, must go thru the cache overhead (extra power dissipation, extra circuitry which equates to slower clock speeds, and multiple copies in multiple caches) of the caches.
  • This data movement penalty is the characteristic of modern processor architectures, including graphic processor units (GPU). Multi-core designs of processors and GPUs replicate the caches per individual processor core and only serve to exacerbate the performance and power dissipation penalty of using these legacy architectures to solve problems that require vast amounts of data movement. Therefore, recent developments in computing technology, such as Artificial Intelligence (AI), Deep Learning (DL), Machine Learning (ML), Machine Intelligence (MI), and Neural Networks (NN), which require enormous amounts of computing resources both in terms of number of processor cores whose total sum aggregate performance is measured in TeraOperations (Trillions of operations) or TeraFLOPS (Trillion of Floating Point Operations) per second and power dissipation measured in the 100's of watts. These modern DL, ML, MI and NN algorithms have the characteristic of requiring massive amounts of data movements with very small code bases which are characterized as data-centric. For example, SEGNET, a neural network architecture for semantic pixel-wise segmentation, requires that all data that is processed in each layer of the neural network must be moved by memory caches in a conventional processor.
  • Current software programmable processor designs have not provided processors that are efficient in supporting AI applications, such as image recognition required for autonomous vehicles. For example, NVIDIA's Drive PX 2™ is used in Tesla vehicles to power the Autopilot feature using Tesla Vision™ image processing technology. The computer is comparable in computing power to about 150 MacBook Pros™ and has been reported to consume 250 W of power and require liquid cooling. See AnandTech, NVIDIA Announces DRIVE PX 2—Pascal Power For Self-Driving Cars, Ryan Smith, Jan. 5, 2016; https://www.anandtech.com/show/9903/nvidia-announces-drive-px-2-pascal-power-for-selfdriving-cars.
  • Other algorithm specific processor designs have been focused on AI applications, and other data-intensive applications, however, such designs have resulted in processors that are application specific and inflexible. Further, software configurable processors based on FPGA (Field Programmable Gate Arrays) are well-known. While such processors are more flexible than conventional processors, they still do not provide the efficiency and flexibility required for modern data-centric applications.
  • SUMMARY
  • One aspect of the present disclosure relates to a system configured for using a multi-core reduced instruction set computer processor architecture. The system may include one or more hardware processors configured by machine-readable instructions. A RISC processor, may define a primary processing core, and include one or more processing elements (e.g. ALU unit(s), Integer Multiplier unit(s), Integer Multipler-Accumulator unit(s), Divider unit(s), Floating Point ALU unit(s), Floating Point Multiplier unit(s), FP Multiplier-Accumulator unit(s), Integer Vector unit(s), Floating Point Vector unit(s), integer SIMD (single instruction, multiple Data) unit(s), Bit Encryption/Decryption unit(s)). Each primary processing core includes a main memory and at least one cache memory or local memory interfacing to a Network-On-Chip. Each RISC core being configurable as either RISC mode or streaming mode via a machine-readable-writeable configuration bit. In the streaming mode, each processor block becomes an individually accessible secondary, i.e. “fractured” core. Each fractured core having at least one arithmetic “processor block” and being capable of reading from and writing to the at least one cache or local memory in a data-centric mode via interfaces to a Network-on-Chip. A node wrapper associated with each of the plurality of fractured cores, being configured to allow data to stream out of the corresponding fractal core into the main memory and other ones of the plurality of fractal cores and to allow data from the main memory and other fractal cores to stream into the corresponding core in a streaming mode. The node wrapper may include, access memory associate with each fractured core, a load/unload matrix associated with each fractured core. The processor(s) may be configured to partition logic module configured to individually configure each of the fractured cores to operate in the streaming mode (data-centric) or the control-centric mode.
  • Another aspect relates to a method for reconfiguring a reduced instruction set computer processor architecture, the method includes providing a primary processing core consisting of a RISC processor, each primary processing core comprising a main memory, at least one cache memory, and a plurality of secondary processing cores, each secondary processing core having at least one arithmetic logic unit, providing a node wrapper associated with each of the plurality of secondary cores, the node wrapper comprising access memory associates with each secondary core, and a load/unload matrix associated with each secondary core. The architecture is operated in a manner in which, for at least one core, data is read from and written to the at least cache memory in a control-centric mode and the cores are selectively partitioned to operate in a streaming mode wherein data streams out of the corresponding secondary core into the main memory and other ones of the plurality of secondary cores and data streams from the main memory and other secondary cores to stream into the corresponding core in a streaming mode or the control-centric mode.
  • These and other features, and characteristics of the present technology, as well as the methods of operation and functions of the related elements of structure and the combination of parts and economies of manufacture, will become more apparent upon consideration of the following description and the appended claims with reference to the accompanying drawings, all of which form a part of this specification, wherein like reference numerals designate corresponding parts in the various figures. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended as a definition of the limits of the invention. As used in the specification and in the claims, the singular form of “a”, “an”, and “the” include plural referents unless the context clearly dictates otherwise.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a schematic illustration of a processor architecture in accordance with one or more implementations.
  • FIG. 2a is a schematic illustration of a single RISC processor and related hardware showing the data streams of both control mode and streaming mode.
  • FIG. 2b is a schematic illustration of a processor architecture showing that the core modes can be dynamically and flexibly configured.
  • FIG. 3 is a flow chart of a pipeline of the computer processor architecture in a streaming mode, in accordance with one or more implementations.
  • FIG. 4 is a schematic diagram of a secondary core in a streaming mode, in accordance with one or more implementations.
  • FIG. 5 is a schematic diagram of specific topology of a secondary core, in accordance with one or more implementations.
  • FIG. 6 is a flow chart of a method for configuring an architecture. in accordance with one or more implementations.
  • FIG. 7 is schematic diagram of a SegNet architecture.
  • FIG. 8 is a flow chart of a data stream of a portion of the SegNet implementation.
  • FIG. 9 is a schematic diagram of a compression data structure.
  • FIG. 10 is flowchart of an implementation of XEncoder.
  • FIG. 11 is a flowchart of an implementation of ZMac.
  • DETAILED DESCRIPTION
  • The inventors have developed an architecture and methodology that allows processor cores, such as known RISC processors to be leveraged for increased computing power. The processor cores, referred to as “primary cores” herein, are segregated into control logic and simple processing elements, such as arithmetic logic units. A node wrapper allows the architecture to be configurable into a streaming mode (“fractured moded”) in which pipelines are defined and data is streamed directly to the execution units/processing elements as “secondary cores”. Applicant refers to secondary cores using the tradename “Fractal Cores™.” In a streaming mode, the processor control logic need not be used. The secondary cores are addressed individually and there is reduced need for data to be stored in temporary storage as the data is streamed from point to point in the pipelines. The architecture is extensible across chips, boards and racks.
  • FIG. 1 illustrates an example of a computing architecture. As illustrated in FIG. 1, architecture 102 includes multiple primary processing cores 108 a, 108 b . . . 108 n. Each main processing core 108 can include a corresponding node wrapper 110 a, 110 b . . . 110 n (only some of which are labeled 110 in FIG. 1. for clarity) as described in greater detail below. Each primary processing core 108 may be defined by a RISC processor, such as the Altera NIOS™ processor. By way of non-limiting example, each primary processing core 108 may include a corresponding main memory 112 a, 112 b . . . 112 n (only some of which are labeled FIG. 1. for clarity) that includes multiple cache memories. The node wrappers 110 can include access memory associated with each secondary core, and a load/unload matrix associated with each secondary core. Each primary processing core 108 can also include a set of processing units 114 a, 114 b . . . 114 n, such as arithmetic logic units (ALUs), which separately or collectively can define a secondary processing core as described in detail below.
  • A “wrapper” is generally known as hardware or software that contains (“wraps around”) other hardware, data or software, so that the contained elements can exist in a newer system. The wrapper provides a new interface to an existing element. In embodiments, the node wrappers provide a configurable interface that can be configured to allow execution in a conventional control-centric mode or in a streaming mode, or fractured mode, that is described below.
  • In a conventional control-centric mode (“RISC mode”), the architecture uses the core control logic to control data flow and operates in a manner wherein data is read from and written to the cache memory and processed by a primary core in accordance with control logic. However, secondary cores 114 may be selectively “fractured” to operate in a fractured mode, as part of a pipeline, wherein data streams out of the corresponding secondary core into the main memory and other ones of the plurality of secondary cores and data streams from the main memory and other secondary cores to stream into the corresponding core, as described in greater detail below. As an example, a rectangular partition can be created from a result matrix y using single precision floating point arithmetic.
  • The node wrappers 110 may be configured to partition logic and an input state machine for transferring data from memory to the processing element and wherein each arithmetic logic unit has an output that is associated with an output memory. The output memory may be updated throughout processing with the latest sum as it is computed. Arithmetic logic units 114 of the RISC processor can be used as streaming secondary cores in the streaming mode. Each node wrapper 110 can be configured to define multiple hardware streams, i.e. pipelines, to be allocated to specific ones of the cores.
  • FIG. 2a illustrates the two possible modes of operation, RISC mode and fractured mode, of the architecture. As illustrated in FIG. 2, RISC Processor 208 includes two processing elements, ALU1 and ALU2. Node Wrapper 210 includes two secondary node wrappers NW0 and NW1. Memory 212 includes secondary memories M0 and M1. In the RISC mode, the data streams indicated by the solid lines stream from a Network on a Chip (NOC), such as a PCIe bus, to memory 112 for processing by RISC processor 208. In the fractured mode, the streams are indicated by the dashed lines. In the fractured mode, node wrapper 210 is used as secondary node wrappers NW0 and NW1 and memory 212 is used as secondary memories M0 and M1 to define two data streams in this example. One data stream passed through ALU1 and one passed through ALU2 with ALU1 and ALU2 each defining a secondary core. Of course, the RISC processor can have any number of processing elements and data streams can be configured as needed. Note that, in this example, the RISC mode includes 4 data streams and a relatively large memory, while in the Fractured mode includes 2 data streams and a relatively small memory.
  • As illustrated schematically in FIG. 2b , some cores of the architecture can be configured to operate in the RISC mode while some are configured to operate in the fractured mode, as needed by any specific application at any specific time. Further, core modes can be configured dynamically, in real-time, during execution. On the left in FIG. 2b , all cores are configured as primary cores (RISC mode). On the right in FIG. 2b some cores are configured as primary cores and some cores are configured as secondary cores (fractured mode). The configuration can take any form as required by the specific application at the specific time. Some examples include:
      • 112 RISC cores/1,480 Fractured Core (FC) cores: 896 RISC cores/12K FC cores per 1U server, 36K RISC cores/474K FC cores per Rack
      • 480 RISC cores/7,420 FC cores: 4K RISC cores/60K FC cores per 1U server, 154K RISC cores/2.4M FC cores per Rack
      • 8196 RISC cores/131,136 FC cores: 66K RISC cores/1M FC cores per 1U server
      • 2.6M RISC cores/42M FC cores per Rack
  • Referring to FIG. 1, the various interconnections are configured by the node wrappers using a Network On Chip (NOC). In this example, the NOC is a 2-layer NOC of L0 switches interconnected to a L1 switch via 64 bit lanes. The NOC also has an overlay network that interconnects all the secondary cores in a linear manner, as shown by the red arrows in FIG. 1. In this example, the switches are “crosspoint” switches, i.e. a collection of switches arranged in a matrix configuration. Each switch can have multiple input and output lines that form a crossed pattern of interconnecting lines between which a connection may be established by closing a switch located at each intersection, the elements of the matrix. In this example, a PCI Express (PCIe) buss interface is used. PCIe provides a switched architecture of channels that can be combined in x2, x4, x8, x16 and x32 configurations, creating a parallel interface of independently controlled “lanes.”
  • In some implementations, the architecture may be formed on a single chip. Each cache memory may be a nodal memory including multiple small memories. In some implementations, each core may have multiple arithmetic logic units. In some implementations, by way of non-limiting example, the arithmetic logic units may include at least one of integer multipliers, integer multiplier accumulators, integer dividers, floating point multipliers, floating point multiplier accumulators, floating point dividers. In some implementations, the arithmetic logic units may be single instruction multiple data units. As a simple example, an architecture can be made up of 500 primary processor cores 108 each having 16 processing elements. In the streaming mode, up to 8000 secondary cores 114 can be addressed individually. This allows for performance of massive mathematical operations, as is needed in Artificial Intelligence applications. The primary cores and secondary cores can be dynamically mixed to implement new algorithms.
  • The process and mechanism for configuring the architecture is described below. As noted above, the fractured mode is accomplished by defining one or more pipelines of streaming data between the secondary cores. FIG. 3 illustrates a simple data stream pipeline which connects 4 arithmetic logic units 302, 304, 306, and 308 in series so that an input from source 301 is processed into an output 309. The ALUs are examples of the processing elements described above that define the secondary cores. The pipeline is defined by setting the L0 and L1 switches in the NOC described above. Of course, the NOC can be configured in any manner to define any data stream pipeline(s). The appropriate node wrapper(s) 110 can execute code to configure the NOC. As an example, the pipeline of FIG. 2 can be configured by execution of the C++ code objects set forth below. Note that the keyword “threadModule” indicates to the tooling that the code to be executed will run on a RISC core, with the keyword “streamModule” indicating that the code to be executed will run on a Fractured Core.
  • class source: public threadModule { // code to run on a RISC core
       outputStream<int> outStrm;
       void code( );  // pointer to the RISC code
    }; // sends data to output
    class pipeline: public streamModule { // code to run on a Fractured core
       inputStream<int> inStrm;
       outputStream<int> outStrm;
       void code( );  // pointer to the operation the Fractured core
       will perform
    }; // process data from input and send to output
    class sink: public threadModule { // code to run on a RISC core
       inputStream<int> inStrm;
       void code( );  // pointer to the RISC code
    }; // receives data from input
  • In the objects above “code( )” can point to the source code below:
  • // Example of code which can be run on a RISC core
    void source::code( ) {
       int x;
       for (x = 0; x < 1000; ++x) // Put 1000 ints into outStrm {
          printf(“Generating Data %d\n”, x);
          outStrm << x; // TruStream put
       }
    }
    //Example of code which can be run on a Fractured Core
    void pipeline::code( ) {
       int x;
       int sum = 0;
       inStrm >> x; // get data from input stream
       sum += x * 3; // perform some computation
       outStrm << sum; // TruStream put, send data to output stream
    }
    // Example of code which can be run on a RISC core
    void sink::code( ) {
       int x;
       for (x = 0; x < 1000; ++x) {
          inStrm >> x; // get data from input stream
          printf(“Received Data %d\n”, x);
       }
    }
  • The code below serves to connect the topology of pipeline of FIG. 3, where source and sink are running on a RISC core, and 4 Fractured Cores are performing a MAC (multiplication with accumulation):
  • class pipelineTest: public streamModule {
       source src;
       pipeline pipe;
       sink snk;
       public:
       pipelineTest( ) // Constructor
       {
         src >> pipe >> pipe >>pile >> pipe >> snk; //
         Connect modules
         end( ); // Housekeeping
       }
    };
  • FIG. 4 illustrates a top-level diagram of an example of a secondary core 400 defined by processing elements. The pipeline configuration requires a number of clock cycles for a value to be read out of Y memory, added to the new product, and returned to Y memory before that element can be accessed again. A product that arrives before the Y memory element is ready to be read is shunted to the T-FIFO for later accumulation. Memory hazard logic (not shown) can be used to determine if the Y memory location for a new product has been used recently that controls steering of the data in the design. The pre-loaded X mem holds the partition of the X (right) matrix applicable to the partition of the Y (result) matrix performed by this Small Core. The applicable partition of the A (left) matrix is streamed into the PE in compressed form (non-zero elements only, accompanied by row/column info). The Y mem accumulates the products as the matrix is computed. The implementation can also include a peer-to-peer connection between adjacent processing elements 114 in a ring intended to permit dividing the processing load for particular Y-elements between two or more processing elements, which is useful to make the design scalable to larger matrices without a significant loss of performance.
  • FIG. 5 illustrates a specific topology of secondary cores 500. The design includes a test scaffold built around the processing element ring that allows the test matrices to be initially stored in a central memory store, automatically partitioned and delivered to the processing elements, run through the processing elements with the option of continuously repeating the test matrices (for power measurement), and then have the result partitions collected and reassembled into the full output matrix and returned to the central memory where the result may be accessed easily using the memory initialization and dump tools.
  • Each processing element 114 in FIG. 5 is associated on the input side with a node input memory, partitioning logic and an input state machine for transferring data from the local memory to the processing element. On the output side, each processing element 114 is associated with an output memory that is updated throughout the process with the latest sum for each Y element as it is computed. At the completion of the matrix processing, the accumulated data in the output memory is transferred back to the central access memory via combiners that either pass data from the previous processing element 114, or replace input with data from the local processing element 114 to reconstruct the full matrix as the matrix is scanned by row and column.
  • The programming and data information in the central access memory includes a setup word for each processing element 114 that contains partition information for the processing element 114. That setup word configures the partition logic at each processing element 114 to only use data with rows and columns associated with the processing element's partition. Both the pre-load X matrix data and the streaming A matrix data arrive over the same path and use the same partition setup to select data out of the data stream from the central memory. Selected data at each processing element 114 gets written into the node input memory and held until the access manager completes transferring data and starts the processing. When processing starts, the processing uses only the data that has been transferred into the node memories, and stops when the end of the data has been reached. If the repeat bit is set in the start word, the pointer into the node input memory is reset to 0 when the end of the buffered data is reached and allowed to repeat the data indefinitely. This allows power measurements to be made.
  • FIG. 6 illustrates a method 600 for reconfiguring a reduced instruction set computer processor architecture, in accordance with one or more implementations. The operations of method 600 presented below are intended to be illustrative. In some implementations, method 600 may be accomplished with one or more additional operations not described, and/or without one or more of the operations discussed. Additionally, the order in which the operations of method 600 are illustrated in FIG. 6 and described below is not intended to be limiting.
  • An operation 602 may include providing configuration code to one or more node wrappers. An operation 604 may include executing the configuration code to set the interconnections of the NOC in a manner which creates at least on pipeline. An operation 606 may include operating the architecture in a streaming mode wherein data streams out of the corresponding secondary core into the main memory and other ones of the plurality of secondary cores and data streams from the main memory and other secondary cores to stream into the corresponding core in a streaming mode or the control-centric mode.
  • FIGS. 7 and 8 illustrates a specific example of the architecture applied to a SegNet topology. As noted above, SegNet is a fully convolutional neural network (CNN) architecture for semantic pixel-wise segmentation. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The SegNet decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps.
  • As illustrated in FIG. 7, a SegNet Topology 700 includes encoder 710 and decoder 720. The three-dimensional CNN topology can be transformed into an equivalent one-dimensional topology using the techniques disclosed herein. SegNet Layer 1 712 can be transformed into the 77-stage fractured core pipeline 800 shown in FIG. 8. The stages illustrated in FIG. 8 perform the following operations:
      • pad (Top), pad (Bottom), pad (Left) and pad (Right) add zero-padding around the image. Does not require memory.
      • The reorder stages convert the row-based video stream into a window-based stream. Accesses on-die SRAM.
      • The 64 convolve stages perform a convolution for each of the 64 filters (kernels). Accesses on-die SRAM.
      • The batch-normalization stage performs batch normalization. Accesses on-die SRAM.
      • The ReLU stage implements the Rectified Linear Unit (ReLU) activation function. Does not require memory.
      • The three pooling stages perform max pooling. Accesses on-die SRAM.
  • The embodiments facilitate more efficient data compression. Neural Networks, by their very definition, contain a high degree of sparsity, for the SegNet CNN over 3× the computations involve a zero element. Clearly, having an architecture that can automatically eliminate the excess data movements for zero data, and the redundant multiply by zero for both random and non-random sparsity would result in higher performance and lower power dissipation. Data which is not moved results in a bandwidth reduction and a power savings. Multiplications that do not need to be performed also save power dissipation as well as allowing the multiplier to be utilized for data which is non-zero. The highest bandwidth and computation load in terms of multiply accumulates occurs in the DataStreams exiting the “Reorder” modules in 801 which feed the “Convolve” Modules 802. Automatically compressing the data leaving the reorder module, 801, reduces the bandwidth required to feed the convolve modules as well as reducing the maximum MAC (multiply accumulates) that each convolve performs. There are several possible zero compression schemes that may be performed, what is illustrated is a scheme which takes into account the nature of convolution neural networks. The input to a convolver, 802, consists of a 3-dimensional data structure (Width×Height×Channel). Convolution is defined as multiplying and summing (accumulating) each element of the W×H×C against a Kernel Weight data structure also consisting of (Width×Height×Channel). The data input into the convolver exhibits two types of sparsity—random zeros interspersed in the W×H×C data structure and short “bursts” of zeros across consecutive (W+1)×(H+1)×C data elements. The compressed data structure that is sent from the Reorder Modules to the Convolver modules is detailed in FIG. 9. For every possible 32 values one Bitmask value, 901, is sent followed by any non-zero data values, 902. Each bit position in the bitmask indicates where whether there is valid data or zero data in that position. In the case where there is no zero data, 901 will be all zeros, followed by 32 data values, 902. In the other extreme where there are 32 zero data values, 901 will be all “1”'s and no data values, 902, will follow. In the case there is a mixture of non-zero data values and data values the bitmask, 901, will indicate this and only the non-zero data values will follow in 902. FIG. 10 is the flow chart for the circuitry which resides in 801 the reorder module which performs the compression. FIG. 11 is the flow chart for the circuitry which resides in 802, the convolver, to perform the de-compression. Note that the bit position which is non-zero is critical since the convolution operation must multiply the non-zero data with the correct kernel weight—hence a counter (FIG. 11, step 1 and step 5) must be maintained. The advantage is as follows: Given a SegNet Reorder/Convolution of width 7, height 7 and channels 64 an approach with no compression will send 3136 (7×7×64) values from the reorder module, 801, to each convolver, 802 where 3136 Multiply Accumulations will be performed. With a 50% chance of zero values the described circuitry will send 98 BitMasks and only 1568 data values. This results in a savings in terms of bandwidth of almost 50% and a 50% reduction in multiply accumulates across 64 individual convolvers. Alternatively, a simpler compression scheme, such as the addition of an additional bit to each data values to indicate “non-zero” data plus the addition of several bits to indicate a “count” of zeros values can also be used to perform compression, at the penalty of increasing the bit width of the bus carrying the data values.
  • The embodiments disclosed herein can be used in connection with various computing platforms. The platforms may include electronic storage, one or more processors, and/or other components. Computing platforms may include communication lines, or ports to enable the exchange of information with a network and/or other computing platforms. The computing platforms may include a plurality of hardware, software, and/or firmware components operating together to provide the functionality attributed herein. Electronic storage may comprise non-transitory storage media that electronically stores information.
  • Although the present technology has been described in detail for the purpose of illustration based on what is currently considered to be the most practical and preferred implementations, it is to be understood that such detail is solely for that purpose and that the technology is not limited to the disclosed implementations, but, on the contrary, is intended to cover modifications and equivalent arrangements that are within the spirit and scope of the appended claims. For example, it is to be understood that the present technology contemplates that, to the extent possible, one or more features of any implementation can be combined with one or more features of any other implementation.

Claims (2)

What is claimed is:
1. A reduced instruction set computer processor architecture comprising:
multiple RISC processors each defining a primary processing core in a control-centric mode, each primary processing core comprising:
a main memory;
at least one cache memory;
at least one arithmetic logic unit capable of reading from and writing to the at least one cache memory in a control-centric mode;
a node wrapper associated with each of the primary cores, the node wrapper being operable to define a plurality of secondary cores by configuring network connections in a manner that defines at least one pipeline to allow data to stream out of arithmetic logic units into the main memory and other ones of the plurality of arithmetic logic units in a streaming mode, the node wrapper comprising;
access memory associate with each arithmetic logic unit;
at least one load/unload matrix associated with each arithmetic logic unit; and
a partitioning logic module configured to individually configure each of the primary cores to operate in the streaming mode or the control-centric mode.
2. A method for reconfiguring a reduced instruction set computer processor architecture, the method comprising:
providing a plurality of primary processing cores defined by RISC processors, each primary processing core comprising a main memory, at least one cache memory, and a plurality of arithmetic logic units;
providing a node wrapper associated with each primary core, the node wrapper comprising access memory associated with each arithmetic logic unit, and a load/unload matrix associated with each arithmetic logic unit;
operating the architecture in a manner in which, for at least primary core, data is read from and written to the at least cache memory in a control-centric mode; and
selectively configuring at least one primary core to operate in a streaming mode wherein data streams out of corresponding arithmetic logic units into the main memory and other ones of the plurality arithmetic logic units.
US17/681,163 2018-05-04 2022-02-25 Reconfigurable reduced instruction set computer processor architecture with fractured cores Pending US20220179823A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/681,163 US20220179823A1 (en) 2018-05-04 2022-02-25 Reconfigurable reduced instruction set computer processor architecture with fractured cores

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US15/970,915 US11294851B2 (en) 2018-05-04 2018-05-04 Reconfigurable reduced instruction set computer processor architecture with fractured cores
US17/681,163 US20220179823A1 (en) 2018-05-04 2022-02-25 Reconfigurable reduced instruction set computer processor architecture with fractured cores

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US15/970,915 Continuation US11294851B2 (en) 2018-05-04 2018-05-04 Reconfigurable reduced instruction set computer processor architecture with fractured cores

Publications (1)

Publication Number Publication Date
US20220179823A1 true US20220179823A1 (en) 2022-06-09

Family

ID=68383926

Family Applications (2)

Application Number Title Priority Date Filing Date
US15/970,915 Active 2038-05-12 US11294851B2 (en) 2018-05-04 2018-05-04 Reconfigurable reduced instruction set computer processor architecture with fractured cores
US17/681,163 Pending US20220179823A1 (en) 2018-05-04 2022-02-25 Reconfigurable reduced instruction set computer processor architecture with fractured cores

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US15/970,915 Active 2038-05-12 US11294851B2 (en) 2018-05-04 2018-05-04 Reconfigurable reduced instruction set computer processor architecture with fractured cores

Country Status (1)

Country Link
US (2) US11294851B2 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11062232B2 (en) * 2018-08-01 2021-07-13 International Business Machines Corporation Determining sectors of a track to stage into cache using a machine learning module
US11080622B2 (en) * 2018-08-01 2021-08-03 International Business Machines Corporation Determining sectors of a track to stage into cache by training a machine learning module
US11164067B2 (en) * 2018-08-29 2021-11-02 Arizona Board Of Regents On Behalf Of Arizona State University Systems, methods, and apparatuses for implementing a multi-resolution neural network for use with imaging intensive applications including medical imaging
EP4011030A4 (en) * 2019-08-07 2023-12-27 Cornami, Inc. Configuring a reduced instruction set computer processor architecture to execute a fully homomorphic encryption algorithm
US20240104360A1 (en) * 2020-12-02 2024-03-28 Alibaba Group Holding Limited Neural network near memory processing

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180308202A1 (en) * 2017-04-24 2018-10-25 Intel Corporation Coordination and increased utilization of graphics processors during inference

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7194598B2 (en) * 2004-01-26 2007-03-20 Nvidia Corporation System and method using embedded microprocessor as a node in an adaptable computing machine

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180308202A1 (en) * 2017-04-24 2018-10-25 Intel Corporation Coordination and increased utilization of graphics processors during inference

Also Published As

Publication number Publication date
US20190340152A1 (en) 2019-11-07
US11294851B2 (en) 2022-04-05

Similar Documents

Publication Publication Date Title
US20220179823A1 (en) Reconfigurable reduced instruction set computer processor architecture with fractured cores
CN109102065B (en) Convolutional neural network accelerator based on PSoC
JP7277449B2 (en) Image preprocessing for generalized image processing
US10223334B1 (en) Native tensor processor
US10354733B1 (en) Software-defined memory bandwidth reduction by hierarchical stream buffering for general matrix multiplication in a programmable IC
EP0085520B1 (en) An array processor architecture utilizing modular elemental processors
WO2020073211A1 (en) Operation accelerator, processing method, and related device
US20190042251A1 (en) Compute-in-memory systems and methods
US11693662B2 (en) Method and apparatus for configuring a reduced instruction set computer processor architecture to execute a fully homomorphic encryption algorithm
US20110264888A1 (en) Dynamically Reconfigurable Systolic Array Accelorators
CN112119459A (en) Memory arrangement for tensor data
EP3844610B1 (en) Method and system for performing parallel computation
EP4010793A1 (en) Compiler flow logic for reconfigurable architectures
JP7381429B2 (en) Storage system and method for accelerating hierarchical sorting around storage
CN111656339A (en) Memory device and control method thereof
CN112988621A (en) Data loading device and method for tensor data
EP4011030A1 (en) Configuring a reduced instruction set computer processor architecture to execute a fully homomorphic encryption algorithm
US20230376733A1 (en) Convolutional neural network accelerator hardware
CN114912596A (en) Sparse convolution neural network-oriented multi-chip system and method thereof
US11704535B1 (en) Hardware architecture for a neural network accelerator
TWI836132B (en) Storage system and method for dynamically scaling sort operation for storage system
US20230059970A1 (en) Weight sparsity in data processing engines
US20200401882A1 (en) Learning neural networks of programmable device blocks directly with backpropagation
US20230195836A1 (en) One-dimensional computational unit for an integrated circuit
Wu et al. Extensible and Modularized Processing Unit Design and Implementation for AI Accelerator

Legal Events

Date Code Title Description
STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER