CN111542826A

CN111542826A - Digital architecture supporting analog coprocessors

Info

Publication number: CN111542826A
Application number: CN201880084392.0A
Authority: CN
Inventors: J·古帕塔; N·阿斯雷亚斯; A·马修
Original assignee: Spero Devices Inc
Current assignee: Spero Devices Inc
Priority date: 2017-12-29
Filing date: 2018-12-28
Publication date: 2020-08-14
Also published as: WO2019133829A1; US10867239B2; US20190205741A1

Abstract

A co-processor is configured to perform Vector Matrix Multiplication (VMM) to solve computational problems such as Partial Differential Equations (PDEs). An analog Discrete Fourier Transform (DFT) may be implemented by the VMM using an analog crossbar array to call the input signals and fourier basis functions. The spectral PDE solution method can be implemented as an alternative to the large-scale discretization finite difference method, while exploiting the inherent parallelism achieved by the crossbar array to solve for linear and nonlinear PDEs. A digital controller interfaces with the crossbar array to direct write and read operations to the crossbar array.

Description

Digital architecture supporting analog coprocessors

RELATED APPLICATIONS

This application claims benefit of U.S. provisional application No.62/611,870 filed on 29.12.2017. The entire teachings of the above application are incorporated herein by reference.

Background

A memristor is a device that acts as a resistive switch capable of maintaining an internal resistance state based on a history of applied voltages and currents. Memristors can store and process information and provide some performance characteristics beyond conventional integrated circuit technology. An important class of memristive devices are two-terminal resistance switches based on ionic motion, which are constructed from simple conductor-insulator-conductor thin film stacks. For large scale applications, memristor devices may be used in the form of crossbar (crossbar) arrays.

Disclosure of Invention

In an example embodiment, a circuit includes a Vector Matrix Multiplication (VMM) processor and a controller. The VMM processors may be configured to perform floating-point VMM operations, each floating-point VMM processor including at least one memristor network having an array of emulated memristor devices arranged in a crossbar structure. The controller interfaces with the VMM processor and may be configured to: a) retrieving the read data from the memory; b) determining a type of matrix multiplication to be performed based on the read data; c) generating an input matrix having a format specific to the type of matrix multiplication to be performed; d) determining a computational precision of the floating-point VMM operation and resolving a sign data field, an exponent data field, and a mantissa data field from a floating-point element of an input matrix, and e) sending input matrix data to the VMM processor.

In response to the type being a generic matrix-matrix (GEMM) multiplication, the controller may be further configured to: 1) generating the input matrix according to a row-by-row order of the read data, and 2) applying the input matrix to an input of the VMM processor, wherein the VMM processor is further configured to apply the input matrix to the at least one memristor network. The input of the VMM processor may include a VMM signal processing chain including digital logic blocks performing a set of sequential functions to prepare floating point data for a VMM, the functions including at least one of data formatting, exponent normalization/denormalization, and memristor network mapping/demapping. In response to the type being two-dimensional (2D) convolution and correlation, the controller may be further configured to: 1) generating the input matrix in an overlapping order of the read data, the overlapping order representing a shift of a convolution operation, and 2) applying the input matrix to an input of the VMM processor, wherein the VMM processor is further configured to apply the input matrix to the at least one memristor network.

The controller may be further configured to: a) identifying an exponent of an extremum of a floating point number retrieved from the read data; b) determining a normalized exponent for other values of the floating point number from the exponent, the other values being other than the extremum; c) modifying the other values by replacing the respective indices with the respective normalized indices; and d) converting the resulting data from the at least one memristor network into a floating point value by a denormalization process based on the normalized index.

The controller may be further configured to: a) identifying a matrix to be stored into at least one memristor network; b) defining mapping coefficients for the matrix based on 1) high and low conductance states of the at least one memristor network and 2) highest and lowest values of the matrix; c) defining a mapping relating elements of the matrix to conductance values of the at least one memristor network based on the mapping coefficients; d) causing the VMM processor to store the matrix to the at least one memristor network in accordance with the mapping; and e) converting the resulting data from the at least one memristor network into numerical matrix values by an inverse mapping process based on the mapping.

The controller may be further configured to: a) receiving, from an instruction cache of a host processor, a plurality of instructions based on the VMM operation to be performed, each instruction of the plurality of instructions specifying a configuration of a single row of the at least one memristor network; and b) causing the VMM processor to execute the plurality of instructions in parallel via the at least one memristor network. The controller may be further configured to forward the plurality of instructions to the VMM processor as Very Long Instruction Word (VLIW) instructions.

The controller may be further configured to: a) identifying, from the read data, a column vector to be written to the at least one memristor network; b) for each of the column vectors, 1) generating an identifier representing a layer of a hierarchy of the at least one memristor network, and 2) generating a flag bit indicating whether to update a value corresponding to the column vector; and c) storing the column vectors and the corresponding identifiers and flag bits into the memory.

The controller may be further configured to: a) identifying, from the read data, a matrix column vector to be written to the at least one memristor network; b) performing a Gather (Gather) operation on the matrix column vectors as follows: i. storing a matrix column vector in a set of sub-banks of the SRAM memory, and ii. the matrix column vector is read from the SRAM memory into a request queue, is routed through the sub-bank address/data crossbar to a specified aggregation register, and is ultimately accessed by the VMM processor from the associated aggregation register; c) mapping the matrix column vector contained in an aggregation register to conductance values of a crossbar of the at least one memristor network of the VMM processor; and d) determining memristor weight values based on the mapping to program the at least one memristor network of the VMM processor. The controller may be further configured to: a) reading a voltage output from the crossbar; b) mapping the voltage output to a digital matrix value; and c) storing the digital matrix values in a memory by a Scatter (Scatter) operation in the following manner: vmm processor writes the values into the associated scatter registers, then routes these values through sub-bank address/data crossbars to the specified request queue, and finally writes the data into the desired sub-bank of SRAM memory, ii.

The controller may be further configured to: a) retrieving vector input data values from the read data to be written to the at least one memristor network; b) performing a gather operation (gather) on the vector input data in the following manner: i. vector input data is read from the SRAM memory into a request queue, then is routed through the sub-bank address/data crossbar to the designated gather register, and finally is accessed by the VMM processor from the associated gather register; c) mapping the vector input data values to a crossbar of the at least one memristor network in the VMM processor; and d) determining memristor voltages based on the mapping to program the at least one memristor network of the VMM processor.

The controller may be further configured to: a) identifying a custom (custom) instruction from the read data, the custom instruction defining an operation associated with a VMM; and b) causing the VMM processor to configure the at least one memristor network in accordance with the custom instruction. The customization instructions may include:

a) load/store instructions for: 1) programming input values into rows of a memristor crossbar array and multiplicative weight values into the at least one memristor network within the VMM processor, and 2) storing VMM output values from the at least one memristor network within the VMM processor into memory;

b) VMM instructions for: 1) define parameters including VMM floating-point precision, 2) format VMM data and map the VMM data into the at least one memristor network within the VMM processor, and 3) facilitate greater I/O bandwidth by leveraging VLIW processing to amortize control overhead per operation;

c) a bit manipulation instruction that defines at least one of an extraction, an insertion, a shift, a rotation, and a test of bits within a floating-point register of the VMM processor, wherein the instruction to manipulate the mantissa, exponent, and sign bit is performed in a larger processing of a VMM signal processing chain; and/or

d) A transactional memory instruction that defines I/O efficiency and scatter/gather instructions, and further defines an atomic operation of the custom instruction to facilitate coordinating reading/writing values from/to the at least one memristor network of the VMM processor.

The controller may also be configured to interface with a neural network system on a chip (SoC), the controller configured to: a) a pair of digital signal processors is included such that: i. a second digital signal processor is used for digital architecture functions such as the VMM signal processing chain, memory management, non-linear operations, custom instruction processing, and calibration/compensation algorithms; b) interfacing to the neural network system on a chip such that: i. the SoC is tasked with a neural network inference workload defined by a neural network model descriptor and comprising a set of kernel functions to be run on a VMM processor, and ii) the kernel functions of the model descriptor are compiled into custom instructions to be passed by the neural network system-on-chip over a high-speed interconnect to a set of digital signal processors, and c) instructions are received and processed by the set of digital signal processors to cause the VMM processor to execute VMM functions.

The VMM processor and controller may be configured in a system-on-chip having a multilayer stack of a plurality of 2D Integrated Circuit (IC) layers, respective ones of the plurality of layers comprising a subset of the at least one memristor network, the respective ones of the plurality of layers being linked by through-silicon vias (TSVs).

In another embodiment, a circuit may include coprocessor circuitry including one or more VMM cores configured to perform VMM operations and support circuitry. Each of the VMM cores may include: a) at least one array of VMM circuits, each of the VMM circuits configured to compute, for the VMM operation, a respective product of a subset of T bits out of a total of N bits, each of the VMM circuits comprising: i) a signal generator configured to generate a programming signal based on at least one coefficient of the VMM operation; ii) a memristor network having an array of simulated memristor devices arranged in a crossbar structure; iii) a read/write control circuit configured to selectively enable read and write operations at the memristor network; iv) a memristor control circuit configured to selectively enable selection of the analog memristor device, the memristor control circuit including a column switch multiplexer, a row switch multiplexer, and an address encoder; v) a write circuit configured to set at least one resistance value within the network based on the programming signal, the write circuit comprising a voltage driver; vi) a read input circuit configured to apply at least one input signal to the memristor network, the input signal corresponding to a vector, the read input circuit comprising a voltage driver; and vii) a readout circuit configured to read at least one current value at the memristor network and generate an output signal based on the at least one current value. The array of read circuits may be configured to convert at least one input vector into an analog signal to be applied to the memristor network. The write circuit array may convert at least one setting signal to an analog setting signal to be applied to the memristor network based on the multiplication coefficient. The ADC array may convert at least one VMM analog output from the memristor network to a digital value. The shift register array may be configured to format the digital values of the ADC array. An array of adders may be configured to add outputs from the array of memristor networks, each of the adders performing a subset of VMM operations associated with the multiplication coefficients. A combiner may be configured to combine the output signals of respective ones of the adder arrays to generate a combined output signal, the output signal of each adder array representing one of the respective products, the combiner configured to assemble the respective products into a combined output, the combined output representing a solution for a floating-point precision of the VMM operation.

In another embodiment, a circuit provides analog co-processing via Vector Matrix Multiplication (VMM). The circuitry may include a signal generator, a memristor network, and supporting input/output (I/O) circuitry. The signal generator generates a programming signal based on at least one coefficient of the VMM. The memristor network includes a memristor array. The read/write control circuitry may be configured to selectively enable read operations and write operations at the memristor network. The memristor control circuitry may be configured to selectively enable selection of the memristors, wherein the memristor control circuitry may include one or more of a column switch multiplexer, a row switch multiplexer, and an address encoder. The write circuit may be configured to set at least one resistance value within the network based on the programming signal, wherein the write circuit may include a voltage converter/driver. The read input circuit may be configured to apply at least one input signal to the memristor network, the input signal corresponding to a vector, wherein the read input circuit may include a voltage converter/driver. The readout circuit may be configured to read at least one current value at the memristor network and generate an output signal based on the at least one current value.

In further embodiments, the memristor network may include a plurality of memristors arranged in a voltage divider structure. The memristor network may also include an array of circuit elements, each of the circuit elements including a memristor in series with a transistor configured to selectively allow current to pass through the respective memristor.

In yet another embodiment, the programming signal may be based on at least one Discrete Fourier Transform (DFT) coefficient. The memristor network includes a plurality of sub-arrays of memristors, a first sub-array of the plurality of sub-arrays applied to a real part of the DFT coefficients, and a second sub-array of the plurality of sub-arrays applied to an imaginary part of the DFT sub-arrays. The input signal may have a voltage value that is a function of the input vector intended for multiplication. The sense circuit may be further configured to generate an output signal as a result of a VMM function of the vector and a programmed resistance value of the memristor network.

In yet another embodiment, the sensing circuit may be further configured to detect current at a plurality of nodes of the memristor network, wherein the output signal is a function of the current. The readout circuit may further include an analog-to-digital converter (ADC) configured to output a digital value representative of the output signal. The write circuit may be further configured to generate at least one analog set signal based on a multiplication coefficient based on the programming signal to set the at least one resistance value, wherein the at least one analog signal is applied to the memristor network. A digital-to-analog converter (DAC) may be configured to generate the at least one analog setting signal based on the programming signal.

In yet another embodiment, a digital-to-analog converter (DAC) may be configured to generate the at least one input signal based on the vector. The sensing circuit may further include a transimpedance amplifier configured to convert the output current into a voltage value, the output signal including the voltage value. The read circuit may be further configured to generate at least one analog input signal to be multiplied by at least one resistance value of the memristor network to which the at least one analog input signal is applied.

Further embodiments include a coprocessor circuit comprising an array of Vector Matrix Multiplication (VMM) circuits and supporting I/O circuitry. The array of VMM circuits may include one or more of the features described above, including a signal generator configured to generate a programming signal based on at least one coefficient of the VMM, and a memristor network. Further, the read DAC array may be configured to convert at least one input vector into an analog signal to be applied to the memristor network. The write DAC array may be configured to convert at least one setting signal to an analog setting signal to be applied to the memristor network based on the multiplication coefficient. The ADC array may be configured to convert at least one VMM analog output from the memristor network to a digital value. The shift register array may be configured to format the digital values of the ADC array. An array of adders may be configured to add outputs from the array of memristor networks, each of the adders to perform a subset of VMM operations associated with the coefficients. The combiner may be configured to combine the output signals of respective ones of the adder arrays to generate a combined output signal.

In further embodiments, the processor may be configured to generate the programming signals for respective ones of the VMM circuits based on a mathematical operation. The mathematical operation may include an operation of solving at least one Partial Differential Equation (PDE). The mathematical operations may also include at least one N-bit fixed-point computation, the VMM circuitry configuring a plurality of respective memristors to represent a subset of T bits out of the total N bits. The mathematical operations may also include at least one N-bit floating point calculation, the VMM circuitry configuring a plurality of respective memristors to represent a subset of T bits out of the total N bits.

In further embodiments, the at least one coefficient of the VMM may correspond to a Discrete Fourier Transform (DFT). The array may be configured to process a 2D DFT by applying the at least one coefficient of the VMM corresponding to the first 1D DFT to a first subset of the array and applying the output of the first subset as an input to a second subset of the array as a second 1D DFT. The at least one coefficient of the VMM may correspond to a Discrete Fourier Transform (DFT) to solve the partial differential equation through a spectral method. Further, the at least one coefficient of the VMM may correspond to a Discrete Fourier Transform (DFT) to perform range-doppler signal processing.

In yet another embodiment, the at least one coefficient of the VMM may correspond to a convolution coefficient to perform inference in a convolutional neural network. The at least one coefficient of the VMM may correspond to a green's function representation to solve a partial differential equation. The combiner may also be configured to interface with a peripheral component interconnect express (PCIe) host processor.

In yet another embodiment, the at least one coefficient of the VMM may correspond to a lattice green's function representation to solve a partial differential equation. The at least one coefficient of the VMM may correspond to an energy minimization optimization problem solved by a conjugate gradient method. The conjugate gradient method may be configured to solve partial differential equations. The conjugate gradient method may be configured to perform a back propagation algorithm within the neural network.

Additional embodiments may include a method of performing a VMM operation. The programming signal may be generated based on at least one coefficient of the VMM. Read and write operations are selectively enabled at a memristor network of an array having memristors. Selectively enabling selection of the memristor. At least one resistance value within the network may be set based on the programming signal. At least one input signal may be applied to the memristor network, the input signal corresponding to a vector. At least one current value at the memristor network may be read and an output signal may be generated based on the at least one current value.

Example embodiments provide an analog co-processor configured to solve Partial Differential Equations (PDEs). Furthermore, an analog Discrete Fourier Transform (DFT) may be implemented by using an analog crossbar array to call a Vector Matrix Multiplication (VMM) of the input signals and the fourier basis functions. The spectral PDE solution method can be implemented as an alternative to the large-scale discretization finite difference method, while exploiting the inherent parallelism achieved by the crossbar array to solve for linear and nonlinear PDEs. The analog crossbar array may be implemented in a hybrid solution of CMOS and memristor or including a combination of CMOS and memristors.

Drawings

The foregoing will be apparent from the following more particular description of example embodiments as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating embodiments.

Fig. 1 is a circuit diagram of a transistor/memristor crossbar array that may be implemented in one embodiment.

Fig. 2 is a circuit diagram of a memristor array.

FIG. 3 is a block diagram of a system including a coprocessor in one embodiment.

FIG. 4 is a block diagram of an array of Vector Matrix Multiplication (VMM) engines that may be implemented in the coprocessor of FIG. 3.

FIG. 5 is a block diagram of a VMM engine in one embodiment.

FIG. 6 is a block diagram of a peripheral component interconnect express (PCIe) architecture in which embodiments may be implemented.

FIG. 7 is a block diagram of an H-tree architecture in one embodiment.

Fig. 8A-8B are block diagrams of a computing system in one embodiment.

FIG. 9 is a block diagram of a computing system in another embodiment.

FIG. 10 is a flow diagram that illustrates threading in one embodiment.

FIG. 11 is a block diagram of a representation of data associated with a matrix column vector in one embodiment.

FIG. 12 is a block diagram of a computing system interfacing with networked devices in one embodiment.

Fig. 13A to 13D are block diagrams illustrating a system configured in a stacked structure in example embodiments.

Detailed Description

The following is a description of example embodiments.

Fig. 1 is a circuit diagram of a transistor/memristor crossbar network 100, which may be implemented in one embodiment. The network 100 includes a transistor/memristor array 150 (also referred to as a crossbar or crossbar array) that includes a plurality of cells (also referred to as devices) arranged in rows and columns, including the cell 140. The cell includes memristors 144 connected in series with transistors 142, where the transistors 142 selectively pass current through the respective memristors 144. A gate of transistor 142 may be connected to transistor control circuit 125 for controlling the current. In addition, row select circuit 110 and column select circuit 112 selectively cause current to flow through cells of a given row and a given column. Together, the transistor control circuit 125, row select circuit 110, and column select circuit 112 enable current to be applied to a selected one or more cells in the array 150, while preventing current from being applied to unselected cells.

Memristor crossbar arrays (such as array 150 of network 100) may provide a number of beneficial features, such as high scalability, fast switching speed, non-volatility, large resistance ratio, non-destructive read, 3D stackability, high CMOS compatibility, and manufacturability. However, this architecture may also have several application-related challenges. With respect to Vector Matrix Multiplication (VMM) in particular, achieving high device isolation and obtaining acceptable simulation behavior in a crossbar array is a substantial issue.

The operation of each memristor device in the memristor crossbar array may affect the operation of other devices in the vicinity. For example, a crossbar array may exhibit a phenomenon known as "sneak path current," which is the sum of the currents flowing through unselected memristor devices. This phenomenon is reduced by using a selection device that can be connected to each column or row of a crossbar array to drive a memristive switch. The total current that the select device can drive may be determined by its channel width. However, the current-voltage relationship of the terminal select device may have high non-linearity. This effect violates ohm's law, so although transistor size limits achievable memristor aspect density, a three-terminal selector (e.g., transistor 142) may be used to mitigate sneak path current problems and provide acceptable analog behavior. As shown in fig. 1, the transistor in series with the memristor device at various cross-points may be referred to as a 1T1M (or 1T1R) architecture. By controlling the current compliance of the memristors during the turn-on of the switches, the resistance values of the individual cells 140 of the array 150 may be set to any target value between the High Resistance State (HRS) to the Low Resistance State (LRS), which is referred to as analog behavior. In the 1T1M configuration, control of current compliance can be easily achieved by setting the gate voltage of the transistor at different levels. This enables programming of the analog value in the memristor 144.

Fig. 2 is a circuit diagram of a memristor array 250, which memristor array 250 may include one or more features of the network 100 described above. In particular, array 250 may include a plurality of rows (V)₁ ¹-V_N ¹) And a plurality of columns (V)₁ ⁰-V_M ⁰) A memristor (e.g., memristor 244). Each memristor may also be configured with a selection device (e.g., a transistor) to form a cell such as cell 140 described above.

Memristor crossbar arrays, such as array 250, may enable matrix-related computations and may enable over 100 times increases in computational speed compared to graphics processing units ("GPUs") or other accelerators due to highly parallel computational models, efficient use of electrical signals, and physics laws in hardware implementations. Low operating energy of memristor device: (<pJ) further reduces power consumption, vector and matrix calculations are performed by memristor crossbar arrays, as shown in FIG. 2, an input voltage V corresponding to an input vector is applied along a row of an N × M array that has been programmed according to the N × M matrix input 210^I. By measuring the output voltage V^OThe output current is collected in each column. At each column, through the corresponding memristor resistance (1/G)_i,j) Each input voltage is weighted and a weighted sum occurs at the output voltage. Thus, the relationship between the input voltage and the output voltage may be in the form of a vector matrix multiplication, V^O＝-V^IGR_s(negative feedback of op-amp) representation, where G is the N × M matrix determined by the conductance of the memristor crossbar array.

By using good quality switching materials (such as TaO)_xAnd HfO_x) And manufacturing processes, device variability may be reduced. Feedback circuitry is readily available to the VMM application to switch the cell to the target value. To provide VMM operationA small voltage may be applied as an input on a row of the array 250 and the output current or voltage on the column may be measured. For example, the output current at each column may be read by the converter circuit 220 and converted to a corresponding voltage. The applied voltage may remain below the effective switching threshold voltage of the memristor device, and thus, does not cause any significant resistance change in the adjacent memristor device. This operation, which may be referred to as a "memristor read," may be repeated with infinite endurance cycles and low error rates or inaccuracies. The more challenging but less frequent VMM operation is to map the matrix on the memristor crossbar array, which requires programming (writing) resistance values into the memristor devices of the crossbar array.

In the embodiments described below, in addition to fixed-point computations described in the introduction and background section, the analog coprocessor also supports vector matrix multiplication in a floating-point format^{Index of refraction}. Various floating point representations have been used over the years, but since the 20 s and 90 s, the most common floating point representation was defined by the IEEE754 standard.

The exponent of a 16-bit floating point is a 5-bit unsigned integer from 0 to 31. The index value actually used is the index value minus the deviation. The offset value for a 16 bit floating point representation is 15. Thus, the effective value of the index varies from-15 to 16. The real significand consists of the 10 decimal places to the right of the binary point with the hidden leading bits. Only 10 decimal bits are stored in memory but the overall precision is 11 bits. This corresponds to 3.311 decimal numbers.

The exponent of a 32-bit floating point is an 8-bit unsigned integer from 0 to 255. The index value actually used is the index value minus the deviation. The 32-bit floating point representation has a bias value of 127. Thus, the effective value of the index varies from-127 to 128. The real significant bits comprise 23 decimal places to the right of the binary point with the hidden leading bit. Only 23 decimal places are stored in memory but the overall precision is 24 bits. This corresponds to a 7.22 decimal number.

The binary representation of a 16-bit floating point number may be given by:

this equation yields decimal values of:

the minimum normalized positive value that can be represented using a 16-bit floating point representation is 2^-14＝6.1×10^-5And the maximum value is (2-2)^-10)×2¹⁵65504. The minimum normalized positive value that can be represented using a 32-bit floating point representation is 2^-126＝1.18×10^-38。

With respect to floating point addition, two floating point numbers X and Y may be added. The significands of X and Y are respectively represented as X_sAnd Y_sAnd the index portions of X and Y are represented by X, respectively_eAnd Y_e. The floating point numbers may be added as follows: (a) two numbers are converted to scientific counting by explicitly indicating a "1". (b) To add these numbers, the indices should be the same. This is done by moving the fraction point of the mantissa. (c) The two mantissas are added. (d) The result is adjusted and represented as a floating point number.

With respect to floating-point multiplication, two floating-point numbers X and Y may be multiplied. The significands of X and Y are respectively represented as X_sAnd Y_sAnd the index portions of X and Y are represented by X, respectively_eAnd Y_e. The product of X and Y is then given by:

in the embodiments described below, floating point numbers may be processed by normalizing the exponent (converting the exponent to a fixed point value by aligning the mantissa). Normalizing the exponent requires bit shifting and padding, which is a direct function of the difference between the maximum and minimum exponent values being processed. In some applications, the value may be up to 278 bits for single precision floating point calculations. To circumvent this problem, the elements of the columns of the VMM array may be aligned. This arrangement makes use of the fact that: the difference between the indices of adjacent elements is significantly less than the extrema. The same normalization process may be followed for vector inputs used to multiply the matrix values. The normalized exponent values for the columns of the crossbar array may be stored for use during the denormalization process that converts the multiplied and accumulated results back to floating point precision.

FIG. 3 is a block diagram of coprocessor 300 in one embodiment, which may be used to perform computations, such as N-bit floating point computations. Coprocessor 300 may be referred to as an analog coprocessor in that it implements analog circuitry (e.g., one or more memristor networks) that performs the computations described below. The data required for the computation may be received from a digital data bus (e.g., a PCIe bus) to a normalization block 305 for the exponents, the normalization block 305 normalizing the data blocks by making the exponents of the data blocks the same. These normalized values may be stored to the on-chip digital memory 306, which the processor 310 accesses via an N-channel bus 306. Processor 310 interfaces with VMM core 320. VMM core 320 may operate as a computational core of a coprocessor and may be comprised of an array of P x P VMM engines. For clarity, two

channels

322a, 322n of such an array are shown. A given channel 322n may include a write digital-to-analog converter (DAC)330 and a read DAC 340, a P x P VMM engine array 350, an analog-to-digital converter (ADC) array 360, a shift register array 362, and an adder array 364. Individual VMM engines (e.g., of the VMM engine array 350) may include an M × M memristor crossbar array (e.g.,

arrays

150, 250 as described above), as well as corresponding read and write circuitry and row and column multiplexers for addressing the memristors for read and write operations. An example VMM engine is described in more detail below with reference to fig. 5.

Fig. 4 is a block diagram of an array 420 of Vector Matrix Multiplication (VMM) engines (e.g., VMM engine 470) that may be implemented in VMM core 320 of coprocessor 300 of fig. 3. Each of the VMM engines 470 may be connected to a respective DAC and ADC to form a respective cell of the array 460 a-460 n.

Fig. 5 is a block diagram of VMM engine 500. VMM engine 500 may be implemented as an engine of VMM engine array 350 of VMM core 320 of coprocessor 300 of fig. 3, and may be implemented as VMM engine 460 of array 420 of fig. 4. The VMM engine 500 may include a memristor network 500 having an array (1,1-M, M) of memristor cells including series-connected switches and memristors, and the VMM engine 500 may include one or more features of the memristor network 100 described above with reference to fig. 1 and the array described with reference to fig. 2. The VMM engine also includes circuitry for programming (also referred to as "write operations") the memristor network 550 by setting resistance values of the memristors in the network 550, and circuitry for applying input signals to the programmed memristor network 550 and detecting the resulting currents and/or voltages (also referred to as "read operations").

In particular, the read/write control circuit 510 may selectively enable read operations and write operations. For a write operation, the voltage converter/driver 512 may receive a programming signal ("write signal") and generate a corresponding signal (e.g., a voltage value) for setting a resistance value of the memristor network 550. The column switch multiplexers 516 and the row switch multiplexers 518 enable selection of one or more of the memristor cells of the array 550. The encoder circuit 515 converts the address signals to signals indicative of a subset of the memristor cells for selection by the

multiplexers

516, 518. For a read operation, the voltage converter/driver 570 may receive a "read signal" indicative of a vector and apply a corresponding set of input signals to the memristor network 550. The sense circuit 580 may then receive and combine the resulting currents through the memristor network 550, and the transimpedance amplifier 590 may receive the resulting currents and generate an output signal having a voltage value based on the currents.

The operation of the co-processor and corresponding VMM engine in the example embodiment is described below with reference to fig. 3-5.

Write operation

Referring again to fig. 3, to enable coprocessor 300 to perform computations, the respective VMM engine may perform a write operation that may set a resistance value of a memristor of the VMM engine. In particular, the write DAC array 330 may write matrix values into memristors of the VMM engine array 350. Since each VMM engine 500 is M × M in size, the number of write DACs per VMM engine may also be M. Each write DAC 330 writes a T bit into each memristor, and may write an entire row in a single clock cycle. The write DAC array 330 may utilize the write circuits of the respective VMM engines 500 (fig. 5) to write into memristors of the engine memristor array 550.

Because the exponent has been normalized (e.g., by normalization block 305), it is possible that only the mantissa needs to be processed by coprocessor 300. The N-bit floating-point number has M mantissa bits. Thus, P VMM engines in each row may be required to process the M bits at T bits per cell. For example, a first memristor of VMM engine "1" (e.g., engine 500) may store T MSB bits, and a first memristor of VMM engine "P" stores T LSB bits. The other (P-1) lines of the VMM engine may have the same value being written to them. Each write DAC 330 may have a T-bit input from each channel of the processor 310. The T-bit digital data may be converted to a single analog value that is stored as a memristor conductance state. Each row of VMM engine 500 may be written in a single clock cycle. For example, given a memristor write time of 4ns, an mxm crossbar may require M x 4ns to write all values. Write operations may be considered overhead because the VMM operation cannot begin until all memristors have been written. To avoid this overhead, an interleaving approach may be used, where the two VMM cores will operate in an interleaved fashion.

Read operation

Once all of the memristors of network 550 have been written, a read operation (VMM operation) may be performed using the VMM engine of VMM core 320. As shown in fig. 3, each VMM engine array 350 has its own read DAC array 340. The width of each DAC of the array 340 may be T bits, and the T bit values are input into each row of the memristor crossbar array. Since the size of the VMM engine is M × M, the number of read DACs in each array is also M. The read and write operations may be sequential, and thus the same M channels may be used for both read and write operations. Read DAC 340 may be used to input vectors into a memristor crossbar array of VMM engine array 350. The read DAC may provide an input vector to be multiplied by matrix values stored in the memristor crossbar array. Since the input vector is also an N-bit floating point number, the mantissa width is M bits. As shown, each DAC 340 has a T-bit input. Thus, the computation requires the [ P ═ M/T) ] column of the VMM engine to process the M mantissa bits. The T-bit digital data may be converted to a single analog value. Columns of the VMM engine may receive the same input vector through read circuitry 570 (fig. 5). The resulting VMM output along each column may be read using sense circuits 580.

As shown in fig. 3, column-parallel ADCs 360 may be used to digitize the analog output of each VMM engine 500. Each ADC may be T bits wide and each VMM engine 500 may have M ADCs. The analog coprocessor 300 may process M-bit mantissas for N-bit floating point numbers, but each memristor may only store T-bit information. Thus, the M-bit mantissa may be decomposed into M/T bits. After the VMM operations are performed and the output is digitized, these bits may be shifted to their correct bit positions. To this end, a shift register 362 may be implemented at the output of each ADC 360. Each shift register array 362 may have a different set of bit shifts. After the bits have been shifted, the output of each column of VMM engine 500 may be added to the corresponding column by adder array 364. Each adder of array 364 may have P inputs from P VMM columns in each row. Similarly, columns of other VMM engines may also be added together using adders of array 364. Each row of VMM core 320 may have a set of P adders of adder array 364. The outputs of adder 364 (which are T bits in length) may be combined using combiner 370 to form an M-bit mantissa output. These outputs may be stored to on-chip digital memory 306 and sent over a digital bus.

FIG. 6 is a block diagram of a peripheral component interconnect express (PCIe) architecture in which embodiments may be implemented. PCIe is a serial point-to-point interconnect topology used to connect peripheral devices in computing and communication platforms. A PCIe "lane" includes two simplex interconnect links between two PCIe devices, with the direction of each simplex interconnect link being reversed. The PCIe standard provides the flexibility to increase throughput by adding lanes between devices. For example, PCIe version 3.0 allows for the addition of simplex channels in increments of 1, 2, 4, 8, 12, 16, and 32. The number of lanes (denoted by "L") may affect the size and performance of the coprocessor interfacing with the device under the PCIe architecture.

As illustrated in fig. 6, CPU chipset 640 and PCIe processor host 650 are communicatively coupled via a PCIe bus 690. The PCIe host processor 650 may then communicate with the coprocessor 620, which coprocessor 620 may include features comparable to those of coprocessor 300 described above. In operation, CPU chipset 640 may be transmitting to coprocessor 620. The data payload may be created at the device core of chipset 640 and sent to the three layers with overhead attached (transaction layer, data link layer, and physical layer) to ensure that the packets are reliably delivered to the corresponding layers in PCIe processor 650 in the correct order. The constructed packet may then be presented to a memory controller of chipset 640 that is physically connected to PCIe bus 690 and determines the channel for packet insertion. As packets flow from the physical layer to the transaction layer in PCIe controller 650, the layers strip their overhead and reliable data payloads are delivered to the device core in the correct order. The PCIe processor 650 host deconstructs the packet and passes the payload to the coprocessor 620.

Similarly, the PCIe processor host constructs packets from the data payload created by the coprocessor 620 and sends the packets reliably and in the correct order to the device cores in the CPU chipset 640. The functions of the PCIe processor may be implemented, for example, in a Xilinx FPGA that communicates with the CPU chipset 640 and controls and programs the coprocessor 620. The specification of the PCIe protocol may be implemented as a control and programming coprocessor 620. In addition, "out-of-band" control and programming of the coprocessor 620 may be performed in a software program, such as MATLAB, residing, for example, in the host workstation.

FIG. 7 is a block diagram of an H-tree architecture 700 in one embodiment. As described above, an M × M crossbar P × P parallel array can perform floating point operations. For single precision calculation, P is 6 and M is 32. Such a configuration (36 sets of 32 x 32 crossbars of 4 bits/cell for 32-bit floating points, or 4 sets of 32 x 32 crossbars of 5 bits/cell for 16-bit floating points) may be referred to as a "crossbar core" or a "floating point core". Using an interface with 32 lanes, such as the PCIe 4.0 interface, enables a full input of a 32 element input vector to the floating point core of the coprocessor with a bidirectional bandwidth of 64 GB/s. PCIe values can be aggregated as input vectors and mapped to specific crossbars for computation, which can be done on FPGAs.

The H-tree architecture 700 may include a plurality of crossbar cores 720 a-720 n connected to a common interface in an H-tree configuration. An H-tree network that feeds 8 such cores with an H-tree bandwidth of 64GB/s using streaming buffer 790 would produce VMM computations of 1PB/s (250TFLOP single precision) (32 bit input of 16GT/s 8 floating point cores per core 2016 floating point computations). The output of the floating-point core may then be inverse mapped back to its corresponding input vector components and aggregated as an output vector for return to the sending device, which may also be done on an FPGA. The crossbar cores 720 a-720 n of the H-tree architecture 700 may be configured to interface directly with the PCIe bus, thereby avoiding bandwidth asymmetries that may occur in memory-embedded processor architectures that attempt to create internal device bandwidth on the TB/s scale.

The above-described example embodiments provide an analog coprocessor for applications requiring high computational speed and low power consumption. The coprocessor is capable of solving partial differential equations generated in scientific simulations of complex systems. Current PDE solution methods in scientific simulations are inefficient and often difficult to process due to limitations associated with discrete variable coding and serial processing. The above-described embodiments implement a set of PDE solution processes by invoking vector matrix multiplication of input signals and multiplication coefficients using simulated behavior of CMOS-memristor crossbar arrays.

Example embodiments may replace the large-scale discretization finite difference method by implementing the spectral and green function solution method while exploiting the inherent parallelism achieved by the crossbar array to solve for linear and nonlinear PDEs. In the spectral approach, the PDE can be converted into the fourier domain using an analog crossbar architecture. Once represented in the fourier domain, the VMM and integration can be performed using the crossbar to arrive at a solution for the PDE by using the Inverse Discrete Fourier Transform (IDFT) and convolution theorem. Thus, the partial differential equations of interest can be mapped to an analog vertical-to-horizontal discrete fourier transform ("DFT") architecture, resulting in a much simpler fourier representation that speeds up PDE solution calculations far beyond previous approaches. Linear PDEs (such as wave and thermal equations) with far-reaching applicability are solved in this way. Solving the nonlinear PDE by linearization and then the fourier domain representation and the Navier-Stokes equation for incompressible fluid flow are example applications.

Another method of solving for PDEs is the green's function method. The green's function method may not be suitable for solving PDEs on traditional computers. However, the above-described implementation of a circuit such as a memristor-CMOS crossbar provides a means for PDE solutions. PDEs have formal mathematical solutions involving the green function:

T(x)＝∫_ΩG(x，x′)S(x′)dx′

where G (x, x') is a Green function. Each PDE has a specific green function. If the integral over the domain is approximated on a discrete grid of N grid cells per element, computing the solution T (x) in one cell requires summing over all (N-1) other cells and solving for all unknown solutions (in all cells) requires the order N²And (6) operation. This operation is computationally expensive because the solution to the problem can be found in o (n) time. Thus, the green's function has rarely been used as the primary solution technique for solving PDEs in previous approaches.

Once discretized, the PDE generator matrix problem Ax ═ b, where a is an N × N sparse matrix with only o (N) non-zero entries, b is a source vector of N known values, and x is an unknown solution vector (with N entries)²Operate to multiply by a source andvia x ═ A-¹b obtaining a solution.

The above example embodiment makes the green function method feasible again because it performs the matrix multiplication operation in the analog domain. As a result, the vector is multiplied by all N of the full matrix multiplication (VMM)²The operations may be performed in a single cycle on the memristor crossbar array.

These embodiments are also applicable to Convolutional Neural Network (CNN) based image recognition and doppler filtering. CNN is an increasingly popular machine learning tool for object detection and recognition tasks. However, the most advanced embedded digital solutions require significant power to perform typical CNN tasks (e.g., tasks on AlexNet benchmarks), thereby failing to achieve real-time video operations or meet mobile device power budgets. Previous PDE solution techniques also involve GPUs and face similar problems in terms of computational efficiency. Further applications include range-doppler signal processing in radar systems. In group intelligence and related applications that may require real-time operation with power consumption less than 1W, signal processing requires a significant reduction in the size, weight, and power (SWaP) of these platforms. The above example embodiments may implement these applications under the above constraints by utilizing features such as emulating a crossbar VMM.

Example embodiments may be implemented via CMOS and emerging nanotechnology such as memristors or hybrid technologies consisting of both CMOS and memristors. This implementation provides many advantages over analog DFT implementations because the analog crossbar processor as described above can program the DFT coefficients and perform vector matrix multiplication with the input signal. In particular, an analog processor as described above can implement an analog discrete fourier transform of over 1024 points with sufficient parallelization of base-size crossbar arrays, improving by 2 to 3 orders of magnitude in computational speed and power consumption compared to digital systems.

Computing system incorporating coprocessor

FIG. 8A is a block diagram of a computing system 800 in one embodiment, the computing system 800 may be used to perform computations, such as N-bit floating point computations. System 800 includes a VMM processor having a VMM engine 820, which VMM engine 820 may be configured to include one or more VMM engines (such as VMM engine 500 described above with reference to fig. 5). VMM engine 820 may be configured in one or more of the various architectures described above, such as array 420 of fig. 4 or H-tree architecture 700 of fig. 7. VMM engine 820 may also be configured as a VMM core (such as VMM core 320 of fig. 3) including a structured array of VMM engines and supporting circuitry. In any of the configurations described above, VMM engine 820 may be referred to as a VMM processor. VMM engine 820 interfaces with digital controller 830.

Fig. 8B illustrates the digital controller 830 in more detail. The digital controller 830 may include an interconnect 805, a memory 806, and a processor 810 that may incorporate the features of the components 305, 306, 310 described above with reference to fig. 3. Accordingly, digital controller 830 and VMM engine 820 may be configured as coprocessors such as coprocessor 300 described above with reference to fig. 3. To perform a computation such as an N-bit floating point computation, the required data may be received from a digital data bus (e.g., via PCIe bus 690 of PCIe processor 650 as shown in fig. 6) to interconnect 805, which interconnect 805 forwards the data to processor 810. The processor 810 may normalize the data blocks to generate normalized values. These normalized values may be stored to a memory 806 accessed by the processor 810. Processor 810 interfaces with VMM engine 820 to control VMM engine 820 to perform read and write operations as described above with reference to fig. 5. The results of calculations following these operations may be processed by digital controller 830 and reported to a host device (not shown) requesting the results (such as CPU chipset 640 of fig. 6).

Accordingly, system 800 may provide for performing computationally intensive operations (such as VMM operations, multiply-accumulate-set (MAC) operations) in the emulated domain using VMM engine 820. System 800 may also provide VMM signal processing chains for handling a data VMM engine 820, enabling floating point computations, and management, command, and control techniques for data movement to and from memory 806 through an Instruction Set Architecture (ISA). For example, these solutions may be implemented in firmware 842 of controller 830 as an extension to the RISC-V open source ISA. Controller 830 may invoke (translate) computational instructions for VMM engine 820 from a function call to software Application Programming Interface (API)844 (including compatibility with Matlab, OpenCL, suitespase, and/or other software applications).

Fig. 9 illustrates computing system 800 in greater detail and in a configuration coupled to host processor 990. The interconnect 805 may be communicatively coupled to the host processor 990 via a system bus (e.g., a PCIe bus) and receive the instruction set 970 from the host processor 990. Controller 830 may implement a VMM signal processing chain to process instructions 970 and generate corresponding commands for writing to and reading from VMM engine 820 (e.g., VMM engines 822 a-822 b). The VMM signal processing chain of controller 830 may include a data formatting block 830, a crossbar mapping/inverse mapping block 832, a normalization/denormalization block 834, a Memory Management Unit (MMU)836, and a special function unit ((SFU) 838. each of these blocks is described in more detail below.

Digital controller 830 may implement a number of features for interfacing with host processor 990 and VMM engine 820 including:

a) memory 806 may manage the flow of data into and out of the crossbar architecture of VMM engine 820 to maximize I/O bandwidth to the co-processors.

b) Digital controller 830 may handle a complete VMM signal processing chain (including floating point computations). The chain may include dedicated digital blocks for data formatting to enable computation of general matrix-matrix multiplication, vector-matrix multiplication, and 2D convolution/correlation operations in the analog domain, as well as mapping/inverse mapping of data on crossbar arrays, and normalization/de-normalization of exponents associated with floating point computations.

c) ISA extensions may be optimized to execute VMMs in memristor crossbar arrays and integrated into host processor 990.

d) Digital controller 830 may also perform additional calculations beyond those described in (b) to minimize memory accesses and improve computational efficiency. The type of digital computation is determined by the various applications implemented in the analog coprocessor and may be categorized as a special function to be processed by special function units in the digital engine.

VMM signal processingChain arranging device

Controller 830 may implement a VMM signal processing chain to support memristor-CMOS crossbar arrays of VMM engine 820 to accelerate VMM operations for various applications. The chain may include the following blocks:

data formatting block 830: many applications that will run on the computing system 800 perform general matrix-matrix multiplication operations or 2D convolution/correlation operations. In the case of matrix-matrix multiplication, controller 830 may cause VMM engine 820 to write one of the matrices to its memristor-CMOS crossbar array as a corresponding conductance value, and the other matrix to be multiplied row-by-row using the written memristor-CMOS crossbar array, such that each matrix row is the input vector for the vector-matrix multiplication. In the case of 2D convolution, additional shift and add operations and matrix multiplication may be required. Here, processor 806 may use the processor to read out the second matrix from memory 806 in an overlapping order to represent the shifts required for the convolution operation before being applied to the memristor crossbar array of VMM engine 820. The VMM outputs at the ends of the various memristor crossbar array columns of the VMM engine 820 may then be summed to obtain the final 2D convolution result. An application layer, which may be implemented by processor 810 or another component, may distinguish which of the two main categories of matrix multiplication operations an analog coprocessor will handle in a given instance.

Normalization and denormalization box834: different applications implemented by the system 800 may require different levels of precision and dynamic range. To accommodate this, a floating point format may be used. The system 800 may represent floating point values by a number (mantissa) raised to a power (exponent) with a leading sign bit. The system 800 may process floating point numbers by normalizing exponents and aligning mantissas. This normalization may be more efficient when performed on a set of values.

To perform this normalization and alignment, the controller 830 may first identify the index 'E' of the extremum_max'. The other values may then be shifted by the bit factor (E)_max-E-deviation) to normalize said other values to have the same exponent E_maxWherein E is an index of each individual value, and'The bias' term is a constant of a particular bit precision (e.g., for 32-bit floating point numbers, bias is 127). In the worst case, the normalization factor may be as large as 278 bits for single precision floating point calculations. To circumvent this problem, controller 830 may align elements within various units of VMM engine 820. This solution makes use of the fact that: the difference between the indices of adjacent elements is significantly less than the extrema. The normalized exponent values of the various units of the analog coprocessor may be stored to memory 806 for use during the denormalization process that converts the multiplied and accumulated results back to floating point precision. Controller 830 may implement digital logic blocks as part of the VMM signal processing chain to perform the normalization and denormalization processes required to process floating point numbers.

Mapping and inverse mapping block 832: the first step in performing the VMM operation is to store the matrix values as the conductances of the memristors in the memristor crossbar array. Memristors with a high conductance state (Y)_max) And a low conductance state (Y)_min). Let the matrix to be mapped into the memristor crossbar array be a. The highest and lowest values of the matrix are first identified as A_maxAnd A_min. Two linear mapping coefficients are then defined as:

and b ═ Y_max-a(A_max)

Using these two coefficients, the elements of matrix a are mapped to memristor conductance values as Y a.A + b. The mapping is also applicable to negative numbers. After converting the elements of the matrix into memristor conductance values, these values will be further converted into the write voltages of the memristors. These write voltages are used to store conductance values in memristors. The input vector values are also linearly mapped to the memristor read voltages. After VMM engine 820 completes the VMM operation, the output voltage must be inverse mapped to the actual value. For an input vector x and an output vector V, the inverse mapping operation is implemented as:

mapping and inverse mapping block 832 may perform the mapping and inverse mapping processes described above as part of the VMM signal processing chain.

Memory management

System 800 may implement a variety of solutions to manage memory 806 to ensure sufficient memory bandwidth to complete VMM operations. The solutions that the system can implement include the following:

integration threading (integral threading): fig. 10 is a flowchart illustrating a threading process 1000 equivalent to a threading process implemented in a Graphics Processing Unit (GPU). As shown, streaming multiprocessor (SP) instruction scheduler 1012 groups thread blocks into "warp", where individual instructions are scheduled for execution on the streaming multiprocessor.

Referring again to fig. 9, the system 800 can provide "credit threading" by enabling concurrent processing on a per-row/column basis. The emulation coprocessor can assign each row of the matrix to a set of horizontal and vertical rows/columns in the floating-point engine, similar to the way each core in the CPU assigns a separate thread for each row. For example, a subset of the instructions 970 (I14, I60, I4) may be assigned to a respective row of the VMM engine 822a, while another subset of the instructions (I15, I61, I5) may be assigned to a respective row of the VMM engine 822b, as shown in fig. 9. The rows of the matrix may be operated in parallel by all of the crossbars in engines 822a through 822b, similar to the way that each streaming multiprocessor in a GPU performs work on each row using one SIMD (single instruction multiple data) unit. A program may be executed on the crossbar array of VMM engine 820 in the following manner: VMM computations can be executed concurrently within a crossbar array without the need to explicitly use threads by using only instruction-level parallelism. This solution simplifies thread management and simplifies VMM handling relative to solutions implemented in these digital systems. A multi-threaded (MT) issue unit may not be required and the instruction and data caches may reside on the host processor 990, with a high-rate interconnect between the host processor 922 and the system 800. Thus, after setting the corresponding memristor crossbar array columns to the desired multiplication weight values, the system 800 may execute the instructions 970 in parallel by applying the vector inputs to the memristor crossbar array rows within the VMM engine 820 through physical computations. Groups of instructions may be bundled into a Very Long Instruction Word (VLIW) instruction for more efficient processing.

Data representation to facilitate VMM updates: FIG. 11 illustrates a data structure 1100 that can be implemented by the system 800 to represent data. System 800 speeds up VMM operations and may require high data throughput to maximize the use of the crossbar core of VMM engine 820. The data structure 1100 may be easily updated to maintain high throughput while minimizing expensive write operations to the crossbar core. To accomplish this, the column vector may be encoded and stored in the memory 806 in a format that includes flag bits 1102 and a unique identifier 1104. In this data structure 1100, the data representation may encode a value corresponding to, for example, "not updated, column 1-unit 1-core 1-floating point engine 1". If the column vector has changed and needs to be updated, its flag bit 1102 may be set and an interrupt will be triggered to the host processor 990 to write the new value into the crossbar core of the corresponding floating point engine 822 a.

Scatter/gather vector support: host processor 990 may be used as a companion to system 800 in a heterogeneous computing environment with scatter/gather transaction support to improve throughput. Scatter/gather vector transactions may be used for I/O to crossbar arrays within VMM engine 820, where values are sent to crossbar array DACs. Scatter/gather vector transactions may improve throughput by more efficiently accessing non-contiguous locations of memory 806 as compared to conventional memory access methods. In the gather operation, the matrix column vectors may be stored in the entire set of sub-banks of the SRAM memory by DMA access of the DRAM. The matrix column vector is read from the SRAM memory into the request queue, then is routed across and across to the designated gather register via the sub-bank address/data, and finally is accessed by the VMM core from the associated gather register. In scatter operations, the VMM core writes these values to the associated scatter registers, then routes the values through sub-bank address/data crossbars to the specified request queue, and finally writes the data to SRAM storeIn a prospective sub-library of machines. The output may be written to the DRAM memory by DMA access.

The following table summarizes example write and read processes for both the crossbar column vectors (memristor multiplication weights for the VMM) and the vector input data.

Instruction set architecture and integration

Custom instruction for VMM: as digital blocks are manufactured, the data formatting process can be extended by creating user-defined instructions that originate from the CPU of the host device and run on an analog coprocessor. Each major class of matrix operations may have custom user instructions to be directed to an analog coprocessor. The custom instructions may be issued as an extension to an existing instruction set, with the ability to run on a host device. This allows for custom configuration of the internal functional IP blocks.

A representative set of custom instructions for executing a VMM in an emulation coprocessor includes:

(a) load/store customization: the custom load instruction is used to program input values into memristor crossbar array rows and to program multiplier weight values into memristor crossbar array columns within the analog coprocessor unit. The custom store instruction is used to store VMM output values from memristor crossbar array columns within the emulation coprocessor units in memory. The custom load and store instructions run between a set of dedicated registers for the VMM and memory (VMM instruction and data caches and shared memory space).

(b) Vector matrix multiplication: independent instructions for integer and floating point precision are defined at different precision levels (e.g., 32, 64, and 128 bit integer multiplications and 16, 32, 64, and 128 bit floating point calculations). The analog coprocessor architecture focuses on floating point computations, and the particular configuration of cores and units in a floating point engine defines the computational precision of the engine, as previously described in the background section. Therefore, instructions of a particular multiply type and precision must only be directed to the floating point engine that supports the computation. Vector matrix multiply (as opposed to traditional multiply) instructions facilitate greater I/O bandwidth into memristor crossbar arrays within analog coprocessor units by amortizing the control overhead per operation.

(c) Bit manipulation instruction: the emulation coprocessor performs the VMM by multiplying the floating-point mantissa within its cell's memristor crossbar array while normalizing the floating-point exponent and adjusting the floating-point sign bit. This requires instructions that are capable of fetching, inserting, shifting, rotating, and testing individual bits within the floating point register of the analog coprocessor. The instructions for manipulating the mantissa, exponent, and sign bits are executed in the larger processing of the VMM signal processing chain.

(d) Transactional memory: transactional memory instructions allow load and store instructions to run atomically. When an emulation coprocessor loads and stores values for VMM processing, the emulation coprocessor typically performs concurrent reads and writes of shared data among many units running in parallel. This requires extensive coordination between the crossbar array cells, which is facilitated by atomic loads and stores.

As an illustrative example, the following pseudo-code defines the instruction processing order in the VMM signal processing chain, where instructions are translated from the pseudo-code y ═ Ax:

a) data _ Format (R0): custom instructions for identifying the operation type as a matrix multiplication or 2D convolution/correlation operation. The register RO is updated by a corresponding flag (e.g. 1 to indicate matrix multiplication).

b) Load (R1, a): the contents of matrix a are loaded into register R1. This may be performed in load order, depending on the maximum register size.

c) Load (R2, x): the contents of vector x are loaded into register R2. Again, this may be performed in load order, depending on the maximum register size.

d) Normalized (R3, R4, R1, R2): custom instructions for normalizing the contents of registers R1 and R2 and returning values to registers R3 and R4, respectively. The high-level instructions are composed of a subset of lower-level instructions, such as compare and bit-shift operations.

e) Mapping matrix (R5, R3): custom instructions for mapping the contents of register R3 to conductance values and storing them in register R5. The high-level instructions are composed of a subset of lower-level instructions such as multiply and divide.

f) Mapping vector (R6, R4): custom instructions for mapping the contents of register R4 to input vector values and storing them in register R6. The high-level instructions are composed of a subset of lower-level instructions such as multiply and divide.

g) Load custom ({ Engine 1-Core 1-Cell 1}, R5): custom instructions for writing conductance values of R5 to a particular floating point Engine unit Engine 1-Core 1-Cell 1. This may be performed in order of load customization, depending on the maximum register size.

h) VMM ({ Engine 1-Core 1-Cell 1}, R5): for executing a custom instruction for multiplying the voltage value of R6 with the conductance value of Engine 1-Core 1-Cell 1.

i) Storage customization (R7, { Engine 1-Core 1-Cell 1 }): custom instructions for storing the output of VMM operations. After the corresponding instruction for inverse mapping and denormalization, the output is finally returned to the caller.

Host device integration: fig. 12 is a block diagram of a system 1200 in which a computing system 800 may be integrated with a host device 1220. System 800 may be integrated within host device 1220 (e.g., into a common system on a chip (SoC)) to form an end-to-end solution in which VMM acceleration functions are defined in software cores 1242 and ultimately run on computing system 800. In this example, the networking system 1200 is configured as a neural network application.

The host (edge) device 1220 may contain an application stack that is tasked with performing neural network reasoning, receiving neural network model descriptors 1208 that have been trained in the cloud/data center 1210. The neural network descriptor 1208 may be compiled into a standard language, such as C/C + +, using a custom neural network compiler, and then further compiled into instructions to run on a host Digital Signal Processor (DSP) running a real-time operating system (RTOS) and a neural network runtime environment. The software core 1242 bound to the neural network function library may define a set of VMM functions to be accelerated on the edge device containing the DSP. VMM data associated with these acceleration functions may be accessed directly from the integrated DMA. Custom instructions from the host DSP may be passed to a floating point engine (collectively referred to as an analog Matrix Multiplication Unit (MMU)) of system 800 to perform vector-matrix multiplication. These instructions can be passed from the host DSP to a DSP co-located with the analog MMU over a system bus via dedicated interconnects, the task of which is simply to utilize the I/O input and output analog MMU crossbar array (I/O DSP). As previously described, memristor write update and memory management tasks may be initiated by a separate base DSP. The basic DSP can implement VMM signal processing chain logic blocks by customizing RTL and calibration and compensation algorithms and nonlinear operations that support neural network reasoning.

Since the system 800 acts as a memory built-in processing device, storing the multiplication weights directly in the memristor devices, data movement between the memory and the crossbar array where the VMM occurs is minimized. One goal is for input data for MAC operations involving the VMM to be input to and output from the crossbar array of system 800, and this task is managed by a dedicated I/O processor, which may be a configurable DSP. The I/O processor may support a previously defined custom instruction set (including custom load and store instructions) to move data in and out of the analog matrix multiplication unit. It may also have a high throughput interface (e.g., high speed queue) to provide enough I/O bandwidth for the analog matrix multiplication unit to utilize its computational power.

3D mold stack with I/O and basic processor: fig. 13A to 13D are block diagrams illustrating a system configured in a stacked structure in an example embodiment. A key consideration in edge computing devices is space efficiency, since the footprint of the Integrated Circuit (IC) should be kept as small as possible. A 3D system on a chip (SoC) provides a means to reduce IC footprint and maximize throughput between compute units and memory units on the SoC.

System 800 may be implemented in a 3D SoC in a variety of different configurations. As shown in various example structures in fig. 13A-13D, a 3D stack of single 2D IC layers can be formed and connected using through-silicon vias, each layer containing I/O and basic processors and a number of simulation engines (e.g., 4) containing memristor crossbar arrays of given dimensions (e.g., 128 × 128, 256 × 256, 512 × 512). These vias are used to connect the various layers to each other and to the host DSP. A representative size of the individual 2D layers in the stack may be 6mm x 6.25mm, and a 3D stack of these layers may be formed from a plurality (e.g., 4) of 2D layers.

While example embodiments have been particularly shown and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of embodiments encompassed by the appended claims.

Claims

1. A circuit, the circuit comprising:

vector Matrix Multiplication (VMM) processors configured to perform floating-point VMM operations, each floating-point VMM processor including at least one memristor network having an array of simulated memristor devices arranged in a crossbar structure; and

a controller configured to interface with the VMM processor, the controller configured to:

a) retrieving the read data from the memory;

b) determining a type of matrix multiplication to be performed based on the read data;

c) generating an input matrix having a format specific to the type of matrix multiplication to be performed;

d) determining a computational precision of the floating-point VMM operation and resolving a sign data field, an exponent data field, and a mantissa data field from floating-point elements of an input matrix, an

e) Sending input matrix data to the VMM processor.

2. The circuit of claim 1, wherein, in response to the type being a generic matrix-matrix (GEMM) multiplication, the controller is further configured to: 1) generating the input matrix according to a row-by-row order of the read data, and 2) applying the input matrix to an input of the VMM processor, wherein the VMM processor is further configured to apply the input matrix to the at least one memristor network.

3. The circuitry of claim 2, wherein the input of the VMM processor comprises a VMM signal processing chain including digital logic blocks performing a set of sequential functions to prepare floating point data for a VMM, the functions including at least one of data formatting, exponent normalization/denormalization, and memristor network mapping/demapping.

4. The circuit of claim 1, wherein, in response to the type being two-dimensional (2D) convolution and correlation, the controller is further configured to: 1) generating the input matrix in an overlapping order of the read data, the overlapping order representing a shift of a convolution operation, and 2) applying the input matrix to an input of the VMM processor, wherein the VMM processor is further configured to apply the input matrix to the at least one memristor network.

5. The circuit of claim 1, wherein the controller is further configured to:

a) identifying an exponent of an extremum of a floating point number retrieved from the read data;

b) determining a normalized exponent for other values of the floating point number from the exponent, the other values being other than the extremum;

c) modifying the other values by replacing the respective indices with the respective normalized indices; and

d) converting result data from the at least one memristor network to a floating point value by a denormalization process based on the normalized exponent.

6. The circuit of claim 1, wherein the controller is further configured to:

a) identifying a matrix to be stored into at least one memristor network;

b) defining mapping coefficients for the matrix based on 1) high and low conductance states of the at least one memristor network and 2) highest and lowest values of the matrix;

c) defining a mapping relating elements of the matrix to conductance values of the at least one memristor network based on the mapping coefficients;

d) causing the VMM processor to store the matrix to the at least one memristor network in accordance with the mapping; and

e) converting the result data from the at least one memristor network to a numerical matrix value by an inverse mapping process based on the mapping.

7. The circuit of claim 1, wherein the controller is further configured to:

a) receiving, from an instruction cache of a host processor, a plurality of instructions based on the VMM operation to be performed, each instruction of the plurality of instructions specifying a configuration of a single row of the at least one memristor network; and is

b) Causing the VMM processor to execute the plurality of instructions in parallel via the at least one memristor network.

8. The circuitry of claim 7, wherein the controller is further configured to forward the plurality of instructions to the VMM processor as Very Long Instruction Word (VLIW) instructions.

9. The circuit of claim 1, wherein the controller is further configured to:

a) identifying, from the read data, a column vector to be written to the at least one memristor network;

b) for each of the column vectors, 1) generating an identifier representing a layer of a hierarchy of the at least one memristor network, and 2) generating a flag bit indicating whether to update a value corresponding to the column vector; and is

c) Storing the column vectors and corresponding identifiers and flag bits into the memory.

10. The circuit of claim 1, wherein the controller is further configured to:

a) identifying, from the read data, a matrix column vector to be written to the at least one memristor network;

b) performing an aggregation operation on the matrix column vectors in the following manner:

i. storing a matrix column vector in a set of sub-banks of the SRAM memory, and

reading matrix column vectors from the SRAM memory into a request queue, crossbar routing to designated aggregation registers by sub-bank address/data, and accessing from associated aggregation registers by the VMM processor;

c) mapping the matrix column vector included in an aggregation register to conductance values of a crossbar of the at least one memristor network of the VMM processor; and

d) determining a memristor weight value based on the mapping to program the at least one memristor network of the VMM processor.

11. The circuit of claim 1, wherein the controller is further configured to:

a) reading a voltage output from the crossbar;

b) mapping the voltage output to a digital matrix value; and

c) storing the digital matrix values into a memory by a scatter operation in the following manner:

i. the VMM processor writes the values into the associated scatter registers, then routes these values through sub-bank address/data crossbars to the specified request queues, and writes the data into the desired sub-banks of SRAM memory,

write the output to the DRAM memory through DMA access.

12. The circuit of claim 1, wherein the controller is further configured to:

a) retrieving vector input data values from the read data to be written to the at least one memristor network;

b) performing an aggregation operation on the vector input data in the following manner:

i. storing vector input data in a set of sub-banks of an SRAM memory, an

Vector input data is read from the SRAM memory into the request queue, then is routed across and across to the designated gather register via sub-bank address/data, and finally is accessed by the VMM processor from the associated gather register;

c) mapping the vector input data values to a crossbar of the at least one memristor network in the VMM processor; and

d) determining a memristor voltage based on the mapping to program the at least one memristor network of the VMM processor.

13. The circuit of claim 1, wherein the controller is further configured to:

a) identifying a custom instruction from the read data, the custom instruction defining an operation associated with a VMM; and

b) causing the VMM processor to configure the at least one memristor network in accordance with the custom instruction.

14. The circuitry of claim 13, wherein the custom instruction comprises a load/store instruction to: 1) programming input values into memristor crossbar array rows and multiplicative weight values into the at least one memristor network within the VMM processor, and 2) storing VMM output values from the at least one memristor network within the VMM processor into memory.

15. The circuitry of claim 13, wherein the custom instruction comprises a VMM instruction to: 1) define parameters including VMM floating-point precision, 2) format VMM data and map the VMM data into the at least one memristor network within the VMM processor, and 3) facilitate greater I/O bandwidth by leveraging VLIW processing to amortize control overhead per operation.

16. The circuitry of claim 13, wherein the custom instruction comprises a bit manipulation instruction that defines at least one of a fetch, an insert, a shift, a rotate, and a test of individual bits within a floating point register of the VMM processor, wherein the instruction to manipulate mantissa, exponent, and sign bits is executed within a larger process of a VMM signal processing chain.

17. The circuitry of claim 13, wherein the custom instruction comprises a transactional memory instruction that defines an I/O efficiency and scatter/gather instruction, and further defines an atomic operation of the custom instruction to facilitate coordinating reading/writing values from/to the at least one memristor network of the VMM processor.

18. The circuit of claim 1, wherein the controller is further configured to interface with a neural network system on a chip (SoC), the controller configured to:

a) a pair of digital signal processors is included such that:

i. one digital signal processor is used only for I/O of the at least one memristor network of the VMM processor,

a second digital signal processor for digital architecture functions such as the VMM signal processing chain, memory management, non-linear operations, custom instruction processing, and calibration/compensation algorithms,

b) interfacing to the neural network system on a chip such that:

i. the task of the SoC is a neural network inference workload defined by a neural network model descriptor, and includes a set of kernel functions to be run on a VMM processor,

the kernel function of the model descriptor is compiled into a custom instruction to be passed by the neural network system on a chip over a high speed interconnect to the set of digital signal processors,

c) instructions are received and processed by the set of digital signal processors to cause the VMM processor to execute VMM functions.

19. The circuit of claim 1, wherein the VMM processor and the controller are configured in a system-on-chip having a multi-layer stack of a plurality of 2D Integrated Circuit (IC) layers, individual ones of the plurality of layers comprising a subset of the at least one memristor network, individual ones of the plurality of layers linked by through-silicon vias (TSVs).