WO2023015167A1 - Digital compute in memory - Google Patents

Digital compute in memory Download PDF

Info

Publication number
WO2023015167A1
WO2023015167A1 PCT/US2022/074399 US2022074399W WO2023015167A1 WO 2023015167 A1 WO2023015167 A1 WO 2023015167A1 US 2022074399 W US2022074399 W US 2022074399W WO 2023015167 A1 WO2023015167 A1 WO 2023015167A1
Authority
WO
WIPO (PCT)
Prior art keywords
lines
coupled
word
compute
transistor
Prior art date
Application number
PCT/US2022/074399
Other languages
French (fr)
Inventor
Zhongze Wang
Mustafa Badaroglu
Original Assignee
Qualcomm Incorporated
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/816,285 external-priority patent/US12019905B2/en
Application filed by Qualcomm Incorporated filed Critical Qualcomm Incorporated
Priority to KR1020247003285A priority Critical patent/KR20240038721A/en
Priority to CN202280051713.3A priority patent/CN117751407A/en
Priority to EP22758105.5A priority patent/EP4381503A1/en
Publication of WO2023015167A1 publication Critical patent/WO2023015167A1/en

Links

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C7/00Arrangements for writing information into, or reading information out from, a digital store
    • G11C7/10Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers
    • G11C7/1006Data managing, e.g. manipulating data before writing or reading out, data bus switches or control circuits therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C11/00Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor
    • G11C11/54Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using elements simulating biological cells, e.g. neuron
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/38Indexing scheme relating to groups G06F7/38 - G06F7/575
    • G06F2207/48Indexing scheme relating to groups G06F7/48 - G06F7/575
    • G06F2207/4802Special implementations
    • G06F2207/4818Threshold devices
    • G06F2207/4824Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C11/00Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor
    • G11C11/21Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements
    • G11C11/34Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices
    • G11C11/40Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors
    • G11C11/41Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors forming static cells with positive feedback, i.e. cells not needing refreshing or charge regeneration, e.g. bistable multivibrator or Schmitt trigger
    • G11C11/413Auxiliary circuits, e.g. for addressing, decoding, driving, writing, sensing, timing or power reduction
    • G11C11/417Auxiliary circuits, e.g. for addressing, decoding, driving, writing, sensing, timing or power reduction for memory cells of the field-effect type
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C11/00Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor
    • G11C11/21Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements
    • G11C11/34Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices
    • G11C11/40Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors
    • G11C11/41Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors forming static cells with positive feedback, i.e. cells not needing refreshing or charge regeneration, e.g. bistable multivibrator or Schmitt trigger
    • G11C11/413Auxiliary circuits, e.g. for addressing, decoding, driving, writing, sensing, timing or power reduction
    • G11C11/417Auxiliary circuits, e.g. for addressing, decoding, driving, writing, sensing, timing or power reduction for memory cells of the field-effect type
    • G11C11/419Read-write [R-W] circuits

Definitions

  • aspects of the present disclosure relate to performing machine learning tasks, and in particular, to computation-in-memory architectures.
  • Machine learning is generally the process of producing a trained model (e.g., an artificial neural network, a tree, or other structures), which represents a generalized fit to a set of training data that is known a priori. Applying the trained model to new data produces inferences, which may be used to gain insights into the new data. In some cases, applying the model to the new data is described as “running an inference” on the new data.
  • a trained model e.g., an artificial neural network, a tree, or other structures
  • machine learning accelerators may be used to enhance a processing system’s capacity to process machine learning model data.
  • Such hardware demands space and power, which is not always available on the processing device.
  • “edge processing” devices such as mobile devices, always-on devices, Internet of Things (loT) devices, and the like, typically have to balance processing capabilities with power and packaging constraints.
  • accelerators may move data across common data busses, which can cause significant power usage and introduce latency into other processes sharing the data bus. Consequently, other aspects of a processing system are being considered for processing machine learning model data.
  • Memory devices are one example of another aspect of a processing system that may be leveraged for performing processing of machine learning model data through so-called computation-in-memory (CIM) processes.
  • CIM arrays were developed to implement a node of a neural network framework without data transfer bottlenecks. A data transfer bottleneck is avoided by storing weight data within each cell of a CIM array and also performing a multiply operation within each cell.
  • Neural networks are a form of artificial intelligence relied on for a high level of accuracy, so the CIM array may be expected to generate accurate results.
  • Conventional CIM processes perform computation using analog signals, which may be more susceptible to inaccuracy in computation results, adversely impacting neural network computations. Accordingly, systems and methods are needed for performing computation-in-memory with increased accuracy. Additional design goals for CIM may include flexibility and scalability.
  • Certain aspects provide apparatus and techniques for performing machine learning tasks, and in particular, computation-in-memory architectures.
  • the circuit generally includes a plurality of memory cells on each of multiple bit-lines of a memory, the plurality of memory cells being configured to store multiple bits representing weights of a neural network.
  • the plurality of memory cells on each of the multiple bit-lines may be on different word-lines of the memory.
  • the circuit also includes a plurality of accumulators, each accumulator being coupled to a respective one of the multiple bitlines.
  • Another aspect provides a method for in-memory computation.
  • the method generally includes: storing, in a plurality of memory cells on each of multiple bit-lines of a memory, multiple bits representing weights of a neural network, wherein the plurality of memory cells on each of the multiple bit-lines are on different word-lines of the memory; and accumulating, via each accumulator of a plurality of accumulators, output signals of two or more of the plurality of memory cells on a respective one of the multiple bit-lines after two or more of the word-lines are sequentially activated.
  • the apparatus generally includes: means for storing, in a plurality of memory cells on each of multiple bit-lines of the means for storing, multiple bits representing weights of a neural network, wherein the plurality of memory cells on each of the multiple bit-lines are on different word-lines of the means for storing; and means for accumulating output signals of two or more of the plurality of memory cells on a respective one of the multiple bit-lines after two or more of the word-lines are sequentially activated.
  • the circuit generally includes multiple bit-lines; multiple word-lines; an array of compute-inmemory cells, wherein each compute-in-memory cell is coupled to one of the bit-lines and to one of the word-lines and is configured to store a weight bit of a neural network; and a plurality of accumulators, each accumulator being coupled to a respective one of the multiple bit-lines.
  • the method generally includes: performing computations, in at least a portion of an array of compute-in-memory cells, on a weight and an activation input for a neural network, each compute-in-memory cell being coupled to one of multiple bit-lines and to one of multiple word-lines and being configured to store a bit of the weight for the neural network; and accumulating, via each accumulator of a plurality of accumulators, output signals from two or more of the compute-in-memory cells coupled to a respective one of the multiple bit-lines.
  • processing systems configured to perform the aforementioned methods as well as those described herein; non-transitory, computer- readable media comprising instructions that, when executed by one or more processors of a processing system, cause the processing system to perform the aforementioned methods as well as those described herein; a computer program product embodied on a computer-readable storage medium comprising code for performing the aforementioned methods as well as those further described herein; and a processing system comprising means for performing the aforementioned methods as well as those further described herein.
  • FIGS. 1A-1D depict examples of various types of neural networks, which may be implemented by aspects of the present disclosure.
  • FIG. 2 depicts an example of a traditional convolution operation, which may be implemented by aspects of the present disclosure.
  • FIGS. 3A and 3B depict examples of depthwise separable convolution operations, which may be implemented by aspects of the present disclosure.
  • FIG. 4 illustrates an example memory cell implemented as an eight-transistor (8T) static random access memory (SRAM) cell for a compute-in-memory (CIM) circuit.
  • 8T eight-transistor
  • SRAM static random access memory
  • CCM compute-in-memory
  • FIG. 5 is a block diagram of an example circuit for digital CIM, in accordance with certain aspects of the present disclosure.
  • FIGS. 6A and 6B are flow diagrams illustrating example operations for inmemory computation, in accordance with certain aspects of the present disclosure.
  • FIG. 7 is a block diagram illustrating an example electronic device having a neural network configured to perform in-memory computation operations, in accordance with certain aspects of the present disclosure.
  • aspects of the present disclosure provide apparatus, methods, processing systems, and computer-readable mediums for performing computation in memory (CIM) to handle data-intensive processing, such as implementing machine learning models.
  • Some aspects provide techniques for performing digital CIM using accumulators, each accumulator accumulating output signals on a respective one of multiple bit-lines of memory after multiple activation cycles.
  • one of the word-lines may be activated.
  • the word-lines may be sequentially activated, and the accumulators may concurrently perform accumulation to provide an accumulation result after two or more of the word-lines are sequentially activated.
  • CIM-based machine learning (ML)/artificial intelligence (Al) may be used for a wide variety of tasks, including image processing (e.g., still images and video), audio processing, controlling radio frequency (RF) front-ends in wireless communications, and making wireless communication decisions (e.g., to optimize, or at least increase, throughput and signal quality).
  • image processing e.g., still images and video
  • RF radio frequency
  • CIM may be based on various types of memory architectures, such as dynamic random-access memory (DRAM), static random-access memory (SRAM) (e.g., based on an SRAM cell as in FIG.
  • DRAM dynamic random-access memory
  • SRAM static random-access memory
  • CIM magnetoresistive randomaccess memory
  • ReRAM resistive random-access memory
  • CPUs central processing units
  • DSPs digital signal processors
  • GPUs graphics processing units
  • FPGAs field- programmable gate arrays
  • NPUs neural processing units
  • NSPs neural signal processors
  • CIM may beneficially reduce the “memory wall” problem, which is where the movement of data in and out of memory consumes more power than the computation of the data.
  • significant power savings may be realized. This is particularly useful for various types of electronic devices, such as low power edge devices, mobile devices, and the like.
  • a mobile device may include a memory device configured for storing data and performing compute-in-memory operations.
  • the mobile device may be configured to perform an ML/ Al operation based on data generated by the mobile device, such as image data generated by a camera sensor of the mobile device, audio data received via a microphone of the mobile device, inertial data gathered by an accelerometer or gyroscope of the mobile device, temperature data captured by a temperature sensor of the mobile device, etc., and/or combinations thereof.
  • a memory controller unit (MCU) of the mobile device may thus load weights from another on-board memory (e.g., flash or RAM) into a CIM array of the memory device and allocate input feature buffers and output (e.g., output activation) buffers.
  • another on-board memory e.g., flash or RAM
  • the processing device may then commence processing of the data by loading, for example, a layer in the input buffer and processing the layer with weights loaded into the CIM array. This processing may be repeated for each layer of the data, and the outputs (e.g., output activations) may be stored in the output buffers and then used by the mobile device for an ML/ Al task, such as intelligently controlling wireless communications, a heating and air conditioning system, a security system, or other Internet of Things (loT) applications.
  • ML/ Al task such as intelligently controlling wireless communications, a heating and air conditioning system, a security system, or other Internet of Things (loT) applications.
  • Neural networks are organized into layers of interconnected nodes.
  • a node or neuron is where computation happens.
  • a node may combine input data with a set of weights (or coefficients) that either amplifies or dampens the input data.
  • the amplification or dampening of the input signals may thus be considered an assignment of relative significances to various inputs with regard to a task the network is trying to learn.
  • input-weight products are summed (or accumulated), and then the sum is passed through a node’s activation function to determine whether and to what extent that signal should progress further through the network.
  • a neural network may have an input layer, a hidden layer, and an output layer. “Deep” neural networks generally have more than one hidden layer.
  • Deep learning is a method of training deep neural networks.
  • deep learning finds the right f to transform x into j'.
  • Deep learning trains each layer of nodes based on a distinct set of features, which is the output from the previous layer.
  • features become more complex. Deep learning is thus powerful because it can progressively extract higher level features from input data and perform complex tasks, such as object recognition, by learning to represent inputs at successively higher levels of abstraction in each layer, thereby building up a useful feature representation of the input data.
  • a first layer of a deep neural network may learn to recognize relatively simple features, such as edges, in the input data.
  • the first layer of a deep neural network may learn to recognize spectral power in specific frequencies in the input data.
  • the second layer of the deep neural network may then learn to recognize combinations of features, such as simple shapes for visual data or combinations of sounds for auditory data, based on the output of the first layer.
  • Higher layers may then learn to recognize complex shapes in visual data or words in auditory data.
  • Still higher layers may learn to recognize common visual objects or spoken phrases.
  • deep learning architectures may perform especially well when applied to problems that have a natural hierarchical structure.
  • Neural networks such as deep neural networks (DNNs) may be designed with a variety of connectivity patterns between layers.
  • DNNs deep neural networks
  • FIG. 1A illustrates an example of a fully connected neural network 102.
  • each node in a first layer communicates its output to every node in a second layer, so that each node in the second layer will receive input from every node in the first layer.
  • FIG. IB illustrates an example of a locally connected neural network 104.
  • a node in a first layer may be connected to a limited number of nodes in the second layer.
  • a locally connected layer of the locally connected neural network 104 may be configured so that each node in a layer will have the same or a similar connectivity pattern, but with connection strengths (or weights) that may have different values (e.g., values associated with local areas 110, 112, 114, and 116 of the first layer nodes).
  • the locally connected connectivity pattern may give rise to spatially distinct receptive fields in a higher layer, because the higher layer nodes in a given region may receive inputs that are tuned through training to the properties of a restricted portion of the total input to the network.
  • One type of locally connected neural network is a convolutional neural network (CNN).
  • FIG. 1C illustrates an example of a convolutional neural network 106.
  • the convolutional neural network 106 may be configured such that the connection strengths associated with the inputs for each node in the second layer are shared (e.g., for local area 108 overlapping another local area of the first layer nodes). Convolutional neural networks are well suited to problems in which the spatial locations of inputs are meaningful.
  • DCN deep convolutional network
  • DCN deep convolutional network
  • FIG. ID illustrates an example of a DCN 100 designed to recognize visual features in an image 126 generated by an image-capturing device 130.
  • the DCN 100 may be trained with various supervised learning techniques to identify a traffic sign and even a number on the traffic sign.
  • the DCN 100 may likewise be trained for other tasks, such as identifying lane markings or identifying traffic lights. These are just some example tasks, and many others are possible.
  • the DCN 100 includes a feature-extraction section and a classification section.
  • a convolutional layer 132 Upon receiving the image 126, a convolutional layer 132 applies convolutional kernels (for example, as depicted and described in FIG. 2) to the image 126 to generate a first set of feature maps 118 (or intermediate activations).
  • a “kernel” or “filter” comprises a multidimensional array of weights designed to emphasize different aspects of an input data channel.
  • “kernel” and “filter” may be used interchangeably to refer to sets of weights applied in a convolutional neural network.
  • the first set of feature maps 118 may then be subsampled by a pooling layer (e.g., a max pooling layer, not shown) to generate a second set of feature maps 120.
  • the pooling layer may reduce the size of the first set of feature maps 118 while maintaining much of the information in order to improve model performance.
  • the second set of feature maps 120 may be downsampled to a 14x14 matrix from a 28x28 matrix by the pooling layer.
  • This process may be repeated through many layers. In other words, the second set of feature maps 120 may be further convolved via one or more subsequent convolutional layers (not shown) to generate one or more subsequent sets of feature maps (not shown).
  • the second set of feature maps 120 is provided to a fully connected layer 124, which in turn generates an output feature vector 128.
  • Each feature of the output feature vector 128 may include a number that corresponds to a possible feature of the image 126, such as “sign,” “60,” and “100.”
  • a softmax function (not shown) may convert the numbers in the output feature vector 128 to a probability.
  • an output 122 of the DCN 100 is a probability of the image 126 including one or more features.
  • a softmax function may convert the individual elements of the output feature vector 128 into a probability in order that an output 122 of DCN 100 is one or more probabilities of the image 126 including one or more features, such as a sign with the number “60” thereon, as in image 126.
  • the probabilities in the output 122 for “sign” and “60” should be higher than the probabilities of the other elements of the output 122, such as “30,” “40,” “50,” “70,” “80,” “90,” and “100.”
  • the output 122 produced by the DCN 100 may be incorrect.
  • an error may be calculated between the output 122 and a target output known a priori.
  • the target output is an indication that the image 126 includes a “sign” and the number “60.”
  • the weights of the DCN 100 may then be adjusted through training so that a subsequent output 122 of the DCN 100 achieves the target output (with high probabilities).
  • a learning algorithm may compute a gradient vector for the weights.
  • the gradient vector may indicate an amount that an error would increase or decrease if a weight were adjusted in a particular way.
  • the weights may then be adjusted to reduce the error. This manner of adjusting the weights may be referred to as “backpropagation” because this adjustment process involves a “backward pass” through the layers of the DCN 100.
  • the error gradient of weights may be calculated over a small number of examples, so that the calculated gradient approximates the true error gradient.
  • This approximation method may be referred to as stochastic gradient descent. Stochastic gradient descent may be repeated until the achievable error rate of the entire system has stopped decreasing or until the error rate has reached a target level.
  • the DCN 100 may be presented with new images, and the DCN 100 may generate inferences, such as classifications, or probabilities of various features being in the new image.
  • Convolution is generally used to extract useful features from an input data set. For example, in convolutional neural networks, such as described above, convolution enables the extraction of different features using kernels and/or filters whose weights are automatically learned during training. The extracted features are then combined to make inferences.
  • An activation function may be applied before and/or after each layer of a convolutional neural network.
  • Activation functions are generally mathematical functions that determine the output of a node of a neural network. Thus, the activation function determines whether a node should pass information or not, based on whether the node’s input is relevant to the model’s prediction.
  • both x and y may be generally considered as “activations.”
  • x may also be referred to as “preactivations” or “input activations” as x exists before the particular convolution, and y may be referred to as output activations or a feature map.
  • FIG. 2 depicts an example of a traditional convolution in which a 12-pixel x 12-pixel x 3-channel input image is convolved using a 5 x 5 x 3 convolution kernel 204 and a stride (or step size) of 1.
  • the resulting feature map 206 is 8 pixels x 8 pixels x 1 channel.
  • the traditional convolution may change the dimensionality of the input data as compared to the output data (here, from 12 x 12 to 8 x 8 pixels), including the channel dimensionality (here, from 3 channels to 1 channel).
  • a spatial separable convolution such as depicted in FIG. 2 may be factorized into two components: (1) a depthwise convolution, where each spatial channel is convolved independently by a depthwise convolution (e.g., a spatial fusion); and (2) a pointwise convolution, where all the spatial channels are linearly combined (e.g., a channel fusion).
  • a depthwise separable convolution is depicted in FIGS. 3A and 3B.
  • a network learns features from the spatial planes, and during channel fusion, the network learns relations between these features across channels.
  • a depthwise separable convolution may be implemented using 5x5 kernels for spatial fusion, and 1x1 kernels for channel fusion.
  • the channel fusion may use a I x l xt/ kernel that iterates through every single point in an input image of depth d, where the depth d of the kernel generally matches the number of channels of the input image.
  • Channel fusion via pointwise convolution is useful for dimensionality reduction for efficient computations.
  • Applying I x l xt/ kernels and adding an activation layer after the kernel may give a network added depth, which may increase the network’s performance.
  • the 12-pixel x 12-pixel x 3 -channel input image 302 is convolved with a filter comprising three separate kernels 304A-C, each having a 5 x 5 x 1 dimensionality, to generate a feature map 306 of 8 pixels x 8 pixels x 3 channels, where each channel is generated by an individual kernel among kernels 304A-C.
  • feature map 306 is further convolved using a pointwise convolution operation with a kernel 308 having dimensionality 1 x 1 x 3 to generate a feature map 310 of 8 pixels x 8 pixels x 1 channel.
  • feature map 310 has reduced dimensionality (1 channel versus 3 channels), which allows for more efficient computations therewith.
  • multiple (e.g., ni) pointwise convolution kernels 308 can be used to increase the channel dimensionality of the convolution output.
  • m 256 1x1x3 kernels 308 can be generated, in which each output is an 8-pixel x 8-pixel x 1 -channel feature map (e.g., feature map 310), and these feature maps can be stacked to get a resulting feature map of 8 pixels x 8 pixels x 256 channels.
  • the resulting increase in channel dimensionality provides more parameters for training, which may improve a convolutional neural network’s ability to identify features (e.g., in input image 302).
  • FIG. 4 illustrates an example memory cell 400 of a static random access memory (SRAM), which may be implemented in a CIM array.
  • the memory cell 400 may be referred to as an 8-transistor (8T) SRAM cell as the memory cell 400 is implemented with eight transistors.
  • 8T 8-transistor
  • the memory cell 400 may include a flip-flop, which may be implemented as a cross-coupled invertor pair 424 having an output 414 and an output 416.
  • the cross-coupled invertor pair output 414 is selectively coupled to a write bit-line (WBL) 406 via a pass-gate transistor 402
  • the cross-coupled invertor pair output 416 is selectively coupled to a complementary write bit-line (WBLB) 420 via a pass-gate transistor 418.
  • WBL 406 and WBLB 420 are configured to provide complementary digital signals to be written (e.g., stored) in the cross-coupled invertor pair 424.
  • the WBL and WBLB may be used to store a bit for a neural network weight in the memory cell 400.
  • the gates of pass-gate transistors 402, 418 may be coupled to a write word-line (WWL) 404, as shown.
  • WWL write word-line
  • a digital signal to be written may be provided to the WBL (and a complement of the digital signal is provided to the WBLB).
  • the pass-gate transistors 402, 418 which are implemented here as n-type field-effect transistors (NFETs) — are then turned on by providing a logic high signal to WWL 404, resulting in the digital signal being stored in the cross-coupled invertor pair 424.
  • NFETs n-type field-effect transistors
  • the cross-coupled invertor pair output 414 may be coupled to a gate of a transistor 410.
  • the source of the transistor 410 may be coupled to a reference potential node (VSS or electrical ground), and the drain of the transistor 410 may be coupled to a source of a transistor 412.
  • the drain of the transistor 412 may be coupled to a read bit-line (RBL) 422, as shown.
  • the gate of transistor 412 may be controlled via a read word-line (RWL) 408.
  • the RWL 408 may be controlled via an activation input signal.
  • the RBL 422 may be precharged to logic high. If both the activation input and the weight bit stored at the cross-coupled invertor pair output 414 are logic high, then transistors 410, 412 are both turned on, electrically coupling the RBL 422 to VSS at the source of transistor 410 and discharging the RBL 422 to logic low. If either the activation input or the weight stored at the cross-coupled invertor pair output 414 is logic low, then at least one of transistors 410, 412 will be turned off, such that the RBL 422 remains logic high. Thus, the output of the memory cell 400 at RBL 422 is logic low only when both the weight bit and activation input are logic high, and is logic high otherwise, effectively implementing a NAND-gate operation.
  • FIG. 5 illustrates a circuit 500 for CIM, in accordance with certain aspects of the present disclosure.
  • the circuit 500 includes a CIM array having N word-lines 504-1 to 504-N (also referred to as rows) and M bit-lines 506-1 to 506-M (also referred to herein as “columns”), N and M each being any integer greater than 1. N and M may be the same or different.
  • Bit-lines 506-1 to 506-M (collectively referred to herein as “BLs 506”) are labeled BLi to BLM in FIG. 5
  • word-lines 504-1 to 504-N collectively referred to herein as “WLs 504” are labeled WLi to WLN in FIG. 5.
  • Each of the BLs 506 may correspond to the RBL in the memory cell 400 of FIG. 4, and each of the WLs 504 may correspond to the RWL in the memory cell 400 of FIG. 4.
  • memory cells 502-1,1 to 502-N,M are implemented at the intersections of the WLs 504 and BLs 506.
  • the first integer after the dash here, 2 indicates the word-line
  • the second integer after the dash here, 1 indicates the bit-line, of the intersection where the memory cell is located.
  • Each of the memory cells 502 may be implemented using the memory cell architecture described with respect to FIG. 4. As shown, activation inputs Xi to XN may be provided to respective word-lines 504, and the memory cells 502 may store neural network weights Wi to WN, where each weight has M bits (e.g., Wi,i to WI,M, W2,I to W2,M, and WWN,I to WN,M). For example, memory cells 502-1,1 to 502-1, M may store M bits for weight Wi (e.g., weight bits Wi,i to WI,M), memory cells 502-2,1 to 502-2, M may store M bits for weight W2 (e.g., weight bits W2,i to W2,M), and so on.
  • M bits e.g., Wi,i to WI,M
  • W2 weight bits W2,i to W2,M
  • the weights may be written to the memory cells 502 via write bit-lines (e.g., WBL 406 and WBLB 420), which are not shown in FIG. 5.
  • each memory cell 502 may multiply the received activation bit with the stored weight bit (e.g., may perform a logical NAND operation with the activation bit and the stored weight bit as inputs, as described with respect to FIG. 4).
  • SAs 508-1 to 508-M may be used to sense the signal on a respective one of the bitlines 506 (e.g., a digital signal from the NAND processing of the memory cell).
  • the SAs 508 may perform concurrent sensing of the bit-lines 506.
  • Each of the sensed signals for a respective BL may be provided to a respective one of accumulators 510-1 to 510-M (collectively referred to herein as “accumulators 510”).
  • the accumulators 510 concurrently perform accumulation of the signals sensed by the SAs 508, on a bit-line basis.
  • the activation inputs xi to XN may be applied one row (wordline) at a time (e.g., one row each computation cycle).
  • the activation input Xi may be provided to word-line 504-1 during a first computation cycle, and the computation (e.g., multiplication, such as a NAND operation as described above) for activation input Wi and weight Wi may be performed via the memory cells on word-line 504-1 storing bit weights Wi,i to WI,M.
  • the signals (e.g., digital signals) on BLs 506 after the first computation cycle may be sensed (concurrently) via respective SAs 508 and provided to respective accumulators 510.
  • the same operation may be performed for each of activation inputs X2 to XN, one word-line at a time (and in order starting from X2 and ending with XN), during subsequent computation cycles.
  • the accumulators 510 accumulate additional signals on corresponding BLs 506 after each computation cycle. After the computation cycles are complete, each of the accumulators 510 provides an accumulation result for a respective one of the BLs 506. The accumulation result from each accumulator indicates the accumulation of the signals on the respective BL, each of the signals being generated after one of the computation cycles.
  • the number of computation cycles corresponds to the number of activation inputs and is an indication of the amount of time it takes to receive the accumulation results for the activation inputs xi to XN, as the computation cycles occur sequentially.
  • a computation cycle may be skipped if an activation input associated with the computation cycle is logic low, in effect speeding up the CIM process.
  • activation inputs Xi to XN may be provided to respective word-lines during respective computation cycles 1-N. If activation input X2 is logic low, computation cycle 2 may be skipped, reducing the total amount of time it takes for computation by the duration of one computation cycle.
  • the computation using memory cells 502-2,1 to 502-2, M may be skipped, and/or the accumulators 510 may skip accumulation of the output signals of memory cells 502-2,1 to 502-2, M based on the activation input X2 being logic low.
  • the present disclosure provides a digital CIM array that offers a more accurate multiply-accumulate (MAC) operation as compared to certain implementations using analog CIM.
  • FIG. 6A is a flow diagram illustrating example operations 600 for in-memory computation, in accordance with certain aspects of the present disclosure.
  • the operations 600 may be performed by a circuit for CIM (e.g., digital CIM), such as the circuit 500 described with respect to FIG. 5.
  • a circuit for CIM e.g., digital CIM
  • the operations 600 begin at block 605 with the circuit storing, in a plurality of memory cells (e.g., memory cells 502) on each of multiple bit-lines (e.g., bit-lines 506) of a memory, multiple bits representing weights of a neural network.
  • the bits representing weights may be stored using other bit-lines (e.g., write bit-lines, such as WBL 406 and WBLB 420) of the memory.
  • the plurality of memory cells on each of the multiple bit-lines are on different word-lines (e.g., word-lines 504) of the memory, as shown in FIG. 5.
  • the circuit multiplies, via each of the plurality of memory cells, a bit of one of the weights with an activation input provided to a respective one of the word-lines.
  • the circuit accumulates, via each accumulator of a plurality of accumulators (e.g., accumulators 510), output signals of two or more of the plurality of memory cells on a respective one of the multiple bit-lines after two or more of the wordlines are sequentially activated.
  • the output signals may include digital signals generated by the plurality of memory cells on the respective one of the multiple bit-lines (e.g., due to NAND logic of the memory cells).
  • the circuit activates the word-lines of the memory, one word-line at a time. In this case, the activation includes multiplying one of the weights stored in the memory cells on the one word-line with an activation input provided to the one word-line.
  • the circuit senses, via each sense amplifier of a plurality of sense amplifiers (e.g., SAs 508), the respective one of the multiple bit-lines.
  • the output signals of the plurality of memory cells may be accumulated based on the sensing of the respective one of the multiple bit-lines.
  • the multiple bit-lines are sensed concurrently via the plurality of sense amplifiers, as described herein.
  • the circuit selects the two or more of the word-lines that are sequentially activated based on an activation input applied to each of the two or more of the word-lines being logic high. For example, the circuit may skip accumulating at least one output signal of at least one other memory cell of the plurality of memory cells based on the at least one other memory cell receiving an activation input that is logic low.
  • the output signals are accumulated, via a respective one of the plurality of accumulators, after multiple activation cycles. During each of the multiple activation cycles, a respective activation input that is logic high is provided to a respective one of the word-lines.
  • each of the plurality of memory cells includes a pass-gate transistor (e.g., pass-gate transistor 418), a flip-flop (e.g., comprising the cross-coupled invertor pair 424) coupled to the pass-gate transistor, a first transistor (e.g., transistor 410) having a gate coupled to an output of the flip-flop, and a second transistor (e.g., transistor 412) coupled between the first transistor and the respective one of the multiple bit-lines (e.g., RBL 422 shown in FIG. 4).
  • a pass-gate transistor e.g., pass-gate transistor 418
  • a flip-flop e.g., comprising the cross-coupled invertor pair 424
  • a first transistor e.g., transistor 410 having a gate coupled to an output of the flip-flop
  • a second transistor e.g., transistor 412 coupled between the first transistor and the respective one of the multiple bit-lines (e.g., RBL 422 shown in FIG. 4).
  • the first transistor may include a source coupled to a reference potential node (e.g., electric ground) and a drain coupled to a source of the second transistor, a drain of the second transistor being coupled to the respective one of the multiple bit-lines.
  • a gate of the second transistor may be coupled to a respective one of the word-lines (e.g., RWL 408 shown in FIG. 4).
  • FIG. 6B is a flow diagram illustrating example operations 650 for in-memory computation, in accordance with certain aspects of the present disclosure.
  • the operations 650 may be performed by a circuit for CIM (e.g., digital CIM), such as the circuit 500 described with respect to FIG. 5. Many of the operations 650 may be similar to the operations 600 described above and are not repeated below.
  • the operations 650 begin at block 655 with the circuit performing computations, in at least a portion of an array of compute-in-memory cells (e.g., memory cells 502), on a weight and an activation input for a neural network.
  • compute-in-memory cells e.g., memory cells 502
  • Each compute-inmemory cell may be coupled to one of multiple bit-lines (e.g., bit-lines 506) and to one of multiple word-lines (e.g., word-lines 504) and may be configured to store a bit of the weight for the neural network
  • the circuit accumulates, via each accumulator of a plurality of accumulators (e.g., accumulators 510), output signals from two or more of the computein-memory cells coupled to a respective one of the multiple bit-lines.
  • the output signals may include digital signals generated by the compute-in-memory cells on the respective one of the multiple bit-lines (e.g., due to NAND logic of the memory cells).
  • the operations 650 may further include the circuit sequentially activating two or more the word-lines.
  • the accumulating at block 660 may occur after the sequentially activating.
  • the sequentially activating may involve applying the activation input to each of the two or more word-lines, one word-line at a time.
  • FIG. 7 illustrates an example electronic device 700.
  • the electronic device 700 may be configured to perform the methods described herein, including the operations 600 and 650 described with respect to FIGS. 6A and 6B.
  • the electronic device 700 includes a central processing unit (CPU) 702, which in some aspects may be a multi-core CPU. Instructions executed at the CPU 702 may be loaded, for example, from a program memory associated with the CPU 702 or may be loaded from a memory 724.
  • CPU central processing unit
  • the electronic device 700 also includes additional processing blocks tailored to specific functions, such as a graphics processing unit (GPU) 704, a digital signal processor (DSP) 706, a neural processing unit (NPU) 708, a multimedia processing block 710, and a wireless connectivity processing block 712.
  • graphics processing unit GPU
  • DSP digital signal processor
  • NPU neural processing unit
  • multimedia processing block 710 multimedia processing block 710
  • the NPU 708 is implemented in one or more of the CPU 702, GPU 704, and/or DSP 706.
  • the wireless connectivity processing block 712 may include components, for example, for Third-Generation (3G) connectivity, Fourth-Generation (4G) connectivity (e.g., 4G LTE), Fifth-Generation connectivity (e.g., 5G or NR), Wi-Fi connectivity, Bluetooth connectivity, and/or wireless data transmission standards.
  • the wireless connectivity processing block 712 is further connected to one or more antennas 77 to facilitate wireless communication.
  • the electronic device 700 may also include one or more sensor processors 716 associated with any manner of sensor, one or more image signal processors (ISPs) 718 associated with any manner of image sensor, and/or a navigation processor 720, which may include satellite-based positioning system components (e.g., GPS or GLONASS) as well as inertial positioning system components.
  • ISPs image signal processors
  • navigation processor 720 which may include satellite-based positioning system components (e.g., GPS or GLONASS) as well as inertial positioning system components.
  • the electronic device 700 may also include one or more input and/or output devices 722, such as screens, touch-sensitive surfaces (including touch-sensitive displays), physical buttons, speakers, microphones, and the like.
  • one or more of the processors of the electronic device 700 may be based on an ARM instruction set.
  • the electronic device 700 also includes memory 724, which is representative of one or more static and/or dynamic memories, such as a dynamic random access memory (DRAM), a flash-based static memory, and the like.
  • memory 724 includes computer-executable components, which may be executed by one or more of the aforementioned processors of the electronic device 700 and/or a CIM controller 732 (also referred to as control circuitry).
  • the electronic device 700 includes a CIM circuit 726, such as the circuit 500, as described herein.
  • the CIM circuit 726 may be controlled via the CIM controller 732.
  • memory 724 may include code 724A for storing (e.g., storing weights in memory cells) and code 724B for computing (e.g., performing a neural network computation by applying activation inputs).
  • the CIM controller 732 may include a circuit 728A for storing (e.g., storing weights in memory cells), and a circuit 728B for computing (e.g., performing a neural network computation by applying activation inputs).
  • the depicted components, and others not depicted, may be configured to perform various aspects of the methods described herein.
  • the electronic device 700 is a server device
  • various aspects may be omitted from the example depicted in FIG. 7, such as one or more of the multimedia processing block 710, wireless connectivity processing block 712, antenna 714, sensor processors 716, ISPs 718, or navigation processor 720.
  • a circuit comprising: multiple bit-lines; multiple word-lines; an array of compute-in-memory cells, wherein each compute-in-memory cell is coupled to one of the bit-lines and to one of the word-lines and is configured to store a weight bit of a neural network; and a plurality of accumulators, each accumulator being coupled to a respective one of the multiple bit-lines.
  • Clause 2 The circuit of Clause 1, further comprising a plurality of sense amplifiers, each sense amplifier having an output coupled to a respective one of the accumulators and having an input coupled to the respective one of the multiple bit-lines.
  • Clause 3 The circuit of Clause 2, wherein the plurality of sense amplifiers are configured to concurrently sense the multiple bit-lines.
  • Clause 4 The circuit of any of Clauses 1-3, wherein the compute-in-memory cells coupled to the multiple bit-lines and to one of the multiple word-lines are configured to perform concurrent computations.
  • Clause 5 The circuit of any of Clauses 1-4, wherein the multiple word-lines are configured to be activated one word-line at a time.
  • Clause 6 The circuit of any of Clauses 1-5, wherein: two or more of the word-lines are configured to be sequentially activated; and each of the plurality of accumulators is configured to accumulate output signals from the compute-in-memory cells coupled to the respective one of the multiple bit-lines after the two or more of the word-lines are sequentially activated.
  • Clause 7 The circuit of Clause 6, wherein the output signals comprise digital signals generated by the compute-in-memory cells on the respective one of the multiple bit-lines.
  • Clause 8 The circuit of Clause 6 or 7, further comprising control circuitry configured to select the two or more of the word-lines that are sequentially activated based on an activation input applied to each of the two or more of the word-lines being logic high. [0091] Clause 9.
  • each of the plurality of accumulators is configured to perform accumulation of output signals from the computein-memory cells coupled to the respective one of the multiple bit-lines, and wherein, in performing the accumulation, each of the plurality of accumulators is configured to: accumulate output signals from two or more of the compute-in-memory cells; and skip accumulation of at least one output signal from at least one other compute-in-memory cell coupled to the respective one of the multiple bit-lines, based on the at least one other compute-in-memory cell receiving an activation input that is logic low.
  • each compute-inmemory cell is configured to multiply the stored weight bit with an activation input provided to a respective one of the multiple word-lines.
  • Clause 11 The circuit of any of Clauses 1-5, wherein: the compute-inmemory cells coupled to each of the multiple word-lines are configured to be sequentially activated based on a plurality of activation inputs applied to the multiple word-lines; and a respective one of the plurality of accumulators is configured to accumulate output signals from the compute-in-memory cells after the compute-in-memory cells coupled to each of the multiple word-lines are sequentially activated.
  • Clause 12 The circuit of Clause 11, wherein: the respective one of the plurality of accumulators is configured to accumulate the output signals after multiple activation cycles; and during each of the multiple activation cycles, a respective one of the activation inputs that is logic high is provided to a respective one of the multiple wordlines.
  • each compute-inmemory cell comprises: a pass-gate transistor; a cross-coupled invertor pair having an output coupled to the pass-gate transistor; a first transistor having a gate coupled to the output of the cross-coupled invertor pair; and a second transistor coupled between the first transistor and the respective one of the multiple bit-lines.
  • Clause 14 The circuit of Clause 13, wherein the first transistor comprises a source coupled to a reference potential node and a drain coupled to a source of the second transistor and wherein a drain of the second transistor is coupled to the respective one of the multiple bit-lines.
  • Clause 15 The circuit of Clause 13 or 14, wherein a gate of the second transistor is coupled to a respective one of the multiple word-lines.
  • Clause 16 The circuit of any of Clauses 1-15, wherein at least one of the compute-in-memory cells comprises an eight-transistor (8T) static random access memory (SRAM) cell.
  • 8T eight-transistor
  • SRAM static random access memory
  • a method comprising: performing computations, in at least a portion of an array of compute-in-memory cells, on a weight and an activation input for a neural network, each compute-in-memory cell being coupled to one of multiple bit-lines and to one of multiple word-lines and being configured to store a bit of the weight for the neural network; and accumulating, via each accumulator of a plurality of accumulators, output signals from two or more of the compute-in-memory cells coupled to a respective one of the multiple bit-lines.
  • Clause 18 The method of Clause 17, further comprising sensing, via each sense amplifier of a plurality of sense amplifiers, the respective one of the multiple bitlines, wherein the output signals from the two or more of the compute-in-memory cells are accumulated based on the sensing of the respective one of the multiple bit-lines.
  • Clause 19 The method of Clause 18, wherein the sensing comprises concurrently sensing the multiple bit-lines via the plurality of sense amplifiers.
  • Clause 20 The method of any of Clauses 17-19, wherein the output signals comprise digital signals generated by the compute-in-memory cells on the respective one of the multiple bit-lines.
  • Clause 21 The method of any of Clauses 17-20, further comprising sequentially activating two or more the word-lines, wherein the accumulating occurs after the sequentially activating and wherein the sequentially activating comprises applying the activation input to each of the two or more word-lines, one word-line at a time.
  • Clause 22 The method of Clause 21, further comprising selecting the two or more of the word-lines that are sequentially activated based on the activation input applied to each of the two or more of the word-lines being logic high.
  • Clause 23 The method of any of Clauses 17-22, further comprising skipping accumulating of at least one output signal from at least one other compute-in-memory cell in the array of compute-in-memory cells based on the at least one other compute-inmemory cell receiving the activation input, which is logic low.
  • Clause 24 The method of any of Clauses 17-23, wherein performing the computations comprises multiplying, via each of the compute-in-memory cells coupled to a respective one of the multiple word-lines in the at least the portion of the array, the bits of the weight with the activation input provided to the respective one of the multiple word-lines.
  • Clause 25 The method of any of Clauses 17-24, wherein: the output signals are accumulated, via a respective one of the plurality of accumulators, after multiple activation cycles; and during each of the multiple activation cycles, a respective activation input that is logic high is provided to a respective one of the multiple word-lines.
  • each compute-inmemory cell comprises: a pass-gate transistor; a flip-flop having an input coupled to the pass-gate transistor; a first transistor having a gate coupled to an output of the flip-flop; and a second transistor coupled between the first transistor and the respective one of the multiple bit-lines.
  • Clause 27 The method of Clause 26, wherein the first transistor comprises a source coupled to a reference potential node and a drain coupled to a source of the second transistor and wherein a drain of the second transistor is coupled to the respective one of the multiple bit-lines.
  • Clause 28 The method of Clause 26 or 27, wherein a gate of the second transistor is coupled to a respective one of the multiple word-lines.
  • a circuit for in-memory computation comprising: a plurality of memory cells on each of multiple bit-lines of a memory, the plurality of memory cells being configured to store multiple bits representing weights of a neural network, wherein the plurality of memory cells on each of the multiple bit-lines are on different word-lines of the memory; and a plurality of accumulators, each accumulator being coupled to a respective one of the multiple bit-lines.
  • Clause 30 The circuit of Clause 29, further comprising a plurality of sense amplifiers, each sense amplifier having an output coupled to a respective accumulator and having an input coupled to the respective one of the multiple bit-lines.
  • Clause 31 The circuit of Clause 30, wherein the plurality of sense amplifiers are configured to concurrently sense the multiple bit-lines.
  • Clause 32 The circuit of any of Clauses 29-31, wherein the memory cells on the multiple bit-lines and on one of the word-lines are configured to perform concurrent computations.
  • Clause 33 The circuit of any of Clauses 29-32, wherein the word-lines of the memory are configured to be activated one word-line at a time.
  • Clause 34 The circuit of any of Clauses 29-33, wherein: two or more of the word-lines are configured to be sequentially activated; and each of the plurality of accumulators is configured to accumulate output signals of the plurality of memory cells on the respective one of the multiple bit-lines after the two or more of the word-lines are sequentially activated.
  • Clause 35 The circuit of Clause 34, wherein the output signals comprise digital signals generated by the plurality of memory cells on the respective one of the multiple bit-lines.
  • Clause 36 The circuit of Clause 34 or 35, further comprising control circuitry configured to select the two or more of the word-lines that are sequentially activated based on an activation input applied to each of the two or more word-lines being logic high.
  • each of the plurality of accumulators is configured to perform accumulation of memory cell output signals for the respective one of the multiple bit-lines, and wherein, in performing the accumulation, each of the plurality of accumulators is configured to: accumulate output signals of two or more of the plurality of memory cells; and skip accumulating of at least one output signal of at least one other memory cell of the plurality of memory cells based on the at least one other memory cell receiving an activation input that is logic low.
  • Clause 38 The circuit of any of Clauses 29-37, wherein each of the plurality of memory cells is configured to multiply a bit of one of the weights with an activation input provided to a respective one of the word-lines.
  • Clause 39 The circuit of any of Clauses 29-38, wherein: the plurality of memory cells are configured to be sequentially activated based on a plurality of activation inputs applied to the word-lines; and a respective one of the plurality of accumulators is configured to accumulate output signals from the plurality of memory cells after the plurality of memory cells are sequentially activated.
  • Clause 40 The circuit of Clause 39, wherein: the respective one of the plurality of accumulators is configured to accumulate the output signals after multiple activation cycles; and during each of the multiple activation cycles, a respective one of the activation inputs that is logic high is provided to a respective one of the word-lines.
  • each of the plurality of memory cells comprises: a pass-gate transistor; a cross-coupled invertor pair having an output coupled to the pass-gate transistor; a first transistor having a gate coupled to the output of the cross-coupled invertor pair; and a second transistor coupled between the first transistor and the respective one of the multiple bit-lines.
  • Clause 42 The circuit of Clause 41, wherein the first transistor comprises a source coupled to a reference potential node and a drain coupled to a source of the second transistor and wherein a drain of the second transistor is coupled to the respective one of the multiple bit-lines.
  • Clause 43 The circuit of Clause 41 or 42, wherein a gate of the second transistor is coupled to a respective one of the word-lines.
  • Clause 44 The circuit of any of Clauses 29-43, wherein at least one of the memory cells comprises an eight-transistor (8T) static random access memory (SRAM) cell.
  • 8T eight-transistor
  • SRAM static random access memory
  • a method for in-memory computation comprising: storing, in a plurality of memory cells on each of multiple bit-lines of a memory, multiple bits representing weights of a neural network, wherein the plurality of memory cells on each of the multiple bit-lines are on different word-lines of the memory; and accumulating, via each accumulator of a plurality of accumulators, output signals of two or more of the plurality of memory cells on a respective one of the multiple bit-lines after two or more of the word-lines are sequentially activated.
  • Clause 46 The method of Clause 45, further comprising sensing, via each sense amplifier of a plurality of sense amplifiers, the respective one of the multiple bit- lines, wherein the output signals of the plurality of memory cells are accumulated based on the sensing of the respective one of the multiple bit-lines.
  • Clause 47 The method of Clause 46, wherein the multiple bit-lines are sensed concurrently via the plurality of sense amplifiers.
  • Clause 48 The method of any of Clauses 45-47, further comprising activating the word-lines of the memory, one word-line at a time, wherein the activating comprises multiplying one of the weights stored in the memory cells on the one word-line with an activation input provided to the one word-line.
  • Clause 49 The method of any of Clauses 45-48, wherein the output signals comprise digital signals generated by the plurality of memory cells on the respective one of the multiple bit-lines.
  • Clause 50 The method of any of Clauses 45-49, further comprising selecting the two or more of the word-lines that are sequentially activated based on an activation input applied to each of the two or more of the word-lines being logic high.
  • Clause 51 The method of any of Clauses 45-50, further comprising skipping accumulating of at least one output signal of at least one other memory cell of the plurality of memory cells based on the at least one other memory cell receiving an activation input that is logic low.
  • Clause 52 The method of any of Clauses 45-51, further comprising multiplying, via each of the plurality of memory cells, a bit of one of the weights with an activation input provided to a respective one of the word-lines.
  • Clause 53 The method of any of Clauses 45-52, wherein: the output signals are accumulated, via a respective one of the plurality of accumulators, after multiple activation cycles; and during each of the multiple activation cycles, a respective activation input that is logic high is provided to a respective one of the word-lines.
  • each of the plurality of memory cells comprises: a pass-gate transistor; a flip-flop having an input coupled to the pass-gate transistor; a first transistor having a gate coupled to an output of the flipflop; and a second transistor coupled between the first transistor and the respective one of the multiple bit-lines.
  • Clause 55 The method of Clause 54, wherein the first transistor comprises a source coupled to a reference potential node and a drain coupled to a source of the second transistor and wherein a drain of the second transistor is coupled to the respective one of the multiple bit-lines.
  • Clause 56 The method of Clause 54 or 55, wherein a gate of the second transistor is coupled to a respective one of the word-lines.
  • An apparatus for in-memory computation comprising: means for storing, in a plurality of memory cells on each of multiple bit-lines of the means for storing, multiple bits representing weights of a neural network, wherein the plurality of memory cells on each of the multiple bit-lines are on different word-lines of the means for storing; and means for accumulating output signals of two or more of the plurality of memory cells on a respective one of the multiple bit-lines after two or more of the wordlines are sequentially activated.
  • Clause 58 The apparatus of Clause 57, further comprising means for sensing the respective one of the multiple bit-lines, wherein the output signals of the plurality of memory cells are accumulated based on the sensing of the respective one of the multiple bit-lines.
  • an apparatus may be implemented or a method may be practiced using any number of the aspects set forth herein.
  • the scope of the disclosure is intended to cover such an apparatus or method that is practiced using other structure, functionality, or structure and functionality in addition to, or other than, the various aspects of the disclosure set forth herein. It should be understood that any aspect of the disclosure disclosed herein may be embodied by one or more elements of a claim.
  • exemplary means “serving as an example, instance, or illustration.” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.
  • a phrase referring to “at least one of’ a list of items refers to any combination of those items, including single members.
  • “at least one of: a, b, or c” is intended to cover a, b, c, a-b, a-c, b-c, and a-b-c, as well as any combination with multiples of the same element (e.g., a-a, a-a-a, a-a-b, a-a-c, a-b-b, a-c-c, b-b, b-b-b, b-b-c, c-c, and c-c-c or any other ordering of a, b, and c).
  • determining encompasses a wide variety of actions. For example, “determining” may include calculating, computing, processing, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure), ascertaining, and the like. Also, “determining” may include receiving (e.g., receiving information), accessing (e.g., accessing data in a memory), and the like. Also, “determining” may include resolving, selecting, choosing, establishing, and the like.
  • the methods disclosed herein comprise one or more steps or actions for achieving the methods.
  • the method steps and/or actions may be interchanged with one another without departing from the scope of the claims.
  • the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims.
  • the various operations of methods described above may be performed by any suitable means capable of performing the corresponding functions.
  • the means may include various hardware and/or software component(s) and/or module(s), including, but not limited to a circuit, an application specific integrated circuit (ASIC), or processor.
  • ASIC application specific integrated circuit
  • means for storing may include: (1) a CIM array, such as the array of memory cells 502, or (2) a CIM controller, such as the CIM controller 732 including a circuit 728A for storing, and memory such as memory 724 including code 724A for storing.
  • Means for accumulating may include an accumulator such as the accumulators 510.
  • Means for sensing may include a sense amplifier (SA), such as the SAs 508.
  • SA sense amplifier

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Image Analysis (AREA)

Abstract

Certain aspects generally relate to performing machine learning tasks, and in particular, to computation-in-memory architectures and operations. One aspect provides a circuit for in-memory computation. The circuit generally includes multiple bit-lines, multiple word-lines, an array of compute-in-memory cells, and a plurality of accumulators, each accumulator being coupled to a respective one of the multiple bit-lines. Each compute-in-memory cell is coupled to one of the bit-lines and to one of the word-lines and is configured to store a weight bit of a neural network.

Description

DIGITAL COMPUTE IN MEMORY
CROSS-REFERENCE TO RELATED APPLICATION
[0001] This application claims priority to U.S. Patent Application No. 17/816,285, filed July 29, 2022, which claims the benefit of and priority to U.S. Provisional Patent Application No. 63/228,523, filed August 2, 2021, the entire contents of each of which are incorporated herein by reference in their entirety.
INTRODUCTION
[0002] Aspects of the present disclosure relate to performing machine learning tasks, and in particular, to computation-in-memory architectures.
[0003] Machine learning is generally the process of producing a trained model (e.g., an artificial neural network, a tree, or other structures), which represents a generalized fit to a set of training data that is known a priori. Applying the trained model to new data produces inferences, which may be used to gain insights into the new data. In some cases, applying the model to the new data is described as “running an inference” on the new data.
[0004] As the use of machine learning has proliferated for enabling various machine learning (or artificial intelligence) tasks, the need for more efficient processing of machine learning model data has arisen. In some cases, dedicated hardware, such as machine learning accelerators, may be used to enhance a processing system’s capacity to process machine learning model data. However, such hardware demands space and power, which is not always available on the processing device. For example, “edge processing” devices, such as mobile devices, always-on devices, Internet of Things (loT) devices, and the like, typically have to balance processing capabilities with power and packaging constraints. Further, accelerators may move data across common data busses, which can cause significant power usage and introduce latency into other processes sharing the data bus. Consequently, other aspects of a processing system are being considered for processing machine learning model data.
[0005] Memory devices are one example of another aspect of a processing system that may be leveraged for performing processing of machine learning model data through so-called computation-in-memory (CIM) processes. CIM arrays were developed to implement a node of a neural network framework without data transfer bottlenecks. A data transfer bottleneck is avoided by storing weight data within each cell of a CIM array and also performing a multiply operation within each cell. Neural networks are a form of artificial intelligence relied on for a high level of accuracy, so the CIM array may be expected to generate accurate results. Conventional CIM processes perform computation using analog signals, which may be more susceptible to inaccuracy in computation results, adversely impacting neural network computations. Accordingly, systems and methods are needed for performing computation-in-memory with increased accuracy. Additional design goals for CIM may include flexibility and scalability.
BRIEF SUMMARY
[0006] Certain aspects provide apparatus and techniques for performing machine learning tasks, and in particular, computation-in-memory architectures.
[0007] One aspect provides a circuit for in-memory computation. The circuit generally includes a plurality of memory cells on each of multiple bit-lines of a memory, the plurality of memory cells being configured to store multiple bits representing weights of a neural network. The plurality of memory cells on each of the multiple bit-lines may be on different word-lines of the memory. The circuit also includes a plurality of accumulators, each accumulator being coupled to a respective one of the multiple bitlines.
[0008] Another aspect provides a method for in-memory computation. The method generally includes: storing, in a plurality of memory cells on each of multiple bit-lines of a memory, multiple bits representing weights of a neural network, wherein the plurality of memory cells on each of the multiple bit-lines are on different word-lines of the memory; and accumulating, via each accumulator of a plurality of accumulators, output signals of two or more of the plurality of memory cells on a respective one of the multiple bit-lines after two or more of the word-lines are sequentially activated.
[0009] Yet another aspect provides an apparatus for in-memory computation. The apparatus generally includes: means for storing, in a plurality of memory cells on each of multiple bit-lines of the means for storing, multiple bits representing weights of a neural network, wherein the plurality of memory cells on each of the multiple bit-lines are on different word-lines of the means for storing; and means for accumulating output signals of two or more of the plurality of memory cells on a respective one of the multiple bit-lines after two or more of the word-lines are sequentially activated.
[0010] Yet another aspect provides a circuit for in-memory computation. The circuit generally includes multiple bit-lines; multiple word-lines; an array of compute-inmemory cells, wherein each compute-in-memory cell is coupled to one of the bit-lines and to one of the word-lines and is configured to store a weight bit of a neural network; and a plurality of accumulators, each accumulator being coupled to a respective one of the multiple bit-lines.
[0011] Yet another aspect provides a method for in-memory computation. The method generally includes: performing computations, in at least a portion of an array of compute-in-memory cells, on a weight and an activation input for a neural network, each compute-in-memory cell being coupled to one of multiple bit-lines and to one of multiple word-lines and being configured to store a bit of the weight for the neural network; and accumulating, via each accumulator of a plurality of accumulators, output signals from two or more of the compute-in-memory cells coupled to a respective one of the multiple bit-lines.
[0012] Other aspects provide processing systems configured to perform the aforementioned methods as well as those described herein; non-transitory, computer- readable media comprising instructions that, when executed by one or more processors of a processing system, cause the processing system to perform the aforementioned methods as well as those described herein; a computer program product embodied on a computer-readable storage medium comprising code for performing the aforementioned methods as well as those further described herein; and a processing system comprising means for performing the aforementioned methods as well as those further described herein.
[0013] The following description and the related drawings set forth in detail certain illustrative features of one or more aspects. BRIEF DESCRIPTION OF THE DRAWINGS
[0014] So that the manner in which the above-recited features of the present disclosure can be understood in detail, a more particular description, briefly summarized above, may be had by reference to aspects, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only certain typical aspects of this disclosure and are therefore not to be considered limiting of its scope, for the description may admit to other equally effective aspects.
[0015] FIGS. 1A-1D depict examples of various types of neural networks, which may be implemented by aspects of the present disclosure.
[0016] FIG. 2 depicts an example of a traditional convolution operation, which may be implemented by aspects of the present disclosure.
[0017] FIGS. 3A and 3B depict examples of depthwise separable convolution operations, which may be implemented by aspects of the present disclosure.
[0018] FIG. 4 illustrates an example memory cell implemented as an eight-transistor (8T) static random access memory (SRAM) cell for a compute-in-memory (CIM) circuit.
[0019] FIG. 5 is a block diagram of an example circuit for digital CIM, in accordance with certain aspects of the present disclosure.
[0020] FIGS. 6A and 6B are flow diagrams illustrating example operations for inmemory computation, in accordance with certain aspects of the present disclosure.
[0021] FIG. 7 is a block diagram illustrating an example electronic device having a neural network configured to perform in-memory computation operations, in accordance with certain aspects of the present disclosure.
[0022] To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the drawings. It is contemplated that elements and features of one aspect may be beneficially incorporated in other aspects without further recitation.
DETAILED DESCRIPTION
[0023] Aspects of the present disclosure provide apparatus, methods, processing systems, and computer-readable mediums for performing computation in memory (CIM) to handle data-intensive processing, such as implementing machine learning models. Some aspects provide techniques for performing digital CIM using accumulators, each accumulator accumulating output signals on a respective one of multiple bit-lines of memory after multiple activation cycles. In some aspects, during each of the multiple activation cycles, one of the word-lines may be activated. In other words, the word-lines may be sequentially activated, and the accumulators may concurrently perform accumulation to provide an accumulation result after two or more of the word-lines are sequentially activated.
[0024] CIM-based machine learning (ML)/artificial intelligence (Al) may be used for a wide variety of tasks, including image processing (e.g., still images and video), audio processing, controlling radio frequency (RF) front-ends in wireless communications, and making wireless communication decisions (e.g., to optimize, or at least increase, throughput and signal quality). Further, CIM may be based on various types of memory architectures, such as dynamic random-access memory (DRAM), static random-access memory (SRAM) (e.g., based on an SRAM cell as in FIG. 4), magnetoresistive randomaccess memory (MRAM), and resistive random-access memory (ReRAM or RRAM), and may be attached to various types of processing units, including central processing units (CPUs), digital signal processors (DSPs), graphics processing units (GPUs), field- programmable gate arrays (FPGAs), neural processing units (NPUs), neural signal processors (NSPs), and others. Generally, CIM may beneficially reduce the “memory wall” problem, which is where the movement of data in and out of memory consumes more power than the computation of the data. Thus, by performing the computation in memory, significant power savings may be realized. This is particularly useful for various types of electronic devices, such as low power edge devices, mobile devices, and the like.
[0025] For example, a mobile device may include a memory device configured for storing data and performing compute-in-memory operations. The mobile device may be configured to perform an ML/ Al operation based on data generated by the mobile device, such as image data generated by a camera sensor of the mobile device, audio data received via a microphone of the mobile device, inertial data gathered by an accelerometer or gyroscope of the mobile device, temperature data captured by a temperature sensor of the mobile device, etc., and/or combinations thereof. A memory controller unit (MCU) of the mobile device may thus load weights from another on-board memory (e.g., flash or RAM) into a CIM array of the memory device and allocate input feature buffers and output (e.g., output activation) buffers. The processing device may then commence processing of the data by loading, for example, a layer in the input buffer and processing the layer with weights loaded into the CIM array. This processing may be repeated for each layer of the data, and the outputs (e.g., output activations) may be stored in the output buffers and then used by the mobile device for an ML/ Al task, such as intelligently controlling wireless communications, a heating and air conditioning system, a security system, or other Internet of Things (loT) applications.
Brief Background on Neural Networks, Deep Neural Networks, and Deep Learning [0026] Neural networks are organized into layers of interconnected nodes. Generally, a node (or neuron) is where computation happens. For example, a node may combine input data with a set of weights (or coefficients) that either amplifies or dampens the input data. The amplification or dampening of the input signals may thus be considered an assignment of relative significances to various inputs with regard to a task the network is trying to learn. Generally, input-weight products are summed (or accumulated), and then the sum is passed through a node’s activation function to determine whether and to what extent that signal should progress further through the network.
[0027] In a most basic implementation, a neural network may have an input layer, a hidden layer, and an output layer. “Deep” neural networks generally have more than one hidden layer.
[0028] Deep learning is a method of training deep neural networks. Generally, deep learning maps inputs to the network to outputs from the network and is thus sometimes referred to as a “universal approximator” because deep learning can learn to approximate an unknown function /(%) = y between any input x and any output y. In other words, deep learning finds the right f to transform x into j'.
[0029] More particularly, deep learning trains each layer of nodes based on a distinct set of features, which is the output from the previous layer. Thus, with each successive layer of a deep neural network, features become more complex. Deep learning is thus powerful because it can progressively extract higher level features from input data and perform complex tasks, such as object recognition, by learning to represent inputs at successively higher levels of abstraction in each layer, thereby building up a useful feature representation of the input data.
[0030] For example, if presented with visual data, a first layer of a deep neural network may learn to recognize relatively simple features, such as edges, in the input data. In another example, if presented with auditory data, the first layer of a deep neural network may learn to recognize spectral power in specific frequencies in the input data. The second layer of the deep neural network may then learn to recognize combinations of features, such as simple shapes for visual data or combinations of sounds for auditory data, based on the output of the first layer. Higher layers may then learn to recognize complex shapes in visual data or words in auditory data. Still higher layers may learn to recognize common visual objects or spoken phrases. Thus, deep learning architectures may perform especially well when applied to problems that have a natural hierarchical structure.
Layer Connectivity in Neural Networks
[0031] Neural networks, such as deep neural networks (DNNs), may be designed with a variety of connectivity patterns between layers.
[0032] FIG. 1A illustrates an example of a fully connected neural network 102. In a fully connected neural network 102, each node in a first layer communicates its output to every node in a second layer, so that each node in the second layer will receive input from every node in the first layer.
[0033] FIG. IB illustrates an example of a locally connected neural network 104. In a locally connected neural network 104, a node in a first layer may be connected to a limited number of nodes in the second layer. More generally, a locally connected layer of the locally connected neural network 104 may be configured so that each node in a layer will have the same or a similar connectivity pattern, but with connection strengths (or weights) that may have different values (e.g., values associated with local areas 110, 112, 114, and 116 of the first layer nodes). The locally connected connectivity pattern may give rise to spatially distinct receptive fields in a higher layer, because the higher layer nodes in a given region may receive inputs that are tuned through training to the properties of a restricted portion of the total input to the network. [0034] One type of locally connected neural network is a convolutional neural network (CNN). FIG. 1C illustrates an example of a convolutional neural network 106. The convolutional neural network 106 may be configured such that the connection strengths associated with the inputs for each node in the second layer are shared (e.g., for local area 108 overlapping another local area of the first layer nodes). Convolutional neural networks are well suited to problems in which the spatial locations of inputs are meaningful.
[0035] One type of convolutional neural network is a deep convolutional network (DCN). Deep convolutional networks are networks of multiple convolutional layers, which may further be configured with, for example, pooling and normalization layers.
[0036] FIG. ID illustrates an example of a DCN 100 designed to recognize visual features in an image 126 generated by an image-capturing device 130. For example, if the image-capturing device 130 is a camera mounted in or on (or otherwise moving along with) a vehicle, then the DCN 100 may be trained with various supervised learning techniques to identify a traffic sign and even a number on the traffic sign. The DCN 100 may likewise be trained for other tasks, such as identifying lane markings or identifying traffic lights. These are just some example tasks, and many others are possible.
[0037] In the example of FIG. ID, the DCN 100 includes a feature-extraction section and a classification section. Upon receiving the image 126, a convolutional layer 132 applies convolutional kernels (for example, as depicted and described in FIG. 2) to the image 126 to generate a first set of feature maps 118 (or intermediate activations). Generally, a “kernel” or “filter” comprises a multidimensional array of weights designed to emphasize different aspects of an input data channel. In various examples, “kernel” and “filter” may be used interchangeably to refer to sets of weights applied in a convolutional neural network.
[0038] The first set of feature maps 118 may then be subsampled by a pooling layer (e.g., a max pooling layer, not shown) to generate a second set of feature maps 120. The pooling layer may reduce the size of the first set of feature maps 118 while maintaining much of the information in order to improve model performance. For example, the second set of feature maps 120 may be downsampled to a 14x14 matrix from a 28x28 matrix by the pooling layer. [0039] This process may be repeated through many layers. In other words, the second set of feature maps 120 may be further convolved via one or more subsequent convolutional layers (not shown) to generate one or more subsequent sets of feature maps (not shown).
[0040] In the example of FIG. ID, the second set of feature maps 120 is provided to a fully connected layer 124, which in turn generates an output feature vector 128. Each feature of the output feature vector 128 may include a number that corresponds to a possible feature of the image 126, such as “sign,” “60,” and “100.” In some cases, a softmax function (not shown) may convert the numbers in the output feature vector 128 to a probability. In such cases, an output 122 of the DCN 100 is a probability of the image 126 including one or more features.
[0041] A softmax function (not shown) may convert the individual elements of the output feature vector 128 into a probability in order that an output 122 of DCN 100 is one or more probabilities of the image 126 including one or more features, such as a sign with the number “60” thereon, as in image 126. Thus, in the present example, the probabilities in the output 122 for “sign” and “60” should be higher than the probabilities of the other elements of the output 122, such as “30,” “40,” “50,” “70,” “80,” “90,” and “100.”
[0042] Before training the DCN 100, the output 122 produced by the DCN 100 may be incorrect. Thus, an error may be calculated between the output 122 and a target output known a priori. For example, here the target output is an indication that the image 126 includes a “sign” and the number “60.” Utilizing the known target output, the weights of the DCN 100 may then be adjusted through training so that a subsequent output 122 of the DCN 100 achieves the target output (with high probabilities).
[0043] To adjust the weights of the DCN 100, a learning algorithm may compute a gradient vector for the weights. The gradient vector may indicate an amount that an error would increase or decrease if a weight were adjusted in a particular way. The weights may then be adjusted to reduce the error. This manner of adjusting the weights may be referred to as “backpropagation” because this adjustment process involves a “backward pass” through the layers of the DCN 100.
[0044] In practice, the error gradient of weights may be calculated over a small number of examples, so that the calculated gradient approximates the true error gradient. This approximation method may be referred to as stochastic gradient descent. Stochastic gradient descent may be repeated until the achievable error rate of the entire system has stopped decreasing or until the error rate has reached a target level.
[0045] After training, the DCN 100 may be presented with new images, and the DCN 100 may generate inferences, such as classifications, or probabilities of various features being in the new image.
Convolution Techniques for Convolutional Neural Networks
[0046] Convolution is generally used to extract useful features from an input data set. For example, in convolutional neural networks, such as described above, convolution enables the extraction of different features using kernels and/or filters whose weights are automatically learned during training. The extracted features are then combined to make inferences.
[0047] An activation function may be applied before and/or after each layer of a convolutional neural network. Activation functions are generally mathematical functions that determine the output of a node of a neural network. Thus, the activation function determines whether a node should pass information or not, based on whether the node’s input is relevant to the model’s prediction. In one example, where y = convlx) (i.e., y is the convolution of x), both x and y may be generally considered as “activations.” However, in terms of a particular convolution operation, x may also be referred to as “preactivations” or “input activations” as x exists before the particular convolution, and y may be referred to as output activations or a feature map.
[0048] FIG. 2 depicts an example of a traditional convolution in which a 12-pixel x 12-pixel x 3-channel input image is convolved using a 5 x 5 x 3 convolution kernel 204 and a stride (or step size) of 1. The resulting feature map 206 is 8 pixels x 8 pixels x 1 channel. As seen in this example, the traditional convolution may change the dimensionality of the input data as compared to the output data (here, from 12 x 12 to 8 x 8 pixels), including the channel dimensionality (here, from 3 channels to 1 channel).
[0049] One way to reduce the computational burden (e.g., measured in floating-point operations per second (FLOPs)) and the number of parameters associated with a neural network comprising convolutional layers is to factorize the convolutional layers. For example, a spatial separable convolution, such as depicted in FIG. 2, may be factorized into two components: (1) a depthwise convolution, where each spatial channel is convolved independently by a depthwise convolution (e.g., a spatial fusion); and (2) a pointwise convolution, where all the spatial channels are linearly combined (e.g., a channel fusion). An example of a depthwise separable convolution is depicted in FIGS. 3A and 3B. Generally, during spatial fusion, a network learns features from the spatial planes, and during channel fusion, the network learns relations between these features across channels.
[0050] In one example, a depthwise separable convolution may be implemented using 5x5 kernels for spatial fusion, and 1x1 kernels for channel fusion. In particular, the channel fusion may use a I x l xt/ kernel that iterates through every single point in an input image of depth d, where the depth d of the kernel generally matches the number of channels of the input image. Channel fusion via pointwise convolution is useful for dimensionality reduction for efficient computations. Applying I x l xt/ kernels and adding an activation layer after the kernel may give a network added depth, which may increase the network’s performance.
[0051] In particular, in FIG. 3A, the 12-pixel x 12-pixel x 3 -channel input image 302 is convolved with a filter comprising three separate kernels 304A-C, each having a 5 x 5 x 1 dimensionality, to generate a feature map 306 of 8 pixels x 8 pixels x 3 channels, where each channel is generated by an individual kernel among kernels 304A-C.
[0052] Then, feature map 306 is further convolved using a pointwise convolution operation with a kernel 308 having dimensionality 1 x 1 x 3 to generate a feature map 310 of 8 pixels x 8 pixels x 1 channel. As is depicted in this example, feature map 310 has reduced dimensionality (1 channel versus 3 channels), which allows for more efficient computations therewith.
[0053] Though the result of the depthwise separable convolution in FIGS. 3A and 3B is substantially similar to the traditional convolution in FIG. 2, the number of computations is significantly reduced, and thus depthwise separable convolution offers a significant efficiency gain where a network design allows it.
[0054] Though not depicted in FIG. 3B, multiple (e.g., ni) pointwise convolution kernels 308 (e.g., individual components of a filter) can be used to increase the channel dimensionality of the convolution output. So, for example, m = 256 1x1x3 kernels 308 can be generated, in which each output is an 8-pixel x 8-pixel x 1 -channel feature map (e.g., feature map 310), and these feature maps can be stacked to get a resulting feature map of 8 pixels x 8 pixels x 256 channels. The resulting increase in channel dimensionality provides more parameters for training, which may improve a convolutional neural network’s ability to identify features (e.g., in input image 302).
Example Compute-in-Memory (CIM) Architecture
[0055] FIG. 4 illustrates an example memory cell 400 of a static random access memory (SRAM), which may be implemented in a CIM array. The memory cell 400 may be referred to as an 8-transistor (8T) SRAM cell as the memory cell 400 is implemented with eight transistors.
[0056] As shown, the memory cell 400 may include a flip-flop, which may be implemented as a cross-coupled invertor pair 424 having an output 414 and an output 416. As shown, the cross-coupled invertor pair output 414 is selectively coupled to a write bit-line (WBL) 406 via a pass-gate transistor 402, and the cross-coupled invertor pair output 416 is selectively coupled to a complementary write bit-line (WBLB) 420 via a pass-gate transistor 418. The WBL 406 and WBLB 420 are configured to provide complementary digital signals to be written (e.g., stored) in the cross-coupled invertor pair 424. The WBL and WBLB may be used to store a bit for a neural network weight in the memory cell 400. The gates of pass-gate transistors 402, 418 may be coupled to a write word-line (WWL) 404, as shown. For example, a digital signal to be written may be provided to the WBL (and a complement of the digital signal is provided to the WBLB). The pass-gate transistors 402, 418 — which are implemented here as n-type field-effect transistors (NFETs) — are then turned on by providing a logic high signal to WWL 404, resulting in the digital signal being stored in the cross-coupled invertor pair 424.
[0057] As shown, the cross-coupled invertor pair output 414 may be coupled to a gate of a transistor 410. The source of the transistor 410 may be coupled to a reference potential node (VSS or electrical ground), and the drain of the transistor 410 may be coupled to a source of a transistor 412. The drain of the transistor 412 may be coupled to a read bit-line (RBL) 422, as shown. The gate of transistor 412 may be controlled via a read word-line (RWL) 408. The RWL 408 may be controlled via an activation input signal.
[0058] During a read cycle, the RBL 422 may be precharged to logic high. If both the activation input and the weight bit stored at the cross-coupled invertor pair output 414 are logic high, then transistors 410, 412 are both turned on, electrically coupling the RBL 422 to VSS at the source of transistor 410 and discharging the RBL 422 to logic low. If either the activation input or the weight stored at the cross-coupled invertor pair output 414 is logic low, then at least one of transistors 410, 412 will be turned off, such that the RBL 422 remains logic high. Thus, the output of the memory cell 400 at RBL 422 is logic low only when both the weight bit and activation input are logic high, and is logic high otherwise, effectively implementing a NAND-gate operation.
[0059] FIG. 5 illustrates a circuit 500 for CIM, in accordance with certain aspects of the present disclosure. The circuit 500 includes a CIM array having N word-lines 504-1 to 504-N (also referred to as rows) and M bit-lines 506-1 to 506-M (also referred to herein as “columns”), N and M each being any integer greater than 1. N and M may be the same or different. Bit-lines 506-1 to 506-M (collectively referred to herein as “BLs 506”) are labeled BLi to BLM in FIG. 5, and word-lines 504-1 to 504-N (collectively referred to herein as “WLs 504”) are labeled WLi to WLN in FIG. 5. Each of the BLs 506 may correspond to the RBL in the memory cell 400 of FIG. 4, and each of the WLs 504 may correspond to the RWL in the memory cell 400 of FIG. 4. As shown in FIG. 5, memory cells 502-1,1 to 502-N,M (collectively referred to herein as “memory cells 502”) are implemented at the intersections of the WLs 504 and BLs 506. In the memory cell reference scheme (e.g., 502-2,1), the first integer after the dash (here, 2) indicates the word-line, and the second integer after the dash (here, 1) indicates the bit-line, of the intersection where the memory cell is located.
[0060] Each of the memory cells 502 may be implemented using the memory cell architecture described with respect to FIG. 4. As shown, activation inputs Xi to XN may be provided to respective word-lines 504, and the memory cells 502 may store neural network weights Wi to WN, where each weight has M bits (e.g., Wi,i to WI,M, W2,I to W2,M, and WWN,I to WN,M). For example, memory cells 502-1,1 to 502-1, M may store M bits for weight Wi (e.g., weight bits Wi,i to WI,M), memory cells 502-2,1 to 502-2, M may store M bits for weight W2 (e.g., weight bits W2,i to W2,M), and so on. The weights may be written to the memory cells 502 via write bit-lines (e.g., WBL 406 and WBLB 420), which are not shown in FIG. 5. During a computation cycle, each memory cell 502 may multiply the received activation bit with the stored weight bit (e.g., may perform a logical NAND operation with the activation bit and the stored weight bit as inputs, as described with respect to FIG. 4).
[0061] In some aspects, sense amplifiers (SAs) 508-1 to 508-M (collectively referred to herein as “SAs 508”) may be used to sense the signal on a respective one of the bitlines 506 (e.g., a digital signal from the NAND processing of the memory cell). The SAs 508 may perform concurrent sensing of the bit-lines 506. Each of the sensed signals for a respective BL may be provided to a respective one of accumulators 510-1 to 510-M (collectively referred to herein as “accumulators 510”). The accumulators 510 concurrently perform accumulation of the signals sensed by the SAs 508, on a bit-line basis.
[0062] In some aspects, the activation inputs xi to XN may be applied one row (wordline) at a time (e.g., one row each computation cycle). For example, the activation input Xi may be provided to word-line 504-1 during a first computation cycle, and the computation (e.g., multiplication, such as a NAND operation as described above) for activation input Wi and weight Wi may be performed via the memory cells on word-line 504-1 storing bit weights Wi,i to WI,M. The signals (e.g., digital signals) on BLs 506 after the first computation cycle may be sensed (concurrently) via respective SAs 508 and provided to respective accumulators 510. The same operation may be performed for each of activation inputs X2 to XN, one word-line at a time (and in order starting from X2 and ending with XN), during subsequent computation cycles. The accumulators 510 accumulate additional signals on corresponding BLs 506 after each computation cycle. After the computation cycles are complete, each of the accumulators 510 provides an accumulation result for a respective one of the BLs 506. The accumulation result from each accumulator indicates the accumulation of the signals on the respective BL, each of the signals being generated after one of the computation cycles.
[0063] Since the word-lines 504 receiving the activation inputs xi to XN may be activated one at a time, the number of computation cycles corresponds to the number of activation inputs and is an indication of the amount of time it takes to receive the accumulation results for the activation inputs xi to XN, as the computation cycles occur sequentially. In some aspects, a computation cycle may be skipped if an activation input associated with the computation cycle is logic low, in effect speeding up the CIM process. In other words, activation inputs Xi to XN may be provided to respective word-lines during respective computation cycles 1-N. If activation input X2 is logic low, computation cycle 2 may be skipped, reducing the total amount of time it takes for computation by the duration of one computation cycle. Thus, for certain aspects, the computation using memory cells 502-2,1 to 502-2, M may be skipped, and/or the accumulators 510 may skip accumulation of the output signals of memory cells 502-2,1 to 502-2, M based on the activation input X2 being logic low. In this manner, the present disclosure provides a digital CIM array that offers a more accurate multiply-accumulate (MAC) operation as compared to certain implementations using analog CIM.
Example Operations for Digital Computation in Memory (CIM)
[0064] FIG. 6A is a flow diagram illustrating example operations 600 for in-memory computation, in accordance with certain aspects of the present disclosure. The operations 600 may be performed by a circuit for CIM (e.g., digital CIM), such as the circuit 500 described with respect to FIG. 5.
[0065] The operations 600 begin at block 605 with the circuit storing, in a plurality of memory cells (e.g., memory cells 502) on each of multiple bit-lines (e.g., bit-lines 506) of a memory, multiple bits representing weights of a neural network. For certain aspects, the bits representing weights may be stored using other bit-lines (e.g., write bit-lines, such as WBL 406 and WBLB 420) of the memory. The plurality of memory cells on each of the multiple bit-lines are on different word-lines (e.g., word-lines 504) of the memory, as shown in FIG. 5. In some aspects, the circuit multiplies, via each of the plurality of memory cells, a bit of one of the weights with an activation input provided to a respective one of the word-lines.
[0066] At block 610, the circuit accumulates, via each accumulator of a plurality of accumulators (e.g., accumulators 510), output signals of two or more of the plurality of memory cells on a respective one of the multiple bit-lines after two or more of the wordlines are sequentially activated. The output signals may include digital signals generated by the plurality of memory cells on the respective one of the multiple bit-lines (e.g., due to NAND logic of the memory cells). In some aspects, the circuit activates the word-lines of the memory, one word-line at a time. In this case, the activation includes multiplying one of the weights stored in the memory cells on the one word-line with an activation input provided to the one word-line.
[0067] In some aspects, the circuit senses, via each sense amplifier of a plurality of sense amplifiers (e.g., SAs 508), the respective one of the multiple bit-lines. The output signals of the plurality of memory cells may be accumulated based on the sensing of the respective one of the multiple bit-lines. In some aspects, the multiple bit-lines are sensed concurrently via the plurality of sense amplifiers, as described herein.
[0068] In some aspects, the circuit selects the two or more of the word-lines that are sequentially activated based on an activation input applied to each of the two or more of the word-lines being logic high. For example, the circuit may skip accumulating at least one output signal of at least one other memory cell of the plurality of memory cells based on the at least one other memory cell receiving an activation input that is logic low.
[0069] In some aspects, the output signals are accumulated, via a respective one of the plurality of accumulators, after multiple activation cycles. During each of the multiple activation cycles, a respective activation input that is logic high is provided to a respective one of the word-lines.
[0070] In some aspects, each of the plurality of memory cells includes a pass-gate transistor (e.g., pass-gate transistor 418), a flip-flop (e.g., comprising the cross-coupled invertor pair 424) coupled to the pass-gate transistor, a first transistor (e.g., transistor 410) having a gate coupled to an output of the flip-flop, and a second transistor (e.g., transistor 412) coupled between the first transistor and the respective one of the multiple bit-lines (e.g., RBL 422 shown in FIG. 4). The first transistor may include a source coupled to a reference potential node (e.g., electric ground) and a drain coupled to a source of the second transistor, a drain of the second transistor being coupled to the respective one of the multiple bit-lines. A gate of the second transistor may be coupled to a respective one of the word-lines (e.g., RWL 408 shown in FIG. 4).
[0071] FIG. 6B is a flow diagram illustrating example operations 650 for in-memory computation, in accordance with certain aspects of the present disclosure. The operations 650 may be performed by a circuit for CIM (e.g., digital CIM), such as the circuit 500 described with respect to FIG. 5. Many of the operations 650 may be similar to the operations 600 described above and are not repeated below. [0072] The operations 650 begin at block 655 with the circuit performing computations, in at least a portion of an array of compute-in-memory cells (e.g., memory cells 502), on a weight and an activation input for a neural network. Each compute-inmemory cell may be coupled to one of multiple bit-lines (e.g., bit-lines 506) and to one of multiple word-lines (e.g., word-lines 504) and may be configured to store a bit of the weight for the neural network
[0073] At block 660, the circuit accumulates, via each accumulator of a plurality of accumulators (e.g., accumulators 510), output signals from two or more of the computein-memory cells coupled to a respective one of the multiple bit-lines. The output signals may include digital signals generated by the compute-in-memory cells on the respective one of the multiple bit-lines (e.g., due to NAND logic of the memory cells).
[0074] In some aspects, the operations 650 may further include the circuit sequentially activating two or more the word-lines. In this case, the accumulating at block 660 may occur after the sequentially activating. The sequentially activating may involve applying the activation input to each of the two or more word-lines, one word-line at a time.
Example Processing Systems for Computation in Memory (CIM)
[0075] FIG. 7 illustrates an example electronic device 700. The electronic device 700 may be configured to perform the methods described herein, including the operations 600 and 650 described with respect to FIGS. 6A and 6B.
[0076] The electronic device 700 includes a central processing unit (CPU) 702, which in some aspects may be a multi-core CPU. Instructions executed at the CPU 702 may be loaded, for example, from a program memory associated with the CPU 702 or may be loaded from a memory 724.
[0077] The electronic device 700 also includes additional processing blocks tailored to specific functions, such as a graphics processing unit (GPU) 704, a digital signal processor (DSP) 706, a neural processing unit (NPU) 708, a multimedia processing block 710, and a wireless connectivity processing block 712. In one implementation, the NPU 708 is implemented in one or more of the CPU 702, GPU 704, and/or DSP 706.
[0078] In some aspects, the wireless connectivity processing block 712 may include components, for example, for Third-Generation (3G) connectivity, Fourth-Generation (4G) connectivity (e.g., 4G LTE), Fifth-Generation connectivity (e.g., 5G or NR), Wi-Fi connectivity, Bluetooth connectivity, and/or wireless data transmission standards. The wireless connectivity processing block 712 is further connected to one or more antennas 77 to facilitate wireless communication.
[0079] The electronic device 700 may also include one or more sensor processors 716 associated with any manner of sensor, one or more image signal processors (ISPs) 718 associated with any manner of image sensor, and/or a navigation processor 720, which may include satellite-based positioning system components (e.g., GPS or GLONASS) as well as inertial positioning system components.
[0080] The electronic device 700 may also include one or more input and/or output devices 722, such as screens, touch-sensitive surfaces (including touch-sensitive displays), physical buttons, speakers, microphones, and the like. In some aspects, one or more of the processors of the electronic device 700 may be based on an ARM instruction set.
[0081] The electronic device 700 also includes memory 724, which is representative of one or more static and/or dynamic memories, such as a dynamic random access memory (DRAM), a flash-based static memory, and the like. In this example, memory 724 includes computer-executable components, which may be executed by one or more of the aforementioned processors of the electronic device 700 and/or a CIM controller 732 (also referred to as control circuitry). For certain aspects, the electronic device 700 includes a CIM circuit 726, such as the circuit 500, as described herein. The CIM circuit 726 may be controlled via the CIM controller 732. For instance, in some aspects, memory 724 may include code 724A for storing (e.g., storing weights in memory cells) and code 724B for computing (e.g., performing a neural network computation by applying activation inputs). As illustrated, the CIM controller 732 may include a circuit 728A for storing (e.g., storing weights in memory cells), and a circuit 728B for computing (e.g., performing a neural network computation by applying activation inputs). The depicted components, and others not depicted, may be configured to perform various aspects of the methods described herein.
[0082] In some aspects, such as where the electronic device 700 is a server device, various aspects may be omitted from the example depicted in FIG. 7, such as one or more of the multimedia processing block 710, wireless connectivity processing block 712, antenna 714, sensor processors 716, ISPs 718, or navigation processor 720.
Example Clauses
[0083] Clause 1. A circuit comprising: multiple bit-lines; multiple word-lines; an array of compute-in-memory cells, wherein each compute-in-memory cell is coupled to one of the bit-lines and to one of the word-lines and is configured to store a weight bit of a neural network; and a plurality of accumulators, each accumulator being coupled to a respective one of the multiple bit-lines.
[0084] Clause 2. The circuit of Clause 1, further comprising a plurality of sense amplifiers, each sense amplifier having an output coupled to a respective one of the accumulators and having an input coupled to the respective one of the multiple bit-lines.
[0085] Clause 3. The circuit of Clause 2, wherein the plurality of sense amplifiers are configured to concurrently sense the multiple bit-lines.
[0086] Clause 4. The circuit of any of Clauses 1-3, wherein the compute-in-memory cells coupled to the multiple bit-lines and to one of the multiple word-lines are configured to perform concurrent computations.
[0087] Clause 5. The circuit of any of Clauses 1-4, wherein the multiple word-lines are configured to be activated one word-line at a time.
[0088] Clause 6. The circuit of any of Clauses 1-5, wherein: two or more of the word-lines are configured to be sequentially activated; and each of the plurality of accumulators is configured to accumulate output signals from the compute-in-memory cells coupled to the respective one of the multiple bit-lines after the two or more of the word-lines are sequentially activated.
[0089] Clause 7. The circuit of Clause 6, wherein the output signals comprise digital signals generated by the compute-in-memory cells on the respective one of the multiple bit-lines.
[0090] Clause 8. The circuit of Clause 6 or 7, further comprising control circuitry configured to select the two or more of the word-lines that are sequentially activated based on an activation input applied to each of the two or more of the word-lines being logic high. [0091] Clause 9. The circuit of any of Clauses 1-5, wherein each of the plurality of accumulators is configured to perform accumulation of output signals from the computein-memory cells coupled to the respective one of the multiple bit-lines, and wherein, in performing the accumulation, each of the plurality of accumulators is configured to: accumulate output signals from two or more of the compute-in-memory cells; and skip accumulation of at least one output signal from at least one other compute-in-memory cell coupled to the respective one of the multiple bit-lines, based on the at least one other compute-in-memory cell receiving an activation input that is logic low.
[0092] Clause 10. The circuit of any of Clauses 1-7, wherein each compute-inmemory cell is configured to multiply the stored weight bit with an activation input provided to a respective one of the multiple word-lines.
[0093] Clause 11. The circuit of any of Clauses 1-5, wherein: the compute-inmemory cells coupled to each of the multiple word-lines are configured to be sequentially activated based on a plurality of activation inputs applied to the multiple word-lines; and a respective one of the plurality of accumulators is configured to accumulate output signals from the compute-in-memory cells after the compute-in-memory cells coupled to each of the multiple word-lines are sequentially activated.
[0094] Clause 12. The circuit of Clause 11, wherein: the respective one of the plurality of accumulators is configured to accumulate the output signals after multiple activation cycles; and during each of the multiple activation cycles, a respective one of the activation inputs that is logic high is provided to a respective one of the multiple wordlines.
[0095] Clause 13. The circuit of any of Clauses 1-12, wherein each compute-inmemory cell comprises: a pass-gate transistor; a cross-coupled invertor pair having an output coupled to the pass-gate transistor; a first transistor having a gate coupled to the output of the cross-coupled invertor pair; and a second transistor coupled between the first transistor and the respective one of the multiple bit-lines.
[0096] Clause 14. The circuit of Clause 13, wherein the first transistor comprises a source coupled to a reference potential node and a drain coupled to a source of the second transistor and wherein a drain of the second transistor is coupled to the respective one of the multiple bit-lines. [0097] Clause 15. The circuit of Clause 13 or 14, wherein a gate of the second transistor is coupled to a respective one of the multiple word-lines.
[0098] Clause 16. The circuit of any of Clauses 1-15, wherein at least one of the compute-in-memory cells comprises an eight-transistor (8T) static random access memory (SRAM) cell.
[0099] Clause 17. A method comprising: performing computations, in at least a portion of an array of compute-in-memory cells, on a weight and an activation input for a neural network, each compute-in-memory cell being coupled to one of multiple bit-lines and to one of multiple word-lines and being configured to store a bit of the weight for the neural network; and accumulating, via each accumulator of a plurality of accumulators, output signals from two or more of the compute-in-memory cells coupled to a respective one of the multiple bit-lines.
[0100] Clause 18. The method of Clause 17, further comprising sensing, via each sense amplifier of a plurality of sense amplifiers, the respective one of the multiple bitlines, wherein the output signals from the two or more of the compute-in-memory cells are accumulated based on the sensing of the respective one of the multiple bit-lines.
[0101] Clause 19. The method of Clause 18, wherein the sensing comprises concurrently sensing the multiple bit-lines via the plurality of sense amplifiers.
[0102] Clause 20. The method of any of Clauses 17-19, wherein the output signals comprise digital signals generated by the compute-in-memory cells on the respective one of the multiple bit-lines.
[0103] Clause 21. The method of any of Clauses 17-20, further comprising sequentially activating two or more the word-lines, wherein the accumulating occurs after the sequentially activating and wherein the sequentially activating comprises applying the activation input to each of the two or more word-lines, one word-line at a time.
[0104] Clause 22. The method of Clause 21, further comprising selecting the two or more of the word-lines that are sequentially activated based on the activation input applied to each of the two or more of the word-lines being logic high.
[0105] Clause 23. The method of any of Clauses 17-22, further comprising skipping accumulating of at least one output signal from at least one other compute-in-memory cell in the array of compute-in-memory cells based on the at least one other compute-inmemory cell receiving the activation input, which is logic low.
[0106] Clause 24. The method of any of Clauses 17-23, wherein performing the computations comprises multiplying, via each of the compute-in-memory cells coupled to a respective one of the multiple word-lines in the at least the portion of the array, the bits of the weight with the activation input provided to the respective one of the multiple word-lines.
[0107] Clause 25. The method of any of Clauses 17-24, wherein: the output signals are accumulated, via a respective one of the plurality of accumulators, after multiple activation cycles; and during each of the multiple activation cycles, a respective activation input that is logic high is provided to a respective one of the multiple word-lines.
[0108] Clause 26. The method of any of Clauses 17-25, wherein each compute-inmemory cell comprises: a pass-gate transistor; a flip-flop having an input coupled to the pass-gate transistor; a first transistor having a gate coupled to an output of the flip-flop; and a second transistor coupled between the first transistor and the respective one of the multiple bit-lines.
[0109] Clause 27. The method of Clause 26, wherein the first transistor comprises a source coupled to a reference potential node and a drain coupled to a source of the second transistor and wherein a drain of the second transistor is coupled to the respective one of the multiple bit-lines.
[0110] Clause 28. The method of Clause 26 or 27, wherein a gate of the second transistor is coupled to a respective one of the multiple word-lines.
[OHl] Clause 29. A circuit for in-memory computation, comprising: a plurality of memory cells on each of multiple bit-lines of a memory, the plurality of memory cells being configured to store multiple bits representing weights of a neural network, wherein the plurality of memory cells on each of the multiple bit-lines are on different word-lines of the memory; and a plurality of accumulators, each accumulator being coupled to a respective one of the multiple bit-lines.
[0112] Clause 30. The circuit of Clause 29, further comprising a plurality of sense amplifiers, each sense amplifier having an output coupled to a respective accumulator and having an input coupled to the respective one of the multiple bit-lines. [0113] Clause 31. The circuit of Clause 30, wherein the plurality of sense amplifiers are configured to concurrently sense the multiple bit-lines.
[0114] Clause 32. The circuit of any of Clauses 29-31, wherein the memory cells on the multiple bit-lines and on one of the word-lines are configured to perform concurrent computations.
[0115] Clause 33. The circuit of any of Clauses 29-32, wherein the word-lines of the memory are configured to be activated one word-line at a time.
[0116] Clause 34. The circuit of any of Clauses 29-33, wherein: two or more of the word-lines are configured to be sequentially activated; and each of the plurality of accumulators is configured to accumulate output signals of the plurality of memory cells on the respective one of the multiple bit-lines after the two or more of the word-lines are sequentially activated.
[0117] Clause 35. The circuit of Clause 34, wherein the output signals comprise digital signals generated by the plurality of memory cells on the respective one of the multiple bit-lines.
[0118] Clause 36. The circuit of Clause 34 or 35, further comprising control circuitry configured to select the two or more of the word-lines that are sequentially activated based on an activation input applied to each of the two or more word-lines being logic high.
[0119] Clause 37. The circuit of any of Clauses 29-36, wherein each of the plurality of accumulators is configured to perform accumulation of memory cell output signals for the respective one of the multiple bit-lines, and wherein, in performing the accumulation, each of the plurality of accumulators is configured to: accumulate output signals of two or more of the plurality of memory cells; and skip accumulating of at least one output signal of at least one other memory cell of the plurality of memory cells based on the at least one other memory cell receiving an activation input that is logic low.
[0120] Clause 38. The circuit of any of Clauses 29-37, wherein each of the plurality of memory cells is configured to multiply a bit of one of the weights with an activation input provided to a respective one of the word-lines.
[0121] Clause 39. The circuit of any of Clauses 29-38, wherein: the plurality of memory cells are configured to be sequentially activated based on a plurality of activation inputs applied to the word-lines; and a respective one of the plurality of accumulators is configured to accumulate output signals from the plurality of memory cells after the plurality of memory cells are sequentially activated.
[0122] Clause 40. The circuit of Clause 39, wherein: the respective one of the plurality of accumulators is configured to accumulate the output signals after multiple activation cycles; and during each of the multiple activation cycles, a respective one of the activation inputs that is logic high is provided to a respective one of the word-lines.
[0123] Clause 41. The circuit of any of Clauses 29-40, wherein each of the plurality of memory cells comprises: a pass-gate transistor; a cross-coupled invertor pair having an output coupled to the pass-gate transistor; a first transistor having a gate coupled to the output of the cross-coupled invertor pair; and a second transistor coupled between the first transistor and the respective one of the multiple bit-lines.
[0124] Clause 42. The circuit of Clause 41, wherein the first transistor comprises a source coupled to a reference potential node and a drain coupled to a source of the second transistor and wherein a drain of the second transistor is coupled to the respective one of the multiple bit-lines.
[0125] Clause 43. The circuit of Clause 41 or 42, wherein a gate of the second transistor is coupled to a respective one of the word-lines.
[0126] Clause 44. The circuit of any of Clauses 29-43, wherein at least one of the memory cells comprises an eight-transistor (8T) static random access memory (SRAM) cell.
[0127] Clause 45. A method for in-memory computation, comprising: storing, in a plurality of memory cells on each of multiple bit-lines of a memory, multiple bits representing weights of a neural network, wherein the plurality of memory cells on each of the multiple bit-lines are on different word-lines of the memory; and accumulating, via each accumulator of a plurality of accumulators, output signals of two or more of the plurality of memory cells on a respective one of the multiple bit-lines after two or more of the word-lines are sequentially activated.
[0128] Clause 46. The method of Clause 45, further comprising sensing, via each sense amplifier of a plurality of sense amplifiers, the respective one of the multiple bit- lines, wherein the output signals of the plurality of memory cells are accumulated based on the sensing of the respective one of the multiple bit-lines.
[0129] Clause 47. The method of Clause 46, wherein the multiple bit-lines are sensed concurrently via the plurality of sense amplifiers.
[0130] Clause 48. The method of any of Clauses 45-47, further comprising activating the word-lines of the memory, one word-line at a time, wherein the activating comprises multiplying one of the weights stored in the memory cells on the one word-line with an activation input provided to the one word-line.
[0131] Clause 49. The method of any of Clauses 45-48, wherein the output signals comprise digital signals generated by the plurality of memory cells on the respective one of the multiple bit-lines.
[0132] Clause 50. The method of any of Clauses 45-49, further comprising selecting the two or more of the word-lines that are sequentially activated based on an activation input applied to each of the two or more of the word-lines being logic high.
[0133] Clause 51. The method of any of Clauses 45-50, further comprising skipping accumulating of at least one output signal of at least one other memory cell of the plurality of memory cells based on the at least one other memory cell receiving an activation input that is logic low.
[0134] Clause 52. The method of any of Clauses 45-51, further comprising multiplying, via each of the plurality of memory cells, a bit of one of the weights with an activation input provided to a respective one of the word-lines.
[0135] Clause 53. The method of any of Clauses 45-52, wherein: the output signals are accumulated, via a respective one of the plurality of accumulators, after multiple activation cycles; and during each of the multiple activation cycles, a respective activation input that is logic high is provided to a respective one of the word-lines.
[0136] Clause 54. The method of any of Clauses 45-53, wherein each of the plurality of memory cells comprises: a pass-gate transistor; a flip-flop having an input coupled to the pass-gate transistor; a first transistor having a gate coupled to an output of the flipflop; and a second transistor coupled between the first transistor and the respective one of the multiple bit-lines. [0137] Clause 55. The method of Clause 54, wherein the first transistor comprises a source coupled to a reference potential node and a drain coupled to a source of the second transistor and wherein a drain of the second transistor is coupled to the respective one of the multiple bit-lines.
[0138] Clause 56. The method of Clause 54 or 55, wherein a gate of the second transistor is coupled to a respective one of the word-lines.
[0139] Clause 57. An apparatus for in-memory computation, comprising: means for storing, in a plurality of memory cells on each of multiple bit-lines of the means for storing, multiple bits representing weights of a neural network, wherein the plurality of memory cells on each of the multiple bit-lines are on different word-lines of the means for storing; and means for accumulating output signals of two or more of the plurality of memory cells on a respective one of the multiple bit-lines after two or more of the wordlines are sequentially activated.
[0140] Clause 58. The apparatus of Clause 57, further comprising means for sensing the respective one of the multiple bit-lines, wherein the output signals of the plurality of memory cells are accumulated based on the sensing of the respective one of the multiple bit-lines.
Additional Considerations
[0141] The preceding description is provided to enable any person skilled in the art to practice the various aspects described herein. The examples discussed herein are not limiting of the scope, applicability, or aspects set forth in the claims. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. For example, changes may be made in the function and arrangement of elements discussed without departing from the scope of the disclosure. Various examples may omit, substitute, or add various procedures or components as appropriate. For instance, the methods described may be performed in an order different from that described, and various steps may be added, omitted, or combined. Also, features described with respect to some examples may be combined in some other examples. For example, an apparatus may be implemented or a method may be practiced using any number of the aspects set forth herein. In addition, the scope of the disclosure is intended to cover such an apparatus or method that is practiced using other structure, functionality, or structure and functionality in addition to, or other than, the various aspects of the disclosure set forth herein. It should be understood that any aspect of the disclosure disclosed herein may be embodied by one or more elements of a claim.
[0142] As used herein, the word “exemplary” means “serving as an example, instance, or illustration.” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.
[0143] As used herein, a phrase referring to “at least one of’ a list of items refers to any combination of those items, including single members. As an example, “at least one of: a, b, or c” is intended to cover a, b, c, a-b, a-c, b-c, and a-b-c, as well as any combination with multiples of the same element (e.g., a-a, a-a-a, a-a-b, a-a-c, a-b-b, a-c-c, b-b, b-b-b, b-b-c, c-c, and c-c-c or any other ordering of a, b, and c).
[0144] As used herein, the term “determining” encompasses a wide variety of actions. For example, “determining” may include calculating, computing, processing, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure), ascertaining, and the like. Also, “determining” may include receiving (e.g., receiving information), accessing (e.g., accessing data in a memory), and the like. Also, “determining” may include resolving, selecting, choosing, establishing, and the like.
[0145] The methods disclosed herein comprise one or more steps or actions for achieving the methods. The method steps and/or actions may be interchanged with one another without departing from the scope of the claims. In other words, unless a specific order of steps or actions is specified, the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims. Further, the various operations of methods described above may be performed by any suitable means capable of performing the corresponding functions. The means may include various hardware and/or software component(s) and/or module(s), including, but not limited to a circuit, an application specific integrated circuit (ASIC), or processor. Generally, where there are operations illustrated in figures, those operations may have corresponding counterpart means-plus-function components with similar numbering. For example, means for storing may include: (1) a CIM array, such as the array of memory cells 502, or (2) a CIM controller, such as the CIM controller 732 including a circuit 728A for storing, and memory such as memory 724 including code 724A for storing. Means for accumulating may include an accumulator such as the accumulators 510. Means for sensing may include a sense amplifier (SA), such as the SAs 508.
[0146] The following claims are not intended to be limited to the aspects shown herein, but are to be accorded the full scope consistent with the language of the claims. Within a claim, reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more.” Unless specifically stated otherwise, the term “some” refers to one or more. No claim element is to be construed under the provisions of 35 U.S.C. § 112(f) unless the element is expressly recited using the phrase “means for” or, in the case of a method claim, the element is recited using the phrase “step for.” All structural and functional equivalents to the elements of the various aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims.

Claims

WHAT IS CLAIMED IS:
1. A circuit comprising: multiple bit-lines; multiple word-lines; an array of compute-in-memory cells, wherein each compute-in-memory cell is coupled to one of the bit-lines and to one of the word-lines and is configured to store a weight bit of a neural network; and a plurality of accumulators, each accumulator being coupled to a respective one of the multiple bit-lines.
2. The circuit of claim 1, further comprising a plurality of sense amplifiers, each sense amplifier having an output coupled to a respective one of the accumulators and having an input coupled to the respective one of the multiple bit-lines.
3. The circuit of claim 2, wherein the plurality of sense amplifiers are configured to concurrently sense the multiple bit-lines.
4. The circuit of claim 1, wherein the compute-in-memory cells coupled to the multiple bit-lines and to one of the multiple word-lines are configured to perform concurrent computations.
5. The circuit of claim 1, wherein the multiple word-lines are configured to be activated one word-line at a time.
6. The circuit of claim 1, wherein: two or more of the word-lines are configured to be sequentially activated; and each of the plurality of accumulators is configured to accumulate output signals from the compute-in-memory cells coupled to the respective one of the multiple bitlines after the two or more of the word-lines are sequentially activated.
29
7. The circuit of claim 6, wherein the output signals comprise digital signals generated by the compute-in-memory cells on the respective one of the multiple bitlines.
8. The circuit of claim 6, further comprising control circuitry configured to select the two or more of the word-lines that are sequentially activated based on an activation input applied to each of the two or more of the word-lines being logic high.
9. The circuit of claim 1, wherein each of the plurality of accumulators is configured to perform accumulation of output signals from the compute-in-memory cells coupled to the respective one of the multiple bit-lines, and wherein, in performing the accumulation, each of the plurality of accumulators is configured to: accumulate output signals from two or more of the compute-in-memory cells; and skip accumulation of at least one output signal from at least one other computein-memory cell coupled to the respective one of the multiple bit-lines, based on the at least one other compute-in-memory cell receiving an activation input that is logic low.
10. The circuit of claim 1, wherein each compute-in-memory cell is configured to multiply the stored weight bit with an activation input provided to a respective one of the multiple word-lines.
11. The circuit of claim 1, wherein: the compute-in-memory cells coupled to each of the multiple word-lines are configured to be sequentially activated based on a plurality of activation inputs applied to the multiple word-lines; and a respective one of the plurality of accumulators is configured to accumulate output signals from the compute-in-memory cells after the compute-in-memory cells coupled to each of the multiple word-lines are sequentially activated.
12. The circuit of claim 11, wherein: the respective one of the plurality of accumulators is configured to accumulate the output signals after multiple activation cycles; and
30 during each of the multiple activation cycles, a respective one of the activation inputs that is logic high is provided to a respective one of the multiple word-lines.
13. The circuit of claim 1, wherein each compute-in-memory cell comprises: a pass-gate transistor; a cross-coupled invertor pair having an output coupled to the pass-gate transistor; a first transistor having a gate coupled to the output of the cross-coupled invertor pair; and a second transistor coupled between the first transistor and the respective one of the multiple bit-lines.
14. The circuit of claim 13, wherein the first transistor comprises a source coupled to a reference potential node and a drain coupled to a source of the second transistor and wherein a drain of the second transistor is coupled to the respective one of the multiple bit-lines.
15. The circuit of claim 13, wherein a gate of the second transistor is coupled to a respective one of the multiple word-lines.
16. The circuit of claim 1, wherein at least one of the compute-in-memory cells comprises an eight-transistor (8T) static random access memory (SRAM) cell.
17. A method comprising: performing computations, in at least a portion of an array of compute-in-memory cells, on a weight and an activation input for a neural network, each compute-inmemory cell being coupled to one of multiple bit-lines and to one of multiple word-lines and being configured to store a bit of the weight for the neural network; and accumulating, via each accumulator of a plurality of accumulators, output signals from two or more of the compute-in-memory cells coupled to a respective one of the multiple bit-lines.
18. The method of claim 17, further comprising sensing, via each sense amplifier of a plurality of sense amplifiers, the respective one of the multiple bit-lines, wherein the output signals from the two or more of the compute-in-memory cells are accumulated based on the sensing of the respective one of the multiple bit-lines.
19. The method of claim 18, wherein the sensing comprises concurrently sensing the multiple bit-lines via the plurality of sense amplifiers.
20. The method of claim 17, wherein the output signals comprise digital signals generated by the compute-in-memory cells on the respective one of the multiple bitlines.
21. The method of claim 17, further comprising sequentially activating two or more of the word-lines, wherein the accumulating occurs after the sequentially activating and wherein the sequentially activating comprises applying the activation input to each of the two or more word-lines, one word-line at a time.
22. The method of claim 21, further comprising selecting the two or more wordlines that are sequentially activated based on the activation input applied to each of the two or more word-lines being logic high.
23. The method of claim 17, further comprising skipping accumulating of at least one output signal from at least one other compute-in-memory cell in the array of compute-in-memory cells based on the at least one other compute-in-memory cell receiving the activation input, which is logic low.
24. The method of claim 17, wherein performing the computations comprises multiplying, via each of the compute-in-memory cells coupled to a respective one of the multiple word-lines in the at least the portion of the array, the bits of the weight with the activation input provided to the respective one of the multiple word-lines.
25. The method of claim 17, wherein: the output signals are accumulated, via a respective one of the plurality of accumulators, after multiple activation cycles; and during each of the multiple activation cycles, a respective activation input that is logic high is provided to a respective one of the multiple word-lines.
26. The method of claim 17, wherein each compute-in-memory cell comprises: a pass-gate transistor; a flip-flop having an input coupled to the pass-gate transistor; a first transistor having a gate coupled to an output of the flip-flop; and a second transistor coupled between the first transistor and the respective one of the multiple bit-lines.
27. The method of claim 26, wherein the first transistor comprises a source coupled to a reference potential node and a drain coupled to a source of the second transistor and wherein a drain of the second transistor is coupled to the respective one of the multiple bit-lines.
28. The method of claim 26, wherein a gate of the second transistor is coupled to a respective one of the multiple word-lines.
33
PCT/US2022/074399 2021-08-02 2022-08-01 Digital compute in memory WO2023015167A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
KR1020247003285A KR20240038721A (en) 2021-08-02 2022-08-01 Digital Compute in Memory
CN202280051713.3A CN117751407A (en) 2021-08-02 2022-08-01 In-digital memory computation
EP22758105.5A EP4381503A1 (en) 2021-08-02 2022-08-01 Digital compute in memory

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202163228523P 2021-08-02 2021-08-02
US63/228,523 2021-08-02
US17/816,285 US12019905B2 (en) 2022-07-29 Digital compute in memory
US17/816,285 2022-07-29

Publications (1)

Publication Number Publication Date
WO2023015167A1 true WO2023015167A1 (en) 2023-02-09

Family

ID=83005856

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/074399 WO2023015167A1 (en) 2021-08-02 2022-08-01 Digital compute in memory

Country Status (1)

Country Link
WO (1) WO2023015167A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3671748A1 (en) * 2018-12-21 2020-06-24 IMEC vzw In-memory computing for machine learning
US10831446B2 (en) * 2018-09-28 2020-11-10 Intel Corporation Digital bit-serial multi-multiply-and-accumulate compute in memory

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10831446B2 (en) * 2018-09-28 2020-11-10 Intel Corporation Digital bit-serial multi-multiply-and-accumulate compute in memory
EP3671748A1 (en) * 2018-12-21 2020-06-24 IMEC vzw In-memory computing for machine learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JHANG CHUAN-JIA ET AL: "Challenges and Trends of SRAM-Based Computing-In-Memory for AI Edge Devices", IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, IEEE, US, vol. 68, no. 5, 22 March 2021 (2021-03-22), pages 1773 - 1786, XP011850843, ISSN: 1549-8328, [retrieved on 20210419], DOI: 10.1109/TCSI.2021.3064189 *

Similar Documents

Publication Publication Date Title
US20220414444A1 (en) Computation in memory (cim) architecture and dataflow supporting a depth-wise convolutional neural network (cnn)
US20220414443A1 (en) Compute in memory-based machine learning accelerator architecture
US20220269483A1 (en) Compute in memory accumulator
TWI815312B (en) Memory device, compute in memory device and method
EP4381376A1 (en) Folding column adder architecture for digital compute in memory
US20230025068A1 (en) Hybrid machine learning architecture with neural processing unit and compute-in-memory processing elements
EP4384899A1 (en) Partial sum management and reconfigurable systolic flow architectures for in-memory computation
US12019905B2 (en) Digital compute in memory
US20220414454A1 (en) Computation in memory architecture for phased depth-wise convolutional
US20230037054A1 (en) Digital compute in memory
WO2023015167A1 (en) Digital compute in memory
WO2023004570A1 (en) Activation buffer architecture for data-reuse in a neural network accelerator
US20230004350A1 (en) Compute in memory architecture and dataflows for depth-wise separable convolution
KR20220096991A (en) Neural network device including convolution SRAM and diagonal accumulation SRAM
WO2023064825A1 (en) Accumulator for digital computation-in-memory architectures
WO2023004374A1 (en) Hybrid machine learning architecture with neural processing unit and compute-in-memory processing elements

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22758105

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202280051713.3

Country of ref document: CN

ENP Entry into the national phase

Ref document number: 20247003285

Country of ref document: KR

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2401000546

Country of ref document: TH

ENP Entry into the national phase

Ref document number: 2024505603

Country of ref document: JP

Kind code of ref document: A

REG Reference to national code

Ref country code: BR

Ref legal event code: B01A

Ref document number: 112024001355

Country of ref document: BR

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2022758105

Country of ref document: EP

Effective date: 20240304

ENP Entry into the national phase

Ref document number: 112024001355

Country of ref document: BR

Kind code of ref document: A2

Effective date: 20240123