WO2021151098A1 - Kernel stacking and kernel partial sum accumulation in memory array for neural network inference acceleration - Google Patents

Kernel stacking and kernel partial sum accumulation in memory array for neural network inference acceleration Download PDF

Info

Publication number
WO2021151098A1
WO2021151098A1 PCT/US2021/014965 US2021014965W WO2021151098A1 WO 2021151098 A1 WO2021151098 A1 WO 2021151098A1 US 2021014965 W US2021014965 W US 2021014965W WO 2021151098 A1 WO2021151098 A1 WO 2021151098A1
Authority
WO
WIPO (PCT)
Prior art keywords
fifo
column
value
kernel
matrix
Prior art date
Application number
PCT/US2021/014965
Other languages
French (fr)
Inventor
Liang Zhao
Zhichao Lu
Original Assignee
Reliance Memory Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Reliance Memory Inc. filed Critical Reliance Memory Inc.
Publication of WO2021151098A1 publication Critical patent/WO2021151098A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the present disclosure relates generally to kernel stacking and kernel partial sum accumulation and more specifically to kernel stacking and kernel partial sum accumulation in a memory array for neural network inference accelerations.
  • RRAM arrays e.g. a one-transistor-one- memristor (1T1R)
  • T1R one-transistor-one- memristor
  • a method of kernel partial sum accumulation comprising: generating a first dot product and a second dot product by multiplying an input image matrix with a kernel, wherein the input matrix comprises a first matrix column and a second matrix column; generating a first FIFO value and a second FIFO value by extracting the first dot product from a first ADC and the second dot product from a second ADC respectively; accumulating the first FIFO value and the second FIFO value in a first FIFO memory and a second FIFO memory respectively, wherein the first FIFO memory comprises a first FIFO column having a first FIFO element at the top of the first FIFO column and the second FIFO memory comprises a second FIFO column having a second FIFO element at the top the second FIFO column; in response to the first FIFO value becomes the first FIFO element, outputting the first FIFO value as a saved output and accumulating the first output as a saved column; and
  • the method in some implementations, further comprising: generating an accumulation result by adding the saved column with the first FIFO column and the second FIFO column.
  • the first FIFO memory includes an SRAM-based FIFO memory.
  • the input matrix includes an input image matrix.
  • the input matrix includes at least two columns.
  • the input matrix includes a 2x5 matrix.
  • the kernel includes a 5x5 kernel.
  • the non-transitory computer-readable storage medium in some implementations, further comprising: generating an accumulation result by adding the saved column with the first FIFO column and the second FIFO column.
  • the first FIFO memory includes an SRAM-based FIFO memory.
  • the non-transitory computer-readable storage medium in some implementations, further comprises a counter of the first FIFO memory, wherein the counter may be shared across arrays of the first FIFO memory.
  • the input matrix includes an input image matrix.
  • the input matrix includes at least two columns.
  • the input matrix includes a 2x5 matrix and the kernel includes a 5x5 kernel.
  • FIG. 1 is a schematic diagram illustrating a 1T1R memory array for neural network inference acceleration in accordance with some implementations of the present disclosure.
  • FIG. 2 is a schematic diagram illustrating an AlexNet, which has 8 layers — 5 convolutional and 3 fully -connected, in accordance with some implementations of the present disclosure.
  • FIG. 3 is a schematic diagram illustrating a naive version of the Inception Module in accordance with some implementations of the present disclosure.
  • FIG. 4 is a schematic diagram illustrating an Inception Module with dimensionality reduction in accordance with some implementations of the present disclosure.
  • FIG. 5 is a schematic diagram illustrating a kernel stacking structure in accordance with some implementations of the present disclosure.
  • FIGS. 14-19 are schematic diagrams illustrating kernel partial sum accumulation in accordance with some implementations of the present disclosure.
  • FIG. 20 is a schematic diagram illustrating a kernel stack activation process in accordance with some implementations of the present disclosure.
  • FIG. 21 is a schematic diagram illustrating a synchronous FIFO in accordance with some implementations of the present disclosure.
  • FIG. 22 is a block diagram illustrating an example computing system for implementing methods of using large output resistance to reduce the current in a crossbar array circuit in accordance with some implementations.
  • the present disclosure reduces the number of RRAM cells that are simultaneously activated on the same bitline. This will prevent current-crowding and IR drop on bitline, and help the further scaling of unit cell size in such an array.
  • FIG. 1 is a schematic diagram 100 illustrating a 1T1R memory array for neural network inference acceleration in accordance with some implementations of the present disclosure.
  • Multiply and Accumulate may be one of the neural network-based algorithms.
  • the 1T1R array may be used to accelerate MAC operations in neural network inference.
  • the weights are stored in each 1T1R cell, and by counting the total output currents on each bitline (BL), the results of the MAC calculation can be determined.
  • FIG. 3 is a schematic diagram 300 illustrating a naive version of the Inception Module in accordance with some implementations of the present disclosure.
  • FIG. 6 is a schematic diagram 600 illustrating a kernel stack activation process in accordance with some implementations of the present disclosure.
  • the convolution in CNN may consist of many applications of convolution in parallel to extract many kinds of features at many locations using different sizes of kernels. And the result of different kinds of features may be accumulated and therefore the deep neural network can be realized.
  • the method further includes accumulating the first FIFO value and the second FIFO value in a first FIFO memory and a second FIFO memory respectively (step 705)
  • the first FIFO value and the second FIFO value accumulated in the first FIFO memory and the second FIFO memory form a first FIFO column and a second FIFO column respectively.
  • the first FIFO column may include a first element at the top of the first FIFO column
  • the second FIFO column may include a second element at the top of the second FIFO column.
  • FIGS. 8-13 are schematic diagrams illustrating kernel partial sum accumulation in accordance with some implementations of the present disclosure.
  • ADC analog-to- digital converter
  • FIFO first-in-first-out
  • the convolution operation is done and the data can be pulled out from the FIFO memory as shown in FIG. 13.
  • FIGS. 14-19 are schematic diagrams illustrating kernel partial sum accumulation in accordance with some implementations of the present disclosure. This process further requires steps 707 and 709.
  • the method further includes, in response to the first FIFO value becomes the first FIFO element, outputting the first FIFO value as an output value, and accumulating the output value as a save column (step 707)
  • the convolution in CNN may consist of multiple rows of kernels. If there are M rows of kernels, then the total word length needed for FIFO memory is MxN.
  • FIG. 22 is a block diagram 22000 illustrating an example computing system 2200 for implementing methods of using large output resistance to reduce the current in a crossbar array circuit in accordance with some implementations.
  • the computer system 2200 may be used to at least the crossbars or crossbar arrays in accordance with some implementations of the present disclosure.
  • the computer system 2200 in some implementations includes one or more processing units CPU(s) 2202 (also referred to as processors), one or more network interfaces 2205, optionally a user interface, a memory 2206, and one or more communication buses 2208 for interconnecting these components.
  • the communication buses 2208 optionally include circuitry (sometimes called a chipset) that interconnects and controls communications between system components.
  • the memory 2206 typically includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid-state memory devices; and optionally includes non-volatile memory, such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non volatile solid-state storage devices.
  • the memory 2206 optionally includes one or more storage devices remotely located from the CPU(s) 2202.
  • the memory 2206, or alternatively the non-volatile memory device(s) within the memory 2206 includes a non-transitory computer-readable storage medium.
  • the memory 2206 or alternatively the non-transitory computer-readable storage medium stores the following programs, modules, and data structures, or a subset thereof:
  • an operating system 2210 e.g., an embedded Linux operating system, which includes procedures for handling various basic system services and for performing hardware dependent tasks;
  • a controller 2216 for controlling a manufacturing machine in accordance with the execution of programming instructions
  • a user interaction module 2218 for enabling a user to interact with the computer system 2200.
  • the term “if’ may be construed to mean “when” or “upon” or “in response to determining” or “in accordance with a determination” or “in response to detecting,” that a stated condition precedent is true, depending on the context.
  • the phrase “if it is determined (that a stated condition precedent is true)” or “if (a stated condition precedent is true)” or “when (a stated condition precedent is true)” may be construed to mean “upon determining” or “in response to determining” or “in accordance with a determination” or “upon detecting” or “in response to detecting” that the stated condition precedent is true, depending on the context.

Abstract

A method of kernel partial sum accumulation includes: generating a first dot product and a second dot product by multiplying an input image matrix with a kernel, wherein the input matrix includes a first matrix column and a second matrix column; generating a first FIFO value and a second FIFO value by extracting the first dot product from a first ADC and the second dot product from a second ADC respectively; accumulating the first FIFO value and the second FIFO value in a first FIFO memory and a second FIFO memory respectively, wherein the first FIFO memory comprises a first FIFO column having a first FIFO element at the top of the first FIFO column and the second FIFO memory includes a second FIFO column having a second FIFO element at the top the second FIFO column; in response to the first FIFO value becomes the first FIFO element, outputting the first FIFO value as a saved output and accumulating the first output as a saved column; and in response to the second FIFO value becomes the second FIFO element, adding the second FIFO value to a subsequent FIFO value generated by extracting a subsequent dot product from the first ADC.

Description

KERNEL STACKING AND KERNEL PARTIAL SUM ACCUMULATION IN MEMORY ARRAY FOR NEURAL NETWORK INFERENCE ACCELERATION
RELATED APPLICATION
[0001] This application claims the benefit of U.S. provisional patent application serial no. 62/965,790, filed January 24, 2020, which is hereby incorporated by reference in its entity.
TECHNICAL FIELD
[0002] The present disclosure relates generally to kernel stacking and kernel partial sum accumulation and more specifically to kernel stacking and kernel partial sum accumulation in a memory array for neural network inference accelerations.
BACKGROUND
[0003] The application of RRAM in accelerating neural network computation has been widely studied. Many studies focused on the acceleration of Multiply-Accumulate (MAC) operation; these schemes are often suitable for accelerating fully -connected layers, but for convolution neural network (CNN) the utilization efficiency of the 1T1R array could be quite low.
[0004] On the other hand, the features of RRAM arrays, e.g. a one-transistor-one- memristor (1T1R), pose some practical challenges for the “fully -connected” approach, such as the limited precision of each cell and the large bitline (BL) currents when turning on multiple word lines.
SUMMARY
[0005] A method of kernel partial sum accumulation, in some implementations, comprising: generating a first dot product and a second dot product by multiplying an input image matrix with a kernel, wherein the input matrix comprises a first matrix column and a second matrix column; generating a first FIFO value and a second FIFO value by extracting the first dot product from a first ADC and the second dot product from a second ADC respectively; accumulating the first FIFO value and the second FIFO value in a first FIFO memory and a second FIFO memory respectively, wherein the first FIFO memory comprises a first FIFO column having a first FIFO element at the top of the first FIFO column and the second FIFO memory comprises a second FIFO column having a second FIFO element at the top the second FIFO column; in response to the first FIFO value becomes the first FIFO element, outputting the first FIFO value as a saved output and accumulating the first output as a saved column; and in response to the second FIFO value becomes the second FIFO element, adding the second FIFO value to a subsequent FIFO value generated by extracting a subsequent dot product from the first ADC.
[0006] The method, in some implementations, further comprising: generating an accumulation result by adding the saved column with the first FIFO column and the second FIFO column.
[0007] In some implementations, the first FIFO memory includes an SRAM-based FIFO memory.
[0008] The method, in some implementations, further comprises a counter of the first FIFO memory, wherein the counter may be shared across arrays of the first FIFO memory.
[0009] In some implementations, the input matrix includes an input image matrix.
[0010] In some implementations, the input matrix includes at least two columns.
[0011] In some implementations, the input matrix includes a 2x5 matrix.
[0012] In some implementations, the kernel includes a 5x5 kernel.
[0013] A non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing system with one or more processors, cause the computing system to execute a method of simulating a crossbar array circuit having a crossbar array, in some implementations, comprising steps of: generating a first dot product and a second dot product by multiplying an input image matrix with a kernel, wherein the input matrix comprises a first matrix column and a second matrix column; generating a first FIFO value and a second FIFO value by extracting the first dot product from a first ADC and the second dot product from a second ADC respectively; accumulating the first FIFO value and the second FIFO value in a first FIFO memory and a second FIFO memory respectively, wherein the first FIFO memory comprises a first FIFO column having a first FIFO element at the top of the first FIFO column and the second FIFO memory comprises a second FIFO column having a second FIFO element at the top the second FIFO column; in response to the first FIFO value becomes the first FIFO element, outputting the first FIFO value as a saved output and accumulating the first output as a saved column; and in response to the second FIFO value becomes the second FIFO element, adding the second FIFO value to a subsequent FIFO value generated by extracting a subsequent dot product from the first ADC.
[0014] The non-transitory computer-readable storage medium, in some implementations, further comprising: generating an accumulation result by adding the saved column with the first FIFO column and the second FIFO column.
[0015] In some implementations, the first FIFO memory includes an SRAM-based FIFO memory.
[0016] The non-transitory computer-readable storage medium, in some implementations, further comprises a counter of the first FIFO memory, wherein the counter may be shared across arrays of the first FIFO memory.
[0017] In some implementations, the input matrix includes an input image matrix.
[0018] In some implementations, the input matrix includes at least two columns.
[0019] In some implementations, the input matrix includes a 2x5 matrix and the kernel includes a 5x5 kernel.
BRIEF DESCRIPTION OF THE DRAWINGS
[0020] FIG. 1 is a schematic diagram illustrating a 1T1R memory array for neural network inference acceleration in accordance with some implementations of the present disclosure.
[0021] FIG. 2 is a schematic diagram illustrating an AlexNet, which has 8 layers — 5 convolutional and 3 fully -connected, in accordance with some implementations of the present disclosure.
[0022] FIG. 3 is a schematic diagram illustrating a naive version of the Inception Module in accordance with some implementations of the present disclosure.
[0023] FIG. 4 is a schematic diagram illustrating an Inception Module with dimensionality reduction in accordance with some implementations of the present disclosure.
[0024] FIG. 5 is a schematic diagram illustrating a kernel stacking structure in accordance with some implementations of the present disclosure.
[0025] FIG. 6 is a schematic diagram illustrating a kernel stack activation process in accordance with some implementations of the present disclosure. [0026] FIG. 7 is a method of kernel partial sum accumulation with pixel movement in accordance with some implementations of the present disclosure.
[0027] FIGS. 8-13 are schematic diagrams illustrating kernel partial sum accumulation in accordance with some implementations of the present disclosure.
[0028] FIGS. 14-19 are schematic diagrams illustrating kernel partial sum accumulation in accordance with some implementations of the present disclosure.
[0029] FIG. 20 is a schematic diagram illustrating a kernel stack activation process in accordance with some implementations of the present disclosure.
[0030] FIG. 21 is a schematic diagram illustrating a synchronous FIFO in accordance with some implementations of the present disclosure.
[0031] FIG. 22 is a block diagram illustrating an example computing system for implementing methods of using large output resistance to reduce the current in a crossbar array circuit in accordance with some implementations.
[0032] In the figures, elements having the same designations have the same or similar functions.
DETAILED DESCRIPTION
[0033] The disclosed technology has the following advantages:
[0034] First, it utilizes kernel data reuse and input data reuse which significantly reduce the overall calculation time and cost. This leads to a higher utilization efficiency of the MAC array.
[0035] Second, the present disclosure provides a simple format for data I/O as easy for pipe-lining.
[0036] Last, the present disclosure reduces the number of RRAM cells that are simultaneously activated on the same bitline. This will prevent current-crowding and IR drop on bitline, and help the further scaling of unit cell size in such an array.
[0037] FIG. 1 is a schematic diagram 100 illustrating a 1T1R memory array for neural network inference acceleration in accordance with some implementations of the present disclosure.
[0038] Multiply and Accumulate (MAC) may be one of the neural network-based algorithms. The 1T1R array may be used to accelerate MAC operations in neural network inference. As shown in FIG. 1, the MAC operation: y = Wx, may be achieved by driving the input signals on each wordline (WL), in the form of voltages. The weights are stored in each 1T1R cell, and by counting the total output currents on each bitline (BL), the results of the MAC calculation can be determined.
[0039] Furthermore, in deep neural networks involving convolutions, there are often many different convolution kernels. For instance, FIG. 2 is a schematic diagram 200 illustrating an AlexNet, which has 8 layers — 5 convolutional and 3 fully-connected, in accordance with some implementations of the present disclosure.
[0040] In more advanced practices, there are often different kernel sizes applied to the same set of input data. For instance, FIG. 3 is a schematic diagram 300 illustrating a naive version of the Inception Module in accordance with some implementations of the present disclosure.
[0041] It is possible to stack the kernels in 1 array which can reuse the input data and minimize data movement. For instance, FIG. 4 is a schematic diagram 400 illustrating an Inception Module with dimensionality reduction in accordance with some implementations of the present disclosure.
[0042] FIG. 5 is a schematic diagram 500 illustrating a kernel stacking structure in accordance with some implementations of the present disclosure. As shown in FIG. 5, the 3 x 3 convolution kernel stack is often used to replace large convolution kernels, or to integrate with other different convolution kernels.
[0043] FIG. 6 is a schematic diagram 600 illustrating a kernel stack activation process in accordance with some implementations of the present disclosure. By kernel stacking, the convolution in CNN may consist of many applications of convolution in parallel to extract many kinds of features at many locations using different sizes of kernels. And the result of different kinds of features may be accumulated and therefore the deep neural network can be realized.
[0044] FIG. 7 is a method of kernel partial sum accumulation with pixel movement in accordance with some implementations of the present disclosure.
[0045] First, the method includes generating a first dot product and a second dot product by multiplying an input image matrix with a kernel, wherein the input matrix comprises a first column and a second column (step 701) In some implementations, the input matrix includes an input image matrix, and thus the first column includes a first image column and the second column includes a second image column. [0046] Second, the method further includes generating a first FIFO value and a second FIFO value by extracting the first dot product from a first ADC and the second dot product from a second ADC respectively (step 703)
[0047] Third, the method further includes accumulating the first FIFO value and the second FIFO value in a first FIFO memory and a second FIFO memory respectively (step 705) The first FIFO value and the second FIFO value accumulated in the first FIFO memory and the second FIFO memory form a first FIFO column and a second FIFO column respectively. The first FIFO column may include a first element at the top of the first FIFO column, and the second FIFO column may include a second element at the top of the second FIFO column. These first and second elements will be extracted later.
[0048] The process can be shown by FIGS. 8-13, which are schematic diagrams illustrating kernel partial sum accumulation in accordance with some implementations of the present disclosure. When one row of kernels is activated, the dot product results of the input vector with each column in the kernel will be extracted by an analog-to- digital converter (ADC) and accumulated in a first-in-first-out (FIFO) memory.
[0049] For instance, assume a convolution of a 1x5 image (x) is done with a 5x5 kernel (W). The calculation flow is demonstrated as shown in FIG. 9-13.
[0050] After the calculation, the convolution operation is done and the data can be pulled out from the FIFO memory as shown in FIG. 13.
[0051] Furthermore, calculating the convolution of a 2x5 image with a 5x5 kernel requires more steps and complexity comparing to that of the 1x5 image with a 5x5 kernel. The calculation of convolution of a 2x5 image with a 5x5 kernel can be shown in FIGS. 14-19, which are schematic diagrams illustrating kernel partial sum accumulation in accordance with some implementations of the present disclosure. This process further requires steps 707 and 709.
[0052] Therefore, the method further includes, in response to the first FIFO value becomes the first FIFO element, outputting the first FIFO value as an output value, and accumulating the output value as a save column (step 707)
[0053] Next, the method further includes, in response to the second FIFO value becomes the second FIFO element, adding the second FIFO value to a subsequent FIFO value generated by extracting a subsequent dot product from the first ADC. (step 709) [0054] This can also be shown as in FIG. 15, the left-most column can be taken out of the FIFO memory and saved as the saved column because it is already the correct results for the left-most column. Next, the second element from the second FIFO memory of the right column is extracted and added with the first ADC output of the left-most column, which is the subsequent FIFO value generated by extracting subsequent dot product from the first ADC. The accumulation is repeated with the pixel movement until FIG. 17.
[0055] In FIG. 18, if the left-most column extracted from the first pass is added back, the correct convolution results are obtained.
[0056] In FIG. 19, the convolution of the 2x5 image with a 5x5 kernel is done. The same principle applies to images and kernels of any size. This algorithm reduces the time complexity of the convolution from 0(M2K2) to 0(M2), where M is the side length of the image, K is the side length of the kernel.
[0057] In FIG. 20, by kernel stacking activation, the convolution in CNN may consist of multiple rows of kernels. If there are M rows of kernels, then the total word length needed for FIFO memory is MxN.
[0058] FIG. 21 is a schematic diagram illustrating a synchronous FIFO in accordance with some implementations of the present disclosure. In practice, the flip- flop based FIFO memory may not be ideal, because: (1) all data has to be moved every cycle (high dynamic power); (2) the total memory size is not small. Therefore, in some implementations, an SRAM-based FIFO memory can be used; the counter of this SRAM-based FIFO memory can be shared across many different arrays (or even better, all the FIFOs from each column can be viewed as one entire SRAM array with wide I/O).
[0059] FIG. 22 is a block diagram 22000 illustrating an example computing system 2200 for implementing methods of using large output resistance to reduce the current in a crossbar array circuit in accordance with some implementations. The computer system 2200 may be used to at least the crossbars or crossbar arrays in accordance with some implementations of the present disclosure. The computer system 2200 in some implementations includes one or more processing units CPU(s) 2202 (also referred to as processors), one or more network interfaces 2205, optionally a user interface, a memory 2206, and one or more communication buses 2208 for interconnecting these components. The communication buses 2208 optionally include circuitry (sometimes called a chipset) that interconnects and controls communications between system components. The memory 2206 typically includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid-state memory devices; and optionally includes non-volatile memory, such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non volatile solid-state storage devices. The memory 2206 optionally includes one or more storage devices remotely located from the CPU(s) 2202. The memory 2206, or alternatively the non-volatile memory device(s) within the memory 2206, includes a non-transitory computer-readable storage medium. In some implementations, the memory 2206 or alternatively the non-transitory computer-readable storage medium stores the following programs, modules, and data structures, or a subset thereof:
[0060] an operating system 2210 (e.g., an embedded Linux operating system), which includes procedures for handling various basic system services and for performing hardware dependent tasks;
[0061] a network communication module 2212 for connecting the computer system with a manufacturing machine via one or more network interfaces (wired or wireless);
[0062] a computing module 2214 for executing programming instructions;
[0063] a controller 2216 for controlling a manufacturing machine in accordance with the execution of programming instructions; and
[0064] a user interaction module 2218 for enabling a user to interact with the computer system 2200.
[0065] Plural instances may be provided for components, operations, or structures described herein as a single instance. Finally, boundaries between various components, operations, and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the implementation(s). In general, structures and functionality presented as separate components in the example configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements fall within the scope of the implementation(s).
[0066] It will also be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first column could be termed a second column, and, similarly, a second column could be termed the first column, without changing the meaning of the description, so long as all occurrences of the “first column” are renamed consistently and all occurrences of the “second column” are renamed consistently. The first column and the second are columns in both column s, but they are not the same column.
[0067] The terminology used herein is for the purpose of describing particular implementations only and is not intended to be limiting of the claims. As used in the description of the implementations and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
[0068] As used herein, the term “if’ may be construed to mean “when” or “upon” or “in response to determining” or “in accordance with a determination” or “in response to detecting,” that a stated condition precedent is true, depending on the context. Similarly, the phrase “if it is determined (that a stated condition precedent is true)” or “if (a stated condition precedent is true)” or “when (a stated condition precedent is true)” may be construed to mean “upon determining” or “in response to determining” or “in accordance with a determination” or “upon detecting” or “in response to detecting” that the stated condition precedent is true, depending on the context.
[0069] The foregoing description included example systems, methods, techniques, instruction sequences, and computing machine program products that embody illustrative implementations. For purposes of explanation, numerous specific details were set forth in order to provide an understanding of various implementations of the inventive subject matter. It will be evident, however, to those skilled in the art that implementations of the inventive subject matter may be practiced without these specific details. In general, well-known instruction instances, protocols, structures, and techniques have not been shown in detail.
[0070] The foregoing description, for purpose of explanation, has been described with reference to specific implementations. However, the illustrative discussions above are not intended to be exhaustive or to limit the implementations to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The implementations were chosen and described in order to best explain the principles and their practical applications, to thereby enable others skilled in the art to best utilize the implementations and various implementations with various modifications as are suited to the particular use contemplated.

Claims

CLAIMS What is claimed is:
1. A method of kernel partial sum accumulation comprising: generating a first dot product and a second dot product by multiplying an input image matrix with a kernel, wherein the input matrix comprises a first matrix column and a second matrix column; generating a first FIFO value and a second FIFO value by extracting the first dot product from a first ADC and the second dot product from a second ADC respectively; accumulating the first FIFO value and the second FIFO value in a first FIFO memory and a second FIFO memory respectively, wherein the first FIFO memory comprises a first FIFO column having a first FIFO element at the top of the first FIFO column and the second FIFO memory comprises a second FIFO column having a second FIFO element at the top the second FIFO column; in response to the first FIFO value becomes the first FIFO element, outputting the first FIFO value as a saved output and accumulating the first output as a saved column; and in response to the second FIFO value becomes the second FIFO element, adding the second FIFO value to a subsequent FIFO value generated by extracting a subsequent dot product from the first ADC.
2. The method as claimed in claim 1, further comprising: generating an accumulation result by adding the saved column with the first FIFO column and the second FIFO column.
3. The method as claimed in claim 1, wherein the first FIFO memory includes an SRAM- based FIFO memory.
4. The method as claimed in claim 1, further comprises a counter of the first FIFO memory, wherein the counter may be shared across arrays of the first FIFO memory.
5. The method as claimed in claim 1, wherein the input matrix includes an input image matrix.
6. The method as claimed in claim 1, wherein the input matrix includes at least two columns.
7. The method as claimed in claim 1, wherein the input matrix includes a 2x5 matrix.
8. The method as claimed in claim 1, wherein the kernel includes a 5x5 kernel.
9. A non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing system with one or more processors, cause the computing system to execute a method of simulating a crossbar array circuit having a crossbar array, comprising steps of: generating a first dot product and a second dot product by multiplying an input image matrix with a kernel, wherein the input matrix comprises a first matrix column and a second matrix column; generating a first FIFO value and a second FIFO value by extracting the first dot product from a first ADC and the second dot product from a second ADC respectively; accumulating the first FIFO value and the second FIFO value in a first FIFO memory and a second FIFO memory respectively, wherein the first FIFO memory comprises a first FIFO column having a first FIFO element at the top of the first FIFO column and the second FIFO memory comprises a second FIFO column having a second FIFO element at the top the second FIFO column; in response to the first FIFO value becomes the first FIFO element, outputting the first FIFO value as a saved output and accumulating the first output as a saved column; and in response to the second FIFO value becomes the second FIFO element, adding the second FIFO value to a subsequent FIFO value generated by extracting a subsequent dot product from the first ADC.
10. The non-transitory computer-readable storage medium as claimed in claim 9, further comprising: generating an accumulation result by adding the saved column with the first FIFO column and the second FIFO column.
11. The non-transitory computer-readable storage medium as claimed in claim 9, wherein the first FIFO memory includes an SRAM-based FIFO memory.
12. The non-transitory computer-readable storage medium as claimed in claim 9, further comprises a counter of the first FIFO memory, wherein the counter may be shared across arrays of the first FIFO memory.
13. The non-transitory computer-readable storage medium as claimed in claim 9, wherein the input matrix includes an input image matrix.
14. The non-transitory computer-readable storage medium as claimed in claim 9, wherein the input matrix includes at least two columns.
15. The non-transitory computer-readable storage medium as claimed in claim 9, wherein the input matrix includes a 2x5 matrix and the kernel includes a 5x5 kernel.
PCT/US2021/014965 2020-01-24 2021-01-25 Kernel stacking and kernel partial sum accumulation in memory array for neural network inference acceleration WO2021151098A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202062965790P 2020-01-24 2020-01-24
US62/965,790 2020-01-24

Publications (1)

Publication Number Publication Date
WO2021151098A1 true WO2021151098A1 (en) 2021-07-29

Family

ID=76993142

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/014965 WO2021151098A1 (en) 2020-01-24 2021-01-25 Kernel stacking and kernel partial sum accumulation in memory array for neural network inference acceleration

Country Status (1)

Country Link
WO (1) WO2021151098A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024026741A1 (en) * 2022-08-03 2024-02-08 Hefei Reliance Memory Limited Acceleration architecture for neural networks

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030156602A1 (en) * 2002-02-19 2003-08-21 Sage Gerald F. Asynchronous digital signal combiner and method of combining asynchronous digital signals in cable television return path
JP3986877B2 (en) * 2002-04-26 2007-10-03 株式会社リコー Image processing device
US20110055445A1 (en) * 2009-09-03 2011-03-03 Azuray Technologies, Inc. Digital Signal Processing Systems
US20170011006A1 (en) * 2015-07-06 2017-01-12 Samsung Electronics Co., Ltd. Device and method to process data in parallel

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030156602A1 (en) * 2002-02-19 2003-08-21 Sage Gerald F. Asynchronous digital signal combiner and method of combining asynchronous digital signals in cable television return path
JP3986877B2 (en) * 2002-04-26 2007-10-03 株式会社リコー Image processing device
US20110055445A1 (en) * 2009-09-03 2011-03-03 Azuray Technologies, Inc. Digital Signal Processing Systems
US20170011006A1 (en) * 2015-07-06 2017-01-12 Samsung Electronics Co., Ltd. Device and method to process data in parallel

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GARLAND JAMES, GREGG DAVID: "Low Complexity Multiply-Accumulate Units for Convolutional Neural Networks with Weight-Sharing", ARXIV.ORG, 30 January 2018 (2018-01-30), Cornell University Library, NY 14853, pages 1 - 20, XP055830979, Retrieved from the Internet <URL:https://arxiv.org/pdf/1801.10219v1.pdf> [retrieved on 20210809] *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024026741A1 (en) * 2022-08-03 2024-02-08 Hefei Reliance Memory Limited Acceleration architecture for neural networks

Similar Documents

Publication Publication Date Title
JP6736646B2 (en) Apparatus and method for performing a convolution operation in a convolutional neural network
US11048997B2 (en) Reduced complexity convolution for convolutional neural networks
US11709911B2 (en) Energy-efficient memory systems and methods
CN110163335A (en) The tune of the convolution at convolution algorithm device and convolutional Neural network input advises method
CN108304922A (en) Computing device and computational methods for neural computing
US11042795B2 (en) Sparse neuromorphic processor
EP3286638A1 (en) Logical operations
CN110647722B (en) Data processing method and device and related products
EP4128236A1 (en) Counter-based multiplication using processing in memory
TW202022711A (en) Convolution accelerator using in-memory computation
CN111340201A (en) Convolutional neural network accelerator and method for performing convolutional operation thereof
US20230068450A1 (en) Method and apparatus for processing sparse data
WO2021151098A1 (en) Kernel stacking and kernel partial sum accumulation in memory array for neural network inference acceleration
US20200175355A1 (en) Neural network accelerator with systolic array structure
CN117546178A (en) In-memory Computing (CIM) architecture and data flow supporting depth-wise Convolutional Neural Network (CNN)
US20230267740A1 (en) Video data processing method and system, and relevant assemblies
CN111125617A (en) Data processing method, data processing device, computer equipment and storage medium
CN112784951B (en) Winograd convolution operation method and related products
CN112765540A (en) Data processing method and device and related products
WO2023019103A1 (en) Partial sum management and reconfigurable systolic flow architectures for in-memory computation
KR20240035492A (en) Folding column adder architecture for in-memory digital computing.
US11163443B2 (en) Method and apparatus for controlling storage operations of data of region of interest
KR102154834B1 (en) In DRAM Bitwise Convolution Circuit for Low Power and Fast Computation
CN112784207B (en) Operation method and related product
CN113536219B (en) Operation method, processor and related products

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21744871

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21744871

Country of ref document: EP

Kind code of ref document: A1