WO2021151098A1

WO2021151098A1 - Kernel stacking and kernel partial sum accumulation in memory array for neural network inference acceleration

Info

Publication number: WO2021151098A1
Application number: PCT/US2021/014965
Authority: WO
Inventors: Liang Zhao; Zhichao Lu
Original assignee: Reliance Memory Inc.
Priority date: 2020-01-24
Filing date: 2021-01-25
Publication date: 2021-07-29

Abstract

A method of kernel partial sum accumulation includes: generating a first dot product and a second dot product by multiplying an input image matrix with a kernel, wherein the input matrix includes a first matrix column and a second matrix column; generating a first FIFO value and a second FIFO value by extracting the first dot product from a first ADC and the second dot product from a second ADC respectively; accumulating the first FIFO value and the second FIFO value in a first FIFO memory and a second FIFO memory respectively, wherein the first FIFO memory comprises a first FIFO column having a first FIFO element at the top of the first FIFO column and the second FIFO memory includes a second FIFO column having a second FIFO element at the top the second FIFO column; in response to the first FIFO value becomes the first FIFO element, outputting the first FIFO value as a saved output and accumulating the first output as a saved column; and in response to the second FIFO value becomes the second FIFO element, adding the second FIFO value to a subsequent FIFO value generated by extracting a subsequent dot product from the first ADC.

Description

KERNEL STACKING AND KERNEL PARTIAL SUM ACCUMULATION IN MEMORY ARRAY FOR NEURAL NETWORK INFERENCE ACCELERATION

RELATED APPLICATION

[0001] This application claims the benefit of U.S. provisional patent application serial no. 62/965,790, filed January 24, 2020, which is hereby incorporated by reference in its entity.

TECHNICAL FIELD

[0002] The present disclosure relates generally to kernel stacking and kernel partial sum accumulation and more specifically to kernel stacking and kernel partial sum accumulation in a memory array for neural network inference accelerations.

BACKGROUND

[0003] The application of RRAM in accelerating neural network computation has been widely studied. Many studies focused on the acceleration of Multiply-Accumulate (MAC) operation; these schemes are often suitable for accelerating fully -connected layers, but for convolution neural network (CNN) the utilization efficiency of the 1T1R array could be quite low.

[0004] On the other hand, the features of RRAM arrays, e.g. a one-transistor-one- memristor (1T1R), pose some practical challenges for the “fully -connected” approach, such as the limited precision of each cell and the large bitline (BL) currents when turning on multiple word lines.

SUMMARY

[0005] A method of kernel partial sum accumulation, in some implementations, comprising: generating a first dot product and a second dot product by multiplying an input image matrix with a kernel, wherein the input matrix comprises a first matrix column and a second matrix column; generating a first FIFO value and a second FIFO value by extracting the first dot product from a first ADC and the second dot product from a second ADC respectively; accumulating the first FIFO value and the second FIFO value in a first FIFO memory and a second FIFO memory respectively, wherein the first FIFO memory comprises a first FIFO column having a first FIFO element at the top of the first FIFO column and the second FIFO memory comprises a second FIFO column having a second FIFO element at the top the second FIFO column; in response to the first FIFO value becomes the first FIFO element, outputting the first FIFO value as a saved output and accumulating the first output as a saved column; and in response to the second FIFO value becomes the second FIFO element, adding the second FIFO value to a subsequent FIFO value generated by extracting a subsequent dot product from the first ADC.

[0006] The method, in some implementations, further comprising: generating an accumulation result by adding the saved column with the first FIFO column and the second FIFO column.

[0007] In some implementations, the first FIFO memory includes an SRAM-based FIFO memory.

[0008] The method, in some implementations, further comprises a counter of the first FIFO memory, wherein the counter may be shared across arrays of the first FIFO memory.

[0009] In some implementations, the input matrix includes an input image matrix.

[0010] In some implementations, the input matrix includes at least two columns.

[0011] In some implementations, the input matrix includes a 2x5 matrix.

[0012] In some implementations, the kernel includes a 5x5 kernel.

[0013] A non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing system with one or more processors, cause the computing system to execute a method of simulating a crossbar array circuit having a crossbar array, in some implementations, comprising steps of: generating a first dot product and a second dot product by multiplying an input image matrix with a kernel, wherein the input matrix comprises a first matrix column and a second matrix column; generating a first FIFO value and a second FIFO value by extracting the first dot product from a first ADC and the second dot product from a second ADC respectively; accumulating the first FIFO value and the second FIFO value in a first FIFO memory and a second FIFO memory respectively, wherein the first FIFO memory comprises a first FIFO column having a first FIFO element at the top of the first FIFO column and the second FIFO memory comprises a second FIFO column having a second FIFO element at the top the second FIFO column; in response to the first FIFO value becomes the first FIFO element, outputting the first FIFO value as a saved output and accumulating the first output as a saved column; and in response to the second FIFO value becomes the second FIFO element, adding the second FIFO value to a subsequent FIFO value generated by extracting a subsequent dot product from the first ADC.

[0014] The non-transitory computer-readable storage medium, in some implementations, further comprising: generating an accumulation result by adding the saved column with the first FIFO column and the second FIFO column.

[0015] In some implementations, the first FIFO memory includes an SRAM-based FIFO memory.

[0016] The non-transitory computer-readable storage medium, in some implementations, further comprises a counter of the first FIFO memory, wherein the counter may be shared across arrays of the first FIFO memory.

[0017] In some implementations, the input matrix includes an input image matrix.

[0018] In some implementations, the input matrix includes at least two columns.

[0019] In some implementations, the input matrix includes a 2x5 matrix and the kernel includes a 5x5 kernel.

BRIEF DESCRIPTION OF THE DRAWINGS

[0020] FIG. 1 is a schematic diagram illustrating a 1T1R memory array for neural network inference acceleration in accordance with some implementations of the present disclosure.

[0021] FIG. 2 is a schematic diagram illustrating an AlexNet, which has 8 layers — 5 convolutional and 3 fully -connected, in accordance with some implementations of the present disclosure.

[0022] FIG. 3 is a schematic diagram illustrating a naive version of the Inception Module in accordance with some implementations of the present disclosure.

[0023] FIG. 4 is a schematic diagram illustrating an Inception Module with dimensionality reduction in accordance with some implementations of the present disclosure.

[0024] FIG. 5 is a schematic diagram illustrating a kernel stacking structure in accordance with some implementations of the present disclosure.

[0025] FIG. 6 is a schematic diagram illustrating a kernel stack activation process in accordance with some implementations of the present disclosure. [0026] FIG. 7 is a method of kernel partial sum accumulation with pixel movement in accordance with some implementations of the present disclosure.

[0027] FIGS. 8-13 are schematic diagrams illustrating kernel partial sum accumulation in accordance with some implementations of the present disclosure.

[0028] FIGS. 14-19 are schematic diagrams illustrating kernel partial sum accumulation in accordance with some implementations of the present disclosure.

[0029] FIG. 20 is a schematic diagram illustrating a kernel stack activation process in accordance with some implementations of the present disclosure.

[0030] FIG. 21 is a schematic diagram illustrating a synchronous FIFO in accordance with some implementations of the present disclosure.

[0031] FIG. 22 is a block diagram illustrating an example computing system for implementing methods of using large output resistance to reduce the current in a crossbar array circuit in accordance with some implementations.

[0032] In the figures, elements having the same designations have the same or similar functions.

DETAILED DESCRIPTION

[0033] The disclosed technology has the following advantages:

[0034] First, it utilizes kernel data reuse and input data reuse which significantly reduce the overall calculation time and cost. This leads to a higher utilization efficiency of the MAC array.

[0035] Second, the present disclosure provides a simple format for data I/O as easy for pipe-lining.

[0036] Last, the present disclosure reduces the number of RRAM cells that are simultaneously activated on the same bitline. This will prevent current-crowding and IR drop on bitline, and help the further scaling of unit cell size in such an array.

[0037] FIG. 1 is a schematic diagram 100 illustrating a 1T1R memory array for neural network inference acceleration in accordance with some implementations of the present disclosure.

[0038] Multiply and Accumulate (MAC) may be one of the neural network-based algorithms. The 1T1R array may be used to accelerate MAC operations in neural network inference. As shown in FIG. 1, the MAC operation: y = Wx, may be achieved by driving the input signals on each wordline (WL), in the form of voltages. The weights are stored in each 1T1R cell, and by counting the total output currents on each bitline (BL), the results of the MAC calculation can be determined.

[0039] Furthermore, in deep neural networks involving convolutions, there are often many different convolution kernels. For instance, FIG. 2 is a schematic diagram 200 illustrating an AlexNet, which has 8 layers — 5 convolutional and 3 fully-connected, in accordance with some implementations of the present disclosure.

[0040] In more advanced practices, there are often different kernel sizes applied to the same set of input data. For instance, FIG. 3 is a schematic diagram 300 illustrating a naive version of the Inception Module in accordance with some implementations of the present disclosure.

[0041] It is possible to stack the kernels in 1 array which can reuse the input data and minimize data movement. For instance, FIG. 4 is a schematic diagram 400 illustrating an Inception Module with dimensionality reduction in accordance with some implementations of the present disclosure.

[0042] FIG. 5 is a schematic diagram 500 illustrating a kernel stacking structure in accordance with some implementations of the present disclosure. As shown in FIG. 5, the 3 x 3 convolution kernel stack is often used to replace large convolution kernels, or to integrate with other different convolution kernels.

[0043] FIG. 6 is a schematic diagram 600 illustrating a kernel stack activation process in accordance with some implementations of the present disclosure. By kernel stacking, the convolution in CNN may consist of many applications of convolution in parallel to extract many kinds of features at many locations using different sizes of kernels. And the result of different kinds of features may be accumulated and therefore the deep neural network can be realized.

[0044] FIG. 7 is a method of kernel partial sum accumulation with pixel movement in accordance with some implementations of the present disclosure.

[0045] First, the method includes generating a first dot product and a second dot product by multiplying an input image matrix with a kernel, wherein the input matrix comprises a first column and a second column (step 701) In some implementations, the input matrix includes an input image matrix, and thus the first column includes a first image column and the second column includes a second image column. [0046] Second, the method further includes generating a first FIFO value and a second FIFO value by extracting the first dot product from a first ADC and the second dot product from a second ADC respectively (step 703)

[0047] Third, the method further includes accumulating the first FIFO value and the second FIFO value in a first FIFO memory and a second FIFO memory respectively (step 705) The first FIFO value and the second FIFO value accumulated in the first FIFO memory and the second FIFO memory form a first FIFO column and a second FIFO column respectively. The first FIFO column may include a first element at the top of the first FIFO column, and the second FIFO column may include a second element at the top of the second FIFO column. These first and second elements will be extracted later.

[0048] The process can be shown by FIGS. 8-13, which are schematic diagrams illustrating kernel partial sum accumulation in accordance with some implementations of the present disclosure. When one row of kernels is activated, the dot product results of the input vector with each column in the kernel will be extracted by an analog-to- digital converter (ADC) and accumulated in a first-in-first-out (FIFO) memory.

[0049] For instance, assume a convolution of a 1x5 image (x) is done with a 5x5 kernel (W). The calculation flow is demonstrated as shown in FIG. 9-13.

[0050] After the calculation, the convolution operation is done and the data can be pulled out from the FIFO memory as shown in FIG. 13.

[0051] Furthermore, calculating the convolution of a 2x5 image with a 5x5 kernel requires more steps and complexity comparing to that of the 1x5 image with a 5x5 kernel. The calculation of convolution of a 2x5 image with a 5x5 kernel can be shown in FIGS. 14-19, which are schematic diagrams illustrating kernel partial sum accumulation in accordance with some implementations of the present disclosure. This process further requires steps 707 and 709.

[0052] Therefore, the method further includes, in response to the first FIFO value becomes the first FIFO element, outputting the first FIFO value as an output value, and accumulating the output value as a save column (step 707)

[0053] Next, the method further includes, in response to the second FIFO value becomes the second FIFO element, adding the second FIFO value to a subsequent FIFO value generated by extracting a subsequent dot product from the first ADC. (step 709) [0054] This can also be shown as in FIG. 15, the left-most column can be taken out of the FIFO memory and saved as the saved column because it is already the correct results for the left-most column. Next, the second element from the second FIFO memory of the right column is extracted and added with the first ADC output of the left-most column, which is the subsequent FIFO value generated by extracting subsequent dot product from the first ADC. The accumulation is repeated with the pixel movement until FIG. 17.

[0055] In FIG. 18, if the left-most column extracted from the first pass is added back, the correct convolution results are obtained.

[0056] In FIG. 19, the convolution of the 2x5 image with a 5x5 kernel is done. The same principle applies to images and kernels of any size. This algorithm reduces the time complexity of the convolution from 0(M2K2) to 0(M2), where M is the side length of the image, K is the side length of the kernel.

[0057] In FIG. 20, by kernel stacking activation, the convolution in CNN may consist of multiple rows of kernels. If there are M rows of kernels, then the total word length needed for FIFO memory is MxN.

[0058] FIG. 21 is a schematic diagram illustrating a synchronous FIFO in accordance with some implementations of the present disclosure. In practice, the flip- flop based FIFO memory may not be ideal, because: (1) all data has to be moved every cycle (high dynamic power); (2) the total memory size is not small. Therefore, in some implementations, an SRAM-based FIFO memory can be used; the counter of this SRAM-based FIFO memory can be shared across many different arrays (or even better, all the FIFOs from each column can be viewed as one entire SRAM array with wide I/O).

[0059] FIG. 22 is a block diagram 22000 illustrating an example computing system 2200 for implementing methods of using large output resistance to reduce the current in a crossbar array circuit in accordance with some implementations. The computer system 2200 may be used to at least the crossbars or crossbar arrays in accordance with some implementations of the present disclosure. The computer system 2200 in some implementations includes one or more processing units CPU(s) 2202 (also referred to as processors), one or more network interfaces 2205, optionally a user interface, a memory 2206, and one or more communication buses 2208 for interconnecting these components. The communication buses 2208 optionally include circuitry (sometimes called a chipset) that interconnects and controls communications between system components. The memory 2206 typically includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid-state memory devices; and optionally includes non-volatile memory, such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non volatile solid-state storage devices. The memory 2206 optionally includes one or more storage devices remotely located from the CPU(s) 2202. The memory 2206, or alternatively the non-volatile memory device(s) within the memory 2206, includes a non-transitory computer-readable storage medium. In some implementations, the memory 2206 or alternatively the non-transitory computer-readable storage medium stores the following programs, modules, and data structures, or a subset thereof:

[0060] an operating system 2210 (e.g., an embedded Linux operating system), which includes procedures for handling various basic system services and for performing hardware dependent tasks;

[0061] a network communication module 2212 for connecting the computer system with a manufacturing machine via one or more network interfaces (wired or wireless);

[0062] a computing module 2214 for executing programming instructions;

[0063] a controller 2216 for controlling a manufacturing machine in accordance with the execution of programming instructions; and

[0064] a user interaction module 2218 for enabling a user to interact with the computer system 2200.

[0065] Plural instances may be provided for components, operations, or structures described herein as a single instance. Finally, boundaries between various components, operations, and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the implementation(s). In general, structures and functionality presented as separate components in the example configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements fall within the scope of the implementation(s).

[0066] It will also be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first column could be termed a second column, and, similarly, a second column could be termed the first column, without changing the meaning of the description, so long as all occurrences of the “first column” are renamed consistently and all occurrences of the “second column” are renamed consistently. The first column and the second are columns in both column s, but they are not the same column.

[0067] The terminology used herein is for the purpose of describing particular implementations only and is not intended to be limiting of the claims. As used in the description of the implementations and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

[0068] As used herein, the term “if’ may be construed to mean “when” or “upon” or “in response to determining” or “in accordance with a determination” or “in response to detecting,” that a stated condition precedent is true, depending on the context. Similarly, the phrase “if it is determined (that a stated condition precedent is true)” or “if (a stated condition precedent is true)” or “when (a stated condition precedent is true)” may be construed to mean “upon determining” or “in response to determining” or “in accordance with a determination” or “upon detecting” or “in response to detecting” that the stated condition precedent is true, depending on the context.

[0069] The foregoing description included example systems, methods, techniques, instruction sequences, and computing machine program products that embody illustrative implementations. For purposes of explanation, numerous specific details were set forth in order to provide an understanding of various implementations of the inventive subject matter. It will be evident, however, to those skilled in the art that implementations of the inventive subject matter may be practiced without these specific details. In general, well-known instruction instances, protocols, structures, and techniques have not been shown in detail.

[0070] The foregoing description, for purpose of explanation, has been described with reference to specific implementations. However, the illustrative discussions above are not intended to be exhaustive or to limit the implementations to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The implementations were chosen and described in order to best explain the principles and their practical applications, to thereby enable others skilled in the art to best utilize the implementations and various implementations with various modifications as are suited to the particular use contemplated.

Claims

CLAIMS What is claimed is:

1. A method of kernel partial sum accumulation comprising: generating a first dot product and a second dot product by multiplying an input image matrix with a kernel, wherein the input matrix comprises a first matrix column and a second matrix column; generating a first FIFO value and a second FIFO value by extracting the first dot product from a first ADC and the second dot product from a second ADC respectively; accumulating the first FIFO value and the second FIFO value in a first FIFO memory and a second FIFO memory respectively, wherein the first FIFO memory comprises a first FIFO column having a first FIFO element at the top of the first FIFO column and the second FIFO memory comprises a second FIFO column having a second FIFO element at the top the second FIFO column; in response to the first FIFO value becomes the first FIFO element, outputting the first FIFO value as a saved output and accumulating the first output as a saved column; and in response to the second FIFO value becomes the second FIFO element, adding the second FIFO value to a subsequent FIFO value generated by extracting a subsequent dot product from the first ADC.

2. The method as claimed in claim 1, further comprising: generating an accumulation result by adding the saved column with the first FIFO column and the second FIFO column.

3. The method as claimed in claim 1, wherein the first FIFO memory includes an SRAM- based FIFO memory.

4. The method as claimed in claim 1, further comprises a counter of the first FIFO memory, wherein the counter may be shared across arrays of the first FIFO memory.

5. The method as claimed in claim 1, wherein the input matrix includes an input image matrix.

6. The method as claimed in claim 1, wherein the input matrix includes at least two columns.

7. The method as claimed in claim 1, wherein the input matrix includes a 2x5 matrix.

8. The method as claimed in claim 1, wherein the kernel includes a 5x5 kernel.

9. A non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing system with one or more processors, cause the computing system to execute a method of simulating a crossbar array circuit having a crossbar array, comprising steps of: generating a first dot product and a second dot product by multiplying an input image matrix with a kernel, wherein the input matrix comprises a first matrix column and a second matrix column; generating a first FIFO value and a second FIFO value by extracting the first dot product from a first ADC and the second dot product from a second ADC respectively; accumulating the first FIFO value and the second FIFO value in a first FIFO memory and a second FIFO memory respectively, wherein the first FIFO memory comprises a first FIFO column having a first FIFO element at the top of the first FIFO column and the second FIFO memory comprises a second FIFO column having a second FIFO element at the top the second FIFO column; in response to the first FIFO value becomes the first FIFO element, outputting the first FIFO value as a saved output and accumulating the first output as a saved column; and in response to the second FIFO value becomes the second FIFO element, adding the second FIFO value to a subsequent FIFO value generated by extracting a subsequent dot product from the first ADC.

10. The non-transitory computer-readable storage medium as claimed in claim 9, further comprising: generating an accumulation result by adding the saved column with the first FIFO column and the second FIFO column.

11. The non-transitory computer-readable storage medium as claimed in claim 9, wherein the first FIFO memory includes an SRAM-based FIFO memory.

12. The non-transitory computer-readable storage medium as claimed in claim 9, further comprises a counter of the first FIFO memory, wherein the counter may be shared across arrays of the first FIFO memory.

13. The non-transitory computer-readable storage medium as claimed in claim 9, wherein the input matrix includes an input image matrix.

14. The non-transitory computer-readable storage medium as claimed in claim 9, wherein the input matrix includes at least two columns.

15. The non-transitory computer-readable storage medium as claimed in claim 9, wherein the input matrix includes a 2x5 matrix and the kernel includes a 5x5 kernel.