CN111435459A

CN111435459A - Double-sided neural network processor

Info

Publication number: CN111435459A
Application number: CN201910029526.5A
Authority: CN
Inventors: 张国飙
Original assignee: Hangzhou Haicun Information Technology Co Ltd
Current assignee: Hangzhou Haicun Information Technology Co Ltd
Priority date: 2019-01-13
Filing date: 2019-01-13
Publication date: 2020-07-21

Abstract

The double-sided neural network processor (100) comprises a plurality of memory units (100aa-100mn), each memory unit (100ij) comprising at least one memory array (170) and a neural computation circuit (180). The neural network processor package (100) is formed on a semiconductor substrate (0) having a first surface (0a) and a second surface (0 b). The first surface (100a) contains a memory array (170) and the second surface (100b) contains a neural computation circuit (180). The memory array (170) and the neural computation circuit (180) are electrically coupled by a plurality of inter-surface connections (160).

Description

Double-sided neural network processor

Technical Field

The present invention relates to the field of integrated circuits, and more particularly, to neural-network processors (neural-processors) used for Artificial Intelligence (AI).

Background

A fifth application of the processor is a neural network. Neural networks provide a powerful artificial intelligence tool. FIG. 1A is an example of a neural network. It contains an input layer 32, a hidden layer 34 and an output layer 36. The input layer 32 contains i neurons 33, which input data x₁、…x_iConstituting an input vector 30 x. The output layer 36 contains k neurons 37, the output data y of which₁、y₂、…y_kConstituting an output vector 30 y. The hidden layer 34 is interposed between the input layer 32 and the output layer 36. It contains j neurons 35, each neuron 35 electrically coupled to a first neuron in the input layer 32 and a second neuron in the output layer 36. The strength of coupling between neurons is determined by synaptic weight w_ijAnd w_jkAnd (4) showing.

The prior art proposes a neural network accelerator chip 60 (see the ancient cloud et al, "Dadiannao: AMachine-L earning Supercomputer", IEEE/ACM International Symposium on Micro-architecture, 5(1), p. 609-622, 2014.) the neural network accelerator 60 comprises 16 cores 50 coupled to each other by a tree-like connection (FIG. 1B), each core 50 comprises a neural computing unit (NPU) 30 and four eDRAM blocks 40 (FIG. 1C). the NPU 30 performs neural computation and comprises 256+32 16-bit multipliers and 256+ 32-bit adders. the eDRAM 40 stores synaptic weights with a storage capacity of 2 MB.

There is still room for improvement in the neural network accelerator 60. First, the eDRAM 40 is a volatile memory, and pre-synaptic weights need to be loaded into the eDRAM 40 from external memory, which takes time. Second, only 32MB of eDRAM in each neural network accelerator chip 60 may be used to store synaptic weights. This capacity is still much lower than actually needed. Again, the design emphasis of neural network accelerator 60 is skewed towards memory-eDRAM 40 occupies 80% of the area in each core, while NPU 30 occupies less than 10% of the area, so the computational density is very limited.

Disclosure of Invention

The main purpose of the invention is to promote the progress of artificial intelligence.

It is another object of the invention to increase the computational power of neural network processors.

It is another object of the present invention to provide a neural network processor that can be used with mobile devices.

To achieve these and other objects, the present invention provides a two-sided neural network processor whose basic function is neural computation; more importantly, the synaptic weights required for neural computation are stored in the same chip. The neural network processor comprises thousands of storage computing units (storage computing units for short), and each storage computing unit comprises at least one neural storage circuit and one neural computing circuit. The neural storage circuit contains a storage array that stores synaptic weights; a neural computation circuit performs a neural computation using the synaptic weights. The neural network processor is formed on a semiconductor substrate, the substrate having a first surface and a second surface: the first surface includes a plurality of memory arrays and the second surface includes a plurality of neural computation circuits electrically coupled thereto via a plurality of inter-surface connections.

This type of integration of the memory array and the neural computation circuit on both sides of the substrate is referred to as double-sided integration. Double-sided integration can improve computational density. With conventional two-dimensional integration, the area of the neural network processor is the sum of the memory array and the neural computation circuit. After double-sided integration is adopted, the storage array is moved to the other side of the substrate from the edge, the neural network processor becomes small, and the calculation density is enhanced.

The first surface may employ any form of memory as a carrier of synaptic weights, such as RAM (SRAM, DRAM, MRAM, FRAM, etc.), or ROM (mask-ROM, OTP, NOR flash, NAND flash, etc.); the second surface may contain any form of neurocomputational circuitry. Since the memory array in the first surface is formed over a single crystal semiconductor substrate, its speed is fast. Furthermore, the memory array and the neural computation circuit are close together (relative to a traditional von Neumann architecture), and the time required to read synaptic weights is short. In addition, the number of inter-surface connections is large, which allows for ultra-wide bandwidth between the memory array and the neural computation circuitry. In the neural calculation, input data are sent to all the storage calculation units, and the neural calculation is carried out simultaneously, so that large-scale parallel calculation is guaranteed. Because the neural network processor contains thousands of storage and calculation units, high-speed and high-efficiency neural calculation can be realized.

Accordingly, the invention proposes a neural network processor (100), characterized in that it comprises: a plurality of storage units (100aa-100mn), each storage unit (100ij) comprising at least one memory array (170) and a neural computation circuit (180), the memory array (170) storing at least one synaptic weight, the neural computation circuit (180) performing a neural computation using the synaptic weights; a semiconductor substrate (0) having a first surface (0a) and a second surface (0b), said first surface (0a) containing said memory array (170) and said second surface (0b) containing said neuro-computation circuitry (180), said first surface (0a) and said second surface (0b) being electrically coupled by a plurality of inter-surface connections (160).

Drawings

FIG. 1A is a schematic diagram of a neural network; FIG. 1B is a chip layout diagram of a neural network accelerator (prior art); fig. 1C is a core architecture of the neural network accelerator.

Fig. 2A-2B are general descriptions of the two-sided neural network processor 100: FIG. 2A is a block circuit diagram thereof; fig. 2B is a circuit block diagram of the storage unit thereof.

FIG. 3A is a perspective view of a first surface of the neural network processor; FIG. 3B is a perspective view of a second surface thereof; fig. 3C is a sectional view thereof.

Fig. 4A-4B are circuit layout diagrams of first and second surfaces of a neural network processor 100.

Fig. 5A-5C are block circuit diagrams of three types of storage units.

Fig. 6A-6C are circuit layouts of three types of storage cells in the first and second surfaces.

FIG. 7 is a block circuit diagram of a neural computation circuit.

Fig. 8A-8B are circuit block diagrams of two types of computation circuits.

It is noted that the figures are diagrammatic and not drawn to scale. Dimensions and structures of parts in the figures may be exaggerated or reduced for clarity and convenience. In different embodiments, alphabetic suffixes following numbers represent different instances of the same class of structure; the same numerical prefixes refer to the same or similar structures. The symbol "/" represents a relationship of "and" or ".

In this specification, "memory" broadly refers to any semiconductor-based information storage device that can store information permanently or temporarily. A "memory array" is a collection of all memory cells that share at least one address line. "electrically coupled" means any form of coupling in which an electrical signal may be transmitted from one element to another. In other publications, "Neural Processing Unit (NPU)" is also referred to as "Neural Function Unit (NFU)" and the like, which are all synonymous; "neural network processor" is also referred to as "neural processor", "neural network accelerator", "machine learning accelerator", etc., and they all have the same meaning.

Detailed Description

Fig. 2A-2B are general illustrations of a two-sided neural network processor 100. Fig. 2A is a circuit block diagram thereof. Not only can the neural network processor 100 perform neural calculations, but synaptic weights required for the neural calculations are stored locally and in close proximity. Neural network processor 100 contains a banked array of m n banked units 100aa-100 mn. Taking the storage unit 100ij as an example, it has an input 110 and an output 120. Generally speaking, a neural network processor 100 may contain thousands of computational units 100aa-100mn, which support massively parallel computations.

Fig. 2B is a circuit block diagram of the storage unit 100ij thereof. The storage unit 100ij comprises at least a neural storage circuit 170 and a neural computation circuit 180 electrically coupled to each other via a plurality of inter-surface connections 160 (see fig. 3C). Each neural storage circuit 170 contains at least one storage array that stores synaptic weights that are used by the neural computation circuit 180 to perform neural computations. Since the memory array 170 is located on a different surface than the neural computation circuit 180, the memory array 170 is represented by a dashed line.

FIG. 3A is a perspective view of a first surface 0a of a neural network processor chip 100; fig. 3B is a perspective view of the second surface 0B thereof; fig. 3C is a sectional view thereof. The neural network processor chip 100 contains a semiconductor substrate 0. The substrate 0 has a first surface 0a (+ z direction) and a second surface 0b (-z direction). In this embodiment, the neural computation circuits 180aa-180bb are formed on the first surface 0a of the substrate 0; neural memory circuits (memory arrays) 170aa-170bb are formed on the second surface 0b of the substrate 0 and are electrically coupled therebetween by a plurality of inter-surface connections (160, including 160a-160 c). Examples of the inter-surface connections (160) include through-substrate vias (TSV's). In other embodiments, the memory arrays 170aa-170bb are formed on the first surface 0a of the substrate 0; the neural computation circuits 180aa-180bb are formed on the second surface 0b of the substrate 0.

This integration of memory arrays 170aa-170bb and neural computation circuits 180aa-180bb to form both front and back sides (0a, 0b) of substrate 0 is referred to as double-sided integration. Double-sided integration can improve computational density. With conventional two-dimensional integration, the area of the neural network processor is the sum of the memory array and the neural computation circuit. After double-sided integration is adopted, the storage array is moved to the other side of the substrate from the edge, the neural network processor becomes small, and the calculation density is enhanced.

The first surface 0a may employ any form of memory as a carrier of synaptic weights, such as RAM (SRAM, DRAM, MRAM, FRAM, etc.), or ROM (mask-ROM, OTP, NOR flash, NAND flash, etc.); the second surface 0b may contain any form of neural computation circuitry. Since the memory array 170 in the first surface 0a is formed on a single crystal semiconductor substrate, it is fast. Furthermore, the memory array 170 and the neural computation circuit 180 are close together (relative to the traditional von Neumann architecture), and the time required to read the new synaptic weights is short. In addition, the number of inter-surface connections 160 is large, which allows for ultra-wide bandwidth between the memory array 170 and the neural computation circuit 180. In the neural calculation, input data are sent to all the storage calculation units, and the neural calculation is carried out simultaneously, so that large-scale parallel calculation is guaranteed. Because the neural network processor contains thousands of storage units (fig. 2A), high-speed and high-efficiency neural computation can be realized.

Fig. 4A-4B are circuit layouts of first and second surfaces 0a, 0B in a two-sided neural network processor 100. This embodiment corresponds to the embodiment of fig. 5A and 6A. Those skilled in the art can easily generalize it to the embodiments of fig. 5B and 6B, and fig. 5C and 6C. FIG. 4A illustrates a first surface 0a that contains a plurality of memory arrays 170aa-170 mn. FIG. 5B shows a second surface 0B that contains a plurality of neural computation circuits 180aa-180 mn. The neural network processor 100 of fig. 5A and 5B employs a "full alignment" technique, i.e., by designing the circuit layout for both surfaces 0a, 0B for the following purposes: each memory array (e.g., 170ij) has a neural computation circuit (e.g., 180ij) aligned with it (see fig. 6A-6C). Since a single neural computation circuit (e.g., 180ij) may have multiple memory arrays (e.g., 170ijA-170ijD, 170ijW-170 ijZ) aligned with it (see fig. 6B-6C), the period of the neural computation circuit (e.g., 180ij) on the second surface 0B is an integer multiple of the period of the memory arrays (e.g., 170ij) on the first surface 0 a.

Fig. 5A to 6C show three kinds of storage units 100 ij. FIGS. 5A-5C are block circuit diagrams thereof; fig. 6A to 6C are circuit layout diagrams thereof. In these embodiments, one neural computation circuit 180ij serves a different number of storage arrays 170 ij.

The neural computation circuit 180ij in fig. 5A serves a memory array 170 ij: it performs neural computations using synaptic weights stored in the memory array 170 ij. The neural computation circuit 180ij in FIG. 5B serves four storage arrays 170ijA-170 ijD: it performs neural computations using synaptic weights stored in the storage arrays 170ijA-170 jiD. The neural computation circuit 180ij in FIG. 5C serves eight storage arrays 170ijA-170ijD and 170ijW-170 ijZ: it performs neural calculations using synaptic weights stored in the storage arrays 170ijA-170ijD and 170ijW-170 ijZ. As can be seen from FIGS. 6A-6C below, the neural computation circuit 180ij that serves more memory arrays 170ij generally occupies a larger chip area and has greater functionality. In fig. 5A to 6C, since the memory array 170ij and the neural computation circuit 180ij are located on different surfaces (see fig. 3A to 3C, and fig. 4A to 4B), the memory array 170ij is indicated by a dotted line.

Fig. 6A-6C show the circuit layout of the second surface 0b and the projection (shown in dashed lines) of the memory arrays 170ij-170ijZ (located in the first surface 0a) onto the second surface 0 b. The embodiment of fig. 6A corresponds to the embodiment of fig. 5A. In this embodiment, the neural computation circuit 180ij in the memory cell 100ij is located in the second semiconductor substrate 0b of the second surface 0 b. The neural computation circuit 180ij is at least partially covered by the memory array 170 ij.

In the embodiment, the period of the neural computation circuit 180ij is equal to the period of the memory array 170ij, and the area of the neural computation circuit cannot exceed the projection area of the memory array 170ij on the second chip 100b, so that the function is limited. This embodiment is better suited to achieve simpler neural calculations. Fig. 6B-6C disclose two complex neural computation circuits 180 ij.

The embodiment of fig. 6B corresponds to the embodiment of fig. 5B. In this embodiment, the neural computation circuits 180ij of the storage unit 100ij are located in the second surface 0b, which are at least partially covered by four storage arrays 170ijA-170 ijD. Below the four memory arrays 170ijA-170ijD, the neural computation circuit 180ji can be laid out freely. The period of the neural computation circuit 180ij in fig. 6B is twice the period of the memory array 170ij in fig. 6A, and the area is four times the period, so that more complicated neural computation can be realized.

The embodiment of fig. 6C corresponds to the embodiment of fig. 5C. In this embodiment, the neural computation circuit 180ij in the storage unit 100ij is located in the second surface 0 b. The eight storage arrays 170ijA-170ijD, 170ijW-170ijZ are divided into two groups 170ijSA, 170 jiSB. Each bank (e.g., 170 ijSA) includes four storage arrays (e.g., 170ijA-170 ijD). Under the first set 170SA of four storage arrays 170ijA-170ijD, the first neural computation circuit assembly 180ijA may be freely laid out. Similarly, under the second set 170ijSB of four memory arrays 170ijW-170ijZ, the second neural computation circuit assembly 180ijB may be freely laid out. The first nerve computation circuit component 180ijA and the second nerve computation circuit component 180ijB constitute a nerve computation circuit 180 ij. The

wiring channels

182, 184, 186 provide for electrical coupling between different nerve computing circuit assemblies 180ijA, 180ijB, or between different nerve computing circuits. The neural computation circuit 180ij in fig. 6C has a period four times (x direction) and an area eight times as long as that of the memory array 170ij in fig. 6A, and can implement more complicated neural computation.

Fig. 7-8B disclose details of a neural computation circuit 180 and its computation circuit 730. In the embodiment of FIG. 7, the neural computation circuit 180 contains a synaptic weight (W)_s) RAM 740A, an input neuron (N)_in) RAM740B and a computing circuit 730. W_sRAM 740A is a cache that temporarily stores synaptic weights 742 from 3D-M array 170; n is a radical of_inRAM740B is also a buffer that temporarily stores input data 746 from input 110. The calculation circuit 730 performs neural calculations and produces output data 748.

In the embodiment of fig. 8A, the calculation circuit 730 contains a multiplier 732, an adder 734, a register 736, and an activation function circuit 738. The multiplier 732 weights the synapses w_ijAnd input data x_iMultiply, adder 734 and register 736 pair the product (w)_ij×x_i) The accumulated value is supplied to an activation function circuit 738, and the result is output data y_j。

In the embodiment of fig. 8B, multiplier 732 in fig. 8A is replaced with a multiplier-adder (MAC) 732'. Of course, the multiplier-adder 732' also includes a multiplier. W_sRAM 740A outputs not only synaptic weights w_ij(via port 742 w), also outputs offset b_j(via port 742 b). Multiplier-adder 732' for input data x_iSynaptic weight w_ijAnd bias b_jImplementing an offset multiply operation (w)_ij×x_i+b_j）。

An activation function refers to a function whose output is controlled within a certain range (e.g., 0 to 1, or-1 to + 1), including a sigmod function, a signum function, a threshold function, a piecewise linear function, a step function, a tanh function, etc., the circuit implementation of the activation function is difficult. the computing circuit 730 may also contain a non-volatile memory, L UT. for long-term storage of the activation function is typically a read-only memory (ROM). in one embodiment of the invention, the ROM is a three-dimensional read-only memory (3D-ROM) array, which is stacked above and coincident with the neural computing circuit (180). at this time, the computing circuit 730 becomes extremely simple-it only needs to implement addition and multiplication, but does not need to implement the activation function.a computing circuit 730 implementing the activation function with a 3D-ROM array has a small area, which can guarantee the density of computation.

It will be understood that changes in form and detail may be made therein without departing from the spirit and scope of the invention, and are not intended to impede the practice of the invention. The invention, therefore, is not to be restricted except in the spirit of the appended claims.

Claims

1. A two-sided neural network processor (100), comprising:

a plurality of storage units (100aa-100mn), each storage unit (100ij) comprising at least one memory array (170) and a neural computation circuit (180), the memory array (170) storing at least one synaptic weight, the neural computation circuit (180) performing a neural computation using the synaptic weights;

a semiconductor substrate (0) having a first surface (0a) and a second surface (0b), said first surface (0a) containing said memory array (170) and said second surface (0b) containing said neuro-computation circuitry (180), said first surface (0a) and said second surface (0b) being electrically coupled by a plurality of inter-surface connections (160).

2. The neural network processor (100) of claim 1, further characterized by: the projection of the memory array (170) on the second surface (0b) at least partially coincides with the neuro-computation circuit (180).

3. The neural network processor (100) of claim 1, further characterized by: each memory array (170ij) in the first surface (0a) has a neural computation circuit (180ij) aligned with it on the second surface (0 b).

4. The neural network processor (100) of claim 1, further characterized by: each neural computation circuit (180ij) in the second surface (0b) has at least one memory array (170ij) aligned with it on the first surface (0 a).

5. The neural network processor (100) of claim 1, further characterized by: the period of the neural computation circuit (180ij) in the second surface (0b) is an integer multiple of the period of the storage array (170ij) in the first surface (0 b).

6. The neural network processor (100) of claim 1, further characterized by: the neural computation circuit (180) includes at least one multiplier (732).

7. The neural network processor (100) of claim 1, further characterized by: the neural computation circuit (180) includes at least one multiplier-adder (732').

8. The neural network processor (100) of claim 1, further characterized in that the neural computation circuit (180) includes a Read Only Memory (ROM) that stores a look-up table (L UT) of activation functions.

9. The neural network processor (100) of claim 8, further characterized by: the ROM is a three-dimensional read-only memory (3D-ROM) array, the 3D-ROM array being stacked above the neural computation circuit (180).

10. The neural network processor (100) of claim 1, further characterized by: the inter-surface connections (160) are through-silicon vias (TSV's).