CN107305594B

CN107305594B - Processor containing three-dimensional memory array

Info

Publication number: CN107305594B
Application number: CN201710241669.3A
Authority: CN
Inventors: 张国飙; 沈忱
Original assignee: Hangzhou Haicun Information Technology Co Ltd
Current assignee: Hangzhou Haicun Information Technology Co Ltd
Priority date: 2016-04-22
Filing date: 2017-04-13
Publication date: 2021-01-08
Anticipated expiration: 2037-04-13
Also published as: CN107305594A

Abstract

The invention provides a three-dimensional processor containing a three-dimensional memory (3D-M) array. It employs storage-based computation (MBC) rather than logic-based computation (LBC). The three-dimensional processor contains a plurality of computing units, each of which contains an Arithmetic Logic Circuit (ALC) and a 3D-M based look-up table (3 DM-LUT). The ALC performs arithmetic operations on the 3DM-LUT data, and the 3DM-LUT is stored in at least one 3D-M array. The programmable computing unit can customize the computation in the field.

Description

Processor containing three-dimensional memory array

Technical Field

The present invention relates to the field of integrated circuits, and more particularly to processors.

Background

Conventional processors employ logic-based computation (LBC), which is mainly computed by logic circuits (e.g., nand gates, etc.). Logic circuits are suitable for performing arithmetic operations (e.g., addition, subtraction, and multiplication), but are ineffective for non-arithmetic functions (e.g., elementary functions, special functions, etc.). High-speed, efficient implementation of non-arithmetic functions faces significant challenges.

In a conventional processor, only a few basic non-arithmetic functions (e.g., basic algebraic functions, basic transcendental functions) can be directly implemented in hardware, and these functions are called built-in functions. The built-in function is typically implemented by a combination of logic circuits and look-up tables (LUTs). Examples of implementing built-in functions are many, such as: U.S. Pat. No. 8, 5,954,787 (inventor: Eun; grant date: 21/9/1999) discloses a method for implementing sine/cosine (sin/cosine) functions using LUTs; US 9,207,910 (inventor: Azadet; grant date: 12/2015 8) discloses a method for implementing a power function using a LUT.

Fig. 1A specifically describes an implementation method of the built-in function. Conventional processor 300 typically contains logic circuit 380 and memory circuit 370. The logic circuit 380 contains an Arithmetic Logic Unit (ALU) that is used to implement arithmetic operations. Memory circuit 370 contains a LUT. To achieve a predetermined accuracy, the polynomial representing the built-in function needs to be expanded to a sufficiently high order. The LUT 370 stores polynomial coefficients and the ALU 380 computes the corresponding polynomial. This planar integration is a two-dimensional integration because the ALUs 380 and LUTs 370 are arranged side-by-side on the same plane (both formed in substrate 0).

Computing is currently evolving towards higher computational densities and greater computational complexity. The calculation density refers to the calculation capacity (such as the number of floating point number operations per second) per unit chip area, and is an important index for parallel calculation. The calculation complexity refers to the number of built-in functions supported by a chip, and is an important index of scientific calculation. Two-dimensional integration limits further development of computational density and computational complexity.

With two-dimensional integration, LUT 370 will increase the chip area of processor 300, reducing its computational density, which is detrimental to parallel computations. In addition, ALU 380 is a core component of processor 300, and occupies a large portion of the chip area, so LUT 370 has limited chip area available to support only a small number of built-in functions. FIG. 1B lists all built-in Transcendental Functions that can be implemented by Itanium processors (IA-64) from Intel corporation (see Harrison et al, The Computation of transfer Functions on The IA-64 Architecture, Intel Technical Journal, Q4, 1999). The IA-64 processor supports a total of 7 transcendental functions, each using a relatively small LUT (from 0 to 24 kb) and requiring relatively many taylor series (5 to 22 orders) calculations.

The set of built-in functions (containing 10 built-in functions including arithmetic operations) is the basis of scientific calculation. Scientific computing requires strong computing power to improve the understanding of human beings on nature and society or solve engineering problems, and is widely applied in the computing fields of computing mathematics, computing physics, computing chemistry, computing biology, engineering computing, computing economy, computing finance and the like. The traditional scientific computing framework contains three levels: a base layer, a function layer, and a model layer. The basic layer comprises various built-in functions which can be directly realized by hardware; the function layer comprises various mathematical functions (such as non-basic non-arithmetic functions) which cannot be directly realized by hardware; the model layer contains various mathematical models that describe the performance (e.g., input-output characteristics) of the system components.

The mathematical functions in the function layer and the mathematical models in the model layer need to be implemented by software. The function layer needs to do software decomposition once: the mathematical function is decomposed into a combination of built-in functions by software, and then the built-in functions are realized by hardware and arithmetic operation is carried out. The model layer needs to do two software decompositions: the mathematical model is first decomposed into mathematical functions, and then the mathematical functions are decomposed into built-in functions. It is clear that software implementations (e.g., mathematical functions, mathematical models) are slower and less efficient than hardware implementations (e.g., built-in functions). Moreover, the more times the software is decomposed (e.g., mathematical models), the more time delay and power consumption will be exacerbated.

The computational complexity of the mathematical model is very surprising. Fig. 2A-2B disclose a simulation of the amplifier circuit 20, a simple example. The amplifier circuit 20 includes a transistor 24 and a resistor 22 (fig. 2A). All transistor models (e.g., MOS3, BSIM 3V 3.2, BSIM 4V 3.0, PSP, etc. in fig. 2B) are built on the set of built-in functions supported by the conventional processor 300. Even one current point of the transistor 24 requires a large amount of calculation due to the limited kinds of built-in functions (fig. 2B). For example, the BSIM 4V 3.0 transistor model requires 222 additions, 286 multiplications, 85 divisions, 16 square root operations, 24 exponential operations, and 19 logarithmic operations. Such large computational effort makes the simulation slow and inefficient.

Disclosure of Invention

The main purpose of the invention is to promote the revolution of scientific computing.

It is a further object of the invention to provide a processor that enables a higher computational complexity.

It is another object of the invention to provide a processor with more built-in functions.

It is another object of the present invention to efficiently calculate non-arithmetic functions at high speed.

It is another object of the invention to enable high speed and efficient simulation and emulation.

It is a further object of the invention to provide a processor that enables a higher computational density.

To achieve these and other objects, the present invention provides a processor (referred to as a "three-dimensional processor" for short) having a three-dimensional memory (referred to as a 3D-M) array. It contains a plurality of computing units on a semiconductor substrate, each computing unit containing an Arithmetic Logic Circuit (ALC) and a 3D-M based look-up table circuit (3 DM-LUT). The ALC is formed in the semiconductor substrate, and it arithmetically operates the 3DM-LUT data. The 3DM-LUT contains at least one 3D-M array. The 3D-M array is stacked above the ALC and covers at least a portion of the ALC. The 3D-M array is electrically coupled to the ALC through contact via holes, collectively referred to as a three-dimensional interconnect.

The present invention also proposes a memory-based computation (MBC) which is mainly computed by looking up a 3 DM-LUT. Compared with the conventional LBC, the MBC uses a much higher memory capacity of the 3DM-LUT than the conventional LUT. Although for most MBCs they still require arithmetic operations. However, by using a larger 3DM-LUT as a starting point, the MBC requires less polynomial expansion. In MBC, the 3DM-LUT has a larger computational contribution than ALC.

Since the 3DM-LUT is stacked on top of the ALC, this vertical integration is referred to as three-dimensional integration. Three-dimensional integration can improve computational density. Since the 3D-M array does not occupy the substrate area, the area of the computational cell is similar to the area of the ALC. Whereas the area of a conventional processor is the sum of the LUT and ALU. By moving the LUT from edge to top, the computational unit becomes smaller. The three-dimensional processor contains more computing units and supports massive parallel computing.

Three-dimensional integration can also greatly increase computational complexity. The capacity of all LUTs in a conventional process is less than 100kb, while the capacity of all 3 DM-LUTs in a three-dimensional processor can reach 100Gb (e.g., the memory capacity of a single core 3D-XPoint is 128 Gb). Thus, a single three-dimensional processor chip can support up to ten thousand built-in functions, three orders of magnitude more than a conventional processor.

The large increase in built-in functions will flatten the framework of traditional scientific computing, including the base layer, function layer, and model layer. Functions can only be implemented in hardware at the base layer in the past; now, not only can the mathematical functions of the function layer be directly implemented by hardware, but also the mathematical models of the model layer can be directly described by hardware. At the function layer, the mathematical function is realized by a function-by-LUT method (namely LUT table lookup and polynomial interpolation); at the model level, the mathematical model is implemented by a model-by-LUT method (i.e., by LUT look-up table plus polynomial interpolation). The high-speed and high-efficiency realization of mathematical functions and mathematical models can promote the revolution of scientific calculation.

Accordingly, the invention proposes a processor (100) characterized in that it comprises: a semiconductor substrate (0); at least one computation unit (110-i) located on the semiconductor substrate (0), said computation unit (110-i) comprising an Arithmetic Logic Circuit (ALC) (180) and a look-up table (3 DM-LUT) (170) based on a three-dimensional memory (3D-M), wherein: the ALC (180) is located in the semiconductor substrate (0) and performs arithmetic operation on the data of the 3DM-LUT (170); said 3DM-LUT (170) is stored in at least one 3D-M array (170o …), the 3D-M array (170o …) being stacked above the ALC (180); the 3D-M array (170o …) and the ALC (180) are electrically coupled through a plurality of contact via holes (1av, 3 av).

The invention also proposes a three-dimensional processor (100) characterized in that it comprises: a semiconductor substrate (0); at least one computation unit (110-i) located on the semiconductor substrate (0), said computation unit (110-i) comprising an Arithmetic Logic Circuit (ALC) (180) and a look-up table (3 DM-LUT) (170) based on a three-dimensional memory (3D-M), wherein: the ALC (180) is located in the semiconductor substrate (0) and performs arithmetic operation on the data of the 3DM-LUT (170); said 3DM-LUT (170) is stored in at least one 3D-M array (170o …), the 3D-M array (170o …) being stacked above the ALC (180) and storing data related to a mathematical function; the 3D-M array (170o …) and the ALC (180) are electrically coupled through a plurality of contact via holes (1av, 3 av).

The invention also proposes a three-dimensional processor (100) characterized in that it comprises: a semiconductor substrate (0); at least one computation unit (110-i) located on the semiconductor substrate (0), said computation unit (110-i) comprising an Arithmetic Logic Circuit (ALC) (180) and a look-up table (3 DM-LUT) (170) based on a three-dimensional memory (3D-M), wherein: the ALC (180) is located in the semiconductor substrate (0) and performs arithmetic operation on the data of the 3DM-LUT (170); said 3DM-LUT (170) is stored in at least one 3D-M array (170o …), the 3D-M array (170o …) being stacked above the ALC (180) and storing data related to a mathematical model; the 3D-M array (170o …) and the ALC (180) are electrically coupled through a plurality of contact via holes (1av, 3 av).

Drawings

FIG. 1A is a perspective view of a conventional processor (prior art); FIG. 1B lists all transcendental functions (prior art) supported by an Intel Itanium (IA-64) processor.

FIG. 2A is a circuit diagram of an amplifying circuit; fig. 2B lists the amount of computation required by different transistor models to compute a current point (prior art).

FIG. 3A is a circuit block diagram of a three-dimensional processor; fig. 3B is a circuit block diagram of a computing unit.

4A-4C are block circuit diagrams of three ALCs;

FIG. 5A is a cross-sectional view of a computing unit including a three-dimensional writable memory (3D-W); FIG. 5B is a cross-sectional view of a computing unit containing a three-dimensional printed memory (3D-P); fig. 5C is a perspective view of these computing units.

FIG. 6A is a schematic diagram of a memory cell including a diode (or diode-like device); fig. 6B is a schematic diagram of a memory cell including a transistor (or transistor-like device).

Fig. 7A-7C are substrate circuit layout diagrams of three types of computational cells.

FIG. 8A is a block circuit diagram of a first type of computational unit; FIG. 8B is a substrate circuit layout diagram thereof; fig. 8C is a circuit diagram of one specific implementation of the computational unit.

FIG. 9A is a block circuit diagram of a second computing unit; fig. 9B is a substrate circuit layout diagram thereof.

Fig. 10A is a circuit block diagram of a third calculation unit; fig. 10B is a substrate circuit layout diagram thereof.

It is noted that the figures are diagrammatic and not drawn to scale. Dimensions and structures of parts in the figures may be exaggerated or reduced for clarity and convenience. In different embodiments, alphabetic suffixes following numbers represent different instances of the same class of structure; the same numerical prefixes refer to the same or similar structures. "/" indicates a relationship of "and" or ".

In this specification, "memory" broadly refers to any semiconductor-based information storage device that can store information permanently or temporarily. "an electrical circuit is located in or on a substrate" means that at least some of the functional devices (e.g., transistors) of the circuit are formed in the substrate (including on the surface of the substrate). "a circuit is located over a substrate" means that all of the functional devices (e.g., memory cells) of the circuit are formed over, and not in contact with, the substrate.

Detailed Description

FIG. 3A shows a three-dimensional processor 100 having N computing units 110-1, 110-2, … 110-i, … 110-N. These computing units 110-1 … 110-N may perform the same function, or different functions. Each computing unit 110-i has one or more input variables 150, one or more output values 190 (FIG. 3B). Each compute unit 110-i contains an Arithmetic Logic Circuit (ALC) 180 and a 3D-M based look-up table circuit (3 DM-LUT) 170. The 3DM-LUT 170 comprises all look-up tables (LUTs) stored by the 3D-M array coupled to the ALC 180. The ALC 180 performs arithmetic operations on the 3DM-LUT data. 3DM-LUT 170 and ALC 180 are electrically coupled via three-dimensional interconnect 160. Since 3DM-LUT 170 and ALC 180 are located at different physical layers (see FIGS. 5A-5C for details), they are represented by dotted lines.

The three-dimensional processor 100 employs memory-based computation (MBC) which is computed primarily by looking up the 3DM-LUT 170. Compared to conventional LBC, MBC uses a much higher memory capacity of the 3DM-LUT 170 than the conventional LUT 370. Although for most MBCs they still require arithmetic operations. However, by using a larger 3DM-LUT as a starting point, the MBC requires less polynomial expansion. In MBC, the 3DM-LUT 170 performs a larger calculation than ALC 180.

Fig. 4A-4C show three ALCs 180. The ALC 180 of FIG. 4A is an adder 180A; ALC 180 in fig. 4B is a multiplier 180M; ALC 180 in fig. 4C is a multiplier-adder (MAC) that includes an adder 180A and a multiplier 180M. ALC 180 may implement integer arithmetic, fixed point arithmetic, or floating point arithmetic. ALC 180 may also contain storage circuitry such as registers, flip-flops, buffer RAM, etc., as will be apparent to those skilled in the art.

There are various forms of 3D-M in the computing unit 110-i. U.S. Pat. No. 5,835,396 (inventor: Zhang Sao biao; grant date: 11/10/1998) discloses 3D-M in detail. The 3D-M contains a plurality of memory cells vertically stacked on a semiconductor substrate.

The 3D-M is classified into a 3D-RAM (three-dimensional random access memory) and a 3D-ROM (three-dimensional read only memory). In this specification, RAM broadly refers to any semiconductor memory that temporarily stores information, including but not limited to registers, SRAM, and DRAM; ROM broadly refers to any semiconductor memory that permanently stores information, and which may be electrically programmed or non-electrically programmed. Most 3D-M are 3D-ROMs. The 3D-ROM is further classified into a three-dimensional writable memory (referred to as 3D-W) and a three-dimensional printed memory (referred to as 3D-P).

The 3D-W stored information is entered by way of electrical programming. The 3D-W is further divided into a three-dimensional one-time-programmable memory (abbreviated as 3D-OTP) and a three-dimensional multi-time-programmable memory (abbreviated as 3D-MTP) according to the programmable times. As the name implies, 3D-OTP can only be written once and 3D-MTP can be written many times (including over-programming). One common 3D-MTP is 3D-XPoint. Other 3D-MTPs include memristor, Resistive Random Access Memory (RRAM), Phase Change Memory (PCM), programmable addressing cell (PMC), capacitive branched random-access memory (CBRAM), and the like. For 3D-W, the 3DM-LUT can be programmed in the field. It is better if the 3D-W is a 3D-MTP, which can enable reprogramming.

The information stored in the 3D-P is recorded in a printing mode (printing method) in the factory production process. This information is permanently fixed and cannot be changed after shipment. The printing method may be photo-lithography (photo-lithography), nano-imprint method (nano-imprint), electron beam scanning exposure (e-beam lithography), DUV scanning exposure, laser scanning exposure (laser patterning), or the like. A common 3D-P has a three-dimensional mask-programmed read only memory (3D-MPROM), which is programmed to record data through a mask by photolithography. Since it has no electrical programming requirement, the 3D-P memory cell can be biased at a higher voltage when reading. Therefore, the 3D-P read speed is faster than the 3D-W.

FIG. 5A shows a computing unit 110-i that includes 3D-W. The computing unit 110-i contains a substrate circuit layer 0K formed in the substrate 0. ALC 180 is formed in the substrate circuit layer 0K. A memory layer 16A is stacked over the substrate circuit 0K, and a memory layer 16B is stacked over the memory layer 16A. The substrate circuit layer 0K contains peripheral circuits of the memory layers 16A, 16B, which include transistors 0t and interconnection lines 0M. Each memory layer (e.g., 16A) has a plurality of first address lines (e.g., 2a, in the y-direction), a plurality of second address lines (e.g., 1a, in the x-direction), and a plurality of 3D-W memory elements (e.g., 6 aa). The storage layers 16A, 16B are coupled to the ALC 180 through contact via holes 1av, 3av, respectively. The LUTs stored by all of the 3D-M arrays coupled to the ALC 180 in the storage layers 16A, 16B are collectively referred to as the 3DM-LUT 170. Since contact via holes 1av, 3av electrically couple 3DM-LUT 170 and ALC 180, they are collectively referred to as three-dimensional interconnect 160.

3D-W memory cell 6aa contains a programming film 12 and a diode film 14. The programming film 12 may be an antifuse film (write once, for 3D-OTP), or other multi-time programming film (for 3D-MTP). The diode membrane 14 has the following broad features: under the reading voltage, the resistance is small; when applying an external voltageWhen the resistance is smaller than the reading voltage or opposite to the reading voltage, the resistance is larger. The diode film may be a P-i-N diode or may be a metal oxide (e.g., TiO)₂) Diodes, etc.

FIG. 5B shows a computing unit 110-i containing a 3D-P array. It is similar to fig. 5A except that the memory element is different. The 3D-P contains at least two memory cells 5aa, 5 ac: memory cell 5aa is a high resistance memory cell and memory cell 5ac is a low resistance memory cell. The low resistance memory cell 5ac contains a layer of diode film 14. The high resistance memory cell 5aa includes a high resistance film 13, which is an insulating film (e.g., silicon oxide or silicon nitride). In the production flow, the high resistance film 13 at the low resistance memory cell 5ac is physically removed.

FIG. 5C shows the structure of the computing unit 110-i from another perspective. The 3DM-LUT 170 is stacked above ALC 180, with ALC 180 located in substrate 0 and at least partially covered by 3DM-LUT 170. They are electrically coupled to each other through a plurality of contact via holes 1av, 3 av. Three-dimensional integration moves 3DM-LUT 170 and ALC 180 closer together. Because the contact via holes 1av, 3av are numerous (a minimum of thousands) and are short (on the order of microns), the bandwidth of the three-dimensional interconnect 160 is much higher than that of the conventional processor 300. In the conventional processor 300, the two-dimensional integration has the ALUs 380 and LUTs 370 arranged side-by-side on the substrate 0, with a limited number (up to hundreds) of interconnects between them and long (on the order of hundreds of microns).

Fig. 6A shows a memory cell 5ab comprising a diode (or diode-like device) 14. It comprises a variable resistor 12 and a diode (or diode-like device) 14. The variable resistor 12 is implemented by a programmable film in fig. 5A, and its resistance can be set before or after factory shipment. The diode (or diode-like device) 14 is implemented by the diode film in fig. 5A.

Fig. 6B shows a memory cell 5ab comprising a transistor (or transistor-like device) 16. Transistor (or transistor-like device) 16 is a three-port (or more) device having the following broad characteristics: the resistance between its first and second ports may be modulated by an electrical signal on the third port. In this embodiment, the transistor 16 also includes a floating gate 18 for storing charge. This charge represents the information stored by the memory element 5 ab. For those skilled in the art, the transistors 16 may constitute a NOR array or a NAND array. Based on the direction of current flow in the transistor (from the first port to the second port), the 3D-M can be divided into a lateral 3D-M (i.e., current flows horizontally, such as 3D-XPoint) and a vertical 3D-M (i.e., current flows vertically, such as 3D-NAND).

Fig. 7A-7C disclose three calculation units 110-i. In FIG. 7A, ALC 180 is only coupled to 3D-M array 170o, which performs arithmetic operations on the data from 3D-M array 170 o. The 3DM-LUT 170 is stored in a 3D-M array 170 o. ALC 180 is located between the four peripheral circuits of 3D-M array 170o (including X-decoders 15o, 15o 'and Y-decoders 17o, 17 o'), and is overlaid by 3D-M array 170 o. In FIG. 7A and subsequent figures, since the 3D-M array is located above substrate circuit 0K, not in substrate circuit 0K, its projection onto substrate 0 is represented here by only dotted lines.

In FIG. 7B, ALC 180 is coupled to the four 3D-M arrays 170a-170D, which performs arithmetic operations on the data from the four 3D-M arrays 170 a-170D. The 3DM-LUT 170 is stored in four 3D-M arrays 170 a-170D. Unlike FIG. 7A, each 3D-M array (e.g., 170 a) has only two peripheral circuits (e.g., X-decoder 15a and Y-decoder 17A). ALC 180 is located between eight peripheral circuits (X-decoders 15a-15D, Y-decoders 17 a-17D) and is covered by four 3D-M arrays 170 a-170D. ALC 180 in fig. 7B may be four times larger than fig. 7A.

In FIG. 7C, ALC 180 is coupled to eight 3D-M arrays 170a-170D, 170w-170z, which perform arithmetic operations on the data from the eight 3D-M arrays 170a-170D, 170w-170 z. The 3DM-LUT 170 is stored in eight 3D-M arrays 170a-170D, 170w-170 z. The 3D-M arrays 170a-170D, 170w-170z are divided into two

groups

150a, 150 b. Each set (e.g., 150 a) includes four 3D-M arrays (e.g., 170 a-170D). Beneath the first set 150a of four 3D-M arrays 170a-170D, a first ALC assembly 180a is formed. Similarly, a second ALC assembly 180b is formed below the second set 150b of four 3D-M arrays 170w-170 z. The first ALC assembly 180a and the second ALC assembly 180b constitute ALC 180. In this embodiment, a gap G is left between adjacent peripheral circuits (e.g., between adjacent X-decoders 15a, 15 c; between adjacent Y-

decoders

17a, 17 b; between adjacent Y-

decoders

17c, 17 d) to form routing

channels

182, 184, 186 for communication between different ALC components, or between different ALCs. The ALC 180 of fig. 7C may be eight times larger than that of fig. 7A.

Since 3DM-LUT 170 is stacked above ALC 180, this vertical integration is referred to as three-dimensional integration. Three-dimensional integration can improve computational density. Since 3DM-LUT 170 does not occupy substrate area, the area of computational cell 110-i is similar to the area of ALC 180; whereas the area of the conventional processor 300 is the sum of the LUT 370 and ALU 380. By moving the LUT from edge to top, the computational unit becomes smaller. The three-dimensional processor 100 contains more computing units 110-i, supporting massively parallel computing.

Three-dimensional integration can also greatly increase computational complexity. The capacity of all LUTs 370 in the legacy processor 300 is less than 100kb, while the capacity of all 3 DM-LUTs 170 in the three-dimensional processor 100 can reach 100 Gb. Thus, a single three-dimensional processor chip 100 can support up to ten thousand built-in functions, three orders of magnitude more than the conventional processor 300.

The large increase in built-in functions will flatten the framework of traditional scientific computing, including the base layer, function layer, and model layer. Functions can only be implemented in hardware at the base layer in the past; now, not only can the mathematical functions of the function layer be directly implemented by hardware, but also the mathematical models of the model layer can be directly described by hardware. At the function level, the mathematical function is realized by a function-by-LUT method (namely, by LUT table lookup and polynomial interpolation, FIGS. 8A-9B); at the model level, the mathematical model is implemented by the model-by-LUT method (i.e., by LUT look-up table plus polynomial interpolation, FIGS. 10A-10B). The high-speed and high-efficiency realization of mathematical functions and mathematical models can promote the revolution of scientific calculation.

Fig. 8A-8C show a first type of computing unit 110-i. The computing unit 110-i is configured to implement the built-in function Y = f (x) by using a function-by-LUT method. Fig. 8A is a circuit block diagram thereof. ALC 180 contains a pre-processing circuit 180R, a 3DM-LUT 170P, and a post-processing circuit 180T. The preprocessing circuit 180R translates the input variable (X) 150 into the address (a) of the 3DM-LUT 170P. After reading out the data (D) of the address (a) in the 3DM-LUT 170P, the post-processing circuit 180T converts it into a function value (Y) 190. To improve the calculation accuracy, the margin (R) of the input variable (X) is sent to the post-processing circuit 180T.

Fig. 8B is a substrate circuit layout diagram thereof. The 3DM-LUT 170P is stored in a 3D-M array 170P. The 3D-M array 170p also contains an X decoder 15p and a Y decoder 17 p. The 3D-M array 170p overlays the pre-processing circuitry 180R and post-processing circuitry 180T. Although this embodiment has only one 3D-M array 170p, the computing unit 110-i may have multiple 3D-M arrays (similar to FIGS. 7B-7C). Since 3DM-LUT 170P does not occupy substrate area, the three-dimensional integration between 3DM-LUT 170P and pre-processing circuit 180R and post-processing circuit 180T allows for a smaller area for computing unit 110-i.

Fig. 8C is a calculation unit 110-i that can implement the single-precision built-in function Y = f (x). The input variable X150 is 32 bits (X)₃₁… x₀). The preprocessing circuit 180R will have its first 16 bits (x)₃₁… x₁₆) The 16-bit address A is extracted as the 3DM-LUT 170P, followed by 16 bits (x)₁₅… x₀) Extracted as 16-bit residue R to the post-processing circuit 180T. The 3DM-LUT 170P contains two 3 DM-

LUTs

170Q, 170R. Each 3DM-LUT has a capacity of 2MB (16-bit input, 32-bit output). Here, the 3DM-LUT 170Q stores the function value D1= f (a), and the 3DM-LUT 170R stores the first derivative value D2= f' (a) of the function. Post-processing circuit 180T contains multiplier 180M and adder 180A. The output value (Y) 190 is 32 bits, which is calculated by polynomial interpolation. In this embodiment, the polynomial interpolation is a first order taylor series: y (x) = D1+ D2 × R = f (a) + f' (a) × R. The use of higher order polynomial interpolation (e.g., higher order taylor series) can further improve the computational accuracy.

When the built-in function is realized, the LUT and the polynomial interpolation are combined, so that higher calculation precision can be realized by using a smaller LUT. If the above-mentioned single-precision function (32-bit input, 32-bit output) is implemented only with a LUT (without polynomial interpolation), the capacity of the LUT needs to be up to 2³²32=128 Gb. It is not practical to implement a function with such a large LUT. The capacity of the LUT can be greatly reduced by polynomial interpolation. In the above embodiment, the LUT requires only 4Mb (function) after the first order Taylor series is usedThe value LUT requires 2Mb, the first derivative value LUT requires 2 Mb). This is much less than with a LUT alone (4 Mb vs. 128 Gb).

In addition to the elementary functions, the embodiment in fig. 8C can implement various high-level functions, such as special functions. The special function plays a significant role in mathematical analysis, functional analysis, physical research and engineering application. Many special functions are solutions of differential equations or integrals of basis functions. Examples of special functions include gamma functions, beta functions, bezier functions, legendre functions, elliptic functions, Lame functions, Mathieu functions, riemann zeta functions, fresnel integrals, and the like. The advent of three-dimensional processors will simplify the computation of special functions, facilitating their application in scientific computing.

Fig. 9A-9B show a second type of computing unit 110-i. The computing unit 110-i is adapted to implement a complex function Y = exp [ K log (X)]=X^KIt adopts function-by-LUT method. Fig. 9A is a circuit block diagram thereof. The calculation unit 110-i contains two 3 DM-

LUTs

170S, 170T and a multiplier 180M. The 3DM-LUT 170S stores a function value of Log () and the 3DM-LUT 170T stores a function value of Exp (). The input variable X is used as the address 150 of the 3DM-LUT 170A. The output Log (X) 160s of 3DM-LUT 170A is multiplied by the power parameter K at multiplier 180M, and the product 160T is sent as an address to 3DM-LUT 170T. Output 190 of 3DM-LUT 170T is Y = X^K。

Fig. 9B is a substrate circuit layout diagram thereof. Substrate circuit 0K contains

X decoders

15s, 15t,

Y decoders

17s, 17t of 3D-

M arrays

170s, 170t, and multiplier 180M. The 3D-

M array

170s, 170t covers at least part of the multiplier 180M. Note that the embodiments of fig. 8C and 9A both use two 3 DM-LUTs. These 3 DM-LUTs may be stored in the same 3D-M array 170p (fig. 8B), in two 3D-

M arrays

170s, 170t arranged side-by-side (fig. 9B), or in two vertically stacked 3D-M arrays (as formed in the storage layers 16A, 16B of fig. 5A-5C, respectively). Of course, the 3DM-LUT may also be formed in more 3D-M arrays.

Fig. 10A-10B show a third calculation unit 110-i. The computing unit 110-i is used to realize the simulation of the amplifying circuit 20 (FIG. 2A), which adopts the model-by-LUT method. Fig. 10A is a circuit block diagram thereof. It contains a 3DM-LUT 170U, an adder 180A and a multiplier 180M. The 3DM-LUT 170U stores data related to the performance (e.g., input-output characteristics) of the transistor 24. Input voltage V_INUsed as the address 150 of the 3DM-LUT 170U, the read data 160 is the leakage current I_D. Multiplier 180M will I_DMultiplying by the negative value-R of the resistance 22, the result (-R I)_D) At summer 180A with the supply voltage V_DDAdding to obtain an output voltage value V _OUT 190。

The 3DM-LUT 170U may store a variety of mathematical models. In one embodiment, the 3DM-LUT stored model data is raw measurement data, such as measured input-output characteristics. An example is the drain current vs. gate-source voltage (I) of a transistor_D-V_GS) Characteristic curve. In another embodiment, the 3DM-LUT stored model data is smoothed measurement data. Raw measurement data can be smoothed by purely mathematical methods (e.g., by best fit models) or can be smoothed by a physical model (e.g., BSIM 4V 3.0 transistor model). In a third embodiment, the 3DM-LUT stores model data that contains not only the measured values of the transistors, but also derivatives of the measured values. For example, the 3DM-LUT stored model data includes not only the current value (I) of transistor 24_D-V_GS) And also its transconductance value (G)_m-V_GS). Similar to fig. 8C, polynomial interpolation (using the derivative of the measured values) can improve model accuracy under reasonable 3DM-LUT premises.

FIG. 10B is a substrate circuit layout diagram of the third computing unit 110-i. Substrate circuit 0K contains X decoder 15u, Y decoder 17u, as well as multiplier 180M and adder 180A of 3D-M array 170 u. The 3D-M array 170u covers the multiplier 180M and the adder 180A. Although this figure shows only one 3D-M array 170u, multiple 3D-M arrays (e.g., FIGS. 7B-7C) may be used with this embodiment.

The Model-by-LUT brings many advantages. It saves a lot of computation time and energy consumption since two software decompositions (from mathematical model to mathematical function, from mathematical function to built-in function) are not needed. Model-by-LUT even ratio funsection-by-LUT requires fewer LUTs. Since transistor models (such as BISM 4V 3.0) require hundreds of model parameters, if the function-by-LUT method is adopted, a large number of LUTs are required for calculating the intermediate functions of the transistor models. If we skip the function-by-LUT (i.e. skip the transistor model and associated intermediate functions), the transistor performance can be described by three measurement parameters (including the gate-source voltage V)_GSDrain source voltage V_DSSource voltage V_BS). Thus a smaller LUT is required to describe the mathematical model of the transistor.

It will be understood that changes in form and detail may be made therein without departing from the spirit and scope of the invention, and are not intended to impede the practice of the invention. For example, the processor may be a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an image processor (GPU), a network security processor, an encryption/decryption processor, an encoding/decoding processor, a neural network processor, an Artificial Intelligence (AI) processor, and the like. The invention, therefore, is not to be restricted except in the spirit of the appended claims.

Claims

1. A three-dimensional processor (100) comprising a three-dimensional memory 3D-M array, characterized by a semiconductor substrate (0) and a plurality of computational cells (110-1 … 110-i … 110-N), each of said computational cells (110-i) comprising:

a 3D-M based lookup table, 3DM-LUT (170), the 3DM-LUT (170) stored in at least a 3D-M array (170o …) and storing at least a partial lookup table of a mathematical function, the 3D-M array (170o …) containing a plurality of vertically stacked memory cells, all of the memory cells not in contact with any semiconductor substrate:

an arithmetic logic circuit ALC (180), said ALC (180) arithmetically operating a look-up table stored in said 3DM-LUT (170), said ALC (180) located on said semiconductor substrate (0) and contacting said semiconductor substrate (0);

a plurality of contact via holes (1av, 3av), the contact via holes (1av, 3av) electrically coupling the 3D-M array (170o …) and the ALC (180), all of the contact via holes (1av, 3av) located entirely between the 3D-M array (170o …) and the semiconductor substrate (0) and not penetrating any semiconductor substrate;

the 3D-M array (170o …) is a three-dimensional writable storage 3D-W array.

2. The three-dimensional processor (100) of claim 1, further characterized by: the look-up table data of the mathematical function comprises function values of the mathematical function and/or derivative values of the mathematical function.

3. A three-dimensional processor (100) comprising a three-dimensional memory 3D-M array, characterized by a semiconductor substrate (0) and a plurality of computational cells (110-1 … 110-i … 110-N), each of said computational cells (110-i) comprising:

a 3D-M based lookup table, 3DM-LUT (170), the 3DM-LUT (170) stored in at least a 3D-M array (170o …) and storing at least a partial lookup table of a mathematical model, the 3D-M array (170o …) containing a plurality of vertically stacked memory cells, all of the memory cells not in contact with any semiconductor substrate:

the 3D-M array (170o …) is a three-dimensional writable storage 3D-W array.

4. The three-dimensional processor (100) of claim 3, further characterized by: the look-up table data of the mathematical model includes a raw measurement data, and/or a smoothed measurement data.

5. The three-dimensional processor (100) of any of claims 1-4, further characterized in that the ALC (180) has one of the following A) -C) features:

A) comprises an adder, a multiplier, or a multiplier-adder;

B) realizing integer arithmetic, fixed point arithmetic or floating point arithmetic;

C) includes a pre-processing circuit (180R), and/or a post-processing circuit (180T).

6. The three-dimensional processor (100) of any of claims 1-4, further characterized in that the 3D-M array has one of the following D) -E) characteristics:

D) the memory cells in the 3D-M array (170o …) contain at least one diode or diode-like device (14);

E) the memory cells in the 3D-M array (170o …) include at least one transistor or transistor-like device (16).

7. The three-dimensional processor (100) of any of claims 1-4, further characterized by: the 3D-M array (170o …) covers at least part of the ALC (180).

8. The three-dimensional processor (100) of any of claims 1-4, further characterized by: each of the computation units (110-i) contains at least two 3D-M arrays stacked above the ALC (180), the two 3D-M arrays being arranged side by side.

9. The three-dimensional processor (100) of any of claims 1-4, further characterized by: each of the computation units (110-i) contains at least two 3D-M arrays stacked above the ALC (180), the two 3D-M arrays being vertically stacked.