CN111326191A

CN111326191A - Processor containing three-dimensional vertical memory array

Info

Publication number: CN111326191A
Application number: CN201910029530.1A
Authority: CN
Inventors: 张国飙
Original assignee: Hangzhou Haicun Information Technology Co Ltd
Current assignee: Hangzhou Haicun Information Technology Co Ltd
Priority date: 2018-12-13
Filing date: 2019-01-14
Publication date: 2020-06-23

Abstract

The three-dimensional processor chip (100) comprises a plurality of computing units (100aa-100mn), each computing unit (100ij) comprising at least one three-dimensional vertical memory (3D-M)_V) An array (170) and an Arithmetic Logic Circuit (ALC) (180). 3D-M_VThe array (170) stores at least a partial look-up table (LUT) of at least one non-arithmetic function or a non-arithmetic model, and the ALC (180) arithmetically operates at least a portion of the data in the LUT. The ALC (180) is located in the semiconductor substrate (0), 3D-M_VThe array (170) is stacked on top of the ALC (180) and electrically coupled through a plurality of on-chip connections (160).

Description

Processor containing three-dimensional vertical memory array

Technical Field

The present invention relates to the field of integrated circuits, and more particularly to processors.

Background

To implement mathematical calculations, conventional processors employ logic-based computation (LBC), which is computed mainly by logic circuits (i.e. arithmetic logic units, ALUs). In fact, the arithmetic operations that the ALU can directly implement are only addition, fat reduction and multiplication, which are collectively referred to as basic arithmetic operations. An ALU is suitable for implementing arithmetic functions, but is not capable of non-arithmetic functions. In a processor implementing mathematical calculations, an arithmetic function is a mathematical function that can be expressed as a combination of its basic arithmetic operations, while a non-arithmetic function is a mathematical function that cannot be expressed as a combination of its basic arithmetic operations. Examples of non-arithmetic functions include transcendental functions, special functions, and the like. Hardware implementation of non-arithmetic functions has been faced with significant challenges since non-arithmetic functions contain more operations than are supported by the ALU and cannot be implemented by the ALU alone.

In a conventional processor, only a few basic functions (i.e., single-variable non-arithmetic functions including basic algebraic functions, basic transcendental functions, etc.) can be directly implemented in hardware, and these functions are called built-in functions. The built-in function is typically implemented by a combination of logic circuits and look-up tables (LUTs). The existing techniques for implementing built-in functions are numerous. For example: U.S. Pat. No. 8, 5,954,787 (inventor: Eun; grant date: 21/9/1999) discloses a method for implementing sine/cosine (SIN/COS) functions using LUTs; US 9,207,910 (inventor: Azadet; grant date: 12/2015 8) discloses a method for implementing a power function using a LUT.

Fig. 1AA describes an implementation method of the built-in function in detail. Conventional processor 0X typically contains logic circuitry 00L and memory circuitry 00M. The logic circuit 00L includes an ALU, which is used to implement arithmetic operations. The memory circuit 00M stores LUT of functions. To achieve a predetermined accuracy, the polynomial representing the built-in function needs to be expanded to a sufficiently high order. The memory circuit 00M stores polynomial coefficients, and ALU 00L calculates the corresponding polynomial. Since the ALU 00L and the memory circuit 00M are arranged side by side on the same plane (both formed in the substrate 00S), this planar integration is a two-dimensional integration.

Computing is currently evolving towards higher computational densities and greater computational complexity. The calculation density refers to the calculation capacity (such as the number of floating point number operations per second) per unit chip area, and is an important index for parallel calculation. The calculation complexity refers to the number of built-in functions supported by a chip, and is an important index of scientific calculation. Two-dimensional integration limits further development of computational density and computational complexity.

Due to the two-dimensional integration, the introduction of the memory circuit 00M will increase the chip area of the processor 0X and reduce its computational density, which is detrimental to parallel computations. Further, ALU 00L is a core component of processor 0X, and occupies most of the chip area, so that memory circuit 00M has a limited chip area available, and can support only a small number of built-in functions. FIG. 1AB lists all built-in Transcendental Functions that can be implemented by Intel corporation's IA-64 processor (see Harrison et al, The computer of Transcental Functions on The IA-64 Architecture, Intel technical journal, Q4, 1999). The IA-64 processor supports only seven built-in functions in total, and so few built-in function groups are extremely detrimental to mathematical computations, since most mathematical functions require software decomposition into a combination of built-in functions. The conventional processor 0X pair is slow and inefficient for most mathematical computations.

Another important application of the processor is computer simulation, i.e. the calculation of mathematical models. Computer simulation is a natural extension of mathematical computation, based on a set of built-in functions (containing only about ten built-in functions) supported by a conventional processor. Conventional computer simulation contains three levels: a base layer, a function layer, and a model layer. The basic layer comprises various built-in functions which can be directly realized by hardware; the function layer comprises various mathematical functions which cannot be directly realized by hardware; the model layer contains various mathematical models that describe the performance (e.g., input-output characteristics) of various system components.

The mathematical functions in the function layer and the mathematical models in the model layer need to be implemented by software. As mentioned previously, the function layer needs to do a software decomposition once. The model layer needs to be decomposed by software twice: the mathematical model is first decomposed into mathematical functions, and then the mathematical functions are decomposed into built-in functions. The mathematical model is more time and energy consuming than the mathematical function because it involves more software decomposition times.

The computational complexity of the mathematical model is very surprising. Fig. 1 BA-1 BB disclose a simulation of a simple example, the amplifying circuit 0Y. The amplifying circuit 0Y includes a transistor 0T and a resistor 0R (fig. 1 BA). The mathematical model of transistor 0T (e.g., MOS3, BSIM 3V 3.2, BSIM 4V 3.0, PSP, etc. in FIG. 1 BB) is built on the set of built-in functions supported by the conventional processor 0X. Since the kinds of built-in functions are limited, even a calculation of one current point of the transistor 0T requires a large amount of calculation (fig. 1 BB). For example, the BSIM 4V 3.0 transistor model requires 222 additions, 286 multiplications, 85 divisions, 16 square root operations, 24 exponential operations, and 19 logarithmic operations.

ALU 00L in conventional processor 0X can only compute the arithmetic model itself. Since most mathematical models are non-arithmetic models, they cannot be implemented by ALU 00L alone. In a processor implementing computer simulation, an arithmetic model is a mathematical model that can be expressed as a combination of its basic arithmetic operations, while a non-arithmetic model is a mathematical model that cannot be expressed as a combination of its basic arithmetic operations. Since non-arithmetic models contain more operations than the arithmetic logic unit supports, non-arithmetic models cannot be implemented by the ALU alone, and computing non-arithmetic models with the conventional processor 0X is slow and inefficient.

On the other hand, a three-dimensional vertical memory (3D-M for short)_V) Is a non-volatile memory. U.S. Pat. No. 8,638,611 discloses a 3D-M_V. It is a vertical nand. 3D-M_VThe array contains a plurality of vertical memory strings. Each memory string contains a plurality of memory cells stacked on top of each other, which are coupled to each other via a vertical address line. Unfortunately, 3D-M in the prior art_VOnly with storage functionality and no computing functionality.

Disclosure of Invention

The main purpose of the invention is to promote the revolution of scientific computing.

It is a further object of the invention to provide a processor that enables a higher computational complexity.

It is a further object of the invention to provide a processor that can implement more built-in functions.

It is another object of the present invention to efficiently calculate non-arithmetic functions at high speed.

It is another object of the invention to enable high speed and efficient simulation and emulation.

It is another object of the present invention to compute non-arithmetic models efficiently and at high speed.

To achieve these and other objects, the present invention proposes a memory system comprising a three-dimensional vertical memory (3D-M)_V) Processor of the array (referred to simply as "three-dimensional processor"): the method not only can store data, but also can accelerate calculation by using the stored data. The three-dimensional processor comprises a plurality of computing units, and each computing unit comprises at least one 3D-M_VArray and an Arithmetic Logic Circuit (ALC). 3D-M_VThe array stores at least a partial look-up table (LUT) of at least one non-arithmetic function or non-arithmetic model, and the ALC performs arithmetic operations on at least a portion of the data in the LUT. Wherein, 3D-M_VArray and ALC vertical integration: they are integrated in the same chip, stacked on each other in a direction perpendicular to the substrate and at least partially overlapping. By applying in 3D-M_VThe LUTs of different mathematical functions are loaded in the array, and the three-dimensional processor can realize different mathematical functions, namely reconfigurable calculation.

3D-M_VIs a three-dimensional memory. 3D-M_VThe array comprises a plurality of vertical storage strings, and each vertical storage string comprises a plurality of 3D-M stacked mutually_VAnd (4) storing the element. Note that 3D-M_VThe memory cells do not contain any semiconductor substrate between them. 3D-M_VIs 3D-NAND. In all semiconductor memories, 3D-M_VThe highest storage density.

The three-dimensional processor employs memory-based computation (MBC), i.e., computation is mainly realized by a large-capacity LUT (i.e., 3 DM-LUT) stored in a 3D-M array. The 3DM-LUT used by MBC has a larger capacity compared to conventional, logic-based computation (LBC). For example, the single core memory capacity of 3D-NAND is up to 128Gb, much higher than a conventional LUT (tens of kb), which can be used to implement tens of thousands of non-arithmetic functions (including various transcendental and special functions). Although for most MBCs they still require arithmetic operations. However, by using a larger 3DM-LUT as a starting point, the MBC requires less polynomial expansion. In MBC, the memory circuit accounts for a greater proportion of the calculations than ALC.

Due to 3D-M_VThe array is stacked on top of the ALC and both are integrated in the same chip, this vertical integration is referred to as three-dimensional integration. Three-dimensional integration can improve computational density. Due to 3D-M_VThe array does not occupy the substrate area, and the area of the computing unit is similar to that of the ALC. Whereas the area of a conventional processor is the sum of memory circuits and logic circuits. By moving the memory circuit from the side to the top, the computational unit becomes smaller. The three-dimensional processor contains more computing units and supports massive parallel computing. Three-dimensional integration can also greatly increase computational complexity. The total LUT capacity in conventional processors is only a few tens of kb. In contrast, the total capacity of a 3DM-LUT in a three-dimensional processor can reach hundreds of Gb, and a single three-dimensional processor chip can support tens of thousands of built-in functions, which is three orders of magnitude more than that of a traditional processor.

Accordingly, the present invention provides a three-dimensional processor (100) for implementing a non-arithmetic function, comprising: a semiconductor substrate (0); a plurality of computing units (100aa, … 100mn), each of said computing units (100ij) comprising at least one three-dimensional longitudinal memory (3D-M)_V) An array (170) and an Arithmetic Logic Circuit (ALC) (180), the 3D-M_VAn array (170) storing at least a partial look-up table (LUT) of the non-arithmetic function, the ALC (180) arithmetically operating at least a portion of the data of the LUT; the ALC (180) is located in the semiconductor substrate (0), the 3D-M_VAn array (170) is stacked above the ALC (180) and electrically coupled with the ALC (180) through a plurality of on-chip connections (160). In the three-dimensional processor (100), the non-arithmetic function includes more arithmetic operations than the ALC (180) supports.

When applied to computer simulations, a separate three-dimensional processor was used to implement the non-arithmetic model, which still employs MBC. MBC brings great advantages for computer simulations. A large increase in built-in functions (from about ten to tens of thousands) will flatten the traditional framework of computer simulations (including base layer, function layer, and model layer). Functions can only be implemented in hardware at the base layer in the past; now, not only the mathematical functions of the function layer can be directly implemented by hardware, but also the mathematical models of the model layer can be directly implemented by hardware. In the function layer, the mathematical function is calculated by a function table look-up method (namely 3DM-LUT stores function values and derivative values thereof, and is expanded by table look-up and a polynomial is added); at the model level, the mathematical model is computed by "model lookup" (i.e., the 3DM-LUT stores the model values and their derivative values, by table lookup with the addition of polynomial expansion). The high-speed and high-efficiency calculation of the mathematical model can be realized through the 3DM-LUT, which promotes the revolution of computer simulation.

Accordingly, the present invention proposes a three-dimensional processor (100) for implementing a non-arithmetic model, characterized in that it comprises: a semiconductor substrate (0); a plurality of computing units (100aa, … 100mn), each of said computing units (100ij) comprising at least one three-dimensional longitudinal memory (3D-M)_V) An array (170) and an Arithmetic Logic Circuit (ALC) (180), the 3D-M_VAn array (170) storing at least part of a look-up table (LUT) of the non-arithmetic model, the ALC (180) arithmetically operating at least part of the data of the LUT; the ALC (180) is located in the semiconductor substrate (0), the 3D-M_VAn array (170) is stacked above the ALC (180) and electrically coupled with the ALC (180) through a plurality of on-chip connections (160). In the three-dimensional processor (100), the non-arithmetic model includes more arithmetic operations than the ALC (180) supports.

Drawings

FIG. 1AA is a perspective view of a conventional processor (prior art); FIG. 1AB lists all transcendental functions (prior art) supported by an Intel Itanium (IA-64) processor; FIG. 1BA is a circuit diagram of an amplifier circuit; fig. 1BB lists the amount of computation required by different transistor models to compute one current point (prior art).

FIGS. 2A-2B are diagrams of a probe containing 3D-M_VGeneral introduction to three-dimensional processors of arrays: FIG. 2A is a block circuit diagram thereof; fig. 2B is a circuit block diagram of a calculation unit thereof.

FIGS. 3A-3B are schematic diagrams of a three-dimensional image containing two types of 3D-M_VA cross-sectional view of the computational cells of the array.

Fig. 4A-4C are block circuit diagrams of three types of computational units.

Fig. 5A-5C are substrate circuit layout diagrams of three types of computational cells.

Fig. 6A-6C are block circuit diagrams of three ALCs.

FIG. 7A is a block circuit diagram of a first type of computational unit; fig. 7B is a circuit diagram of one specific implementation of the computational unit.

Fig. 8 is a circuit block diagram of a second calculation unit.

Fig. 9 is a circuit block diagram of a third calculation unit.

It is noted that the figures are diagrammatic and not drawn to scale. Dimensions and structures of parts in the figures may be exaggerated or reduced for clarity and convenience. In different embodiments, alphabetic suffixes following numbers represent different instances of the same class of structure; the same numerical prefixes refer to the same or similar structures. "/" indicates a relationship of "and" or ".

In this specification, "memory" broadly refers to any semiconductor-based information storage device that can store information permanently or temporarily. ' memory array (e.g. 3D-M)_VArray) "is the set of all memory cells sharing at least one address line. "circuitry in a substrate" means that the active elements (e.g., transistors, memory cells) of the circuitry are located in the substrate; the interconnect lines in the circuit connecting the active elements may be located above the substrate. By "circuit on a substrate" is meant that the active elements of the circuit (e.g., transistors, memory cells) and their interconnect lines are all located above the substrate. "electrically coupled" means any form of coupling in which an electrical signal may be transmitted from one element to another. "look-up table (LUT)" refers to both data in the LUT and a memory circuit (i.e., LUT memory) for storing the LUT, and is not distinguished in this specification.

Detailed Description

FIGS. 2A-2B are diagrams of a probe containing 3D-M_VA general description of a three-dimensional processor of an array. Drawing (A)And 2A is a circuit block diagram thereof. The three-dimensional processor chip 100 can not only store data, but also accelerate calculations using the stored data. The three-dimensional processor 100 includes a memory array having m n computing units 100aa-100 mn. Taking the example of a computational cell 100ij, it has an input 110 and an output 120. In general, a three-dimensional processor 100 may contain thousands of computing units 100aa-100mn that support massively parallel computing.

Fig. 2B is a circuit block diagram of the calculation unit 100ij thereof. The computing unit 100ij comprises at least one memory circuit 170 and a logic circuit 180 electrically coupled via a plurality of on-chip connections 160. The memory circuit 170 includes at least one 3D-M_VArray, 3D-M_VThe array stores at least a partial look-up table (LUT) of at least one non-arithmetic function or non-arithmetic model; the logic circuitry contains Arithmetic Logic Circuitry (ALC), and ALC180 utilizes at least some of the data in the LUT to speed up the computation. Due to 3D-M_VArray 170 is located in a different physical plane than ALC180, 3D-M_VArray 170 is shown in phantom.

FIGS. 3A-3B show a two-based 3D-M_VThe computational cells 100ij of the array 170. The computing unit 100ij comprises at least one 3D-M_VArray 170 and a substrate circuit 0K. The substrate circuit 0K is formed on a semiconductor substrate 0, which contains transistors 0t and their interconnections 0 m. The substrate circuit 0K comprises ALC180 and 3D-M_V Peripheral circuitry 190 of the array (see fig. 5A-5C). 3D-M_VThe array 170 is vertically stacked with the substrate circuit 0K and the on-chip connections 160 are realized by a plurality of via holes. In 3D-M_VIn the array 170, the memory cells are distributed in three dimensions, and at least one set of address lines is along the longitudinal z-direction, i.e. perpendicular to the substrate direction. 3D-M_VThe array 170 itself is of single core (monolithic) integration: the memory cells are stacked on top of each other without any semiconductor substrate between the memory cells. Structurally, multiple 3D-M_VThe memory cells form a vertical memory string, and then a plurality of memory strings are horizontally arranged on the substrate circuit to form a memory array. Due to 3D-M_VThe storage density of (A) is highest among all semiconductor memories, 3D-M_VThe array may store a large number of LUTs. For simplicity, 3 in FIGS. 3A-3BD-M_VVias 160 (to enable on-chip connections) between array 170 and ALC180 are not shown and are well known to those skilled in the art.

According to its degree of programmability, 3D-M_VThe method is divided into a three-dimensional one-time-programmable memory (3D-OTP for short) and a three-dimensional multi-time-programmable memory (3D-MTP for short). As the name implies, 3D-OTP can be programmed once and 3D-MTP can be programmed multiple times (including over-programming). 3D-OTP process is mature; the 3D-MTP is a general purpose memory.

3D-M in FIG. 3A_VArray 170 employs transistors or transistor-like devices as memory cells. It contains a plurality of vertical and side-by-side memory strings 16X, 16Y. Each memory string (e.g., 16Y) contains a plurality of vertically stacked memory elements (e.g., 18ay-18 hy). Each memory cell (e.g., 18 fy) contains a vertical transistor having a gate (which is a horizontal address line) 15, a memory film 17, and a vertical channel (which is a vertical address line) 19. The memory film 17 may include a composite film of silicon oxide-silicon nitride-silicon oxide, silicon oxide-polysilicon-silicon oxide, or the like. The 3D-M_VArray 170 is a 3D-NAND, the process for producing which is well known to those skilled in the art.

3D-M in FIG. 3B_VArray 170 employs diodes or diode-like devices as memory cells. It contains a plurality of vertical storage strings 16U-16W arranged side by side. Each memory string 16U contains a plurality of vertically stacked memory cells 18au-18 hu. 3D-M_VThe array 170 contains a plurality of vertically stacked horizontal address lines (word lines) 15. After etching a plurality of memory wells 11 penetrating these horizontal address lines 15, the sidewalls of the memory wells 11 are covered with a programming film 13 and filled with a conductive material to form vertical address lines 19 (bit lines). The conductor material may be a metallic material or a doped semiconductor material. Memory cells 18au-18hu are formed at the intersections of word lines 15 and bit lines 19. The programming film 13 may be one time programming (OTP, such as antifuse film) or multiple time programming (MTP, such as RRAM film).

To reduce the cross talk between memory cells, a diode is preferably formed between word line 15 and bit line 19. In one embodiment, the programming film 13 itself may have certain diode electrical characteristics. In another embodiment, a diode film (not shown) may be deposited separately on the sidewalls of the memory well 11. In a third embodiment, a built-in diode (e.g., P-N diode, Schottky diode) may be formed naturally between the word line 15 and the bit line 19. For details of the built-in diode, reference may be made to chinese patent application 201811117502.7 (application date: 2018, 9, 20).

In the embodiment of FIGS. 3A-3B, the substrate circuit 0K contains at least portions of the ALC180 and 3D-M_V

Peripheral circuitry

190, 3D-M of array 170_VThe array 170 is stacked over the substrate circuit 0K. In another embodiment of the present invention, 3D-M_VThe array 170 is stacked over a semiconductor substrate 0, at least in part, the ALC180 (or, 3D-M)_V Peripheral circuitry 190 of array 170) is stacked in 3D-M_VAbove the array 170. At this time, 3D-M_VArray 170 is interposed between semiconductor substrate 0 and ALC180 (or, 3D-M)_VPeripheral circuit 190). In all of these examples, 3D-M_VArray 170 and ALC180 are stacked on top of each other and electrically coupled to achieve three-dimensional integration.

Due to 3D-M_VThe array 170 is stacked on top of ALC180 and both are integrated in the same chip 0, this vertical integration is referred to as three-dimensional integration. Three-dimensional integration can improve computational density. Due to 3D-M_VArray 170 does not occupy substrate area and the area of computational cells 100ij is similar to the area of ALC 180. Whereas the area of the conventional processor 0X is the sum of the memory circuit 00M and the logic circuit 00L. By moving the memory circuit from the side to the top, the computing unit 100ij becomes smaller. The three-dimensional processor 100 contains more computing units 100ij, supporting massively parallel computing. Three-dimensional integration can also greatly increase computational complexity. The total LUT capacity in the conventional processor 0X is only a few tens of kb. In contrast, the total 3DM-LUT capacity in the three-dimensional processor 100 can reach hundreds of Gb, and a single three-dimensional processor chip can support tens of thousands of built-in functions, which is three orders of magnitude more than that of the traditional processor.

Fig. 4A to 5C show three kinds of calculation units 100 ij. FIG. 4A-FIG. 4C is a circuit block diagram thereof; fig. 5A to 5C are circuit layout diagrams thereof. In these embodiments, one ALC180ij uses different numbers of 3D-Ms_VThe LUT of array 170 ij.

ALC180ij in FIG. 4A utilizes a 3D-M_VLUT of array 170 ij: it utilizes storage in 3D-M_VThe LUTs in array 170ij accelerate the computation. ALC180ij in FIG. 4B utilizes four 3D-Ms_VLUTs of arrays 170ijA-170 ijD: it utilizes storage in 3D-M_VThe LUTs in the arrays 170ijA-170jiD accelerate computations. ALC180ij in FIG. 4C utilizes eight 3D-Ms_VLUTs for arrays 170ijA-170ijD and 170ijW-170 ijZ: it utilizes storage in 3D-M_VThe LUTs in the arrays 170ijA-170ijD and 170ijW-170ijZ accelerate computations. As can be seen from the later FIGS. 5A-5C, more 3D-M is utilized_VThe ALC180ij of the array 170ij generally occupies a larger chip area and has a stronger function. In FIGS. 4A-5C, since 3D-M_VThe array 170ij is located in a different physical plane from the ALC180ij (see FIGS. 3A-3B), 3D-M_VArray 170ij is shown in dashed lines.

Fig. 5A to 5C are circuit layout diagrams of three kinds of calculation units 100 ij. The embodiment of fig. 5A corresponds to the embodiment of fig. 4A. In this embodiment, the substrate circuit 0K of the computing unit 100ij includes the ALC180ij and the 3D-M_V Peripheral circuitry 190 of array 170 ij. Wherein ALC180ij utilizes 3D-M_VLUTs stored in array 170ij accelerate computations and peripheral circuitry 190 comprises 3D-M_VAn X decoder and a Y decoder (including read/write circuits) of the array, and the like. ALC180ij is 3D-M_VArray 170ij is at least partially covered. In FIGS. 5A-5C, since 3D-M_VArray 170ij is located above substrate circuit 0K, not in substrate circuit 0K, and its projection onto substrate 0 is represented by the dashed line here.

In this embodiment, the period of ALC180ij is equal to 3D-M_VThe period and area of the array 170ij cannot exceed 3D-M_VThe projected area of array 170ij on substrate 0 is limited in functionality. This embodiment is well suited to achieve simpler calculations. Fig. 5B-5C disclose two complex ALCs 180 ij.

The embodiment of fig. 5B corresponds to the embodiment of fig. 4B. In this embodiment, the meterThe ALC180ij of the arithmetic unit 100ij is positioned in the substrate 0, and the ALC180ij is divided into four 3D-M_VArrays 170ijA-170ijD are at least partially covered. At four 3D-M_VALC180 ji is freely configurable under arrays 170ijA-170 ijD. The period of ALC180ij in FIG. 5B is 3D-M in FIG. 5A_VThe array 170ij has twice the period and four times the area, and thus can be calculated in a more complicated manner.

The embodiment of fig. 5C corresponds to the embodiment of fig. 4C. In this embodiment, ALC180ij in compute unit 100ij is located in substrate 0. These eight 3D-M_VArrays 170ijA-170ijD, 170ijW-170ijZ are divided into two groups 170ijSA, 170 jiSB. Each group (e.g. 170 ijSA) comprising four 3D-Ms_VArrays (e.g., 170ijA-170 ijD). Four 3D-M in the first set 170SA_VBelow arrays 170ijA-170ijD, first ALC assembly 180ijA may be freely arranged. Similarly, four 3D-Ms in the second set 170ijSB_VBelow arrays 170ijW-170ijZ, second ALC assembly 180ijB may be freely arranged. The first ALC assembly 180ijA and the second ALC assembly 180ijB constitute ALC180 ij. In the present embodiment, a gap (e.g., G) is left between adjacent peripheral circuits to form the

wiring channels

182, 184, 186 for communication between different ALC assemblies 180ijA, 180ijB, or between different ALCs. The ALC180ij in fig. 5C has four times (x direction) the period and eight times the area of the 3D-M array 170ij in fig. 5A, enabling more complex calculations.

Fig. 6A-6C show three ALCs 180. ALC180 of fig. 6A is a summer 180A; ALC180 in fig. 6B is a multiplier 180M; ALC180 in fig. 6C is a multiplier-adder (MAC) that includes an adder 180A and a multiplier 180M. ALC180 may implement integer arithmetic, fixed point arithmetic, or floating point arithmetic. ALC180 may also contain storage circuitry such as registers, flip-flops, buffer RAM, etc., as will be apparent to those skilled in the art.

Fig. 7A-7B show a first calculation unit 100ij for implementing a non-arithmetic function Y = f (x) and using a table look-up of the function. Fig. 7A is a circuit block diagram thereof. ALC180 contains a pre-processing circuit 180R, a 3DM-LUT170P and a post-processing circuit 180T. The preprocessing circuit 180R translates the input variable (X) 110 into the address (a) of the LUT 170P. After reading out the data (D) at the address (a) of the 3DM-LUT170P, the post-processing circuit 180T converts it into the function value (Y) 120. To improve the calculation accuracy, the margin (R) of the input variable (X) is sent to the post-processing circuit 180T.

Fig. 7B is a calculation unit 100ij capable of implementing a single-precision non-arithmetic function Y = f (x). The input variable X110 is 32 bits (X)₃₁… x₀). The preprocessing circuit 180R will have its first 16 bits (x)₃₁… x₁₆) The 16-bit address A is extracted as LUT170P, followed by 16 bits (x)₁₅… x₀) Extracted as 16-bit residue R to the post-processing circuit 180T. The 3DM-LUT170P contains two 3 DM-

LUTs

170Q, 170R. Each 3DM-

LUT

170Q, 170R has a capacity of 2Mb (16-bit input, 32-bit output). The 3DM-LUT 170Q stores a function value D1= f (a) of the function, and the 3DM-LUT 170R stores a first derivative value D2= f' (a) of the function. Post-processing circuit 180T contains multiplier 180M and adder 180A. The output value (Y) 120 is 32 bits, which is calculated by polynomial interpolation. In this embodiment, the polynomial interpolation is a first order taylor series: y (x) = D1+ D2 × R = f (a) + f' (a) × R. The use of higher order polynomial interpolation (e.g., higher order taylor series) can further improve the computational accuracy.

When non-arithmetic functions are implemented, combining LUTs and polynomial interpolation can achieve higher computational accuracy with smaller LUTs. If the above-mentioned single-precision function (32-bit input, 32-bit output) is implemented only with a LUT (without polynomial interpolation), the capacity of the LUT needs to be up to 2³²32=128Gb, which is not practical. The capacity of the LUT can be greatly reduced by polynomial interpolation. In the above embodiment, the LUT needs only 4Mb (2 Mb for the function value LUT and 2Mb for the first derivative value LUT) after the first-order taylor series is adopted. This is much less than with a LUT alone (4 Mb vs. 128 Gb).

In addition to elementary functions (including algebraic functions and transcendental functions), the three-dimensional processor 100 can implement various higher functions, such as special functions. The special function plays a significant role in mathematical analysis, functional analysis, physical research and engineering application. Many special functions are solutions of differential equations or integrals of basis functions. Examples of special functions include gamma functions, beta functions, bezier functions, legendre functions, elliptic functions, Lame functions, Mathieu functions, riemann zeta functions, fresnel integrals, and the like. The advent of the three-dimensional processor 100 will simplify the computation of special functions, boosting its application in scientific computing.

Fig. 8 shows a second calculation unit 100 ij. The computing unit 100ij is used to implement a composition function (composition effect) Y = EXP [ K ] LOG (X)]=X^KIt uses function table look-up method. The calculation unit 100ij contains two 3 DM-

LUTs

170S,170T and a multiplier 180M. The 3DM-LUT170S stores a function value of LOG (), and the 3DM-LUT 170T stores a function value of EXP (). The input variable X is used as the address 110 of the 3DM-LUT 170S. The output LOG (X) 160S of 3DM-LUT170S is multiplied by the power parameter K at multiplier 180M, and the product 160T is sent as an address to 3DM-LUT 170T. Output 120 of 3DM-LUT 170T is Y = X^K。

The functions calculated by the embodiments of fig. 7A-7B and fig. 8 are combinatorial functions. The combination function is a combination of at least two non-arithmetic functions. If the single-precision function is a combination of a function value and a derivative value; a complex function is a combination of two functions. Accordingly, the invention also proposes a three-dimensional processor (100) for implementing a combinatorial function, characterized in that it comprises: a semiconductor substrate (0); first three-dimensional longitudinal storage (3D-M)_V) Array (170Q or 170S) and a second 3D-M_VAn array (170R or 170T), the first 3D-M_VThe array (170Q or 170S) stores at least part of a first look-up table (LUT) of a first non-arithmetic function, the second 3D-M_VAn array (170R or 170T) stores at least part of a second LUT of a second non-arithmetic function; an Arithmetic Logic Circuit (ALC) (180), said ALC (180) arithmetically operating at least a portion of the data of said first LUT or second LUT; the ALC (180) is located in the semiconductor substrate (0), the first and second 3D-M_VAn array (170Q, 170R or 170S,170T) stacked above the ALC (180) and electrically coupled with the ALC (180) through a plurality of on-chip connections (160); the combining function is a combination of the first and second non-arithmetic functions. In the three-dimensional processor (100), theThe first and second non-arithmetic functions include more arithmetic operations than the ALC (180) supports.

When applied to computer simulations, a separate three-dimensional processor was used to implement the non-arithmetic model, which still employs MBC. MBC brings great advantages for computer simulations. In this application, the 3D-M array 170 in FIG. 2A stores LUTs for non-arithmetic models, and the logic circuit 180 is an ALC.

Fig. 9 shows a third calculation unit 100 ij. The calculation unit 100ij is used to implement a computer simulation of the amplifying circuit 0Y (fig. 1 BA), which uses a model lookup method. The calculating unit 100ij comprises a 3DM-LUT 170U, an adder 180A and a multiplier 180M. The 3DM-LUT 170U stores data related to the performance (e.g., input-output characteristics) of the transistor 0T. Input voltage V_INUsed as the address 110 of the 3DM-LUT 170U, the read data 160U is the leakage current I_D. Multiplier 180M will I_DMultiplying by a negative value-R of the resistance 0R, the result (-R I)_D) At summer 180A with the supply voltage V_DDAdding to obtain an output voltage value V _OUT120。

The 3DM-LUT 170U may store a variety of mathematical models. In one embodiment, the model data stored by 3DM-LUT 170U is raw measurement data, such as measured input-output characteristics. An example is the drain current vs. gate-source voltage (I) of a transistor_D-V_GS) Characteristic curve. In another embodiment, the model data stored by 3DM-LUT 170U is smoothed measurement data. Raw measurement data can be smoothed by purely mathematical methods (e.g., by best fit models) or can be smoothed by a physical model (e.g., BSIM 4V 3.0 transistor model). In a third embodiment, the 3DM-LUT 170U stores model data that contains not only the measured values of the transistors, but also derivatives of the measured values. For example, the 3DM-LUT 170U stores model data that includes not only the current value (I) of transistor 0T_D-V_GS) And also its transconductance value (G)_m-V_GS). Similar to fig. 7B, polynomial interpolation (using the derivative of the measured value) can improve model accuracy with reasonable LUT.

Model table look-up method beltMany advantages are achieved. It saves a lot of computation time and energy since two software decompositions (from mathematical model to mathematical function and then from mathematical function to built-in function) are not needed. The model lookup table requires even fewer LUTs than the function lookup table. Since transistor models (e.g., BISM 4V 3.0) require hundreds of model parameters, if a function lookup is used, a large number of LUTs are required to compute the intermediate functions of the transistor model. If the function lookup method is skipped (i.e. the transistor model and the related intermediate functions are skipped), and the model lookup method is directly adopted, the transistor performance can be described by three measurement parameters (including the grid source voltage V)_GSDrain source voltage V_DSSource voltage V_BS). Thus, a smaller LUT is required to describe the mathematical model of the transistor.

It will be understood that changes in form and detail may be made therein without departing from the spirit and scope of the invention, and are not intended to impede the practice of the invention. For example, the processor may be a controller (or micro-controller), a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an image processor (GPU), a network security processor, an encryption/decryption processor, an encoding/decoding processor, a three-dimensional processor, an Artificial Intelligence (AI) processor, and the like. The invention, therefore, is not to be restricted except in the spirit of the appended claims.

Claims

1. A three-dimensional processor (100) for implementing a non-arithmetic function, comprising:

a semiconductor substrate (0);

a plurality of computing units (100aa, … 100mn), each of said computing units (100ij) comprising at least one three-dimensional longitudinal memory (3D-M)_V) An array (170) and an Arithmetic Logic Circuit (ALC) (180), the 3D-M_VAn array (170) storing at least a partial look-up table (LUT) of the non-arithmetic function, the ALC (180) arithmetically operating at least a portion of the data of the LUT;

the ALC (180) is located in the semiconductor substrate (0), the 3D-M_VAn array (170) stacked on the ALC(180) Above and electrically coupled to the ALC (180) through a plurality of on-chip connections (160).

2. A three-dimensional processor (100) for implementing a combinatorial function, comprising:

a semiconductor substrate (0);

first three-dimensional longitudinal storage (3D-M)_V) Array (170Q or 170S) and a second 3D-M_VAn array (170R or 170T), the first 3D-M_VThe array (170Q or 170S) stores at least part of a first look-up table (LUT) of a first non-arithmetic function, the second 3D-M_VAn array (170R or 170T) stores at least part of a second LUT of a second non-arithmetic function;

an Arithmetic Logic Circuit (ALC) (180), said ALC (180) arithmetically operating at least a portion of the data of said first LUT or second LUT;

the ALC (180) is located in the semiconductor substrate (0), the first and second 3D-M_VAn array (170Q, 170R or 170S,170T) stacked above the ALC (180) and electrically coupled with the ALC (180) through a plurality of on-chip connections (160);

the combining function is a combination of the first and second non-arithmetic functions.

3. The three-dimensional processor (100) of claim 2, further characterized by: the first LUT includes function values of the non-arithmetic function and the second LUT includes derivative values of the non-arithmetic function.

4. The three-dimensional processor (100) of claim 2, further characterized by: the combining function is a complex function of the first and second non-arithmetic functions.

5. The three-dimensional processor (100) of claims 1-4, further characterized by: the non-arithmetic function, the first non-arithmetic function, the second non-arithmetic function contain more arithmetic operations than the ALC (180) supports.

6. The three-dimensional processor (100) of claims 1-4, further characterized by: the non-arithmetic function, the first non-arithmetic function, the second non-arithmetic function cannot be expressed as a combination of arithmetic operations supported by the ALC (180).

7. A three-dimensional processor (100) implementing a non-arithmetic model, comprising:

a semiconductor substrate (0);

a plurality of computing units (100aa, … 100mn), each of said computing units (100ij) comprising at least one three-dimensional longitudinal memory (3D-M)_V) An array (170) and an Arithmetic Logic Circuit (ALC) (180), the 3D-M_VAn array (170) storing at least part of a look-up table (LUT) of the non-arithmetic model, the ALC (180) arithmetically operating at least part of the data of the LUT;

the ALC (180) is located in the semiconductor substrate (0), the 3D-M_VAn array (170) is stacked above the ALC (180) and electrically coupled with the ALC (180) through a plurality of on-chip connections (160).

8. The three-dimensional processor (100) of claim 7, further characterized by: the non-arithmetic model includes more operations than the ALC (180) supports.

9. The three-dimensional processor (100) of claim 7, further characterized by: the non-arithmetic model cannot be expressed as a combination of arithmetic operations supported by the ALC (180).

10. The three-dimensional processor (100) of claim 7, further characterized by: the non-arithmetic model includes raw measurement data, and/or smoothed measurement data.