CN111290994B

CN111290994B - Discrete three-dimensional processor

Info

Publication number: CN111290994B
Application number: CN201910038528.0A
Authority: CN
Inventors: 张国飙
Original assignee: Hangzhou Haicun Information Technology Co Ltd
Current assignee: Hangzhou Haicun Information Technology Co Ltd
Priority date: 2018-12-10
Filing date: 2019-01-16
Publication date: 2023-01-10
Anticipated expiration: 2039-01-16
Also published as: CN116049093A; CN116303224A; WO2020119511A1; CN115794730A; CN111290994A; CN113918506A; CN116150085A; CN112597098A

Abstract

A discrete three-dimensional processor (100) contains a first chip (100 a) and a second chip (100 b). The first chip (100 a) contains a three-dimensional memory (3D-M) array (170), and the second chip (100 b) contains logic circuitry (180) and at least one outer perimeter circuit assembly (190) of the 3D-M array (170). The first chip (100 a) is free of the off-chip perimeter circuit assembly (190). The first chip (100 a) and the second chip (100 b) are electrically coupled by a plurality of inter-chip connections (160). The separated three-dimensional memory can be applied to the fields of mathematical computation, computer simulation, programmable computing array, mode processing, neural network and the like.

Description

Discrete three-dimensional processor

Technical Field

The present invention relates to the field of integrated circuits, and more particularly to processors.

Background

Processors (including CPUs, GPUs, FPGAs, etc.) are widely used in the fields of mathematical computation, computer simulation, programmable gate arrays, pattern processing, neural networks, etc. Conventional processor chips are based on two-dimensional integration, with logic circuits (e.g., arithmetic logic units, control units, etc.) in the same plane (i.e., semiconductor substrate surface) as memory circuits (internal memory, including RAM for caching and ROM to store look-up tables, etc.). Since the main function of the processor chip is arithmetic logic operation, the capacity of its internal memory is small.

Traditional computers are based on a von neumann architecture, processor and memory separation in the computer: most of the memory is external memory (e.g., internal memory, external memory, etc.) and is located off the processor chip. When a large amount of data is needed in the calculation process, the processor chip acquires the data from the external memory. The data transmission bandwidth between the external memory and the processor chip is limited due to the physical distance between the external memory and the processor chip being too far and the narrow data bus between them. With the advent of massive amounts of data, traditional processors and their von Neumann architectures are increasingly frustrating.

The following is a description of the current state and limitations of the processor application.

[A] And (4) performing mathematical calculation.

One important application of processors is mathematical calculations, including the calculation of mathematical functions and the calculation of mathematical models. To implement mathematical computations, conventional processors employ logic-based computation (LBC), which is computed primarily by logic circuits (commonly referred to as arithmetic logic units, or ALUs). In fact, the arithmetic operations that the ALU can directly implement are only addition, subtraction and multiplication, which are collectively referred to as basic arithmetic operations. An ALU is suitable for implementing arithmetic functions, but is not capable of non-arithmetic functions. In a processor implementing mathematical calculations, an arithmetic function is a mathematical function that can be expressed as a combination of its basic arithmetic operations, while a non-arithmetic function is a mathematical function that cannot be expressed as a combination of its basic arithmetic operations. Examples of non-arithmetic functions include transcendental functions, special functions, and the like. Non-arithmetic functions cannot be implemented by an ALU alone, since they contain more operations than the ALU supports. Hardware implementation of non-arithmetic functions has been faced with significant challenges.

In a conventional processor, only a few basic functions (i.e., single-variable non-arithmetic functions including basic algebraic functions, basic transcendental functions, etc.) can be directly implemented in hardware, and these functions are called built-in functions. The built-in function is typically implemented by a combination of logic circuits and look-up tables (LUTs). The existing techniques for implementing built-in functions are numerous. For example: U.S. Pat. No. 5, 5,954,787 (inventor: eun; grant date: 21/9/1999) discloses a method for implementing sine/cosine (SIN/COS) functions using LUTs; US patent 9,207,910 (inventor: azadet; grant date: 12/2015 8) discloses a method for implementing a power function using a LUT.

Fig. 1AA specifically describes an implementation method of the built-in function. Conventional processor 0X typically contains logic circuitry 00L and memory circuitry 00M. The logic circuit 00L includes an ALU, which is used to implement arithmetic operations. The memory circuit 00M stores LUT of functions. To achieve a predetermined accuracy, the polynomial representing the built-in function needs to be expanded to a sufficiently high order. The memory circuit 00M stores polynomial coefficients and ALU 00L calculates the corresponding polynomial. Since the ALU 00L and the memory circuit 00M are arranged side by side on the same plane (both formed in the substrate 00S), this planar integration is a two-dimensional integration.

Computing is currently evolving towards higher computational densities and greater computational complexity. The calculation density refers to the calculation capacity (such as the number of floating point number operations per second) per unit chip area, and is an important index of parallel calculation. The calculation complexity refers to the number of built-in functions supported by a chip, and is an important index of scientific calculation. Two-dimensional integration limits further development of computational density and computational complexity.

With two-dimensional integration, too many memory circuits 00M will increase the chip area of processor 0X, reducing its computational density, which is detrimental to parallel computations. Further, ALU 00L is a core component of processor 0X, and occupies most of the chip area, so that memory circuit 00M has a limited chip area available, and can support only a small number of built-in functions. FIG. 1AB lists all built-in Transcendental Functions that can be implemented by Intel corporation IA-64 processors (see Harrison et al, the Computation of transduction Functions on The IA-64Architecture, intel Technical Journal, Q4,1999). The IA-64 processor supports only seven built-in functions in total, and so few built-in function groups are extremely detrimental to mathematical calculations. Because most mathematical functions require software to decompose them into a combination of built-in functions, the conventional processor 0X pair is slow and inefficient for most mathematical computations.

[B] And (4) performing computer simulation.

Another important application of the processor is computer simulation, i.e. the calculation of mathematical models. Computer simulation is a natural extension of mathematical computation, based on a set of built-in functions (containing only about ten built-in functions) supported by a conventional processor. Conventional computer simulation contains three levels: a base layer, a function layer, and a model layer. The basic layer comprises various built-in functions which can be directly realized by hardware; the function layer comprises various mathematical functions which cannot be directly realized by hardware; the model layer contains various mathematical models that describe the performance (e.g., input-output characteristics) of various system components.

The mathematical functions in the function layer and the mathematical models in the model layer need to be implemented by software. As mentioned previously, the function layer needs to do a software decomposition once. The model layer needs to be decomposed by software twice: the mathematical model is first decomposed into mathematical functions, and then the mathematical functions are decomposed into built-in functions. The mathematical model is more time and energy consuming than the mathematical function because it involves more software decomposition times.

The computational complexity of the mathematical model is very surprising. Fig. 1 BA-1 BB disclose a simulation of the amplifying circuit 0Y, a simple example. The amplifying circuit 0Y includes a transistor 0T and a resistor 0R (fig. 1 BA). The mathematical model of transistor 0T (e.g., MOS3, BSIM 3V 3.2, BSIM 4V 3.0, PSP, etc. in fig. 1 BB) is built on the set of built-in functions supported by the legacy processor 0X. Since the kinds of built-in functions are limited, even a calculation of one current point of the transistor 0T requires a large amount of calculation (fig. 1 BB). For example, the BSIM 4V 3.0 transistor model requires 222 additions, 286 multiplications, 85 divisions, 16 square root operations, 24 exponential operations, and 19 logarithmic operations.

ALU 00L in conventional processor 0X can only compute the arithmetic model itself. Since most mathematical models are non-arithmetic models, they cannot be implemented by ALU 00L alone. In a processor implementing computer simulation, an arithmetic model is a mathematical model that can be expressed as a combination of its basic arithmetic operations, while a non-arithmetic model is a mathematical model that cannot be expressed as a combination of its basic arithmetic operations. Non-arithmetic models cannot be implemented solely by an ALU because they contain more operations than the arithmetic logic unit supports. Computing non-arithmetic models with the conventional processor 0X is slow and inefficient.

[C] A programmable gate array.

A third application of the processor is a programmable gate array. Programmable gate arrays (also called FPGAs, CPLDs, etc.) belong to semi-custom integrated circuits, i.e. the customization of logic circuits is achieved by backend processes or field programming. U.S. Pat. No. 4,870,302 discloses a programmable gate array. It contains a plurality of programmable logic elements (CLEs for short) and programmable connections (CITs for short; or programmable interconnects). The programmable logic unit can selectively realize the functions of shifting, logical negation, AND (logical AND), OR (logical AND), NOR (AND NOT), NAND (NAND), XOR (exclusive OR), plus (arithmetic addition), minus (arithmetic subtraction) AND the like under the control of a setting signal; the programmable connection can selectively realize the functions of connection, disconnection and the like between the two interconnection lines under the control of a set signal.

In a programmable gate array, the arithmetic operations (arithmetic addition and arithmetic subtraction) supported by a programmable logic unit are collectively referred to as basic arithmetic operations. They are fewer than the basic arithmetic operations (addition, subtraction and multiplication) in conventional processors. When reference is made in this specification to a basic arithmetic operation, it can be determined from its context whether it is a basic arithmetic operation in a programmable gate array or a basic arithmetic operation in a conventional processor.

The programmable gate array can realize the customization of logic functions and arithmetic functions, but cannot customize non-arithmetic functions. In a programmable gate array, an arithmetic function is a mathematical function that can be expressed as a combination of its basic arithmetic operations, while a non-arithmetic function is a mathematical function that cannot be expressed as a combination of its basic arithmetic operations. Non-arithmetic functions cannot be implemented solely by a programmable logic unit because they contain more operations than the programmable logic unit supports. Customization of non-arithmetic functions is not considered possible in the prior art.

[D] And (5) mode processing.

A fourth application of processors is pattern processing. The pattern processing includes pattern matching and pattern recognition, which refers to finding a pattern identical to or close to a retrieval pattern (pattern for retrieval) among target patterns (retrieved patterns). The pattern matching requires finding the same pattern, and the pattern recognition requires finding only the close pattern. In this specification, "mode" includes a target mode and a retrieval mode; "schema library" refers to a database containing related schemas, including a target schema library or a search schema library.

Mode processing is widely used. Common pattern processing includes code matching, character string matching, voice recognition, image recognition, and the like. Code matching is widely used in the fields of information security and the like, and the operation thereof includes searching viruses from network data packets or computer files or checking whether the viruses meet specifications so as to determine whether data is safe. String matching, also referred to as keyword retrieval, is widely used in the fields of big data analysis and the like, and its operations include regular expression (regular expression) matching and the like. Speech recognition finds the acoustic/language model in the acoustic/language model library that is closest to the speech data. Image recognition will find the closest image model to the image data in the image model library.

With the advent of the big data age, schema libraries have become large databases. The data size of the search pattern library (including related search patterns such as virus library, keyword library, acoustic/language model library, image model library, etc.) is already large, and the data size of the target pattern library (including related target patterns such as computer files on the whole hard disk, large data library, voice archive library, image archive library, etc.) is much larger. Unfortunately, the internal memory of existing processors cannot store these pattern libraries, all of which need to be stored in external memory, requiring frequent pattern reads from external memory during pattern processing. Therefore, the existing processor and its architecture cannot realize fast mode processing for large mode library.

[E] A neural network.

A fifth application of the processor is a neural network. Neural networks provide a powerful artificial intelligence tool. FIG. 1C is an example of a neural network. It contains an input layer 32, a hidden layer 34 and an output layer 36. The input layer 32 contains i neurons 33, which input data x ₁ 、…x _i Constituting an input vector 30x. The output layer 36 contains k neurons 37, the output data y of which ₁ 、y ₂ 、…y _k Constituting an output vector 30y. The hidden layer 34 is interposed between the input layer 32 and the output layer 36. It contains j neurons 35, each neuron 35 electrically coupled to a first neuron in the input layer 32 and a second neuron in the output layer 36. The strength of coupling between neurons is determined by synaptic weight w _ij And w _jk And (4) showing.

The prior art proposes a neural network accelerator chip 60 (see Chen Yunji et al, dadiannao: A Machine-Learning Supercomputer, IEEE/ACM International Symposium on Micro-architecture,5 (1), pages 609-622, 2014). Neural network accelerator 60 contains 16 cores 50, which are coupled to each other by a tree connection (fig. 1 DA). Each core 50 contains one neural computation unit (NPU) 30 and four eDRAM blocks 40 (FIG. 1 DB). NPU 30 performs neural computations, which contain 256+32 16-bit multipliers and 256+32 16-bit adders. eDRAM 40 stores synaptic weights with a storage capacity of 2MB.

There is still room for improvement in the neural network accelerator 60. First, the eDRAM 40 is a volatile memory, and pre-synaptic weights need to be loaded into the eDRAM 40 from external memory, which takes time. Second, only 32MB of eDRAM in each neural network accelerator chip 60 may be used to store synaptic weights. This capacity is still far below the actual need. Again, the design emphasis of neural network accelerator 60 is skewed towards memory-eDRAM 40 occupies 80% of the area in each core, while NPU 30 occupies less than 10% of the area, so the computational density is very limited.

With the advent of three-dimensional memory (3D-M for short), the difficulties encountered by the above-mentioned conventional processors and their architectures are largely solved. The memory cells of the 3D-M are distributed in three dimensions, i.e. stacked on top of each other in a direction perpendicular to the substrate. Chinese patent 02131089.0 (grant publication No.: CN 1285125C; grant date: 1006, 11, 15) proposes a 3D-M based processor (i.e., a three-dimensional processor), which integrates logic circuits into a substrate under a 3D-M array to form an integrated three-dimensional processor. The integrated three-dimensional processor is in a single three-dimensional processor chip.

The integrated three-dimensional processor may be applied to the above fields: the Chinese patent application 201710241669.3 (application date: 2017, 4 and 13) applies the integrated three-dimensional processor to mathematical calculation and computer simulation; the Chinese patent application 201710126067.3 (application date: 2017, 3 and 6) applies the integrated three-dimensional processor to a programmable gate array; chinese patent application 201710130887.X (application date: 3/7/2017) applies an integrated three-dimensional processor to a mode processor; chinese patent application 201710171413.X (application date: 3/21 of 2017) applies an integrated three-dimensional processor to a neural network processor. Integrated three-dimensional processors have shown great advantages in these areas.

Fig. 1 EA-1 EB illustrate an integrated three-dimensional processor 80 having a 3D-M array 77 and logic 78 integrated therewith. The 3D-M array 77 stores data and the logic circuit 78 processes at least a portion of the data stored in the 3D-M array 77. In the three-dimensional processor chip, a chip area occupied by the memory array 77 is a memory area 70, and a chip area other than the memory area 70 is a non-memory area 71 (fig. 1 EA). The storage area 70 contains a substrate circuit 0K and a 3D-M array 77 (FIG. 1 EB) stacked on the substrate circuit 0K. The substrate circuit 0K is formed on the semiconductor substrate 0 below the 3D-M array 77. It contains transistor 0t and substrate interconnect line 0i. The transistor 0t is formed in the semiconductor substrate 0, and electrically coupled therebetween through a substrate interconnection line 0i. The substrate interconnect 0i includes two interconnect layers 0m1-0m2, each interconnect layer (e.g., 0m 1) having a plurality of interconnects (e.g., 0 m) in the same physical plane. The 3D-M array 77 includes four address line layers 0a1-0a4, each address line layer (e.g., 0a 1) having a plurality of address lines (e.g., 1 a) in the same physical plane. These address line layers 0a1 to 0a4 form two

memory layers

16A, 16B. Wherein a memory layer 16A is stacked over the substrate circuit 0K and a memory layer 16B is stacked over the memory layer 16A. The memory element (e.g. 7 aa) is located at the intersection of two address lines (e.g. 1a, 2 a). The memory layers 16A, 16B are electrically coupled to the substrate circuit 0K through the contact via holes 1av, 3av, respectively.

The non-storage region 71 also contains a part of the substrate circuit 0K (fig. 1 EB). Since the non-memory region 71 does not contain the 3D-M array 77, the number of back-end-of-line (BEOL) layers is smaller than that of the memory region 70. In this specification, a back-end wiring layer is a separate conductive layer (not counting into a via hole) of a back-end process. In fig. 1EB, the memory region 70 contains six back-end interconnect layers, including two interconnect layers 0m1-0m2 for the substrate circuit 0i and four address line layers 0a1-0a4 for the memory array 77; the non-memory area 71 contains only two back-end wiring layers including two wiring layers 0m1-0m2 of the substrate circuit 0i. In the non-memory region 71, the space 72 on the substrate circuit 0K contains neither memory cells nor interconnect lines, and the space 72 is effectively wasted.

The memory area 70 contains a plurality of 3D-M arrays 77 and their associated local peripheral circuits 75 and logic circuits 78 (fig. 1 EA). Local peripheral circuitry 75 and logic circuitry 78 are formed in substrate 0 in the vicinity of the projection of 3D-M array 77 onto substrate 0. Since the 3D-M array 77 is stacked on the local peripheral circuitry 75 and the logic circuitry 78, it is not located in the substrate 0, here indicated by dashed lines. On the other hand, the non-storage area 71 contains global (global) peripheral circuits 73 of the 3D-M array 77, which are formed in the substrate 0 at positions outside the projection of all the 3D-M arrays 77 on the substrate 0. The local peripheral circuit 75 and the global peripheral circuit 73 are collectively referred to as a peripheral circuit 79.

In the three-dimensional processor chip 80, the non-storage region 71 occupies a large chip area. Currently, the non-storage area 71 occupies 20 to 30% of the chip area; for mass storage this ratio will even reach over 50%. Thus, the array of integrated three-dimensional processors 80 is less efficient. In this specification, the array efficiency is the ratio of the total projected area of the 3D-M array 77 on the substrate 0 in the chip to the total area of the chip.

The prevailing views of integrated circuits are: the greater the integration, the better, i.e., the integration can reduce cost and improve performance. Conventional integrated circuits tend to be single-core (monolithic) integrated, i.e., all circuit components are integrated into one chip. Single core integration is effective for two-dimensional circuits, but is no longer effective for three-dimensional circuits, especially when three-dimensional circuits (e.g., three-dimensional memory) are mixed with two-dimensional circuits. In this specification, a two-dimensional circuit refers to a circuit in which active elements (e.g., transistors, memory cells, etc.) are distributed in a two-dimensional plane (e.g., a front surface of a semiconductor substrate); three-dimensional circuits refer to circuits in which the active elements (e.g., transistors, memory cells, etc.) are distributed in a three-dimensional space (stacked on top of each other in a direction perpendicular to the front surface of the semiconductor substrate).

The drawbacks of single core integration are manifold when applied to the integration of three-dimensional circuits with two-dimensional circuits. First, because their back-end processes are not compatible. Blind integration will result in logic circuitry 78 and peripheral circuitry 79 being fabricated with the complex process of fabricating 3D-M array 77. In addition to the integrated three-dimensional processor chip 80 having lower array efficiency, blind integration increases the overall cost of the three-dimensional processor chip 80.

Second, since the 3D-M array 77 is very process demanding, the back-end process of the three-dimensional processor chip 80 needs to be optimized for the 3D-M array 77, which has to sacrifice the performance of the logic 78 and peripheral circuits 79 to some extent. For an integrated three-dimensional processor chip 80, the logic 78 and peripheral 79 may only contain a few (e.g., two) interconnect layers 0M1-0M2 contained in the substrate interconnect layer 0i, or use slower high temperature interconnect materials (i.e., materials that can withstand high temperature back-end processing in the manufacture of the 3D-M array 77, such as tungsten), which may degrade the overall performance of the three-dimensional processor chip 80.

Finally, with single-core integration, the chip area occupied by the logic circuit 78 is limited by the projected area of the 3D-M array 77 on the substrate, which can only perform limited processing functions. Furthermore, since the logic 78 is integrated with the 3D-M array 77, the three-dimensional processor 80 can only perform fixed functions. If the three-dimensional processor 80 also needs to perform other functions, the entire three-dimensional processor 80 (including its 3D-M array 77 and logic 78) needs to be redesigned and manufactured, which is time and cost intensive.

Disclosure of Invention

The invention mainly aims to provide a three-dimensional processor with lower overall cost.

It is another object of the present invention to provide a three-dimensional processor with more excellent overall performance.

It is another object of the present invention to provide a three-dimensional processor that is more powerful and flexible.

It is a further object of this invention to provide such a processor with greater computational density.

It is a further object of the invention to provide a processor with greater computational complexity.

It is another object of the invention to improve the speed and efficiency of mathematical calculations.

It is another object of the present invention to improve the speed and efficiency of computer simulations.

It is another object of the present invention to customize non-arithmetic functions.

It is another object of the invention to customize complex functions.

It is another object of the invention to enable reconfigurable computing.

It is another object of the present invention to enable high speed and efficient pattern processing for large pattern libraries.

It is another object of the invention to enhance information security.

It is another object of the present invention to enhance big data analysis capabilities.

It is another object of the present invention to enhance speech recognition capabilities and enable speech retrieval for a speech archive.

It is another object of the present invention to enhance image recognition capabilities and enable image retrieval from an image archive.

It is another object of the present invention to enhance neural network computational power.

To achieve these and other objects, the present invention follows a design principle that is distinct from a conventional processor: the three-dimensional circuit and the two-dimensional circuit are de-integrated. In particular, the three-dimensional circuit and the two-dimensional circuit are divided into different chips as much as possible so as to be optimized separately. Accordingly, the invention proposes a separate three-dimensional processor (100), characterized in that it comprises: a plurality of memory computing units (100 aa-100 mn), each memory computing unit (100 ij) comprising at least one three-dimensional memory (3D-M) array (170) and a logic circuit (180); a first chip (100 a) and a second chip (100 b), said first chip (100 a) containing said 3D-M array (170), said second chip (100 b) containing at least part of said logic circuitry (180) and a piece of outer perimeter circuit assembly (190) of said 3D-M array (170); the first chip (100 a) does not contain the chip outer periphery circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160). Briefly, the first chip is a memory chip comprising a plurality of functional layers stacked on top of each other; the second chip is a logic chip that has only one functional layer.

The separate three-dimensional processor is different from the integrated three-dimensional processor: in the integrated three-dimensional processor, all peripheral circuit components of the 3D-M array are located on the same chip as the 3D-M array; in a separate three-dimensional processor, at least one peripheral circuit component of the 3D-M array is not located on the first chip, but is located on the second chip. Accordingly, the peripheral circuit assembly in the second chip is referred to as an off-chip peripheral circuit assembly. The circuit partitioning strategy employed by the discrete three-dimensional processor at design time is to have the second chip contain as many off-chip peripheral circuit components as possible. The advantage of this division is that the array efficiency of the first chip is greatly improved. Note that although the first chip contains a 3D-M array, since it does not contain off-chip peripheral circuit components, the first chip cannot function properly as a memory chip by itself, e.g., its performance does not meet the industry standard for comparable memory chips.

In a separate three-dimensional processor, the first chip and the second chip may have distinct back-end structures, as they may be designed and manufactured separately. Since the backend structure of the second chip can be optimized separately, its off-chip peripheral circuit components and logic circuits have lower cost and superior performance than the same type of circuits in an integrated three-dimensional processor. A comparison is made below for a separate three-dimensional processor and an integrated three-dimensional processor.

First, the first chip does not include at least part of peripheral circuits and logic circuits, so that the array efficiency is high. Furthermore, as a two-dimensional circuit, the number of back-end wiring layers of the second chip 100b is much lower than that of an integrated three-dimensional processor and can be manufactured using a conventional process. Since the wafer cost is substantially proportional to the number of back-end wiring levels, the wafer cost of the second chip is much lower than that of an integrated three-dimensional processor. Thus, the chip total cost of a discrete three-dimensional processor (comprising the first and second chips) is lower than that of an integrated three-dimensional processor (comprising only one chip). Even with the additional bonding cost, the overall cost of a separate three-dimensional processor is less than an integrated three-dimensional processor.

Second, the performance of the off-chip peripheral circuit components and logic circuits in a discrete three-dimensional processor is better than the same type of circuits in an integrated three-dimensional processor because they can be optimized individually. In one embodiment, the number of interconnect layers (e.g., four, eight, or more) in the second chip is greater than the number of interconnect layers (e.g., two) of the substrate circuitry in the integrated three-dimensional processor (or first chip). In another embodiment, the second chip employs a high performance interconnect material (e.g., copper) instead of a high temperature interconnect material (e.g., tungsten) used by the integrated three-dimensional processor (or first chip). Thus, the overall performance of a discrete three-dimensional processor is superior to an integrated three-dimensional processor.

Finally, in an integrated three-dimensional processor, the logic circuitry is limited in area and functionality due to its confinement in a single chip (e.g., within the projected area of the 3D-M array on the substrate). In contrast, in a separate three-dimensional processor, the larger area of the logic circuitry gives the separate three-dimensional processor greater processing power, since the logic circuitry can be formed in two chips (the first portion of the logic circuitry is located in the first chip within the projected area of the 3D-M array on the substrate, and the second portion of the logic circuitry is located in the second chip). Furthermore, since the second chip is designed and produced separately, it has greater flexibility in design and production. By combining the same first chip with a second chip having a different function, a processing function suitable for different application scenarios can be realized. Preferably, these various processing functions can be implemented in a shorter design cycle and with less design budget. Thus, the separate three-dimensional processor is more powerful and flexible.

The application of the discrete three-dimensional processor in different fields is described below.

[A] And (4) performing mathematical calculation.

When applied to mathematical calculations, separate three-dimensional processors are used to implement non-arithmetic functions. It employs memory-based computation (MBC), i.e. computation is mainly implemented by large-capacity LUTs (i.e. 3 DM-LUTs) stored in a 3D-M array. The 3DM-LUT used by MBC has a larger capacity compared to conventional, logic-based computation (LBC). For example, the single core storage capacity of 3D-XPoint is up to 128Gb, much higher than a traditional LUT (tens of kb), which can be used to implement tens of thousands of non-arithmetic functions (including various transcendental and special functions). Although for most MBCs they still require arithmetic operations. However, by using a larger 3DM-LUT as a starting point, the MBC requires less polynomial expansion. In the MBC, the memory circuit accounts for a greater proportion of the calculation than the logic circuit.

Accordingly, the present invention proposes a three-dimensional processor (100) for computing at least one non-arithmetic function, characterized in that it comprises: a plurality of compute units (100 ij), the compute units (100 ij) having at least one three-dimensional memory (3D-M) array (170) and an Arithmetic Logic Circuit (ALC) (180 ALC), the 3D-M array (170) storing at least part of a look-up table (LUT) of the non-arithmetic functions, the ALC (180 ALC) arithmetically operating at least part of the data in the LUT; a first chip (100 a) and a second chip (100 b), said first chip (100 a) containing said 3D-M array (170), said second chip (100 b) containing at least part of said ALC (180 ALC) and a piece of outer periphery circuitry (190) of said 3D-M array (170); the first chip (100 a) does not contain the off-chip perimeter circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160); the non-arithmetic function includes more operations than the ALC (180 ALC) supports.

[B] And (4) performing computer simulation.

When applied to computer simulations, a separate three-dimensional processor was used to implement the non-arithmetic model, which still employs MBC. MBC brings great advantages for computer simulations. A large increase in built-in functions (from about ten to tens of thousands) will flatten the traditional framework of computer simulations (including the base layer, function layer, and model layer). Functions can only be implemented in hardware at the base layer in the past; now, not only the mathematical functions of the function layer can be directly implemented by hardware, but also the mathematical models of the model layer can be directly implemented by hardware. In the function layer, the mathematical function is calculated by a function table look-up method (namely 3DM-LUT stores function values and derivative values thereof, and is expanded by table look-up and a polynomial is added); at the model level, the mathematical model is computed by "model lookup" (i.e., the 3DM-LUT stores the model values and their derivative values, by table lookup with the addition of polynomial expansion). The high-speed and high-efficiency calculation of the mathematical model can be realized through the 3DM-LUT, which promotes the revolution of computer simulation.

Accordingly, the invention proposes a three-dimensional processor (100) for computing at least one non-arithmetic model, characterized in that it comprises: a plurality of computational units (100 ij), the computational units (100 ij) having at least one three-dimensional memory (3D-M) array (170) and an Arithmetic Logic Circuit (ALC) (180 ALC), the 3D-M array (170) storing at least part of a look-up table (LUT) of the non-arithmetic model, the ALC (180 ALC) arithmetically operating at least part of the data in the LUT; a first chip (100 a) and a second chip (100 b), said first chip (100 a) containing said 3D-M array (170), said second chip (100 b) containing at least part of said ALC (180 ALC) and a piece of outer periphery circuitry (190) of said 3D-M array (170); the first chip (100 a) does not contain the off-chip perimeter circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160); the non-arithmetic model includes more arithmetic operations than the ALC (180 ALC) supports.

[C] A programmable compute array.

When applied to a programmable compute array, the discrete three-dimensional processor is a three-dimensional programmable compute array. It can customize not only logical functions and arithmetic functions, but also non-arithmetic functions. Accordingly, the present invention provides a three-dimensional programmable computational array (100) for customizing at least one non-arithmetic function, comprising: a plurality of programmable logic units (200) and/or programmable connections (300); and a plurality of programmable computation units (400) comprising at least one three-dimensional memory (3D-M) array (170), said 3D-M array (170) storing at least part of a look-up table (LUT) of said non-arithmetic functions; a first chip (100 a) and a second chip (100 b), said first chip (100 a) containing said 3D-M array (170), said second chip (100 b) containing at least part of said programmable logic unit (200) and/or programmable connections (300) and a piece of outer perimeter circuit assembly (190) of said 3D-M array (170); the first chip (100 a) does not contain the chip outer periphery circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160); -enabling customization of the non-arithmetic function by programming the programmable logic unit (200) and/or the programmable connection (300) and the programmable computation unit (400); the non-arithmetic function includes more arithmetic operations than the programmable logic unit (200) supports.

The life cycle of a programmable computing unit includes two phases: a setting phase and a calculating phase. In the setting stage, loading a lookup table of a non-arithmetic function into a 3D-M array according to the requirement of a user; in the calculation stage, the corresponding LUT is looked up in the 3D-M array to obtain the value of the non-arithmetic function. For 3D-M which can be repeatedly programmed, different non-arithmetic functions can be realized by loading LUTs of different non-arithmetic functions in the 3D-M array in different use periods, so that reconfigurable calculation is realized.

[D] And (5) mode processing.

When applied to mode processing, a discrete three-dimensional processor is a type of three-dimensional mode processor. The basic function is pattern processing. More importantly, most of the patterns involved in the pattern processing are stored locally, so that the pattern processing circuit is very close to the pattern storage circuit, and the time required for reading a new pattern is very short. In addition, three-dimensional mode processors contain thousands of storage units. In the mode processing, input data is sent to all the storage units, and the mode processing is carried out simultaneously, so that massive parallel calculation is guaranteed. The three-dimensional pattern processor can process large pattern library with high speed and high efficiency.

Accordingly, the invention proposes a separate three-dimensional mode processor (100), characterized in that it comprises: an input (110) for transmitting at least part of the first mode; a plurality of memory cells (100 aa-100 mn) electrically coupled to said input (110), each memory cell (100 ij) comprising at least a three-dimensional memory (3D-M) array (170) and a pattern processing circuit (180 PPC), said 3D-M array (170) storing at least a portion of a second pattern, said pattern processing circuit (180 PPC) performing pattern processing on said first and second patterns; a first chip (100 a) and a second chip (100 b), said first chip (100 a) containing said 3D-M array (170), said second chip (100 b) containing at least part of said mode processing circuitry (180 PPC) and a piece of outer perimeter circuit assembly (190) of said 3D-M array (170); the first chip (100 a) does not contain the chip outer periphery circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

[E] A neural network.

When applied to a neural network, the discrete three-dimensional processor is a type of three-dimensional neural network processor. Its basic function is neural computation. More importantly, most of the synaptic weights required by the neural computation are local, the neural computation circuit is close to the storage circuit of the synaptic weights, and the time required for reading the synaptic weights is short. In addition, three-dimensional neural network processors contain thousands of storage units. In the neural calculation, input data are sent to all the storage calculation units, and the neural calculation is carried out simultaneously, so that large-scale parallel calculation is guaranteed. The three-dimensional neural network processor can realize high-speed and high-efficiency neural calculation.

Accordingly, the invention proposes a separate three-dimensional neural network processor (100), characterized in that it comprises: a plurality of storage units (100 aa-100 mn), each storage unit (100 ij) comprising at least one three-dimensional storage (3D-M) array (170) and a neural computation circuit (180 NPC), the 3D-M array (170) storing at least part of the synaptic weights, the neural computation circuit (180 NPC) performing a neural computation based on the synaptic weights; a first chip (100 a) and a second chip (100 b), the first chip (100 a) containing the 3D-M array (170), the second chip (100 b) containing at least part of the neural computation circuit (180 NPC) and a piece of outer periphery circuitry assembly (190) of the 3D-M array (170); the first chip (100 a) does not contain the chip outer periphery circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

Drawings

FIG. 1AA is a perspective view of a conventional processor (prior art); FIG. 1AB lists all transcendental functions (prior art) supported by an Intel Itanium (IA-64) processor; FIG. 1BA is a circuit diagram of an amplifier circuit; FIG. 1BB lists the amount of computation required by different transistor models to compute a current point (prior art); FIG. 1C is a schematic diagram of a neural network; FIG. 1DA is a block circuit diagram of a neural network processor (prior art); FIG. 1DB is a chip layout diagram of a neural network accelerator (prior art); FIG. 1EA is a circuit layout diagram of an integrated three-dimensional processor (prior art); fig. 1EB is a cross-sectional view of the three-dimensional processor.

Fig. 2A-2C are general descriptions of a separate three-dimensional processor: FIG. 2A is a block circuit diagram thereof; FIG. 2B is a block circuit diagram of a memory unit; fig. 2C is a circuit layout diagram of two chips in a separate three-dimensional processor.

Fig. 3A-3D are cross-sectional views of four separate three-dimensional processors.

Fig. 4A to 4D are cross-sectional views of four kinds of first chips.

Fig. 5 is a cross-sectional view of a second chip.

FIG. 6A is a circuit layout diagram of a first chip; fig. 6BA to 6BB are circuit layout diagrams of two kinds of second chips.

Fig. 7A-7C are block circuit diagrams of three types of storage units.

Fig. 8A-8C are circuit layouts of three kinds of storage units in the first and second chips.

Fig. 9 is a circuit block diagram of a computing unit.

Fig. 10A to 10C are circuit block diagrams of three kinds of Arithmetic Logic Circuits (ALCs).

FIG. 11A is a block circuit diagram of a first type of computational unit; fig. 11B is a circuit diagram of one specific implementation of the computational cell.

Fig. 12 is a circuit block diagram of a second calculation unit.

Fig. 13 is a circuit block diagram of a third calculation unit.

FIG. 14A is a circuit block diagram of a programmable cell; fig. 14B shows functional blocks included in the programmable unit.

FIG. 15A is a circuit block diagram of a first programmable compute unit; fig. 15B is a circuit block diagram of a second programmable computing unit.

FIG. 16 shows two cycles of use of a programmable computational cell.

FIG. 17A discloses a connection library in which programmable connections can be implemented; fig. 17B discloses a logic operation library that can be implemented by a programmable logic unit.

FIG. 18 is a layout diagram of a first three-dimensional programmable computational array.

FIG. 19 is a diagram of a first three-dimensional programmable computational array implementing a non-arithmetic function.

FIG. 20 is a layout diagram of a second three-dimensional programmable computational array.

FIGS. 21A-21B are setup diagrams for the second three-dimensional programmable computational array implementing two mathematical functions.

FIG. 22 is a circuit block diagram of a split three-dimensional parallel processor.

FIG. 23 is a block circuit diagram of a storage unit in a three-dimensional mode processor.

FIG. 24 is a block circuit diagram of a storage and computation unit in a three-dimensional neural network processor.

FIG. 25 is a block circuit diagram of a neural computation circuit.

Fig. 26A to 26B are circuit block diagrams of two kinds of calculation circuits.

It is noted that the figures are diagrammatic and not drawn to scale. Dimensions and structures of parts in the figures may be exaggerated or reduced for clarity and convenience. In different embodiments, alphabetic suffixes following numbers represent different instances of the same class of structure; the same numerical prefixes refer to the same or similar structures.

In this specification, "/" denotes a relationship of "and" or ". "memory" broadly refers to any semiconductor-based information storage device that can store information permanently or temporarily. A "memory array (e.g., a 3D-M array)" is a collection of all memory cells that share at least one address line. "circuitry in a substrate" means that the active elements (e.g., transistors, memory cells) of the circuitry are located in the substrate; the interconnect lines in the circuit connecting the active elements may be located above the substrate. By "circuit on a substrate" is meant that the active elements of the circuit (e.g., transistors, memory cells) and their interconnect lines are all located above the substrate. "electrically coupled" means any form of coupling in which an electrical signal may be transmitted from one element to another. "look-up table (LUT) (including 3 DM-LUT)" refers to both data in the LUT and a memory circuit (i.e., LUT memory) for storing the LUT, and is not distinguished in this specification. "schema" refers to both abstract and physical representations of schema (i.e., data associated with the schema), which are not differentiated by this specification.

Detailed Description

Fig. 2A-2C are general descriptions of a separate three-dimensional processor 100. Fig. 2A is a circuit block diagram thereof. The separate three-dimensional processor 100 can not only process data but also store data. More importantly, a significant portion of the data it processes is stored locally and in close proximity. The discrete three-dimensional processor 100 contains a vault array containing m x n vault units 100aa-100 mn. Taking the storage unit 100ij as an example, it has an input 110 and an output 120. In general, a three-dimensional processor 100 may contain thousands of depository units 100aa-100mn, which support massively parallel computations.

Fig. 2B is a circuit block diagram of a storage unit 100ij. The storage unit 100ij includes a memory circuit 170 and a logic circuit 180 electrically coupled via a plurality of inter-chip connections 160 (see fig. 3A-3D). The memory circuit 170 contains at least one 3D-M array. The 3D-M array stores data, and the logic circuit 180 processes a portion of the data. Since the 3D-M array 170 is not located in the same chip as the logic circuit 180 (see FIG. 2C), the 3D-M array 170 is represented by a dashed line.

Fig. 2C shows a separate implementation of the three-dimensional processor 100, which includes at least a first chip (also referred to as a memory chip) 100a and at least a second chip (also referred to as a logic chip) 100b. The first chip 100a contains three-dimensional circuitry, in this embodiment a 3D-M array 170. The second chip 100b contains two-dimensional circuitry, in this embodiment logic circuitry 180 and a peripheral circuit component 190 of the 3D-M array 170. The inter-chip connection 160 is electrically coupled between the first chip 100a and the second chip 100b. Since the peripheral circuit assembly 190 is in a different chip than the 3D-M array 170, it is referred to as an off-chip peripheral circuit assembly. Note that part of the logic may be located in the first chip 100a, for example, part of the logic may be integrated below the 3D-M array 170. For simplicity, in this specification, the logic circuit refers to the logic circuit 180 located on the second chip 100b, unless otherwise specified.

The circuit partitioning strategy employed by the discrete three-dimensional processor 100 is to have the second chip 100b contain as many off-chip peripheral circuit components 190 as possible. The peripheral circuit assembly 190 is an organic component of the memory chip; the memory chip (e.g., the first chip 100 a) lacking it cannot independently implement the basic functions of the memory (e.g., its performance does not meet the industry standard of the same kind of memory chip). The typical peripheral circuit component 190 may be an address decoder, a read amplifier circuit, a write circuit, a read voltage generation circuit, a write voltage generation circuit, a data buffer, or a portion thereof.

Since the read/write voltage is generally different from the external power voltage in value, a read/write voltage generation circuit is required to convert the external power voltage into the read/write voltage of the 3D-M array 170. The voltage generator preferably uses a direct current-direct current converter (DC-DC converter). The dc-dc converter includes a booster and a buck. The output voltage of the booster is higher than the input voltage, and the output voltage of the step-down transformer is lower than the input voltage. Examples of the booster include a charge pump (charge pump), a Boost converter (Boost converter), and the like. Examples of the step-down device include a low dropout regulator (low drop out), a Buck converter (Buck converter), and the like.

The address/data conversion circuit converts external logical addresses/data (addresses/data viewed by a user or a host) into physical addresses/data of the 3D-M array 170, or vice versa. The address translation circuitry typically contains a non-volatile memory that stores an address mapping table, a fault block table, and/or a wear management table. The data conversion circuit typically contains an Error Check Correction (ECC) encoder and/or an ECC decoder. Other circuits may be used for the peripheral circuit assembly 190, as will be apparent to those skilled in the art.

Fig. 3A-3D are cross-sectional views of four separate three-dimensional processors 100, focusing on showing various implementations of inter-chip connections 160. In the embodiment of fig. 3A, the first chip 100a and the second chip 100b are stacked on each other, i.e., in a direction perpendicular to the chip surfaces. The front sides (i.e., surfaces containing circuitry) of the first chip 100a and the second chip 100b face upward (+ z direction), and the inter-chip connection 160 is implemented between them through the bonding wires 160 w.

In the embodiment of fig. 3B, the first chip 100a and the second chip 100B are stacked face to face. Specifically, the first chip 100a faces upward (+ z direction); and the second chip 100b is flipped upside down (-z direction). The inter-chip connection 160 is realized by micro-pads (micro-bumps) 160 x.

The embodiment of FIG. 3C contains two memory chips 100a1, 100a2 and one logic chip 100b. To avoid confusion, the first chip is referred to in the figure as memory chip 100a1, 100a2 and the second chip is referred to as logic chip 100b. The memory chips 100a1, 100a2 each contain a plurality of 3D-M arrays; they are stacked on each other and electrically coupled through a through-substrate via (TSV) 160 y. The stacked memory chips 100a1, 100a2 and the logic chip 100b are electrically coupled by micro-solder joints 160 x. TSVs 160y and micro-pads 160x are inter-chip connections 160. In the present embodiment, the logic circuit 180 in the logic chip 100b processes data stored in the two memory chips 100a1 and 100a 2.

In the embodiment of fig. 3D, a first insulating medium 168a is formed on the front surface of the first chip 100a, and then a plurality of first via holes 160za are formed in the first insulating medium 168 a. In addition, a second insulating medium 168b is also formed on the front surface of the second chip 100b, and then a plurality of second via holes 160zb are formed in the second insulating medium 168 b. After the second chip 100b is flipped over, the first via hole 160za and the second via hole 160zb are aligned and the first and

second chips

100a, 100b are bonded. Accordingly, the first and

second chips

100a, 100b realize the inter-chip connection 160 through the electrically contacted first and second via holes 160za, 160zb. Since the via holes 160za, 160zb are formed by standard chip manufacturing processes, it can have a very small size and a large number. Therefore, a large bandwidth inter-chip connection 160 may be formed between the first chip 100a and the second chip 100b. In the present embodiment, the passage holes 160za and 160zb are collectively referred to as Vertical Interconnect Access (VIA).

In the above-described embodiment, the memory circuit 170 and the logic circuit 180 are in close proximity (relative to a conventional von Neumann architecture). In addition, for the embodiments of fig. 3B-3D, and particularly the embodiments of fig. 3C-3D, the number of inter-chip connections (TSVs or VIAs) 160 is large, which may enable ultra-wide bandwidth between the memory circuit 170 and the logic circuit 180. Coupled with massively parallel processing (fig. 2A), the separated three-dimensional processor 100 performs well.

Fig. 4A-4D show four first chips 100a in which the 3D-M array 170 is integrated using a single core (monolithic), i.e., its memory cells are stacked on each other in a vertical direction without any semiconductor substrate between the memory cells.

The 3D-M is classified into a three-dimensional horizontal memory (3D-M for short) according to its physical structure _H ) And a three-dimensional vertical memory (3D-M for short) _V )。3D-M _H All address lines are horizontal, with the memory cells constituting a plurality of horizontal memory layers, which are vertically stacked on the substrate circuitry. 3D-M _H A typical example of (3D-XPoint). 3D-M _V The memory cells form a plurality of vertical memory strings arranged side by side on the substrate circuit. 3D-M _V Is 3D-NAND.3D-M _H Faster speed, and 3D-M _V The storage density is greater.

The 3D-M is classified into a 3D-RAM (three-dimensional random access memory) and a 3D-ROM (three-dimensional read only memory) according to the length of time for storing information. The 3D-RAM can temporarily store information and is mainly used for caching; the 3D-ROM can store information for a long period of time, and is a non-volatile memory (NVM). Most 3D-M arrays in the present invention are 3D-ROMs.

The 3D-M is classified into a three-dimensional writable memory (referred to as 3D-W) and a three-dimensional printed memory (referred to as 3D-P) according to its programmability. The 3D-W stored information is entered by way of electrical programming. The 3D-W is further divided into a three-dimensional one-time-programmable memory (abbreviated as 3D-OTP) and a three-dimensional multi-time-programmable memory (abbreviated as 3D-MTP) according to the programmable times thereof, including repetitive programming. One common 3D-MTP is 3D-XPoint and 3D-NAND. Other 3D-MTPs include memristor, resistive Random Access Memory (RRAM), phase Change Memory (PCM), programmable addressing cell (PMC), conductive bridging random-access memory (CBRAM), and the like.

The 3D-P stored information is entered by printing during the factory production process (imprinting method). This information is permanently fixed and cannot be changed after shipment. The printing method may be photo-lithography (photo-lithography), nano-imprint method (nano-imprint), electron beam scanning exposure (e-beam lithography), DUV scanning exposure, laser scanning exposure (laser patterning), or the like. A common 3D-P is a three-dimensional mask-programmed read-only memory (3D-MPROM), which is programmed to record data through a mask by photolithography. Since it has no electrical programming requirement, the 3D-P memory cell can be biased at a higher voltage when reading. Therefore, the 3D-P read speed is faster than the 3D-W.

The first chip 100a in FIGS. 4A-4B has a substrate circuit 0Ka and 3D-M stacked on the substrate circuit 0Ka _H An array 170. The substrate circuit 0Ka contains the transistor 0t and the interconnect line 0ia. The transistor 0t is formed in the first semiconductor substrate 0a, and electrically coupled therebetween through a substrate interconnection line 0ia. The substrate interconnect 0ia includes two interconnect layers 0m1a to 0m2a, and each interconnect layer (e.g., 0m1 a) includes a plurality of interconnects (e.g., 0 m) in the same physical plane. 3D-M _H Array 170 includes four address line layers 0a1a-0a4a, each address line layer (e.g., 0a1 a) including a plurality of address lines (e.g., 1 a) in the same physical plane. These address line layers 0a1a to 0a4a form two

memory layers

16A, 16B. Wherein the memory layer 16A is stacked over the substrate circuit 0Ka and the memory layer 16B is stacked over the memory layer 16A. The memory element (e.g. 7 aa) is located at the intersection of two address lines (e.g. 1a, 2 a). The memory layers 16A and 16B realize chip interconnection through the contact channel holes 1av and 3av and the substrate circuit 0Ka respectivelyAnd (150) connecting. The contact via holes 1av, 3av contain a plurality of via holes, each of which penetrates at least one insulating layer and is electrically coupled to via holes above and below it. In FIGS. 4A-4B, the substrate circuit 0Ka includes 3D-M _H At least a portion of the peripheral circuitry of array 170. In some embodiments, the substrate circuit 0Ka may contain a part of a logic circuit.

3D-M in FIG. 4A _H Array 170 is a 3D-W. The memory cell 7aa includes a programming film 5 and a diode film 6. The programming film 5 may be an antifuse film (programmable one time for 3D-OTP) or a resistance change (resistive RAM, RRAM for short) (reprogrammable for 3D-MTP). The diode membrane 6 has the following broad features: under the reading voltage, the resistance is small; when the applied voltage is less than the read voltage or in the opposite direction to the read voltage, the resistance is larger. The diode film may be a P-i-N diode or may be a metal oxide (e.g., tiO) ₂ Etc.) diodes, etc.

3D-M in FIG. 4B _H Array 170 is a type of 3D-P. It contains at least two types of memory cells: a high resistance memory cell 7ab and a low resistance memory cell 7ac. The low resistance memory cell 7ac contains a diode film 6 similar to the diode film 6 in 3D-W. The high-resistance memory cell 7ab further includes a high-resistance film 9 which is an insulating film (e.g., silicon oxide/silicon nitride). In the production flow, the high-resistance film 9 located at the low-resistance memory cell 7ac is physically removed.

The first chip 100a in FIGS. 4C-4D has a substrate circuit 0Ka and 3D-M stacked on the substrate circuit 0Ka _V An array 170. Substrate circuit 0Ka is similar to the substrate circuits in fig. 4A-4B. In certain embodiments, 3D-M _V There is no substrate circuit 0Ka below the array 170. 3D-M _V Array 170 contains a plurality of vertically stacked horizontal address line layers 0a1a-0a8a, each horizontal address line layer (e.g., 0a5 a) containing a plurality of horizontal address lines (e.g., 15) in the same physical plane. 3D-M _V Array 170 also contains a set of vertical address lines that are perpendicular to substrate 0a (i.e., in the + z direction). 3D-M _V The storage density of (a) is highest among all semiconductor memories. For simplicity, 3D-M in FIGS. 4C-4D _V Not shown are the on-chip connections 150 electrically coupled between the array 170 and the substrate circuit 0KaAs is well known to those skilled in the art.

3D-M in FIG. 4C _V Array 170 employs transistors or transistor-like devices as memory cells. It contains a plurality of vertical and side-by-side memory strings 16X, 16Y. Each memory string (e.g., 16Y) contains a plurality of vertically stacked memory elements (e.g., 18ay-18 hy). Each memory cell (e.g., 18 fy) contains a vertical transistor having a gate (which is a horizontal address line) 15, a memory film 17, and a vertical channel (which is a vertical address line) 19. The memory film 17 may include a composite film of silicon oxide-silicon nitride-silicon oxide, silicon oxide-polysilicon-silicon oxide, or the like. The 3D-M _V Array 170 is a 3D-NAND, the process for producing which is well known to those skilled in the art.

3D-M in FIG. 4D _V Array 170 employs diodes or diode-like devices as memory cells. It contains a plurality of vertical storage strings 16U-16W arranged side by side. Each memory string 16U contains a plurality of vertically stacked memory cells 18au-18hu.3D-M _V The array 170 contains a plurality of vertically stacked horizontal address lines (word lines) 15. After etching a plurality of memory wells 11 penetrating these horizontal address lines 15, the sidewalls of the memory wells 11 are covered with a programming film 13 and filled with a conductive material to form vertical address lines 19 (bit lines). The conductor material may be a metallic material or a doped semiconductor material. Memory cells 18au-18hu are formed at the intersections of word lines 15 and bit lines 19. The programming film 13 may be one time programming (OTP, such as antifuse film) or multiple time programming (MTP, such as RRAM film).

To reduce the cross talk between memory cells, a diode is preferably formed between word line 15 and bit line 19. In one embodiment, the programming film 13 itself may have certain diode electrical characteristics. In another embodiment, a diode film (not shown) may be deposited separately on the side walls of the storage well 11. In a third embodiment, a built-in diode (e.g., P-N diode, schottky diode) may be formed naturally between the word line 15 and the bit line 19. For details of the built-in diode, reference may be made to the chinese patent application 201811117502.7 (application date: 2018, 9 and 20).

The second chip 100b in fig. 5 is a conventional two-dimensional circuit 0Kb for implementing the logic circuit 180 and the off-chip peripheral circuit assembly 190. The second chip 100b includes a transistor 0t and an interconnection line 0ib. The transistor 0t is formed in the second semiconductor substrate 0b, and electrically coupled therebetween through an interconnection line 0ib. In this embodiment, interconnect 0ib includes four interconnect layers 0m1b-0m4b, each interconnect layer (e.g., 0m1 b) including multiple interconnects (e.g., 0 m) in the same physical plane.

Comparing the first chip 100a (fig. 4A-4D) and the second chip 100b (fig. 5), the number of back-end wiring layers in the first chip 100a is larger than that in the second chip 100b. For example, the first chip 100a of FIGS. 4A-4B has six back-end wiring levels (0 m1a-0m2a, 0a1a-0a 4A), and the first chip 100a of FIGS. 4C-4D has ten back-end wiring levels (0 m1a-0m2a, 0a1a-0a8 a) that are greater in number than the four back-end wiring levels (0 m1B-0m 4B) of the second chip 100B of FIG. 5. Even if only the number of address line layers in the first chip 100a is counted, it is equal to or greater than the number of interconnect line layers in the second chip 100b. Especially for 3D-M _V For array 170, the number of address line layers in first chip 100a (which is approximately equal to the number of all memory cells in the memory string, which is nearly a hundred layers, and also increasing) is much greater than, at least twice as great as, the number of interconnect line layers in second chip 100b (e.g., four layers).

On the other hand, since the second chip 100b is independently designed and manufactured, the number of interconnect layers in the interconnect line 0ib thereof is larger than that in the substrate interconnect line 0ia in the first chip 100 a. For example, the second chip 100b in fig. 5 has four interconnect layers (0 m1b-0m4 b) that are larger than the two interconnect layers (0 m1a-0m2 a) of the first chip 100a in fig. 4A-4D. Therefore, the circuit layout of the second chip 100b is easier than that of the first chip 100a (or the integrated three-dimensional processor 80). Moreover, the second chip 100b may employ high-speed interconnect materials (e.g., copper), and the first chip 100a (or the integrated three-dimensional processor 80) may employ only high-temperature interconnect materials (e.g., tungsten), which is generally slow.

Fig. 6A-6 BB are circuit layout diagrams of the first and

second chips

100a, 100b of two separate three-dimensional processors 100, which show more detail than fig. 2C. This embodiment corresponds to the embodiment of fig. 7A and 8A. It can be easily generalized to the embodiments of fig. 7B and 8B, and fig. 7C and 8C by those skilled in the art.

FIG. 6A shows a first chip 100a that contains a plurality of 3D-M arrays 170aa-170mn. FIG. 6BA shows a second chip 100b that contains a plurality of logic circuits 180aa-180mn and a global peripheral circuit assembly 190G. Global peripheral circuit components 190G are located outside of the projection of all 3D-M arrays 170aa-170mn onto second chip 100b. The three-dimensional processor 100 of fig. 6A and 6BA employs a "full alignment" technique, i.e., the circuit layout on the two

chips

100a, 100b meets the following requirements: when two

chips

100a, 100b are stacked, each 3D-M array (e.g., 170 ij) has a logic circuit (e.g., 180 ij) vertically aligned and electrically coupled to it (see fig. 8A-8C). Since one logic (e.g., 180 ij) may have multiple 3D-M arrays (e.g., 170ijA-170ijD, 170ijW-170 ijZ) vertically aligned with and electrically coupled to it (see fig. 8B-8C), the period of the logic (e.g., 180 ij) on the second chip 100B is an integer multiple of the period of the 3D-M arrays (e.g., 170 ij) on the first chip 100 a.

FIG. 6BB illustrates another second chip 100b, which further includes a plurality of local peripheral circuit components 190aa-190mn. It is apparent that the three-dimensional processor 100 of fig. 6A and 6BB may also employ "full alignment" techniques. Wherein each local peripheral circuit assembly 190aa-190mn is vertically aligned with and electrically coupled to a 3D-M array (e.g., 170 ij). In addition to local peripheral circuit components 190aa-190mn, the embodiment in FIG. 6BB may also contain global peripheral circuit component 190G. In this description, all of local peripheral circuit elements 190aa-190mn and global peripheral circuit element 190G are collectively referred to as an off-chip peripheral circuit element 190.

In the embodiment of fig. 6A-6 BB, the local peripheral circuit components (e.g., 190 ij) typically include a partial address decoder, a partial read amplifier circuit, or a partial write circuit, etc., which performs at least partial read and write operations to the memory elements in each 3D-M array (e.g., 170 ij). The global peripheral circuit component 190G generally includes a read voltage generation circuit, a write voltage generation circuit, or a data buffer, etc., which generates read/write voltages, etc. Of course, the partitioning of these local and global peripheral circuit components is not absolute. For example, the local peripheral circuit assembly may include at least a portion of the read/write circuit generation circuit.

Fig. 7A to 8C show three kinds of storage units 100ij. Fig. 7A-7C are block circuit diagrams thereof (for simplicity, off-chip perimeter circuit component 190ij is not shown in fig. 7A-7C); fig. 8A to 8C are circuit layout diagrams thereof. In these embodiments, one logic circuit 180ij serves a different number of 3D-M arrays 170 ij.

The logic circuit 180ij in FIG. 7A serves a 3D-M array 170 ij: it processes the data stored in the 3D-M array 170 ij. The logic circuit 180ij in FIG. 7B serves four storage arrays 170ijA-170 ijD: it processes the data stored in the 3D-M array 170ijA-170 jiD. The logic circuit 180ij in FIG. 7C serves eight storage arrays 170ijA-170 zxft 3252 and 170ijW-170 ijZ: it processes the data stored in the 3D-M array 170ijA-170ijD and 170ijW-170 ijZ. As can be seen from fig. 8A-8C below, the logic circuits 180ij that serve more 3D-M arrays 170ij generally occupy more chip area and have more functionality. In fig. 7A to 7C, since the 3D-M array 170ij and the logic circuit 180ij are located on different chips (see fig. 2C and fig. 6A to 6 BB), the 3D-M array 170ij is indicated by a dotted line.

Fig. 8A-8C show the circuit layout of the second chip 100b and the projection (indicated by dashed lines) of the 3D-M array 170 (located in the first chip 100 a) onto the second chip 100b. The embodiment of fig. 8A corresponds to the embodiment of fig. 7A. In this embodiment, the logic circuit 180ij and the local peripheral circuit component 190ij in the storage unit 100ij are located in the second semiconductor substrate 0b of the second chip 100b. The logic circuit 180ij and the off-chip peripheral circuit component 190ij are at least partially covered by the 3D-M array 170 ij.

In this embodiment, the period of the logic circuit 180ij is equal to the period of the 3D-M array 170ij, and the area of the logic circuit cannot exceed the projection area of the 3D-M array 170ij on the second chip 100b, so that the function is limited. This embodiment is well suited for achieving simpler data processing. Fig. 8B-8C disclose two complex logic circuits 180.

The embodiment of fig. 8B corresponds to the embodiment of fig. 7B. In this embodiment, the logic circuit 180ij and the off-chip peripheral circuit component 190ij of the storage unit 100ij are located in the second chip 100b, which are at least partially covered by four 3D-M arrays 170ijA-170 ijD. Under the four 3D-M arrays 170ijA-170ijD, the logic circuit 180ji can be laid out freely. The logic circuit 180ij in fig. 8B has twice the period and four times the area of the 3D-M array 170ij in fig. 8A, and thus can implement more complicated processing functions.

The embodiment of fig. 8C corresponds to the embodiment in fig. 7C. In this embodiment, the logic circuit 180ij and the chip outer periphery circuit component 190ij in the stock unit 100ij are located in the second chip 100b. The eight 3D-M arrays 170ijA-170ijD, 170ijW-170ijZ are divided into two groups 170ijSA, 170jiSB. Each group (e.g., 170 ijSA) includes four 3D-M arrays (e.g., 170ijA-170 ijD). Below the first set 170SA of four 3D-M arrays 170ijA-170ijD, the first logic circuit components 180ijA can be laid out freely. Similarly, the second logic circuit component 180ijB can be freely laid out below the second set 170ijSB of four 3D-M arrays 170ijW-170 ijZ. The first logic circuit component 180ijA and the second logic circuit component 180ijB constitute a logic circuit 180ij. In this embodiment, gaps (e.g., G) are left between adjacent chip outer periphery circuit assemblies to form routing

channels

182, 184, 186 for electrical coupling between different logic circuit assemblies 180ijA, 180ijB, or between different logic circuits. The period of the logic circuit 180ij in fig. 8C is four times (x direction) the period of the 3D-M array 170ij in fig. 8A, and the area is eight times, so that more complicated processing functions can be realized.

In the separated three-dimensional processor 100, since the first chip 100a and the second chip 100b can be designed and manufactured separately, they can have distinct backend structures. Since the back-end structure of the second chip 100b can be optimized individually, its off-chip peripheral circuit assembly 190 and logic circuit 180 have lower cost and superior performance than the same kind of circuit in the integrated three-dimensional processor 80. A comparison is made below of the discrete three-dimensional processor 100 and the integrated three-dimensional processor 80.

First, since the first chip 100a does not include the off-chip peripheral circuit assembly 190 and the logic circuit 180, the array efficiency is high. Furthermore, as a two-dimensional circuit, the number of back-end wiring levels of the second chip 100b is much lower than the integrated three-dimensional processor 80 and can be manufactured using conventional processes. Since the wafer cost is substantially proportional to the number of back-end wiring layers, the wafer cost of the second chip 100b is much lower than that of the integrated three-dimensional processor 80. Therefore, the total chip cost of the discrete three-dimensional processor 100 (including the first and

second chips

100a, 100 b) is lower than that of the integrated three-dimensional processor 80 (including only one chip). The overall cost of the discrete three-dimensional processor 100 is less, even with the additional bonding cost involved.

Second, the performance of off-chip peripheral circuit assembly 190 and logic circuit 180 in separate three-dimensional processor 100 is better than the same type of circuit in integrated three-dimensional processor 80 because they can be optimized individually. In one embodiment, the number of interconnect layers (e.g., four layers, eight layers, or more, fig. 5) in the second chip 100b is greater than the number of interconnect layers (e.g., two layers, fig. 1 EB) of the substrate circuit 0K in the integrated three-dimensional processor 80 (or the first chip 100 a). In another embodiment, the second chip 100b employs a high performance interconnect material (e.g., copper) instead of a high temperature interconnect material (e.g., tungsten) used by the integrated three-dimensional processor 80 (or the first chip 100 a). Thus, the overall performance of the discrete three-dimensional processor 100 is more excellent.

Finally, in the integrated three-dimensional processor 80, the logic 78 is limited in area and functionality due to its confinement in a chip 80 (e.g., 3D-M array 77 in FIG. 1EA within the projected area of substrate 0). In contrast, in a separate three-dimensional processor 100, the larger area of the logic circuit 180 gives the three-dimensional processor 100 greater processing power since it can be formed in two

chips

100a, 100b (e.g., a first portion of the logic circuit is located below the 3D-M array 170ij of the first chip 100a in FIG. 6A, similar to the logic circuit 78 located below the 3D-M array 77 in FIG. 1EA, and a second portion of the logic circuit is located in the second chip 100b of FIG. 6 BA). Furthermore, since the second chip is designed and produced separately, it has greater flexibility in design and production. By combining the same first chip 100a with a second chip 100b having a different function, a processing function suitable for different application scenarios can be realized. Preferably, these various processing functions can be implemented in a shorter design cycle and with less design budget. Thus, the discrete three-dimensional processor 100 is more powerful and flexible.

The application of the discrete three-dimensional processor in various fields is described below.

[A] And (4) performing mathematical calculation.

When applied to mathematical calculations, a separate three-dimensional processor is used to implement non-arithmetic functions, which employ memory-based calculations (MBC), i.e., calculations that are implemented primarily by large-capacity LUTs (i.e., 3 DM-LUTs) stored in a 3D-M array. In this application, the storage unit 100ij in fig. 2A is also referred to as a calculation unit. Where the 3D-M array 170 stores at least a partial look-up table (LUT) of a non-arithmetic function, the logic circuit 180 is an Arithmetic Logic Circuit (ALC).

Fig. 9 shows a calculation unit 100ij. It includes input 110,

output

120, 3D-M array 170, and ALC180ALC (i.e., logic circuit 180 is ALC180 ALC). The 3D-M array 170 stores at least part of a LUT of a non-arithmetic function (or model), and ALC180ALC arithmetically operates on the data in the LUT. The 3D-M array 170 and ALC180ALC are electrically coupled through inter-chip connections 160. As previously described, non-arithmetic functions involve more operations (i.e., addition, subtraction, and multiplication) than ALC180ALC supports. Since it cannot be expressed as a combination of basic arithmetic operations, the non-arithmetic function cannot be implemented by ALC180ALC alone, which needs to be implemented by ALC180ALC in combination with LUT 170.

FIGS. 10A-10C are block circuit diagrams of three ALC180 ALCs. ALC180ALC of fig. 10A is a summer 180A; ALC180ALC in FIG. 10B is a multiplier 180M; ALC180ALC in fig. 10C is a multiplier-adder (MAC) that has an adder 180A and a multiplier 180M. ALC180ALC may implement integer arithmetic, fixed point arithmetic, or floating point arithmetic.

Fig. 11A-11B show a first calculation unit 100ij for implementing a non-arithmetic function Y = f (X) and using a function lookup table. Fig. 11A is a circuit block diagram thereof. ALC180ALC contains a pre-processing circuit 180R, a 3DM-LUT170P and a post-processing circuit 180T. The preprocessing circuit 180R converts the input variable (X) 110 into an address (a) of the 3DM-LUT 170P. After reading out the data (D) at the address (a) of the 3DM-LUT170P, the post-processing circuit 180T converts it into the function value (Y) 120. To improve the calculation accuracy, the margin (R) of the input variable (X) is sent to the post-processing circuit 180T.

Fig. 11B is a calculation unit 100ij capable of realizing a single-precision non-arithmetic function Y = f (X). The input variable X110 is 32 bits (X) ₃₁ …x ₀ ). The preprocessing circuit 180R converts the first 16 bits (x) ₃₁ …x ₁₆ ) The 16-bit address A is extracted as the 3DM-LUT170P, followed by 16 bits (x) ₁₅ …x ₀ ) Extracted as 16-bit residue R to the post-processing circuit 180T. The 3DM-LUT170P contains two 3 DM-

LUTs

170Q, 170R. Each 3DM-

LUT

170Q, 170R has a capacity of 2Mb (16-bit input, 32-bit output). The 3DM-LUT 170Q stores a function value D1= f (a) of the function, and the 3DM-LUT 170R stores a first derivative value D2= f' (a) of the function. Post-processing circuit 180T contains multiplier 180M and adder 180A. The output value (Y) 120 is 32 bits, which is calculated by polynomial interpolation. In this embodiment, the polynomial interpolation is a first order taylor series: y (X) = D1+ D2 × R = f (a) + f' (a) × R. The use of higher order polynomial interpolation (e.g., higher order taylor series) can further improve the computational accuracy.

When non-arithmetic functions are implemented, combining LUTs and polynomial interpolation can achieve higher computational accuracy with smaller LUTs. If the above-mentioned single-precision function (32-bit input, 32-bit output) is implemented only with a LUT (without polynomial interpolation), the capacity of the LUT needs to be up to 2 ³² *32=128gb, which is not realistic. The capacity of the LUT can be greatly reduced by polynomial interpolation. In the above embodiment, the LUT needs only 4Mb (2 Mb for the function value LUT and 2Mb for the first derivative value LUT) after the first-order taylor series is adopted. This is much less than with LUTs alone (4 Mb vs. 128Gb).

In addition to elementary functions (including algebraic functions and transcendental functions), the three-dimensional processor 100 can implement various higher functions, such as special functions. The special function plays a significant role in mathematical analysis, functional analysis, physical research and engineering application. Many special functions are solutions of differential equations or integrations of basis functions. Examples of special functions include gamma functions, beta functions, bessel functions, legendre functions, elliptic functions, lake functions, mathieu functions, riemann zeta functions, fresnel integrals, and the like. The advent of the three-dimensional processor 100 will simplify the computation of special functions, boosting its application in scientific computing.

Fig. 12 shows a second calculation unit 100ij. The computing unit 100ij is configured to implement a composition function (composition function) Y = EXP [ K × LOG (X)]＝X ^K It adopts function table look-up method. The calculation unit 100ij contains two 3 DM-

LUTs

170S, 170T and a multiplier 180M. The 3DM-LUT170S stores a function value of LOG (), and the 3DM-LUT 170T stores a function value of EXP (). The input variable X is used as the address 110 of the 3DM-LUT 170S. The output LOG (X) 160S of the 3DM-LUT170S is multiplied by the power parameter K at multiplier 180M, and the product 160T is sent as an address to the 3DM-LUT 170T. Output 120 of 3DM-LUT 170T is Y = X ^K 。

The functions calculated by the embodiments of fig. 11A-11B and fig. 12 are combinatorial functions. The combination function is a combination of at least two non-arithmetic functions, e.g., a single-precision function is a combination of a function value and a derivative value; a complex function is a combination of two functions. Accordingly, the invention also proposes a three-dimensional processor (100) for computing a combinatorial function, characterized in that it comprises: a first three-dimensional memory (3D-M) array (170Q or 170S), a second 3D-M array (170R or 170T), and an Arithmetic Logic Circuit (ALC) (180 ALC), said first 3D-M array (170Q or 170S) storing at least part of a first look-up table (LUT) for a first non-arithmetic function, said second 3D-M array (170R or 170T) storing at least part of a second LUT for a second non-arithmetic function, said ALC (180 ALC) performing arithmetic operations on at least part of the data in said first or second LUT; a first chip (100 a) and a second chip (100 b), said first chip (100 a) containing said first and second 3D-M arrays (170Q, 170R or 170S, 170T), said second chip (100 b) containing a piece of outer periphery circuitry (190) at least in part of said ALC (180 ALC) and said first or second 3D-M arrays (170Q, 170R, 170S or 170T); the first chip (100 a) does not contain the off-chip perimeter circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160); said combining function is a combination of said first and second non-arithmetic functions; the first and second non-arithmetic functions include more arithmetic operations than the ALC (180 ALC) supports.

[B] And (4) performing computer simulation.

When applied to computer simulations, a separate three-dimensional processor was used to implement the non-arithmetic model, which still employs MBC. MBC brings great advantages to computer simulations. In this application, the storage unit 100ij in fig. 2A is also referred to as a calculation unit. Where 3D-M array 170 stores at least part of the LUT for the non-arithmetic model, logic circuit 180 is an ALC180 ALC.

Fig. 13 shows a third calculation unit 100ij. The calculation unit 100ij is used to implement a computer simulation of the amplifying circuit 0Y (fig. 1 BA), which uses a model lookup method. The computing unit 100ij comprises a 3DM-LUT 170U, an adder 180A and a multiplier 180M. The 3DM-LUT 170U stores data related to the performance (e.g., input-output characteristics) of the transistor 0T. Input voltage V _IN Used as the address 110 of the 3DM-LUT 170U, the read data 160U is the leakage current I _D . Multiplier 180M will I _D Multiplying by a negative value-R of the resistance 0R, the result (-R I) _D ) At summer 180A with the supply voltage V _DD Adding to obtain an output voltage value V _OUT 120。

The 3DM-LUT 170U may store a variety of mathematical models. In one embodiment, the model data stored by 3DM-LUT 170U is raw measurement data, such as measured input-output characteristics. An example is the drain current vs. gate-source voltage (I) of a transistor _D -V _GS ) A characteristic curve. In another embodiment, the model data stored by 3DM-LUT 170U is smoothed measurement data. Raw measurement data can be smoothed by purely mathematical methods (e.g., by best-fit models) or can be smoothed by physical models (e.g., BSIM 4V 3.0 transistor models). In a third embodiment, the 3DM-LUT 170U stores model data that contains not only the measured values of the transistors, but also derivatives of the measured values. For example, the 3DM-LUT 170U stores model data that includes not only the current value (I) of transistor 0T _D -V _GS ) And also its transconductance value (G) _m -V _GS ). Similar to FIG. 11B, polynomial interpolation (using measured values)Derivative) can improve model accuracy under reasonable LUT premises.

The model lookup approach brings many advantages. It saves a lot of computation time and energy since two software decompositions (from mathematical model to mathematical function and then from mathematical function to built-in function) are not needed. The model lookup table requires even fewer LUTs than the function lookup table. Since a transistor model (e.g., BISM 4V 3.0) requires hundreds of model parameters, if a function lookup method is used, a large number of LUTs are required to calculate the intermediate functions of the transistor model. If the function lookup method is skipped (i.e. the transistor model and the related intermediate functions are skipped), and the model lookup method is directly adopted, the transistor performance can be described by three measurement parameters (including the grid source voltage V) _GS Drain source voltage V _DS Source voltage V _BS ). Thus, a smaller LUT is required to describe the mathematical model of the transistor.

[C] A programmable compute array.

When applied to a programmable compute array, the discrete three-dimensional processor is a three-dimensional programmable compute array. It can customize not only logical functions and arithmetic functions, but also non-arithmetic functions. In a three-dimensional programmable compute array, the storage unit 100ij in FIG. 2A is also referred to as a programmable unit.

FIGS. 14A-14B illustrate a programmable cell 100ij in a three-dimensional programmable computational array that includes a 3D-M array 170 and logic circuitry 180 (FIG. 14A). The 3D-M array 170 stores at least a portion of a LUT for non-arithmetic functions, and the logic circuit 180 includes an Arithmetic Logic Circuit (ALC), a programmable logic unit (CLE), and/or a programmable Connection (CIT). Accordingly, the functional block (fig. 14B) that the programmable unit 100ij can realize includes the programmable computing unit 400 (see fig. 15A to 15B), the programmable logic unit 200 (see fig. 17B), and the programmable connection 300 (see fig. 17A). The programmable computation unit 400 implements non-arithmetic functions based on LUTs; the programmable logic unit 200 implements the selected logic function from a logic operation library; programmable connections 300 implement selected connections from a library of connections.

The input IN of the programmable computing unit 400 comprises input data 410, the output OUT comprises output data 420 and the set terminal CFG comprises a set signal 430. When the set signal 430 is "write", a LUT for the mathematical function is written in the programmable computation unit 400. When the set signal 430 is "read," the value of the mathematical function is read from the programmable calculation unit 400. Fig. 15A-15B show two specific implementations of the programmable computing unit 400. In FIG. 15A, the programmable computation unit 400 is a 3D-M array 170 that stores the function values of non-arithmetic functions. In FIG. 15B, the programmable compute unit 400 is a combination of a 3D-M array 170 and an ALC 180. Like 11B,3D-M array 170 stores the function and derivative values of non-arithmetic functions, ALC180 performs polynomial calculations.

Fig. 16 shows two cycles of use of a programmable computational cell 400. The programmable compute array enables reconfigurable computing because of its 3D-M array 170 is reprogrammable. The first usage cycle 620 is divided into two phases: a setup phase 610 and a calculation phase 630. In a setup phase 610, the LUT of the first function is loaded into the 3D-M array 170 according to the user; in the calculation stage 630, the corresponding LUT is read in the 3D-M array 170 to obtain the function value of the first function. Similarly, the second usage period 660 is also divided into a setup phase 650 and a calculation phase 670. This embodiment is particularly suitable for SIMD (single instruction multiple data stream) data processing. Once the LUT is loaded into the 3D-M array 170 during the setup phase 610, a large amount of data can be sent to the programmable compute unit 400 for processing and higher processing speeds. There are many examples of SIMD applications, such as the same operation or vector operation on a plurality of pixels in image processing, massively parallel computation used in scientific computation, and the like.

FIGS. 17A-17B disclose a link library and a logical operator library, respectively. FIG. 17A discloses a connection library that can be implemented by programmable connection 300, which includes the following connections: a) Interconnect lines 302/304 are connected, interconnect lines 306/308 are connected, but 302/304 are not connected to 306/308; b) Interconnect lines 302/304/306/308 are connected; c) Interconnect lines 306/308 are connected,

interconnect lines

302, 304 are unconnected, and are not connected to 306/308; d) Interconnect 302/304 is connected, and interconnects 306, 306 are unconnected, and are not connected to 302/304; e) None of the

interconnect lines

302, 304, 306 are connected. In this specification, a symbol "/" between two interconnect lines indicates that the two interconnect lines are connected, and a symbol "between two interconnect lines" indicates that the two interconnect lines are not connected.

Fig. 17B discloses a library of logic operations that can be implemented by the programmable logic unit 200. With inputs a and B being

input data

210, 220 and output C being output data 230. The programmable logic unit 200 can implement the following logic operations: c = A, A logical not, a shift, AND (a, B), OR (a, B), NAND (a, B), NOR (a, B), XOR (a, B), arithmetic plus a + B, arithmetic minus a-B, etc. Programmable logic unit 200 may also contain sequential circuit elements such as registers, flip-flops, etc. to implement a pipeline, etc. Details of the programmable connections 300 and programmable logic units 200 can be found in U.S. patent 4,870,302.

FIG. 18 shows a first three-dimensional programmable computational array 100. It includes regularly arranged programmable modules 100A and 100B, etc. Each programmable module (e.g., 100A) contains a plurality of programmable compute units (CCEs, e.g., 400AA-400 AD) and programmable logic units (CLEs, e.g., 200AA-200 AD).

Programmable channels

320, 340 are contained between programmable computing units (e.g., 400AA-400 AD) and programmable logic units (e.g., 200AA-200 AD); between the programmable module 100A and the programmable module 100B, there are also

programmable channels

310, 330, 350. The programmable channels 310-350 contain a plurality of programmable Connections (CITs) 300. For those skilled in the art, in addition to programmable channels, sea-of-gates (sea-of-gates) and the like may be used.

Complex functions are often encountered in computations. In this specification, a complex function refers to a multi-independent variable non-arithmetic function; the basis function refers to a separate, independent variable, non-arithmetic function. In general, a complex function is a combination of basis functions. The three-dimensional programmable computational array 100 can enable customization of complex functions, which is not imaginable in the prior art. To customize a complex function, the complex function is first decomposed into a plurality of basis functions. Each basis function is implemented by loading its LUT in the corresponding programmable computation unit. Finally, complex functions are customized by programming programmable logic cells and programmable connections.

FIG. 19 is a representation of a first three-dimensional programmable computational array100 for customizing complex functions and implementing complex functions as follows: e = a ^. SIN(b)+c ^. COS (d). The programmable connections 300 in the programmable channels 310-350 are represented in FIG. 17A: a programmable connection with a dot at an intersection point means that the intersection lines are connected, a programmable connection without a dot at an intersection point means that the intersection lines are not connected, and a disconnected programmable connection means that the disconnected interconnect line is divided into two interconnect line segments that are not connected to each other. In this embodiment, the programmable calculation unit 400AA is set to LOG () whose calculation result LOG (a) is supplied to the first input of the programmable logic unit 200 AA. Programmable computing element 400AB is set to LOG [ SIN ()]The calculation result log [ sin (b) ]]Is sent to a second input of the programmable logic unit 200 AA. Programmable logic unit 200AA is set to arithmetic plus "+" and its calculation results LOG (a) + LOG [ SIN (b)]Is sent to the programmable calculation unit 400BA. The programmable calculation unit 400BA is set to EXP (), and its calculation result EXP { LOG (a) + LOG [ SIN (b)]}＝a ^. SIN (b) is provided to a first input of programmable logic unit 200 BA. Similarly, with appropriate settings, the programmable computation units 400AC, 400AD, the programmable logic unit 200AC, the results c of the programmable computation unit 400BC ^. COS (d) is provided to a second input of the programmable logic unit 200 BA. Programmable logic unit 200BA is set to arithmetic plus "+", a ^. SIN (b) and c ^. COS (d) is added here and the final result is sent to the output e. It will be apparent that other complex functions may be implemented by the three-dimensional programmable computational array 100 by changing the settings.

Accordingly, the present invention also provides a three-dimensional programmable computational array (100) for customizing at least one complex function, comprising: a plurality of programmable logic units (200) and/or programmable connections (300); and a first programmable computation unit (400 AA) and a second programmable computation unit (400 AC), the first programmable computation unit (400 AA) having a first 3D-M array storing at least part of a first look-up table (LUT) for a first non-arithmetic function, the second programmable computation unit (100 AC) having a second 3D-M array storing at least part of a second LUT for a second non-arithmetic function; a first chip (100 a) and a second chip (100 b), said first chip (100 a) containing said first and second 3D-M arrays, said second chip (100 b) containing at least part of said programmable logic unit (200) and/or programmable connections (300) and a piece of outer periphery circuitry (190) of said first or second 3D-M array; the first chip (100 a) does not contain the off-chip perimeter circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160); -enabling customization of the complex function by programming the programmable logic unit (200) and/or the programmable connection (300) and the programmable computation unit (400); said complex function is a combination of said first and second non-arithmetic functions; the first and second non-arithmetic functions include more arithmetic operations than are supported by the programmable logic unit (200).

FIG. 20 shows a second three-dimensional programmable computational array 100. In addition to the

programmable compute units

400A, 400B, the programmable logic unit 200A, and the programmable channels 360-380, the programmable compute array 100 also contains a multiplier 500. The introduction of the multiplier 500 enables the three-dimensional programmable computational array 100 to implement more mathematical functions and be more computationally powerful.

FIGS. 21A-21B illustrate two specific implementations of the second three-dimensional programmable computational array 100. The embodiment in fig. 21A implements a mathematical function h = EXP (f)/g. Wherein the programmable calculation unit 400A is arranged to implement the basic function EXP (f) and the programmable calculation unit 400B is arranged to implement the basic function INV (g). After setting the programmable channel 370, the outputs of the

programmable compute units

400A, 400B are sent to the multiplier 500. After setting the programmable channel 380, the final output is h = EXP (f)/g. The embodiment in fig. 21B implements another mathematical function h = SIN (f) + COS (g). Wherein the programmable computing unit 400A is arranged to implement a basic function SIN (f) and the programmable computing unit 400B is arranged to implement a basic function COS (g). After setting the programmable channel 370, the outputs of the

programmable compute units

400A, 400B are sent to the programmable logic unit 200A, which unit 200A implements the arithmetic plus "+". After setting the programmable channel 380, the final output is h = SIN (f) + COS (g).

[D] And (5) mode processing.

When applied to mode processing, a discrete three-dimensional processor is a type of three-dimensional mode processor. It can perform mode processing; more importantly, the patterns involved in most pattern processing are stored locally.

Figure 22 shows a split three-dimensional parallel processor 100. It comprises an array of m x n depositories 100aa-100mn, each of the depository 100aa-100mn being electrically coupled to a common input 110 and a common output 120. The input data are simultaneously supplied to the depository units 100aa to 100mn via the common input 110 and the pattern processing is simultaneously carried out in the depository units 100aa to 100 mn. Since the three-dimensional parallel processor 100 contains thousands of storage units 100aa-100mn, it can guarantee massive parallel computation. The three-dimensional parallel processor 100 can be applied to the fields of pattern processing, neural network processing, and the like.

When used as a mode process, the discrete three-dimensional parallel processor 100 is a discrete three-dimensional mode processor. Fig. 23 shows a memory unit 100ij in a three-dimensional pattern processor 100, which comprises a pattern memory circuit 170 and a pattern processing circuit 180PPC (i.e. the logic circuit 180 is the pattern processing circuit 180 PPC), which are electrically coupled via an interchip connection 160 (fig. 3A-3D). The pattern storage circuit 170 includes a 3D-M array 170 that stores at least a portion of the pattern; the pattern processing circuit 180PPC processes the pattern.

The split three-dimensional mode processor 100 may employ two approaches-a processor-like approach and a memory-like approach. Class processor three-dimensional pattern processor 100 is a three-dimensional processor with its own library of search patterns that can be used to pattern process target patterns from input 110 using its locally stored search patterns. Specifically, a library of search patterns (e.g., a virus library, a keyword library, an acoustic/language model library, an image model library, etc.) is stored in the 3D-M array 170; input data 110 includes target patterns (e.g., network packets, computer files, big data, voice data, image data, etc.); the pattern processing circuit 180PPC performs pattern processing on the target pattern according to the retrieval pattern. Since the large number of the storage units 100ij (thousands, fig. 22) support massive parallel processing and the inter-chip connection 160 has a large bandwidth (fig. 3B-fig. 3D), the three-dimensional processor 100 has a fast retrieval speed and high efficiency.

Accordingly, the present invention provides a three-dimensional processor (100) with a search pattern library, comprising: an input (110) for transmitting at least part of the target pattern; a plurality of storage units (100 aa-100 mn) electrically coupled to said input (110), each storage unit (100 ij) comprising at least a three-dimensional memory (3D-M) array (170) and a pattern processing circuit (180 PPC), said 3D-M array (170) storing at least a portion of a retrieval pattern, said pattern processing circuit (180 PPC) performing pattern processing on said target pattern based on said retrieval pattern; a first chip (100 a) and a second chip (100 b), said first chip (100 a) containing said 3D-M array (170), said second chip (100 b) containing at least part of said mode processing circuitry (180 PPC) and a piece of outer perimeter circuit assembly (190) of said 3D-M array (170); the first chip (100 a) does not contain the off-chip perimeter circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

The memory-like three-dimensional pattern processor 100 is a three-dimensional memory with a pattern processing function, the primary function of which is to store a library of target patterns, and the secondary function of which is to retrieve the stored target patterns using the retrieval pattern from the input 110. Specifically, a library of target patterns (e.g., computer files on the entire hard disk, big data databases, voice archives, image archives) is stored and distributed in the 3D-M array 170; the input data 110 is a retrieval pattern (e.g., virus identification, keywords, acoustic/language models, image models, etc.); the pattern processing circuit 180PPC performs pattern processing on the target pattern according to the retrieval pattern. Since the numerous storage units 100ij (thousands, fig. 22) support massively parallel processing and the inter-chip connections 160 have large bandwidths (fig. 3B-3D), the mode processing speed of the three-dimensional memory 100 is fast and efficient.

Like flash memory, the plurality of self-contained mode processing function three-dimensional memories 100 may be packaged as a memory card (e.g., SD card, TF card) or a solid state disk (i.e., SSD) for storing a target mode library having a large amount of data. Of particular importance, they also have mode processing (e.g., retrieval) functionality. Since each storage unit 100ij has its own mode processing circuit 180PPC, it only needs to retrieve the target mode stored locally (in the same storage unit 100 ij) in the 3D-M array 170. Thus, regardless of the capacity of the memory card or solid state drive, the retrieval time is close to the time required to retrieve a single 3D-M array 170. In other words, the retrieval time of the database is independent of the capacity of the database, in most cases on the order of seconds.

In contrast, in the traditional von neumann architecture, the processor (CPU) and the memory (hard disk) are physically separated from each other, and database retrieval first requires reading the database from the hard disk. Due to the limited bandwidth of the system bus between the CPU and the hard disk, the database retrieval time is limited by the database read time. Therefore, the search time of the database is proportional to the size of the database. Generally speaking, retrieval times range from minutes to hours, and even longer, based on the size of the database. In contrast, the three-dimensional memory 100 with the self-contained mode processing function is significantly advantageous in database retrieval.

When the three-dimensional memory 100 with the pattern processing function performs pattern processing on a large database (i.e., a target pattern library), the pattern processing circuit 180PPC only needs to complete a part of the pattern processing function. For example, the pattern processing circuit 180PPC only needs to perform simple preliminary pattern processing (e.g., string matching, code matching) on the database. The data (i.e., the target pattern) remaining after the preliminary pattern processing screening is then sent to a more powerful external processor (e.g., CPU, GPU) via output 120 to complete the final pattern processing. Since most of the data in the database will be filtered out by the simple pattern processing, the data output from the three-dimensional memory 100 will only occupy a small portion of the entire database, which can greatly reduce the bandwidth pressure of the output 120.

Accordingly, the present invention provides a three-dimensional memory (100) with a mode processing function, comprising: an input (110) for transmitting at least part of the retrieval mode; a plurality of storage units (100 aa-100 mn) electrically coupled to said input (110), each storage unit (100 ij) comprising at least a three-dimensional memory (3D-M) array (170) and a pattern processing circuit (180 PPC), said 3D-M array (170) storing at least a portion of a target pattern, said pattern processing circuit (180 PPC) performing a pattern processing on said target pattern based on said retrieved pattern; a first chip (100 a) and a second chip (100 b), said first chip (100 a) containing said 3D-M array (170), said second chip (100 b) containing at least part of said mode processing circuitry (180 PPC) and a piece of outer perimeter circuit assembly (190) of said 3D-M array (170); the first chip (100 a) does not contain the off-chip perimeter circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

The following description is directed to the application of the discrete three-dimensional mode processor 100, and the application areas include: a) information security, B) big data analysis, C) speech recognition, D) image recognition, and the like. Examples of such applications include: a) An information security processor; b) A memory capable of virus detection; c) A data analysis processor; d) A retrievable memory; e) A speech recognition processor; f) A retrievable speech memory; g) An image recognition processor; h) A retrievable image memory.

A) And (4) information security.

Information security includes network security and computer security. The main means for enhancing the network security is to check viruses in the network data packet; the main means for enhancing computer security is to virus viruses in computer files (including computer software). Broadly, viruses (also referred to as malware, etc.) include network viruses, computer viruses, software violating network specifications, files violating file specifications, and the like. During virus inspection, the processor compares the network data packet/computer file with all virus identifiers (virus patterns or virus signatures) in a virus library one by one. After the virus identifier is found, the part containing the virus identifier is isolated or deleted.

Currently, virus libraries are increasing in size, which has reached hundreds of MB; and the computer data required to be checked for virus is more huge, and the computer data are GB grade, TB grade and even larger. On the other hand, because the number of cores in the conventional processor is limited (for example, the number of cores in the CPU is tens at most, and the number of cores in the GPU is hundreds at most), each core can generally only screen one virus at a time, which results in low parallelism of virus screening. Furthermore, due to the von Neumann architecture, the processor and memory are physically separated from each other, requiring a long time per new virus identification read. Thus, conventional processors and their architectures are slow and inefficient at processing information security transactions.

To enhance information security, the present invention proposes a variety of separate three-dimensional mode processors 100. It can adopt a processor-like mode and a memory-like mode: when the processor-like mode is adopted, the separated three-dimensional mode processor 100 is an information security processor, i.e., a processor for enhancing information security; when the memory-like approach is used, the discrete three-dimensional processor 100 is a memory capable of virus detection, i.e., a memory with virus detection function.

a) An information security processor.

In order to ensure information security, the present invention proposes an information security processor 100. It searches the network data packet or computer file for various virus marks in the virus library; once matched to a virus identification, the network packet or computer file contains the corresponding virus. The information security processor 100 may be implemented in a network or a computer as a stand-alone processor, or may be integrated into a processor (e.g., CPU) or a memory (e.g., hard disk) of the network or the computer.

In the information security processor 100, the 3D-M arrays 170 in the different storage units 100ij store different virus identifications. In other words, the virus library is stored and distributed in the respective storage unit 100ij of the processor 100. Once a network packet or computer file is sent from the input 110, at least a portion of the data in the network packet or computer file is sent to all of the depository units 100ij. In each storage unit 100ij the pattern processing circuit 180PPC retrieves in the portion of data the various virus identifications stored in the local 3D-M array 170. Once matched to a virus identification, the network packet or computer file contains the corresponding virus.

The virus screening process is performed simultaneously in all the storage units 100ij. Since the information security processor 100 contains a large number (thousands) of depository units 100ij, it supports massively parallel virus investigation. Furthermore, due to the large number of inter-chip connections 160 and the close proximity between the pattern processing circuit 180PPC and the 3D-M array 170 (relative to the traditional von Neumann architecture), the pattern processing circuit 180PPC can easily read new virus signatures from it. Therefore, the information security processor 100 has a fast virus checking speed and a high virus checking efficiency. In this embodiment, the 3D-M array 170 storing the virus library may be a 3D-P, 3D-OTP or 3D-MTP; the pattern processing circuit 180PPC is a code matching circuit.

Accordingly, the invention proposes a separate information security processor (100), characterized in that it comprises: an input (110) for transmitting at least part of the data in a network data packet or computer file; a plurality of depository units (100 aa-100 mn) electrically coupled to said input (110), each depository unit (100 ij) comprising at least one three-dimensional depository (3D-M) array (170) and a code matching circuit (180 PPC), said 3D-M array (170) storing at least a portion of a virus signature, said code matching circuit (180 PPC) retrieving said virus signature from said data; a first chip (100 a) and a second chip (100 b), said first chip (100 a) containing said 3D-M array (170), said second chip (100 b) containing at least part of said code matching circuit (180 PPC) and a piece of outer perimeter circuit assembly (190) of said 3D-M array (170); the first chip (100 a) does not contain the off-chip perimeter circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

b) A memory capable of being checked for viruses.

When newly discovered viruses exist, the virus needs to be checked for data stored on the hard disk (such as a mechanical hard disk and a solid state hard disk) of the whole computer. Such full disk virus screening is very difficult for the traditional von Neumann architecture. Since a computer hard disk stores a huge amount of data, it takes a lot of time to read all computer data from the hard disk, let alone to check for viruses. In the traditional von Neumann architecture, the time required for full disk virus verification is proportional to the hard disk size.

In order to shorten the time required for full disk virus checking, the present invention provides a virus-checking memory 100. Its primary function is computer storage and the secondary function is to virus the stored data locally at the storage. Like flash memory, a plurality of virus-checking memories 100 can be packaged into a memory card or a solid-state hard disk for storing massive data and having a virus-checking function.

In the virus-verifiable memory 100, the 3D-M arrays 170 in different cells 100ij store different data. In other words, a large amount of computer files are stored and distributed in the storage unit 100ij of each virus-verifiable memory 100 in the memory card or the solid-state hard disk. When a new virus is found that requires a full disk virus check, its virus identification is sent as input 110 to all the storage units 100ij, and the pattern processing circuit 180PPC retrieves the virus identification from the data stored in the local 3D-M array 170.

The virus checking process is performed simultaneously in all the storage units 100ij, and the virus checking time required by each storage unit 100ij is similar. Because of the adoption of large-scale parallel virus detection, the virus detection time is close to that of a single storage and calculation unit 100ij no matter how large the capacity of the memory card and the solid state disk is, generally in the second level. In contrast, traditional whole-disc virus testing requires minutes to hours, or even longer. In this embodiment, the 3D-M array 170 storing the mass of computer files is preferably a 3D-MTP; the pattern processing circuit 180PPC is a code matching circuit.

Accordingly, the present invention provides a separate virus-verifiable memory (100), characterized by comprising: an input (110) for transmitting at least a portion of the virus identification; a plurality of depository units (100 aa-100 mn) electrically coupled to said input (110), each depository unit (100 ij) comprising at least one three-dimensional storage (3D-M) array (170) and a code matching circuit (180 PPC), said 3D-M array (170) storing at least a portion of data in a computer file, said code matching circuit (180 PPC) retrieving said virus identification in said data; a first chip (100 a) and a second chip (100 b), said first chip (100 a) containing said 3D-M array (170), said second chip (100 b) containing at least part of said code matching circuit (180 PPC) and a piece of outer perimeter circuit assembly (190) of said 3D-M array (170); the first chip (100 a) does not contain the off-chip perimeter circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

B) And (5) analyzing big data.

Big data is a collection of huge amounts of data, which mainly involves unstructured data or semi-structured data. An important component of big data analytics is keyword retrieval (including string matching, such as regular expression matching). At present, the keyword library is increasingly enlarged, and the large data database is more huge. For such a large keyword library and a large data database, the conventional processor and the architecture thereof have difficulty in high-speed and efficient retrieval of unstructured data or semi-structured data.

To improve the efficiency of big data analysis, the present invention proposes a variety of separate three-dimensional pattern processors 100. It can adopt a processor-like mode and a memory-like mode: when the processor-like approach is adopted, the separated three-dimensional mode processor 100 is a data analysis processor, i.e., a processor for big data analysis; when the memory-like approach is used, the separate three-dimensional mode processor 100 is a retrievable memory, i.e., a memory with a retrieving function.

c) A data analysis processor.

In order to achieve high-speed and efficient retrieval of input data, the present invention proposes a data analysis processor 100 that retrieves keywords in a keyword library from an input data. In the data analysis processor 100, the 3D-M arrays 170 in the different storage units 100ij store different keywords. In other words, the keyword libraries are stored and distributed in the respective storage units 100ij of the processor 100. Data from the input 110 is sent to all the depository units 100ij. In each storage unit 100ij, the pattern processing circuit 180PPC retrieves each keyword stored in the local 3D-M array 170 in the input data.

The above-described retrieval process is performed simultaneously in all the storage units 100ij. Since it contains a large number (thousands) of depository units 100ij, the processor 100 supports massively parallel retrieval. Furthermore, because of the large number of inter-chip connections 160 and the close proximity between the pattern processing circuit 180PPC and the 3D-M array 170 (relative to the traditional von Neumann architecture), the pattern processing circuit 180PPC can easily read keywords from the local 3D-M array 170. Therefore, the processor 100 has a fast retrieval speed and a high retrieval efficiency for unstructured data and semi-structured data.

In this embodiment, the 3D-M array 170 storing the keyword library may be a 3D-P, 3D-OTP or 3D-MTP; the pattern processing circuit 180PPC is a string matching circuit. The string matching circuit may be implemented by a Content Addressable Memory (CAM) or a comparator with an exclusive or gate (XOR). Further, the keywords may be represented by regular expressions. At this time, the character string matching circuit 180PPC is realized by finite-state automata (FSA for short).

Accordingly, the invention proposes a separate data analysis processor (100), characterized in that it comprises: an input (110) for transmitting at least part of the data; a plurality of depository units (100 aa-100 mn) electrically coupled to said input (110), each depository unit (100 ij) comprising at least a three-dimensional storage (3D-M) array (170) and a string matching circuit (180 PPC), said 3D-M array (170) storing at least a portion of a keyword, said string matching circuit (180 PPC) retrieving said keyword from said portion of data; a first chip (100 a) and a second chip (100 b), said first chip (100 a) containing said 3D-M array (170), said second chip (100 b) containing at least part of said string matching circuit (180 PPC) and a piece of outer perimeter circuitry (190) of said 3D-M array (170); the first chip (100 a) does not contain the off-chip perimeter circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

d) A retrievable memory.

Big data analysis often requires a search of the entire database, i.e., a full-base search. Since large data databases are very large, with few GB grades and many TB grades, even higher, the traditional von neumann architecture is very difficult for full-library retrieval: the optical readout of the database takes a lot of time, let alone to retrieve it. In the traditional von Neumann architecture, the full-bank search time is proportional to the database size.

To improve the speed and efficiency of full-library retrieval, the present invention proposes a retrievable memory 100. The primary function of the retrievable memory 100 is database storage and the secondary function is to retrieve the database locally. Like flash memory, the retrievable memories 100 may be packaged as memory cards or solid state drives for storing large databases and having retrieval functions.

In the retrievable memory 100, the 3D-M arrays 170 in the different storage units 100ij store different data in the database. In other words, the database is stored and distributed in the storage unit 100ij of each retrievable memory 100 in the memory card or the solid state disk. At the time of retrieval, the keywords are transmitted to the input 110 and sent to all the depository units 100ij. In each storage unit 100ij, the pattern processing circuit 180PPC retrieves the key word in the data of the local 3D-M array 170.

The above-mentioned retrieval process is performed simultaneously in all the storage units 100 ij; the retrieval time required for each storage unit 100ij is similar. Because of the large-scale parallel search, the search time is close to the search time required for a single storage unit 100ij, generally in the order of seconds, no matter how large the capacity of the memory card and the solid state disk is. In contrast, conventional full-library searches require minutes to hours, or even longer. In the retrievable memory 100, the 3D-M storing the big data database is preferably a 3D-MTP; the pattern processing circuit 180PPC is a string matching circuit.

Because of the 3D-M _V With the highest storage density among all semiconductor memories, it is suitable for use as a storage large database. In all 3D-M _V Middle, 3D-OTP _V Has the longest data life, so it is suitable for storing large archives. Archival storage requires fast retrieval capabilities. 3D-OTP that can be retrieved _V Can provide a large capacity and a low volumeCost, and has fast retrieval capability.

Accordingly, the invention proposes a separate retrievable memory (100), characterized in that it comprises: an input (110) for transmitting at least part of the keyword; a plurality of depository units (100 aa-100 mn) electrically coupled to said input (110), each depository unit (100 ij) comprising at least a three-dimensional depository (3D-M) array (170) and a string matching circuit (180 PPC), said 3D-M array (170) storing at least a portion of data, said string matching circuit (180 PPC) retrieving said keyword in said portion of data; a first chip (100 a) and a second chip (100 b), said first chip (100 a) containing said 3D-M array (170), said second chip (100 b) containing at least part of said string matching circuit (180 PPC) and a piece of outer perimeter circuit assembly (190) of said 3D-M array (170); the first chip (100 a) does not contain the chip outer periphery circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

C) Speech recognition or retrieval.

One typical application of pattern processing is speech recognition. One approach to speech recognition is to perform pattern recognition on a user's speech based on a library of acoustic models and a library of language models. Wherein the acoustic model library stores a plurality of acoustic models; the language model library stores a large number of language models. In recognition, the pattern processing circuit 180PPC performs pattern recognition on the user speech data according to the acoustic/language model library to find the closest acoustic/language model. Because the conventional processors (such as CPUs and GPUs) have limited number of cores, low parallelism of pattern recognition, and the acoustic/language model database is stored in the external memory, the conventional processors and their architectures are slow and inefficient in processing speech recognition.

e) A speech recognition processor.

In order to improve the efficiency of speech recognition, the present invention provides a speech recognition processor 100. In the speech recognition processor 100, speech data generated by a user is supplied as input 110 to each of the storage units 100ij,3D-M array 170 for storing at least a part of models in an acoustic/language model library, and the pattern processing circuit 180PPC performs speech recognition on the speech data from the input 110 based on the model data stored in the 3D-M array 170. In this embodiment, the 3D-M array 170 storing the library of models may be a 3D-P, 3D-OTP or 3D-MTP; the mode processing circuit 180PPC is a speech recognition circuit.

Accordingly, the invention proposes a separate speech recognition processor (100), characterized in that it comprises: an input (110) for transmitting at least part of the speech data; a plurality of depository units (100 aa-100 mn) electrically coupled to the input (110), each depository unit (100 ij) comprising at least one three-dimensional depository (3D-M) array (170) and a speech recognition circuit (180 PPC), the 3D-M array (170) storing at least a portion of an acoustic/language model, the speech recognition circuit (180 PPC) performing speech recognition on the speech data based on the acoustic/language model; a first chip (100 a) and a second chip (100 b), said first chip (100 a) containing said 3D-M array (170), said second chip (100 b) containing at least part of said speech recognition circuitry (180 PPC) and a piece of outer periphery circuitry (190) of said 3D-M array (170); the first chip (100 a) does not contain the off-chip perimeter circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

f) A retrievable speech memory.

In order to realize voice retrieval in a voice database (such as a voice archive), the present invention also provides a retrievable voice storage 100. In the retrievable speech store 100, the speech data to be retrieved is converted into an acoustic/language model as input 110 to each of the depository units 100ij. The user generated voice data is stored in the 3D-M array 170. In other words, the voice database is stored and distributed in the respective storage units 100ij of the retrievable voice memory 100. The pattern processing circuit 180PPC performs speech recognition and retrieval on the speech data according to the acoustic/language model. In this embodiment, the 3D-M array 170 storing the voice database is preferably a 3D-MTP; the mode processing circuit 180PPC is a speech recognition circuit.

Accordingly, the invention proposes a separate retrievable speech store (100), characterised in that it comprises: an input (110) for transmitting at least part of the acoustic/language model; a plurality of depository units (100 aa-100 mn) electrically coupled to the input (110), each depository unit (100 ij) comprising at least one three-dimensional depository (3D-M) array (170) and a speech recognition circuit (180 PPC), the 3D-M array (170) storing at least a portion of speech data, the speech recognition circuit (180 PPC) performing speech recognition on the speech data based on the acoustic/language model; a first chip (100 a) and a second chip (100 b), said first chip (100 a) containing said 3D-M array (170), said second chip (100 b) containing at least part of said speech recognition circuitry (180 PPC) and a piece of outer periphery circuitry (190) of said 3D-M array (170); the first chip (100 a) does not contain the off-chip perimeter circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

D) And (5) image recognition.

Another typical application of pattern processing is image recognition. One approach to image recognition is to perform pattern recognition on a user's image from an image model library. Wherein the image model library stores a large number of image models. During recognition, the pattern processor performs pattern recognition on the user image data according to the image models in the image model library to find the closest image model. Because the traditional processors (such as CPU and GPU) have limited kernel quantity, low pattern recognition parallelism and the image model base is stored in the external memory, the traditional processors have low speed and low efficiency when processing image recognition.

g) An image recognition processor.

In order to improve the efficiency of image recognition, the present invention proposes an image recognition processor 100. In the image recognition processor 100, image data generated by a user is provided as input 110 to each of the storage units 100ij,3D-M array 170 to store at least a portion of the image model, and the pattern processing circuit 180PPC performs image recognition on the image data from the input 110 based on the image model stored in the 3D-M array 170. In this embodiment, the 3D-M array 170 storing the library of models may be a 3D-P, 3D-OTP or 3D-MTP; the pattern processing circuit 180PPC is an image recognition circuit.

Accordingly, the invention proposes a separate image recognition processor (100), characterized in that it comprises: an input (110) for transmitting at least part of the image data; a plurality of storage units (100 aa-100 mn) electrically coupled to the input (110), each storage unit (100 ij) comprising at least one three-dimensional storage (3D-M) array (170) and an image recognition circuit (180 PPC), the 3D-M array (170) storing at least a portion of an image model, the image recognition circuit (180 PPC) performing image recognition on the image data based on the image model; a first chip (100 a) and a second chip (100 b), said first chip (100 a) containing said 3D-M array (170), said second chip (100 b) containing at least part of said image recognition circuitry (180 PPC) and a piece of outer perimeter circuit assembly (190) of said 3D-M array (170); the first chip (100 a) does not contain the off-chip perimeter circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

h) A retrievable image memory.

In order to realize image retrieval in an image database (such as an image archive), the invention also provides a retrievable image storage 100. In the retrievable image memory 100, the image data to be searched is converted into an image model as an input 110 to each storage unit 100ij. The image data generated by the user is stored in the 3D-M array 170. In other words, the image database is stored and distributed in the respective storage units 100ij of the retrievable image memory 100. The pattern processing circuit 180PPC performs image recognition and retrieval on the image data according to the image model. In this embodiment, the 3D-M array 170 storing the image database is preferably a 3D-MTP; the pattern processing circuit 180PPC is an image recognition circuit.

The invention also proposes a separate retrievable image memory (100), characterized in that it comprises: an input (110) for transmitting at least part of the image model; a plurality of depository units (100 aa-100 mn) electrically coupled to the input (110), each depository unit (100 ij) comprising at least one three-dimensional depository (3D-M) array (170) and an image recognition circuit (180 PPC), the 3D-M array (170) storing at least a portion of image data, the image recognition circuit (180 PPC) performing image recognition on the image data based on the image model; a first chip (100 a) and a second chip (100 b), said first chip (100 a) containing said 3D-M array (170), said second chip (100 b) containing at least part of said image recognition circuitry (180 PPC) and a piece of outer perimeter circuitry (190) of said 3D-M array (170); the first chip (100 a) does not contain the off-chip perimeter circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

[E] A neural network.

When applied to a neural network, the discrete three-dimensional processor is a type of three-dimensional neural network processor. It can perform neural calculations; more importantly, the synaptic weights used in the neural computation are stored locally.

When used as a pattern processor, the discrete three-dimensional parallel processor 100 is a discrete three-dimensional neural net processor. Fig. 24 shows a memory cell 100ij in a three-dimensional neural network processor 100 that includes a neural memory circuit 170 and a neural computation circuit 180NPC (the logic circuit 180 is the neural computation circuit 180 NPC), which are electrically coupled via an interchip connection 160 (fig. 3A-3D). The neural memory circuit 170 comprises a 3D-M array that stores at least a portion of the synaptic weights; the neural computation circuit 180NPC performs neural computation using synaptic weights.

Fig. 25-26B disclose details of a neural computation circuit 180NPC and its computation circuit 730. In the embodiment of FIG. 25, the neural computation circuit 180NPC contains a synaptic weight (W) _s ) RAM 740A, an input neuron (N) _in ) RAM 740B and a computing circuit 730.W _s RAM 740A is a cache that temporarily stores synaptic weights 742 from 3D-M array 170; n is a radical of hydrogen _in RAM 740B is also a cache that temporarily stores input data 746 from input 110. The calculation circuit 730 performs neural calculations and produces output data 748.

In the embodiment of FIG. 26A, the calculation circuit 730 includes a multiplier 732, an adder 734, a register 736, and an activation function circuit 738. The multiplier 732 weights the synapses w _ij And input data x _i Multiply, adder 734 and register 736 pair the product (w) _ij ×x _i ) The accumulated value is supplied to an activation function circuit 738, and the result is output data y _j 。

In the embodiment of fig. 26B, multiplier 732 in fig. 26A is replaced with a multiplier-adder (MAC) 732'. Of course, the multiplier-adder 732' also includes a multiplier. W _s RAM 740A outputs not only synaptic weights w _ij (via port 742 w), also outputs offset b _j (via port 742 b). Multiplier-adder 732' for input data x _i Synaptic weight w _ij And bias b _j Implementing an offset multiply operation (w) _ij ×x _i +b _j )。

An activation function refers to a function whose output is controlled within a certain range (e.g., 0 to 1, or-1 to + 1), and includes a sigmod function, a signum function, a threshold function, a piecewise linear function, a step function, a tanh function, and the like. The circuit implementation of the activation function is difficult. Continuing with the spirit of the "mathematical calculation" of the present invention, the calculation circuit 730 may also contain a non-volatile memory (NVM) for long-term storage of the LUT of activation functions. The NVM is typically a read-only memory (ROM), and particularly a three-dimensional read-only memory (3D-ROM). The 3D-ROM array may be stacked above and coincident with the neural computation circuit (180 NPC). At this point, the computation circuit 730 becomes extremely simple-it only needs to implement additions and multiplications, but does not need to implement activation functions. The area of the calculation circuit 730 for realizing the activation function by using the 3D-ROM array is small, and the calculation density can be ensured.

It will be understood that changes in form and detail may be made therein without departing from the spirit and scope of the invention, and are not intended to impede the practice of the invention. For example, the processor in the present invention may be a Central Processing Unit (CPU), a controller or microcontroller (controller or micro-controller), a Digital Signal Processor (DSP), an image processor (GPU), a network security processor, an encryption/decryption processor, an encoding/decoding processor, a neural network processor, an Artificial Intelligence (AI) processor, and the like. The invention, therefore, is not to be restricted except in the spirit of the appended claims.

Claims

1. A discrete three-dimensional processor (100) comprising:

a plurality of storage units (100 aa-100 mn), each storage unit (100 ij) comprising at least one three-dimensional storage 3D-M array (170) and a logic circuit (180); the logic circuitry (180) processes data stored by the three-dimensional storage 3D-M array (170) but not peripheral circuitry of the three-dimensional storage 3D-M array (170);

a first chip (100 a) comprising a first semiconductor substrate (0 a), said first chip (100 a) comprising said three-dimensional memory 3D-M array (170) and at least a portion of its peripheral circuitry, said three-dimensional memory 3D-M array (170) comprising a plurality of memory cells stacked on said first semiconductor substrate (0 a);

a second chip (100 b) comprising a second semiconductor substrate (0 b), said second chip (100 b) comprising a piece of peripheral circuitry (190) of at least a portion of said logic circuitry (180) and said three-dimensional memory 3D-M array (170), said second chip (100 b) comprising a plurality of transistors in said second semiconductor substrate (0 b);

the first chip (100 a) does not contain the off-chip perimeter circuit assembly (190); the second chip (100 b) does not contain the three-dimensional storage 3D-M array (170); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

2. The three-dimensional processor (100) of claim 1, further characterized by at least one of the following features a) -F):

a) The first chip (100 a) and the second chip (100 b) are stacked on each other; or

B) The first chip (100 a) and the second chip (100 b) are bonded face to face; or

C) The first chip (100 a) and the second chip (100 b) are the same or close in area; or

D) The first chip (100 a) is aligned with at least one edge of the second chip (100 b); or

E) The projection of the three-dimensional storage 3D-M array (170) on the second chip (100 b) at least partially coincides with the logic circuit (180); or

F) The inter-chip connections (160) include bond wires, micro-pads, through-substrate VIAs (TSVs), and/or vertical contact connections (VIA).

3. The three-dimensional processor (100) of claim 2, further characterized by at least one of the following G3) -L3) features:

g3 The off-chip perimeter circuit component (190) is an address decoder; or

H3 The off-chip peripheral circuit assembly (190) is a sense amplifier circuit; or

I3 The off-chip peripheral circuit assembly (190) is a writer; or

J3 The off-chip peripheral circuit component (190) is a read voltage generating circuit; or

K3 The off-chip peripheral circuit block (190) is a write voltage generating circuit; or

L3) the off-chip peripheral circuit component (190) is a data buffer.

4. The three-dimensional processor (100) of claim 2 or 3, further characterized by one of the following V4) -Y4) features:

v4) the three-dimensional storage 3D-M array (170) stores at least part of a look-up table LUT of a non-arithmetic function or a non-arithmetic model, the logic circuit (180) is an arithmetic logic circuit ALC (180 ALC) and arithmetically operates at least part of the data in the look-up table LUT; said three-dimensional processor (100) is configured to implement said non-arithmetic function or said non-arithmetic model containing more arithmetic operations than said arithmetic logic circuit ALC (180 ALC) supports; or

W4) said three-dimensional memory 3D-M array (170) is part of a programmable computation element CCE (400) and stores at least part of a look-up table LUT of a non-arithmetic function, said logic circuit (180) containing a plurality of programmable logic elements CLE (200) and/or programmable connections CIT (300); the three-dimensional processor (100) implements the customization of the non-arithmetic functions by programming the programmable logic unit CLE (200) and/or the programmable connection CIT (300), and the programmable computation unit CCE (400), the non-arithmetic functions containing more operations than the arithmetic operations supported by the programmable logic unit CLE (200); or

X4) an input of said three-dimensional processor (100) transmits at least part of a first pattern, said three-dimensional memory 3D-M array (170) stores at least part of a second pattern, said logic circuit (180) is a pattern processing circuit (180 PPC) and performs pattern processing on said first and second patterns; or

Y4) the three-dimensional stored 3D-M array (170) stores at least part of the synaptic weights, the logic circuit (180) is a neural computation circuit (180 NPC) and performs a neural computation based on the synaptic weights.

5. The three-dimensional processor (100) according to any of claims 1-4, further characterized by at least one of the following M5) -U5) features:

m5) the three-dimensional storage 3D-M array (170) contains a plurality of storage units which are stacked on each other and do not contain any semiconductor substrate between the storage units; or

N5) the three-dimensional storage 3D-M array (170) is a three-dimensional random access memory 3D-RAM array; or

O5) the three-dimensional storage 3D-M array (170) is a three-dimensional read-only memory 3D-ROM array; or

P5) the three-dimensional storage 3D-M array (170) is a three-dimensional transverse storage 3D-M _H An array; or

Q5) the three-dimensional storage 3D-M array (170) is a three-dimensional vertical storage 3D-M _V An array; or

R5) the number of back-end wiring layers in the first chip (100 a) is greater than the number of back-end wiring layers in the second chip (100 b); or

S5) the number of the address line layers in the first chip (100 a) is at least twice as large as the number of the interconnection line layers in the second chip (100 b); or

T5) the number of memory cells in a memory string of the first chip (100 a) is at least twice the number of interconnect levels in the second chip (100 b); or

U5) the number of interconnect layers of the substrate circuit (0 Ka) in the first chip (100 a) is smaller than the number of interconnect layers in the second chip (100 b).

6. A discrete three-dimensional processor (100) comprising:

a first chip (100 a) including a first semiconductor substrate (0 a), the first chip (100 a) including the three-dimensional memory 3D-M array (170) and at least a portion of peripheral circuitry thereof, the three-dimensional memory 3D-M array (170) including a plurality of memory cells stacked on the first semiconductor substrate (0 a);

a second chip (100 b) including a second semiconductor substrate (0 b), the second chip (100 b) including at least a portion of the logic circuit (180) and at least a portion of an address decoder of the three-dimensional memory 3D-M array (170), the second chip (100 b) including a plurality of transistors located in the second semiconductor substrate (0 b);

said first chip (100 a) does not contain said at least partial address decoder of said three-dimensional storage 3D-M array (170); the second chip (100 b) does not contain the three-dimensional storage 3D-M array (170); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

7. The three-dimensional processor (100) of claim 6, further characterized by one of the following V7) -Y7) features:

v7) the three-dimensional storage 3D-M array (170) stores at least part of a look-up table LUT of a non-arithmetic function or a non-arithmetic model, the logic circuit (180) is an arithmetic logic circuit ALC (180 ALC) and arithmetically operates at least part of the data in the look-up table LUT; said three-dimensional processor (100) is configured to implement said non-arithmetic function or said non-arithmetic model containing more arithmetic operations than said arithmetic logic circuit ALC (180 ALC) supports; or

W7) said three-dimensional memory 3D-M array (170) is part of a programmable computation element CCE (400) and stores at least part of a look-up table LUT of a non-arithmetic function, said logic circuit (180) containing a plurality of programmable logic elements CLE (200) and/or programmable connections CIT (300); the three-dimensional processor (100) implements the customization of the non-arithmetic functions by programming the programmable logic unit CLE (200) and/or the programmable connection CIT (300), and the programmable computation unit CCE (400), the non-arithmetic functions containing more operations than the arithmetic operations supported by the programmable logic unit CLE (200); or

X7) an input to said three-dimensional processor (100) transmits at least a portion of a first pattern, said three-dimensional storage 3D-M array (170) stores at least a portion of a second pattern, said logic (180) is a pattern processing circuit (180 PPC) and performs pattern processing on said first and second patterns; or

Y7) the three-dimensional memory 3D-M array (170) stores at least part of the synaptic weights, the logic circuit (180) being a neural computation circuit (180 NPC) and performing a neural computation based on the synaptic weights.

8. The three-dimensional processor (100) according to claim 6 or 7, further characterized by having at least one of the following M8) -U8) features:

m8) the three-dimensional storage 3D-M array (170) contains a plurality of storage units which are stacked on each other and do not contain any semiconductor substrate between the storage units; or

N8) the three-dimensional storage 3D-M array (170) is a three-dimensional random access memory 3D-RAM array; or

O8) the three-dimensional storage 3D-M array (170) is a three-dimensional read-only memory 3D-ROM array; or

P8) the three-dimensional storage 3D-M array (170) is a three-dimensional transverse storage 3D-M _H An array; or

Q8) the three-dimensional storage 3D-M array (170) is a three-dimensional vertical storage 3D-M _V An array; or

R8) the number of back-end wiring layers in the first chip (100 a) is greater than the number of back-end wiring layers in the second chip (100 b); or

S8) the number of the address line layers in the first chip (100 a) is at least twice as large as the number of the interconnection line layers in the second chip (100 b); or

T8) the number of memory cells in the memory string of the first chip (100 a) is at least twice the number of levels of interconnect lines in the second chip (100 b); or

U8) the number of interconnect layers of the substrate circuit (0 Ka) in the first chip (100 a) is smaller than the number of interconnect layers in the second chip (100 b).

9. A discrete three-dimensional processor (100) comprising:

a second chip (100 b) comprising a second semiconductor substrate (0 b), said second chip (100 b) comprising at least part of said logic circuitry (180) and at least part of said read amplifying circuitry of said three dimensional memory 3D-M array (170), said second chip (100 b) comprising a plurality of transistors in said second semiconductor substrate (0 b);

the first chip (100 a) does not contain the at least part of the sense amplifier circuitry of the three-dimensional storage 3D-M array (170); the second chip (100 b) does not contain the three-dimensional storage 3D-M array (170); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

10. The three-dimensional processor (100) of claim 9, further characterized by one of the following V10) -Y10) features:

v10) said three-dimensional storage 3D-M array (170) stores at least part of a look-up table LUT of a non-arithmetic function or a non-arithmetic model, said logic circuit (180) being an arithmetic logic circuit ALC (180 ALC) and performing arithmetic operations on at least part of the data in said look-up table LUT; said three-dimensional processor (100) is configured to implement said non-arithmetic function or said non-arithmetic model containing more arithmetic operations than said arithmetic logic circuit ALC (180 ALC) supports; or

W10) said three-dimensional memory 3D-M array (170) is part of a programmable computation element CCE (400) and stores at least part of a look-up table LUT of a non-arithmetic function, said logic circuit (180) containing a plurality of programmable logic elements CLE (200) and/or programmable connections CIT (300); the three-dimensional processor (100) implements the customization of the non-arithmetic functions by programming the programmable logic unit CLE (200) and/or the programmable connection CIT (300), and the programmable computation unit CCE (400), the non-arithmetic functions containing more operations than the arithmetic operations supported by the programmable logic unit CLE (200); or

X10) an input to said three-dimensional processor (100) transmits at least a portion of a first pattern, said three-dimensional storage 3D-M array (170) stores at least a portion of a second pattern, said logic (180) is a pattern processing circuit (180 PPC) and performs pattern processing on said first and second patterns; or

Y10) the three-dimensional stored 3D-M array (170) stores at least part of the synaptic weights, the logic circuit (180) is a neural computation circuit (180 NPC) and performs a neural computation based on the synaptic weights.

11. The three-dimensional processor (100) according to claim 9 or 10, further characterized by having at least one of the following M11) -U11) features:

m11) the three-dimensional storage 3D-M array (170) contains a plurality of memory cells stacked on top of each other, the memory cells being free of any semiconductor substrate therebetween; or

N11) the three-dimensional storage 3D-M array (170) is a three-dimensional random access memory 3D-RAM array; or

O11) the three-dimensional storage 3D-M array (170) is a three-dimensional read-only memory 3D-ROM array; or

P11) the three-dimensional storage 3D-M array (170) is a three-dimensional transverse storage 3D-M _H An array; or

Q11) the three-dimensional storage 3D-M array (170) is a three-dimensional vertical storage 3D-M _V An array; or

R11) the number of back-end wiring layers in the first chip (100 a) is greater than the number of back-end wiring layers in the second chip (100 b); or

S11) the number of the address line layers in the first chip (100 a) is at least twice as large as the number of the interconnection line layers in the second chip (100 b); or

T11) the number of memory cells in the memory string of the first chip (100 a) is at least twice the number of levels of interconnect lines in the second chip (100 b); or

U11) the number of interconnect layers of the substrate circuit (0 Ka) in the first chip (100 a) is smaller than the number of interconnect layers in the second chip (100 b).

12. A discrete three-dimensional processor (100) comprising:

a second chip (100 b) comprising a second semiconductor substrate (0 b), said second chip (100 b) comprising at least part of said logic circuitry (180) and at least part of said write circuitry of said three-dimensional memory 3D-M array (170), said second chip (100 b) comprising a plurality of transistors located in said second semiconductor substrate (0 b);

said first chip (100 a) does not contain said at least partial write circuitry of said three-dimensional storage 3D-M array (170); the second chip (100 b) does not contain the three-dimensional storage 3D-M array (170); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

13. The three-dimensional processor (100) of claim 12, further characterized by one of the following V13) -Y13) features:

v13) the three-dimensional storage 3D-M array (170) stores at least part of a look-up table LUT of a non-arithmetic function or a non-arithmetic model, the logic circuit (180) is an arithmetic logic circuit ALC (180 ALC) and performs arithmetic operations on at least part of the data in the look-up table LUT; said three-dimensional processor (100) is configured to implement said non-arithmetic function or said non-arithmetic model containing more arithmetic operations than said arithmetic logic circuit ALC (180 ALC) supports; or

W13) said three-dimensional memory 3D-M array (170) is part of a programmable computation element CCE (400) and stores at least part of a look-up table LUT of a non-arithmetic function, said logic circuit (180) containing a plurality of programmable logic elements CLE (200) and/or programmable connections CIT (300); the three-dimensional processor (100) implements the customization of the non-arithmetic functions by programming the programmable logic unit CLE (200) and/or the programmable connection CIT (300), and the programmable computation unit CCE (400), the non-arithmetic functions containing more operations than the arithmetic operations supported by the programmable logic unit CLE (200); or

X13) an input to said three-dimensional processor (100) transmits at least a portion of a first pattern, said three-dimensional storage 3D-M array (170) stores at least a portion of a second pattern, said logic (180) is a pattern processing circuit (180 PPC) and performs pattern processing on said first and second patterns; or

Y13) the three-dimensional stored 3D-M array (170) stores at least part of the synaptic weights, the logic circuit (180) is a neural computation circuit (180 NPC) and performs a neural computation based on the synaptic weights.

14. The three-dimensional processor (100) according to claim 12 or 13, further characterized by having at least one of the following M14) -U14) features:

m14) the three-dimensional storage 3D-M array (170) contains a plurality of storage units which are stacked on each other and do not contain any semiconductor substrate between the storage units; or

N14) the three-dimensional storage 3D-M array (170) is a three-dimensional random access memory 3D-RAM array; or

O14) the three-dimensional storage 3D-M array (170) is a three-dimensional read-only memory 3D-ROM array; or

P14) the three-dimensional storage 3D-M array (170) is a three-dimensional transverse storage 3D-M _H An array; or

Q14) the three-dimensional storage 3D-M array (170) is a three-dimensional vertical storage 3D-M _V An array; or

R14) the number of back-end wiring layers in the first chip (100 a) is greater than the number of back-end wiring layers in the second chip (100 b); or

S14) the number of address line layers in the first chip (100 a) is at least twice the number of interconnect line layers in the second chip (100 b); or

T14) the number of memory cells in the memory string of the first chip (100 a) is at least twice the number of levels of interconnect lines in the second chip (100 b); or

U14) the number of interconnect layers of the substrate circuit (0 Ka) in the first chip (100 a) is smaller than the number of interconnect layers in the second chip (100 b).

15. A discrete three-dimensional processor (100) comprising:

a second chip (100 b) comprising a second semiconductor substrate (0 b), said second chip (100 b) comprising at least part of said logic circuitry (180) and at least part of a read voltage generation circuit of said three-dimensional memory 3D-M array (170), said second chip (100 b) comprising a plurality of transistors located in said second semiconductor substrate (0 b);

the first chip (100 a) does not contain the at least part of the read voltage generation circuitry of the three-dimensional storage 3D-M array (170); the second chip (100 b) does not contain the three-dimensional storage 3D-M array (170); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

16. The three-dimensional processor (100) of claim 15, further characterized by one of the following V16) -Y16) features:

v16) said three-dimensional storage 3D-M array (170) storing at least part of a look-up table LUT for a non-arithmetic function or a non-arithmetic model, said logic circuit (180) being an arithmetic logic circuit ALC (180 ALC) and performing arithmetic operations on at least part of the data in said look-up table LUT; said three-dimensional processor (100) is configured to implement said non-arithmetic function or said non-arithmetic model containing more arithmetic operations than said arithmetic logic circuit ALC (180 ALC) supports; or

W16) said three-dimensional memory 3D-M array (170) is part of a programmable computation element CCE (400) and stores at least part of a look-up table LUT of a non-arithmetic function, said logic circuit (180) containing a plurality of programmable logic elements CLE (200) and/or programmable connections CIT (300); the three-dimensional processor (100) implements the customization of the non-arithmetic functions by programming the programmable logic unit CLE (200) and/or the programmable connection CIT (300), and the programmable computation unit CCE (400), the non-arithmetic functions containing more operations than the arithmetic operations supported by the programmable logic unit CLE (200); or

X16) an input to said three-dimensional processor (100) transmits at least a portion of a first pattern, said three-dimensional storage 3D-M array (170) stores at least a portion of a second pattern, said logic (180) is a pattern processing circuit (180 PPC) and performs pattern processing on said first and second patterns; or

Y16) the three-dimensional stored 3D-M array (170) storing at least part of the synaptic weights, the logic circuit (180) being a neural computation circuit (180 NPC) and performing a neural computation based on the synaptic weights.

17. The three-dimensional processor (100) according to claim 15 or 16, further characterized by having at least one of the following M17) -U17) features:

m17) the three-dimensional storage 3D-M array (170) contains a plurality of memory cells stacked on top of each other, the memory cells being devoid of any semiconductor substrate therebetween; or

N17) the three-dimensional storage 3D-M array (170) is a three-dimensional random access memory 3D-RAM array; or

O17) the three-dimensional storage 3D-M array (170) is a three-dimensional read-only memory 3D-ROM array; or

P17) the three-dimensional storage 3D-M array (170) is a three-dimensional transverse storage 3D-M _H An array; or

Q17) the three-dimensional storage 3D-M array (170) is a three-dimensional vertical storage 3D-M _V An array; or

R17) the number of back-end wiring layers in the first chip (100 a) is larger than the number of back-end wiring layers in the second chip (100 b); or

S17) the number of the address line layers in the first chip (100 a) is at least twice as large as the number of the interconnection line layers in the second chip (100 b); or

T17) the number of memory cells in the memory string of the first chip (100 a) is at least twice the number of levels of interconnect lines in the second chip (100 b); or

U17) the number of interconnect layers of the substrate circuit (0 Ka) in the first chip (100 a) is smaller than the number of interconnect layers in the second chip (100 b).

18. A discrete three-dimensional processor (100) comprising:

a second chip (100 b) comprising a second semiconductor substrate (0 b), said second chip (100 b) comprising at least part of said logic circuitry (180) and at least part of a write voltage generation circuit of said three-dimensional memory 3D-M array (170), said second chip (100 b) comprising a plurality of transistors located in said second semiconductor substrate (0 b);

the first chip (100 a) does not contain the at least partial write voltage generation circuit of the three-dimensional storage 3D-M array (170); the second chip (100 b) does not contain the three-dimensional storage 3D-M array (170); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

19. The three-dimensional processor (100) of claim 18, further characterized by one of the following V19) -Y19) features:

v19) said three-dimensional storage 3D-M array (170) stores at least part of a look-up table LUT of a non-arithmetic function or a non-arithmetic model, said logic circuit (180) is an arithmetic logic circuit ALC (180 ALC) and performs arithmetic operations on at least part of the data in said look-up table LUT; said three-dimensional processor (100) is configured to implement said non-arithmetic function or said non-arithmetic model containing more arithmetic operations than said arithmetic logic circuit ALC (180 ALC) supports; or

W19) said three-dimensional memory 3D-M array (170) is part of a programmable computation element CCE (400) and stores at least part of a look-up table LUT of a non-arithmetic function, said logic circuit (180) containing a plurality of programmable logic elements CLE (200) and/or programmable connections CIT (300); the three-dimensional processor (100) implements the customization of the non-arithmetic functions by programming the programmable logic unit CLE (200) and/or the programmable connection CIT (300), and the programmable computation unit CCE (400), the non-arithmetic functions containing more operations than the arithmetic operations supported by the programmable logic unit CLE (200); or

X19) an input to said three-dimensional processor (100) transmits at least a portion of a first pattern, said three-dimensional storage 3D-M array (170) stores at least a portion of a second pattern, said logic (180) is a pattern processing circuit (180 PPC) and performs pattern processing on said first and second patterns; or

Y19) the three-dimensional stored 3D-M array (170) stores at least part of the synaptic weights, the logic circuit (180) is a neural computation circuit (180 NPC) and performs a neural computation based on the synaptic weights.

20. The three-dimensional processor (100) according to claim 18 or 19, further characterized by having at least one of the following M20) -U20) features:

m20) the three-dimensional storage 3D-M array (170) contains a plurality of storage units which are stacked on each other and do not contain any semiconductor substrate between the storage units; or

N20) the three-dimensional storage 3D-M array (170) is a three-dimensional random access memory 3D-RAM array; or

O20) the three-dimensional storage 3D-M array (170) is a three-dimensional read-only memory 3D-ROM array; or

P20) the three-dimensional storage 3D-M array (170) is a three-dimensional transverse storage 3D-M _H An array; or

Q20) the three-dimensional storage 3D-M array (170) is a three-dimensional vertical storage 3D-M _V An array; or

R20) the number of back-end wiring layers in the first chip (100 a) is greater than the number of back-end wiring layers in the second chip (100 b); or

S20) the number of the address line layers in the first chip (100 a) is at least twice as large as the number of the interconnection line layers in the second chip (100 b); or

T20) the number of memory cells in the memory string of the first chip (100 a) is at least twice the number of levels of interconnect lines in the second chip (100 b); or

U20) the number of interconnect layers of the substrate circuit (0 Ka) in the first chip (100 a) is smaller than the number of interconnect layers in the second chip (100 b).

21. A discrete three-dimensional processor (100) comprising:

a second chip (100 b) comprising a second semiconductor substrate (0 b), said second chip (100 b) comprising at least a portion of said logic circuitry (180) and at least a portion of a data buffer of said three dimensional memory 3D-M array (170), said second chip (100 b) comprising a plurality of transistors located in said second semiconductor substrate (0 b);

the first chip (100 a) does not contain the at least part of the data buffer of the three-dimensional storage 3D-M array (170); the second chip (100 b) does not contain the three-dimensional storage 3D-M array (170); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

22. The three-dimensional processor (100) of claim 21, further characterized by one of the following V22) -Y22) features:

v22) said three-dimensional storage 3D-M array (170) storing at least part of a look-up table LUT for a non-arithmetic function or a non-arithmetic model, said logic circuit (180) being an arithmetic logic circuit ALC (180 ALC) and performing arithmetic operations on at least part of the data in said look-up table LUT; said three-dimensional processor (100) is configured to implement said non-arithmetic function or said non-arithmetic model containing more arithmetic operations than said arithmetic logic circuit ALC (180 ALC) supports; or

W22) said three-dimensional memory 3D-M array (170) is part of a programmable computation element CCE (400) and stores at least part of a look-up table LUT of a non-arithmetic function, said logic circuit (180) containing a plurality of programmable logic elements CLE (200) and/or programmable connections CIT (300); the three-dimensional processor (100) implements the customization of the non-arithmetic functions by programming the programmable logic unit CLE (200) and/or the programmable connection CIT (300), and the programmable computation unit CCE (400), the non-arithmetic functions containing more operations than the arithmetic operations supported by the programmable logic unit CLE (200); or

X22) an input to said three-dimensional processor (100) transmits at least a portion of a first pattern, said three-dimensional storage 3D-M array (170) stores at least a portion of a second pattern, said logic (180) is a pattern processing circuit (180 PPC) and performs pattern processing on said first and second patterns; or

Y22) the three-dimensional stored 3D-M array (170) stores at least part of the synaptic weights, the logic circuit (180) is a neural computation circuit (180 NPC) and performs a neural computation based on the synaptic weights.

23. The three-dimensional processor (100) according to claim 21 or 22, further characterized by having at least one of the following M23) -U23) features:

m23) the three-dimensional storage 3D-M array (170) contains a plurality of storage units which are stacked on each other and do not contain any semiconductor substrate between the storage units; or

N23) the three-dimensional storage 3D-M array (170) is a three-dimensional random access memory 3D-RAM array; or

O23) the three-dimensional storage 3D-M array (170) is a three-dimensional read-only memory 3D-ROM array; or

P23) the three-dimensional storage 3D-M array (170) is a three-dimensional transverse storage 3D-M _H An array; or

Q23) the three-dimensional storage 3D-M array (170) is a three-dimensional vertical storage 3D-M _V An array; or

R23) the number of back-end wiring layers in the first chip (100 a) is larger than the number of back-end wiring layers in the second chip (100 b); or

S23) the number of the address line layers in the first chip (100 a) is at least twice as large as the number of the interconnection line layers in the second chip (100 b); or

T23) the number of memory cells in the memory string of the first chip (100 a) is at least twice the number of levels of interconnect lines in the second chip (100 b); or

U23) the number of interconnect layers of the substrate circuit (0 Ka) in the first chip (100 a) is smaller than the number of interconnect layers in the second chip (100 b).

24. A discrete three-dimensional processor (100) comprising:

a plurality of compute units (100 ij), the compute units (100 ij) having a first three-dimensional storage 3D-M array (170) and an arithmetic logic circuit ALC (180 ALC), the first three-dimensional storage 3D-M array (170) storing at least a portion of a first look-up table LUT of a first non-arithmetic function or a first non-arithmetic model, the arithmetic logic circuit ALC (180 ALC) arithmetically operating at least a portion of data in the first look-up table LUT;

a first chip (100 a) comprising a first semiconductor substrate (0 a), said first chip (100 a) comprising said first three-dimensional storage 3D-M array (170) and at least a portion of its peripheral circuitry, said first three-dimensional storage 3D-M array (170) comprising a plurality of memory cells stacked on said first semiconductor substrate (0 a);

a second chip (100 b) comprising a second semiconductor substrate (0 b), said second chip (100 b) comprising at least a portion of said arithmetic logic circuit ALC (180 ALC), and a piece of peripheral circuitry (190) of said first three-dimensional memory 3D-M array (170), said second chip (100 b) comprising a plurality of transistors in said second semiconductor substrate (0 b);

the first chip (100 a) does not contain the off-chip perimeter circuit assembly (190); the second chip (100 b) does not contain the first three-dimensional storage 3D-M array (170); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160);

the three-dimensional processor (100) is configured to implement the first non-arithmetic function or the first non-arithmetic model that contains more arithmetic operations than the arithmetic logic circuit ALC (180 ALC) supports.

25. The three-dimensional processor (100) of claim 24, further characterized by:

-said calculation unit (100 ij) contains a second three-dimensional memory 3D-M array, said second three-dimensional memory 3D-M array storing at least part of a second look-up table LUT of a second non-arithmetic function, said arithmetic logic circuit ALC (180 ALC) arithmetically operating at least part of the data in said first look-up table LUT and/or said second look-up table LUT to realize a combined function; said combining function is a combination of said first and second non-arithmetic functions, said first and second non-arithmetic functions comprising more arithmetic operations than said arithmetic logic circuit ALC (180 ALC) supports;

the first chip (100 a) contains the second three-dimensional storage 3D-M array and at least part of its peripheral circuitry; the second chip (100 b) does not contain the second three-dimensional storage 3D-M array.

26. The three-dimensional processor (100) according to claim 24 or 25, further characterized by at least one of the following M26) -U26) features:

m26) the three-dimensional storage 3D-M array (170) contains a plurality of memory cells stacked on top of each other, the memory cells being devoid of any semiconductor substrate therebetween; or

N26) the three-dimensional storage 3D-M array (170) is a three-dimensional random access memory 3D-RAM array; or

O26) the three-dimensional storage 3D-M array (170) is a three-dimensional read-only memory 3D-ROM array; or

P26) the three-dimensional storage 3D-M array (170) is a three-dimensional transverse storage 3D-M _H An array; or

Q26) the three-dimensional storage 3D-M array (170) is a three-dimensional vertical storage 3D-M _V An array; or

R26) the number of back-end wiring layers in the first chip (100 a) is greater than the number of back-end wiring layers in the second chip (100 b); or

S26) the number of address line layers in the first chip (100 a) is at least twice the number of interconnect line layers in the second chip (100 b); or

T26) the number of memory cells in the memory string of the first chip (100 a) is at least twice the number of levels of interconnect lines in the second chip (100 b); or

U26) the number of interconnect layers of the substrate circuit (0 Ka) in the first chip (100 a) is smaller than the number of interconnect layers in the second chip (100 b).

27. A discrete three-dimensional processor (100) comprising:

a plurality of programmable logic units CLE (200) and/or programmable connections CIT (300); and a plurality of first programmable computation elements CCE (400) comprising a first three-dimensional memory 3D-M array (170), said first three-dimensional memory 3D-M array (170) storing at least part of a first look-up table LUT of first non-arithmetic functions;

a second chip (100 b) comprising a second semiconductor substrate (0 b), said second chip (100 b) comprising at least a portion of said programmable logic units CLE (200) and/or programmable connections CIT (300) and a piece of peripheral circuitry (190) of said first three dimensional memory 3D-M array (170), said second chip (100 b) comprising a plurality of transistors located in said second semiconductor substrate (0 b);

the three-dimensional processor (100) enables the customization of a non-arithmetic function comprising more operations than the arithmetic operations supported by the programmable logic unit CLE (200) by programming the programmable logic unit CLE (200) and/or the programmable connection CIT (300), and the first programmable computation unit CCE (400).

28. The three-dimensional processor (100) of claim 27, further comprising:

a second programmable computation element CCE containing a second three-dimensional memory 3D-M array storing at least part of a second look-up table LUT of a second non-arithmetic function;

the first chip (100 a) contains the second three-dimensional storage 3D-M array and at least part of its peripheral circuitry; the second chip (100 b) does not contain the second three-dimensional storage 3D-M array;

-said three-dimensional processor (100) enables customization of a complex function by programming said programmable logic unit CLE (200) and/or programmable connection CIT (300), and said first and second programmable computation units CCE; the complex function is a combination of the first and second non-arithmetic functions that contain more arithmetic operations than supported by the programmable logic unit CLE (200).

29. The three-dimensional processor (100) according to claim 27 or 28, further characterized by at least one of the following M29) -U29) features:

m29) the three-dimensional storage 3D-M array (170) contains a plurality of memory cells stacked on top of each other, the memory cells being devoid of any semiconductor substrate therebetween; or

N29) the three-dimensional storage 3D-M array (170) is a three-dimensional random access memory 3D-RAM array; or

O29) the three-dimensional storage 3D-M array (170) is a three-dimensional read-only memory 3D-ROM array; or

P29) the three-dimensional storage 3D-M array (170) is a three-dimensional transverse storage 3D-M _H An array; or

Q29) the three-dimensional storage 3D-M array (170) is a three-dimensional vertical storage 3D-M _V An array; or

R29) the number of back-end wiring layers in the first chip (100 a) is greater than the number of back-end wiring layers in the second chip (100 b); or

S29) the number of address line layers in the first chip (100 a) is at least twice the number of interconnect line layers in the second chip (100 b); or

T29) the number of memory cells in the memory string of the first chip (100 a) is at least twice the number of levels of interconnect lines in the second chip (100 b); or

U29) the number of interconnect layers of the substrate circuit (0 Ka) in the first chip (100 a) is smaller than the number of interconnect layers in the second chip (100 b).

30. A discrete three-dimensional processor (100) comprising:

an input (110) for transmitting at least part of the first mode;

a plurality of memory cells (100 aa-100 mn) electrically coupled to said input (110), each memory cell (100 ij) comprising at least a three-dimensional memory 3D-M array (170) and a pattern processing circuit (180 PPC), said three-dimensional memory 3D-M array (170) storing at least a portion of a second pattern, said pattern processing circuit (180 PPC) performing pattern processing on said first and second patterns;

a second chip (100 b) comprising a second semiconductor substrate (0 b), said second chip (100 b) comprising at least a portion of said mode processing circuitry (180 PPC), and a piece of outer periphery circuitry (190) of said three-dimensional memory 3D-M array (170), said second chip (100 b) comprising a plurality of transistors located in said second semiconductor substrate (0 b);

31. The three-dimensional processor (100) of claim 30, further characterized by: the first mode comprises a target mode; the second mode comprises a retrieval mode.

32. The three-dimensional processor (100) of claim 31, further characterized by: the target mode comprises a network data packet or a computer file data; the retrieval pattern comprises at least a partial virus identification; the pattern processing circuit (180 PPC) is a code matching circuit and retrieves the virus identification in the network packet or file data.

33. The three-dimensional processor (100) of claim 31, further characterized by: the target pattern comprises at least part of the data; the retrieval mode comprises at least partial keywords; the pattern processing circuit (180 PPC) is a string matching circuit and retrieves the keyword in the data.

34. The three-dimensional processor (100) of claim 31, further characterized by: the target pattern comprises at least part of speech data; the retrieval pattern comprises at least a partial acoustic/language model; the pattern processing circuit (180 PPC) is a speech recognition circuit and performs speech recognition on the speech data according to the acoustic/language model.

35. The three-dimensional processor (100) of claim 31, further characterized by: the target pattern includes at least a portion of image data; the retrieval mode comprises at least part of an image model; the pattern processing circuit (180 PPC) is an image recognition circuit and performs image recognition on the image data based on the image model.

36. The three-dimensional processor (100) of claim 30, further characterized by: the first mode comprises a retrieval mode; the second mode includes a target mode.

37. The three-dimensional processor (100) of claim 36, further characterized by: the retrieval pattern comprises at least a partial virus identification; the target mode comprises at least part of the computer file data; the pattern processing circuit (180 PPC) is a code matching circuit and retrieves the virus identification in the file data.

38. The three-dimensional processor (100) of claim 36, further characterized by: the retrieval mode comprises at least part of keywords; the target pattern comprises at least part of the data; the pattern processing circuit (180 PPC) is a string matching circuit and retrieves the keyword in the data.

39. The three-dimensional processor (100) of claim 36, further characterized by: the retrieval pattern comprises at least a partial acoustic/language model; the target pattern comprises at least part of speech data; the pattern processing circuit (180 PPC) is a speech recognition circuit and performs speech recognition on the speech data according to the acoustic/language model.

40. The three-dimensional processor (100) of claim 36, further characterized by: the retrieval mode comprises at least part of an image model; the target pattern includes at least a portion of image data; the pattern processing circuit (180 PPC) is an image recognition circuit and performs image recognition on the image data based on the image model.

41. The three-dimensional processor (100) according to any of claims 30-40 further characterized by at least one of the following M41) -U41) features:

m41) the three-dimensional storage 3D-M array (170) contains a plurality of storage units which are stacked on each other and do not contain any semiconductor substrate between the storage units; or

N41) the three-dimensional storage 3D-M array (170) is a three-dimensional random access memory 3D-RAM array; or

O41) the three-dimensional storage 3D-M array (170) is a three-dimensional read-only memory 3D-ROM array; or

P41) the three-dimensional storage 3D-M array (170) is a three-dimensional transverse storage 3D-M _H An array; or

Q41) the three-dimensional storage 3D-M array (170) is a three-dimensional vertical storage 3D-M _V An array; or

R41) the number of back-end wiring layers in the first chip (100 a) is larger than the number of back-end wiring layers in the second chip (100 b); or

S41) the number of the address line layers in the first chip (100 a) is at least twice as large as the number of the interconnection line layers in the second chip (100 b); or

T41) the number of memory cells in the memory string of the first chip (100 a) is at least twice the number of levels of interconnect lines in the second chip (100 b); or

U41) the number of interconnect layers of the substrate circuit (0 Ka) in the first chip (100 a) is smaller than the number of interconnect layers in the second chip (100 b).

42. A discrete three-dimensional processor (100) comprising:

a plurality of storage units (100 aa-100 mn), each storage unit (100 ij) comprising at least one three-dimensional storage 3D-M array (170) and a neural computation circuit (180), the three-dimensional storage 3D-M array (170) storing at least a portion of the synaptic weights, the neural computation circuit (180) performing a neural computation based on the synaptic weights;

a second chip (100 b) comprising a second semiconductor substrate (0 b), said second chip (100 b) comprising at least a portion of said neurocomputational circuitry (180) and a piece of peripheral circuitry (190) of said three-dimensional storage 3D-M array (170), said second chip (100 b) comprising a plurality of transistors in said second semiconductor substrate (0 b);

43. The three-dimensional processor (100) of claim 42, further characterized by at least one of the following M43) -U43) features:

m43) said three-dimensional storage 3D-M array (170) contains a plurality of memory cells stacked on top of each other, said memory cells being devoid of any semiconductor substrate between them; or

N43) the three-dimensional storage 3D-M array (170) is a three-dimensional random access memory 3D-RAM array; or

O43) the three-dimensional storage 3D-M array (170) is a three-dimensional read-only memory 3D-ROM array; or

P43) the three-dimensional storage 3D-M array (17)0) Storing 3D-M for three-dimensional landscape _H An array; or

Q43) the three-dimensional storage 3D-M array (170) is a three-dimensional vertical storage 3D-M _V An array; or

R43) the number of back-end wiring layers in the first chip (100 a) is larger than the number of back-end wiring layers in the second chip (100 b); or

S43) the number of address line layers in the first chip (100 a) is at least twice the number of interconnect line layers in the second chip (100 b); or

T43) the number of memory cells in the memory string of the first chip (100 a) is at least twice the number of levels of interconnect lines in the second chip (100 b); or

U43) the number of interconnect layers of the substrate circuit (0 Ka) in the first chip (100 a) is smaller than the number of interconnect layers in the second chip (100 b).