CN116150085A

CN116150085A - Separated three-dimensional processor

Info

Publication number: CN116150085A
Application number: CN202310223434.7A
Authority: CN
Inventors: 张国飙
Original assignee: Hangzhou Haicun Information Technology Co Ltd
Current assignee: Hangzhou Haicun Information Technology Co Ltd
Priority date: 2018-12-10
Filing date: 2019-01-16
Publication date: 2023-05-23
Also published as: CN111290994A; CN111290994B; WO2020119511A1; CN115794730A; CN112597098A; CN113918506A; CN116049093A; CN116303224A

Abstract

A separate three-dimensional processor (100) includes a plurality of storage units (100 aa-100 mn), each storage unit (100 ij) including a logic circuit (180) and at least one storage array (170). The three-dimensional processor (100) further comprises a first chip (100 a) and a second chip (100 b) coupled: the first chip (100 a) contains a memory array (170); the second chip (100 b) contains logic circuitry (180) and at least a portion of the off-chip peripheral circuit components (190) of the memory array (170). The logic circuit (180) is not a peripheral circuit of the memory array (170).

Description

Separated three-dimensional processor

The present application is a divisional application of chinese patent application with application number 201910038528.0, application date 2019, 1 month and 16 days, and the name of "separated three-dimensional memory".

Technical Field

The present invention relates to the field of integrated circuits, and more particularly to processors.

Background

Processors (including CPU, GPU, FPGA, etc.) are widely used in the fields of mathematical computation, computer simulation, programmable gate arrays, pattern processing, neural networks, etc. Conventional processor chips are based on two-dimensional integration with logic circuits (e.g., arithmetic logic units, control units, etc.) in the same plane (i.e., semiconductor substrate surface) as memory circuits (internal memory, including RAM for buffering and ROM for storing look-up tables, etc.). Since the main function of the processor chip is arithmetic logic operation, the capacity of its internal memory is small.

Traditional computers are based on von neumann architecture, with processors and memory in the computer separated: most of the memory is external memory (e.g., memory, external memory, etc.), which is located off the processor chip. The processor chip obtains the data from the external memory when it is calculated and if it needs to use a lot of data. The data transfer bandwidth between the external memory and the processor chip is limited due to the physical distance between the external memory and the processor chip being too long and the data bus therebetween being relatively narrow. With the advent of massive data, traditional processors and their von neumann architecture have grown striving away.

The present situation and limitations of the application field of the above processor are described below.

[A] And (5) mathematical calculation.

One important application of processors is mathematical calculations, including the calculation of mathematical functions and the calculation of mathematical models. To implement mathematical calculations, conventional processors employ logic-based calculations (logic-based computation, abbreviated as LBC) that are primarily calculated by logic circuitry (commonly referred to as arithmetic logic units, ALUs). In practice, the arithmetic operations that an ALU can directly implement are only addition, subtraction, and multiplication, which are collectively referred to as basic arithmetic operations. The ALU is adapted to implement arithmetic functions but is not capable of non-arithmetic functions. In a processor implementing mathematical calculations, an arithmetic function is a mathematical function that can be expressed as a combination of its basic arithmetic operations, whereas a non-arithmetic function is a mathematical function that cannot be expressed as a combination of its basic arithmetic operations. Examples of non-arithmetic functions include transcendental functions, special functions, and the like. Since the non-arithmetic function contains more operations than the ALU supports, the non-arithmetic function cannot be implemented by the ALU alone. Hardware implementation of non-arithmetic functions has always presented a significant challenge.

In conventional processors, only a small number of basic functions (i.e., univariate non-arithmetic functions, including basic algebraic functions, basic transcendental functions, etc.) can be implemented directly in hardware, and these functions are referred to as built-in functions. The built-in function is typically implemented by a combination of logic circuits and a look-up table (LUT). The prior art for implementing built-in functions is numerous. For example: U.S. patent No. 5,954,787 (inventor: eun; date of authorization: 21, 9, 1999) discloses a method for implementing sine/cosine (SIN/COS) functions using LUTs; U.S. patent No. 9,207,910 (inventor: azadet; date of authorization: date of 2015, 12, 8) discloses a method for implementing a power function using a LUT.

FIG. 1AA specifically illustrates a method of implementing a built-in function. The conventional processor 0X generally includes a logic circuit 00L and a memory circuit 00M. Logic circuit 00L contains an ALU for implementing arithmetic operations. The memory circuit 00M stores the LUT of the function. To achieve a predetermined accuracy, the polynomial representing the built-in function needs to be expanded to a sufficiently high order. The memory circuit 00M stores polynomial coefficients and ALU 00L computes a corresponding polynomial. Since ALU 00L and memory circuit 00M are arranged side-by-side on the same plane (both formed in substrate 00S), this planar integration is a two-dimensional integration.

Computing is currently evolving towards higher computational density and greater computational complexity. The computation density refers to the computation power (such as the number of floating point number operations per second) per unit chip area, and is an important index of parallel computation. The calculation complexity refers to the number of built-in functions supported by the chip, and is an important index of scientific calculation. Two-dimensional integration limits further development of computational density and computational complexity.

With two-dimensional integration, the excessive memory circuit 00M will increase the chip area of the processor 0X, reducing its computation density, which is detrimental to parallel computation. Furthermore, ALU 00L is a core component of processor 0X, which occupies a large portion of the chip area, so memory circuit 00M has a limited chip area available and can only support a small number of built-in functions. FIG. 1AB lists all built-in override functions that can be implemented by Intel corporation's IA-64 processor (cf. Harrison et al, "The Computation of Transcendental Functions on the IA-64 Architecture", intel Technical Journal, Q4, 1999). The IA-64 processor supports only seven built-in functions in total, so few built-in function groups are extremely detrimental to mathematical calculations. Because most mathematical functions require software to decompose them into combinations of built-in functions, the conventional processor 0X pair is slow and inefficient to calculate for most mathematics.

[B] And (5) computer simulation.

Another important application of processors is computer simulation, i.e. computation of a mathematical model. Computer simulation is a natural extension of mathematical computation, based on a set of built-in functions (containing only about ten built-in functions) supported by a conventional processor. Traditional computer simulations contain three levels: a base layer, a function layer and a model layer. The base layer comprises built-in functions which can be directly realized by various hardware; the function layer contains mathematical functions which cannot be directly realized by various hardware; the model layer contains various mathematical models that describe the performance (e.g., input-output characteristics) of the various system components.

The mathematical functions in the function layer and the mathematical models in the model layer are implemented by software. As previously mentioned, the function layer needs to do a software decomposition. The model layer needs to perform software decomposition twice: the mathematical model is first decomposed into mathematical functions, which are then decomposed into built-in functions. The time and energy consumption of the mathematical model is worse than that of the mathematical function because the mathematical model involves more software decomposition times.

The computational complexity of the mathematical model is quite dramatic. Fig. 1 BA-1 BB disclose a simple example-simulation of the amplifying circuit 0Y. The amplifying circuit 0Y includes a transistor 0T and a resistor 0R (fig. 1 BA). Mathematical models of transistor 0T (e.g., MOS3, BSIM 3V 3.2, BSIM 4V 3.0, PSP, etc. in fig. 1 BB) are built on the set of built-in functions supported by conventional processor 0X. Since the kind of built-in function is limited, a large amount of calculation is required even for calculating one current point of the transistor 0T (fig. 1 BB). For example, the BSIM 4V 3.0 transistor model requires 222 additions, 286 multiplications, 85 divisions, 16 square root operations, 24 exponentiations, and 19 logarithms.

ALU 00L in conventional processor 0X itself can only compute an arithmetic model. Since most mathematical models are non-arithmetic models, they cannot be implemented by ALU 00L alone. In a processor implementing computer simulation, an arithmetic model is a mathematical model that can be expressed as a combination of its basic arithmetic operations, while a non-arithmetic model is a mathematical model that cannot be expressed as a combination of its basic arithmetic operations. Since the non-arithmetic model contains more operations than the arithmetic logic unit supports, the non-arithmetic model cannot be implemented by the ALU alone. Calculating the non-arithmetic model with the conventional processor 0X is slow and inefficient.

[C] A programmable gate array.

A third application of the processor is a programmable gate array. Programmable gate arrays (also known as FPGAs, CPLDs, etc.) belong to semi-custom integrated circuits, i.e. the customization of logic circuits is achieved by back-end processing or field programming. U.S. Pat. No. 4,870,302 discloses a programmable gate array. It contains a plurality of programmable logic units (configurable logic element, abbreviated as CLE; or configurable logic block) and programmable connections (configurable interconnect, abbreviated as CIT; or programmable interconnect). The programmable logic unit can selectively realize functions of shift, logical negation, AND (logical AND), OR (logical AND), NOR (AND), NAND (NAND), XOR (exclusive OR), plus (arithmetic addition), minus (arithmetic subtraction) AND the like under the control of a setting signal; the programmable connection can selectively realize the functions of connection, disconnection and the like between the two interconnection lines under the control of the setting signal.

In a programmable gate array, the arithmetic operations (arithmetic addition and arithmetic subtraction) supported by programmable logic units are collectively referred to as basic arithmetic operations. They have fewer basic arithmetic operations (addition, subtraction and multiplication) than in conventional processors. Where the specification refers to a basic arithmetic operation, it may be determined whether it is a basic arithmetic operation in a programmable gate array or a basic arithmetic operation in a conventional processor, depending on its context.

The programmable gate array can achieve customization of logic functions as well as arithmetic functions, but does not have the ability to customize non-arithmetic functions. In a programmable gate array, an arithmetic function is a mathematical function that can be expressed as a combination of its basic arithmetic operations, while a non-arithmetic function is a mathematical function that cannot be expressed as a combination of its basic arithmetic operations. Since the non-arithmetic function contains more operations than the arithmetic operations supported by the programmable logic unit, the non-arithmetic function cannot be implemented by the programmable logic unit alone. Customization of non-arithmetic functions is not considered possible in the prior art.

[D] And (5) mode processing.

A fourth application of the processor is mode processing. The pattern processing includes pattern matching and pattern recognition, which refers to finding a pattern that is the same as or close to a search pattern (search pattern for search) in a target pattern (searched pattern). Wherein pattern matching requires finding the same pattern and pattern recognition requires finding only a close pattern. In the present specification, "mode" includes a target mode and a search mode; "Pattern library" refers to a database containing related patterns, including a target pattern library or a search pattern library.

The mode processing is widely used. Common pattern processing includes code matching, character string matching, speech recognition, image recognition, and the like. Code matching is widely used in the field of information security, etc., and its operations include searching network data packets or computer files for viruses or checking whether they meet specifications, thereby determining whether data is secure. String matching, also called keyword retrieval, is widely used in the field of big data analysis and the like, and its operations include regular expression (regular expression) matching and the like. Speech recognition finds the acoustic/language model closest to the speech data in the acoustic/language model library. Image recognition will find the image model closest to the image data in the image model library.

With the advent of the big data age, schema libraries have become large databases. The data size of the search pattern library (including related search patterns, such as virus library, keyword library, acoustic/language model library, image model library, etc.) is already large, while the data size of the target pattern library (including related target patterns, such as computer files on the whole hard disk, big data database, voice archive, image archive, etc.) is larger. Unfortunately, the internal memory of existing processors cannot store these pattern libraries, all of which need to be stored in external memory, and patterns need to be read from external memory frequently during pattern processing. Therefore, existing processors and their architecture cannot implement fast pattern processing for large pattern libraries.

[E] A neural network.

A fifth application of the processor is a neural network. Neural networks provide a powerful artificial intelligence tool. Fig. 1C is an example of a neural network. It contains an input layer 32, a hidden layer 34 and an output layer 36. The input layer 32 contains i neurons 33 which input data x ₁ 、…x _i The input vector 30x is constituted. The output layer 36 contains k neurons 37 which output data y ₁ 、y ₂ 、…y _k The output vector 30y is constructed. Hidden layer 34 is interposed between input layer 32 and output layer 36. It contains j neurons 35, each neuron 35 being electrically coupled to a first neuron in the input layer 32 and to a second neuron in the output layer 36. The coupling strength between neurons is determined by the synaptic weight w _ij And w _jk And (3) representing.

The prior art proposes a neural network accelerator chip 60 (see Chen Yunji et al, daDianNao: A Machine-Learning Supercomputer, IEEE/ACM International Symposium on Micro-architecture,5 (1), pages 609-622, 2014). The neural network accelerator 60 contains 16 kernels 50 that are coupled to each other by a tree connection (fig. 1 DA). Each core 50 contains one neural computing unit (NPU) 30 and four eDRAM blocks 40 (fig. 1 DB). The NPU 30 performs a neural calculation that includes 256+32 16-bit multipliers and 256+32 16-bit adders. eDRAM 40 stores synaptic weights with a storage capacity of 2MB.

The neural network accelerator 60 has room for improvement. First, the eDRAM 40 is a volatile memory, and pre-run synaptic weights need to be loaded into the eDRAM 40 from external memory, which takes time. Second, only 32MB of eDRAM in each neural network accelerator chip 60 may be used to store the synaptic weights. This capacity is still far below the actual need. Again, the design of the neural network accelerator 60 focuses on memory tilting—in each core, eDRAM 40 occupies 80% of the area, while NPU 30 occupies less than 10%, so the computation density is very limited.

With the advent of three-dimensional memory (3D-M for short), the above-described conventional processors and their architectures have encountered a variety of difficulties that are largely resolved. The memory cells of the 3D-M are distributed in three-dimensional space, i.e. stacked on top of each other in a direction perpendicular to the substrate. Chinese patent 02131089.0 (publication number CN 1285125C; date of authorization: 11/15/1006) proposes a 3D-M based processor (i.e., three-dimensional processor) that integrates logic circuitry into a substrate under a 3D-M array to form an integrated three-dimensional processor. The integrated three-dimensional processor is in a single three-dimensional processor chip.

The integrated three-dimensional processor can be applied to the above fields: chinese patent application 201710241669.3 (filing date: 2017, 4, 13) applies an integrated three-dimensional processor to mathematical calculations and computer simulations; chinese patent application 201710126067.3 (filing date: 2017, 3, 6) applies an integrated three-dimensional processor to a programmable gate array; chinese patent application 201710130887.X (filing date: date of 2017, month 3, 7) applies an integrated three-dimensional processor to a pattern processor; chinese patent application 201710171413.X (filing date: date of 2017, month 3, 21) applies an integrated three-dimensional processor to a neural network processor. Integrated three-dimensional processors show great advantages in these areas.

FIGS. 1 EA-1 EB illustrate an integrated three-dimensional processor 80 having a 3D-M array 77 and logic 78 integrated therewith. The 3D-M array 77 stores data and the logic 78 processes at least a portion of the data stored in the 3D-M array 77. In the three-dimensional processor chip, the chip area occupied by the memory array 77 is an array area 70, and the chip area outside the array area 70 is a non-array area 71 (fig. 1 EA). The array region 70 contains a substrate circuit 0K and a 3D-M array 77 (FIG. 1 EB) stacked on the substrate circuit 0K. The substrate circuit 0K is formed on the semiconductor substrate 0 below the 3D-M array 77. It contains transistor 0t and substrate interconnect 0i. The transistor 0t is formed in the semiconductor substrate 0, and is electrically coupled therebetween through a substrate interconnect line 0i. The substrate interconnect line 0i contains two interconnect line layers 0m1-0m2, each interconnect line layer (e.g., 0m 1) containing a plurality of interconnect lines (e.g., 0 m) that are in the same physical plane. The 3D-M array 77 includes four address line layers 0a1-0a4, each address line layer (e.g., 0a 1) including a plurality of address lines (e.g., 1 a) that are in the same physical plane. These address line layers 0a1-0a4 form two

memory layers

16A, 16B. Wherein the memory layer 16A is stacked over the substrate circuit 0K, and the memory layer 16B is stacked over the memory layer 16A. The memory cell (e.g., 7 aa) is located at the intersection of two address lines (e.g., 1a, 2 a). The memory layers 16A, 16B are electrically coupled to the substrate circuit 0K through the contact via holes 1av, 3av, respectively.

The non-array region 71 also contains a portion of the substrate circuitry 0K (fig. 1 EB). Since the non-array region 71 does not contain the 3D-M array 77, the number of back-end-of-line (BEOL) layers is smaller than that of the array region 70. In this specification, the back-end interconnect layer is a separate conductive layer (not counting into the via hole) of the back-end process. In FIG. 1EB, array region 70 contains six back-end wiring levels, including two interconnect levels 0m1-0m2 of substrate circuit 0i and four address line levels 0a1-0a4 of memory array 77; while non-array region 71 contains only two back-end wiring levels, including two interconnect levels 0m1-0m2 of substrate circuit 0 i. In the non-array region 71, the space 72 on the substrate circuit 0K contains neither memory cells nor interconnect lines, and the space 72 is effectively wasted.

Array region 70 includes a plurality of 3D-M arrays 77 and their associated local (local) peripheral circuits 75 and logic circuits 78 (fig. 1 EA). Local peripheral circuitry 75 and logic circuitry 78 are formed in substrate 0, which are in the vicinity of the projection of 3D-M array 77 on substrate 0. Since the 3D-M array 77 is stacked on the local peripheral circuitry 75 and the logic circuitry 78, it is not located in the substrate 0, here represented by a dashed line. On the other hand, the non-array region 71 contains global (global) peripheral circuits 73 of the 3D-M array 77, which are formed in the substrate 0 at positions outside the projection of all 3D-M arrays 77 on the substrate 0. The local peripheral circuit 75 and the global peripheral circuit 73 are collectively referred to as a peripheral circuit 79.

In the three-dimensional processor chip 80, the non-array region 71 occupies a large amount of chip area. Currently, the non-array region 71 occupies 20-30% of the chip area; for mass storage, this ratio will even be above 50%. Thus, the array efficiency of the integrated three-dimensional processor 80 is low. In this specification, array efficiency is the ratio of the total projected area of the 3D-M array 77 in the chip onto the substrate 0 to the total area of the chip.

The dominant views of integrated circuits are: the greater the integration level, the better, i.e., the integration can reduce cost and improve performance. Conventional integrated circuits tend to be single core (monolithic) integrated, i.e. all circuit components are integrated into one chip. Single core integration is effective for two-dimensional circuits, but is no longer effective for three-dimensional circuits, especially when three-dimensional circuits (e.g., three-dimensional memory) are mixed with two-dimensional circuits. In this specification, a two-dimensional circuit refers to a circuit in which active elements (e.g., transistors, memory cells, etc.) are distributed on a two-dimensional plane (e.g., a front surface of a semiconductor substrate); a three-dimensional circuit refers to a circuit in which active elements (e.g., transistors, memory cells, etc.) are distributed in a three-dimensional space (stacked on top of each other in a direction perpendicular to the front surface of a semiconductor substrate).

The drawbacks of single-core integration are manifold when applied to the integration of three-dimensional circuits with two-dimensional circuits. First, they are not compatible due to their back-end processing. Blind integration would result in the logic 78 and peripheral 79 circuits being fabricated with a complex process for fabricating the 3D-M array 77. Plus the integrated three-dimensional processor chip 80 has a lower array efficiency, blind integration increases the overall cost of the three-dimensional processor chip 80.

Second, because 3D-M array 77 is very process demanding, the back-end process of three-dimensional processor chip 80 needs to be optimized for 3D-M array 77, which has to sacrifice performance of logic 78 and peripheral 79 to some extent. For an integrated three-dimensional processor 80, the logic 78 and peripheral 79 may only contain a few (e.g., two) interconnect layers 0M1-0M2 contained in the substrate interconnect layer 0i, or use of relatively slow high temperature interconnect materials (e.g., tungsten) that can withstand the high temperature back-end processes used in fabricating the 3D-M array 77, which may reduce the overall performance of the three-dimensional processor chip 80.

Finally, with single core integration, the chip area occupied by the logic 78 is limited by the projected area of the 3D-M array 77 on the substrate, which can only perform limited processing functions. In addition, since the logic 78 is solidified with the 3D-M array 77, the three-dimensional processor 80 can only perform a fixed function. If the three-dimensional processor 80 is required to perform other functions as well, the entire three-dimensional processor 80 (including its 3D-M array 77 and logic 78) needs to be redesigned and manufactured, which takes a lot of time and cost.

Disclosure of Invention

The main object of the present invention is to provide a three-dimensional processor with lower overall cost.

It is another object of the present invention to provide a three-dimensional processor with more excellent overall performance.

It is a further object of the invention to provide a more powerful and flexible three-dimensional processor.

It is a further object of the invention to provide a processor with a greater computational density.

It is a further object of the invention to provide a processor with greater computational complexity.

It is a further object of the invention to increase the speed and efficiency of mathematical calculations.

It is a further object of the invention to increase the speed and efficiency of computer simulation.

It is a further object of the invention to customize non-arithmetic functions.

It is a further object of the invention to customize complex functions.

It is a further object of the invention to implement reconfigurable computing.

It is another object of the present invention to achieve high speed and efficient pattern processing for large pattern libraries.

It is a further object of the invention to enhance information security.

It is a further object of the invention to enhance big data analysis capabilities.

It is a further object of the present invention to enhance speech recognition capabilities and to enable speech retrieval from a speech archive.

It is another object of the present invention to enhance image recognition capabilities and to enable image retrieval for an image archive.

It is a further object of the invention to enhance the computational power of a neural network.

To achieve these and other objects, the present invention complies with design rules distinct from conventional processors: the three-dimensional circuit and the two-dimensional circuit are de-integrated. In particular, the three-dimensional circuit and the two-dimensional circuit are divided into different chips as much as possible so as to optimize them separately. Accordingly, the present invention proposes a discrete three-dimensional processor (100) characterized by comprising: a plurality of storage computing units (abbreviated as "storage computing units") (100 aa-100 mn), each storage computing unit (100 ij) comprising at least one three-dimensional storage (3D-M) array (170) and a logic circuit (180); a first chip (100 a) and a second chip (100 b), the first chip (100 a) containing the 3D-M array (170), the second chip (100 b) containing at least part of the logic circuit (180) and a piece of peripheral circuit components (190) of the 3D-M array (170); -the first chip (100 a) is free of the off-chip peripheral circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160). Briefly, the first chip 100a is a memory chip that contains a plurality of functional layers stacked on top of each other; the second chip 100b is a logic chip having only one functional layer. Note that the logic circuit 180 does not include the 3D-M array 170 or any peripheral circuitry thereof. The discrete three-dimensional processor 100 may be generalized to any memory (including random access memory RAM, read only memory ROM, and non-volatile memory NVM, etc.): at least a portion of the peripheral circuitry 190 of the memory array 170 may be disposed in another chip as long as the peripheral circuitry of the memory array 170 has a different back-end structure than the memory array 170. Specifically, the memory array 170 is located on the first chip 100a, and at least a portion of the peripheral circuitry 190 of the memory array 170 is located on the second chip 100b.

Separate three-dimensional processors differ from integrated three-dimensional processors: in the integrated three-dimensional processor, all peripheral circuit components of the 3D-M array and the 3D-M array are positioned on the same chip; in a separate three-dimensional processor, at least one peripheral circuit component of the 3D-M array is located not on the first chip but on the second chip. Accordingly, the peripheral circuit assembly in the second chip is referred to as an off-chip peripheral circuit assembly. In design, the circuit division strategy adopted by the separated three-dimensional processor is that the second chip contains as many off-chip peripheral circuit components as possible. The advantage of this division is that the array efficiency of the first chip is greatly improved. It is noted that although the first chip contains a 3D-M array, since it does not contain peripheral circuit components, the first chip cannot function properly as a memory chip independently, e.g., its performance does not meet industry standards for similar memory chips.

In a separate three-dimensional processor, the first chip and the second chip may have distinct back-end structures, as they may be designed and manufactured separately. Because the back-end structure of the second chip can be optimized independently, its off-chip peripheral circuit components and logic circuits have lower cost and superior performance than the same type of circuits in an integrated three-dimensional processor. A comparison is made between the separate three-dimensional processor and the integrated three-dimensional processor.

First, the first chip does not contain at least part of peripheral circuits and logic circuits, so that the array efficiency is high. In addition, as a two-dimensional circuit, the number of back-end wiring layers of the second chip 100b is far lower than that of an integrated three-dimensional processor, and can be manufactured using a conventional process. Since the wafer cost is substantially proportional to the number of back-end wiring levels, the wafer cost of the second chip is much lower than that of the integrated three-dimensional processor. Thus, the total cost of chips of the separate three-dimensional processor (including the first and second chips) is lower than that of an integrated three-dimensional processor (which contains only one chip). Even if additional bonding costs are accounted for, the overall cost of the separate three-dimensional processor is less than the integrated three-dimensional processor.

Second, because they can be optimized individually, the off-chip peripheral circuit components and logic circuits in a separate three-dimensional processor perform better than the same type of circuits in an integrated three-dimensional processor. In one embodiment, the number of interconnect lines (e.g., four, eight, or more) in the second chip is greater than the number of interconnect lines (e.g., two) of the substrate circuitry in the integrated three-dimensional processor (or the first chip). In another embodiment, the second chip uses a high performance interconnect line material (e.g., copper) rather than a high temperature interconnect line material (e.g., tungsten) used by the integrated three-dimensional processor (or first chip). Thus, the overall performance of the separated three-dimensional processor is superior to that of the integrated three-dimensional processor.

Finally, in an integrated three-dimensional processor, since the logic circuitry is confined to one chip (e.g., to the projected area of the 3D-M array on the substrate), it is limited in area and function. In contrast, in a separate three-dimensional processor, since the logic circuit may be formed in two chips (a first portion of the logic circuit is located in a first chip and the 3D-M array is located within the projected area of the substrate, and a second portion of the logic circuit is located in a second chip), its larger area imparts more processing power to the separate three-dimensional processor. Furthermore, since the second chip is separately designed and separately produced, it has greater flexibility in design and production. By combining the same first chip with a second chip having different functions, processing functions suitable for different application scenarios can be realized. Preferably, these different processing functions can be implemented within a shorter design cycle and with less design budget. Thus, the separate three-dimensional processor is more powerful and flexible.

The following description will be made of the application of the separated three-dimensional processor in different fields.

[A] And (5) mathematical calculation.

When applied to mathematical calculations, a separate three-dimensional processor is used to implement the non-arithmetic functions. It employs memory-based computation, abbreviated MBC, that is, computation is accomplished primarily by a high-capacity LUT (i.e., 3 DM-LUT) stored in a 3D-M array. Compared to conventional, logic-based computation (LBC), the 3DM-LUT used by MBCs has a larger capacity. For example, the single-core storage capacity of 3D-XPoint is up to 128Gb, which is far higher than that of a traditional LUT (tens of kb), and can be used for realizing tens of thousands of non-arithmetic functions (including various transcendental functions and special functions). Although for most MBCs they still require arithmetic operations. However, by using a larger 3DM-LUT as the starting point, MBC only needs to use fewer polynomial expansions. In MBC, the memory circuit occupies a larger specific gravity in the calculation than the logic circuit.

Accordingly, the invention proposes a three-dimensional processor (100) for calculating at least one non-arithmetic function, characterized by comprising: -a plurality of calculation units (100 ij), the calculation units (100 ij) comprising at least a three-dimensional storage (3D-M) array (170) and an Arithmetic Logic Circuit (ALC) (180 ALC), the 3D-M array (170) storing at least part of a look-up table (LUT) of the non-arithmetic functions, the ALC (180 ALC) performing arithmetic operations on at least part of the data in the LUT; a first chip (100 a) and a second chip (100 b), the first chip (100 a) containing the 3D-M array (170), the second chip (100 b) containing at least a portion of the ALC (180 ALC) and a piece of peripheral circuit components (190) of the 3D-M array (170); -the first chip (100 a) is free of the off-chip peripheral circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160); the non-arithmetic function contains more operations than the ALC (180 ALC) supports.

[B] And (5) computer simulation.

When applied to computer simulation, a separate three-dimensional processor is used to implement the non-arithmetic model, which still employs MBC. MBC offers great advantages for computer simulation. The large increase in built-in functions (from about ten to tens of thousands) will flatten the traditional framework of computer simulation (including the base layer, the function layer, and the model layer). In the past, functions can be realized only in hardware at a base layer; at present, not only the mathematical functions of the function layer can be directly realized by hardware, but also the mathematical model of the model layer can be directly realized by hardware. At the function layer, the mathematical function is calculated by a "function lookup method" (i.e., 3DM-LUT stores function values and derivative values, and by lookup and polynomial expansion); at the model level, the mathematical model is calculated by a "model look-up table" (i.e., the 3DM-LUT stores model values and their derivatives, by look-up tables and expansion with polynomials). High-speed and efficient computation of mathematical models can be achieved through 3DM-LUT, which will drive the revolution of computer simulation.

Accordingly, the invention proposes a three-dimensional processor (100) for computing at least one non-arithmetic model, characterized by comprising: -a plurality of computation units (100 ij), the computation units (100 ij) comprising at least a three-dimensional storage (3D-M) array (170) and an Arithmetic Logic Circuit (ALC) (180 ALC), the 3D-M array (170) storing at least part of a look-up table (LUT) of the non-arithmetic model, the ALC (180 ALC) performing an arithmetic operation on at least part of the data in the LUT; a first chip (100 a) and a second chip (100 b), the first chip (100 a) containing the 3D-M array (170), the second chip (100 b) containing at least a portion of the ALC (180 ALC) and a piece of peripheral circuit components (190) of the 3D-M array (170); -the first chip (100 a) is free of the off-chip peripheral circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160); the non-arithmetic model contains more operations than are supported by the ALC (180 ALC).

[C] A programmable compute array.

When applied to a programmable computing array, the separate three-dimensional processor is a three-dimensional programmable computing array. It can customize not only logical functions and arithmetic functions, but also non-arithmetic functions. Accordingly, the present invention proposes a three-dimensional programmable computing array (100) for customizing at least one non-arithmetic function, characterized by comprising: a plurality of programmable logic units (200) and/or programmable connections (300); and a plurality of programmable computing units (400) containing at least one three-dimensional memory (3D-M) array (170), said 3D-M array (170) storing at least part of a look-up table (LUT) of said non-arithmetic functions; -a first chip (100 a) and a second chip (100 b), the first chip (100 a) containing the 3D-M array (170), the second chip (100 b) containing at least part of the programmable logic units (200) and/or programmable connections (300) and a piece of peripheral circuit components (190) of the 3D-M array (170); -the first chip (100 a) is free of the off-chip peripheral circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160); -enabling customization of said non-arithmetic function by programming said programmable logic unit (200) and/or programmable connection (300) and said programmable computing unit (400); the non-arithmetic function contains more operations than the arithmetic operations supported by the programmable logic unit (200).

The usage cycle of the programmable computing unit includes two phases: a setup phase and a calculation phase. In the setting stage, a lookup table of a non-arithmetic function is loaded into a 3D-M array according to the needs of a user; in the calculation phase, the corresponding LUT is looked up in the 3D-M array to obtain the value of the non-arithmetic function. For 3D-M that can be reprogrammed, different non-arithmetic functions can be implemented by loading LUTs of different non-arithmetic functions in the 3D-M array in different usage periods, thereby enabling reconfigurable computations.

[D] And (5) mode processing.

When applied to pattern processing, the separate three-dimensional processor is a three-dimensional pattern processor. Its basic function is mode processing. More importantly, most of the modes involved in the mode processing are stored locally, so the mode processing circuit is very close to the mode storage circuit, and the time required for reading the new mode is very short. In addition, a three-dimensional pattern processor contains thousands of storage units. In the mode processing, the input data is sent to all the storage units, and mode processing is performed simultaneously, so that large-scale parallel calculation is ensured. The three-dimensional pattern processor can process large pattern libraries in a high-speed and high-efficiency manner.

Accordingly, the present invention proposes a discrete three-dimensional mode processor (100) characterized by comprising: an input (110) transmitting at least part of the first pattern; a plurality of storage units (100 aa-100 mn) electrically coupled to said input (110), each storage unit (100 ij) comprising at least one three-dimensional storage (3D-M) array (170) and a pattern processing circuit (180 PPC), said 3D-M array (170) storing at least a portion of a second pattern, said pattern processing circuit (180 PPC) performing a pattern processing of said first and second patterns; a first chip (100 a) and a second chip (100 b), the first chip (100 a) containing the 3D-M array (170), the second chip (100 b) containing at least part of the pattern processing circuitry (180 PPC) and a piece of peripheral circuitry component (190) of the 3D-M array (170); -the first chip (100 a) is free of the off-chip peripheral circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

[E] A neural network.

When applied to a neural network, the separate three-dimensional processor is a three-dimensional neural network processor. Its basic function is neural computing. More importantly, most of the synaptic weights required for nerve computation are local, the nerve computation circuit is very close to the memory circuit of the synaptic weights, and the time required for reading the synaptic weights is very short. In addition, three-dimensional neural network processors contain thousands of storage units. In the neural calculation, the input data is sent to all the storage units, and the neural calculation is performed simultaneously, so that large-scale parallel calculation is ensured. The three-dimensional neural network processor can realize high-speed and high-efficiency neural calculation.

Accordingly, the present invention proposes a discrete three-dimensional neural network processor (100) characterized by comprising: a plurality of storage units (100 aa-100 mn), each storage unit (100 ij) containing at least a three-dimensional storage (3D-M) array (170) and a neural computing circuit (180 NPC), said 3D-M array (170) storing at least part of the synaptic weights, said neural computing circuit (180 NPC) performing a neural computation based on said synaptic weights; a first chip (100 a) and a second chip (100 b), the first chip (100 a) containing the 3D-M array (170), the second chip (100 b) containing at least a portion of the neural computing circuitry (180 NPC) and a piece of peripheral circuitry component (190) of the 3D-M array (170); -the first chip (100 a) is free of the off-chip peripheral circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

Drawings

FIG. 1AA is a perspective view of a conventional disposer (prior art); FIG. 1AB lists all override functions supported by the Intel Itanium (IA-64) processor (prior art); FIG. 1BA is a circuit diagram of an amplifying circuit; fig. 1BB lists the amount of computation required for different transistor models to calculate one current point (prior art); FIG. 1C is a schematic diagram of a neural network; FIG. 1DA is a circuit block diagram of a neural network processor (prior art); FIG. 1DB is a chip layout diagram of a neural network accelerator (prior art); FIG. 1EA is a circuit layout diagram of an integrated three-dimensional processor (prior art); fig. 1EB is a cross-sectional view of the three-dimensional processor.

Fig. 2A-2C are general illustrations of a discrete three-dimensional processor: FIG. 2A is a circuit block diagram thereof; FIG. 2B is a circuit block diagram of a storage unit; fig. 2C is a circuit layout diagram of two chips in a separate three-dimensional processor.

Fig. 3A-3D are cross-sectional views of four separate three-dimensional processors.

Fig. 4A-4D are cross-sectional views of four first chips.

Fig. 5 is a cross-sectional view of a second chip.

FIG. 6A is a circuit layout of a first chip; fig. 6 BA-6 BB are circuit layouts of two second chips.

Fig. 7A-7C are circuit block diagrams of three types of storage units.

Fig. 8A-8C are circuit layouts of three types of memory cells in first and second chips.

Fig. 9 is a circuit block diagram of a computing unit.

Fig. 10A to 10C are circuit block diagrams of three kinds of Arithmetic Logic Circuits (ALC).

FIG. 11A is a circuit block diagram of a first type of computing unit; fig. 11B is a circuit diagram of one implementation of the computing unit.

Fig. 12 is a circuit block diagram of a second type of computing unit.

Fig. 13 is a circuit block diagram of a third type of computing unit.

FIG. 14A is a circuit block diagram of a programmable cell; fig. 14B shows functional blocks included in the programmable unit.

FIG. 15A is a circuit block diagram of a first programmable computing unit; fig. 15B is a circuit block diagram of a second programmable computing unit.

Fig. 16 shows two usage cycles of a programmable computing unit.

FIG. 17A discloses a connection library that can be implemented with programmable connections; FIG. 17B discloses a logic operation library that can be implemented by a programmable logic unit.

Fig. 18 is a layout diagram of a first three-dimensional programmable computing array.

FIG. 19 is a diagram of a first three-dimensional programmable computing array implementing a non-arithmetic function.

FIG. 20 is a layout diagram of a second three-dimensional programmable computing array.

Fig. 21A-21B are diagrams illustrating the implementation of two mathematical functions by the second three-dimensional programmable computing array.

Fig. 22 is a circuit block diagram of a separate three-dimensional parallel processor.

Fig. 23 is a circuit block diagram of a storage unit in a three-dimensional pattern processor.

Fig. 24 is a circuit block diagram of a storage unit in a three-dimensional neural network processor.

Fig. 25 is a circuit block diagram of a neural computing circuit.

Fig. 26A-26B are circuit block diagrams of two kinds of calculation circuits.

It is noted that these figures are merely schematic and they are not drawn to scale. Some of the dimensions and structures in the figures may be exaggerated or reduced for clarity and convenience. In different embodiments, the letter suffix following the numeral refers to a different instance of the same class of structure; the same numerical prefix indicates the same or similar structure.

In the present specification, "/" means a relationship of "and" or ". "memory" refers broadly to any semiconductor-based information storage device that may store information permanently or temporarily. A "memory array (e.g., a 3D-M array)" is a collection of all memory cells sharing at least one address line. "(data) processing" means that information is changed for a user or host; while the (3D-M) peripheral circuitry "does not change the stored information with respect to the user or host. "circuitry in a substrate" means that the active elements (e.g., transistors, memory cells) of the circuitry are located in the substrate; the interconnect lines connecting the active elements in the circuit may be located above the substrate. By "circuitry on a substrate" is meant that the active components of the circuitry (e.g., transistors, memory cells) and their interconnect lines are located above the substrate. "electrically coupled" means any form of coupling in which an electrical signal may be transmitted from one element to another. "look-up table (LUT) (including 3 DM-LUT)" may refer to either data in the LUT or a memory circuit (i.e., LUT memory) used to store the LUT, and is not differentiated by this specification. "schema" may refer to both abstract schema and physical manifestations of the schema (i.e., data related to the schema), which are not differentiated by this specification.

Description of the embodiments

Fig. 2A-2C are an overall description of a discrete three-dimensional processor 100. Fig. 2A is a circuit block diagram thereof. The separate three-dimensional processor 100 is not only capable of processing data, but also capable of storing data. More importantly, a significant portion of the data it processes is stored locally and in close proximity. The discrete three-dimensional processor 100 contains a storage array of m x n storage units 100aa-100 mn. Taking the example of the storage unit 100ij, it has an input 110 and an output 120. In general, a three-dimensional processor 100 may contain thousands of storage units 100aa-100mn, which support large-scale parallel computing.

Fig. 2B is a circuit block diagram of a storage unit 100 ij. The memory cell 100ij includes a memory circuit 170 and a logic circuit 180 electrically coupled to each other via a plurality of inter-chip connections 160 (see fig. 3A-3D). The memory circuit 170 contains at least one 3D-M array. The 3D-M array stores data and the logic circuit 180 processes a portion of the data. Since the 3D-M array 170 is not located in the same chip as the logic circuit 180 (see fig. 2C), the 3D-M array 170 is represented by a dotted line.

Fig. 2C shows a specific implementation of a discrete three-dimensional processor 100, which includes at least a first chip (also referred to as a memory chip) 100a and at least a second chip (also referred to as a logic chip) 100b. The first chip 100a contains three-dimensional circuitry, in this embodiment a 3D-M array 170. The second chip 100b contains two-dimensional circuitry, in this embodiment logic circuitry 180 and a peripheral circuit component 190 of the 3D-M array 170. The inter-chip connection 160 provides for electrical coupling between the first chip 100a and the second chip 100b. Since the peripheral circuit assembly 190 is in a different chip than the 3D-M array 170, it is referred to as an off-chip peripheral circuit assembly. Note that some logic may be located in the first chip 100a, for example, below the 3D-M array 170, and some logic may be integrated. For simplicity, in this specification, the logic circuit refers to the logic circuit 180 located on the second chip 100b unless specifically described. The discrete three-dimensional processor 100 may be generalized to any memory (including random access memory RAM, read only memory ROM, and non-volatile memory NVM, etc.): at least a portion of the peripheral circuitry 190 of the memory array 170 may be disposed in another chip as long as the peripheral circuitry of the memory array 170 has a different back-end structure than the memory array 170. Specifically, the memory array 170 is located on the first chip 100a, and at least a portion of the peripheral circuitry 190 of the memory array 170 is located on the second chip 100b.

The circuit partitioning strategy employed by the discrete three-dimensional processor 100 is to have the second chip 100b contain as many off-chip peripheral circuit components 190 as possible. The peripheral circuit assembly 190 is an organic component of the memory chip; the memory chip lacking it (e.g., the first chip 100 a) cannot independently perform the basic functions of the memory (e.g., its performance does not meet industry standards for the same type of memory chip). Typical peripheral circuit components 190 may be address decoders, read amplifier circuits, write circuits, read voltage generation circuits, write voltage generation circuits, data buffers, or portions thereof.

Fig. 3A-3D are cross-sectional views of four separate three-dimensional processors 100 that focus on various implementations of the display inter-chip connections 160. In the embodiment of fig. 3A, the first chip 100a and the second chip 100b are stacked on each other, i.e. in a direction perpendicular to the chip surface. Wherein the front sides (i.e., the surfaces containing the circuitry) of the first chip 100a and the second chip 100b are both facing upward (+z direction), and an inter-chip connection 160 is achieved therebetween by bonding wires 160 w.

In the embodiment of fig. 3B, the first chip 100a and the second chip 100B are stacked face-to-face. Specifically, the first chip 100a faces upward (+z direction); and flip-flop the second chip 100b with its front side down (-z direction). The inter-chip connection 160 is achieved between them by micro-pads 160 x.

The embodiment of fig. 3C contains two memory chips 100a1, 100a2 and one logic chip 100b. In order to avoid confusion, the first chip is referred to as memory chips 100a1, 100a2 in the figure, and the second chip is referred to as logic chip 100b. The memory chips 100a1, 100a2 each contain a plurality of 3D-M arrays; they are stacked on each other and electrically coupled through a through-substrate via (TSV) 160 y. The stacked memory chips 100a1, 100a2 are electrically coupled to the logic chip 100b by micro-pads 160 x. TSV 160y and micro solder joint 160x are inter-chip connections 160. In the present embodiment, the logic circuit 180 in the logic chip 100b processes data stored in the two memory chips 100a1, 100a 2.

In the embodiment of fig. 3D, an insulating medium 168a is formed on the front surface of the first chip 100a, and then a plurality of first via holes 160za are formed in the first insulating medium 168 a. In addition, a second insulating medium 168b is also formed on the front surface of the second chip 100b, and then a plurality of second via holes 160zb are formed in the second insulating medium 168 b. After the second chip 100b is flipped over, the first and second via holes 160za and 160zb are aligned and the first and

second chips

100a, 100b are bonded. Accordingly, the first and

second chips

100a, 100b implement the inter-chip connection 160 through the first and second via holes 160za, 160zb that are electrically contacted. Since the via holes 160za, 160zb are formed through a standard chip manufacturing process, they can have a small size and a large number. Accordingly, a large bandwidth of inter-chip connection 160 may be formed between the first chip 100a and the second chip 100b. In the present embodiment, the passage holes 160za, 160zb are collectively referred to as vertical contact connection (vertical interconnect access, simply referred to as VIA).

In the above-described embodiments, the distance between the memory circuit 170 and the logic circuit 180 is relatively short (relative to conventional von neumann architectures). Furthermore, for the embodiments of fig. 3B-3D, and particularly the embodiments of fig. 3C-3D, the number of inter-chip connections (TSVs or VIA) 160 is large, which may enable ultra-wide bandwidth between the memory circuit 170 and the logic circuit 180. Coupled with the large-scale parallel processing (fig. 2A), the isolated three-dimensional processor 100 is excellent in performance.

Fig. 4A to 4D show four types of first chips 100a in which 3D-M arrays 170 are integrated using a single core (monolithic), i.e., memory cells thereof are stacked on each other in a vertical direction without any semiconductor substrate between the memory cells.

According to its physical structure, 3D-M is divided into three-dimensional lateral memories (three-dimensional horizontal memory, abbreviated as 3D-M _H ) And three-dimensional longitudinal memory (three-dimensional vertical memory, abbreviated as 3D-M) _V ）。3D-M _H The memory cells of which constitute a plurality of horizontal memory layers vertically stacked on the substrate circuit. 3D-M _H A typical example of (a) is 3D-XPoint.3D-M _V The memory cells of which constitute a plurality of vertical memory strings arranged side by side on the substrate circuit. 3D-M _V A typical example of (a) is 3D-NAND.3D-M _H Faster, 3D-M _V The storage density is greater.

Conventionally, 3D-M is classified into 3D-RAM (three-dimensional random access memory) and 3D-ROM (three-dimensional read only memory). Wherein the 3D-RAM provides random access, examples of which include 3D-SRAM, 3D-DRAM, 3D-RRAM, 3D-MRAM, 3D-FeRAM, or others; on the other hand, 3D-ROM is a non-volatile memory (NVM) which is electrically programmable, examples of which include 3D-MPROM, 3D-OTP, 3D-MPT, 3D-EPROM, 3D-EEPROM, 3D-flash, 3D-NOR, 3D-NAND, 3D-XPoint or others.

According to its plausibility level, 3D-M is classified into three-dimensional writable memory (thread-dimensional writable memory, abbreviated as 3D-W) and three-dimensional recordable memory (thread-dimensional printed memory, abbreviated as 3D-P). The information stored in the 3D-W is recorded by means of electric programming. Depending on the number of times it is programmed, 3D-W is further divided into three-dimensional one-time-programmable memory (3D-OTP for short) and three-dimensional multiple-time-programmable memory (3D-MTP for short), including repeated programming. One common 3D-MTP is 3D-XPoint and 3D-NAND. Other 3D-MTPs include memristor, resistive Random Access Memory (RRAM), phase Change Memory (PCM), programmable Metallization Cell (PMC), conductive bridging random-access memory (CBRAM), and the like.

The 3D-P stored information is entered in a printed manner (printing method) during factory production. The information is permanently fixed and cannot be changed after leaving the factory. The printing method may be photolithography (photo-lithograph), nanoimprint (nano-imprint), electron beam scanning exposure (e-beam lithograph), DUV scanning exposure, laser scanning exposure (laser programming), or the like. A common 3D-P has a three-dimensional mask programming read only memory (3D-MPROM) that enters data via mask programming by photolithography. Since it does not have the electrical programming requirements, the 3D-P memory cell can be biased at a higher voltage at the time of reading. Therefore, 3D-P reads faster than 3D-W.

The first chip 100a in FIGS. 4A-4B contains a substrate circuit 0Ka and 3D-M stacked on the substrate circuit 0Ka _H An array 170. The substrate circuit 0Ka includes a transistor 0t and an interconnect line 0ia. The transistor 0t is formed in the first semiconductor substrate 0a, and is electrically coupled therebetween through the substrate interconnect line 0ia. The substrate interconnect line 0ia contains two interconnect line layers 0m1a-0m2a, each interconnect line layer (e.g., 0m1 a) containing a plurality of interconnect lines (e.g., 0 m) that are in the same physical plane. 3D-M _H The array 170 includes four address line layers 0a1a-0a4a, each address line layer (e.g., 0a1 a) including a plurality of address lines (e.g., 1 a) that are in the same physical plane. These address line layers 0a1a-0a4a form two

memory layers

16A, 16B. Wherein the memory layer 16A is stacked over the substrate circuit 0Ka, and the memory layer 16B is stacked over the memory layer 16A. The memory cell (e.g. 7 aa) being located The intersection of two address lines (e.g., 1a, 2 a). The memory layers 16A, 16B are connected 150 on chip with the substrate circuit 0Ka through the contact via holes 1av, 3av, respectively. The contact via holes 1av, 3av contain a plurality of via holes, each penetrating at least one insulating layer and electrically coupled with via holes above and below it. In FIGS. 4A-4B, the substrate circuit 0Ka contains 3D-M _H At least part of the peripheral circuitry of the array 170. In some embodiments, substrate circuit 0Ka may contain a portion of logic circuitry.

3D-M in FIG. 4A _H Array 170 is a 3D-W. The memory cell 7aa includes a programming film 5 and a diode film 6. The programming film 5 may be an antifuse film (programmable once for 3D-OTP) or a resistive RAM (simply referred to as RRAM) film (reprogrammable for 3D-MTP). The diode film 6 has the following broad features: at the read voltage, its resistance is small; when the applied voltage is smaller than the read voltage or opposite to the read voltage, the resistance is larger. The diode film may be a P-i-N diode, or a metal oxide (e.g., tiO ₂ Etc.) diodes, etc.

3D-M in FIG. 4B _H Array 170 is a 3D-P. It contains at least two memory elements: a high resistance memory cell 7ab and a low resistance memory cell 7ac. The low-resistance memory cell 7ac contains a layer of diode film 6, which is similar to the diode film 6 in 3D-W. The high-resistance memory cell 7ab further includes a high-resistance film 9, which is an insulating film (e.g., silicon oxide/silicon nitride). In the production flow, the high-resistance film 9 located at the low-resistance memory cell 7ac is physically removed.

The first chip 100a in FIGS. 4C-4D contains a substrate circuit 0Ka and 3D-M stacked on the substrate circuit 0Ka _V An array 170. The substrate circuit 0Ka is similar to the substrate circuit in fig. 4A-4B. In certain embodiments, 3D-M _V There is no substrate circuit 0Ka under the array 170. 3D-M _V The array 170 includes a plurality of vertically stacked horizontal address line layers 0a1a-0a8a, each horizontal address line layer (e.g., 0a5 a) including a plurality of horizontal address lines (e.g., 15) in the same physical plane. 3D-M _V Array 170 also contains a set of vertical address lines that are perpendicular to substrate 0a (i.e., in the +z direction). 3D-M _V Is stored in (a)The storage density is highest among all semiconductor memories. For simplicity, 3D-M in FIGS. 4C-4D _V The on-chip connections 150 electrically coupled between the array 170 and the substrate circuit 0Ka are not shown and are well known to those skilled in the art.

3D-M in FIG. 4C _V The array 170 employs transistor or transistor-like devices as memory elements. It contains a plurality of vertical and side-by-side memory strings 16X, 16Y. Each memory string (e.g., 16Y) contains a plurality of vertically stacked memory cells (e.g., 18ay-18 hy). Each memory cell (e.g. 18 fy) contains a vertical transistor with a gate (horizontal address line) 15, a memory film 17 and a vertical channel (vertical address line) 19. The memory film 17 may contain a composite film of silicon oxide-silicon nitride-silicon oxide, silicon oxide-polysilicon-silicon oxide, or the like. The 3D-M _V The array 170 is a 3D-NAND, the fabrication process of which is well known to those skilled in the art.

3D-M in FIG. 4D _V The array 170 employs diode or diode-like devices as memory cells. It contains a plurality of vertical storage strings 16U-16W arranged side by side. Each memory string 16U contains a plurality of vertically stacked memory cells 18au-18hu.3D-M _V The array 170 contains a plurality of vertically stacked horizontal address lines (word lines) 15. After etching a plurality of memory wells 11 penetrating these horizontal address lines 15, the sidewalls of the memory wells 11 are covered with a programming film 13 and filled with a conductive material to form vertical address lines 19 (bit lines). The conductor material may be a metallic material or a doped semiconductor material. Memory cells 18au-18hu are formed at intersections of word lines 15 and bit lines 19. The programming film 13 may be one-time programming (OTP, such as an antifuse film) or multiple-time programming (MTP, such as an RRAM film).

To reduce the mutual interference between memory cells, a diode is preferably formed between the word line 15 and the bit line 19. In one embodiment, the programming film 13 itself may have certain diode electrical characteristics. In another embodiment, a diode film (not shown here) may be deposited separately on the sidewalls of the storage well 11. In the third embodiment, a built-in diode (e.g., P-N diode, schottky diode) may naturally be formed between the word line 15 and the bit line 19. For details on built-in diodes, reference is made to chinese patent application 201811117502.7 (filing date: 2018, 9, 20).

The second chip 100b in fig. 5 is a conventional two-dimensional circuit 0Kb that is used to implement the logic circuit 180 and the off-chip peripheral circuit assembly 190. The second chip 100b contains a transistor 0t and an interconnect line 0ib. The transistor 0t is formed in the second semiconductor substrate 0b, and is electrically coupled therebetween through an interconnection line 0ib. In this embodiment, interconnect line 0ib contains four interconnect line layers 0m1b-0m4b, each interconnect line layer (e.g., 0m1 b) containing a plurality of interconnect lines (e.g., 0 m) that are in the same physical plane.

Comparing the first chip 100a (fig. 4A-4D) with the second chip 100b (fig. 5), the number of back-end wiring layers in the first chip 100a is greater than that of the second chip 100b. For example, the first chip 100a in fig. 4A-4B has six back-end wiring layers (0 m1a-0m2a, 0a1a-0a 4A), and the first chip 100a in fig. 4C-4D has ten back-end wiring layers (0 m1a-0m2a, 0a1a-0a8 a), which are all greater than the four back-end wiring layers (0 m1B-0m 4B) of the second chip 100B in fig. 5. Even if only the number of address line layers in the first chip 100a is counted, it is equal to or greater than the number of interconnect line layers in the second chip 100b. Especially for 3D-M _V The number of address line layers (approximately equal to the number of all memory cells in a memory string, approximately hundred layers, and also increasing) in the first chip 100a is much greater than the number of interconnect line layers (e.g., four layers) in the second chip 100b by at least a factor of two.

On the other hand, since the second chip 100b is independently designed and manufactured, the number of interconnect line layers in the interconnect line 0ib thereof is greater than the number of interconnect line layers in the substrate interconnect line 0ia in the first chip 100 a. For example, the second chip 100b in fig. 5 has four interconnect layers (0 m1b-0m4 b) that are larger than the two interconnect layers (0 m1a-0m2 a) of the first chip 100a in fig. 4A-4D. Therefore, the circuit layout of the second chip 100b is easier than that of the first chip 100a (or the integrated three-dimensional processor 80). Moreover, the second chip 100b may employ a high-speed interconnect material (e.g., copper), and the first chip 100a (or the integrated three-dimensional processor 80) may employ only a high-temperature interconnect material (e.g., tungsten), which is generally low in speed.

Fig. 6A-6 BB are circuit layouts of the first and

second chips

100a, 100b in two separate three-dimensional processors 100, which are shown in greater detail than fig. 2C. This embodiment corresponds to the embodiment of fig. 7A and 8A. Those skilled in the art can readily generalize it to the embodiments of fig. 7B and 8B, and fig. 7C and 8C.

FIG. 6A shows a first chip 100a that contains a plurality of 3D-M arrays 170aa-170mn. FIG. 6BA shows a second chip 100b that contains a plurality of logic circuits 180aa-180mn and a global peripheral circuit assembly 190G. The global peripheral circuit assembly 190G is located outside the projection of all 3D-M arrays 170aa-170mn on the second chip 100 b. The three-dimensional processor 100 of fig. 6A and 6BA employs a "full alignment" technique, i.e., the circuit layout on both

chips

100a, 100b meets the following requirements: when the two

chips

100a, 100b are stacked, each 3D-M array (e.g., 170 ij) has a logic circuit (e.g., 180 ij) vertically aligned and electrically coupled thereto (see fig. 8A-8C). Since a logic circuit (e.g., 180 ij) may have multiple 3D-M arrays (e.g., 170ijA-170ijD, 170ijW-170 ijZ) vertically aligned and electrically coupled thereto (see fig. 8B-8C), the period of the logic circuit (e.g., 180 ij) on the second chip 100B is an integer multiple of the period of the 3D-M array (e.g., 170 ij) on the first chip 100 a.

FIG. 6BB shows another second chip 100b, which also contains a plurality of local peripheral circuit assemblies 190aa-190mn. It is apparent that the three-dimensional processor 100 of fig. 6A and 6BB may also employ a "full alignment" technique. Wherein each local peripheral circuit assembly 190aa-190mn is vertically aligned with and electrically coupled to a 3D-M array (e.g., 170 ij). In addition to the local peripheral circuit assemblies 190aa-190mn, the embodiment in FIG. 6BB may also contain global peripheral circuit assembly 190G. In this description, all of the local peripheral circuit assemblies 190aa-190mn and the global peripheral circuit assembly 190G are collectively referred to as off-chip peripheral circuit assemblies 190.

In the embodiment of FIGS. 6A-6 BB, the local peripheral circuit components (e.g., 190 ij) typically include partial address decoders, partial sense amplifier circuits, partial write circuits, etc., which perform at least partial read and write operations on the memory cells in each 3D-M array (e.g., 170 ij). The global peripheral circuit component 190G typically contains a read voltage generation circuit, a write voltage generation circuit, a data buffer, or the like, which generates a read/write voltage, or the like. Of course, the partitioning of these local and global peripheral circuit components is not absolute. For example, the local peripheral circuit assembly may contain at least a portion of the read/write circuit generation circuitry.

Fig. 7A-8C illustrate three types of storage units 100ij. FIGS. 7A-7C are circuit block diagrams (for simplicity, the off-chip peripheral circuit assembly 190ij is not shown in FIGS. 7A-7C); fig. 8A to 8C are circuit layout diagrams thereof. In these embodiments, one logic circuit 180ij serves a different number of 3D-M arrays 170 ij.

Logic 180ij in fig. 7A serves a 3D-M array 170 ij: it processes the data stored in the 3D-M array 170 ij. Logic 180ij in fig. 7B serves four storage arrays 170ijA-170 ijD: it processes the data stored in the 3D-M arrays 170ijA-170 jiD. Logic 180ij in FIG. 7C serves eight storage arrays 170ijA-170ijD and 170ijW-170 ijZ: it processes data stored in 3D-M arrays 170ijA-170ijD and 170ijW-170 ijZ. As can be seen from fig. 8A-8C below, the logic circuits 180ij serving more 3D-M arrays 170ij generally occupy more chip area and have more functionality. In fig. 7A-7C, the 3D-M array 170ij is indicated by a dotted line since the 3D-M array 170ij is located on a different chip than the logic circuit 180ij (see fig. 2C and 6A-6 BB).

Fig. 8A-8C show the circuit layout of the second chip 100b, and the projection of the 3D-M array 170 (located in the first chip 100 a) onto the second chip 100b (shown in dashed lines). The embodiment of fig. 8A corresponds to the embodiment of fig. 7A. In this embodiment, the logic circuit 180ij and the local peripheral circuit assembly 190ij in the storage unit 100ij are located in the second semiconductor substrate 0b of the second chip 100 b. Logic circuitry 180ij and off-chip peripheral circuit assembly 190ij are at least partially covered by 3D-M array 170 ij.

In the present embodiment, the period of the logic circuit 180ij is equal to the period of the 3D-M array 170ij, and the area cannot exceed the projected area of the 3D-M array 170ij on the second chip 100b, so the function is limited. This embodiment is better suited for achieving simpler data processing. Fig. 8B-8C disclose two complex logic circuits 180.

The embodiment of fig. 8B corresponds to the embodiment of fig. 7B. In this embodiment, the logic circuit 180ij and the off-chip peripheral circuit assembly 190ij of the storage unit 100ij are located in the second chip 100b, which are at least partially covered by the four 3D-M arrays 170ijA-170 ijD. Under the four 3D-M arrays 170ijA-170ijD, the logic circuit 180ji can be laid out freely. The logic circuit 180ij in fig. 8B has a period twice that of the 3D-M array 170ij in fig. 8A and an area four times that of the 3D-M array, so that more complex processing functions can be realized.

The embodiment of fig. 8C corresponds to the embodiment of fig. 7C. In this embodiment, the logic circuit 180ij and the off-chip peripheral circuit assembly 190ij in the storage unit 100ij are located in the second chip 100 b. The eight 3D-M arrays 170ijA-170ijD, 170ijW-170ijZ are divided into two sets 170ijSA, 170jiSB. Each group (e.g., 170 ijSA) includes four 3D-M arrays (e.g., 170ijA-170 ijD). Under the four 3D-M arrays 170ijA-170ijD of the first set 170SA, the first logic circuit assembly 180ijA can be laid out freely. Similarly, the second logic circuit assembly 180ijB may be laid out freely under the four 3D-M arrays 170ijW-170ijZ of the second set 170 ijSB. The first logic circuit assembly 180ijA and the second logic circuit assembly 180ijB constitute a logic circuit 180ij. In this embodiment, gaps (e.g., G) are left between adjacent off-chip peripheral circuit components to form routing

channels

182, 184, 186 for electrical coupling between different logic circuit components 180ijA, 180ijB, or between different logic circuits. The logic circuit 180ij in fig. 8C has a period four times (x direction) and an area eight times the period of the 3D-M array 170ij in fig. 8A, so that more complex processing functions can be realized. Note that in the embodiment of fig. 8B-8C, each SPU 100ij contains more than one 3D-M array 170ij, and that the logic 180ij may perform more powerful functions than embodiments (fig. 8A) that contain only one 3D-M array 170 ij. In the separated three-dimensional processor 100, as compared to the integrated three-dimensional processor 80 (fig. 1 EA), since the first chip 100a also contains a portion of peripheral circuit components, the density of off-chip peripheral circuit components 190 in the second chip 100b is relatively sparse compared to the peripheral circuit components in the integrated three-dimensional processor 80 (fig. 1 EA), which facilitates routing of the logic circuits 180ij in the second chip 100 b.

In the separate three-dimensional processor 100, since the first chip 100a and the second chip 100b can be separately designed and manufactured, they can have distinct back-end structures. Since the back-end structure of the second chip 100b can be optimized separately, its off-chip peripheral circuit assembly 190 and logic circuit 180 have lower cost and superior performance than the same type of circuits in the integrated three-dimensional processor 80. A comparison is made between the separated three-dimensional processor 100 and the integrated three-dimensional processor 80.

First, since the peripheral circuit components 190 and the logic circuits 180 are not included in the first chip 100a, the array efficiency thereof is high. In addition, as a two-dimensional circuit, the number of back-end wiring layers of the second chip 100b is far lower than that of the integrated three-dimensional processor 80, and can be manufactured using a conventional process. Since the wafer cost is substantially proportional to the number of back-end wiring levels, the wafer cost of the second chip 100b is much lower than the integrated three-dimensional processor 80. Thus, the total cost of the chips of the discrete three-dimensional processor 100 (including the first and

second chips

100a, 100 b) is lower than the integrated three-dimensional processor 80 (which contains only one chip). The overall cost of the discrete three-dimensional processor 100 is lower, even if additional bonding costs are accounted for.

Second, because they can be optimized individually, the off-chip peripheral circuit components 190 and logic circuits 180 in the discrete three-dimensional processor 100 perform better than the same type of circuitry in the integrated three-dimensional processor 80. In one embodiment, the number of interconnect layers (e.g., four, eight, or more layers, fig. 5) in the second chip 100b is greater than the number of interconnect layers (e.g., two layers, fig. 1 EB) of the substrate circuit 0K in the integrated three-dimensional processor 80 (or the first chip 100 a). In another embodiment, the second chip 100b uses a high performance interconnect material (e.g., copper) instead of a high temperature interconnect material (e.g., tungsten) used by the integrated three-dimensional processor 80 (or the first chip 100 a). Thus, the overall performance of the separated three-dimensional processor 100 is more excellent.

Finally, in the integrated three-dimensional processor 80, since the logic 78 is confined to one chip 80 (e.g., the projected area of the 3D-M array 77 on the substrate 0 in fig. 1 EA), its area is limited and its functionality is limited. In contrast, in a separate three-dimensional processor 100, the larger area of logic 180 may be formed in both

chips

100a, 100b (e.g., a first portion of the logic is located below the 3D-M array 170ij of the first chip 100a in FIG. 6A, similar to the logic 78 located below the 3D-M array 77 in FIG. 1 EA; a second portion of the logic is located on the second chip 100b of FIG. 6 BA), which may provide more processing power to the three-dimensional processor 100. Furthermore, since the second chip is separately designed and separately produced, it has greater flexibility in design and production. By combining the same first chip 100a with a second chip 100b having a different function, a processing function suitable for different application scenarios can be realized. Preferably, these different processing functions can be implemented within a shorter design cycle and with less design budget. Thus, the separated three-dimensional processor 100 is more powerful and flexible.

The application of the separated three-dimensional processor in various fields is described below.

[A] And (5) mathematical calculation.

When applied to mathematical calculations, a separate three-dimensional processor is used to implement the non-arithmetic function, which uses memory-based calculations (memory-based computation, abbreviated MBC), i.e. the calculations are implemented mainly by means of a high-capacity LUT (i.e. 3 DM-LUT) stored in a 3D-M array. In this application, the storage unit 100ij in fig. 2A is also referred to as a calculation unit. Wherein the 3D-M array 170 stores at least a portion of a look-up table (LUT) of a non-arithmetic function, and the logic circuit 180 is an Arithmetic Logic Circuit (ALC).

Fig. 9 shows a calculation unit 100ij. It includes an input 110, an output 120, a 3D-M array 170, and an ALC 180ALC (i.e., the logic circuit 180 is an ALC 180 ALC). The 3D-M array 170 stores at least a portion of the LUT of a non-arithmetic function (or model), and the ALC 180ALC performs arithmetic operations on the data in the LUT. The 3D-M array 170 and the ALC 180ALC are electrically coupled through the inter-chip connections 160. As previously described, the non-arithmetic function contains more operations than ALC 180ALC supports (i.e., addition, subtraction, and multiplication). Since it cannot be expressed as a combination of basic arithmetic operations, the non-arithmetic function cannot be implemented by the ALC 180ALC alone, and it needs to be implemented by the ALC 180ALC in combination with the LUT 170.

Fig. 10A-10C are circuit block diagrams of three kinds of ALC 180 ALC. ALC 180ALC of fig. 10A is an adder 180A; ALC 180ALC in fig. 10B is a multiplier 180M; the ALC 180ALC in FIG. 10C is a multiply-add (MAC) that contains an adder 180A and a multiplier 180M. ALC 180ALC may implement integer operations, fixed-point operations, or floating-point operations.

Fig. 11A-11B show a first calculation unit 100ij for implementing a non-arithmetic function y=f (X) and employing a function lookup table. Fig. 11A is a circuit block diagram thereof. ALC 180ALC includes a preprocessing circuit 180R, a 3DM-LUT 170P, and a post-processing circuit 180T. The preprocessing circuit 180R converts the input variable (X) 110 into an address (A) of the 3DM-LUT 170P. After reading the data (D) of the address (a) of the 3DM-LUT 170P, the post-processing circuit 180T converts it into the function value (Y) 120. In order to improve the calculation accuracy, the margin (R) of the input variable (X) is sent to the post-processing circuit 180T.

Fig. 11B is a calculation unit 100ij capable of realizing a single-precision non-arithmetic function y=f (X). The input variable X110 is 32 bits (X ₃₁ … x ₀ ). Preprocessing circuit 180R advances its first 16 bits (x ₃₁ … x ₁₆ ) The 16-bit address A is extracted as the 3DM-LUT 170P, and the next 16 bits (x ₁₅ … x ₀ ) The extracted bits are provided as a 16-bit residue R to the post-processing circuit 180T. The 3DM-LUT 170P contains two 3 DM-

LUTs

170Q, 170R. Each 3DM-

LUT

170Q, 170R has a 2Mb capacity (16-bit input, 32-bit output). Wherein the 3DM-LUT 170Q stores the function value d1=f (a) of the function, and the 3DM-LUT 170R stores the first derivative value d2=f' (a) of the function. Post-processing circuit 180T contains multiplier 180M and adder 180A. The output value (Y) 120 is 32 bits, which is calculated by polynomial interpolation. In this embodiment, the polynomial interpolation is a first order taylor series: y (X) =d1+d2×r=f (a) +f' (a) ×r. Further enhancement can be achieved by using higher order polynomial interpolation (e.g., higher order taylor series)High calculation accuracy.

Combining the LUT with polynomial interpolation allows for higher computational accuracy with smaller LUTs when implementing non-arithmetic functions. If the single precision function (32-bit input, 32-bit output) described above is implemented with only a LUT (polynomial-free interpolation), the capacity of the LUT needs to be up to 2 ³² *32 This is not practical, =128 Gb. The capacity of the LUT can be greatly reduced by polynomial interpolation. In the above embodiment, only 4Mb is required for the LUT after the first-order taylor series is adopted (2 Mb is required for the function value LUT and 2Mb is required for the first derivative value LUT). This is much less than with LUT alone (4 Mb vs. 128 Gb).

In addition to prime functions (including algebraic functions and transcendental functions), the three-dimensional processor 100 can implement various higher functions, such as special functions, and the like. Special functions have a significant role in mathematical analysis, overt analysis, physical research, engineering applications. Many special functions are solutions to differential equations or integrals of basic functions. Examples of special functions include gamma functions, beta functions, bessel functions, legendre functions, elliptic functions, lame functions, mathieu functions, li Manze tower functions, fresnel components, and the like. The advent of the three-dimensional processor 100 will simplify the computation of special functions, boosting its use in scientific computing.

Fig. 12 shows a second calculation unit 100ij. The calculation unit 100ij is configured to implement a complex function (composite function) y=exp [ K ] LOG (X)]=X ^K It adopts a function table look-up method. The computation unit 100ij contains two 3 DM-

LUTs

170S, 170T and a multiplier 180M. The 3DM-LUT 170S stores the function value of LOG (), and the 3DM-LUT 170T stores the function value of EXP (). The input variable X is used as the address 110 of the 3DM-LUT 170S. The output LOG (X) 160S of the 3DM-LUT 170S is multiplied by the power parameter K at multiplier 180M, and the product 160T is fed as an address into the 3DM-LUT 170T. The output 120 of the 3DM-LUT 170T is y=x ^K 。

The functions calculated in the embodiments of fig. 11A-11B and fig. 12 are combined functions. A combined function is a combination of at least two non-arithmetic functions, e.g., a single precision function is a combination of a function value and a derivative value; a composite function is a combination of both functions. Accordingly, the invention also proposes a three-dimensional processor (100) for calculating a combining function, characterized by comprising: a first three-dimensional storage (3D-M) array (170Q or 170S), a second 3D-M array (170R or 170T), and an Arithmetic Logic Circuit (ALC) (180 ALC), the first 3D-M array (170Q or 170S) storing at least a portion of a first look-up table (LUT) of a first non-arithmetic function, the second 3D-M array (170R or 170T) storing at least a portion of a second LUT of a second non-arithmetic function, the ALC (180 ALC) performing an arithmetic operation on at least a portion of data in the first or second LUT; a first chip (100 a) and a second chip (100 b), the first chip (100 a) containing the first and second 3D-M arrays (170Q, 170R or 170S, 170T), the second chip (100 b) containing at least a portion of the ALC (180 ALC) and a piece of peripheral circuit components (190) of the first or second 3D-M arrays (170Q, 170R, 170S or 170T); -the first chip (100 a) is free of the off-chip peripheral circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160); said combining function is a combination of said first and second non-arithmetic functions; the first and second non-arithmetic functions contain more operations than are supported by the ALC (180 ALC).

[B] And (5) computer simulation.

When applied to computer simulation, a separate three-dimensional processor is used to implement the non-arithmetic model, which still employs MBC. MBC offers great advantages for computer simulation. In this application, the storage unit 100ij in fig. 2A is also referred to as a calculation unit. Wherein the 3D-M array 170 stores at least a portion of the LUT of the non-arithmetic model, and the logic circuit 180 is an ALC 180ALC.

Fig. 13 shows a third calculation unit 100ij. The calculation unit 100ij is used to implement a computer simulation of the amplifying circuit 0Y (fig. 1 BA), which uses a model look-up table method. The computation unit 100ij includes a 3DM-LUT 170U, an adder 180A, and a multiplier 180M. The 3DM-LUT 170U stores data related to transistor 0T performance (e.g., input-output characteristics). Input voltage V _IN Address 110 used as 3DM-LUT 170U, read data 160U is leakage current I _D . Multiplier 180M will I _D Multiplied by the negative value of resistance 0R-R to obtain the result (-R.times.I) _D ) At adder 180A with supply voltage V _DD Adding to obtain an output voltage value V _OUT 120。

The 3DM-LUT 170U may store a variety of mathematical models. In one embodiment, the model data stored by the 3DM-LUT 170U is raw measurement data, such as measured input-output characteristics. An example is the leakage current vs. gate-source voltage (I _D -V _GS ) Characteristic curve. In another embodiment, the model data stored by the 3DM-LUT 170U is smoothed measurement data. Raw measurement data may be smoothed by purely mathematical methods (e.g., by best fit models) or may be aided by physical models (e.g., BSIM 4V 3.0 transistor models). In a third embodiment, the model data stored by the 3DM-LUT 170U includes not only the measurements of the transistor, but also the derivatives of the measurements. For example, the model data stored in the 3DM-LUT 170U includes not only the current value (I _D -V _GS ) And also includes its transconductance value (G _m -V _GS ). Similar to fig. 11B, polynomial interpolation (using the derivative of the measurement) can improve model accuracy with a reasonable LUT.

The model look-up method brings many advantages. Since two software decompositions (from mathematical model to mathematical function and from mathematical function to built-in function) are not required, it can save a lot of calculation time and energy consumption. The model look-up method requires even fewer LUTs than the function look-up method. Since a transistor model (e.g., BISM 4V 3.0) requires hundreds of model parameters, e.g., using a function lookup table, a large number of LUTs are required to calculate the intermediate functions of the transistor model. If the function lookup method is skipped (i.e. the transistor model and related intermediate functions are skipped), the transistor performance can be described by three measurement parameters (including the gate-source voltage V) _GS Drain-source voltage V _DS Body source voltage V _BS ). Thus, a smaller LUT is required to describe the mathematical model of the transistor.

[C] A programmable compute array.

When applied to a programmable computing array, the separate three-dimensional processor is a three-dimensional programmable computing array. It can customize not only logical functions and arithmetic functions, but also non-arithmetic functions. In a three-dimensional programmable computing array, the storage units 100ij in FIG. 2A are also referred to as programmable units.

Fig. 14A-14B illustrate a programmable cell 100ij in a three-dimensional programmable computing array that includes a 3D-M array 170 and logic circuitry 180 (fig. 14A). The 3D-M array 170 stores at least a portion of the LUT of non-arithmetic functions, and the logic circuit 180 includes an Arithmetic Logic Circuit (ALC), a programmable logic unit (CLE), and/or a programmable Connection (CIT). Accordingly, the functional modules that the programmable unit 100ij can implement (fig. 14B) include the programmable computing unit 400 (see fig. 15A-15B), the programmable logic unit 200 (see fig. 17B), and the programmable connection 300 (see fig. 17A). The programmable computing unit 400 implements a non-arithmetic function based on the LUT; the programmable logic unit 200 implements the selected logic function from a logic operation library; the programmable connection 300 implements the selected connection from a connection library.

The input terminal IN of the programmable computing unit 400 includes input data 410, the output terminal OUT includes output data 420, and the set terminal CFG includes a set signal 430. When the set signal 430 is "write", a LUT of a mathematical function is written in the programmable computation unit 400. When the set signal 430 is "read", the value of the mathematical function is read from the programmable computing unit 400. Fig. 15A-15B illustrate specific implementations of two programmable computing units 400. In FIG. 15A, the programmable computation unit 400 is a 3D-M array 170 that stores function values that are not arithmetic functions. In FIG. 15B, the programmable computing unit 400 is a combination of a 3D-M array 170 and an ALC 180. Like FIG. 11B, the 3D-M array 170 stores the function values and derivative values of the non-arithmetic functions, and the ALC 180 performs a polynomial calculation.

Fig. 16 shows two usage cycles of a programmable computing unit 400. Because of its 3D-M array 170 being reprogrammable, the programmable computing array is capable of performing reconfigurable computing. The first usage period 620 is divided into two phases: a setup phase 610 and a calculation phase 630. In the setup phase 610, the LUT of the first function is loaded into the 3D-M array 170 according to the user; in the calculation stage 630, the corresponding LUT is read in the 3D-M array 170 to obtain the function value of the first function. Similarly, the second usage period 660 is also divided into a setup phase 650 and a calculation phase 670. This embodiment is particularly suitable for data processing of SIMD (single instruction multiple data stream). Once the LUT is loaded into the 3D-M array 170 at the setup stage 610, a large amount of data may be fed into the programmable computing unit 400 for processing and a higher processing speed is achieved. The SIMD has many applications such as the same operation or vector operation for a plurality of pixels in image processing, large-scale parallel computation used in scientific computation, and the like.

FIGS. 17A-17B disclose a junction library and a logic operation library, respectively. Fig. 17A discloses a connection library that can be implemented by the programmable connection 300, which includes the following connection modes: a) Interconnect lines 302/304 are connected, interconnect lines 306/308 are connected, but 302/304 is disconnected from 306/308; b) Interconnect lines 302/304/306/308 are all connected; c) Interconnect lines 306/308 are connected, and

interconnect lines

302, 304 are not connected nor are interconnect lines 306/308 connected; d) Interconnect lines 302/304 are connected, and

interconnect lines

306, 306 are not connected nor are interconnect lines 302/304 connected; e) None of the

interconnect lines

302, 304, 306 are connected. In this specification, a symbol "/" between two interconnect lines indicates that the two interconnect lines are connected, and a symbol "/" between two interconnect lines indicates that the two interconnect lines are disconnected.

Fig. 17B discloses a logic operation library that can be implemented by the programmable logic unit 200. Inputs a and B are

input data

210, 220 and output C is output data 230. The programmable logic unit 200 may implement the following logic operations: c= A, A logical negation, a shift, AND (a, B), OR (a, B), NAND (a, B), NOR (a, B), XOR (a, B), arithmetic addition a+b, arithmetic subtraction a-B, AND the like. The programmable logic unit 200 may also contain circuit elements such as registers, flip-flops, etc. to implement pipelining, etc. Details of the programmable connection 300 and the programmable logic unit 200 may be found in U.S. Pat. No. 4,870,302.

Fig. 18 shows a first three-dimensional programmable computing array 100. It contains a regular arrangement of programmable modules 100A and 100B, etc. Each programmable module (e.g., 100A) contains a plurality of programmable compute units (CCEs, e.g., 400AA-400 AD) and programmable logic units (CLE, e.g., 200AA-200 AD). The

programmable channels

320, 340 are contained between programmable computational units (e.g., 400AA-400 AD) and programmable logic units (e.g., 200AA-200 AD); between the programmable module 100A and the programmable module 100B,

programmable channels

310, 330, 350 are also included. The programmable channels 310-350 contain a plurality of programmable Connections (CIT) 300. It will be apparent to those skilled in the art that a sea-of-gates design or the like may be used in addition to the programmable channels.

Complex functions are often encountered in computation. In this specification, complex functions refer to multiple independent variable non-arithmetic functions; the basis function refers to a separate independent variable non-arithmetic function. In general, complex functions are a combination of basis functions. The three-dimensional programmable computing array 100 enables customization of complex functions, which is not envisioned by the prior art. To customize a complex function, the complex function is first decomposed into a plurality of basis functions. Each basis function is implemented by loading its LUT in a respective programmable computational unit. Finally, the complex functions are customized by programming the programmable logic units and the programmable connections.

FIG. 19 shows a specific implementation of a first three-dimensional programmable computing array 100 for customizing complex functions and implementing the following complex functions: e=a ^. SIN(b)+c ^. COS (d). The programmable connection 300 in the programmable channels 310-350 takes the form represented in fig. 17A: a programmable connection with dots at the cross-point indicates that the cross-lines are connected, a programmable connection without dots at the cross-point indicates that the cross-lines are disconnected, and a broken programmable connection indicates that the broken interconnect line is divided into two interconnect segments that are disconnected from each other. In this embodiment, the programmable computing unit 400AA is set to LOG (), the calculation result LOG (a) of which is sent to the first input of the programmable logic unit 200 AA. The programmable computing unit 400AB is set to LOG [ SIN ()]The calculation result log [ sin (b) ]]Is provided to a second input of programmable logic unit 200 AA. The programmable logic unit 200AA is set to arithmetic plus "+", which calculates LOG (a) +LOG [ SIN (b) ]]Is sent to the programmable computing unit 400BA. The programmable computing unit 400BA is set upSet as EXP (), the calculation result EXP { LOG (a) +LOG [ SIN (b)]}=a ^. SIN (b) is supplied to a first input of programmable logic unit 200 BA. Similarly, with appropriate settings, programmable computing units 400AC, 400AD, programmable logic unit 200AC, result c of programmable computing unit 400BC ^. COS (d) is sent to a second input of programmable logic unit 200 BA. The programmable logic unit 200BA is set to arithmetic plus "+" a ^. SIN (b) and c ^. The COS (d) is added here, and the final result is sent to output e. It is apparent that other complex functions can be implemented by the three-dimensional programmable computing array 100 by changing the settings.

Accordingly, the invention also proposes a three-dimensional programmable computing array (100) for customizing at least one complex function, characterized by comprising: a plurality of programmable logic units (200) and/or programmable connections (300); and a first programmable computing unit (400 AA) and a second programmable computing unit (400 AC), the first programmable computing unit (400 AA) containing a first 3D-M array of at least a portion of a first look-up table (LUT) storing a first non-arithmetic function, the second programmable computing unit (100 AC) containing a second 3D-M array of at least a portion of a second LUT storing a second non-arithmetic function; a first chip (100 a) and a second chip (100 b), the first chip (100 a) containing the first and second 3D-M arrays, the second chip (100 b) containing at least part of the programmable logic units (200) and/or programmable connections (300) and a piece of peripheral circuit components (190) of the first or second 3D-M arrays; -the first chip (100 a) is free of the off-chip peripheral circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160); -enabling customization of said complex function by programming said programmable logic unit (200) and/or programmable connection (300) and said programmable computing unit (400); said complex function is a combination of said first and second non-arithmetic functions; the first and second non-arithmetic functions include more operations than are supported by the programmable logic unit (200).

Fig. 20 shows a second three-dimensional programmable computing array 100. In addition to

programmable compute units

400A, 400B, programmable logic unit 200A, and programmable channels 360-380, programmable compute array 100 also includes a multiplier 500. The inclusion of multiplier 500 enables three-dimensional programmable computing array 100 to implement more mathematical functions that are more computationally intensive.

FIGS. 21A-21B illustrate two specific implementations of the second three-dimensional programmable computing array 100. The embodiment in fig. 21A implements a mathematical function h=exp (f)/g. Wherein the programmable computing unit 400A is arranged to implement the basis function EXP (f), and the programmable computing unit 400B is arranged to implement the basis function INV (g). After setting the programmable channel 370, the outputs of the

programmable computation units

400A, 400B are sent to the multiplier 500. After setting the programmable channel 380, the final output is h=exp (f)/g. The embodiment in fig. 21B implements another mathematical function h=sin (f) +cos (g). Wherein the programmable computing unit 400A is arranged to implement the basis function SIN (f) and the programmable computing unit 400B is arranged to implement the basis function COS (g). After setting the programmable channel 370, the outputs of the

programmable computation units

400A, 400B are sent to the programmable logic unit 200A, which unit 200A implements arithmetic plus "+". After setting programmable channel 380, the final output is h=sin (f) +cos (g).

[D] And (5) mode processing.

When applied to pattern processing, the separate three-dimensional processor is a three-dimensional pattern processor. It can perform mode processing; more importantly, most of the patterns involved in pattern processing are stored locally.

Fig. 22 shows a separate three-dimensional parallel processor 100. It comprises an array of m x n storage units 100aa-100mn, each storage unit 100aa-100mn being electrically coupled to a common input 110 and a common output 120. The input data is simultaneously supplied to the storage units 100aa-100mn via the common input 110 and mode processing is performed simultaneously in the storage units 100aa-100 mn. Since the three-dimensional parallel processor 100 contains thousands of storage units 100aa-100mn, it can guarantee large-scale parallel computation. The three-dimensional parallel processor 100 can be applied to the fields of pattern processing, neural network processing and the like.

When used as a pattern process, the split three-dimensional parallel processor 100 is a split three-dimensional pattern processor. Fig. 23 shows a memory unit 100ij in a three-dimensional pattern processor 100, which includes a pattern memory circuit 170 and a pattern processing circuit 180PPC (i.e., the logic circuit 180 is the pattern processing circuit 180 PPC), which are electrically coupled via an inter-chip connection 160 (fig. 3A-3D). The pattern storage circuit 170 includes a 3D-M array 170 that stores at least a portion of the patterns; the pattern processing circuit 180PPC processes the pattern.

The separate three-dimensional pattern processor 100 may take two forms-a processor-like form and a memory-like form. The class processor's three-dimensional pattern processor 100 is a three-dimensional processor with its own search pattern library that can use its locally stored search patterns to pattern the target pattern from the input 110. Specifically, a search pattern library (e.g., a virus library, a keyword library, an acoustic/language model library, an image model library, etc.) is stored in the 3D-M array 170; input data 110 includes target patterns (e.g., network packets, computer files, big data, voice data, image data, etc.); the pattern processing circuit 180PPC performs pattern processing on the target pattern according to the search pattern. Because of the large number of storage units 100ij (thousands, FIG. 22) supporting large-scale parallel processing, and the large bandwidth of the inter-chip connections 160 (FIGS. 3B-3D), the three-dimensional processor 100 is fast and efficient to retrieve.

Accordingly, the present invention proposes a three-dimensional processor (100) with a search pattern library, characterized by comprising: an input (110) conveying at least a portion of the target pattern; a plurality of storage units (100 aa-100 mn) electrically coupled to said input (110), each storage unit (100 ij) comprising at least a three-dimensional storage (3D-M) array (170) and a pattern processing circuit (180 PPC), said 3D-M array (170) storing at least a portion of a retrieval pattern, said pattern processing circuit (180 PPC) performing a pattern processing of said target pattern according to said retrieval pattern; a first chip (100 a) and a second chip (100 b), the first chip (100 a) containing the 3D-M array (170), the second chip (100 b) containing at least part of the pattern processing circuitry (180 PPC) and a piece of peripheral circuitry component (190) of the 3D-M array (170); -the first chip (100 a) is free of the off-chip peripheral circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

The three-dimensional pattern processor 100 of the class memory is a three-dimensional memory with pattern processing functions of itself, the primary function of which is to store a target pattern library, and the secondary function of which is to retrieve its stored target pattern from the retrieval pattern of the input 110. Specifically, the target pattern library (e.g., computer files on the entire hard disk, big data database, voice archive, image archive) is stored and distributed in the 3D-M array 170; the input data 110 is a retrieval pattern (e.g., virus identification, keywords, acoustic/language models, image models, etc.); the pattern processing circuit 180PPC performs pattern processing on the target pattern according to the search pattern. The mode processing speed and efficiency of the three-dimensional memory 100 is fast due to the large number of storage units 100ij (thousands, fig. 22) supporting large-scale parallel processing and the large bandwidth of the inter-chip connections 160 (fig. 3B-3D).

Like flash memory, the multiple three-dimensional memories 100 with the mode processing function can be packaged into a memory card (such as an SD card, a TF card) or a solid state disk (i.e., SSD) for storing a target mode library with mass data. Of particular importance, they also have mode processing (e.g., retrieval) functionality. Since each storage unit 100ij is self-contained with the pattern processing circuit 180PPC, it only needs to retrieve the target pattern stored in the 3D-M array 170 locally (at the same storage unit 100 ij). Therefore, regardless of the capacity of the memory card or solid state disk, the retrieval time is close to the time required to retrieve a single 3D-M array 170. In other words, the search time of the database is independent of the capacity of the database, in most cases on the order of seconds.

In contrast, in a conventional von neumann architecture, the processor (CPU) and memory (hard disk) are physically separated from each other, and database retrieval first requires that the database be read from the hard disk. Because of the limited bandwidth of the system bus between the CPU and the hard disk, the retrieval time of the database is limited by the readout time of the database. Thus, the retrieval time of the database is proportional to the size of the database. Generally, the retrieval time ranges from minutes to hours, or even longer, based on the size of the database. In contrast, the three-dimensional memory 100 with its own schema processing functionality is significantly advantageous in terms of database retrieval.

When the three-dimensional memory 100 with the pattern processing function performs pattern processing on a large database (i.e., a target pattern library), the pattern processing circuit 180PPC only needs to perform part of the pattern processing function. For example, the pattern processing circuit 180PPC only needs to perform simple preliminary pattern processing (e.g., string matching, code matching) on the database. The data (i.e., target pattern) remaining after the preliminary pattern processing screening is then sent to a more powerful external processor (e.g., CPU, GPU) via output 120 to complete the final pattern processing. Since most of the data in the database is filtered out by simple pattern processing, the data output from the three-dimensional memory 100 is only a small fraction of the entire database, which can greatly reduce the bandwidth pressure of the output 120.

Accordingly, the present invention proposes a three-dimensional memory (100) with a mode processing function, characterized by comprising: an input (110) conveying at least part of the retrieval pattern; a plurality of storage units (100 aa-100 mn) electrically coupled to said input (110), each storage unit (100 ij) comprising at least one three-dimensional storage (3D-M) array (170) and a pattern processing circuit (180 PPC), said 3D-M array (170) storing at least part of a target pattern, said pattern processing circuit (180 PPC) performing a pattern processing of said target pattern according to said retrieval pattern; a first chip (100 a) and a second chip (100 b), the first chip (100 a) containing the 3D-M array (170), the second chip (100 b) containing at least part of the pattern processing circuitry (180 PPC) and a piece of peripheral circuitry component (190) of the 3D-M array (170); -the first chip (100 a) is free of the off-chip peripheral circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

An introduction is made below to the application of the split three-dimensional pattern processor 100, the application fields of which include: a) information security, B) big data analysis, C) voice recognition, D) image recognition, etc. Examples of such applications include: a) An information security processor; b) A memory capable of checking toxicity; c) A data analysis processor; d) A retrievable memory; e) A speech recognition processor; f) A retrievable speech memory; g) An image recognition processor; h) A retrievable image memory.

A) And the information is safe.

Information security includes network security and computer security. The main means for enhancing network security is to check virus in network data packet; the main means for enhancing the security of a computer is to check viruses in computer files (including computer software). Broadly, viruses (also known as malware, etc.) include network viruses, computer viruses, software that violates network specifications, files that violate file specifications, etc. During the virus checking process, the processor compares the network data packet/computer file with all virus identifiers (virus patterns or virus signatures) in a virus library one by one. After the virus identifier is found, the portion containing the virus identifier is quarantined or deleted.

Currently, virus libraries are increasingly large, reaching hundreds of MB in size; and the computer data required to be checked for toxicity are more huge, and are GB level, TB level and even larger. On the other hand, due to the limited number of cores in the conventional processor (such as the maximum number of cores in the CPU is tens and the maximum number of cores in the GPU is hundreds), each core can generally only discriminate one virus at a time, which results in low parallelism of virus checking. Furthermore, because of the von neumann architecture, the processor and memory are physically separated from each other, requiring a long time for each new virus identification to be read. Therefore, conventional processors and their architectures are slow and inefficient in handling information security transactions.

To enhance information security, the present invention proposes a plurality of separate three-dimensional mode processors 100. It can adopt a class processor mode and a class memory mode: when the processor-like manner is adopted, the separated three-dimensional mode processor 100 is an information security processor, namely, a processor for enhancing information security; when the memory-like manner is adopted, the separated three-dimensional mode processor 100 is a memory capable of checking toxicity, i.e. a memory with the function of checking toxicity.

a) An information security processor.

In order to ensure information security, the present invention proposes an information security processor 100. It searches various virus identifiers in the virus library in a network data packet or a computer file; once matched to a virus identification, the network data packet or computer file contains the corresponding virus. The information security processor 100 may be implemented as a stand-alone processor in a network or a computer, or may be integrated into a network processor, a processor (e.g., a CPU) or a memory (e.g., a hard disk) of the computer.

In the information security processor 100, the 3D-M array 170 in different storage units 100ij stores different virus identifications. In other words, the virus library is stored and distributed in the respective storage units 100ij of the processor 100. Once the input 110 passes a network packet or computer file, at least some of the data in the network packet or computer file is sent to all of the storage units 100 ij. In each storage unit 100ij, the pattern processing circuit 180PPC retrieves in this portion of data various virus identifications stored in the local 3D-M array 170. Once matched to a virus identification, the network data packet or computer file contains the corresponding virus.

The above-described toxin checking process is performed simultaneously in all of the storage units 100 ij. Since the information security processor 100 contains a large number (thousands) of storage units 100ij, it supports large-scale parallel virus checking. Furthermore, due to the large number of inter-chip connections 160, the close distance between the pattern processing circuit 180PPC and the 3D-M array 170 (relative to conventional von neumann architectures), the pattern processing circuit 180PPC can easily read new virus identifications therefrom. Thus, the information security processor 100 has a fast virus searching speed and a high virus searching efficiency. In this embodiment, the 3D-M array 170 storing the virus library may be 3D-P, 3D-OTP or 3D-MTP; the pattern processing circuit 180PPC is a code matching circuit.

Accordingly, the present invention proposes a separate information security processor (100) characterized by comprising: an input (110) for transmitting at least a portion of the data in a network data packet or computer file; a plurality of storage computing units (100 aa-100 mn) electrically coupled to said input (110), each storage computing unit (100 ij) comprising at least one three-dimensional storage (3D-M) array (170) and a code matching circuit (180 PPC), said 3D-M array (170) storing at least part of a virus identification, said code matching circuit (180 PPC) retrieving said virus identification in said data; a first chip (100 a) and a second chip (100 b), the first chip (100 a) containing the 3D-M array (170), the second chip (100 b) containing at least part of the code matching circuit (180 PPC) and a piece of peripheral circuit components (190) of the 3D-M array (170); -the first chip (100 a) is free of the off-chip peripheral circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

b) A memory capable of checking toxicity.

When a virus is newly found, the data stored in the whole computer hard disk (such as a mechanical hard disk and a solid state hard disk) needs to be subjected to virus checking. This full interrogation is very difficult for traditional von neumann architectures. Because of the massive data stored on the computer hard disk, it takes a lot of time for light to read all the computer data from the hard disk, let alone for its virus detection. In the traditional von neumann architecture, the time required for a full-disk virus challenge is proportional to the hard disk size.

In order to shorten the time required for full interrogation, the present invention proposes an interrogation memory 100. Its primary function is computer storage and its secondary function is local storage to check stored data for toxins. Like flash memory, a plurality of virus-detectable memories 100 can be packaged into a memory card or a solid state disk for storing massive data and having a virus-detecting function.

In the poison-taking memory 100, the 3D-M array 170 in different storage units 100ij stores different data. In other words, a large amount of computer files are stored and distributed in the storage unit 100ij of each virus-searchable memory 100 in the memory card or the solid-state disk. When a new virus is found to require a full challenge, its virus identification is sent as input 110 to all storage units 100ij, and the pattern processing circuit 180PPC retrieves the virus identification from the data stored in the local 3D-M array 170.

The above-described toxin checking process is performed simultaneously in all of the storage units 100ij, and the toxin checking time required for each storage unit 100ij is similar. Because of the adoption of large-scale parallel poison checking, no matter how large the capacity of the memory card and the solid state disk is, the poison checking time is close to that of a single storage and calculation unit 100ij, and is generally in the second level. In contrast, conventional whole-disc virus searches take minutes to hours, or even longer. In this embodiment, the 3D-M array 170 storing the mass computer files is preferably a 3D-MTP; the pattern processing circuit 180PPC is a code matching circuit.

Accordingly, the invention proposes a separate virus-detectable memory (100), characterized by comprising: an input (110) for transmitting at least part of the virus identification; a plurality of storage computing units (100 aa-100 mn) electrically coupled to said input (110), each storage computing unit (100 ij) comprising at least one three-dimensional storage (3D-M) array (170) and a code matching circuit (180 PPC), said 3D-M array (170) storing at least part of data in a computer file, said code matching circuit (180 PPC) retrieving said virus identification in said data; a first chip (100 a) and a second chip (100 b), the first chip (100 a) containing the 3D-M array (170), the second chip (100 b) containing at least part of the code matching circuit (180 PPC) and a piece of peripheral circuit components (190) of the 3D-M array (170); -the first chip (100 a) is free of the off-chip peripheral circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

B) And (5) analyzing big data.

Big data is a collection of huge amounts of data, which mainly relates to unstructured data or semi-structured data. An important component of big data analysis is keyword retrieval (including string matching, such as regular expression matching). Currently, keyword libraries are increasingly larger and large databases are more massive. For such large keyword libraries and large data databases, conventional processors and their architectures have difficulty in high-speed and efficient retrieval of unstructured or semi-structured data.

To improve the efficiency of big data analysis, the present invention proposes a plurality of separate three-dimensional pattern processors 100. It can adopt a class processor mode and a class memory mode: when the processor-like manner is adopted, the separated three-dimensional mode processor 100 is a data analysis processor, i.e. a processor for big data analysis; when the memory-like manner is adopted, the separated three-dimensional mode processor 100 is a retrievable memory, i.e., a memory with a retrieval function.

c) A data analysis processor.

In order to achieve high-speed and efficient retrieval of input data, the present invention proposes a data analysis processor 100 that retrieves keywords in a keyword library from an input data. In the data analysis processor 100, the 3D-M array 170 in different storage units 100ij stores different keywords. In other words, the keyword library is stored and distributed in the respective storage units 100ij of the processor 100. Data from the input 110 is sent to all of the storage units 100 ij. In each storage unit 100ij, the pattern processing circuit 180PPC retrieves each keyword stored in the local 3D-M array 170 in the input data.

The above-described search process is performed simultaneously in all the storage units 100 ij. Because it contains a large number (thousands) of storage units 100ij, the processor 100 supports large-scale parallel retrieval. Furthermore, because of the large number of inter-chip connections 160 and the close distance between the pattern processing circuit 180PPC and the 3D-M array 170 (as opposed to the conventional von Neumann architecture), the pattern processing circuit 180PPC can easily read keywords from the local 3D-M array 170. Thus, the processor 100 has a fast retrieval speed and a high retrieval efficiency for unstructured data and semi-structured data.

In this embodiment, the 3D-M array 170 storing keyword libraries may be 3D-P, 3D-OTP or 3D-MTP; the pattern processing circuit 180PPC is a string matching circuit. The string matching circuit may be implemented by a content addressable memory (content addressable memory, simply CAM) or a comparator with an exclusive or gate (XOR). Further, keywords may be represented by regular expressions. At this time, the string matching circuit 180PPC is implemented by a finite state automaton (FSA for short).

Accordingly, the invention proposes a separate data analysis processor (100) characterized by comprising: an input (110) for transmitting at least part of the data; a plurality of storage units (100 aa-100 mn) electrically coupled to said input (110), each storage unit (100 ij) comprising at least a three-dimensional storage (3D-M) array (170) and a string matching circuit (180 PPC), said 3D-M array (170) storing at least a portion of a keyword, said string matching circuit (180 PPC) retrieving said keyword in said portion of data; a first chip (100 a) and a second chip (100 b), the first chip (100 a) containing the 3D-M array (170), the second chip (100 b) containing at least part of the string matching circuit (180 PPC) and a piece of peripheral circuit components (190) of the 3D-M array (170); -the first chip (100 a) is free of the off-chip peripheral circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

d) A retrievable memory.

Big data analysis often requires a search of the entire database, i.e., a full library search. Since large databases are very large, with few GB levels, and many TB's, i.e., even higher, traditional von Neumann architectures are very difficult to retrieve for full libraries: the optical readout of the database takes a lot of time, let alone its retrieval. In a traditional von neumann architecture, the full-library search time is proportional to the database size.

In order to increase the speed and efficiency of full library search, the present invention proposes a retrievable memory 100. The primary function of the retrievable memory 100 is database storage and the secondary function is to retrieve the database locally. Like flash memory, a plurality of retrievable memories 100 may be packaged as memory cards or solid state disks for storing large data databases and having a retrieval function.

In the retrievable memory 100, the 3D-M array 170 in different storage units 100ij stores different data in a database. In other words, the database is stored and distributed in the storage unit 100ij of each retrievable memory 100 in the memory card or the solid state disk. At the time of retrieval, the keywords are transmitted to the input 110 and sent to all storage units 100 ij. In each storage unit 100ij, the pattern processing circuit 180PPC retrieves the keyword in the data of the local 3D-M array 170.

The above-described search process is performed simultaneously in all the storage units 100 ij; the search time required for each storage unit 100ij is similar. Because of the adoption of large-scale parallel search, the search time is close to the search time required by a single storage unit 100ij, and is generally in the second level no matter how large the capacity of the memory card and the solid state disk is. In contrast, conventional full library searches take minutes to hours, or even longer. In the retrievable memory 100, the 3D-M storing the big data database is preferably 3D-MTP; the pattern processing circuit 180PPC is a string matching circuit.

Because of 3D-M _V Among all semiconductor memories, there is the highest memory density, which is suitable for storing large data databases. In all 3D-M _V In 3D-OTP _V Has the longest data lifetime and is therefore suitable for storing large archives. Archival storage requires fast retrieval capabilities. Retrievable 3D-OTP _V A high-capacity, low-cost archive storage with fast retrieval capability can be provided.

Accordingly, the invention proposes a separate retrievable memory (100) characterized by comprising: an input (110) conveying at least a portion of the keyword; a plurality of storage units (100 aa-100 mn) electrically coupled to said input (110), each storage unit (100 ij) comprising at least a three-dimensional storage (3D-M) array (170) and a string matching circuit (180 PPC), said 3D-M array (170) storing at least part of the data, said string matching circuit (180 PPC) retrieving said keyword in said part of the data; a first chip (100 a) and a second chip (100 b), the first chip (100 a) containing the 3D-M array (170), the second chip (100 b) containing at least part of the string matching circuit (180 PPC) and a piece of peripheral circuit components (190) of the 3D-M array (170); -the first chip (100 a) is free of the off-chip peripheral circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

C) Speech recognition or retrieval.

One typical application of pattern processing is speech recognition. One means of speech recognition is pattern recognition of user speech based on an acoustic model library and a language model library. Wherein the acoustic model library stores a plurality of acoustic models; the language model library stores a large number of language models. At the time of recognition, the pattern processing circuit 180PPC performs pattern recognition on the user voice data according to the acoustic/language model library, finding the closest acoustic/language model. Because of the limited number of cores and low parallelism of pattern recognition in conventional processors (e.g., CPU, GPU) and the fact that the acoustic/language model library is stored in memory, conventional processors and their architecture are slow and inefficient in processing speech recognition.

e) A speech recognition processor.

In order to improve the efficiency of speech recognition, the present invention proposes a speech recognition processor 100. In the speech recognition processor 100, user-generated speech data is provided as input 110 to each of the storage units 100ij, the 3D-M array 170 stores at least part of the models in an acoustic/language model library, and the model processing circuit 180PPC performs speech recognition on the speech data from the input 110 based on the model data stored in the 3D-M array 170. In this embodiment, the 3D-M array 170 storing the model library may be 3D-P, 3D-OTP or 3D-MTP; the pattern processing circuit 180PPC is a speech recognition circuit.

Accordingly, the invention proposes a separate speech recognition processor (100) characterized by comprising: an input (110) for transmitting at least part of the voice data; a plurality of storage units (100 aa-100 mn) electrically coupled to said input (110), each storage unit (100 ij) comprising at least a three-dimensional memory (3D-M) array (170) and a speech recognition circuit (180 PPC), said 3D-M array (170) storing at least part of an acoustic/language model, said speech recognition circuit (180 PPC) speech recognizing said speech data according to said acoustic/language model; a first chip (100 a) and a second chip (100 b), the first chip (100 a) containing the 3D-M array (170), the second chip (100 b) containing at least part of the speech recognition circuitry (180 PPC) and a piece of peripheral circuitry component (190) of the 3D-M array (170); -the first chip (100 a) is free of the off-chip peripheral circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

f) A retrievable voice store.

In order to implement voice retrieval in a voice database (e.g., a voice archive), the present invention also proposes a retrievable voice storage 100. In the retrievable speech memory 100, the speech data to be looked up is converted into an acoustic/language model as input 110 to each storage unit 100ij. The user-generated voice data is stored in the 3D-M array 170. In other words, the voice database is stored and distributed in the respective storage units 100ij of the retrievable voice storage 100. The pattern processing circuit 180PPC performs speech recognition and retrieval of speech data according to an acoustic/language model. In this embodiment, the 3D-M array 170 storing the voice database is preferably a 3D-MTP; the pattern processing circuit 180PPC is a speech recognition circuit.

Accordingly, the invention proposes a separate retrievable voice store (100), characterized by comprising: an input (110) conveying at least part of the acoustic/language model; a plurality of storage units (100 aa-100 mn) electrically coupled to said input (110), each storage unit (100 ij) comprising at least a three-dimensional memory (3D-M) array (170) and a speech recognition circuit (180 PPC), said 3D-M array (170) storing at least part of speech data, said speech recognition circuit (180 PPC) performing speech recognition on said speech data according to said acoustic/language model; a first chip (100 a) and a second chip (100 b), the first chip (100 a) containing the 3D-M array (170), the second chip (100 b) containing at least part of the speech recognition circuitry (180 PPC) and a piece of peripheral circuitry component (190) of the 3D-M array (170); -the first chip (100 a) is free of the off-chip peripheral circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

D) And (5) image identification.

Another typical application of pattern processing is image recognition. One means of image recognition is pattern recognition of a user image from a library of image models. Wherein the image model library stores a plurality of image models. During recognition, the mode processor carries out mode recognition on the image data of the user according to the image models in the image model library, and searches for the closest image model. Because the number of cores of the traditional processor (such as CPU and GPU) is limited, the parallelism of pattern recognition is low, and the image model library is stored in the external memory, the traditional processor has low speed and low efficiency in processing image recognition.

g) An image recognition processor.

In order to improve the efficiency of image recognition, the present invention proposes an image recognition processor 100. In the image recognition processor 100, user-generated image data is provided as input 110 to each of the storage units 100ij, the 3D-M array 170 storing at least part of the image model, and the pattern processing circuit 180PPC performs image recognition on the image data from the input 110 based on the image model stored in the 3D-M array 170. In this embodiment, the 3D-M array 170 storing the model library may be 3D-P, 3D-OTP or 3D-MTP; the pattern processing circuit 180PPC is an image recognition circuit.

Accordingly, the invention proposes a separate image recognition processor (100) characterized by comprising: an input (110) for transmitting at least part of the image data; a plurality of storage units (100 aa-100 mn) electrically coupled to said input (110), each storage unit (100 ij) comprising at least a three-dimensional storage (3D-M) array (170) and an image recognition circuit (180 PPC), said 3D-M array (170) storing at least a portion of an image model, said image recognition circuit (180 PPC) image recognizing said image data according to said image model; a first chip (100 a) and a second chip (100 b), the first chip (100 a) containing the 3D-M array (170), the second chip (100 b) containing at least part of the image recognition circuitry (180 PPC) and a piece of peripheral circuitry component (190) of the 3D-M array (170); -the first chip (100 a) is free of the off-chip peripheral circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

h) A retrievable image memory.

In order to implement image retrieval in an image database (e.g., an image archive), the present invention also proposes a retrievable image memory 100. In the retrievable image memory 100, the image data to be looked up is converted into an image model as input 110 to each storage unit 100ij. The user-generated image data is stored in the 3D-M array 170. In other words, the image database is stored and distributed in the respective storage units 100ij of the retrievable image memory 100. The pattern processing circuit 180PPC performs image recognition and retrieval of image data based on the image model. In this embodiment, the 3D-M array 170 storing the image database is preferably a 3D-MTP; the pattern processing circuit 180PPC is an image recognition circuit.

The invention also proposes a separate retrievable image memory (100) characterized by comprising: an input (110) for transmitting at least part of the image model; a plurality of storage units (100 aa-100 mn) electrically coupled to said input (110), each storage unit (100 ij) comprising at least a three-dimensional storage (3D-M) array (170) and an image recognition circuit (180 PPC), said 3D-M array (170) storing at least part of the image data, said image recognition circuit (180 PPC) image recognizing said image data according to said image model; a first chip (100 a) and a second chip (100 b), the first chip (100 a) containing the 3D-M array (170), the second chip (100 b) containing at least part of the image recognition circuitry (180 PPC) and a piece of peripheral circuitry component (190) of the 3D-M array (170); -the first chip (100 a) is free of the off-chip peripheral circuit assembly (190); the first chip (100 a) and the second chip (100 b) are two different chips and are electrically coupled by a plurality of inter-chip connections (160).

[E] A neural network.

When applied to a neural network, the separate three-dimensional processor is a three-dimensional neural network processor. It can perform neural calculation; more importantly, the synaptic weights used in neural computation are stored locally.

When used as a pattern process, the split three-dimensional parallel processor 100 is a split three-dimensional neural network processor. Fig. 24 shows a memory unit 100ij in a three-dimensional neural network processor 100, which includes a neural memory circuit 170 and a neural calculation circuit 180NPC (the logic circuit 180 is the neural calculation circuit 180 NPC), which are electrically coupled via an inter-chip connection 160 (fig. 3A-3D). The neural memory circuit 170 contains a 3D-M array that stores at least some of the synaptic weights; the neural calculation circuit 180NPC performs neural calculation using the synaptic weights.

Fig. 25-26B disclose details of a neural calculation circuit 180NPC and its calculation circuit 730. In the embodiment of FIG. 25, the neural calculation circuit 180NPC contains a synaptic weight (W _s ) RAM 740A, an input neuron (N) _in ) RAM 740B and a computing circuit 730.W (W) _s RAM 740A is a cache that temporarily stores synaptic weights 742 from 3D-M array 170; n (N) _in RAM 740B is also a buffer that temporarily stores input data 746 from input 110. The computation circuit 730 performs neural computation and generates output data 748.

In the embodiment of FIG. 26A, the calculation circuit 730 includes a multiplier 732, an adder 734, a register 736, and an activation function circuit 738. Multiplier 732 weights synapse weight w _ij And input data x _i Multiplication, adder 734 and register 736 pair product (w _ij ×x _i ) Accumulating, the accumulated value is sent to an activation function circuit 738, and the result is output data y _j 。

In the embodiment of fig. 26B, multiplier 732 in fig. 26A is replaced with a one-multiply-add (MAC) 732'. Of course, the multiplier-adder 732' also contains a multiplier. W (W) _s RAM 740A outputs not only synaptic weights w _ij Also output bias b (via port 742 w) _j (via port 742 b). Multiplier adder 732' pair inputData x _i Synaptic weight w _ij And bias b _j Performing a bias multiplication operation (w _ij ×x _i +b _j ）。

An activation function refers to a function in which the output is controlled within a certain range (e.g., 0 to 1, or-1 to +1), including a sigmod function, a signum function, a threshold function, a piecewise linear function, a step function, a tanh function, and the like. The circuit to activate the function is difficult to implement. Continuing to extend the spirit of the "mathematical computation" of the present invention, the computation circuit 730 may also contain a non-volatile memory (NVM) for long-term storage of the LUT for activating functions. The NVM is typically read-only memory (ROM), especially three-dimensional read-only memory (3D-ROM). The 3D-ROM array may be stacked over and coincident with the neural computing circuit (180 NPC). At this point, the calculation circuit 730 becomes extremely simple-it only needs to implement addition and multiplication, but does not need to implement an activation function. The area of the calculation circuit 730 for implementing the activation function using the 3D-ROM array is small, and the calculation density can be ensured.

It will be understood that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. For example, the processor in the present invention may be a Central Processing Unit (CPU), a controller or microcontroller (micro-controller), a Digital Signal Processor (DSP), an image processor (GPU), a network security processor, an encryption/decryption processor, an encoding/decoding processor, a neural network processor, an Artificial Intelligence (AI) processor, or the like. Accordingly, the invention should not be limited except as by the appended claims.

Claims

1. A discrete three-dimensional processor (100) comprising:

a plurality of storage units (100 aa-100 mn), each storage unit (100 ij) comprising a logic circuit (180) and at least one storage array (170); the logic circuit (180) processes at least a portion of the data stored by the memory array (170) but not peripheral circuitry of the memory array (170);

a first chip (100 a) and a second chip (100 b), the first chip (100 a) containing the memory array (170); -said second chip (100 b) contains at least part of said logic circuitry (180) and at least part of off-chip peripheral circuit components (190) of said memory array (170); -the first chip (100 a) and the second chip (100 b) have different back-end structures; the first chip (100 a) and the second chip (100 b) are electrically coupled by a plurality of inter-chip connections (160);

The off-chip peripheral circuit assembly (190) has at least one of the following features 1 a) -1 f):

1a) The off-chip peripheral circuit assembly (190) is an address decoder of the memory array (170); or (b)

1b) The off-chip peripheral circuit assembly (190) is a sense amplifier circuit of the memory array (170); or (b)

1c) The off-chip peripheral circuit assembly (190) is a writer for the storage array (170); or (b)

1d) The off-chip peripheral circuit assembly (190) is a read voltage generation circuit of the memory array (170); or (b)

1e) The off-chip peripheral circuit assembly (190) is a write voltage generation circuit of the memory array (170); or (b)

1f) The off-chip peripheral circuit component (190) is a data buffer for the memory array (170).

2. The three-dimensional processor (100) of claim 1, further characterized by having one of the following 2 a) -2 o) features:

2a) The logic circuit (180) is a processing circuit; or (b)

2b) The memory array (170) stores at least a portion of a look-up table LUT of at least one non-arithmetic function or at least one non-arithmetic model, the logic circuit (180) being an arithmetic logic circuit ALC (180 ALC) and performing arithmetic operations on at least a portion of data in the look-up table LUT, the three-dimensional processor (100) being configured to implement the non-arithmetic function or the non-arithmetic model, the non-arithmetic function or the non-arithmetic model comprising more operations than are supported by the arithmetic logic circuit ALC (180 ALC); or (b)

2c) The memory array (170) is part of at least one programmable computing unit, CCE, (400), and stores at least a portion of a look-up table, LUT, of at least one non-arithmetic function, the logic circuit (180) containing a plurality of programmable logic units, CLE, (200) and/or programmable connections, CIT, (300); -said three-dimensional processor (100) implementing a customization of said non-arithmetic function comprising more operations than said programmable logic unit CLE (200) supports by programming said programmable logic unit CLE (200) and/or programmable connection CIT (300), and said programmable computing unit CCE (400); or (b)

2d) The input of the three-dimensional processor (100) transmitting at least part of a first pattern, the memory array (170) storing at least part of a second pattern, the logic circuit (180) being a pattern processing circuit (180 PPC) and performing pattern processing on the first and second patterns; or (b)

2e) The input of the three-dimensional processor (100) transmitting at least part of a target pattern, the memory array (170) storing at least part of a retrieval pattern, the logic circuit (180) being a pattern processing circuit (180 PPC) and performing pattern processing on the target pattern and the retrieval pattern; or (b)

2f) The input of the three-dimensional processor (100) transmits at least one network data packet or at least one computer file data, the storage array (170) stores at least part of the virus identification, the logic circuit (180) is a mode processing circuit (180 PPC), and the virus identification is retrieved from the network data packet or file data; or (b)

2g) The input of the three-dimensional processor (100) transmitting at least part of the data, the memory array (170) storing at least part of the keywords, the logic circuit (180) being a pattern processing circuit (180 PPC) and retrieving the keywords in the data; or (b)

2h) The input of the three-dimensional processor (100) transmits at least part of the voice data, the memory array (170) stores at least part of the acoustic/language model, the logic circuit (180) is a mode processing circuit (180 PPC), and the voice data is subjected to voice recognition according to the acoustic/language model; or (b)

2i) The input of the three-dimensional processor (100) transmits at least part of the image data, the storage array (170) stores at least part of the image model, the logic circuit (180) is a mode processing circuit (180 PPC) and performs image recognition on the image data according to the image model; or (b)

2j) The input of the three-dimensional processor (100) transmitting at least a portion of a retrieval pattern, the memory array (170) storing at least a portion of a target pattern, the logic circuit (180) being a pattern processing circuit (180 PPC) and performing pattern processing on the target pattern and the retrieval pattern; or (b)

2k) The input of the three-dimensional processor (100) transmitting at least part of a virus identification, the storage array (170) storing at least part of computer file data, the logic circuit (180) being a pattern processing circuit (180 PPC) and retrieving the virus identification in the file data; or (b)

2 l) the input of the three-dimensional processor (100) transmitting at least part of a keyword, the memory array (170) storing at least part of data, the logic circuit (180) being a pattern processing circuit (180 PPC) and retrieving the keyword from the data; or (b)

2 m) the input of the three-dimensional processor (100) transmitting at least part of an acoustic/language model, the memory array (170) storing at least part of speech data, the logic circuit (180) being a pattern processing circuit (180 PPC) and performing speech recognition on the speech data according to the acoustic/language model; or (b)

2 n) an input of the three-dimensional processor (100) transmitting at least part of an image model, the memory array (170) storing at least part of the image data, the logic circuit (180) being a pattern processing circuit (180 PPC) and performing image recognition of the image data based on the image model; or (b)

2 o) the memory array (170) stores at least a portion of the synaptic weights, the logic circuit (180) being a neural computing circuit (180 NPC) and performing neural computation based on the synaptic weights.

3. The three-dimensional processor (100) of claim 1, further characterized by at least one of the following 3 a) -3 f) features:

3a) -the first chip (100 a) and the second chip (100 b) are stacked on each other; or (b)

3b) -the first chip (100 a) is bonded face-to-face with the second chip (100 b); or (b)

3c) The first chip (100 a) and the second chip (100 b) are the same or close in area; or (b)

3d) -said first chip (100 a) is aligned with at least one edge of said second chip (100 b); or (b)

3e) The projection of the memory array (170) onto the second chip (100 b) at least partially coincides with the logic circuit (180); or (b)

3f) The inter-chip connections (160) include bond wires, micro-pads, through substrate VIA holes (TSVs), and/or vertical contact connections (VIA).

4. The three-dimensional processor (100) of claim 2, further characterized by at least one of the following features 4 a) -4 f):

4a) The first chip (100 a) is vertically stacked with the second chip (100 b); or (b)

4b) -the first chip (100 a) is bonded face-to-face with the second chip (100 b); or (b)

4c) The first chip (100 a) and the second chip (100 b) are the same or close in area; or (b)

4d) -said first chip (100 a) is aligned with at least one edge of said second chip (100 b); or (b)

4e) The projection of the memory array (170) onto the second chip (100 b) at least partially coincides with the logic circuit (180); or (b)

4f) The inter-chip connections (160) include bond wires, micro-pads, through substrate VIA holes (TSVs), and/or vertical contact connections (VIA).

5. The three-dimensional processor (100) of claim 1, further characterized by at least one of the following 5 a) -5 c) features:

5a) The storage array (170) includes a plurality of storage arrays (170 ijA-170ijD, 170ijW-170 ijZ); or (b)

5b) The storage array (170) includes four storage arrays (170 ijA-170 ijD); or (b)

5c) The memory array (170) includes eight memory arrays (170 ijA-170ijD, 170ijW-170 ijZ).

6. The three-dimensional processor (100) of claim 2, further characterized by at least one of the following 6 a) -6 c) features:

6a) The storage array (170) includes a plurality of storage arrays (170 ijA-170ijD, 170ijW-170 ijZ); or (b)

6b) The storage array (170) includes four storage arrays (170 ijA-170 ijD); or (b)

6c) The memory array (170) includes eight memory arrays (170 ijA-170ijD, 170ijW-170 ijZ).

7. The three-dimensional processor (100) of claim 3, further characterized by at least one of the following 7 a) -7 c) features:

7a) The storage array (170) includes a plurality of storage arrays (170 ijA-170ijD, 170ijW-170 ijZ); or (b)

7b) The storage array (170) includes four storage arrays (170 ijA-170 ijD); or (b)

7c) The memory array (170) includes eight memory arrays (170 ijA-170ijD, 170ijW-170 ijZ).

8. The three-dimensional processor (100) of claim 4, further characterized by at least one of the following 8 a) -8 c) features:

8a) The storage array (170) includes a plurality of storage arrays (170 ijA-170ijD, 170ijW-170 ijZ); or (b)

8b) The storage array (170) includes four storage arrays (170 ijA-170 ijD); or (b)

8c) The memory array (170) includes eight memory arrays (170 ijA-170ijD, 170ijW-170 ijZ).

9. The three-dimensional processor (100) of any of claims 1-8, further characterized by at least one of the following features 9 a) -9 c):

9a) The memory array (170) is a RAM array; or (b)

9b) The storage array (170) is a ROM array; or (b)

9c) The memory array (170) is an NVM array.