CN114586019A - Memory-based processor - Google Patents
Memory-based processor Download PDFInfo
- Publication number
- CN114586019A CN114586019A CN202080071415.1A CN202080071415A CN114586019A CN 114586019 A CN114586019 A CN 114586019A CN 202080071415 A CN202080071415 A CN 202080071415A CN 114586019 A CN114586019 A CN 114586019A
- Authority
- CN
- China
- Prior art keywords
- memory
- processing
- processor
- database
- integrated circuit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7807—System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
- G06F15/7821—Tightly coupled to memory, e.g. computational memory, smart memory, processor in memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/0215—Addressing or allocation; Relocation with look ahead addressing means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/14—Protection against unauthorised use of memory or access to memory
- G06F12/1416—Protection against unauthorised use of memory or access to memory by checking the object accessibility, e.g. type of access defined by the memory independently of subject rights
- G06F12/1425—Protection against unauthorised use of memory or access to memory by checking the object accessibility, e.g. type of access defined by the memory independently of subject rights the protection being physical, e.g. cell, word, block
- G06F12/1441—Protection against unauthorised use of memory or access to memory by checking the object accessibility, e.g. type of access defined by the memory independently of subject rights the protection being physical, e.g. cell, word, block for a range
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/70—Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer
- G06F21/78—Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer to assure secure storage of data
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C7/00—Arrangements for writing information into, or reading information out from, a digital store
- G11C7/10—Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers
- G11C7/1006—Data managing, e.g. manipulating data before writing or reading out, data bus switches or control circuits therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0811—Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0813—Multiuser, multiprocessor or multiprocessing cache systems with a network or matrix configuration
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0862—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with prefetch
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/45—Caching of specific data in cache memory
- G06F2212/454—Vector or matrix data
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Hardware Design (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Computer Security & Cryptography (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Software Systems (AREA)
- Dram (AREA)
- Memory System (AREA)
- Storage Device Security (AREA)
- Semiconductor Memories (AREA)
Abstract
In some embodiments, an integrated circuit may include a substrate and a memory array disposed on the substrate, wherein the memory array includes a plurality of discrete memory banks. The integrated circuit can also include a processing array disposed on the substrate, wherein the processing array includes a plurality of processor sub-units, each of the plurality of processor sub-units being associated with one or more discrete memory banks among the plurality of discrete memory banks. The integrated circuit may also include a controller configured to implement at least one security measure with respect to an operation of the integrated circuit and to take one or more remedial actions if the at least one security measure is triggered.
Description
Cross Reference to Related Applications
The present application claims priority from: us provisional application No. 62/886,328, filed on 8/13/2019; us provisional application No. 62/907,659, filed on 29/9/2019; united states provisional application No. 62/971,912, filed on 7/2/2020; and U.S. provisional application No. 62/983,174, filed on 28/2/2020. The foregoing application is incorporated herein by reference in its entirety.
Technical Field
The present disclosure generally relates to devices for facilitating memory intensive operations. In particular, the present disclosure pertains to a hardware chip that includes a processing element coupled to a dedicated memory bank. The present disclosure also relates to an apparatus for improving power efficiency and speed of a memory chip. In particular, the present disclosure relates to systems and methods for implementing partial refresh or even no refresh on a memory chip. The present disclosure also relates to size selectable memory chips and dual port capabilities on memory chips.
Background
As processor speeds and memory sizes continue to increase, a significant limitation on effective processing speed is the von Neumann bottleneck. The von neumann bottleneck is manufactured by the throughput limitations caused by conventional computer architectures. In particular, data transfers from memory to the processor often encounter bottlenecks as compared to actual computations performed by the processor. Thus, the number of clock cycles to read and write to memory increases significantly with memory intensive processing procedures. These clock cycles result in a lower effective processing speed because reading and writing to the memory consumes clock cycles that cannot be used to perform operations on the data. Furthermore, the computational bandwidth of a processor is typically greater than the bandwidth of the bus that the processor uses to access the memory.
These bottlenecks are particularly evident for: memory intensive processing programs, such as neural networks and other machine learning algorithms; constructing a database, searching and querying an index; and other tasks that include more read and write operations than data processing operations.
In addition, the rapid growth in the capacity and granularity of available digital data has created opportunities to develop machine learning algorithms and new technologies have been enabled. However, this also presents a formidable challenge to the field of databases and parallel computing. For example, the rise of social media and the internet of things (IoT) produces digital data at a documented rate. This new data can be used to generate algorithms for a variety of uses, ranging from new advertising technologies to more precise control methods for industrial processes. However, new data is difficult to store, process, analyze, and dispose of.
New data resources may be large, sometimes on the order of gigabytes (peta) to zetta bytes. Furthermore, the rate of growth of these data resources may exceed data processing capabilities. Thus, data scientists have turned to parallel data processing techniques to address these challenges. To increase computing power and handle large amounts of data, scientists have attempted to create systems and methods that are capable of parallel, intensive computing. These existing systems and methods do not keep up with data processing requirements, often because the techniques used are limited by the need for additional resources for data management, integration of partitioned data, and analysis of segmented data.
To facilitate the manipulation of large data sets, engineers and scientists are now seeking to improve the hardware used to analyze data. For example, new semiconductor processors or chips, such as those described herein, may be designed specifically for data-intensive tasks by incorporating memory and processing functions in a single substrate fabricated in a technology more suitable for memory operations rather than arithmetic computations. With integrated circuits specifically designed for data intensive tasks, it is possible to meet new data processing requirements. However, this new approach to data processing of large data sets requires solving new problems in chip design and manufacturing. For example, if a new chip designed for data intensive tasks is fabricated using fabrication techniques and architectures for a common chip, the new chip will have poor performance and/or unacceptable yield. Furthermore, if the new chip is designed to operate with current data handling methods, the new chip will have poor performance because current methods may limit the ability of the chip to handle parallel operations.
The present disclosure describes solutions to mitigate or overcome one or more of the problems set forth above, as well as other problems in the prior art.
Disclosure of Invention
In some embodiments, an integrated circuit may include a substrate and a memory array disposed on the substrate, wherein the memory array includes a plurality of discrete memory banks. The integrated circuit can also include a processing array disposed on the substrate, wherein the processing array includes a plurality of processor sub-units, each of the plurality of processor sub-units being associated with one or more discrete memory banks among the plurality of discrete memory banks. The integrated circuit may also include a controller configured to implement at least one security measure with respect to an operation of the integrated circuit and to take one or more remedial actions if the at least one security measure is triggered.
The disclosed embodiments may also include a method of protecting an integrated circuit from tampering, wherein the method includes implementing at least one security measure with respect to operation of the integrated circuit using a controller associated with the integrated circuit and taking one or more remedial actions if the at least one security measure is triggered, and wherein the integrated circuit includes: a substrate; a memory array disposed on a substrate, the memory array comprising a plurality of discrete memory groups; and a processing array disposed on the substrate, the processing array including a plurality of processor subunits, each of the plurality of processor subunits associated with one or more discrete memory banks among the plurality of discrete memory banks.
The disclosed embodiments may include an integrated circuit comprising: a substrate; a memory array disposed on a substrate, the memory array comprising a plurality of discrete memory groups; a processing array disposed on a substrate, the processing array comprising a plurality of processor subunits, each of the plurality of processor subunits associated with one or more discrete memory banks among the plurality of discrete memory banks; and a controller configured to: implementing at least one security measure with respect to operation of the integrated circuit; wherein the at least one security measure comprises copying the program code in at least two different memory portions.
In some embodiments, a distributed processor memory chip is provided, comprising: a substrate; a memory array disposed on a substrate; a process array disposed on a substrate; a first communication port; and a second communication port. The memory array may include a plurality of discrete memory banks. The processing array may include a plurality of processor subunits, each of the plurality of processor subunits being associated with one or more discrete memory banks among a plurality of discrete memory banks. The first communication port may be configured to establish a communication connection between the distributed processor memory chip and an external entity other than another distributed processor memory chip. The second communication port may be configured to establish a communication connection between the distributed processor memory chip and a first additional distributed processor memory chip.
In some embodiments, a method of transferring data between a first distributed processor memory chip and a second distributed processor memory chip may comprise: determining, using a controller associated with at least one of the first distributed processor memory chip and the second distributed processor memory chip, whether a first processor subunit among a plurality of processor subunits disposed on the first distributed processor memory chip is ready to transfer data to a second processor subunit included in the second distributed processor memory chip; and after determining that the first processor subunit is ready to transfer data to the second processor subunit, using a clock enable signal controlled by the controller to initiate transfer of data from the first processor subunit to the second processor subunit.
In some embodiments, a memory cell may comprise: a memory array comprising a plurality of memory banks; at least one controller configured to control at least one aspect of a read operation with respect to a plurality of memory banks; at least one zero value detection logic configured to detect a multi-bit zero value stored in a particular address of a plurality of memory banks; and wherein the at least one controller and the at least one zero value detection logic unit are configured to return a zero value indicator to one or more circuits external to the memory unit in response to a zero value detection by the at least one zero value detection logic.
Some embodiments may include a method for detecting a zero value in a particular address of a plurality of discrete memory banks, comprising: receiving a request from circuitry external to the memory unit to read data stored in addresses of the plurality of discrete memory banks; activating, by the controller, a zero value detection logic to detect a zero value in the received address in response to the received request; and transmitting, by the controller, a zero value indicator to a circuit in response to a zero value detection by the zero value detection logic.
Some embodiments may include a non-transitory computer-readable medium storing a set of instructions executable by a controller of a memory unit to cause the memory unit to detect a zero value in a particular address of a plurality of discrete memory banks, the method comprising: receiving a request from circuitry external to the memory unit to read data stored in addresses of the plurality of discrete memory banks; activating, by the controller, a zero value detection logic to detect a zero value in the received address in response to the received request; and transmitting, by the controller, a zero value indicator to a circuit in response to a zero value detection by the zero value detection logic.
In some embodiments, a memory cell may comprise: one or more memory banks; a group controller; and an address generator; wherein the address generator is configured to provide a current address in a current row to be accessed in the associated memory bank to the bank controller, determine a predicted address of a next row to be accessed in the associated memory bank, and provide the predicted address to the bank controller before a read operation with respect to the current row associated with the current address is completed.
In some embodiments, a memory cell may comprise: one or more memory banks, wherein each of the one or more memory banks comprises a plurality of rows; a first row controller configured to control a first subset of the plurality of rows; a second row controller configured to control a second subset of the plurality of rows; a single data input to receive data to be stored in a plurality of columns; and a single data output for providing data retrieved from the plurality of columns.
In some embodiments, a distributed processor memory chip may include: a substrate; a memory array disposed on a substrate, the memory array comprising a plurality of discrete memory groups; a processing array disposed on a substrate, the processing array comprising a plurality of processor subunits, each of the processor subunits associated with a corresponding dedicated memory bank of the plurality of discrete memory banks; a first plurality of buses each connecting one of the plurality of processor subunits to its corresponding dedicated memory bank; and a second plurality of buses each connecting one of the plurality of processor subunits to another of the plurality of processor subunits. At least one of the memory banks may include at least one DRAM memory pad disposed on a substrate. At least one of the processor units may include one or more logic components associated with at least one memory pad. The at least one memory pad and the one or more logic components may be configured to act as a cache for one or more of the plurality of processing subunits.
In some embodiments, a method of executing at least one instruction in a distributed processor memory chip may comprise: retrieving one or more data values from a memory array of a distributed processor memory chip; storing one or more data values in registers formed in memory pads of a distributed processor memory chip; and accessing one or more data values stored in the register in accordance with at least one instruction executed by the processor element; wherein the memory array comprises a plurality of discrete memory groups disposed on a substrate; wherein the processor assembly is a processor sub-unit included among a plurality of processor sub-units in a processing array disposed on a substrate, wherein each of the processor sub-units is associated with a corresponding dedicated memory bank of a plurality of discrete memory banks; and wherein the register is provided by a memory pad disposed on a substrate.
Some embodiments may include a device comprising: a substrate; a processing unit disposed on the substrate; and a memory unit disposed on the substrate, wherein the memory unit is configured to store data to be accessed by the processing unit, and wherein the processing unit includes a memory pad configured to act as a cache for the processing unit.
Processing systems are expected to process increasing amounts of information at extremely high rates. For example, it is expected that fifth generation (5G) mobile internet receives a large number of information streams and processes these information streams at an increased rate.
The processing system may include one or more buffers and a processor. Processing operations applied by a processor may have some latency and this may require a large number of buffers. A large number of buffers can be costly and/or area consuming.
Transferring large amounts of information from the buffer to the processor may require a high bandwidth connector and/or a high bandwidth bus between the buffer and the processor, which may also increase the cost and area of the processing system.
There is an increasing need to provide efficient processing systems.
Processing systems are expected to handle the increasing amount of information at extremely high rates. For example, it is expected that fifth generation (5G) mobile internet receives a large number of information streams and processes these information streams at an increased rate.
The processing system may include one or more buffers and a processor. Processing operations applied by a processor may have some latency and this may require a large number of buffers. A large number of buffers can be costly and/or area consuming.
Transferring large amounts of information from the buffer to the processor may require a high bandwidth connector and/or a high bandwidth bus between the buffer and the processor, which may also increase the cost and area of the processing system.
There is an increasing need to provide efficient processing systems.
A disaggregated server includes a plurality of subsystems, each having a unique role. For example, a disaggregated server may include one or more switching subsystems, one or more computing subsystems, and one or more storage subsystems.
The one or more compute subsystems and the one or more storage subsystems are coupled to each other via one or more switch subsystems.
The arithmetic subsystem may include a plurality of arithmetic units.
The switching subsystem may include a plurality of switching units.
The storage subsystem may include a plurality of storage units.
The bottleneck of this disaggregated server is the bandwidth required to transfer information between subsystems.
This is particularly true when performing distributed computations that require sharing units of information between all (or at least most) of the computational units (such as graphics processing units) of different computational subsystems.
Assume that there are N arithmetic units participating in sharing, N being a very large integer (e.g., at least 1024), and each of the N arithmetic units must send (and receive) information units to (from) all other arithmetic units. Under these assumptions, it is necessary to execute about N × N transfer handlers of the information unit. The bulk transfer process is time and energy consuming and will significantly limit the throughput of the disaggregated server.
There is an increasing need to provide efficient disaggregated servers and efficient ways to perform distributed processing.
The database includes a number of entries that include a plurality of fields. Database processing typically includes executing one or more queries that include one or more screening parameters (e.g., identifying one or more relevant fields and one or more relevant field values) and also include one or more operational parameters that may determine the type of operation to be performed, variables or constants to be used in applying the operation, and the like.
For example, a database query may request that a statistical operation (operation parameter) be performed on all records of the database, with a certain field having a value within a predefined range (filter parameter). For yet another example, a database query may request deletion of a (operational parameter) record having a certain field less than a threshold (filter parameter).
Large databases are typically stored in storage devices. In order to respond to a query, the database is sent to the memory units, typically one database segment followed by another.
Entries for the database segment are sent from the memory unit to processors that do not belong to the same integrated circuit as the memory unit. The entries are then processed by a processor.
For each database segment of the database stored in the memory unit, the process comprises the steps of: (i) selecting a record of a database segment; (ii) sending the record from the memory unit to the processor; (iii) screening the records by a processor to determine if the records are related; and (iv) performing one or more additional operations (summing, applying any other mathematical operations and/or statistical operations) on the associated records.
The filter handler ends after all records are sent to the processor and the processor determines which records are relevant.
In the case where the relevant entries of the database segment are not stored in the processor, then these relevant records need to be sent to the processor for further processing (applying the post-processing operations) after the screening phase.
When multiple processing operations follow a single screen, the results of each operation may then be sent to the memory unit and then again to the processor.
This process is bandwidth consuming and time consuming.
There is an increasing need to provide efficient ways of performing database processing.
Word embedding is the collective term for a collection of language modeling and feature learning techniques in Natural Language Processing (NLP), where words or phrases from a vocabulary are mapped to vectors of elements. Conceptually, it involves mathematical embedding from a space with many dimensions per word to a continuous vector space with much lower dimensions (www.
Methods of generating this mapping include neural networks, dimensionality reduction of word co-occurrence matrices, probability models, interpretable knowledge base methods, and explicit representations based on the context in which the word occurs.
Word and phrase embedding has been shown to improve the performance of NLP tasks such as grammar parsing and emotion analysis when used as a base input representation.
A sentence may be segmented into words or phrases and each section may be represented by a vector. A sentence may be represented by a matrix that includes all vectors representing words or phrases of the sentence.
The vocabulary that maps words to vectors may be stored in a memory unit, such as a Dynamic Random Access Memory (DRAM), that may be accessed using a word or a phrase (or an index representing a word).
The accesses may be random accesses, which reduces the throughput of the DRAM. In addition, these accesses may saturate the DRAM, especially when a large number of accesses are fed into the DRAM.
In particular, the words included in a statement are often quite random. Even when using a DRAM burst, accessing a storage mapped DRAM memory will typically result in lower performance for random access, since typically only one of a small portion of the DRAM memory bank entries (among multiple entries of different memory banks being accessed at the same time) will store entries associated with a certain statement during the burst.
Thus, the throughput of DRAM memory is low and non-continuous.
Each word or phrase of a statement is retrieved from DRAM memory under the control of a host computer that is external to the integrated circuit of the DRAM memory and must control each retrieval of each vector representing each word or segment based on knowledge of the word's location, a time-consuming and resource-consuming task.
Data centers and other computerized systems are expected to process and exchange increasing amounts of information at extremely high rates.
The exchange of increasing amounts of data can be a bottleneck for data centers and other computerized systems, and such data centers and other computerized systems can be made to utilize only a portion of their capabilities.
Fig. 96A illustrates an example of a prior art database 12010 and a prior art server motherboard 12011. The database may include a plurality of servers, each server including a plurality of server motherboards (also denoted "CPU + memory + network"). Each server motherboard 12011 includes a CPU 12012 (such as, but not limited to intel's XEON) that receives traffic, connected to a memory unit 12013 (shown as RAM) and a plurality of database accelerators (DB accelerators) 12014.
The DB accelerator is optional, and DB acceleration operations may be performed by the CPU 12012.
All traffic flows through the CPU, and the CPU may be coupled to the DB accelerator via a link with relatively limited bandwidth (such as PCIe).
A large amount of resources is dedicated to routing information units between multiple server motherboards.
There is an increasing need to provide efficient data centers and other computerized systems.
Artificial Intelligence (AI) applications such as neural networks have increased significantly in size. To cope with the increasing size of neural networks, multiple servers, each as an AI acceleration server (including a server motherboard), are used to perform neural network processing tasks, such as, but not limited to, training. An example of a system including multiple AI acceleration servers configured in different racks is shown in fig. 1.
In a typical training session, a large number of images are processed simultaneously to provide a large amount, such as a loss. Large amounts of traffic are transported between different AI acceleration servers and result in exceptional amounts of traffic. For example, some neural network layers may be operated across multiple GPUs located in different AI acceleration servers, and may require bandwidth-consuming aggregation on the network.
The transmission of exceptional amounts of traffic requires ultra-high bandwidth, which may not be feasible or may not be cost-effective.
Fig. 97A illustrates a system 12050 including subsystems, each subsystem including: a switch 12051 for connecting an AI acceleration server 12052 having a server board 12055 including a RAM memory (RAM 12056), a Central Processing Unit (CPU)12054, a network adapter (NIC)12053, and the CPU 12054 is connected (via a PCIe bus) to a plurality of AI accelerators 12057 such as a graphic processing unit, an AI chip (AI ASIC), an FPGA, and the like. NICs are coupled to each other (e.g., by one or more switches) by a network (using, for example, ethernet, UDP links, and the like), and these NICs may be capable of delivering the ultra-high bandwidth required by the system.
There is an increasing need to provide efficient AI computing systems.
According to other embodiments of the disclosure, a non-transitory computer-readable storage medium may store program instructions that are executed by at least one processing device and perform any of the methods described herein.
The foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the claims.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate various disclosed embodiments. In the drawings:
FIG. 1 is a schematic diagram of a Central Processing Unit (CPU).
FIG. 2 is a schematic diagram of a Graphics Processing Unit (GPU).
FIG. 3A is a schematic diagram of one embodiment of an exemplary hardware chip consistent with the disclosed embodiments.
FIG. 3B is a schematic diagram of another embodiment of an exemplary hardware chip consistent with the disclosed embodiments.
FIG. 4 is a schematic diagram of a generic command executed by an exemplary hardware chip consistent with the disclosed embodiments.
FIG. 5 is a diagram of specialized commands executed by an exemplary hardware chip consistent with the disclosed embodiments.
FIG. 6 is a schematic diagram of a processing group for use in an exemplary hardware chip consistent with the disclosed embodiments.
FIG. 7A is a schematic diagram of a rectangular array of processing groups consistent with the disclosed embodiments.
FIG. 7B is a schematic illustration of an elliptical array of processing groups consistent with the disclosed embodiments.
FIG. 7C is a schematic diagram of an array of hardware chips consistent with the disclosed embodiments.
FIG. 7D is a schematic diagram of another array of hardware chips consistent with the disclosed embodiments.
FIG. 8 is a flow diagram depicting an exemplary method for compiling a series of instructions for execution on an exemplary hardware chip consistent with the disclosed embodiments.
FIG. 9 is a schematic diagram of a memory bank.
FIG. 10 is a schematic diagram of a memory bank.
FIG. 11 is a schematic diagram of an embodiment of an exemplary memory bank having subgroup controls consistent with the disclosed embodiments.
FIG. 12 is a schematic diagram of another embodiment of an exemplary memory bank having subgroup controls consistent with the disclosed embodiments.
FIG. 13 is a functional block diagram of an exemplary memory chip consistent with the disclosed embodiments.
FIG. 14 is a functional block diagram of an exemplary set of redundant logical blocks consistent with the disclosed embodiments.
FIG. 15 is a functional block diagram of exemplary logical blocks consistent with the disclosed embodiments.
FIG. 16 is a functional block diagram of exemplary logical blocks coupled to a bus consistent with the disclosed embodiments.
FIG. 17 is a functional block diagram of exemplary logical blocks connected in series consistent with the disclosed embodiments.
FIG. 18 is a functional block diagram of exemplary logic blocks connected in a two-dimensional array consistent with the disclosed embodiments.
FIG. 19 is a functional block diagram of exemplary logical blocks in a complex connection consistent with the disclosed embodiments.
FIG. 20 is an exemplary flow chart illustrating a redundant block enable process consistent with the disclosed embodiments.
FIG. 21 is an exemplary flow chart illustrating an address assignment process consistent with the disclosed embodiments.
FIG. 22 is a functional block diagram of an exemplary processing device consistent with the disclosed embodiments.
FIG. 23 is a functional block diagram of an exemplary processing device consistent with the disclosed embodiments.
FIG. 24 includes an exemplary memory configuration diagram consistent with the disclosed embodiments.
FIG. 25 is an exemplary flow chart illustrating a memory configuration handler consistent with the disclosed embodiments.
FIG. 26 is an exemplary flow chart illustrating a memory read process consistent with the disclosed embodiments.
FIG. 27 is an exemplary flow chart illustrating the execution of a handler consistent with the disclosed embodiments.
FIG. 28 is an embodiment of a memory chip with a refresh controller consistent with the present disclosure.
FIG. 29A is a refresh controller consistent with an embodiment of the present disclosure.
FIG. 29B is a refresh controller according to another embodiment consistent with the present disclosure.
FIG. 30 is a flow diagram of one embodiment of a process performed by the refresh controller consistent with the present disclosure.
FIG. 31 is a flow diagram of one embodiment of a process implemented by a compiler consistent with the present disclosure.
FIG. 32 is a flow diagram of another embodiment of a handler implemented by a compiler consistent with the present disclosure.
FIG. 33 shows an example refresh controller configured with stored patterns consistent with the present disclosure.
FIG. 34 is an example flow diagram of a process implemented by software within the refresh controller consistent with the present disclosure.
Fig. 35A shows an example wafer including dice consistent with the present disclosure.
FIG. 35B shows an example memory chip connected to an input/output bus consistent with the present disclosure.
Fig. 35C shows an example wafer including memory chips arranged in rows and connected to input-output buses consistent with the present disclosure.
FIG. 35D shows two memory chips grouped and connected to an input-output bus consistent with the present disclosure.
Fig. 35E shows an example wafer consistent with the present disclosure including dice disposed in a hexagonal lattice and connected to input-output buses.
FIGS. 36A-36D show various possible configurations of memory chips connected to an input/output bus consistent with the present disclosure.
FIG. 37 shows an example grouping of grains of shared glue logic (glue logic) consistent with the present disclosure.
Fig. 38A-38B show example cuts through a wafer consistent with the present disclosure.
Fig. 38C shows an example arrangement of dice on a wafer and an arrangement of input-output buses consistent with the present disclosure.
FIG. 39 shows an example memory chip on a wafer with interconnected processor subunits consistent with the present disclosure.
FIG. 40 is a flowchart of an example process for laying out groups of memory chips from a wafer consistent with the present disclosure.
FIG. 41A is another example flow chart of a process for laying out groups of memory chips from a wafer consistent with the present disclosure.
41B-41C are example flow charts of a process of determining a dicing pattern for dicing one or more groups of memory chips from a wafer consistent with the present disclosure.
FIG. 42 is an example of circuitry within a memory chip providing dual port access along a column consistent with the present disclosure.
FIG. 43 is an example of circuitry within a memory chip providing dual port access along a row consistent with the present disclosure.
Fig. 44 is an example of circuitry within a memory chip providing dual port access along both rows and columns consistent with the present disclosure.
FIG. 45A is a dual read using a duplicate memory array or pad.
FIG. 45B is a dual write using a duplicate memory array or pad.
FIG. 46 is an example of circuitry within a memory chip having switching elements for dual port access along a column consistent with the present disclosure.
FIG. 47A is an example flow diagram of a process for providing dual port access on a single port memory array or pad consistent with the present disclosure.
FIG. 47B is an example flow diagram of another process for providing dual port access on a single port memory array or pad consistent with the present disclosure.
Fig. 48 is another example of circuitry within a memory chip providing dual port access along both rows and columns consistent with the present disclosure.
FIG. 49 is an example of a switching element for dual port access within a memory mat consistent with the present disclosure.
Fig. 50 is an example integrated circuit having a reduction unit configured to access a partial word consistent with the present disclosure.
FIG. 51 is a memory bank for using a reduction unit as described with respect to FIG. 50.
Fig. 52 is a memory bank utilizing a reduction unit integrated into PIM logic consistent with the present disclosure.
Fig. 53 is a memory bank using PIM logic to activate a switch for accessing a partial word consistent with the present disclosure.
FIG. 54A is a memory bank having a segmented column multiplexer for deactivating to access a partial word consistent with the present disclosure.
FIG. 54B is an example flow diagram of a process for partial word access in memory consistent with the present disclosure.
Fig. 55 is a conventional memory chip including a plurality of memory pads.
FIG. 56 is an embodiment of a memory chip having a startup circuit for reducing power consumption during line disconnection consistent with the present disclosure.
FIG. 57 is another embodiment of a memory chip having a startup circuit for reducing power consumption during line disconnection consistent with the present disclosure.
FIG. 58 is yet another embodiment of a memory chip having a startup circuit for reducing power consumption during line disconnection consistent with the present disclosure.
FIG. 59 is yet another embodiment of a memory chip having a power up circuit for reducing power consumption during line disconnection consistent with the present disclosure.
FIG. 60 is an embodiment of a memory chip having global word lines and local word lines for reducing power consumption during line disconnection consistent with the present disclosure.
FIG. 61 is another embodiment of a memory chip having global word lines and local word lines for reducing power consumption during line disconnection consistent with the present disclosure.
FIG. 62 is a flow chart of a process for sequentially disconnecting lines in memory consistent with the present disclosure.
FIG. 63 is a prior art tester for memory chips.
FIG. 64 is another prior art tester for memory chips.
FIG. 65 is an embodiment of testing a memory chip using logic cells on the same substrate as the memory consistent with the present disclosure.
FIG. 66 is another embodiment of testing a memory chip using logic cells on the same substrate as the memory consistent with the present disclosure.
FIG. 67 is yet another embodiment of testing a memory chip using logic cells on the same substrate as the memory consistent with the present disclosure.
FIG. 68 is yet another embodiment of testing a memory chip using logic cells on the same substrate as the memory consistent with the present disclosure.
FIG. 69 is another embodiment consistent with the present disclosure for testing a memory chip using logic cells on the same substrate as the memory.
FIG. 70 is a flow chart of a process for testing memory chips consistent with the present disclosure.
FIG. 71 is a flow chart of another process for testing memory chips consistent with the present disclosure.
FIG. 72A is a diagrammatic representation of an integrated circuit that includes a memory array and a processing array consistent with an embodiment of the invention.
FIG. 72B is a diagrammatic representation of a memory region within an integrated circuit consistent with embodiments of the invention.
FIG. 73A is a diagrammatic representation of an integrated circuit with an example configuration of a controller consistent with embodiments of the invention.
FIG. 73B is a diagrammatic representation of an arrangement for concurrently executing a replication model in accordance with an embodiment of the present invention.
FIG. 74A is a diagrammatic representation of an integrated circuit with another example configuration of a controller consistent with embodiments of the invention.
Fig. 74B is a flowchart representation of a method of protecting an integrated circuit according to an exemplary disclosed embodiment.
Fig. 74C is a diagrammatic representation of detection elements located at various points within a chip in accordance with an exemplary disclosed embodiment.
FIG. 75A is a diagrammatic representation of a scalable processor memory system that includes a plurality of distributed processor memory chips consistent with an embodiment of the invention.
FIG. 75B is a diagrammatic representation of a scalable processor memory system that includes a plurality of distributed processor memory chips consistent with an embodiment of the invention.
FIG. 75C is a diagrammatic representation of a scalable processor memory system that includes a plurality of distributed processor memory chips consistent with an embodiment of the invention.
FIG. 75D is a diagrammatic representation of a dual port distributed processor memory chip consistent with an embodiment of the invention.
FIG. 75E is an example timing diagram consistent with an embodiment of the present invention.
FIG. 76 is a diagrammatic representation of a processor memory chip with an integrated controller and interface module and constituting a scalable processor memory system consistent with an embodiment of the invention.
FIG. 77 is a flow diagram for transferring data between processor memory chips in the scalable processor memory system shown in FIG. 75A, consistent with an embodiment of the invention.
FIG. 78A illustrates a system for detecting zero values at the chip level stored in one or more particular addresses of a plurality of memory banks implemented in a memory chip, consistent with an embodiment of the invention.
FIG. 78B illustrates a memory chip for detecting zero values stored in one or more of specific addresses of multiple memory banks at the memory bank level, consistent with an embodiment of the present invention.
FIG. 79 illustrates a memory bank at a memory pad level detecting a zero value stored in one or more of specific addresses of a plurality of memory pads, consistent with an embodiment of the invention.
FIG. 80 is a flow chart illustrating an exemplary method of detecting a zero value in a particular address of a plurality of discrete memory banks, consistent with an embodiment of the present invention.
FIG. 81A illustrates a system for initiating a next row associated with a memory bank based on a next row prediction, consistent with an embodiment of the invention.
FIG. 81B illustrates another embodiment of the system of FIG. 81A consistent with embodiments of the invention.
FIG. 81C illustrates first and second subsets of row controllers for each memory subset consistent with an embodiment of the invention.
FIG. 81D illustrates an embodiment of next row prediction consistent with an embodiment of the present invention.
FIG. 81E illustrates an embodiment of a memory bank consistent with an embodiment of the present invention.
FIG. 81F illustrates another embodiment of a memory bank consistent with an embodiment of the invention.
FIG. 82 illustrates a dual control memory bank for reducing memory row launch penalties, consistent with an embodiment of the invention.
FIG. 83A illustrates a first example of accessing and activating a row of a memory bank.
FIG. 83B illustrates a second example of accessing and activating a row of a memory bank.
FIG. 83C illustrates a third example of accessing and activating a row of a memory bank.
FIG. 84 provides a diagrammatic representation of a conventional CPU/register file and external memory architecture.
FIG. 85A illustrates an exemplary distributed processor memory chip having memory pads that function as register files, consistent with one embodiment.
FIG. 85B illustrates an exemplary distributed processor memory chip having memory pads configured to act as a register file, consistent with another embodiment.
FIG. 85C illustrates an exemplary device having memory pads that function as a register file, consistent with another embodiment.
FIG. 86 provides a flowchart representing an exemplary method for executing at least one instruction in a distributed processor memory chip consistent with the disclosed embodiments.
FIG. 87A includes an example of a disaggregated server;
FIG. 87B is an example of distributed processing;
FIG. 87C is an example of a memory/processing unit;
FIG. 87D is an example of a memory/processing unit;
FIG. 87E is an example of a memory/processing unit;
FIG. 87F is an example of an integrated circuit including a memory/processing unit and one or more communication modules;
FIG. 87G is an example of an integrated circuit including a memory/processing unit and one or more communication modules;
FIG. 87H is an example of a method;
FIG. 87I is an example of a method;
FIG. 88A is an example of a method;
FIG. 88B is an example of a method;
FIG. 88C is an example of a method;
FIG. 89A is an example of a memory/processing unit and a vocabulary table;
FIG. 89B is an example of a memory/processing unit;
FIG. 89C is an example of a memory/processing unit;
FIG. 89D is an example of a memory/processing unit;
FIG. 89E is an example of a memory/processing unit;
FIG. 89F is an example of a memory/processing unit;
FIG. 89G is an example of a memory/processing unit;
FIG. 89H is an example of a memory/processing unit;
FIG. 90A is an example of a system;
FIG. 90B is an example of a system;
FIG. 90C is an example of a system;
FIG. 90D is an example of a system;
FIG. 90E is an example of a system;
FIG. 90F is an example of a method;
FIG. 91A is an example of a memory and screening system, storage device, and CPU;
FIG. 91B is an example of a memory and processing system, storage device, and CPU;
FIG. 92A is an example of a memory and processing system, storage device, and CPU;
FIG. 92B is an example of a memory/processing unit;
FIG. 92C illustrates an example of a memory and screening system, storage device, and CPU;
FIG. 92D illustrates an example of a memory and processing system, a storage device, and a CPU;
FIG. 92E illustrates an example of a memory and processing system, a storage device, and a CPU;
FIG. 92F is an example of a method;
FIG. 92G is an example of a method;
FIG. 92H is an example of a method;
FIG. 92I is an example of a method;
FIG. 92J is an example of a method;
FIG. 92K is an example of a method;
FIG. 93A is a cross-sectional view of an example of a hybrid integrated circuit;
FIG. 93B is a cross-sectional view of an example of a hybrid integrated circuit;
FIG. 93C is a cross-sectional view of an example of a hybrid integrated circuit;
FIG. 93D is a cross-sectional view of an example of a hybrid integrated circuit;
FIG. 93E is a top view of an example of a hybrid integrated circuit;
FIG. 93F is a top view of an example of a hybrid integrated circuit;
FIG. 93G is a top view of an example of a hybrid integrated circuit;
fig. 93H is a cross-sectional view of an example of a hybrid integrated circuit;
FIG. 93I is a cross-sectional view of an example of a hybrid integrated circuit;
FIG. 93J is an example of a method;
FIG. 94A is an example of a storage system, one or more devices, and a computing system;
FIG. 94B is an example of a storage system, one or more devices, and a computing system;
FIG. 94C is an example of one or more devices and a computing system;
FIG. 94D is an example of one or more devices and a computing system;
FIG. 94E is an example of a database acceleration integrated circuit;
FIG. 94F is an example of a database acceleration integrated circuit;
FIG. 94G is an example of a database acceleration integrated circuit;
FIG. 94H is an example of a database acceleration unit;
FIG. 94I is an example of a blade and a group of database acceleration integrated circuits;
FIG. 94J is an example of a group of database acceleration integrated circuits;
FIG. 94K is an example of a group of database acceleration integrated circuits;
FIG. 94L is an example of a group of database acceleration integrated circuits;
FIG. 94M is an example of a group of database accelerated integrated circuits;
FIG. 94N is an example of a system;
FIG. 94O is an example of a system;
FIG. 94P is an example of a method;
FIG. 95A is an example of a method;
FIG. 95B is an example of a method;
FIG. 95C is an example of a method;
FIG. 96A is an example of a prior art system;
FIG. 96B is an example of a system;
FIG. 96C is an example of a database accelerator board;
FIG. 96D is an example of a portion of a system;
FIG. 97A is an example of a prior art system;
FIG. 97B is an example of a system; and
fig. 97C is an example of an AI network adapter.
Detailed Description
The following detailed description refers to the accompanying drawings. Wherever convenient, the same reference numbers will be used throughout the drawings and the following description to refer to the same or like parts. While several illustrative embodiments have been described herein, modifications, adaptations, and other implementations are possible. For example, substitutions, additions or modifications may be made to the components illustrated in the drawings, and the illustrative methods described herein may be modified by substituting, reordering, removing or adding steps to the disclosed methods. The following detailed description is, therefore, not to be taken in a limiting sense, of the disclosed embodiments and examples. Instead, the appropriate scope is defined by the appended claims.
Processor architecture
As used throughout this disclosure, the term "hardware chip" refers to a semiconductor wafer (such as silicon or the like) having one or more circuit elements (such as transistors, capacitors, resistors, and/or the like) formed thereon. The circuit elements may form processing elements or memory elements. A "processing element" refers to one or more circuit elements that collectively perform at least one logical function, such as an arithmetic function, logic gates, other Boolean operations, or the like. The processing elements may be general purpose processing elements (such as a configurable plurality of transistors) or special purpose processing elements (such as a particular logic gate or a plurality of circuit elements designed to perform a particular logic function). A "memory element" refers to one or more circuit elements that may be used to store data. A "memory element" may also be referred to as a "memory cell". The memory elements may be dynamic (such that electrical refresh is required to maintain data storage), static (such that data persists for at least a period of time after power is lost), or non-volatile memory.
The processing elements may be joined to form a processor subunit. A "processor subunit" can thus comprise a minimal grouping of processing elements that can execute at least one task or instruction (e.g., belonging to a processor instruction set). For example, a subunit may include one or more general-purpose processing elements configured to collectively execute instructions, one or more general-purpose processing elements paired with one or more special-purpose processing elements configured to execute instructions in a complementary manner, or the like. The processor subunits may be arranged in an array on a substrate (e.g., a wafer). Although an "array" may comprise a rectangular shape, any arrangement of subunits in an array may be formed on a substrate.
The memory elements may be joined to form a memory bank. For example, a memory group may include one or more lines of memory elements linked along at least one conductive line (or other conductive connection). Further, the memory elements may be linked along at least one additional conductive line in another direction. For example, memory elements can be arranged along word lines and bit lines, as explained below. Although a memory group may include lines, any arrangement of elements in a group may be used to form a group on a substrate. Further, one or more banks may be electrically joined to at least one memory controller to form a memory array. Although the memory array may include a rectangular arrangement of sets, any arrangement of sets in the array may be formed on the substrate.
As further used throughout this disclosure, a "bus" refers to any communication connection between components of a substrate. For example, wires or lines (forming electrical connections), optical fibers (forming optical connections), or any other connection that enables communication between elements may be referred to as a "bus.
Conventional processors pair general-purpose logic circuits with shared memory. The shared memory may store both an instruction set for execution by the logic circuit and data for execution of the instruction set and resulting from execution of the instruction set. As described below, some conventional processors use cache systems to reduce latency in performing fetches from shared memory; however, conventional cache systems remain shared. Conventional processors include Central Processing Units (CPUs), Graphics Processing Units (GPUs), various Application Specific Integrated Circuits (ASICs), or the like. Fig. 1 shows an example of a CPU, and fig. 2 shows an example of a GPU.
As shown in FIG. 1, CPU 100 may include a processing unit 110, and processing unit 110 may include one or more processor sub-units, such as processor sub-unit 120a and processor sub-unit 120 b. Although not depicted in fig. 1, each processor subunit may include multiple processing elements. Further, processing unit 110 may include one or more levels of on-chip cache. Such cache elements are typically formed on the same semiconductor die as the processing unit 110, rather than being connected to the processor sub-units 120a and 120b via one or more buses formed in the substrate that contains the processor sub-units 120a and 120b and the cache elements. For level one (L1) and level two (L2) caches in conventional processors, arrangements directly on the same die rather than connected via a bus are common. Alternatively, in early processors, the L2 cache was shared among the processor subunits using a backside bus between the subunits and the L2 cache. The backside bus is typically larger than the front side bus described below. Thus, because the cache is to be shared by all processor subunits on the die, the cache 130 may be formed on the same die as the processor subunits 120a and 120b or communicatively coupled to the processor subunits 120a and 120b via one or more backside buses. In both embodiments without a bus (e.g., cache formed directly on die) and embodiments using a backside bus, the cache is shared among the processor subunits of the CPU.
In addition, processing unit 110 communicates with shared memory 140a and memory 140 b. For example, memories 140a and 140b may represent banks of shared Dynamic Random Access Memory (DRAM). Although depicted as having two memory banks, most conventional memory chips include between eight and sixteen memory banks. Thus, the processor sub-units 120a and 120b may use the shared memories 140a and 140b to store data, which is then operated on by the processor sub-units 120a and 120 b. However, this arrangement results in the bus between the memories 140a and 140b and the processing unit 110 becoming a bottleneck when the clock speed of the processing unit 110 exceeds the data transfer speed of the bus. This is typically the case for conventional processors, resulting in effective processing speeds that are lower than the specified processing speed based on clock rate and number of transistors.
As shown in fig. 2, similar drawbacks also exist in GPUs. GPU 200 may include a processing unit 210, and processing unit 210 may include one or more processor sub-units (e.g., sub-units 220a, 220b, 220c, 220d, 220e, 220f, 220g, 220h, 220i, 220j, 220k, 220l, 220m, 220n, 220o, and 220 p). Further, processing unit 210 may include one or more levels of on-chip cache and/or register files. Such cache elements are typically formed on the same semiconductor die as processing unit 210. In fact, in the embodiment of FIG. 2, cache 210 is formed on the same die as processing unit 210 and is shared among all processor subunits, while caches 230a, 230b, 230c, and 230d are formed on subsets of and dedicated to processor subunits, respectively.
Further, processing unit 210 communicates with shared memories 250a, 250b, 250c, and 250 d. For example, memories 250a, 250b, 250c, and 250d may represent banks of shared DRAM. Thus, the processor subunits of the processing unit 210 may use the shared memories 250a, 250b, 250c and 250d to store data, which is then operated on by the processor subunits. However, this arrangement results in the buses between the memories 250a, 250b, 250c and 250d and the processing unit 210 being bottlenecks, similar to those described above with respect to the CPU.
Overview of the disclosed hardware chip
Fig. 3A is a schematic diagram depicting an embodiment of an exemplary hardware chip 300. The hardware chip 300 may include a distributed processor designed to alleviate the bottlenecks described above with respect to CPUs, GPUs, and other conventional processors. A distributed processor may include multiple processor subunits spatially distributed on a single substrate. Furthermore, as explained above, in the distributed processor of the present disclosure, the corresponding memory groups are also spatially distributed on the substrate. In some embodiments, a distributed processor may be associated with a set of instructions, and each of the processor subunits of the distributed processor may be responsible for performing one or more tasks included in the set of instructions.
As depicted in fig. 3A, hardware chip 300 may include a plurality of processor subunits, e.g., logic and control subunits 320a, 320b, 320c, 320d, 320e, 320f, 320g, and 320 h. As further depicted in FIG. 3A, each processor subunit may have a dedicated memory instance. For example, logic and control subunit 320a is operably connected to private memory instance 330a, logic and control subunit 320b is operably connected to private memory instance 330b, logic and control subunit 320c is operably connected to private memory instance 330c, logic and control subunit 320d is operably connected to private memory instance 330d, logic and control subunit 320e is operably connected to private memory instance 330e, logic and control subunit 320f is operably connected to private memory instance 330f, logic and control subunit 320g is operably connected to private memory instance 330g, and logic and control subunit 320h is operably connected to private memory instance 330 h.
Although fig. 3A depicts each memory instance as a single memory bank, hardware chip 300 may include two or more memory banks as dedicated memory instances for processor subunits on hardware chip 300. Furthermore, although fig. 3A depicts each processor subunit as including both logic components and controls for a dedicated memory bank, hardware chip 300 may use controls for a memory bank that are at least partially separate from the logic components. Furthermore, as depicted in FIG. 3A, two or more processor subunits and their corresponding memory components may be grouped into, for example, processing groups 310a, 310b, 310c, and 310 d. A "processing group" may represent a spatial distinction on a substrate on which hardware chip 300 is formed. Thus, a processing group may include other controls for the memory banks in the group, such as controls 340a, 340b, 340c, and 340 d. Additionally or alternatively, a "processing group" may represent a logical grouping for the purpose of compiling code for execution on hardware chip 300. Thus, a compiler (described further below) for hardware chip 300 may divide the entire set of instructions among the processing groups on hardware chip 300.
In addition, host 350 may provide instructions, data, and other inputs to and read outputs from hardware chip 300. Thus, a set of instructions may all be executed on a single die, such as a die hosting hardware chip 300. Indeed, the only communication outside the die may include the loading of instructions to hardware chip 300, any input sent to hardware chip 300, and any output read from hardware chip 300. Thus, all computation and memory operations may be performed on the die (on hardware chip 300) because the processor subunits of hardware chip 300 communicate with the dedicated memory banks of hardware chip 300.
FIG. 3B is a schematic diagram of an embodiment of another exemplary hardware chip 300'. Although depicted as an alternative to hardware chip 300, the architecture depicted in fig. 3B may be combined, at least in part, with the architecture depicted in fig. 3A.
As depicted in fig. 3B, hardware chip 300' may include a plurality of processor sub-units, e.g., processor sub-units 350a, 350B, 350c, and 350 d. As further depicted in FIG. 3B, each processor subunit may have multiple instances of dedicated memory. For example, processor subunit 350a is operably connected to private memory instances 330a and 330b, processor subunit 350b is operably connected to private memory instances 330c and 330d, processor subunit 350c is operably connected to private memory instances 330e and 330f, and processor subunit 350d is operably connected to private memory instances 330g and 330 h. Furthermore, as depicted in FIG. 3B, processor subunits and their corresponding memory components may be grouped into processing groups 310a, 310B, 310c, and 310d, for example. As explained above, a "processing group" may represent a spatial distinction on a substrate on which hardware chip 300 'is formed and/or a logical grouping for the purpose of compiling code for execution on hardware chip 300'.
As further depicted in fig. 3B, the processor subunits may communicate with each other over a bus. For example, as shown in fig. 3B, processor subunit 350a may communicate with processor subunit 350B via bus 360a, processor subunit 350c via bus 360c, and processor subunit 350d via bus 360 f. Similarly, processor subunit 350b may communicate with processor subunit 350a via bus 360a (as described above), processor subunit 350c via bus 360e, and processor subunit 350d via bus 360 d. Further, processor subunit 350c may communicate with processor subunit 350a via bus 360c (as described above), with processor subunit 350b via bus 360e (as described above), and with processor subunit 350d via bus 360 b. Thus, processor sub-unit 350d may communicate with processor sub-unit 350a via bus 360f (as described above), with processor sub-unit 350b via bus 360d (as described above), and with processor sub-unit 350c via bus 360b (as described above). Those skilled in the art will appreciate that fewer buses may be used than depicted in FIG. 3B. For example, bus 360e may be eliminated such that communications between processor sub-units 350b and 350c pass through processor sub-units 350a and/or 350 d. Similarly, bus 360f may be eliminated such that communications between processor sub-unit 350a and processor sub-unit 350d pass through processor sub-units 350b or 350 c.
Furthermore, those skilled in the art will appreciate that architectures other than those depicted in fig. 3A and 3B may be used. For example, an array of processing groups, each having a single processor subunit and a memory instance, may be arranged on a substrate. The processor sub-units may additionally or alternatively form part of a controller for a corresponding dedicated memory bank, part of a controller for a memory pad for a corresponding dedicated memory, or the like.
In view of the above-described architecture, the hardware chips 300 and 300' may significantly improve the efficiency of memory-intensive tasks compared to conventional architectures. For example, database operations and artificial intelligence algorithms (such as neural networks) are examples of memory intensive tasks for which the conventional architecture is less efficient than the hardware chips 300 and 300'. Thus, hardware chips 300 and 300' may be referred to as database accelerator processors and/or artificial intelligence accelerator processors.
Configuring disclosed hardware chips
The hardware chip architecture described above may be configured for code execution. For example, each processor subunit may individually execute code (defining a set of instructions) separate from other processor subunits in the hardware chip. Thus, instead of relying on an operating system to manage multi-threaded processing or to use multi-tasking (which is concurrent rather than parallel), the hardware chip of the present disclosure may allow the processor subunits to operate in full parallel.
In addition to the fully parallel implementation described above, at least some of the instructions assigned to each processor subunit may overlap. For example, multiple processor subunits on a distributed processor may execute overlapping instructions as, for example, implementations of an operating system or other management software, while executing non-overlapping instructions to perform parallel tasks within the context of the operating system or other management software.
FIG. 4 depicts an exemplary handler 400 for executing a general command by processing group 410. For example, processing group 410 may include a portion of a hardware chip of the present disclosure (e.g., hardware chip 300', or the like).
As depicted in fig. 4, the command may be sent to processor subunit 430 paired with dedicated memory instance 420. An external host (e.g., host 350) may send the command to processing group 410 for execution. Alternatively, host 350 may have sent an instruction set including the command for storage in memory instance 420, so that processor subunit 430 can retrieve the command from memory instance 420 and execute the retrieved command. Thus, the command may be executed by processing element 440, which is a general purpose processing element that may be configured to execute the received command. Further, processing group 410 may include controls 460 for memory instance 420. As depicted in fig. 4, control 460 may perform any reads and/or writes to memory instance 420 that are required by processing element 440 when executing the received command. After executing the command, processing group 410 may output the results of the command to, for example, an external host or to a different processing group on the same hardware chip.
In some embodiments, as depicted in fig. 4, processor subunit 430 may also include an address generator 450. An "address generator" may include multiple processing elements configured to determine addresses in one or more memory banks for performing reads and writes, and may also perform operations (e.g., additions, subtractions, multiplications, or the like) on data located at the determined addresses. For example, address generator 450 may determine an address for any read or write to memory. In one example, address generator 450 may improve efficiency by overwriting a read value with a new value determined based on the command when the read value is no longer needed. Additionally or alternatively, address generator 450 may select an available address for storing results from command execution. This may allow the result read to be scheduled for the next clock cycle, which is convenient for the external host. In another example, address generator 450 may determine the addresses of reads and writes during a multi-cycle computation such as a vector or matrix multiply-accumulate (multiply-accumulate) computation. Thus, the address generator 450 may maintain or calculate memory addresses for reading data and writing intermediate results of multi-cycle computations so that the processor subunit 430 may continue processing without having to store these memory addresses.
Fig. 5 depicts an exemplary process 500 for executing specialized commands by the processing group 510. For example, processing group 510 may include a portion of a hardware chip of the present disclosure (e.g., hardware chip 300', or the like).
As depicted in FIG. 5, a specialized command (e.g., a multiply-accumulate command) may be sent to processing element 530 paired with dedicated memory instance 520. An external host (e.g., host 350) may send the command to processing element 530 for execution. Thus, the command may be executed by processing element 530 under a given signal from the host, which is a specialized processing element configurable to execute a particular command (including the received command). Alternatively, processing element 530 may retrieve the command from memory instance 520 for execution. Thus, in the example of fig. 5, processing element 530 is a multiply-accumulate (MAC) circuit configured to execute MAC commands received from an external host or retrieved from memory instance 520. After executing the command, processing group 410 may output the results of the command to, for example, an external host or to a different processing group on the same hardware chip. Although depicted with respect to a single command and a single result, multiple commands may be received or retrieved and executed, and multiple results may be combined on processing group 510 prior to output.
Although depicted as MAC circuitry in fig. 5, additional or alternative specialized circuitry may be included in processing group 510. For example, a MAX read command (which returns the maximum value of the vector), a MAX0 read command (also known as a common function of a rectifier, which returns the entire vector, but also returns a maximum value of 0), or the like, may be implemented.
Although depicted separately, the general processing group 410 of FIG. 4 and the specialized processing group 510 of FIG. 5 may be combined. For example, a general-purpose processor subunit may be coupled to one or more specialized processor subunits to form a processor subunit. Thus, a general-purpose processor subunit may be used for all instructions that may not be executed by one or more specialized processor subunits.
Those skilled in the art will appreciate that neural network implementations and other memory intensive tasks may be handled by specialized logic circuitry. For example, database queries, packet detection, string comparison, and other functions may improve efficiency if performed by a hardware chip as described herein.
Memory-based architecture for distributed processing
On a hardware chip consistent with the present disclosure, a dedicated bus may transfer data between processor subunits on the chip and/or between the processor subunits and their corresponding dedicated memory banks. The use of a dedicated bus may reduce arbitration costs because competing requests are not possible or are easily avoided using software rather than hardware.
Fig. 6 schematically depicts a schematic of a processing group 600. Processing group 600 may be available for use in a hardware chip (e.g., hardware chip 300', or the like). The processor subunit 610 may be connected to a memory 620 via a bus 630. Memory 620 may include Random Access Memory (RAM) elements that store data and code for execution by processor subunit 610. In some embodiments, memory 620 may be an N-way memory (where N is a number equal to or greater than 1, which implies a number of sectors in interleaved memory 620). Because processor subunit 610 is coupled via bus 630 to memory 620 dedicated to processor subunit 610, N may be kept relatively small without compromising performance. This represents an improvement over conventional multi-way register files or caches, where a lower N generally results in lower performance, and a higher N generally results in large area and power penalties.
The size of memory 620, the number of lanes, and the width of bus 630 may be adjusted according to, for example, the size of data involved in one or more tasks to meet the requirements of a task and application implementation using the system of processing group 600. The memory element 620 may include one or more types of memory known in the art, for example, volatile memory (such as RAM, DRAM, SRAM, phase change RAM (pram), magnetoresistive RAM (mram), resistive RAM (reram), or the like) or non-volatile memory (such as flash memory or ROM). According to some embodiments, a portion of memory element 620 may include a first memory type, while another portion may include another memory type. For example, the code region of memory element 620 may comprise ROM elements, while the data region of memory element 620 may comprise DRAM elements. Another example of this partitioning is to store the weights of the neural network in flash memory, while the data used for the calculations is stored in DRAM.
The processing element 640 executing received or stored code may comprise a general purpose processing element and thus be flexible and capable of performing a wide variety of processing operations in accordance with some embodiments of the present disclosure. Non-application specific circuitry typically consumes more power than operation specific circuitry when compared to the power consumed during execution of a particular operation. Thus, when performing certain complex arithmetic calculations, the processing element 640 may consume more power and perform less efficiently than dedicated hardware. Thus, according to some embodiments, the controller of processing element 640 may be designed to perform specific operations (e.g., addition or "move" operations).
In an embodiment of the present disclosure, certain operations may be performed by one or more accelerators 650. Each accelerator may be dedicated and programmed to perform a particular computation, such as a multiplication, floating point vector operation, or the like. By using accelerators, the average power consumed per computation per processor subunit may be reduced and the computational throughput increased thereafter. Accelerator 650 may be selected according to the application that the system is designed to implement (e.g., execute a neural network, execute a database query, or the like). The accelerator 650 may be configured by the processing element 640 and may operate in tandem with the processing element for reducing power consumption and accelerating computations and computations. The accelerators may additionally or alternatively be used to transfer data between the memory of processing group 600, such as intelligent Direct Memory Access (DMA) peripherals, and MUX/DEMUX/input/output ports (e.g., MUX 650 and DEMUX 660).
The accelerator 650 may be configured to perform a variety of functions. For example, one accelerator may be configured to perform 16-bit floating point calculations or 8-bit integer calculations commonly used in neural networks. Another example of an accelerator function is a 32-bit floating point calculation commonly used during the training phase of neural networks. Yet another example of an accelerator function is query processing, such as for use in a database. In some embodiments, the accelerator 650 may include specialized processing elements to perform these functions and/or may be configured so that it may be modified based on configuration data stored on the memory element 620.
The accelerator 650 may additionally or alternatively implement a configurable script processing list of memory moves to time the movement of data to/from the memory 620 or to/from other accelerators and/or inputs/outputs. Thus, as explained further below, all data movement within a hardware chip using the processing group 600 may use software synchronization rather than hardware synchronization. For example, an accelerator in one processing group (e.g., group 600) may transfer data from its input to its accelerator every ten cycles, then output the data in the next cycle, thereby streaming information from the memory of the processing group to another memory.
As further depicted in fig. 6, in some embodiments, processing group 600 may also include at least one input Multiplexer (MUX)660 connected to its input ports and at least one output DEMUX 670 connected to its output ports. These MUX/DEMUXs may be controlled by control signals (not shown) from processing elements 640 and/or from one of accelerators 650, as determined by the current instruction being performed by processing elements 640 and/or the operation being performed by an accelerator in accelerator 650. In some scenarios, the processing group 600 may be required to transfer data (according to predefined instructions from its code memory) from its input ports to its output ports. Thus, in addition to each of the DEMUXs/MUXs being connected to processing elements 640 and accelerators 650, one or more of the input MUXs (e.g., MUXs 660) may also be directly connected to the output DEMUX (e.g., DEMUX 670) via one or more buses.
The processing groups 600 of fig. 6 may be arrayed to form a distributed processor, e.g., as depicted in fig. 7A. The processing groups may be disposed on a substrate 710 to form an array. In some embodiments, the substrate 710 may comprise a semiconductor substrate such as silicon. Additionally or alternatively, the substrate 710 may include a circuit board, such as a flexible circuit board.
As depicted in fig. 7A, the substrate 710 may include a plurality of processing groups, such as processing group 600, disposed thereon. Thus, the substrate 710 includes a memory array that includes a plurality of groups, such as groups 720a, 720b, 720c, 720d, 720e, 720f, 720g, and 720 h. In addition, the substrate 710 includes a processing array, which may include a plurality of processor sub-units, such as sub-units 730a, 730b, 730c, 730d, 730e, 730f, 730g, and 730 h.
Further, as explained above, each processing group may include a processor subunit and one or more corresponding memory banks dedicated to that processor subunit. Thus, as depicted in FIG. 7A, each subunit is associated with a corresponding dedicated memory bank, such as: processor subunit 730a is associated with memory bank 720a, processor subunit 730b is associated with memory bank 720b, processor subunit 730c is associated with memory bank 720c, processor subunit 730d is associated with memory bank 720d, processor subunit 730e is associated with memory bank 720e, processor subunit 730f is associated with memory bank 720f, processor subunit 730g is associated with memory bank 720g, and processor subunit 730h is associated with memory bank 720 h.
To allow each processor subunit to communicate with its corresponding dedicated memory bank, the substrate 710 may include a first plurality of buses connecting one of the processor subunits to its corresponding dedicated memory bank. Thus, bus 740a connects processor subunit 730a to memory bank 720a, bus 740b connects processor subunit 730b to memory bank 720b, bus 740c connects processor subunit 730c to memory bank 720c, bus 740d connects processor subunit 730d to memory bank 720d, bus 740e connects processor subunit 730e to memory bank 720e, bus 740f connects processor subunit 730f to memory bank 720f, bus 740g connects processor subunit 730g to memory bank 720g, and bus 740h connects processor subunit 730h to memory bank 720 h. Further, to allow each processor subunit to communicate with other processor subunits, the substrate 710 can include a second plurality of buses connecting one of the processor subunits to another of the processor subunits. In the example of fig. 7A, bus 750a connects processor subunit 730a to processor subunit 750e, bus 750b connects processor subunit 730a to processor subunit 750b, bus 750c connects processor subunit 730b to processor subunit 750f, bus 750d connects processor subunit 730b to processor subunit 750c, bus 750e connects processor subunit 730c to processor subunit 750g, bus 750f connects processor subunit 730c to processor subunit 750d, bus 750g connects processor subunit 730d to processor subunit 750h, bus 750h connects processor subunit 730h to processor subunit 750g, bus 750i connects processor subunit 730g to processor subunit 750g, and bus 750j couples processor subunit 730f to processor subunit 750 e.
Thus, in the example arrangement shown in fig. 7A, the plurality of logical processor subunits are arranged in at least one row and at least one column. The second plurality of buses connects each processor subunit to at least one neighboring processor subunit in the same row and to at least one neighboring processor subunit in the same column. Fig. 7A may be referred to as "partial block join".
The configuration shown in fig. 7A may be modified to form a "full block connection. The full block connection includes an additional bus connecting the diagonal processor subunits. For example, the second plurality of buses may include additional buses between processor subunit 730a and processor subunit 730f, between processor subunit 730b and processor subunit 730e, between processor subunit 730b and processor subunit 730g, between processor subunit 730c and processor subunit 730f, between processor subunit 730c and processor subunit 730h, and between processor subunit 730d and processor subunit 730 g.
Full block concatenation can be used for convolution calculations where data and results stored in nearby processor subunits are used. For example, during convolutional image processing, each processor subunit may receive a block (such as a pixel or group of pixels) of an image. To compute the convolution result, each processor subunit may obtain data from all eight neighboring processor subunits that each have received a corresponding block. In a partial block connection, data from a diagonally adjacent processor subunit may be passed via other adjacent processor subunits connected to that processor subunit. Thus, the distributed processors on the chip may be artificial intelligence accelerator processors.
In a particular embodiment of convolution computation, the nxm image may be partitioned across multiple processor subunits. Each processor subunit may perform convolution through an a x B filter on its corresponding block. To perform filtering on one or more pixels on a boundary between blocks, each processor sub-unit may require data from a neighboring processor sub-unit having a block that includes pixels on the same boundary. Thus, the code generated for each processor subunit configures that subunit to compute the convolution and fetch from the second plurality of buses whenever data from an adjacent subunit is needed. Corresponding commands to output data to the second plurality of buses are provided to the subunit to ensure proper timing of the desired data transfers.
The partial block connection of fig. 7A may be modified to be an N-partial block connection. In this modification, the second plurality of buses may further connect each processor subunit to processor subunits that are within a threshold distance of that processor subunit (e.g., within n processor subunits) in the four directions in which the buses of fig. 7A run (i.e., up, down, left, and right). A similar modification may be made to the full block connections (to produce N full block connections) such that the second plurality of buses further connects each processor sub-unit to a processor sub-unit that is within a threshold distance of that processor sub-unit (e.g., within N processor sub-units) in four directions along which the buses of fig. 7A extend, except in two diagonal directions.
Other arrangements are also possible. For example, in the arrangement shown in fig. 7B, bus 750a connects processor subunit 730a to processor subunit 730d, bus 750B connects processor subunit 730a to processor subunit 730B, bus 750c connects processor subunit 730B to processor subunit 730c, and bus 750d connects processor subunit 730c to processor subunit 730 d. Thus, in the example arrangement illustrated in fig. 7B, the plurality of processor subunits are arranged in a star pattern. A second plurality of buses connects each processor subunit to at least one adjacent processor subunit within the star pattern.
Other arrangements (not shown) are also possible. For example, a neighbor connection arrangement may be used such that multiple processor subunits are arranged in one or more lines (e.g., similar to the case depicted in fig. 7A). In an adjacent connection arrangement, a second plurality of buses connects each processor subunit to a processor subunit on the left in the same line, a processor subunit on the right in the same line, a processor subunit on both the left and right in the same line, and so on.
In another embodiment, an N-linear connection arrangement may be used. In an N-linear connection arrangement, a second plurality of buses connects each processor subunit to processor subunits that are within a threshold distance of that processor subunit (e.g., within N processor subunits). The N linear connection arrangement may be used with an array of lines (described above), a rectangular array (depicted in fig. 7A), an elliptical array (depicted in fig. 7B), or any other geometric array.
In yet another embodiment, an N-log connection arrangement may be used. In an N-log connection arrangement, a second plurality of buses connects each processor subunit to within a threshold distance of a power of two of the processor subunit (e.g., at 2)nWithin a processor subunit). The N-log connection arrangement may be used with an array of lines (described above), a rectangular array (depicted in fig. 7A), an elliptical array (depicted in fig. 7B), or any other geometric array.
Any of the connection schemes described above may be combined for use in the same hardware chip. For example, full block connections may be used in one zone, while partial block connections may be used in another zone. In another embodiment, an N linear connection arrangement may be used in one zone and an N full block connection in another zone.
Instead of or in addition to a dedicated bus between the processor subunits of the memory chip, one or more shared buses may also be used to interconnect all of the processor subunits (or a subset of the processor subunits) of the distributed processor. Conflicts on the shared bus may still be avoided by clocking data transfers on the shared bus using code executed by the processor subunit, as explained further below. Instead of or in addition to a shared bus, a configurable bus may also be used to dynamically connect processor subunits to form groups of processor units connected to separate buses. For example, the configurable bus may include transistors or other mechanisms that may be controlled by the processor subunits to direct data transfers to selected processor subunits.
In both fig. 7A and 7B, the plurality of processor subunits of the processing array are spatially distributed among a plurality of discrete memory banks of the memory array. In other alternative embodiments (not shown), multiple processor subunits may be grouped together in one or more zones of the substrate and multiple memory banks may be grouped together in one or more other zones of the substrate. In some embodiments, a combination of spatial distribution and aggregation may be used (not shown). For example, one region of the substrate may include a cluster of processor sub-units, another region of the substrate may include a cluster of memory banks, and yet another region of the substrate may include a processing array distributed among the memory banks.
Those skilled in the art will recognize that the arrangement of processor cluster 600 in an array on a substrate is not an exclusive embodiment. For example, each processor subunit may be associated with at least two dedicated memory banks. Thus, processing groups 310a, 310B, 310c, and 310d of FIG. 3B may be used in place of processing group 600 or in combination with processing group 600 to form a processing array and a memory array. Other processing groups (not shown) including, for example, three, four, or more than four dedicated memory banks may be used.
Each of the plurality of processor subunits may be configured to independently execute software code associated with a particular application program relative to other processor subunits included in the plurality of processor subunits. For example, as explained below, multiple sub-series of instructions may be grouped into machine code and provided to each processor subunit for execution.
In some embodiments, each dedicated memory bank includes at least one Dynamic Random Access Memory (DRAM). Alternatively, the memory banks may include a mix of memory types such as Static Random Access Memory (SRAM), DRAM, flash memory, or the like.
In conventional processors, data sharing between processor subunits is typically performed by shared memory. Shared memory typically requires a large portion of chip area and/or implements a bus that is managed by additional hardware, such as an arbiter. As described above, the bus creates a bottleneck. Furthermore, shared memory, which may be external to the chip, typically includes cache coherency mechanisms and more complex caches (e.g., L1 cache, L2 cache, and shared DRAM) in order to provide accurate and up-to-date data to the processor subunits. As explained further below, the dedicated bus depicted in fig. 7A and 7B allows for hardware chips without hardware management, such as an arbiter. Furthermore, using a dedicated memory as depicted in fig. 7A and 7B allows for elimination of complex cache layers and coherency mechanisms.
Instead, to allow each processor subunit to access data computed by and/or stored in memory banks dedicated to the other processor subunits, a bus is provided whose timing is dynamically executed using code executed by each processor subunit individually. This allows for the elimination of most, if not all, of the bus management hardware as conventionally used. Furthermore, direct transfers over these buses replace complex caching mechanisms to reduce latency during memory reads and writes.
Memory-based processing array
As depicted in fig. 7A and 7B, the memory chips of the present disclosure can operate independently. Alternatively, the memory chips of the present disclosure may be operatively connected with one or more additional integrated circuits such as a memory device (e.g., one or more DRAM banks), a system-on-a-chip, a Field Programmable Gate Array (FPGA), or other processing and/or memory chip. In these embodiments, the tasks in the series of instructions executed by the architecture may be divided (e.g., by a compiler, as described below) between the processor subunit of the memory chip and any processor subunit of the additional integrated circuit. For example, other integrated circuits may include a host (e.g., host 350 of FIG. 3A) that inputs instructions and/or data to the memory chip and receives outputs therefrom.
To interconnect the memory chip of the present disclosure with one or more additional integrated circuits, the memory chip may include a memory interface, such as one that complies with any of the Joint Electron Device Engineering Council (JEDEC) standards or variants thereof. One or more additional integrated circuits may then be connected to the memory interface. Thus, if the one or more additional integrated circuits are connected to multiple memory chips of the present disclosure, data may be shared between the memory chips via the one or more additional integrated circuits. Additionally or alternatively, the one or more additional integrated circuits can include a bus to connect to a bus on the memory chip of the present disclosure, such that the one or more additional integrated circuits can execute code in series with the memory chip of the present disclosure. In these embodiments, the one or more additional integrated circuits further assist in distributed processing, even though the additional integrated circuits may be on different substrates than the memory chips of the present disclosure.
Further, the memory chips of the present disclosure may be arrayed so as to form an array of distributed processors. For example, one or more buses may connect memory chip 770a to additional memory chips 770b, as depicted in FIG. 7C. In the embodiment of FIG. 7C, memory chip 770a includes processor subunits and one or more corresponding memory banks dedicated to each processor subunit, such as: processor subunit 730a is associated with memory bank 720a, processor subunit 730b is associated with memory bank 720b, processor subunit 730e is associated with memory bank 720c, and processor subunit 730f is associated with memory bank 720 d. A bus connects each processor subunit to its corresponding memory bank. Thus, bus 740a connects processor subunit 730a to memory bank 720a, bus 740b connects processor subunit 730b to memory bank 720b, bus 740c connects processor subunit 730e to memory bank 720c, and bus 740d connects processor subunit 730f to memory bank 720 d. In addition, bus 750a connects processor subunit 730a to processor subunit 750e, bus 750b connects processor subunit 730a to processor subunit 750b, bus 750c connects processor subunit 730b to processor subunit 750f, and bus 750d connects processor subunit 730e to processor subunit 750 f. For example, as described above, other arrangements of the memory chips 770a may be used.
Similarly, memory chip 770b includes processor subunits and one or more corresponding memory banks dedicated to each processor subunit, such as: processor subunit 730c is associated with memory bank 720e, processor subunit 730d is associated with memory bank 720f, processor subunit 730g is associated with memory bank 720g, and processor subunit 730h is associated with memory bank 720 h. A bus connects each processor subunit to its corresponding memory bank. Thus, bus 740e connects processor subunit 730c to memory bank 720e, bus 740f connects processor subunit 730d to memory bank 720f, bus 740g connects processor subunit 730g to memory bank 720g, and bus 740h connects processor subunit 730h to memory bank 720 h. Furthermore, bus 750g connects processor subunit 730c to processor subunit 750g, bus 750h connects processor subunit 730d to processor subunit 750h, bus 750i connects processor subunit 730c to processor subunit 750d, and bus 750j connects processor subunit 730g to processor subunit 750 h. For example, as described above, other arrangements of the memory chips 770b may be used.
The processor sub-units of memory chips 770a and 770b may be connected using one or more buses. Thus, in the embodiment of FIG. 7C, bus 750e may connect processor subunit 730b of memory chip 770a with processor subunit 730C of memory chip 770b, and bus 750f may connect processor subunit 730f of memory chip 770a with processor subunit 730C of memory chip 770 b. For example, bus 750e may serve as an input bus to memory chip 770b (and thus as an output bus for memory chip 770 a), while bus 750f may serve as an input bus to memory chip 770a (and thus as an output bus for memory chip 770 b), or vice versa. Alternatively, buses 750e and 750f may both be used as bidirectional buses between memory chips 770a and 770 b.
Thus, although depicted using buses 750e and 750f, architecture 760 may include fewer or additional buses. For example, a single bus between processor subunits 730b and 730c or between processor subunits 730f and 730c may be used. Alternatively, additional buses may be used, such as between processor subunits 730b and 730d, between processor subunits 730f and 730d, or the like.
Furthermore, although depicted as using a single memory chip and an additional integrated circuit, multiple memory chips may be connected using a bus as explained above. For example, as depicted in the embodiment of FIG. 7C, memory chips 770a, 770b, 770C, and 770d are connected in an array. Similar to the memory chips described above, each memory chip includes a processor subunit and a dedicated memory bank. Therefore, the description of these components is not repeated here.
In the embodiment of FIG. 7C, memory chips 770a, 770b, 770C, and 770d are connected in a loop. Thus, bus 750a connects memory chips 770a and 770d, bus 750c connects memory chips 770a and 770b, bus 750e connects memory chips 770b and 770c, and bus 750g connects memory chips 770c and 770 d. Although the memory chips 770a, 770b, 770C, and 770d may be connected using full block connections, partial block connections, or other connection arrangements, the embodiment of FIG. 7C allows for fewer pin connections between the memory chips 770a, 770b, 770C, and 770 d.
Relatively large memory
Embodiments of the present disclosure may use a dedicated memory that is relatively large in size compared to the shared memory of conventional processors. The use of dedicated memory rather than shared memory allows efficiency gains to continue to be achieved without tapering as memory grows. This allows memory intensive tasks such as neural network processing and database queries to be performed more efficiently than in conventional processors, where the efficiency gains of increasing shared memory are diminishing due to the von neumann bottleneck.
For example, in a distributed processor of the present disclosure, a memory array disposed on a substrate of the distributed processor may include a plurality of discrete memory banks. Each of the discrete memory banks may have a capacity greater than one megabyte (megabyte); and a processing array disposed on the substrate, the processing array including a plurality of processor subunits. As explained above, each of the processor subunits may be associated with a corresponding dedicated memory bank of a plurality of discrete memory banks. In some embodiments, the plurality of processor subunits may be spatially distributed among a plurality of discrete memory banks within the memory array. By using at least one megabyte of dedicated memory for large CPUs or GPUs, rather than a few megabytes of shared cache, the distributed processor of the present disclosure achieves efficiencies that are not possible in conventional systems due to the von neumann bottleneck in the CPUs and GPUs.
Different memories may be used as dedicated memories. For example, each dedicated memory bank may include at least one DRAM bank. Alternatively, each dedicated memory bank may include at least one sram bank. In other embodiments, different types of memory may be combined on a single hardware chip.
As explained above, each private memory may be at least one megabyte. Thus, each dedicated memory bank may be the same size, or at least two memory banks of the plurality of memory banks may have different sizes.
Further, as described above, the distributed processor may include: a first plurality of buses each connecting one of the plurality of processor subunits to a corresponding dedicated memory bank; and a second plurality of buses each connecting one of the plurality of processor subunits to another of the plurality of processor subunits.
Synchronization using software
As explained above, the hardware chip of the present disclosure may manage data transfer using software rather than hardware. In particular, because the timing of transfers on the bus, reads and writes to memory, and computations by the processor subunit is set by a subsequence of instructions executed by the processor subunit, the hardware chip of the present disclosure may execute code to prevent conflicts on the bus. Thus, the hardware chip of the present disclosure may avoid hardware mechanisms conventionally used to manage data transfers (such as a network controller within the chip, packet parsers and packet forwarders between processor subunits, bus arbiters, multiple buses used to avoid arbitration, or the like).
If the hardware chip of the present disclosure is transferring data conventionally, connecting N processor subunits with a bus would require bus arbitration or a wide MUX controlled by the arbiter. Rather, as described above, embodiments of the present disclosure may use a bus that is merely a wire, optical cable, or the like between processor subunits that individually execute code to avoid conflicts over the bus. Thus, embodiments of the present disclosure may save space on the substrate as well as material cost and efficiency losses (e.g., power and time consumption due to arbitration). Efficiency and space gain are even greater compared to other architectures using first-in-first-out (FIFO) controllers and/or mailboxes.
Further, as explained above, each processor subunit may also include one or more accelerators in addition to one or more processing elements. In some embodiments, the accelerator may read and write from the bus rather than from the processing elements. In these embodiments, additional efficiencies may be obtained by allowing the accelerator to transfer data during the same cycle in which the processing element performs one or more computations. However, these embodiments require additional materials for the accelerator. For example, additional transistors may be required for fabricating the accelerator.
The code may also consider the internal behavior of the processor subunit (e.g., including the processing elements and/or accelerators that form part of the processor subunit), including timing and latency. For example, a compiler (as described below) may perform pre-processing that takes into account timing and latency when generating a subsequence of instructions that control data transfer.
In one embodiment, the plurality of processor subunits may be assigned the task of computing a neural network layer containing a plurality of neurons all connected to a previous layer of the larger plurality of neurons. Assuming that the data of the previous layer is evenly spread among the plurality of processor subunits, one way to perform this calculation may be to configure each processor subunit to transmit the data of the previous layer to the main bus in turn, and then each processor subunit multiplies this data by the weight of the corresponding neuron implemented by the subunit. Because each processor subunit computes more than one neuron, each processor subunit will transmit the data of the previous layer a number of times equal to the number of neurons. Thus, the code for each processor subunit is not the same as the code for the other processor subunits because the subunits will be transmitted at different times.
In some embodiments, a distributed processor may include a substrate (e.g., a semiconductor substrate such as silicon and/or a circuit board such as a flexible circuit board) having: a memory array disposed on the substrate, the memory array comprising a plurality of discrete memory groups; and a processing array disposed on the substrate, the processing array including a plurality of processor subunits, as depicted, for example, in fig. 7A and 7B. As explained above, each of the processor subunits may be associated with a corresponding dedicated memory bank of a plurality of discrete memory banks. Moreover, as depicted in, for example, fig. 7A and 7B, the distributed processor may also include a plurality of buses, each of the plurality of buses connecting one of the plurality of processor sub-units to at least one other of the plurality of processor sub-units.
As explained above, the multiple buses may be controlled in software. Thus, the multiple buses may be free of sequential hardware logic components, such that data transfers between processor subunits and across corresponding ones of the multiple buses are not controlled by sequential hardware logic components. In one embodiment, the plurality of buses may be free of a bus arbiter, such that data transfers between the processor subunits and across corresponding ones of the plurality of buses are not controlled by the bus arbiter.
In some embodiments, as depicted, for example, in fig. 7A and 7B, the distributed processor may also include a second plurality of buses connecting one of the plurality of processor subunits to a corresponding dedicated memory bank. Similar to the plurality of buses described above, the second plurality of buses may be free of sequential hardware logic components such that data transfer between the processor subunits and the corresponding dedicated memory banks is not controlled by sequential hardware logic components. In one embodiment, the second plurality of buses may not contain a bus arbiter, such that data transfers between the processor subunits and the corresponding dedicated memory banks are not controlled by the bus arbiter.
As used herein, the phrase "free of does not necessarily imply the absolute absence of components such as sequential hardware logic components (e.g., bus arbiter, arbitration tree, FIFO controller, mailbox, or the like). These elements may still be included in the hardware chip described as "free of" these elements. Instead, the phrase "without" refers to the functionality of a hardware chip; that is, a hardware chip that is "free" of sequential hardware logic components controls the timing of its data transfers without using the sequential hardware logic components (if any) included therein. For example, a hardware chip executes code that includes a sub-series of instructions that control data transfer between processor subunits of the hardware chip, even though the hardware chip includes sequential hardware logic components as an auxiliary precaution against conflicts due to errors in the executed code.
As explained above, the plurality of buses may include at least one of wires or optical fibers between corresponding ones of the plurality of processor subunits. Thus, in one embodiment, a distributed processor without sequential hardware logic components may include only wires or fibers, without a bus arbiter, arbitration tree, FIFO controller, mailbox, or the like.
In some embodiments, the plurality of processor subunits are configured to transfer data across at least one of the plurality of buses in accordance with code executed by the plurality of processor subunits. Thus, as explained below, a compiler may organize subseries of instructions, each containing code that is executed by a single processor subunit. The sub-series of instructions may indicate when the processor subunit is to transfer data onto one of the buses and when to retrieve data from the bus. When the sub-series is executed in a cascaded fashion across distributed processors, the timing of transfers between processor subunits may be controlled by instructions included in the sub-series for transfer and retrieval. Thus, the code specifies the timing of data transfer across at least one of the plurality of buses. A compiler may produce code to be executed by a single processor subunit. Additionally, a compiler may generate code to be executed by a group of processor sub-units. In some cases, the compiler may collectively treat all processor subunits as one hyper-processor (e.g., a distributed processor) and the compiler may generate code for execution by the hyper-processor/distributed processor defined thereby.
As explained above and as depicted in fig. 7A and 7B, the plurality of processor sub-units may be spatially distributed among a plurality of discrete memory banks within the memory array. Alternatively, multiple processor subunits may be grouped in one or more zones of a substrate and multiple memory groups may be grouped in one or more other zones of the substrate. In some embodiments, a combination of spatial distribution and aggregation may be used, as explained above.
In some embodiments, a distributed processor may include a substrate (e.g., a semiconductor substrate including silicon and/or a circuit board such as a flexible circuit board) having a memory array disposed thereon, the memory array including a plurality of discrete memory groups. A processing array may also be disposed on the substrate, the processing array including a plurality of processor subunits, as depicted, for example, in fig. 7A and 7B. As explained above, each of the processor subunits may be associated with a corresponding dedicated memory bank of a plurality of discrete memory banks. Moreover, as depicted in, for example, fig. 7A and 7B, the distributed processor may also include a plurality of buses, each of the plurality of buses connecting one of the plurality of processor subunits to a corresponding dedicated memory bank of the plurality of discrete memory banks.
As explained above, the multiple buses may be controlled in software. Thus, the plurality of buses may be free of sequential hardware logic components such that data transfers between the processor subunit and a corresponding dedicated discrete memory bank of the plurality of discrete memory banks and across a corresponding one of the plurality of buses are not controlled by the sequential hardware logic components. In one embodiment, the plurality of buses may be free of a bus arbiter, such that data transfers between the processor subunits and across corresponding ones of the plurality of buses are not controlled by the bus arbiter.
In some embodiments, as depicted, for example, in fig. 7A and 7B, the distributed processor may also include a second plurality of buses connecting one of the plurality of processor sub-units to at least one other of the plurality of processor sub-units. Similar to the plurality of buses described above, the second plurality of buses may be free of sequential hardware logic components such that data transfer between the processor subunits and the corresponding dedicated memory banks is not controlled by sequential hardware logic components. In one embodiment, the second plurality of buses may not contain a bus arbiter, such that data transfers between the processor subunits and the corresponding dedicated memory banks are not controlled by the bus arbiter.
In some embodiments, the distributed processor may use a combination of software timing components and hardware timing components. For example, a distributed processor may include a substrate (e.g., a semiconductor substrate including silicon and/or a circuit board such as a flexible circuit board) on which the substrate is disposed a memory array including a plurality of discrete memory banks. A processing array may also be disposed on the substrate, the processing array including a plurality of processor subunits, as depicted, for example, in fig. 7A and 7B. As explained above, each of the processor subunits may be associated with a corresponding dedicated memory bank of a plurality of discrete memory banks. Moreover, as depicted in, for example, fig. 7A and 7B, the distributed processor may also include a plurality of buses, each of the plurality of buses connecting one of the plurality of processor sub-units to at least one other of the plurality of processor sub-units. Furthermore, as explained above, the plurality of processor subunits may be configured to execute software that controls the timing of data transfers across the plurality of buses to avoid collisions with data transfers on at least one of the plurality of buses. In this embodiment, software may control the timing of the data transfer, but the transfer itself may be controlled, at least in part, by one or more hardware components.
In these embodiments, the distributed processor may further include a second plurality of buses connecting one of the plurality of processor subunits to a corresponding dedicated memory bank. Similar to the plurality of buses described above, the plurality of processor subunits may be configured to execute software that controls the timing of data transfers across the second plurality of buses to avoid collisions with data transfers on at least one of the second plurality of buses. In this embodiment, software may control the timing of the data transfer, as explained above, but the transfer itself may be controlled at least in part by one or more hardware components.
Code partitioning
As explained above, the hardware chips of the present disclosure may execute code in parallel across processor subunits included on a substrate forming the hardware chip. In addition, the hardware chip of the present disclosure may perform multitasking. For example, a hardware chip of the present disclosure may perform regional multitasking, where one group of processor subunits of the hardware chip performs one task (e.g., audio processing) while another group of processor subunits of the hardware chip performs another task (e.g., image processing). In another embodiment, a hardware chip of the present disclosure may perform time-sequential multitasking, where one or more processor subunits of the hardware chip perform one task during a first time period and perform another task during a second time period. A combination of regional and temporal multitasking may also be used such that one task may be assigned to a first group of processor subunits during a first time period while another task may be assigned to a second group of processor subunits during the first time period, after which a third task may be assigned to the processor subunits included in the first and second groups during a second time period.
To organize the machine code for execution on the memory chips of the present disclosure, the machine code may be divided among the processor subunits of the memory chips. For example, a processor on a memory chip may include a substrate and a plurality of processor subunits disposed on the substrate. The memory chip may also include a corresponding plurality of memory banks disposed on the substrate, each of the plurality of processor sub-units being connected to at least one dedicated memory bank that is not shared by any other of the plurality of processor sub-units. Each processor subunit on the memory chip may be configured to execute a series of instructions independently of the other processor subunits. Each series of instructions may be executed by: one or more general processing elements of a processor subunit are configured according to code defining the series of instructions and/or one or more special processing elements (e.g., one or more accelerators) of a processor subunit are launched according to a sequence provided in the code defining the series of instructions.
Thus, each series of instructions may define a series of tasks to be performed by a single processor subunit. A single task may include instructions within an instruction set defined by the architecture of one or more processing elements in a processor subunit. For example, the processor subunit may include a particular register, and a single task may push data onto the register, fetch data from the register, perform an arithmetic function on data within the register, perform a logical operation on data within the register, or the like. Further, the processor sub-units may be configured for any number of operands, such as a 0 operand processor sub-unit (also referred to as a "stacked machine"), a 1 operand processor sub-unit (also referred to as an accumulator machine), a 2 operand processor sub-unit (such as a RISC), a 3 operand processor sub-unit (such as a Complex Instruction Set Computer (CISC)), or the like. In another embodiment, the processor sub-unit may include one or more accelerators, and a single task may activate an accelerator to perform a particular function, such as a MAC function, a MAX-0 function, or the like.
The series of instructions may also include tasks for reading and writing to dedicated memory banks of the memory chip. For example, a task may include writing a piece of data to a memory bank dedicated to a processor subunit performing the task, reading a piece of data from a memory bank dedicated to a processor subunit performing the task, or the like. In some embodiments, the reading and writing may be performed by the processor subunit in series with the controller of the memory bank. For example, the processor subunit may perform a read or write task by sending a control signal to the controller to perform the read or write. In some embodiments, the control signal may include specific addresses for reading and writing. Alternatively, the processor subunit may listen to the slave memory controller to select an address that is available for reading and writing.
Additionally or alternatively, the reading and writing may be performed by one or more accelerators in tandem with the controller of the memory bank. For example, the accelerator may generate control signals for a memory controller, similar to how the processor subunits generate the control signals, as described above.
In any of the embodiments described above, the address generator may also be used to direct reads and writes to specific addresses of memory banks. For example, the address generator may include a processing element configured to generate memory addresses for reading and writing. The address generator may be configured to generate addresses to improve efficiency, for example by writing results of later calculations to the same addresses as results of earlier calculations that are no longer needed. Thus, the address generator may generate control signals for the memory controller in response to commands from, or in tandem with, the processor subunit (e.g., from processing elements included therein or from one or more accelerators therein). Additionally or alternatively, the address generator may generate addresses based on some configuration or registers, such as generating nested loop structures, repeating in a certain pattern on certain addresses in the memory.
In some embodiments, each series of instructions may include a set of machine code that defines a corresponding series of tasks. Thus, the series of tasks described above may be encapsulated within machine code that includes the series of instructions. In some embodiments, as explained below with respect to fig. 8, the series of tasks may be defined by a compiler that is configured to distribute higher-order series of tasks among a plurality of logic circuits as a plurality of series of tasks. For example, a compiler may generate multiple series of tasks based on a higher order series of tasks such that the processor subunits that execute each corresponding series of tasks in tandem perform the same functions as outlined by the higher order series of tasks.
As explained further below, the higher-order series of tasks may include a set of instructions written in a human-readable programming language. Correspondingly, the series of tasks for each processor subunit may comprise a lower order series of tasks, each of which comprises a set of instructions written in machine code.
As explained above with respect to fig. 7A and 7B, the memory chip may also include a plurality of buses, each connecting one of the plurality of processor subunits to at least one other of the plurality of processor subunits. Furthermore, as explained above, data transfers over multiple buses may be controlled using software. Thus, the transfer of data across at least one of the plurality of buses may be predefined by the series of instructions included in the processor subunit connected to the at least one of the plurality of buses. Thus, one of the tasks included in the series of instructions may include outputting data to one of the buses or extracting data from one of the buses. These tasks may be performed by the processing elements of the processor sub-unit or by one or more accelerators included in the processor sub-unit. In the latter embodiment, the processor subunit may perform the calculations or send control signals to the corresponding memory banks in the same cycle during which the accelerator fetches or places data from or onto one of the buses.
In one embodiment, the series of instructions included in the processor subunit connected to at least one of the plurality of buses may include a send task that includes a command for the processor subunit connected to at least one of the plurality of buses to write data to at least one of the plurality of buses. Additionally or alternatively, the series of instructions included in the processor subunit connected to at least one of the plurality of buses may include a receive task that includes a command for the processor subunit connected to at least one of the plurality of buses to read data from at least one of the plurality of buses.
In addition to or instead of distributing code among processor subunits, the data may be divided among memory groups of the memory chip. For example, as explained above, a distributed processor on a memory chip may include multiple processor subunits disposed on the memory chip and multiple memory banks disposed on the memory chip. Each of the plurality of memory banks may be configured to store data independent of data stored in others of the plurality of memory banks, and one of the plurality of processor subunits may be connected to at least one dedicated memory bank among the plurality of memory banks. For example, each processor subunit may access one or more memory controllers dedicated to one or more corresponding memory banks of that processor subunit, and other processor subunits may not access these corresponding one or more memory controllers. Thus, the data stored in each memory bank may be unique to the specialized processor subunit. In addition, the data stored in each memory bank may be independent of the memory stored in the other memory banks, as no memory controller may be shared between the memory banks.
In some embodiments, as described below with respect to fig. 8, the data stored in each of the plurality of memory banks may be defined by a compiler configured to distribute the data among the plurality of memory banks. In addition, the compiler may be configured to distribute data defined in a higher-order series of tasks among a plurality of memory banks using a plurality of lower-order tasks distributed among corresponding processor subunits.
As explained further below, the higher-order series of tasks may include a set of instructions written in a human-readable programming language. Correspondingly, the series of tasks for each processor subunit may comprise a lower order series of tasks, each of which comprises a set of instructions written in machine code.
As explained above with respect to fig. 7A and 7B, the memory chip may also include a plurality of buses, each connecting one of the plurality of processor subunits to one or more corresponding dedicated memory banks among the plurality of memory banks. Furthermore, as explained above, data transfers over multiple buses may be controlled using software. Thus, data transfer across a particular bus of the plurality of buses may be controlled by the corresponding processor subunit connected to the particular bus of the plurality of buses. Thus, one of the tasks included in the series of instructions may include outputting data to one of the buses or extracting data from one of the buses. As explained above, these tasks may be performed by (i) the processing elements of the processor subunit or (ii) one or more accelerators included in the processor subunit. In the latter embodiment, the processor subunits may perform computations or use the bus connecting the processor subunit to other processor subunits in the same cycle during which the accelerator fetches or places data on one of the buses connected to the corresponding dedicated memory bank or banks.
Thus, in one embodiment, the series of instructions included in a processor subunit connected to at least one of the plurality of buses may include a send task. The send task may include a command for a processor subunit connected to at least one of the plurality of buses to write data to the at least one of the plurality of buses for storage in one or more corresponding dedicated memory banks. Additionally or alternatively, the series of instructions included in the processor subunit connected to at least one of the plurality of buses may include receiving a task. The receiving task may include a command for a processor subunit connected to at least one of the plurality of buses to read data from the at least one of the plurality of buses for storage in one or more corresponding dedicated memory banks. Thus, the sending and receiving tasks in these embodiments may include control signals sent along at least one of the plurality of buses to one or more memory controllers in one or more corresponding dedicated memory banks. Further, the sending task and the receiving task may be performed by one portion of the processing subunit (e.g., by one or more accelerators of the processing subunit) concurrently with computations or other tasks performed by another portion of the processing subunit (e.g., by one or more different accelerators of the processing subunit). Embodiments of this concurrent execution may include MAC relay commands where the receiving, multiplying, and sending are performed in tandem.
In addition to distributing data among memory banks, particular portions of data may also be replicated across different memory banks. For example, as explained above, a distributed processor on a memory chip may include multiple processor subunits disposed on the memory chip and multiple memory banks disposed on the memory chip. Each of the plurality of processor subunits may be connected to at least one dedicated memory bank among the plurality of memory banks, and each of the plurality of memory banks may be configured to store data independent of data stored in others of the plurality of memory banks. In addition, at least some of the data stored in a particular memory bank of the plurality of memory banks may include a replicator of the data stored in at least another memory bank of the plurality of memory banks. For example, the numbers, strings, or other types of data used in the series of instructions may be stored in multiple memory banks dedicated to different processor subunits, rather than being transferred from one memory bank to other processor subunits in the memory chip.
In one embodiment, parallel string matching may use data replication as described above. For example, multiple strings may be compared to the same string. Conventional processors may compare each string of a plurality of strings to the same string in sequence. On the hardware chip of the present disclosure, the same character string may be replicated across the memory banks so that the processor subunit may compare separate character strings of the plurality of character strings with the replicated character string in parallel.
In some embodiments, as described below with respect to fig. 8, at least some of the data replicated across a particular one of the plurality of memory banks and at least another one of the plurality of memory banks is defined by a compiler configured to replicate the data across the memory banks. Further, the compiler may be configured to replicate at least some of the data using a plurality of lower-order tasks distributed among the corresponding processor subunits.
Replication of data may be applicable to a particular task that reuses the same portion of data across different computations. By copying these portions of data, different computations may be distributed among the processor subunits of the memory chip for parallel execution, and each processor subunit may store and access the portions of data in and from a dedicated memory bank (rather than pushing and fetching the portions of data across a bus connecting the processor subunits). In one embodiment, at least some of the data replicated across a particular one of the plurality of memory banks and at least another one of the plurality of memory banks may include a weight of the neural network. In this embodiment, each node in the neural network may be defined by at least one processor subunit among a plurality of processor subunits. For example, each node may contain machine code that is executed by at least one processor subunit that defines the node. In this embodiment, the duplication of the weights may allow each processor subunit to execute machine code to at least partially implement the corresponding node while only accessing one or more dedicated memory banks (rather than performing data transfers with other processor subunits). Since the timing of reads and writes to the dedicated memory bank is independent of other processor subunits, while the timing of data transfers between processor subunits requires timing synchronization (e.g., using software, as explained above), duplicating memory to avoid data transfers between processor subunits may further improve the efficiency of overall execution.
As explained above with respect to fig. 7A and 7B, the memory chip may also include a plurality of buses, each connecting one of the plurality of processor subunits to one or more corresponding dedicated memory banks among the plurality of memory banks. Furthermore, as explained above, data transfers over multiple buses may be controlled using software. Thus, data transfer across a particular bus of the plurality of buses may be controlled by the corresponding processor subunit connected to the particular bus of the plurality of buses. Thus, one of the tasks included in the series of instructions may include outputting data to one of the buses or extracting data from one of the buses. As explained above, these tasks may be performed by (i) the processing elements of the processor subunit or (ii) one or more accelerators included in the processor subunit. As further explained above, these tasks may include transmit tasks and/or receive tasks that include control signals that are transmitted along at least one of the plurality of buses to one or more memory controllers in one or more corresponding dedicated memory banks.
Fig. 8 depicts a flow diagram of a method 800 for compiling a series of instructions for execution on an exemplary memory chip of the present disclosure, e.g., as depicted in fig. 7A and 7B. The method 800 may be implemented by any conventional processor, whether general or special purpose.
The method 800 may be performed as part of a computer program forming a compiler. As used herein, a "compiler" refers to any computer program that converts a higher level language (e.g., procedural languages, such as C, FORTRAN, BASIC, or the like; object oriented languages, such as Java, C + +, Pascal, Python, or the like; and the like) into a lower level language (e.g., assembly code, object code, machine code, or the like). A compiler may allow a human to program a series of instructions in a human-readable language and then convert the human-readable language into a machine-executable language.
At step 810, the processor may assign tasks associated with the series of instructions to different ones of the processor subunits. For example, the series of instructions may be divided into subgroups to be executed in parallel across processor subunits. In one embodiment, the neural network may be divided into its nodes, and one or more nodes may be assigned to separate processor subunits. In this embodiment, each subgroup may include a plurality of nodes across different layer connections. Thus, a processor subunit may implement nodes from a first layer of a neural network, nodes from a second layer connected to nodes from the first layer implemented by the same processor subunit, and the like. By assigning nodes based on their connections, data transfers between processor subunits may be reduced, which may result in increased efficiency, as explained above.
As explained above as depicted in fig. 7A and 7B, the processor sub-units may be spatially distributed among multiple memory banks disposed on the memory chip. Thus, the assignment of tasks may be at least partially spatial as well as logical partitioning.
At step 820, the processor may generate tasks to transfer data between pairs of processor subunits of the memory chip, each pair of processor subunits being connected by a bus. For example, as explained above, the data transfer may be controlled using software. Thus, the processor subunit may be configured to push data on the bus and to extract data on the bus at synchronized times. The generated tasks may thus include tasks for performing such synchronized pushing and fetching of data.
As explained above, step 820 may include preprocessing to account for internal behavior of the processor subunits, including timing and latency. For example, the processor may use known times and latencies of the processor subunits (e.g., time to push data to the bus, time to fetch data from the bus, latency between calculation and push or fetch, or the like) to ensure that the generated tasks are synchronized. Thus, data transfers including at least one push by one or more processor subunits and at least one fetch by one or more processor subunits may occur simultaneously without delay due to timing differences between processor subunits, latency of processor subunits, or the like.
At step 830, the processor may group the assigned and generated tasks into a plurality of groups of sub-series instructions. For example, the sub-series of instructions may each comprise a series of tasks for execution by a single processor subunit. Thus, each of the plurality of groups of sub-series instructions may correspond to a different processor subunit of the plurality of processor subunits. Thus, steps 810, 820 and 830 may result in dividing the series of instructions into multiple groups of sub-series of instructions. As explained above, step 820 can ensure that any data transfer between different groups is synchronized.
At step 840, the processor may generate machine code corresponding to each of the plurality of groups of sub-series instructions. For example, higher order code representing a sub-series of instructions may be converted into lower order code, such as machine code, that may be executed by a corresponding processor subunit.
At step 850, the processor may assign the generated machine code corresponding to each of the plurality of groups of sub-series instructions to a corresponding processor subunit of the plurality of processor subunits according to the partitioning. For example, the processor may tag each sub-series of instructions with an identifier of the corresponding processor sub-unit. Thus, when a sub-series of instructions is uploaded to the memory chip for execution (e.g., by the host 350 of FIG. 3A), each sub-series may configure a correct processor subunit.
In some embodiments, assigning tasks associated with the series of instructions to different ones of the processor sub-units may depend at least in part on spatial proximity between two or more of the processor sub-units on the memory chip. For example, as explained above, efficiency may be improved by reducing the number of data transfers between processor subunits. Thus, the processor may minimize data transfers that move data across more than two of the processor subunits. Thus, the processor may use a known layout of memory chips in conjunction with one or more optimization algorithms, such as a greedy algorithm, in order to assign subseries to processor subunits in a manner that maximizes (at least regionally) adjacent transfers and minimizes (at least regionally) transfers to non-adjacent processor subunits.
The method 800 may include further optimizations for the memory chip of the present disclosure. For example, the processor may group data associated with the series of instructions based on the partitioning and assign the data to the memory banks according to the grouping. Thus, the memory banks may hold data for a sub-series of instructions assigned to each processor subunit to which each memory bank is dedicated.
In some embodiments, grouping the data may include determining at least a portion of the data that is replicated in two or more of the memory banks. For example, as explained above, some data may be used across more than one sub-series instruction. This data may be replicated across a memory bank dedicated to a plurality of processor subunits assigned different sub-series instructions. This optimization may further reduce data transfer across the processor subunits.
The output of the method 800 may be input to a memory chip of the present disclosure for execution. For example, a memory chip may include a plurality of processor subunits and a corresponding plurality of memory banks, each processor subunit connected to at least one memory bank dedicated to that processor subunit, and the processor subunits of the memory chip may be configured to execute the machine code generated by method 800. As explained above with respect to fig. 3A, the host 350 may input the machine code generated by the method 800 to the processor subunit for execution.
Subgroup and sub-controller
In a conventional memory bank, the controller is disposed at the bank level. Each set includes a plurality of pads, which are typically arranged in a rectangular fashion, but may be arranged in any geometric shape. Each pad includes a plurality of memory cells, which are also typically arranged in a rectangular fashion, but may be arranged in any geometric shape. Each cell can store a single bit of data (e.g., depending on whether the cell is held at a high voltage or a low voltage).
Embodiments of this conventional architecture are depicted in fig. 9 and 10. As shown in FIG. 9, at the set level, a plurality of pads (e.g., pads 930-1, 930-2, 940-1, and 940-2) may form a set 900. In a conventional rectangular organization, group 900 may be controlled across a global word line (e.g., word line 950) and a global bit line (e.g., bit line 960). Thus, the row decoder 910 may select the correct word line based on incoming control signals (e.g., a request to read from an address, a request to write to an address, or the like), and the global sense amplifiers 920 (and/or global column decoder, not shown in FIG. 9) may select the correct bit line based on the control signals. Amplifier 920 may also amplify any voltage level from the selected set during a read operation. Although depicted as using a row decoder for initial selection and amplification along the columns, a bank may additionally or alternatively use a column decoder for initial selection and amplification along the rows.
Fig. 10 depicts an embodiment of a pad 1000. For example, the pad 1000 may form part of a memory bank, such as bank 900 of fig. 9. As depicted in fig. 10, a plurality of cells (e.g., cells 1030-1, 1030-2, and 1030-3) may form a pad 1000. Each cell may include a capacitor, transistor, or other circuitry that stores at least one data bit. For example, a cell may include a capacitor that is charged to represent a "1" and discharged to represent a "0" or may include a flip-flop (flip-flop) having a first state representing a "1" and a second state representing a "0". A conventional pad may contain, for example, 512 bits by 512 bits. In embodiments where pad 1000 forms a portion of an MRAM, ReRAM, or the like, a cell may include a transistor, resistor, capacitor, or other mechanism for isolating ions or a portion of the material storing at least one data bit. For example, a cell may include electrolyte ions having a first state representing "1" and a second state representing "0", a portion of chalcogenide glass, or the like.
As further depicted in fig. 10, in a conventional rectangular organization, the pads 1000 may be controlled across a local word line (e.g., word line 1040) and a local bit line (e.g., bit line 1050). Thus, word line drivers (e.g., word line drivers 1020-1, 1020-2, … …, 1020-x) may control a selected word line to perform a read, write, or refresh based on control signals (e.g., a request to read from an address, a request to write to an address, a refresh signal) from a controller associated with a memory bank of which pad 1000 forms a portion. Furthermore, the local sense amplifiers (e.g., local amplifiers 1010-1, 1010-2, … …, 1010-x) and/or the local column decoder (not shown in FIG. 10) may control selected bit lines to perform reading, writing, or refreshing. The area sense amplifier can also amplify any voltage level from the selected cell during a read operation. Although depicted as using word line drivers for initial selection and amplification along the columns, the pads could alternatively use bit line drivers for initial selection and amplification along the rows.
As explained above, a large number of pads are replicated to form a memory bank. Memory banks may be grouped to form memory chips. For example, a memory chip may contain eight to thirty-two memory banks. Thus, pairing processor subunits with memory banks on a conventional memory chip may yield only eight to thirty-two processor subunits. Accordingly, embodiments of the present disclosure may include memory chips with additional subset levels. The memory chips of the present disclosure may then include a processor subunit having a memory subset that serves as a dedicated memory bank paired with the processor subunit to allow for a larger number of sub-processors, which may then achieve higher parallelism and performance of in-memory computations.
In some embodiments of the present disclosure, the global row decoder and global sense amplifiers of the bank 900 may be replaced with a subgroup controller. Thus, the controller of the memory bank may direct control signals to the appropriate subgroup controller, rather than sending control signals to the global row decoder and global sense amplifiers of the memory bank. The booting may be dynamically controlled or may be hardwired (e.g., via one or more logic gates). In some embodiments, fuses may be used to indicate whether the controller of each subset or pad blocks or passes control signals to the appropriate subset or pad. In these embodiments, fuses may thus be used to deactivate the faulty subgroup.
In one of these embodiments, a memory chip may include a plurality of memory banks, each memory bank having a bank controller and a plurality of memory subsets, each memory subset having a subset of row decoders and a subset of column decoders to allow reading and writing to locations on the memory subset. Each subset may include a plurality of memory pads, each having a plurality of memory cells and may have an internal local row decoder, column decoder, and/or local sense amplifier. The subset of row decoders and the subset of column decoders may process read and write requests from the group controller or from the subset processor subunits for in-memory computation on the subset memory, as described below. Additionally, each memory sub-bank may further have a controller configured to determine whether to process and/or forward read and write requests from the bank controller to a next level (e.g., a next level of row and column decoders on the pad), or block the request, e.g., to allow internal processing elements or processor sub-units to access the memory. In some embodiments, the set of controllers may be synchronized to a system clock. However, the subgroup controllers may not be synchronized to the system clock.
As explained above, the use of subgroups may allow for a greater number of processor subunits to be included in a memory chip than if the processor subunits were paired with memory banks of a conventional chip. Thus, each subset may further have a processor subunit that uses the subset as a dedicated memory. As explained above, the processor sub-units may include RISC, CISC, or other general purpose processing sub-units and/or may include one or more accelerators. In addition, the processor subunit may include an address generator, as explained above. In any of the embodiments described above, each processor subunit may be configured to access a subset of the processor subunits using a row decoder and a column decoder dedicated to the subset without using a group controller. The processor sub-units associated with a subset may also handle memory pads (including decoders and memory redundancy mechanisms described below) and/or determine whether to forward and thus handle read or write requests from an upper level (e.g., a bank level or a memory level).
In some embodiments, the subgroup controller may further comprise a register to store the status of the subgroup. Thus, if the subgroup controller receives a control signal from a memory controller while the register indicates that the subgroup is in use, the subgroup controller may return an error. In embodiments where each subset also includes a processor subunit, the register may indicate an error if the processor subunit in the subset is accessing memory that conflicts with an external request from the memory controller.
FIG. 11 shows an embodiment of another embodiment of a memory bank using a sub-bank controller. In the embodiment of FIG. 11, set 1100 has a row decoder 1110, a column decoder 1120, and a plurality of memory subsets (e.g., subsets 1170a, 1170b, and 1170c) having subset controllers (e.g., controllers 1130a, 1130b, and 1130 c). The subgroup controller may include address solvers (e.g., solvers 1140a, 1140b, and 1140c) that may determine whether to communicate requests to one or more subgroups controlled by the subgroup controller.
The subgroup controller may also include one or more logic circuits (e.g., logic 1150a, 1150b, and 1150 c). For example, logic circuitry including one or more processing elements may allow one or more operations such as refreshing cells in the subset, clearing cells in the subset, or the like to be performed without a processing request from outside the set 1100. Alternatively, the logic circuit may include a processor subunit, as explained above, such that the processor subunit has any subset controlled by the subset controller as a corresponding dedicated memory. In the embodiment of fig. 11, logic 1150a may have subset 1170a as corresponding dedicated memory, logic 1150b may have subset 1170b as corresponding dedicated memory, and logic 1150c may have subset 1170c as corresponding dedicated memory. In any of the embodiments described above, the logic circuit may have a bus to a subset, such as bus 1131a, 1131b, or 1131 c. As further depicted in fig. 11, the subgroup controllers may each include multiple decoders, such as subgroup row decoders and subgroup column decoders, to allow processing elements or processor subunits or higher-order memory controllers issuing commands to read and write addresses on memory subgroups. For example, the subgroup controller 1130a includes decoders 1160a, 1160b, and 1160c, the subgroup controller 1130b includes decoders 1160d, 1160e, and 1160f, and the subgroup controller 1130c includes decoders 1160g, 1160h, and 1160 i. Based on the request from the group row decoder 1110, the subgroup controller may select a word line using a decoder included in the subgroup controller. The described system may allow subsets of processing elements or processor subunits to access memory without interrupting other sets and even other subsets, thereby allowing each subset of processor subunits to perform memory computations in parallel with other subsets of processor subunits.
Further, each subset may include a plurality of memory pads, each having a plurality of memory cells. For example, subgroup 1170a includes pads 1190a-1, 1190a-2, … …, 1190 a-x; subgroup 1170b includes pads 1190b-1, 1190b-2, … …, 1190 b-x; and subunit 1170c includes pads 1190c-1, 1190c-2, … …, 1190 c-3. As further depicted in fig. 11, each subset may include at least one decoder. For example, subgroup 1170a includes decoder 1180a, subgroup 1170b includes decoder 1180b, and subgroup 1170c includes decoder 1180 c. Thus, the group column decoder 1120 may select global bit lines (e.g., bit lines 1121a or 1121b) based on an external request, while the subset selected by the group row decoder 1110 may select an area bit line (e.g., bit line 1181a or 1181b) based on an area request from the logic circuit to which the subset is dedicated using its column decoder. Thus, each processor subunit may be configured to access a subset dedicated to that processor subunit using a subset of row decoders and column decoders without using a set of row decoders and a set of column decoders. Thus, each processor subunit can access the corresponding subset without interrupting the other subsets. Further, when a request for a subset is outside of the processor subunit, the subset decoder may reflect the accessed data to the set decoder. Alternatively, in embodiments where each subset has only one row of memory pads, the local bit lines may be the bit lines of the pads, rather than the bit lines of the subset.
Combinations of the following embodiments may be used: embodiments using subgroup row decoders and subgroup column decoders; and the embodiment depicted in fig. 11. For example, the bank row decoder may be eliminated, but the bank column decoder is retained and the area bit lines are used.
FIG. 12 shows an embodiment of a memory subset 1200 having a plurality of pads. For example, the subset 1200 may represent a portion of the subset 1100 of FIG. 11 or may represent an alternative implementation of a memory bank. In the embodiment of fig. 12, subset 1200 includes a plurality of pads (e.g., pads 1240a and 1240 b). Furthermore, each pad may comprise a plurality of cells. For example, pad 1240a includes cells 1260a-1, 1260a-2, … …, 1260a-x, and pad 1240b includes cells 1260b-1, 1260b-2, … …, 1260 b-x.
Each pad may be assigned a range of addresses of the memory cells to be assigned to the pad. These addresses may be configured at production time so that the pads can be moved around and so that the failed pads can be deactivated and remain unused (e.g., using one or more fuses, as explained further below).
The subset 1200 receives read and write requests from the memory controller 1210. Although not depicted in FIG. 12, requests from the memory controller 1210 may be screened through the controllers of the subset 1200 and directed to the appropriate pads of the subset 1200 for address resolution. Alternatively, at least a portion (e.g., the higher bits) of the requested address from memory controller 1210 may be transmitted to all pads (e.g., pads 1240a and 1240b) of subset 1200, such that each pad may process the full address and the request associated with that address only if the assigned address range of the pad includes the address specified in the command. Similar to the subgroup guidance described above, the pad decision may be dynamically controlled or may be hardwired. In some embodiments, fuses may be used to determine the address range of each pad to also allow disabling of faulty pads by assigning illegal address ranges. The pads may additionally or alternatively be deactivated by other conventional methods or connection of fuses.
In any of the embodiments described above, each pad of the subset can include a row decoder (e.g., row decoder 1230a or 1230b) for selecting a word line in the pad. In some embodiments, each pad may also include a fuse and a comparator (e.g., 1220a and 1220 b). As described above, the comparators may allow each pad to determine whether to process an incoming request, and the fuses may allow each pad to be deactivated if a failure occurs. Alternatively, rather than using row decoders in each pad, groups and/or subsets of row decoders may be used.
Further, in any of the embodiments described above, a column decoder (e.g., column decoder 1250a or 1250b) included in the appropriate pad may select an area bit line (e.g., bit line 1251 or 1253). The local bit lines may be connected to global bit lines of the memory bank. In embodiments where the subset has its own local bit line, the local bit line of the cell may be further connected to the local bit line of the subset. Thus, data in a selected cell can be read through the column decoder (and/or sense amplifier) of the cell, then through the subset of column decoders (and/or sense amplifiers) (in embodiments including the subset of column decoders and/or sense amplifiers), and then through the set of column decoders (and/or sense amplifiers).
The pads 1200 may be duplicated and arrayed to form a memory bank (or memory subset). For example, a memory chip of the present disclosure may include a plurality of memory banks, each memory bank having a plurality of memory sub-groups, and each memory sub-group having a sub-group controller for handling reads and writes to locations on the memory sub-groups. Further, each memory subset may include a plurality of memory pads, each having a plurality of memory cells and having a pad row decoder and a pad column decoder (e.g., as depicted in fig. 12). The pad row decoder and the pad column decoder can process read and write requests from the subgroup controller. For example, the pad decoder may receive all requests and determine (e.g., using a comparator) whether to process the requests based on the known address range of each pad, or the pad decoder may receive only requests within the known address range based on the selection of pads by the subset (or set) of controllers.
Controller data transfer
In addition to using the processing subunits to share data, any of the memory chips of the present disclosure may also use a memory controller (or subgroup controller or pad controller) to share data. For example, a memory chip of the present disclosure may include: a plurality of memory banks (e.g., SRAM banks, DRAM banks, or the like), each memory bank having a bank controller, a row decoder, and a column decoder to allow reading and writing of locations on the memory bank; and a plurality of buses connecting each of the plurality of bank controllers to at least one other of the plurality of bank controllers. The plurality of buses may be similar to the buses connecting the processing subunits as described above, but the plurality of buses connect the set of controllers directly rather than via the processing subunits. Further, although described as connecting group controllers, the bus may additionally or alternatively connect the group controllers and/or the pad controllers.
In some embodiments, the multiple buses may be accessed without interrupting data transfers on a main bus connecting memory banks of one or more processor subunits. Thus, a memory group (or subset) may transfer data to or from a corresponding processor sub-unit in the same clock cycle as data is transferred to or from a different memory group (or subset). In embodiments where each controller is connected to a plurality of other controllers, the controller may be configurable for selecting another one of the other controllers for sending or receiving data. In some embodiments, each controller may be connected to at least one neighboring controller (e.g., a pair of spatially adjacent controllers may be connected to each other).
Redundancy logic in memory circuits
The present disclosure generally relates to memory chips having main logic for on-chip data processing. The memory chip may include redundant logic that can replace defective primary logic to improve the manufacturing yield of the chip. Thus, the chip may include on-chip components that allow the logic blocks in the memory chip to be configured based on individual testing of the logic portions. This feature of the chip may improve yield because memory chips with larger areas dedicated to logic portions are more prone to manufacturing failures. For example, DRAM memory chips with large redundant logic portions can be prone to manufacturing problems, which reduces yield. However, implementing redundant logic portions may result in improved yield and reliability because the implementation enables a manufacturer or user of a DRAM memory chip to turn all logic portions on or off while maintaining high parallelism. It should be noted that embodiments of certain memory types (such as DRAM) may be identified herein and throughout this disclosure in order to facilitate explanation of the disclosed embodiments. It should be understood, however, that the identified memory type is not intended to be limiting in these cases. Rather, memory types such as DRAM, flash memory, SRAM, ReRAM, PRAM, MRAM, ROM, or any other memory may be used with the disclosed embodiments, even if fewer embodiments are specifically identified in a section of the disclosure.
FIG. 13 is a functional block diagram of an exemplary memory chip 1300 consistent with the disclosed embodiments. The memory chip 1300 may be implemented as a DRAM memory chip. The memory chip 1300 may also be implemented as any type of volatile or non-volatile memory, such as flash memory, SRAM, ReRAM, PRAM, and/or MRAM. Memory chip 1300 may include a substrate 1301 having disposed therein an address manager 1302, a memory array 1304 including a plurality of memory banks 1304(a, a) through 1304(z, z), memory logic 1306, business logic 1308, and redundant business logic 1310. The memory logic 1306 and business logic 1308 may constitute a primary logic block, while the redundant business logic 1310 may constitute a redundant block. In addition, the memory chip 1300 may include configuration switches, which may include a deactivate switch 1312 and an activate switch 1314. A deactivation switch 1312 and an activation switch 1314 may also be disposed in the substrate 1301. In the present application, memory logic 1306, business logic 1308, and redundant business logic 1310 may also be collectively referred to as "logic blocks.
The address manager 1302 may include row and column decoders or other types of memory auxiliary devices. Alternatively or additionally, address manager 1302 may include a microcontroller or processing unit.
In some embodiments, as shown in fig. 13, a memory chip 1300 may include a single memory array 1304 that may arrange multiple memory blocks in a two-dimensional array on a substrate 1301. However, in other embodiments, the memory chip 1300 may include multiple memory arrays 1304, and each of the memory arrays 1304 may arrange the memory blocks in different configurations. For example, memory blocks (also referred to as memory banks) in at least one of the memory arrays may be arranged in a radial distribution to facilitate routing between the address manager 1302 or memory logic 1306 to the memory blocks.
In some embodiments, the logic blocks in the memory chip 1300 may be connected to a subset of the memory array 1304 through a dedicated bus. For example, a set of memory logic 1306, business logic 1308, and redundant business logic 1310 may be connected to a first row of memory blocks in memory array 1304 (i.e., memory blocks 1304(a, a) -1304 (a, z)). A dedicated bus may allow the associated logical block to quickly access the data of the memory block without requiring a communication line to be opened through, for example, the address manager 1302.
Each of the plurality of primary logical blocks may be connected to at least one of the plurality of memory banks 1304. In addition, a redundant block, such as redundant commercial block 1310, may be connected to at least one of the memory instances 1304(a, a) through 1304(z, z). The redundant block may reproduce at least one of the primary logic blocks, such as the memory logic 1306 or the business logic 1308. The deactivation switch 1312 may be connected to at least one of the plurality of main logic blocks, and the activation switch 1314 may be connected to at least one of the plurality of redundancy blocks.
In these embodiments, upon detecting a failure associated with one of the primary logic blocks (memory logic 1306 and/or business logic 1308), the deactivation switch 1312 may be configured to deactivate the one of the primary logic blocks. Meanwhile, the enable switch 1314 may be configured to enable a redundancy block of the plurality of redundancy blocks that is further processed than one of the plurality of primary logic blocks, such as the redundancy logic block 1310.
In addition, the enable switch 1314 and the disable switch 1312, which may be collectively referred to as a "configure switch," may include external inputs to configure the state of the switches. For example, the activation switch 1314 may be configured such that an activation signal in the external input produces a closed switch condition, while the deactivation switch 1312 may be configured such that a deactivation signal in the external input produces an open switch condition. In some embodiments, all configuration switches in 1300 may default to deactivated and become activated or enabled after the test indicates that the associated logic block is functional and a signal is applied in the external input. Alternatively, in some cases, all configuration switches in 1300 may default to enabled and may be deactivated or disabled after the test indicates that the associated logical block is not functional and a deactivation signal is applied in the external input.
Regardless of whether the configuration switch is initially enabled or disabled, the configuration switch may disable the associated logical block upon detecting a failure associated with the associated logical block. In the case where the configuration switch is initially enabled, the state of the configuration switch may change to disabled in order to disable the associated logic block. In the case where the configuration switches are initially disabled, the state of the configuration switches may remain in their disabled state in order to disable the associated logic blocks. For example, the results of the operability test may indicate that a logical block is not operating or that the logical block is not operating within certain specifications. In these cases, the logic blocks may be disabled and their corresponding configuration switches may not be enabled.
In some embodiments, the configuration switch may be connected to two or more logic blocks and may be configured to select between different logic blocks. For example, configuration switches may be connected to both the business logic block 1308 and the redundant logic block 1310. The configuration switches may enable the redundant logic block 1310 while disabling the business logic 1308.
Alternatively or additionally, at least one of the plurality of primary logic blocks (memory logic 1306 and/or business logic 1308) may be connected to a subset of the plurality of memory banks or memory instances 1304 through a first dedicated connection. Next, at least one redundant block (such as redundant business logic 1310) of at least one of the plurality of redundant blocks that is remanufactured from at least one of the plurality of primary logic blocks may be connected to a subset of the same plurality of memory banks or instances 1304 via a second dedicated connection.
Further, the memory logic 1306 may have different functions and capabilities than the business logic 1308. For example, while memory logic 1306 may be designed to implement read and write operations in memory bank 1304, business logic 1308 may be designed to perform in-memory computations. Thus, if the business logic 1308 comprises a first business logic block and the business logic 1308 comprises a second business logic block (e.g., redundant business logic 1310), it is possible to disconnect the defective business logic 1308 and reconnect the redundant business logic 1310 so that no capability is lost.
In some embodiments, the configuration switches (including the deactivation switch 1312 and the activation switch 1314) may be implemented with fuses, antifuses, or programmable devices (including one-time programmable devices) or other forms of non-volatile memory.
FIG. 14 is a functional block diagram of an exemplary set of redundant logical blocks 1400 consistent with the disclosed embodiments. In some embodiments, the set of redundant logic blocks 1400 may be disposed in the substrate 1301. The set of redundant logic blocks 1400 may include at least one of the business logic 1308 and the redundant business logic 1310 connected to the switches 1312 and 1314, respectively. Further, business logic 1308 and redundant business logic 1310 can be connected to the address bus 1402 and the data bus 1404.
In some embodiments, switches 1312 and 1314 may connect the logic blocks to the clock node, as shown in fig. 14. In this manner, the configuration switch may engage or disengage the logic block from the clock signal to effectively enable or disable the logic block. However, in other embodiments, switches 1312 and 1314 may connect the logic blocks to other nodes for activation or deactivation. For example, a configuration switch may connect a logic block to a voltage supply node (e.g., VCC) or to a ground node (e.g., GND) or a clock signal. In this way, the logic blocks may be enabled or disabled by the configuration switches, as the configuration switches may create an open circuit or block the logic block power supply.
In some embodiments, as shown in fig. 14, the address bus 1402 and the data bus 1404 may be in opposite sides of a logical block connected in parallel to each of the buses. In this manner, routing of different on-chip components may be facilitated through the set of logical blocks 1400.
In some embodiments, each of the plurality of deactivation switches 1312 couples at least one of the plurality of primary logic blocks to the clock node, and each of the plurality of activation switches 1314 may couple at least one of the plurality of redundant blocks to the clock node, to allow connection/disconnection of the clock as a simple activation/deactivation mechanism.
The redundant business logic 1310 of the redundant logic block set 1400 allows a designer to select blocks worth copying based on area and routing. For example, a chip designer may choose a larger block for copying because larger blocks may be more prone to errors. Thus, a chip designer may decide to copy large logic blocks. On the other hand, designers may prefer to copy smaller logical blocks because they are easily copied without significant loss of space. Furthermore, using the configuration in fig. 14, a designer can easily select a copy logical block depending on the statistical data of errors of each area.
FIG. 15 is a functional block diagram of an exemplary logical block 1500 consistent with the disclosed embodiments. The logic blocks may be business logic 1308 and/or redundant business logic 1310. However, in other embodiments, the example logic blocks may describe the memory logic 1306 or other components of the memory chip 1300.
The calculation unit 1510 and the replication calculation unit 1512 may include digital circuits capable of performing digital calculations. For example, the compute unit 1510 and the replica compute unit 1512 may include an Arithmetic Logic Unit (ALU) to perform arithmetic and bit-by-bit operations on binary numbers. Alternatively, compute unit 1510 and replica compute unit 1512 may comprise a Floating Point Unit (FPU) that operates on floating point numbers. Furthermore, in some embodiments, the calculation unit 1510 and the replication calculation unit 1512 may perform database-related functions such as minimum, maximum, count, and compare operations, among others.
In some embodiments, as shown in fig. 15, the compute unit 1510 and the replica compute unit 1512 may be connected to switch circuits 1514 and 1516. When activated, the switching circuit may enable or disable the computational unit.
In logic block 1500, replication computation unit 1512 may remanufacture computation unit 1510. Further, in some embodiments, the size of the register 1508, fetch circuitry 1504, decoder 1506, and write back circuitry 1518 (collectively local logical units) may be smaller than the compute unit 1510. Because larger elements are more prone to problems during manufacturing, a designer may decide to copy larger cells (such as compute unit 1510) instead of smaller cells (such as local logic cells). However, depending on historical yield and error rate, the designer may also choose to copy the local logical units in addition to or instead of copying the large units (or entire block). For example, compute unit 1510 may be larger than register 1508, fetch circuitry 1504, decoder 1506, and write back circuitry 1518, and thus more error prone. The designer may choose to copy the computational unit 1510 rather than copying other elements or the entire block in the logic block 1500.
The logic block 1500 may include a plurality of zone configuration switches, each of which is connected to at least one of the compute unit 1510 or the replica compute unit 1512. When a failure in the compute unit 1510 is detected, the zone configuration switch may be configured to disable the compute unit 1510 and enable the duplicate compute unit 1512.
FIG. 16 shows a functional block diagram of exemplary logical blocks connected to a bus consistent with the disclosed embodiments. In some embodiments, the logic blocks 1602 (which may represent the memory logic 1306, the business logic 1308, or the redundant business logic 1310) may be independent of one another, may be connected via a bus, and may be externally enabled by specifically addressing the logic blocks. For example, the memory chip 1300 may include a number of logical blocks, each having an ID number. However, in other embodiments, the logic block 1602 may represent a larger unit of several (one or more) of the memory logic 1306, the business logic 1308, or the redundant business logic 1310.
In some embodiments, each of logical blocks 1602 may be redundant with other logical blocks 1602. This full redundancy, where all blocks can operate as primary or redundant blocks, can improve manufacturing yield because the designer can disconnect the failed cells while maintaining the functionality of the entire chip. For example, a designer may be able to disable error-prone logic areas but maintain similar computational power because all copy blocks may be connected to the same address and data buses. For example, the initial number of logical blocks 1602 may be greater than the target capacity. Thus, disabling some of the logical blocks 1602 will not affect the target capacity.
The buses connected to the logical blocks may include an address bus 1614, command lines 1616, and data lines 1618. As shown in fig. 16, each of the logic blocks may be connected independently of each line in the bus. However, in some embodiments, the logical blocks 1602 may be connected in a hierarchical structure to facilitate routing. For example, each line in the bus may be connected to a multiplexer that routes the line to different logic blocks 1602.
In some embodiments, to allow external access without knowledge of internal chip structures (which may change due to enabling and disabling cells), each of the logical blocks may include a fuse ID, such as fuse identification 1604. The fuse identifier 1604 may comprise an array of switches (e.g., fuses) that determine the ID, and may be connected to a management circuit. For example, a fuse identifier 1604 may be connected to address manager 1302. Alternatively, the blown identification 1604 may be connected to a higher memory address location. In these embodiments, fuse flag 1604 may be configurable for use with a particular address. For example, fuse identifier 1604 may comprise a programmable non-volatile device that determines a final ID based on instructions received from the managing circuit.
A distributed processor on a memory chip may be designed with the configuration depicted in fig. 16. A test program executing as a BIST at chip wake-up or at factory test may assign run ID numbers to blocks of the plurality of primary logic blocks (memory logic 1306 and business logic 1308) that pass the test protocol. The test program may also assign an illegal ID number to a block of the plurality of primary logical blocks that fails the test protocol. The test program may also assign a run ID number to a block of the plurality of redundant blocks that passes the test protocol (redundant logic block 1310). Because the redundant blocks replace the failed primary logical block, a block of the plurality of redundant blocks assigned a run ID number may be equal to or greater than a block of the plurality of primary logical blocks assigned an illegal ID number, thereby disabling the block. Further, each of the plurality of primary logic blocks and each of the plurality of redundant blocks may include at least one fuse identifier 1604. Additionally, as shown in fig. 16, the bus connecting logic block 1602 may include command lines, data lines, and address lines.
However, in other embodiments, all of the logic blocks 1602 connected to the bus will start to be disabled and have no ID numbers. Testing one by one, each good logical block will get a run ID number and those logical blocks that are not working will retain illegal IDs, which will disable those blocks. In this manner, redundant logic blocks may improve manufacturing yield by replacing blocks known to be defective during the test processing procedure.
An address bus 1614 may couple the managing circuitry to each of the plurality of memory banks, each of the plurality of primary logical blocks, and each of the plurality of redundant blocks. These connections allow the management circuitry to assign an invalid address to one of the primary logical blocks and a valid address to one of the redundant blocks upon detecting a failure associated with the primary logical block, such as business logic 1308.
For example, as shown in fig. 16A, illegal IDs are configured to all logical blocks 1602(a) through 1602(c) (e.g., address 0 xFFF). After testing, logic blocks 1602(a) and 1602(c) are verified as functional, while logic block 1602(b) is not functional. In fig. 16A, unshaded logic blocks may represent logic blocks that successfully pass the functionality test, while shaded logic blocks may represent logic blocks that fail the functionality test. Thus, the test program changes the illegal ID to a legal ID for functional logical blocks, while preserving the illegal ID for non-functional logical blocks. As an example, in fig. 16A, the addresses of logical blocks 1602(a) and 1602(c) are changed from 0xFFF to 0x001 and 0x002, respectively. In contrast, the address of logical block 1602(b) is still the illegal address 0 xFFF. In some embodiments, the ID is changed by programming the corresponding fuse identification 1604.
Different results from the testing of logic block 1602 may yield different configurations. For example, as shown in fig. 16B, address manager 1302 may initially assign an illegal ID to all logical blocks 1602 (i.e., 0 xFFF). However, the test results may indicate that both logical blocks 1602(a) and 1602(b) are functional. In these cases, testing of logic block 1602(c) may not be necessary, as memory chip 1300 may only require two logic blocks. Thus, to minimize test resources, a logical block may be tested according to the product definition of 1300 only the minimum number of functional logical blocks needed to leave other logical blocks untested. Fig. 16B also shows unshaded logic blocks representing tested logic blocks that pass the functionality test and shaded logic blocks representing untested logic blocks.
In these embodiments, a production tester (external or internal, automatic or manual) or controller that executes the BIST at start-up may change the illegal ID to a run ID for the functional tested logic block, while leaving the illegal ID for the untested logic block. As an example, in fig. 16B, the addresses of logical blocks 1602(a) and 1602(B) are changed from 0xFFF to 0x001 and 0x002, respectively. In contrast, the address of untested logical block 1602(c) is still at illegal address 0 xFFF.
FIG. 17 is a functional block diagram of exemplary cells 1702 and 1712 connected in series, consistent with the disclosed embodiments. Fig. 17 may represent an entire system or chip. Alternatively, FIG. 17 may represent blocks in a chip that contain other functional blocks.
The cells 1702 and 1712 may represent a complete cell including multiple logic blocks, such as the memory logic 1306 and/or the business logic 1308. In these embodiments, the units 1702 and 1712 may also include elements required to perform operations, such as the address manager 1302. However, in other embodiments, the blocks 1702 and 1712 may represent logical units such as business logic 1308 or redundant business logic 1310.
FIG. 17 presents an embodiment in which units 1702 and 1712 may need to communicate between themselves. In such cases, cells 1702 and 1712 may be connected in series. However, the non-working cells may break the continuity between the logic blocks. Thus, when cells need to be disabled due to a defect, the connections between cells may include a bypass option. The bypass option may also be part of the bypass unit itself.
In fig. 17, cells may be connected in series (e.g., 1702(a) -1702 (c)) and a failed cell (e.g., 1702(b)) may be bypassed when it is defective. The cell may further be connected in parallel with the switching circuit. For example, in some embodiments, cells 1702 and 1712 may be connected with switch circuits 1722 and 1728, as depicted in FIG. 17. In the embodiment depicted in FIG. 17, cell 1702(b) is defective. For example, the cell 1702(b) failed the circuit functionality test. Thus, the cell 1702(b) may be disabled using, for example, the enable switch 1314 (not shown in fig. 17), and/or the switch circuit 1722(b) may be enabled to bypass the cell 1702(b) and maintain connectivity between the logic blocks.
Accordingly, when a plurality of main cells are connected in series, each of the plurality of cells may be connected in parallel with a parallel switch. Upon detecting a fault associated with one of the plurality of cells, a parallel switch connected to the one of the plurality of cells may be activated to connect two of the plurality of cells.
In other embodiments, as shown in fig. 17, the switch circuit 1728 may include one or more sampling points that will cause one or more cyclic delays to maintain synchronization between different lines of cells. When a cell is disabled, shorting of connections between adjacent logic blocks may create synchronization errors with other calculations. For example, if a task requires data from both a-line and B-line, and each of a and B is carried by a separate series of cells, disabling a cell will result in desynchronization between the lines that will require further data management. To prevent desynchronization, sample circuitry 1730 may emulate the delay caused by disabled cell 1712 (b). However, in some embodiments, the parallel switch may include an antifuse instead of sampling circuit 1730.
FIG. 18 is a functional block diagram of exemplary cells connected in a two-dimensional array consistent with the disclosed embodiments. Fig. 18 may represent an entire system or chip. Alternatively, FIG. 18 may represent blocks in a chip that contain other functional blocks.
As shown in fig. 18, the cells may be arranged in a two-dimensional array, with cells 1806 (which may include or represent one or more of memory logic 1306, business logic 1308, or redundant business logic 1310) interconnected via switch boxes 1808 and connection boxes 1810. Further, to control the configuration of the two-dimensional array, the two-dimensional array may include an I/O block 1804 in the perimeter of the two-dimensional array.
The connection box 1810 may be a programmable and reconfigurable device that may respond to signals input from the I/O block 1804. For example, the connection box may include multiple input pins from unit 1806 and may also be connected to switch box 1808. Alternatively, the connection box 1810 may include a group of switches that connect the pins of the programmable logic cells with routing traces, while the switch box 1808 may include a group of switches that connect different traces.
In some embodiments, the connection box 1810 and the switch box 1808 may be implemented by configuration switches such as switches 1312 and 1314. In these embodiments, the connection box 1810 and the switch box 1808 may be configured by a production tester or BIST performed at chip start-up.
In some embodiments, the connection box 1810 and the switch box 1808 may be configured after testing the circuit functionality of the unit 1806. In these embodiments, I/O block 1804 may be used to send test signals to unit 1806. Depending on the test results, the I/O block 1804 may send programming signals that configure the connection boxes 1810 and switch boxes 1808 in a manner that disables units 1806 that fail the test protocol and enables units 1806 that pass the test protocol.
In these embodiments, the plurality of main logic blocks and the plurality of redundancy blocks may be disposed on the substrate in a two-dimensional grid. Thus, each of the plurality of primary cells 1806 and each of the plurality of redundancy blocks (such as redundancy business logic 1310) may be interconnected with a switch box 1808, and the input blocks may be disposed in the perimeter of each line and each column of the two-dimensional grid.
FIG. 19 is a functional block diagram of exemplary units in a complex connection consistent with the disclosed embodiments. Fig. 19 may represent the entire system. Alternatively, FIG. 19 may represent blocks in a chip that contain other functional blocks.
The complex connection of fig. 19 includes cells 1902(a) -1902 (f) and configuration switches 1904(a) -1904 (f). Cell 1902 may represent an autonomous cell including a plurality of logic blocks, such as memory logic 1306 and/or business logic 1308. However, in other embodiments, the cell 1902 may represent a logical cell such as the memory logic 1306, the business logic 1308, or the redundant business logic 1310. The configuration switch 1904 may include either of a deactivate switch 1312 and an activate switch 1314.
As shown in fig. 19, the complex connection may include cells 1902 in two planes. For example, a complex connection may include two separate substrates separated in the z-axis. Alternatively or additionally, the cells 1902 may be arranged in both surfaces of the substrate. For example, the substrate 1301 may be arranged in two overlapping surfaces and connected with the configuration switches 1904 arranged in three dimensions for the purpose of reducing the area of the memory chip 1300. The configuration switches may include a deactivate switch 1312 and/or an activate switch 1314.
The first plane of the substrate may include a "master" cell 1902. These blocks may be preset to be enabled. In these embodiments, the second plane may include "redundant" cells 1902. These cells may default to being disabled.
In some embodiments, the configuration switch 1904 may include an antifuse. Thus, after testing cells 1902, a block may be connected in a block of active cells by switching certain antifuses to "always on" and disabling selected cells 1902, even if the cells are in different planes. In the embodiment presented in fig. 19, one of the "master" cells (cell 1902(e)) does not operate. Fig. 19 may represent inactive or untested blocks as shaded blocks, while tested or active blocks may be unshaded. Thus, configure switch 1904 is configured to cause one of the logic blocks in the different plane (e.g., cell 1902(f)) to become active. In this way, the memory chip operates by replacing the spare logic cells even if one of the main logic blocks is defective.
Fig. 19 additionally shows that one of the cells 1902 in the second plane is not tested or enabled (i.e., 1902(c)) because the main logic block is functional. For example, in fig. 19, two master units 1902(a) and 1902(d) pass the functionality test. Thus, cell 1902(c) is not tested or enabled. Thus, FIG. 19 shows the ability to specifically select the logical block that becomes active depending on the test results.
In some embodiments, as shown in fig. 19, not all cells 1902 in the first plane may have a corresponding spare or redundant block. However, in other embodiments, all cells may be redundant of each other to achieve full redundancy, where all cells are primary or redundant. Further, while some implementations may follow the star network topology depicted in fig. 19, other implementations may use parallel connections, series connections, and/or coupling different elements in parallel or in series with the configuration switches.
Fig. 20 is an exemplary flow chart illustrating a redundant block enable process 2000 consistent with the disclosed embodiments. The enable process 2000 may be implemented for the memory chip 1300, and particularly for a DRAM memory chip. In some embodiments, the process 2000 may include the following steps: testing at least one circuit functionality of each of a plurality of logic blocks on a substrate of a memory chip; identifying a failed logical block of the plurality of primary logical blocks based on the test results; testing at least one circuit functionality of at least one redundant or additional logic block on a substrate of a memory chip; deactivating the at least one failed logic block by applying an external signal to the deactivation switch; and enabling the at least one redundancy block by applying the external signal to an enable switch connected with the at least one redundancy block and disposed on the substrate of the memory chip. Each step of the process 2000 is further detailed below in the description of fig. 20.
The process 2000 may include testing a plurality of logical blocks, such as the commercial block 1308 (step 2002) and a plurality of redundant blocks (e.g., redundant commercial block 1310). Testing may be performed prior to packaging using, for example, a probing station for on-wafer testing. However, step 2000 may also be performed after encapsulation.
The test in step 2002 may include applying a limited sequence of test signals to each logic block in the memory chip 1300 or a subset of the logic blocks in the memory chip 1300. The test signal may include a request for a calculation that is expected to yield a 0 or a 1. In other embodiments, the test signal may request a read of a particular address in a memory bank or a write to a particular memory bank.
A test technique may be implemented in step 2002 to test the response of the logic block under an iterative process. For example, the testing may involve testing the logical block by transmitting an instruction to write data into the memory bank and then verifying the integrity of the written data. In some embodiments, the testing may include utilizing an inverted data repetition algorithm.
In an alternative embodiment, the test of step 2002 may comprise running a model of the logic block to generate a target memory image based on a set of test instructions. Then, the same sequence of instructions can be executed to the logic blocks in the memory chip, and the results can be recorded. The simulated residual memory image may also be compared to the image obtained from self-test, and any mismatch may be flagged as a fault.
Alternatively, in step 2002, the test may include shadow modeling, where a diagnosis is generated, but not necessarily a predicted result. Instead, testing using shadow modeling can be performed in parallel for both the memory chip and the simulation. For example, when a logic block in a memory chip completes an instruction or task, the emulation may be signaled to execute the same instruction. Once the logic block in the memory chip completes the instruction, the architectural states of the two models can be compared. If there is a mismatch, a fault is flagged.
In some embodiments, all logic blocks (including, for example, each of the memory logic 1306, the business logic 1308, or the redundant business logic 1310) may be tested in step 2002. However, in other embodiments, only a subset of the logic blocks may be tested in different test rounds. For example, in a first test pass, only the memory logic 1306 and associated blocks may be tested. In the second pass, only business logic 1308 and associated blocks may be tested. In the third pass, the logic blocks associated with the redundant business logic 1310 may be tested depending on the results of the first two passes.
The process 2000 may continue to step 2004. In step 2004, a failed logical block may be identified, and a failed redundant block may also be identified. For example, a logical block that fails the test of step 2002 may be identified as a failed block in step 2004. However, in other embodiments, only certain failing logical blocks may be initially identified. For example, in some embodiments, only logical blocks associated with business logic 1308 may be identified, and only failed redundant blocks are identified if needed to replace failed logical blocks. Further, identifying the failing block may include writing identification information of the identified failing block on the memory bank or the non-volatile memory.
In step 2006, the failed logical block may be disabled. For example, using configuration circuitry, a failed logic block may be disabled by disconnecting it from clock, ground, and/or power nodes. Alternatively, a failed logic block may be disabled by configuring the connection box in an arrangement that avoids the logic block. Additionally, in other embodiments, failed logical blocks may be disabled by receiving an illegal address from address manager 1302.
In step 2008, a redundant block that replicates the failed logical block may be identified. Even if some of the logical blocks have failed, in order to support the same capabilities of the memory chip, in step 2008, redundant blocks that are available and that can replicate the failed logical blocks are identified. For example, if the logical block performing the multiplication of the vector is determined to be faulty, then in step 2008, the address manager 1302 or on-chip controller may identify an available redundant logical block that also performs the multiplication of the vector.
In step 2010, the redundant blocks identified in step 2008 may be enabled. In contrast to the disabling operation of step 2006, the identified redundant block may be enabled by connecting it to a clock, ground, and/or power node in step 2010. Alternatively, the identified redundant blocks may be enabled by configuring the connection boxes in an arrangement that connects the identified redundant blocks. Additionally, in other embodiments, the identified redundant blocks may be enabled by receiving a run address at test program run time.
FIG. 21 is an exemplary flow chart illustrating an address assignment handler 2100 consistent with the disclosed embodiments. The address assignment handler 2100 may be implemented for the memory chip 1300, and particularly for DRAM memory chips. As described with respect to fig. 16, in some embodiments, the logical blocks in the memory chip 1300 may be connected to a data bus and have address identification. The process 2100 describes an address assignment method that disables failed logical blocks and enables logical blocks that pass testing. The steps described in process program 2100 will be described as being performed by a production tester or BIST performed at chip start-up; however, other components of the memory chip 1300 and/or external devices may also perform one or more steps of the processing program 2100.
In step 2102, the tester may disable all logical blocks and redundant blocks by assigning an illegal identification to each logical block at the chip level.
In step 2104, the tester may execute a test protocol for the logical block. For example, the tester may perform the test method described in step 2002 for one or more of the logic blocks in the memory chip 1300.
In step 2106, depending on the results of the test in step 2104, the tester may determine whether the logic block is defective. If the logical block is not defective (step 2106: No), the address manager may assign a run ID to the tested logical block in step 2108. If the logical block is defective (step 2106: yes), address manager 1302 may reserve an illegal ID for the defective logical block in step 2110.
In step 2112, the address manager 1302 may select redundant logical blocks that duplicate the defective logical blocks. In some embodiments, redundant logic blocks that duplicate defective logic blocks may have the same components and connections as the defective logic blocks. However, in other embodiments, redundant logic blocks may have components and/or connections that are different from defective logic blocks, but are capable of performing equivalent operations. For example, if the defective logic block is designed to perform multiplication of a vector, the selected redundant logic block will be able to perform multiplication of the vector even if the selected redundant logic block does not have the same architecture as the defective cell.
In step 2114, the address manager 1302 may test the redundant blocks. For example, the tester may apply the testing techniques applied in step 2104 to the identified redundant blocks.
In step 2116, based on the results of the test in step 2114, the tester may determine whether the redundant block is defective. In step 2118, if the redundant block is not defective (step 2116: no), the tester may assign a run ID to the identified redundant block. In some embodiments, the handler 2100 may return to step 2104 after step 2118 to generate an iterative loop that tests all logical blocks in the memory chip.
If the tester determines that the redundant blocks are defective (step 2116: yes), then in step 2120 the tester may determine whether additional redundant blocks are available. For example, the tester may query the memory bank for information about available redundant logical blocks. If redundant logic blocks are available (step 2120: yes), the tester may return to step 2112 and identify new redundant logic blocks that reproduce defective logic blocks. If the redundant logic blocks are not available (step 2120: no), then the tester may generate an error signal in step 2122. The error signal may include information of defective logic blocks and defective redundant blocks.
Coupled memory bank
Embodiments disclosed in the present disclosure also include a distributed high-performance processor. The processor may include a memory controller that interfaces memory banks and processing units. The processor may be configurable to expedite delivery of data to the processing unit for computation. For example, if a processing unit requires two data instances to perform a task, the memory controller may be configured such that the communication lines independently provide access to information from both data instances. The disclosed memory architecture attempts to minimize the hardware requirements associated with complex caching and complex register file schemes. Typically, processor chips include a cache hierarchy that allows the core to work directly with the registers. However, cache operations require significant die area and consume additional power. The disclosed memory architecture avoids the use of a cache hierarchy by adding logic components in the memory.
The disclosed architecture also enables strategic (or even optimized) placement of data in a memory bank. The disclosed memory architecture achieves high performance and avoids memory access bottlenecks by strategically locating data in different blocks of a memory bank even though the memory bank has a single port and high latency. With the goal of providing a continuous stream of data to the processing unit, the compile optimization step may determine how the data should be stored in the memory banks for a particular or general task. The memory controller interfacing the processing units and memory banks may then be configured to grant access to a particular processing unit when the particular processing unit requires data to perform an operation.
The configuration of the memory chip may be performed by a processing unit (e.g., a configuration manager) or an external interface. The configuration may also be written by a compiler or other SW facility. Further, the configuration of the memory controller may be based on the available ports in the memory banks and the organization of the data in the memory banks. Thus, the disclosed architecture may provide a constant stream of data or simultaneous information from different memory blocks to a processing unit. In this way, computational tasks within memory can be processed quickly by avoiding latency bottlenecks or cache requirements.
In addition, the data stored in the memory chip may be arranged based on a compilation optimization step. Compilation may allow for the building of a handler in which a processor efficiently assigns tasks to processing units without memory latency associated delays. The compilation may be performed by a compiler and transmitted to a host connected to an external interface in the substrate. Typically, high latency and/or a small number of ports of certain access patterns will result in a data bottleneck for the processing unit that needs the data. However, the disclosed compilation may locate data in memory banks in a manner that enables the processing unit to continuously receive data even with unfavorable memory types.
Further, in some embodiments, the configuration manager may signal the required processing units based on the computations required by the tasks. Different processing units or logic blocks in a chip may have specialized hardware or architecture for different tasks. Thus, depending on the task to be performed, a processing unit or group of processing units may be selected to perform the task. The memory controller on the substrate may be configurable to route data or grant access based on the selection of the processing subunit to improve data transfer speed. For example, based on compilation optimization and memory architecture, a processing unit may be granted access to a memory bank when it is needed to perform a task.
Further, the chip architecture may include on-chip components that facilitate the transfer of data by reducing the time required to access the data in the memory banks. Thus, the present disclosure describes a chip architecture for a high performance processor capable of performing specific or general tasks using simple memory instances, along with compilation optimization steps. Memory instances may have high random access latency and/or a small number of ports, such as those used in DRAM devices or other memory orientation techniques, but the disclosed architecture may overcome these drawbacks by enabling a continuous (or nearly continuous) data flow from the memory banks to the processing units.
In the present application, simultaneous communication may refer to communication within a clock cycle. Alternatively, simultaneous communication may refer to sending information within a predetermined amount of time. For example, simultaneous communication may refer to communication within a few nanoseconds.
FIG. 22 provides a functional block diagram of an exemplary processing device consistent with the disclosed embodiments. Fig. 22A shows a first embodiment of a processing apparatus 2200 in which a memory controller 2210 connects a first memory block 2202 and a second memory block 2204 using a multiplexer. The memory controller 2210 may also be coupled to at least one allocation manager 2212, a logic block 2214, and a plurality of accelerators 2216(a) through 2216 (n). Fig. 22B shows a second embodiment of a processing apparatus 2200, in which a memory controller 2210 connects memory banks 2202 and 2204 using a bus connecting memory controller 2210 with at least one configuration manager 2212, a logic bank 2214, and a plurality of accelerators 2216(a) through 2216 (n). Further, the host 2230 may be external to the processing device 2200 and connected to the processing device via, for example, an external interface.
Memory blocks 2202 and 2204 may comprise DRAM pads or groups of pads, DRAM banks, MRAM \ PRAM \ RERAM \ SRAM cells, flash memory pads, or other memory technologies. The memory blocks 2202 and 2204 may alternatively comprise non-volatile memory, flash memory devices, resistive random access memory (ReRAM) devices, or Magnetoresistive Random Access Memory (MRAM) devices.
The memory blocks 2202 and 2204 may additionally include a plurality of memory cells arranged in rows and columns between a plurality of word lines (not shown) and a plurality of bit lines (not shown). The gates of each row of memory cells may be connected to a respective one of a plurality of word lines. Each row of memory cells may be connected to a respective one of a plurality of bit lines.
In other embodiments, the memory regions (including memory blocks 2202 and 2204) are implemented by simple memory instances. In the present application, the term "memory instance" may be used interchangeably with the term "memory block". Memory instances (or blocks) can have poor characteristics. For example, the memory may be a single port only memory and may have a high random access latency. Alternatively or additionally, the memory may not be accessible during column and line changes and face data access issues related to, for example, capacitive charging and/or circuitry setup. However, the architecture presented in fig. 22 still facilitates parallel processing in the memory device by allowing dedicated connections between memory instances and processing units and arranging the data in some way that takes into account the characteristics of the blocks.
In some device architectures, a memory instance may include several ports to facilitate parallel operations. However, in these embodiments, the chip may still achieve improved performance when data is compiled and organized based on the chip architecture. For example, a compiler may improve the efficiency of access in a memory region by providing instruction and organization data placement, thus enabling easy access to the memory region even with a single port memory.
Furthermore, memory blocks 2202 and 2204 may be multiple types of memory in a single chip. For example, memory blocks 2202 and 2204 may be eFlash and eDRAM. Additionally, the memory banks may include DRAMs with ROM instances.
Further, memory controller 2210 may constitute a dual-channel memory controller. The incorporation of a dual-channel memory may facilitate control of parallel access lines by memory controller 2210. The parallel access lines may be configured to have the same length to facilitate data synchronization when multiple lines are used in combination. Alternatively or additionally, the parallel access lines may allow access to multiple memory ports of a memory bank.
In some embodiments, the processing device 2200 may include one or more multiplexers connectable to the processing units. The processing unit may include a configuration manager 2212, a logic block 2214, and an accelerator 2216, which may be directly connected to the multiplexer. Additionally, memory controller 2210 may include at least one data input from multiple memory banks or blocks 2202 and 2204, and at least one data output connected to each of multiple processing units. With this arrangement, memory controller 2210 can receive data from memory banks or memory blocks 2202 and 2204 simultaneously via two data inputs and transmit the received data to at least one selected processing unit simultaneously via two data outputs. However, in some embodiments, at least one data input and at least one data output may be implemented in a single port to allow only read or write operations. In these embodiments, a single port may be implemented as a data bus comprising data lines, address lines, and command lines.
In some embodiments, a first memory block 2202 of the at least two memory blocks may be arranged on a first side of the plurality of processing units; and a second memory bank 2204 of the at least two memory banks may be disposed on a second side of the plurality of processing units opposite the first side. Additionally, the selected processing unit (e.g., accelerator 2216(n)) to perform the task may be configured to access the second memory bank 2204 during clock cycles in which communication lines to the first memory bank or first memory block 2202 are open. Alternatively, the selected processing unit may be configured to transfer data to the second memory block 2204 during clock cycles in which communication lines to the first memory block 2202 are open.
In some embodiments, memory controller 2210 can be implemented as a standalone element, as shown in fig. 22. However, in other embodiments, the memory controller 2210 may be embedded in a memory area or may be disposed along the accelerators 2216(a) through 2216 (n).
The processing regions in processing device 2200 may include a configuration manager 2212, a logical block 2214, and accelerators 2216(a) through 2216 (n). The accelerator 2216 may include a plurality of processing circuits with predefined functionality and may be defined by a particular application. For example, the accelerator may be a vector Multiply Accumulate (MAC) unit or a Direct Memory Access (DMA) unit that handles memory movement between modules. The accelerator 2216 may also be capable of calculating its own address and requesting data from or writing data to the memory controller 2210. For example, the configuration manager 2212 may signal at least one of the accelerators 2216 that the accelerator may access a bank of memory. Then, accelerator 2216 may configure memory controller 2210 to route the data or grant access to the accelerator itself. Further, accelerator 2216 may include at least one arithmetic logic unit, at least one vector handling logic unit, at least one string comparison logic unit, at least one register, and at least one direct memory access.
The configuration manager 2212 may include digital processing circuitry to configure the accelerators 2216 and instruct the execution of tasks. For example, the configuration manager 2212 may be connected to the memory controller 2210 and to each of the plurality of accelerators 2216. The configuration manager 2212 may have its own dedicated memory to save the configuration of the accelerators 2216. Configuration manager 2212 may use the memory banks to fetch commands and configurations via memory controller 2210. Alternatively, the configuration manager 2212 may be programmed via an external interface. In some embodiments, configuration manager 2212 may be implemented with an on-chip Reduced Instruction Set Computer (RISC) or an on-chip complex CPU having its own cache hierarchy. In some embodiments, the configuration manager 2212 may also be omitted, and the accelerators may be configured via an external interface.
The processing device 2200 may also include an external interface (not shown). The external interface allows access to the memory from the upper level (this memory bank controller, which receives commands from the external host 2230 or on-chip host processor) or from the external host 2230 or on-chip host processor. The external interface may allow the configuration manager 2212 and accelerator 2216 to be programmed by writing configuration or code to memory through the memory controller 2210 for later use by the configuration manager 2212 or the units 2214 and 2216 themselves. However, the external interface may also program the processing unit directly without routing through memory controller 2210. In the case where configuration manager 2212 is a microcontroller, configuration manager 2212 may allow code to be loaded from main memory to controller area memory via an external interface. Memory controller 2210 may be configured to interrupt tasks in response to receiving a request from an external interface.
The external interface may include a plurality of connectors associated with the logic circuitry that provide a glue-free interface to various components on the processing device. The external interface may include: a data I/O input for data reading and an output for data writing; an external address output terminal; external CE0 chip select pin; a low active chip selector; a byte enable pin; a pin for wait state of memory cycle; writing an enable pin; outputting an enabled active pin; and a read/write enable pin. Thus, the external interface has the required inputs and outputs to control the processing program and to obtain information from the processing device. For example, the external interface may conform to the JEDEC DDR standard. Alternatively or additionally, the external interface may conform to other standards, such as SPI \ OSPI or UART.
In some embodiments, the external interface may be disposed on the chip substrate and may connect to an external host 2230. An external host may access memory blocks 2202 and 2204, memory controller 2210, and the processing unit via an external interface. Alternatively or additionally, the external host 2230 may read and write to memory, or may signal to the configuration manager 2212 via read and write commands to perform operations, such as start handlers and/or stop handlers. Further, the external host 2230 may directly configure the accelerator 2216. In some embodiments, external host 2230 is capable of performing read/write operations directly to memory blocks 2202 and 2204.
In some embodiments, the configuration manager 2212 and the accelerators 2216 may be configured to use a direct bus to connect device regions and memory regions depending on the target task. For example, when a subset of accelerators 2216 are capable of performing the computations required for task execution, that subset of accelerators may be connected with memory instance 2204. By doing this separation, it is possible to ensure that the dedicated accelerator obtains the Bandwidth (BW) required by memory blocks 2202 and 2204. Furthermore, such a configuration with a dedicated bus may allow for splitting a large memory into smaller instances or blocks, as connecting memory instances to the memory controller 2210 allows for fast access to data in different memories even with high row latency. To achieve parallelism of connections, memory controller 2210 can connect to each of the memory instances with a data bus, an address bus, and/or a control bus.
The above-described inclusion of memory controller 2210 may eliminate the need for cache hierarchies or complex register files in the processing device. Although cache hierarchies may be added to get added capabilities, the architecture in the processing apparatus 2200 may allow a designer to add sufficient memory blocks or instances based on processing operations and manage the instances accordingly without cache hierarchies. For example, the architecture in the processing device processing apparatus 2200 may eliminate the need for a cache hierarchy by implementing pipelined memory accesses. In pipelined memory accesses, a processing unit may receive a continuous data stream in each cycle in which some data lines may be open (or activated) while other data lines receive or transmit data. Continuous data flow using a separate communication line may enable improved execution speed and minimal latency due to line changes.
Furthermore, the disclosed architecture in FIG. 22 enables pipelined memory accesses, potentially organizing data in a small number of memory blocks and saving power consumption caused by line switching. For example, in some embodiments, a compiler may communicate to the host 2230 the organization of data in a memory bank or the method used to organize data in a memory bank to facilitate accessing data during a given task. The configuration manager 2212 may then define which memory banks and, in some cases, which ports of the memory banks are accessible to the accelerator. This synchronization between the location of the data in the memory bank and the data access method improves the computational task by feeding the data to the accelerator with minimal delay. For example, in an embodiment where the configuration manager 2212 comprises a RISC \ CPU, the method may be implemented in offline Software (SW), and then the configuration manager 2212 may be programmed to perform the method. The method may be developed in any language executable by a RISC/CPU computer and may be executed on any platform. Inputs to the method may include the configuration of the memory behind the memory controller and the data itself, along with the pattern of memory accesses. Further, the method may be implemented in a language specific to the embodiment or a machine language, and may also be simply a series of configuration values represented in binary or text.
As discussed above, in some embodiments, the compiler may provide instructions to the host 2230 for organizing data in the memory blocks 2202 and 2204 in preparation for pipelined memory accesses. The pipelined memory access may generally include the steps of: receiving a plurality of addresses for a plurality of memory banks or memory blocks 2202 and 2204; accessing the plurality of memory banks using independent data lines according to the received address; supplying data from a first address in a first memory group of the plurality of memory groups to at least one of the plurality of processing units via a first communication line and opening to a second communication line to a second address in a second memory group 2204 of the plurality of memory groups; and supplying data from the second address to the at least one of the plurality of processing units via the second communication line and opening to a third communication line of a third address in the first memory bank in the first line within a second clock cycle. In some embodiments, the pipelined memory access may be performed with two memory blocks connected to a single port. In these embodiments, the memory controller 2210 may hide the two memory blocks behind a single port, but transmit the data to the processing unit using a pipelined memory access method.
In some embodiments, the compiler can execute on the host 2230 before performing the tasks. In these embodiments, the compiler may be able to determine the configuration of the data stream based on the architecture of the memory device, as the configuration will be known to the compiler.
In other embodiments, if the configuration of the memory blocks 2204 and 2202 is unknown at offline time, the pipelined method may be performed on the host 2230, which may arrange the data in the memory blocks before starting the computation. For example, host 2230 can write data directly into memory blocks 2204 and 2202. In these embodiments, processing units such as configuration manager 2212 and memory controller 2210 may not have information about the required hardware until runtime. It may then be necessary to delay the selection of the accelerator 2216 until the task begins to run. In these cases, the processing unit or memory controller 2210 may randomly select the accelerator 2216 and generate a test data access pattern, which may be modified in performing the task.
However, when the tasks are known in advance, the compiler may organize the data and instructions in memory banks for the host 2230 to provide to a processing unit, such as the configuration manager 2212, to set the signal connections that minimize access latency. For example, in some cases, the accelerator 2216 may require n words at a time. However, each memory instance supports fetching only m words at a time, where "m" and "n" are integers and m < n. Thus, a compiler may place desired data across different instances or blocks of memory to facilitate data access. Additionally, to avoid line miss latency, where the processing device 2200 includes multiple memory memories, the host may split data in different lines of different memory instances. The partitioning of the data may allow the next data line in the next instance to be accessed while still using the data from the current instance.
For example, accelerator 2216(a) may be configured to multiply two vectors. Each of the vectors may be stored in a separate memory block, such as memory blocks 2202 and 2204, and each vector may include multiple words. Thus, to complete a task that requires the accelerator 2216(a) to multiply, it may be necessary to access two memory blocks and retrieve multiple words. However, in some embodiments, the memory block only allows one word to be accessed per clock cycle. For example, a memory block may have a single port. In these cases, to expedite data transfer during operation, a compiler may organize the words that make up a vector in different memory blocks to allow parallel and/or simultaneous reads of the words. In these cases, the compiler may store words in memory blocks having dedicated lines. For example, if each vector includes two words and the memory controller is able to directly access four memory blocks, the compiler may arrange the data in four memory blocks, each memory block transmitting a word and speeding up the data delivery. Furthermore, in embodiments, when the memory controller 2210 may have more than a single connection to each memory bank, the compiler may instruct the configuration manager 2212 (or other processing unit) to access port-specific ports. In this manner, the processing apparatus 2200 may perform pipelined memory accesses to provide data to the processing units sequentially by loading words in some lines and transmitting data in other lines at the same time. Thus, this pipelined memory access avoidance may avoid latency problems.
Fig. 23 is a functional block diagram of an exemplary processing device 2300 consistent with the disclosed embodiments. The functional block diagram shows a simplified processing apparatus 2300 showing a single accelerator in the form of a MAC unit 2302, a configuration manager 2304 (equivalent or similar to configuration manager 2212), a memory controller 2306 (equivalent or similar to memory controller 2210), and a plurality of memory blocks 2308(a) -2308 (d).
In some embodiments, MAC unit 2302 may be a specific accelerator for processing specific tasks. As an example, processing device 2300 may perform a 2D convolution as a task. The configuration manager 2304 may then signal an accelerator with appropriate hardware to perform computations associated with the task. For example, MAC unit 2302 may have four internal incremental counters (logical adders and registers to manage the four loops required for convolution calculations) and a multiply-accumulate unit. The configuration manager 2304 may signal the MAC unit 2302 to process incoming data and perform tasks. The configuration manager 2304 may transmit an indication to the MAC unit 2302 to perform a task. In these cases, the MAC unit 2302 may iterate over the calculated addresses, multiply the numbers, and accumulate them to internal registers.
In some embodiments, configuration manager 2304 may configure the accelerator, while memory controller 2306 grants access to dedicated bus access block 2308 and MAC unit 2302. However, in other embodiments, the memory controller 2306 may directly configure the accelerator based on instructions received from the configuration manager 2304 or an external interface. Alternatively or in addition, the configuration manager 2304 may preload several configurations and allow the accelerator to run repeatedly on different addresses with different sizes. In these embodiments, the configuration manager 2304 may include a cache that stores commands that are then transmitted to at least one of a plurality of processing units, such as the accelerator 2216. However, in other embodiments, the configuration manager 2304 may not include a cache.
In some embodiments, the configuration manager 2304 or the memory controller 2306 may receive an address that needs to be accessed for a task. The configuration manager 2304 or the memory controller 2306 may check a register to determine whether an address is already in a loaded line to one of the memory blocks 2308. If in a loaded line, memory controller 2306 may read a word from memory block 2308 and pass the word to MAC unit 2302. If the address is not in the loaded line, the configuration manager 2304 may request that the memory controller 2306 may load the line and signal the MAC unit 2302 to stall until the loaded line is fetched.
In some embodiments, as shown in fig. 23, the memory controller 2306 may include two inputs that form two independent addresses. But if more than two addresses should be accessed simultaneously and these addresses are in a single memory block (e.g., the addresses are only in memory block 2308 (a)), the memory controller 2306 or the configuration manager 2304 may incur exception conditions. Alternatively, the configuration manager 2304 may return an invalid data signal when both addresses are accessible via only a single line. In other embodiments, the unit may delay the handler execution until it is possible to retrieve all the required data. This can reduce overall performance. However, the compiler may be able to find configurations and data placements that will prevent delays.
In some embodiments, a compiler may generate a configuration or set of instructions for the processing device 2300 that may configure the configuration manager 2304 and memory controller 2306 and accelerators 2302 to handle situations that require multiple addresses to be accessed from a single memory block, but which has one port. For example, a compiler may rearrange data in memory block 2308 so that processing units may access multiple lines in memory block 2308.
In addition, the memory controller 2306 may also operate on more than one input at the same time. For example, memory controller 2306 may allow one of memory blocks 2308 to be accessed via one port and supply data when a request for a different memory block is received in another input. Thus, this operation may result in accelerator 2216 tasked with the exemplary 2D convolution receiving data from the dedicated communication lines of the relevant memory bank.
Additionally or alternatively, the memory controller 2306 or logic blocks may maintain a refresh counter for each memory block 2308 and handle the refresh of all lines. Having this counter allows the memory controller 2306 to be inserted into a refresh cycle between the dead access times of the device.
Further, the memory controller 2306 may be configured to perform pipelined memory accesses, to receive addresses and open lines in a memory block before supplying data. The pipelined memory access may provide data to a processing unit without interrupting or delaying clock cycles. For example, while one of the memory controller 2306 or logic blocks utilizes the right line to access data in fig. 23, the memory controller or logic block may be transmitting data in the left line. These methods will be explained in more detail with respect to fig. 26.
In response to the required data, processing device 2300 can use multiplexers and/or other switching devices to select which devices to service to perform a given task. For example, configuration manager 2304 may configure a multiplexer so that at least two data lines arrive at MAC unit 2302. In this way, tasks that require data from multiple addresses (such as 2D convolution) can be performed faster because the vectors or words that require multiplication during convolution can arrive at the processing unit simultaneously in a single clock. This data transfer method may allow a processing unit, such as accelerator 2216, to quickly output results.
In some embodiments, the configuration manager 2304 may be configurable to execute handlers based on priority of tasks. For example, the configuration manager 2304 may be configured to cause the running handler to complete without any interruption. In this case, the configuration manager 2304 may provide instructions or configurations of the task to the accelerator 2216, running the accelerator without interruption, and switching the multiplexer only when the task is complete. However, in other embodiments, the configuration manager 2304 may interrupt tasks and reconfigure data routing when it receives priority tasks (such as requests from external interfaces). However, where memory block 2308 is sufficient, memory controller 2306 may be configurable to route data to or grant access to processing units using dedicated lines that do not have to be changed before the task is completed. Further, in some embodiments, all devices may be connected to the entities of the configuration manager 2304 through a bus, and the devices may manage access between the devices themselves and the bus (e.g., using the same logic as a multiplexer). Thus, the memory controller 2306 may be directly connected to several memory instances or memory blocks.
Alternatively, memory controller 2306 may be directly connected to a memory subinstance. In some embodiments, each memory instance or rank may be implemented by a subinstance (e.g., a DRAM may be implemented by pads with independent data lines arranged in multiple subintervals). Additionally, examples may include at least one of a DRAM pad, a DRAM, a bank, a flash memory pad, or an SRAM pad, or any other type of memory. The memory controller 2306 may then include dedicated lines to directly address the subinstances to minimize latency during pipelined memory accesses.
In some embodiments, memory controller 2306 may also maintain the logic required for a particular memory instance (such as row/column decoders, refresh logic, etc.), and memory block 2308 may handle its own logic. Thus, memory block 2308 can obtain the address and generate a command for returning \ write data.
FIG. 24 depicts an exemplary memory configuration diagram consistent with the disclosed embodiments. In some embodiments, a compiler that generates code or configurations for processing device 2200 may perform a method to configure the loading from memory blocks 2202 and 2204 by pre-arranging data in each block. For example, a compiler may pre-arrange the data such that each word required by a task is associated with a line of a memory instance or memory block. For tasks that require more memory blocks than are available in the processing device 2200, however, the compiler may implement a method of adapting data to more than one memory location of each memory block. The compiler may also store the data in sequence and evaluate the latency of each memory block to avoid line miss latency. In some embodiments, the host may be part of a processing unit, such as configuration manager 2212, but in other embodiments, the compiler host may be connected to processing device 2200 via an external interface. In these embodiments, the host may run a compilation function, such as the compilation function described for the compiler.
In some embodiments, the configuration manager 2212 may be a CPU or a microcontroller (uC). In these embodiments, configuration manager 2212 may have to access memory to retrieve commands or instructions placed in memory. A particular compiler may generate code and place the code in memory in a manner that allows successive commands to be stored in the same memory line and across several memory banks to allow pipelined memory accesses to also be made to the fetched commands. In these embodiments, the configuration manager 2212 and the memory controller 2210 may be able to avoid row latency in linear execution by facilitating pipelined memory accesses.
The previous case of linear execution of a program describes a method for a compiler to recognize and place instructions to allow pipelined memory execution. However, other software structures may be more complex and would require the compiler to recognize the other software structures and take action accordingly. For example, in the case where a task requires loops and branches, a compiler may place all loop code within a single line so that the single line can loop without wired open latency. Then, the memory controller 2210 may not need to change lines during execution.
In some embodiments, the configuration manager 2212 may include an internal cache or small memory. The internal caches may store commands that are executed by the configuration manager 2212 to handle branches and loops. For example, commands in the internal cache may include instructions to configure an accelerator for accessing a block of memory.
FIG. 25 is an exemplary flow chart illustrating a possible memory configuration handler 2500 consistent with the disclosed embodiments. Where it is convenient to describe the memory configuration handler 2500, reference may be made to the identifiers of the elements depicted in FIG. 22 and described above. In some embodiments, process 2500 may be performed by a compiler providing instructions to a host connected via an external interface. In other embodiments, handler 2500 may be executed by a component of processing device 2200, such as configuration manager 2212.
In general, process 2500 may include: determining the number of words required to execute the task simultaneously; determining a number of words that can be accessed simultaneously from each of a plurality of memory banks; and dividing the number of words simultaneously required among the plurality of memory banks when the number of words simultaneously required is greater than the number of words simultaneously accessible. Further, dividing the number of words required simultaneously may include performing a round robin organization of words and assigning one word per memory bank in sequence.
More specifically, handler 2500 may begin at step 2502, where a compiler may receive a task specification. The specification includes required calculations and/or priority levels.
In step 2504, the compiler may identify an accelerator or group of accelerators that may execute the task. Alternatively, a compiler may generate instructions so a processing unit (such as configuration manager 2212) may identify an accelerator to perform the task. For example, using the required computation configuration manager 2212 may identify an accelerator in the group of accelerators 2216 that may handle the task.
In step 2506, the compiler may determine the number of words that need to be accessed simultaneously in order to perform the task. For example, multiplication of two vectors requires access to at least two vectors, and the compiler may therefore decide that a vector word must be accessed simultaneously to perform an operation.
In step 2508, the compiler may determine the number of cycles necessary to perform the task. For example, if the task requires four convolution operations with the resulting result, the compiler may determine that at least 4 cycles will be necessary to execute the task.
In step 2510, the compiler may place words that need to be accessed simultaneously in different memory banks. In this way, the memory controller 2210 may be configured to open lines to different memory instances and access a desired memory block within a clock cycle without requiring any cache data.
In step 2512, the compiler places the sequentially accessed words in the same memory bank. For example, in the case of four cycles of operations required, a compiler may generate instructions to write the desired word in a single memory block in sequential cycles to avoid changing lines between different memory blocks during execution.
In step 2514, the compiler generates instructions for programming a processing unit, such as the configuration manager 2212. The instructions may specify conditions for operating a switching device (such as a multiplexer) or configuring a data bus. Through these instructions, the allocation manager 2212 may allocate the memory controller 2210 to route data from the memory blocks to the processing units or grant access to the memory blocks using dedicated communication lines according to the task.
FIG. 26 is an exemplary flow chart illustrating a memory read handler 2600 consistent with the disclosed embodiments. Where it is convenient to describe the memory read handler 2600, reference may be made to the identifiers of the elements depicted in fig. 22 and described above. In some embodiments, the process 2600 may be implemented by a memory controller 2210, as described below. However, in other embodiments, the handler 2600 may be implemented by other elements in the processing device 2200, such as the configuration manager 2212.
In step 2602, memory controller 2210, configuration manager 2212, or other processing unit may receive an indication to route data from a memory bank or to grant access to a memory bank. The request may specify an address and a memory block.
In some embodiments, the request may be received via a data bus in line 2218 specifying a read command and an address in line 2220 specifying an address. In other embodiments, the request may be received via a demultiplexer connected to memory controller 2210.
In step 2604, the configuration manager 2212, host, or other processing unit may query internal registers. The internal registers may include information regarding an open line to a memory bank, an open address, an open memory block, and/or an upcoming task. Based on the information in the internal registers, it may be determined whether there is an open line to the memory bank and/or whether a memory bank has received a request in step 2602. Alternatively or additionally, memory controller 2210 may directly query the internal register.
If the internal register indicates that the memory bank is not loaded in the open line (step 2606: no), the process 2600 may proceed to step 2616 and may load the line into the memory bank associated with the received address. In addition, the memory controller 2210 or a processing unit, such as the configuration manager 2212, may signal a delay to the element requesting information from the memory address in step 2616. For example, if the accelerator 2216 is requesting memory information located in an occupied memory bank, the memory controller 2210 may send a delay signal to the accelerator in step 2618. In step 2620, configuration manager 2212 or memory controller 2210 may update internal registers to indicate lines that have been opened to a new memory bank or a new memory block.
If the internal register indicates that the memory bank load is on the line (step 2606: Yes), then process 2600 may proceed to step 2608. In step 2608, it may be determined whether the line loaded with the memory bank is being used for a different address. If the line is being used for different addresses (step 2608: Yes), this would indicate that there are two instances in a single block, and therefore, that the two instances cannot be accessed simultaneously. Accordingly, an error or immunity signal may be sent to the element requesting information from the memory address in step 2616. If, however, the line is not being used for a different address (step 2608: no), then the line for that address may be opened and data retrieved from the target memory bank and on to step 2614 to transfer the data to the element requesting information from the memory address.
With the processing program 2600, the processing device 2200 is able to establish a direct connection between the processing unit and a memory block or memory instance containing information needed to perform the task. This organization of data will enable reading information from organized vectors in different memory instances, as well as allowing information to be retrieved from different memory blocks simultaneously when multiple of these addresses are requested by the apparatus.
Fig. 27 is an exemplary flowchart illustrating an execution handler 2700 consistent with the disclosed embodiments. Where it is convenient to describe execution process 2700, reference can be made to identifiers of elements depicted in FIG. 22 and described above.
In step 2702, a compiler or area unit, such as configuration manager 2212, may receive an indication of the tasks that need to be performed. The task may include a single operation (e.g., multiplication) or more complex operations (e.g., convolution between matrices). The task may also indicate the required computation.
In step 2704, the compiler or configuration manager 2212 may determine the number of words needed to perform the task at the same time. For example, the configuration compiler may determine that two words are needed to perform multiplication between vectors at the same time. In another embodiment (a 2D convolution task), configuration manager 2212 may determine that the convolution between matrices requires "n" by "m" words, where "n" and "m" are the matrix dimensions. Further, in step 2704, the configuration manager 2212 may also determine the number of cycles necessary to perform the task.
In step 2706, depending on the determination in step 2704, the compiler may write words that need to be accessed simultaneously into multiple memory banks disposed on the substrate. For example, when the number of words that can be accessed simultaneously from one of multiple memory banks is less than the number of words that are required simultaneously, a compiler may organize data in multiple memory banks to facilitate access to different required words within a clock. Further, when the configuration manager 2212 or the compiler determines the number of cycles necessary to execute a task, the compiler may write the required words in a single memory bank of multiple memory banks in sequential cycles to prevent line switching between memory banks.
In step 2708, memory controller 2210 may be configured to read or grant access to at least one first word from or to a first memory bank of a plurality of memory banks or blocks using a first memory line.
In step 2170, a processing unit (e.g., one of the accelerators 2216) can process the task using the at least one first word.
In step 2712, memory controller 2210 may be configured to open a second memory line in the second memory bank. For example, based on the task and using a pipelined memory access method, the memory controller 2210 may be configured to open a second memory line in the second memory block to which information required by the task is written in step 2706. In some embodiments, the second memory line may be opened when the task in step 2170 is to be completed. For example, if a task requires 100 clocks, the second memory line may be opened in the 90 th clock.
In some embodiments, steps 2708-2712 may be performed in one line access cycle.
In step 2714, memory controller 2210 may be configured to authorize access of data of the at least one second word from the second memory bank using the second memory line opened in step 2710.
In step 2176, a processing unit (e.g., one of the accelerators 2216) can process the task using at least the second word.
In step 2718, memory controller 2210 may be configured to open the second memory line in the first memory bank. For example, based on the task and using a pipelined memory access method, the memory controller 2210 may be configured to open a second memory line to the first memory block. In some embodiments, the second memory line to the first bank may be opened when the task in step 2176 is to be completed.
In some embodiments, steps 2714-2718 may be performed within one line access cycle.
At step 2720, memory controller 2210 may read or grant access to at least one third word from or to a first memory bank of the plurality of memory banks or blocks using a second memory line in the first bank or a first line in a third bank and continuing in a different memory bank.
Some memory chips, such as Dynamic Random Access Memory (DRAM) chips, use refreshing to avoid the loss of stored data (e.g., using capacitance) due to voltage decay in the chip's capacitors or other electrical components. For example, in DRAM, each cell must be refreshed from time to time (based on the particular process and design) to recover the charge in the capacitor so that data is not lost or corrupted. As the memory capacity of DRAM chips increases, the amount of time required to refresh the memory becomes significant. During the period of time that a line of memory is being refreshed, the bank containing the line being refreshed cannot be accessed. This can result in reduced performance. In addition, the power associated with the refresh processing procedure may also be significant. Previous efforts have attempted to reduce the rate at which refreshes are performed to reduce the adverse effects associated with refreshing memory, but most of these efforts have focused on the physical layer of the DRAM.
Refreshing is analogous to reading and writing back rows of memory. Using this principle and focusing on the pattern of accessing the memory, embodiments of the present disclosure include software and hardware techniques and modifications to the memory chip to use less power for refreshing and reduce the amount of time during which the memory is refreshed. For example, as an overview, some embodiments may use hardware and/or software to track line access timing and skip past recently accessed rows within a refresh cycle (e.g., based on a timing threshold). In another embodiment, some embodiments may rely on software executed by a refresh controller of the memory chip to assign reads and writes so that accesses to the memory are non-random. Thus, software may more precisely control the refresh to avoid wasting refresh cycles and/or lines. These techniques may be used alone or in combination with a compiler encoding commands for a refresh controller and machine code for a processor, so that access to memory is likewise non-random. Using any combination of these techniques and configurations described in detail below, the disclosed embodiments may reduce memory refresh power requirements and/or improve system performance by reducing the amount of time during which memory cells are refreshed.
FIG. 28 depicts an embodiment memory chip 2800 having a refresh controller 2803 consistent with the present disclosure. For example, memory chip 2800 may include multiple memory banks (e.g., memory bank 2801a and the like) on a substrate. In the embodiment of fig. 28, the substrate includes four memory groups, each having four lines. A line may refer to a word line within one or more memory banks of memory chip 2800 or any other set of memory cells within memory chip 2800, such as a portion of a memory bank or along an entire row of memory banks or group of memory banks.
In other embodiments, the substrate may include any number of memory groups, and each memory group may include any number of lines. Some memory banks may include the same number of lines (as shown in fig. 28), while other memory banks may include a different number of lines. As further depicted in fig. 28, the memory chip 2800 may include a controller 2805 to receive inputs to the memory chip 2800 and transmit outputs from the memory chip 2800 (e.g., as described above in "division of code").
In some embodiments, the plurality of memory banks may include Dynamic Random Access Memory (DRAM). However, the plurality of memory banks may include any volatile memory that stores data that requires periodic refreshing.
As will be discussed in more detail below, embodiments disclosed herein may use a counter or resistor-capacitor circuit to time the refresh cycle. For example, a counter or timer may be used to count the time since the last full refresh cycle, and then when the counter reaches its target value, all rows may be iterated using another counter. Embodiments of the present disclosure may additionally track accesses to sections of memory chip 2800 and reduce the required refresh power. For example, although not depicted in fig. 28, the memory chip 2800 may also include data storage configured to store access information indicating access operations of one or more sectors in a plurality of memory banks. For example, the one or more sections may include any portion of a line, column, or any other grouping of memory cells within the memory chip 2800. In one particular embodiment, the one or more sections may include at least one row of memory structures within a plurality of memory banks. The refresh controller 2803 may be configured to perform a refresh operation for the one or more sectors based at least in part on the stored access information.
For example, the data store may include one or more registers associated with a section of the memory chip 2800 (e.g., a line, column, or any other grouping of memory cells within the memory chip 2800), Static Random Access Memory (SRAM) cells, or the like. Additionally, the data store may be configured to store a bit indicating whether the associated segment was accessed in one or more previous cycles. A "bit" may include any data structure that stores at least one bit, such as a register, SRAM cell, non-volatile memory, or the like. Further, a bit may be set by setting a corresponding switch (or switching element, such as a transistor) of the data structure to on (which may be equivalent to "1" or "true"). Additionally or alternatively, a bit may be set by modifying any other property within the data structure (such as charging a floating gate of a flash memory, modifying the state of one or more flip-flops in an SRAM, or the like) in order to write a "1" to the data structure (or any other value indicative of the setting of a bit). If a bit is determined to be set as part of a refresh operation of the memory controller, refresh controller 2803 may skip a refresh cycle of the associated sector and clear the register associated with its portion.
In another embodiment, the data store may include one or more non-volatile memories (e.g., flash memories or the like) associated with a section of the memory chip 2800 (e.g., a line, column, or any other grouping of memory cells within the memory chip 2800). The non-volatile memory may be configured to store a bit indicating whether the associated segment was accessed in one or more previous cycles.
Some embodiments may additionally or alternatively add a timestamp register on each row or group of rows (or other section of the memory chip 2800) that holds the last moment a line was accessed within the current refresh cycle. This means that the refresh controller can update the row timestamp register with each row access. Thus, when the next refresh occurs (e.g., at the end of a refresh cycle), the refresh controller can compare the stored timestamps, and if the associated section was previously accessed within a certain time period (e.g., within a certain threshold as applied to the stored timestamps), the refresh controller can jump to the next section. This prevents the system from consuming refresh power on sectors that have been recently accessed. In addition, the refresh controller may continue to track accesses to ensure that each sector is accessed or refreshed on the next cycle.
Thus, in yet another embodiment, the data store may include one or more registers or non-volatile memory associated with a section of the memory chip 2800 (e.g., a line, column, or any other grouping of memory cells within the memory chip 2800). Rather than using a bit to indicate whether the associated section has been accessed, the register or non-volatile memory may be configured to store a timestamp or other information indicating the most recent access of the associated section. In such an embodiment, the refresh controller 2803 can determine whether to refresh or access the associated segment based on whether the amount of time between the timestamp stored in the associated register or memory and the current time (e.g., from a timer, as explained below in fig. 29A and 29B) exceeds a predetermined threshold (e.g., 8ms, 16ms, 32ms, 64ms, or the like).
Thus, the predetermined threshold may include an amount of time for a refresh cycle that ensures that the associated segment is refreshed (if not accessed) at least once within each refresh cycle. Alternatively, the predetermined threshold may include an amount of time that is less than the amount of time required for a refresh cycle (e.g., to ensure that any required refresh or access signals may reach the associated sector before the refresh cycle is completed). For example, the predetermined time may comprise 7ms for a memory chip having an 8ms refresh period, such that if a sector has not been accessed within 7ms, the refresh controller will send a refresh or access signal that reaches the sector at the end of the 8ms refresh period. In some embodiments, the predetermined threshold may depend on the size of the associated segment. For example, the predetermined threshold may be smaller for smaller sections of the memory chip 2800.
Although described above with respect to memory chips, the refresh controller of the present disclosure may also be used in distributed processor architectures, such as those described above and throughout the sections of the present disclosure. One embodiment of such an architecture is depicted in fig. 7A. In these embodiments, the same substrate as the memory chip 2800 can include multiple processing groups disposed thereon, e.g., as depicted in fig. 7A. As explained above with respect to fig. 3A, a "processing group" may refer to two or more processor subunits and their corresponding memory banks on a substrate. The groups may represent spatial distributions and/or logical groupings on a substrate for purposes of compiling code for execution on the memory chip 2800. Thus, the substrate can include a memory array that includes a plurality of banks, such as bank 2801a and other banks shown in fig. 28. In addition, the substrate may include a processing array that may include a plurality of processor subunits (such as subunits 730a, 730b, 730c, 730d, 730e, 730f, 730g, and 730h shown in fig. 7A).
As further explained above with respect to fig. 7A, each processing group may include a processor subunit and one or more corresponding memory banks dedicated to the processor subunit. Further, to allow each processor subunit to communicate with its corresponding dedicated memory bank, the substrate may include a first plurality of buses connecting one of the processor subunits to its corresponding dedicated memory bank.
In these embodiments, as shown in fig. 7A, the substrate may include a second plurality of buses to connect each processor subunit to at least one other processor subunit (e.g., an adjacent subunit in the same row, an adjacent processor subunit in the same column, or any other processor subunit on the substrate). The first plurality of buses and/or the second plurality of buses may be free of sequential hardware logic components, such that data transfers between processor subunits and across corresponding ones of the plurality of buses are not controlled by sequential hardware logic components, as explained above in the section "synchronization using software".
In embodiments where the same substrate as memory chip 2800 may include multiple processing groups disposed thereon (e.g., as depicted in fig. 7A), the processor subunit may also include an address generator (e.g., address generator 450 as depicted in fig. 4). Further, each processing group may include a processor subunit and one or more corresponding memory banks dedicated to the processor subunit. Thus, each of the address generators may be associated with a corresponding dedicated memory bank of the plurality of memory banks. In addition, the baseboard may include a plurality of buses, each bus connecting one of the plurality of address generators to its corresponding dedicated memory bank.
FIG. 29A depicts an embodiment refresh controller 2900 consistent with the present disclosure. Refresh controller 2900 can be incorporated in a memory chip of the present disclosure (such as memory chip 2800 of fig. 28). As depicted in fig. 29A, refresh controller 2900 may include a timer 2901, which may include an on-chip oscillator or any other timing circuit for refresh controller 2900. In the configuration depicted in fig. 29A, timer 2901 may trigger a refresh cycle periodically (e.g., every 8ms, 16ms, 32ms, 64ms, or the like). Refresh cycles may use row counter 2903 to cycle through all rows of the corresponding memory chip, and adder 2901 in conjunction with valid bit 2905 to generate a refresh signal for each row. As shown in fig. 29A, bit 2905 may be fixed to a 1 ("true") to ensure that each row is refreshed during a cycle.
In embodiments of the present disclosure, refresh controller 2900 may include data storage. As described above, the data store may include one or more registers or non-volatile memory associated with a section of the memory chip 2800 (e.g., a line, column, or any other grouping of memory cells within the memory chip 2800). The register or non-volatile memory may be configured to store a timestamp or other information indicating the most recent access of the associated segment.
In any of the embodiments described above, refresh controller 2900 may be integrated with a memory controller for multiple memory banks. For example, similar to the embodiment depicted in FIG. 3A, refresh controller 2900 may be incorporated into logic and control subunits associated with memory banks or other sections of memory chip 2800.
FIG. 29B depicts another embodiment refresh controller 2900' consistent with the present disclosure. Refresh controller 2900' can be incorporated in a memory chip of the present disclosure (such as memory chip 2800 of fig. 28). Similar to refresh controller 2900, refresh controller 2900' includes timer 2901, row counter 2903, valid bit 2905, and adder 2907. Additionally, refresh controller 2900' may include data storage 2909. As shown in fig. 29B, data storage 2909 may include one or more registers or non-volatile memory associated with a section of memory chip 2800 (e.g., a line, column, or any other grouping of memory cells within memory chip 2800), and a state within the data storage may be configured to change (e.g., by sense amplifiers and/or other elements of refresh controller 2900' as described above) in response to one or more sections being accessed. Thus, refresh controller 2900' may be configured to skip the refresh of one or more sectors based on the state within the data store. For example, if the state associated with a segment is enabled (e.g., set to 1 by turning on, changing a property to store a "1," or the like), refresh controller 2900' may skip the refresh cycle of the associated segment and clear the state associated with its portion. The state may be stored by at least a one-bit register or any other memory structure configured to store at least one data bit.
To ensure that a segment of a memory chip is refreshed or accessed during each refresh cycle, refresh controller 2900' may reset or otherwise clear the state to trigger a refresh signal during the next refresh cycle. In some embodiments, after a section is skipped, refresh controller 2900' may clear the associated state to ensure that the section is refreshed at the next refresh cycle. In other embodiments, refresh controller 2900' may be configured to reset the state within data storage after a threshold time interval. For example, refresh controller 2900' may clear (e.g., set to 0) the state in data storage whenever timer 2901 exceeds a threshold time since the associated state was set (e.g., set to 1 by turning on, changing a property to store a "1," or the like). In some embodiments, refresh controller 2900' may use a critical number of refresh cycles (e.g., one, two, or the like) or use a critical number of clock cycles (e.g., two, four, or the like) instead of a threshold time.
In other embodiments, the state may include a timestamp of the most recent refresh or access of the associated segment, such that if the amount of time between the timestamp and the current time (e.g., from timer 2901 of fig. 29A and 29B) exceeds a predetermined threshold (e.g., 8ms, 16ms, 32ms, 64ms, or the like), refresh controller 2900' may send an access command or refresh signal to the associated segment and update the timestamp associated with a portion thereof (e.g., using timer 2901). Additionally or alternatively, if the refresh time indicator indicates that the last refresh time is within a predetermined time threshold, refresh controller 2900' may be configured to skip refresh operations relative to one or more sectors in the plurality of memory banks. In these embodiments, after skipping refresh operations with respect to one or more segments, refresh controller 2900' may be configured to alter the stored refresh time indicators associated with the one or more segments so that the one or more segments will be refreshed during the next cycle of operation. For example, as described above, refresh controller 2900' may use timer 2901 to update the stored refresh time indicator.
Accordingly, the data store may include a timestamp register configured to store a refresh time indicator indicating a time at which the one or more banks in the plurality of memory banks were last refreshed. In addition, refresh controller 2900' may use the output of the timer to clear access information stored in data storage after a threshold time interval.
In any of the embodiments described above, the access to the one or more sectors may comprise a write operation associated with the one or more sectors. Additionally or alternatively, the access to the one or more sectors may include a read operation associated with the one or more sectors.
Further, as depicted in fig. 29B, refresh controller 2900' may include a row counter 2903 and an adder 2907 configured to assist in updating data storage 2909 based at least in part on a state within the data storage. Data store 2909 may contain bit tables associated with multiple memory banks. For example, the bit table may contain an array of switches (or switching elements, such as transistors) or registers (e.g., SRAM or the like) configured to hold bits for the associated section. Additionally or alternatively, data store 2909 can store timestamps associated with multiple memory banks.
In addition, refresh controller 2900' may include refresh gate 2911 configured to control whether to refresh one or more segments based on the corresponding values stored in the bit table. For example, refresh gate 2911 may include a logic gate (such as an AND gate) that disables refresh signals from row counter 2903 if a corresponding state of data storage 2909 indicates that the associated segment was refreshed or accessed during one or more previous clock cycles.
Fig. 30 is a flow diagram of an embodiment of a process 3000 for partial refresh in a memory chip (e.g., memory chip 2800 of fig. 28), process 3000 being executable by a refresh controller consistent with the present disclosure, such as refresh controller 2900 of fig. 29A or refresh controller 2900' of fig. 29B.
At step 3010, the refresh controller may access information indicating access operations for one or more sectors in the plurality of memory banks. For example, as explained above with respect to fig. 29A and 29B, the refresh controller may include a data store associated with a segment of the memory chip 2800 (e.g., a line, column, or any other grouping of memory cells within the memory chip 2800) and configured to store a timestamp or other information indicating a most recent access of the associated segment.
At step 3020, the refresh controller may generate refresh and/or access commands based at least in part on the accessed information. For example, as explained above with respect to fig. 29A and 29B, the refresh controller may skip refresh operations with respect to one or more sectors in the plurality of memory banks if the accessed information indicates that the last refresh or access time is within a predetermined time threshold and/or if the accessed information indicates that the last refresh or access occurred during one or more previous clock cycles. Additionally or alternatively, the refresh controller may generate an opinion to refresh or access the associated segment based on whether the accessed information indicates that the last refresh or access time exceeds a predetermined threshold and/or whether the accessed information indicates that the last refresh or access did not occur during one or more previous clock cycles.
At step 3030, the refresh controller can alter the stored refresh time indicators associated with one or more segments such that the one or more segments will be refreshed during the next cycle of operation. For example, after skipping a refresh operation with respect to one or more segments, the refresh controller may alter information indicative of an access operation of the one or more segments such that the one or more segments will be refreshed during a next clock cycle. Thus, after skipping a refresh cycle, the refresh controller may clear the state of the segment (e.g., set to 0). Additionally or alternatively, the refresh controller can set the state (e.g., to 1) of the segment being refreshed and/or accessed during the current cycle. In embodiments where the information indicative of the access operation of one or more segments includes a timestamp, the refresh controller may update any stored timestamps associated with the segments refreshed and/or accessed during the current cycle.
FIG. 31 is a flow diagram of one embodiment of a process 3100 for determining a refresh of a memory chip (e.g., the memory chip 2800 of FIG. 28). The processing program 3100 may be implemented within a compiler consistent with the present disclosure. As explained above, a "compiler" refers to any computer program that converts a higher level language (e.g., a procedural language such as C, FORTRAN, BASIC or the like; an object oriented language such as Java, C + +, Pascal, Python or the like; and the like) into a lower level language (e.g., assembly code, object code, machine code or the like). A compiler may allow a human to program a series of instructions in a human-readable language and then convert the human-readable language into a machine-executable language. A compiler may include software instructions that are executed by one or more processors.
At step 3110, one or more processors may receive higher-order computer code. For example, the higher-order computer code may be encoded in one or more files on a memory (e.g., a non-volatile memory such as a hard disk drive or the like, a volatile memory such as DRAM, or the like) or received via a network (e.g., the Internet or the like). Additionally or alternatively, the higher-order computer code may be received from a user (e.g., using an input device such as a keyboard).
At step 3120, the one or more processors may identify a plurality of memory banks distributed across a plurality of memory banks associated with the memory chips to be accessed by the higher-order computer code. For example, one or more processors may access data structures defining a plurality of memory banks and a corresponding structure of a memory chip. The one or more processors may access the data structures from a memory (e.g., a non-volatile memory such as a hard disk drive or the like, a volatile memory such as DRAM, or the like), or receive the data structures via a network (e.g., the internet or the like). In these embodiments, the data structures are included in one or more libraries that are accessible to the compiler to permit the compiler to generate instructions for the particular memory chip to be accessed.
At step 3130, the one or more processors may evaluate the higher-order computer code to identify a plurality of memory read commands occurring within a plurality of memory access cycles. For example, the one or more processors may identify each operation within higher-order computer code that requires one or more read commands to read from memory and/or one or more write commands to write to memory. These instructions may include variable initialization, variable reassignment, logical operations on variables, input output operations, or the like.
At step 3140, the one or more processors may cause distribution of data associated with the plurality of memory access commands across each of the plurality of memory segments such that each of the plurality of memory segments is accessed during each of the plurality of memory access cycles. For example, one or more processors may customize the data structure of the memory chip to identify memory segments, and then assign variables from higher-order code to each of the memory segments so that each memory segment is accessed (e.g., via writing or reading) at least once during each refresh cycle (which may include a particular number of clock cycles). In this embodiment, the one or more processors may access information indicating how many clock cycles are required for each line of higher order code in order to assign a variable from the line of higher order code such that each memory segment is accessed (e.g., via a write or read) at least once during a particular number of clock cycles.
In another embodiment, one or more processors may first generate machine code or other lower order code from higher order code. The one or more processors may then assign variables from the lower-order code to each of the memory segments such that each memory segment is accessed (e.g., via writing or reading) at least once during each refresh cycle (which may include a particular number of clock cycles). In this embodiment, each line of lower order code may require a single clock cycle.
In any of the embodiments presented above, one or more processors may further assign logical operations or other commands that use the temporary output to various ones of the memory segments. These temporary outputs may also generate read and/or write commands such that the assigned memory segment is accessed during this refresh cycle even though the specified variable has not been assigned to the memory segment.
The method 3100 may also include additional steps. For example, in embodiments where variables are assigned prior to compilation, one or more processors may generate machine code or other lower-order code from higher-order code. Further, one or more processors may transmit compiled code for execution by a memory chip and corresponding logic circuitry. The logic circuitry may comprise conventional circuitry such as a GPU or CPU, or may comprise processing groups on the same substrate as the memory chips, e.g., as depicted in fig. 7A. Thus, as described above, the substrate can include a memory array that includes a plurality of banks, such as bank 2801a and others shown in fig. 28. In addition, the substrate may include a processing array that may include a plurality of processor subunits (such as subunits 730a, 730b, 730c, 730d, 730e, 730f, 730g, and 730h shown in fig. 7A).
FIG. 32 is a flow diagram of another embodiment of a process 3200 for determining a refresh of a memory chip (e.g., the memory chip 2800 of FIG. 28). The processing program 3200 may be implemented in a compiler consistent with the present disclosure. Processing program 3200 may be executed by one or more processors executing software instructions comprising a compiler. Processing routine 3200 may be implemented separately or in combination with processing routine 3100 of fig. 31.
At step 3210, similar to step 3110, the one or more processors may receive higher-order computer code. At step 3220, similar to step 3210, the one or more processors may identify a plurality of memory segments distributed across a plurality of memory banks associated with the memory chip to be accessed by the higher-order computer code.
At step 3230, the one or more processors may evaluate the higher-order computer code to identify a plurality of memory read commands each involving one or more of a plurality of memory segments. For example, the one or more processors may identify each operation within higher-order computer code that requires one or more read commands to read from memory and/or one or more write commands to write to memory. These instructions may include variable initialization, variable reassignment, logical operations on variables, input output operations, or the like.
In some embodiments, one or more processors may emulate execution of higher-order code using logic circuits and multiple memory sections. For example, the emulation may include a line-by-line step-through of higher-order code, similar to that of a debugger or other Instruction Set Simulator (ISS). The simulation may further maintain internal variables representing addresses of the multiple memory sections, similar to how a debugger may maintain internal variables representing registers of a processor.
At step 3240, the one or more processors may track, based on the analysis of the memory access command and for each memory segment among the plurality of memory segments, an amount of time accumulated since a last access to the memory segment. For example, using the simulations described above, one or more processors may determine a length of time between each access (e.g., read or write) to one or more addresses within each of a plurality of memory segments. The length of time may be measured in absolute time, clock cycles, or refresh cycles (e.g., as determined by a known refresh rate of the memory chip).
At step 3250, in response to a determination that the amount of time elapsed since the last access of any particular memory segment will exceed a predetermined threshold, the one or more processors may introduce into the higher-order computer code at least one of a memory refresh command or a memory access command configured to cause access to the particular memory segment. For example, one or more processors may include refresh commands for execution by a refresh controller (e.g., refresh controller 2900 of FIG. 29A or refresh controller 2900 of FIG. 29B). In embodiments where the logic circuitry is not embedded on the same substrate as the memory chip, the one or more processors may generate refresh commands for sending to the memory chip separate from the lower-order code for sending to the logic circuitry.
Additionally or alternatively, one or more processors may include access commands for execution by a memory controller (which may be separate from or incorporated into the refresh controller). The access command may include a dummy command configured to trigger a read operation to the memory segment but not to cause the logic circuit to perform any other operation on the read or write variable from the memory segment.
In some embodiments, the compiler may include a combination of steps from process 3100 and steps from process 3200. For example, the compiler may assign variables according to step 3140 and then run the simulation described above to add in any additional memory refresh commands or memory access commands according to step 3250. This combination may allow the compiler to distribute the variables across as many memory segments as possible, and generate refresh or access commands for any memory segment that cannot be accessed for a predetermined threshold amount of time. In another combined embodiment, the compiler may simulate the code according to step 3230 and assign variables according to step 3140 based on the simulation indicating any memory segments that will not be accessed within a predetermined threshold amount of time. In some embodiments, this combination may also include step 3250 to allow the compiler to generate a refresh or access command for any memory segment that cannot be accessed within a predetermined threshold amount of time, even after the assignment according to step 3140 is complete.
The refresh controller of the present disclosure may allow software executed by logic circuits (whether conventional logic circuits such as CPUs and GPUs or processing groups on the same substrate as the memory chip, e.g., as depicted in fig. 7A) to disable auto-refresh performed by the refresh controller and instead control refresh via the executed software. Thus, some embodiments of the present disclosure may provide software to the memory chip with a known access pattern (e.g., if a compiler is able to access data structures defining a plurality of memory banks and a corresponding structure of the memory chip). In these embodiments, the post-compile optimizer may disable auto-refresh and manually set refresh control only for sections of the memory chip that have not been accessed within a threshold amount of time. Thus, similar to step 3250 described above, but after compilation, the post-compilation optimizer may generate refresh commands to ensure that each memory segment is accessed or refreshed using a predetermined threshold amount of time.
Another embodiment of reducing refresh cycles can include using a predefined pattern of accesses to the memory chip. For example, some embodiments may generate access patterns for refreshes beyond conventional linear line refresh if software executed by the logic circuit can control its access pattern for the memory chip. For example, if the controller determines that software executed by the logic circuit regularly accesses every other row of memory, the refresh controller of the present disclosure may use an access pattern that does not refresh every other row in order to speed up the memory chip and reduce power usage.
An embodiment of such a refresh controller is shown in fig. 33. Fig. 33 depicts an embodiment refresh controller 3300 configured with stored patterns consistent with the present disclosure. The refresh controller 3300 may be incorporated in a memory chip of the present disclosure, for example, having a plurality of memory banks and a plurality of memory segments included in each of the plurality of memory banks, such as the memory chip 2800 of fig. 28.
The refresh controller 3300 includes a timer 3301 (similar to the timer 2901 of fig. 29A and 29B), a row counter 3303 (similar to the row counter 2903 of fig. 29A and 29B), and an adder 3305 (similar to the adder 2907 of fig. 29A and 29B). In addition, the refresh controller 3300 includes a data storage 3307. Unlike the data storage regions 2909 of fig. 29B, the data storage 3307 may store at least one memory refresh pattern to be implemented to refresh a plurality of memory segments included in each of a plurality of memory banks. For example, as depicted in fig. 33, data storage 3307 may include Li (e.g., L1, L2, L3, and L4 in the embodiment of fig. 33) and Hi (e.g., H1, H2, H3, and H4 in the embodiment of fig. 33) in rows and/or columns defining a section in a memory bank. Further, each segment may be associated with an Inc variable (e.g., in the embodiment of fig. 33, Inc1, Inc2, Inc3, and Inc4) that defines how the rows associated with the segment are incremented (e.g., whether each row is accessed or refreshed, whether every other row is accessed or refreshed, or the like). Thus, as shown in fig. 33, a refresh pattern may include a table that includes a plurality of memory segment identifiers assigned by software to identify ranges of memory segments in a particular memory bank that need to be refreshed during a refresh cycle and ranges of memory segments in the particular memory bank that do not need to be refreshed during the refresh cycle.
Thus, the data storage 3308 may define a refresh pattern that software executed by logic circuits (whether conventional logic circuits such as CPUs and GPUs or processing groups on the same substrate as the memory chip, e.g., as depicted in fig. 7A) may select for use. The memory refresh pattern may be configurable using software to identify which of the plurality of memory segments in a particular memory bank need to be refreshed during a refresh cycle, and which of the plurality of memory segments in a particular memory bank do not need to be refreshed during the refresh cycle. Thus, the refresh controller 3300 may refresh some or all rows within the defined section that are not accessed during the current cycle according to Inci. The refresh controller 3300 may skip other rows of the defined section that are set to be accessed during the current cycle.
In embodiments where the data store 3308 of the refresh controller 3300 includes multiple memory refresh patterns, each memory refresh pattern may represent a different refresh pattern for refreshing multiple memory segments included in each of multiple memory banks. The memory refresh pattern may be selectable for use on multiple memory segments. Thus, refresh controller 3300 may be configured to allow selection of which of a plurality of memory refresh patterns to implement during a particular refresh cycle. For example, software executed by logic circuits (whether conventional logic circuits such as CPUs and GPUs or processing groups on the same substrate as the memory chips, e.g., as depicted in fig. 7A) may select different memory refresh patterns for use during one or more different refresh cycles. Alternatively, software executed by the logic circuit may select one memory refresh pattern for use throughout some or all of the different refresh cycles.
The memory refresh pattern may be encoded using one or more variables stored in the data store 3308. For example, in embodiments where multiple memory sections are arranged in a row, each memory section identifier may be configured to identify a particular location within the row of memory where a memory refresh should begin or end. For example, in addition to Li and Hi, one or more additional variables may also define which portions of which rows are defined by Li and Hi within a segment.
FIG. 34 is a flow diagram of an embodiment of a process 3400 for determining a refresh of a memory chip (e.g., the memory chip 2800 of FIG. 28). Processing program 3100 may be implemented by software within a refresh controller consistent with the present disclosure (e.g., refresh controller 3300 of fig. 33).
At step 3410, the refresh controller may store at least one memory refresh pattern to be implemented to refresh the plurality of memory segments included in each of the plurality of memory banks. For example, as explained above with respect to fig. 33, the refresh pattern may include a table that includes a plurality of memory segment identifiers assigned by software to identify ranges of a plurality of memory segments in a particular memory bank that need to be refreshed during a refresh cycle and ranges of a plurality of memory segments in the particular memory bank that do not need to be refreshed during the refresh cycle.
In some embodiments, the at least one refresh pattern may be encoded onto the refresh controller during manufacturing (e.g., onto a read-only memory associated with or at least accessible by the refresh controller). Thus, the refresh controller can access the at least one memory refresh pattern but does not store the at least one memory refresh pattern.
At steps 3420 and 3430, the refresh controller may use software to identify which of the plurality of memory segments in a particular memory bank are to be refreshed during a refresh cycle, and which of the plurality of memory segments in a particular memory bank are not to be refreshed during the refresh cycle. For example, as explained above with respect to fig. 33, software executed by logic circuits (whether conventional logic circuits such as CPUs and GPUs or processing groups on the same substrate as the memory chips, e.g., as depicted in fig. 7A) may select at least one memory refresh pattern. Further, the refresh controller can access a selected at least one memory refresh pattern to generate a corresponding refresh signal during each refresh cycle. The refresh controller may refresh some or all portions within the defined section that are not accessed during the current cycle according to the at least one memory refresh pattern, and may skip other portions of the defined section that are set to be accessed during the current cycle.
At step 3440, the refresh controller may generate a corresponding refresh command. For example, as depicted in fig. 33, the adder 3305 may include logic circuitry configured to invalidate refresh signals for particular sectors that are not refreshed according to at least one memory refresh pattern in the data store 3307. Additionally or alternatively, a microprocessor (not shown in FIG. 33) may generate a particular refresh signal based on which sectors are to be refreshed according to at least one memory refresh pattern in the data store 3307.
The method 3400 may also include additional steps. For example, in embodiments where at least one memory refresh pattern is configured to change every one, two, or other number of refresh cycles (e.g., moving from L1, H1, and Inc1 to L2, H2, and Inc2, as shown in fig. 33), the refresh controller may access a different portion of the data storage for the next determination of the refresh signal according to steps 3430 and 3440. Similarly, if software executed by logic circuits (whether conventional logic circuits such as CPUs and GPUs or processing groups on the same substrate as the memory chips, e.g., as depicted in fig. 7A) selects a new memory refresh pattern from the data store for use in one or more future refresh cycles, the refresh controller can access a different portion of the data store for the next determination of the refresh signal according to steps 3430 and 3440.
When designing a memory chip and targeting a certain capacity of memory, a change in memory capacity to a larger size or a smaller size may require redesigning the product and redesigning the entire reticle set. Typically, product design is conducted in parallel with market research, and in some cases, product design is completed before market research is available. Thus, there may be a decoupling between product design and the actual needs of the market. The present disclosure proposes a way to flexibly provide a memory chip with a memory capacity that meets market demands. The design method may include designing dice on a wafer along with appropriate interconnect circuitry so that memory chips, which may contain one or more dice, may be selectively diced from the wafer in order to provide an opportunity to produce memory chips with variable-sized memory capacity from a single wafer.
The present disclosure relates to systems and methods for manufacturing memory chips by dicing the memory chips from a wafer. The method can be used to produce size selectable memory chips from a wafer. An embodiment of a wafer 3501 containing dies 3503 is shown in fig. 35A. Wafer 3501 may be formed from a semiconductor material, such as silicon (Si), silicon germanium (SiGe), Silicon On Insulator (SOI), gallium nitride (GaN), aluminum nitride (AlN), aluminum gallium nitride (AlGaN), Boron Nitride (BN), gallium arsenide (GaAs), aluminum gallium arsenide (AlGaAs), indium nitride (InN), combinations thereof, and the like. The die 3503 may include any suitable circuit elements (e.g., transistors, capacitors, resistors, and/or the like) that may include any suitable semiconductor, dielectric, or metallic components. The dies 3503 may be formed of a semiconductor material that may be the same or different than the material of the wafers 3501. In addition to die 3503, wafer 3501 may also include other structures and/or circuitry. In some embodiments, one or more coupling circuits may be provided and couple one or more of the dies together. In one embodiment, the coupling circuit may comprise a bus shared by two or more dies 3503. Additionally, the coupling circuitry may comprise one or more logic circuits designed to control circuitry associated with die 3503 and/or direct information to die 3503/direct information from die 3503. In some cases, the coupling circuit may include memory access management logic. This logic may translate the logical memory address into an entity address associated with the die 3503. It should be noted that the term fabrication, as used herein, may collectively refer to any of the steps used to build up the disclosed wafers, dice, and/or chips. For example, fabrication may refer to the simultaneous placement and formation of various dies (and any other circuitry) included on a wafer. Manufacturing may also refer to cutting a selectable size memory chip from a wafer to include one die in some cases, or multiple dies in other cases. Of course, the term fabrication is not intended to be limited to these embodiments, but may include other aspects associated with the production of any or all of the disclosed memory chips and intermediate structures.
The die 3503 may also include a processing array disposed on a substrate that includes a plurality of processor sub-units 3515A-3515D, as shown in fig. 35B. As described above, each memory bank may include dedicated processor subunits connected by a dedicated bus. For example, processor subunit 3515A is associated with memory bank 3511A via a bus or connection 3512. It should be understood that various connections between the memory banks 3511A-3511D and the processor sub-units 3515A-3515D are possible, and only some illustrative connections are shown in fig. 35B. In an embodiment, the processor subunit may perform read/write operations on the associated memory banks, and may further perform refresh operations or any other suitable operations with respect to the memory stored in the various memory banks.
As mentioned, the die 3503 may include a first group of buses configured to connect processor subunits with their corresponding memory banks. An embodiment bus may comprise a set of wires or conductors connecting the electrical components and allowing data and addresses to be transferred to and from each memory bank and its associated processor subunit. In one embodiment, the connector 3512 may serve as a dedicated bus for connecting the processor subunit 3515A to the memory bank 3511A. The die 3503 may include a group of such buses, each connecting a processor subunit to a corresponding dedicated memory bank. Additionally, die 3503 may include another group of buses, each connecting processor subunits (e.g., subunits 3515A-3515D) to one another. For example, such a bus may include connections 3516A-3516D. In various embodiments, data for memory banks 3511A-3511D can be delivered via input-output bus 3530. In an embodiment, input-output bus 3530 may carry data-related information, as well as command-related information for controlling the operation of memory cells of die 3503. The data information may include data for storage in the memory banks, data read from the memory banks, processing results from one or more of the processor subunits based on operations performed with respect to the data stored in the corresponding memory banks, command related information, various codes, and the like.
In various cases, data and commands transmitted by the input output bus 3530 may be controlled by an Input Output (IO) controller 3521. In an embodiment, IO controller 3521 may control the flow of data from bus 3530 to processor subunits 3515A-3515D and from processor subunits 3515A-3515D. IO controller 3521 may determine from which of the processor subunits 3515A through 3515D information is retrieved. In various embodiments, the IO controller 3521 may include a fuse 3554 configured to deactivate the IO controller 3521. Fuses 3554 may be used if multiple dies are combined together to form a larger memory chip (also referred to as a multi-die memory chip, as an alternative to a single die memory chip containing only one die). The multi-die memory chip may then use one of the IO controllers forming one of the die cells of the multi-die memory chip while disabling the other IO controllers by using fuses corresponding to the other IO controllers associated with the other die cells.
As mentioned, each memory chip or front-end die or group of dies may include a distributed processor associated with a corresponding memory group. In some embodiments, these distributed processors may be arranged in a processing array disposed on the same substrate as a plurality of memory banks. In addition, the processing array may include one or more logic sections that each include an address generator, also referred to as an Address Generator Unit (AGU). In some cases, the address generator may be part of at least one processor subunit. The address generator may generate memory addresses needed to fetch data from one or more memory banks associated with the memory chips. The address generation calculation may involve an integer arithmetic operation, such as an addition, subtraction, modulo operation, or bit shift. The address generator may be configured to operate on multiple operands at once. Furthermore, multiple address generators may perform more than one address calculation operation simultaneously. In various embodiments, the address generator may be associated with a corresponding memory bank. The address generator may be connected with its corresponding memory bank by means of a corresponding bus line.
In various embodiments, size selectable memory chips can be formed from wafer 3501 by selectively dicing different regions of the wafer. As mentioned, the wafer can include a group of dies 3503 that includes any group of two or more dies included on the wafer (e.g., 2, 3, 4, 5, 10, or more than 10 dies). As will be discussed further below, in some cases a single memory chip may be formed by dicing a portion of a wafer that includes only one die of a group of dies. In these cases, the resulting memory chip will include memory cells associated with one die. However, in other cases, a size-selectable memory chip may be formed to include more than one die. These memory chips may be formed by dicing a region of a wafer that includes two or more dies of a group of dies included on the wafer. In these cases, the die, along with coupling circuitry that couples the die together, provide a multi-die memory chip. Some additional circuit elements may also be connected between the chips on-board ground, such as a clock element, a data bus, or any suitable logic circuit.
In some cases, at least one controller associated with a group of dies may be configured to control the group of dies to operate as a single memory chip (e.g., a multi-memory cell memory chip). The controller may include one or more circuits that manage the flow of data into and from the memory chips. The memory controller may be part of the memory chip, or it may be part of a separate chip that is not directly related to the memory chip. In an embodiment, the controller may be configured to facilitate read and write requests or other commands associated with the distributed processors of the memory chip, and may be configured to control any other suitable aspect of the memory chip (e.g., refreshing the memory chip, interacting with the distributed processors, etc.). In some cases, the controller may be part of die 3503, and in other cases, the controller may be disposed adjacent to die 3503. In various embodiments, the controller may also include at least one memory controller of at least one of the memory cells included on the memory chip. In some cases, the protocol used to access information on the memory chip may be independent of the copy logic and memory units (e.g., memory banks) that may be present on the memory chip. The protocol may be configured to have different ID or address ranges for full access to data on the memory chip. Embodiments of chips having this protocol may include chips having Joint Electron Device Engineering Council (JEDEC) Double Data Rate (DDR) controllers, where different memory banks may have different address ranges, Serial Peripheral Interface (SPI) connections, where different memory units (e.g., memory banks) have different Identifications (IDs), and the like.
In various embodiments, a plurality of zones may be cut from a wafer, where each zone includes one or more dies. In some cases, each partition may be used to build a multi-die memory chip. In other cases, each region to be diced from the wafer can include a single die to provide a single die memory chip. In some cases, two or more of the regions may have the same shape and have the same number of dies coupled to the coupling circuit in the same manner. Alternatively, in some embodiments, a first group of regions may be used to form a first type of memory chip and a second group of regions may be used to form a second type of memory chip. For example, as shown in fig. 35C, the wafer 3501 may include a zone 3505, which may include a single die, and the second zone 3504 may include a group of two dies. When the regions 3505 are cut from the wafer 3501, single die memory chips will be provided. When the regions 3504 are diced from the wafer 3501, multi-die memory chips will be provided. The groups shown in fig. 35C are merely illustrative, and various other regions and groups of dice may be cut from wafer 3501.
In various embodiments, the dice may be formed on wafer 3501 such that they are arranged along one or more rows of the wafer, as shown, for example, in fig. 35C. The dies may share input-output buses 3530 corresponding to one or more rows. In an embodiment, a group of dice may be cut from wafer 3501 using various dicing shapes, wherein when the dice that may be used to form a group of memory chips are cut, at least a portion of shared input-output bus 3530 may not be included (e.g., only a portion of input-output bus 3530 may be included as part of the memory chips formed to include a group of dice).
As previously discussed, when multiple dies (e.g., dies 3506A and 3506B, as shown in fig. 35C) are used to form memory chip 3517, one IO controller corresponding to one of the multiple dies may be enabled and configured to control data flow to all processor subunits of dies 3506A and 3506B. For example, fig. 35D shows memory dies 3506A and 3506B combined to form a memory chip 3517 that includes banks 3511A-3511H, processor sub-units 3515A-3515H, IO, controllers 3521A and 3521B, and fuses 3554A and 3554B. Note that before memory die 3517 are removed from the wafer, the memory die correspond to region 3517 of wafer 3501. In other words, as used herein and elsewhere in this disclosure, once cut from wafer 3501, zones 3504, 3505, 3517, etc. of wafer 3501 would result in memory chips 3504, 3505, 3517, etc. Additionally, the fuse herein is also referred to as a disable element. In one embodiment, fuse 3554B may be used to deactivate IO controller 3521B, and IO controller 3521A may be used to control the flow of data to all memory banks 3511A-3511H by communicating data to processor subunits 3515A-3515H. In an embodiment, IO controller 3521A may connect to the various processor subunits using any suitable connection. In some embodiments, as described further below, processor subunits 3515A-3515H may be interconnected and IO controller 3521A may be configured to control data flow to processor subunits 3515A-3515H forming the processing logic of memory chip 3517.
In one embodiment, IO controllers such as controllers 3521A and 3521B and corresponding fuses 3554A and 3554B may be formed on wafer 3501 along with forming banks 3511A through 3511H and processor sub-units 3515A through 3515H. In various embodiments, when memory chip 3517 is formed, one of the fuses (e.g., fuse 3554B) may be activated such that dice 3506A and 3506B are configured to form memory chip 3517, which functions as a single chip and is controlled by a single input-output controller (e.g., controller 3521A). In an embodiment, activating the fuse may include applying a current to trigger the fuse. In various embodiments, when more than one die is used to form a memory chip, all but one IO controller may be deactivated via the corresponding fuses.
In various embodiments, as shown in fig. 35C, multiple dies are formed on wafer 3501 along with the same set of input-output buses and/or control buses. An embodiment of the input output bus 3530 is shown in fig. 35C. In an embodiment, one of the input output buses (e.g., input output bus 3530) may be connected to multiple dies. Fig. 35C shows an embodiment of an input-output bus 3530 that is close to the die 3506A and 3506B passing through. The configuration of dies 3506A and 3506B and input-output bus 3530 as shown in fig. 35C is merely illustrative, and various other configurations may be used. For example, fig. 35E illustrates dice 3540 formed on wafer 3501 and arranged in a hexagonal formation. Memory chips 3532 including four dice 3540 may be cut from wafer 3501. In an embodiment, memory chip 3532 can include a portion of input output bus 3530 connected to four dies by suitable bus lines (e.g., lines 3533, as shown in fig. 35E). To route information to the appropriate memory cells of memory chip 3532, memory chip 3532 may include input/ output controllers 3542A and 3542B placed at the branching points of output bus 3530. Controllers 3542A and 3542B may receive command data via input-output bus 3530 and select a branch of bus 3530 for transmitting information to the appropriate memory cells. For example, if the command data includes read/write information from/to memory cells associated with die 3546, controller 3542A may receive the command request and transmit the data to branch 3531A of bus 3530, as shown in fig. 35D, while controller 3542B may receive the command request and transmit the data to branch 3531B. FIG. 35E indicates various cuts of different zones that may be made, where the cut lines are represented by dashed lines.
In an embodiment, a group of dies and interconnect circuitry can be designed to be included in the memory chip 3506 as shown in FIG. 36A. This embodiment may include processor subunits (for in-memory processing) that may be configured to communicate with each other. For example, each die to be included in memory chips 3506 can include various memory units such as banks 3511A through 3511D, processor sub-units 3515A through 3515D, and IO controllers 3521 and 3522. IO controllers 3521 and 3522 may be connected in parallel to the input output bus 3530. IO controller 3521 may have fuse 3554 and IO controller 3522 may have fuse 3555. In one embodiment, processor subunits 3515A-3515D can be connected by way of, for example, a bus 3613. In some cases, one of the IO controllers may be disabled using a corresponding fuse. For example, the IO controller 3522 may be disabled using a fuse 3555, and the IO controller 3521 may control data flow into the memory banks 3511A-3511D via processor subunits 3515A-3515D, which are connected to each other via a bus 3613.
The configuration of memory cells as shown in fig. 36A is merely illustrative, and various other configurations may be formed by dicing different regions of wafer 3501. For example, FIG. 36B shows a configuration with three domains 3601-3603 that contain memory cells and are connected to an input-output bus 3530. In one embodiment, domains 3601-3603 are connected to the I/O bus 3530 using IO control modules 3521-3523 that may be disabled by corresponding fuses 3554-3556. Another embodiment of arranging the domains containing memory cells is shown in FIG. 36C, where three domains 3601, 3602, and 3603 are connected to the input-output bus 3530 using bus lines 3611, 3612, and 3613. FIG. 36D shows another embodiment of memory chips 3506A-3506D connected to input- output buses 3530A and 3530B via IO controllers 3521-3524. In an embodiment, the IO controller may be deactivated using corresponding fuse elements 3554-3557, as shown in fig. 36D.
Fig. 37 shows various groups of dies 3503, such as group 3713 and group 3715, which can include one or more dies 3503. In one embodiment, in addition to forming dies 3503 on wafer 3501, wafer 3501 may also contain logic 3711, referred to as glue logic 3711. Glue logic 3711 may take up some space on wafer 3501 compared to the number of dies that would have been manufactured without glue logic 3711 present, resulting in a smaller number of dies manufactured per wafer 3501. However, the presence of glue logic 3711 may allow multiple dies to be configured to collectively function as a single memory chip. For example, glue logic may connect multiple dies without having to change the configuration and without having to specify areas within any of the dies themselves for circuitry that is only used to connect the dies together. In various embodiments, glue logic 3711 provides an interface to other memory controllers such that a multi-die memory chip functions as a single memory chip. Glue logic 3711 may be cut along with the group of dies (e.g., as shown by group 3713). Alternatively, if the memory chip only requires one die, e.g., as for group 3715, the glue logic may not be cut. For example, glue logic may be selectively eliminated where cooperation between different dies need not be achieved. In fig. 37, various cuts of different regions may be made, for example, as shown by the dashed regions. In various embodiments, as shown in fig. 37, for every two dies 3506, one glue logic element 3711 may be arranged on the wafer. In some cases, one glue logic element 3711 may be used to form any suitable number of dies 3506 of a group of dies. Glue logic 3711 may be configured to connect to all dice from a group of dice. In various embodiments, the die connected to glue logic 3711 may be configured to form a multi-die memory chip, and may be configured to form a separate single-die memory chip when the die is not connected to glue logic 3711. In various embodiments, dice connected to glue logic 3711 and designed to function together may be cut from wafer 3501 as a group, and may include glue logic 3711, e.g., as indicated by group 3713. Dice not connected to glue logic 3711 may be cut from wafer 3501 without including glue logic 3711, e.g., as indicated by group 3715, to form single die memory chips.
In some embodiments, during the manufacturing of multi-die memory chips from the wafer 3501, one or more cut shapes (e.g., shapes forming groups 3713, 3715) may be determined for generating a desired set of multi-die memory chips. In some cases, as shown by group 3715, cut shapes may not include glue logic 3711.
In various embodiments, glue logic 3711 may be a controller for controlling a plurality of memory cells of a multi-die memory chip. In some cases, glue logic 3711 may include parameters that may be modified by various other controllers. For example, coupling circuitry for a multi-die memory chip may include circuitry for configuring parameters of glue logic 3711 or parameters of a memory controller (e.g., processor subunits 3515A-3515D, as shown, for example, in fig. 35B). Glue logic 3711 may be configured to perform a variety of tasks. For example, logic 3711 may be configured to determine which grains may need addressing. In some cases, logic 3711 may be used to synchronize multiple memory cells. In various embodiments, logic 3711 may be configured to control the various memory cells such that the memory cells operate as a single chip. In some cases, an amplifier may be added between an input-output bus (e.g., bus 3530, as shown in fig. 35C) and processor subunits 3515A-3515D to amplify data signals from bus 3530.
In various embodiments, cutting complex shapes from wafer 3501 may be technically difficult/expensive, and simpler cutting methods may be employed, subject to the constraint that the dice are aligned on wafer 3501. For example, fig. 38A shows die 3506 aligned to form a rectangular grid. In one embodiment, vertical cuts 3803 and horizontal cuts 3801 may be made across the entire wafer 3501 to separate a group of dice that are cut. In an embodiment, the vertical cuts 3803 and the horizontal cuts 3801 may result in a group containing a selected number of dice. For example, cuts 3803 and 3801 may result in a region containing a single die (e.g., region 3811A), a region containing two dies (e.g., region 3811B), and a region containing four dies (e.g., region 3811C). The regions formed by cuts 3801 and 3803 are merely illustrative, and any other suitable regions may be formed. In various embodiments, various cuts may be made depending on the die alignment. For example, if the dice are arranged in a triangular grid, as shown in fig. 38B, cut lines such as lines 3802, 3804, and 3806 may be used to make a multi-die memory chip. For example, some regions may include six grains, five grains, four grains, three grains, two grains, one grain, or any other suitable number of grains.
Fig. 38C shows bus lines 3530 arranged in a triangular grid with dies 3503 aligned at the center of the triangle formed by the intersection of bus lines 3530. Die 3503 may be connected to all adjacent bus lines via bus line 3820. By dicing a region containing two or more adjacent dies (e.g., region 3822, as shown in fig. 38C), at least one bus line (e.g., line 3824) remains within region 3822, and bus line 3824 may be used to supply data and commands to a multi-die memory chip formed using region 3822.
FIG. 39 shows various connections that can be formed between processor sub-units 3515A-3515P to allow groups of memory cells to be used as a single memory chip. For example, the group of various memory cells 3901 may include a connection 3905 between processor subunit 3515B and subunit 3515E. The connection 3905 may be used as a bus line for transmitting data and commands from the subcell 3515B to the subcell 3515E, which may be used to control the respective memory bank 3511E. In various embodiments, the connections between the processor sub-units may be implemented during the formation of the dies on wafer 3501. In some cases, the additional connections may be fabricated during the packaging stage of a memory chip formed from several dies.
As shown in fig. 39, processor sub-units 3515A-3515P may be connected to each other using various buses (e.g., connection 3905). Connection 3905 may not contain sequential hardware logic components such that data transfers between processor subunits and across connection 3905 may not be controlled by sequential hardware logic components. In various embodiments, a bus connecting processor sub-units 3515A-3515P may be disposed on wafer 3501 prior to fabrication of various circuits on wafer 3501.
In various embodiments, processor subunits (e.g., subunits 3515A-3515P) may be interconnected. For example, subunits 3515A-3515P may be connected by a suitable bus (e.g., connection 3905). The connector 3905 may connect any one of the sub-units 3515A-3515P with any other of the sub-units 3515A-3515P. In one embodiment, the connected sub-cells may be on the same die (e.g., sub-cells 3515A and 3515B), and in other cases, the connected sub-cells may be on different dies (e.g., sub-cells 3515B and 3515E). Connector 3905 may include a dedicated bus for connector units and may be configured to efficiently transfer data between subunits 3515A-3515P.
Various aspects of the present disclosure are directed to methods for producing selectable size memory chips from a wafer. In one embodiment, a size selectable memory chip may be formed from one or more dies. As previously mentioned, the dice may be arranged along one or more rows, as shown, for example, in fig. 35C. In some cases, at least one shared input-output bus corresponding to one or more rows may be disposed on wafer 3501. For example, the bus 3530 may be arranged as shown in fig. 35C. In various embodiments, bus 3530 may be electrically connected to memory cells of at least two of the dies, and the connected dies may be used to form a multi-die memory chip. In an embodiment, one or more controllers (e.g., input- output controllers 3521 and 3522, as shown in fig. 35B) may be configured to control memory cells used to form at least two dies of a multi-die memory chip. In various embodiments, a die having memory cells connected to bus 3530 may be cut from a wafer with a corresponding portion of a shared input output bus (e.g., bus 3530, as shown in fig. 35B) that transmits information to at least one controller (e.g., controllers 3521, 3522) to configure the controller to control the memory cells of the connected die to function as a single chip.
In some cases, memory cells located on wafer 3501 may be tested before memory chips are fabricated by dicing regions of wafer 3501. The test may be performed using at least one shared input-output bus (e.g., bus 3530, as shown in fig. 35C). When a memory cell passes testing, a memory chip may be formed from a group of dies containing the memory cell. Memory cells that fail the test may be discarded and not used to manufacture the memory chip.
FIG. 40 shows one embodiment of a process 4000 for building a memory chip from a group of dies. At step 4011 of process 4000, dice may be disposed on semiconductor wafer 3501. At step 4015, dice may be fabricated on wafer 3501 using any suitable method. For example, the die may be fabricated by etching wafer 3501, depositing various dielectric, metal, or semiconductor layers, and further etching the deposited layers, and the like. For example, multiple layers may be deposited and etched. In various embodiments, the layers may be n-type doped or P-type doped using any suitable doping element. For example, the semiconductor layer may be n-type doped with phosphorus and P-type doped with boron. As shown in fig. 35A, the dies 3503 can be separated from each other by space available to cut the dies 3503 from the wafer 3501. For example, the dice 3503 can be separated from each other by spacers, wherein the width of the spacers can be selected to allow wafer dicing in the spacers.
At step 4017, dice 3503 can be cut from wafer 3501 using any suitable method. In one embodiment, die 3503 may be cut using a laser. In one embodiment, wafer 3501 may be scribed first, followed by mechanical scribing. Alternatively, a mechanical saw may be used. In some cases, an implicit scribing process may be used. During scribing, once the die is cut, wafer 3501 may be mounted on a dicing tape used to hold the die. In various embodiments, large cuts may be made, for example, as shown by cuts 3801 and 3803 in fig. 38A or by cuts 3802, 3804, or 3806 in fig. 38B. Once the dies 3503 are cut out individually or in groups, as shown in fig. 35C, for example by group 3504, the dies 3503 may be packaged. Packaging of the die may include forming contacts to the die 3503, depositing a protective layer over the contacts, attaching a thermal management device (e.g., a heat sink), and encapsulating the die 3503. In various embodiments, depending on how many dies are selected to form a memory chip, appropriate configurations of contacts and buses may be used. In an embodiment, some of the contacts between different dies forming the memory chip may be fabricated during the packaging of the memory chip.
FIG. 41A shows one embodiment of a process 4100 for manufacturing a memory chip containing multiple dies. Step 4011 of process 4100 can be the same as step 4011 of process 4000. At step 4111, glue logic 3711 may be arranged on wafer 3501, as shown in fig. 37. Glue logic 3711 may be any suitable logic for controlling the operation of die 3506, as shown in fig. 37. As previously described, the presence of glue logic 3711 may allow multiple dies to be used as a single memory chip. Glue logic 3711 may provide an interface to other memory controllers such that a memory chip formed from multiple dies functions as a single memory chip.
At step 4113 of the process 4100, buses (e.g., input output buses and control buses) may be placed on the wafer 3501. The bus may be arranged such that it connects with various dies and logic circuitry, such as glue logic 3711. In some cases, a bus may connect the memory units. For example, a bus may be configured to connect processor subunits of different dies. At step 4115, the die, glue logic, and bus may be fabricated using any suitable method. For example, logic elements may be fabricated by etching wafer 3501, depositing various dielectric, metal, or semiconductor layers, and further etching the deposited layers, and the like. The bus may be fabricated using, for example, metal evaporation.
At step 4140, a cut shape may be used to cut a group of dice connected to a single glue logic 3711, as shown, for example, in fig. 37. The cut shape may be determined using the memory requirements of a memory chip containing multiple dies 3503. For example, fig. 41B shows a process 4101, which can be a variation of process 4100, wherein steps 4117 and 4119 can be placed before step 4140 of process 4100. At step 4117, the system for dicing wafer 3501 may receive instructions describing the requirements of the memory chips. For example, the requirements may include forming a memory chip including four dies 3503. In some cases, at step 4119, program software may determine a periodic pattern for a group of dies and glue logic 3711. For example, the periodic pattern may include two glue logic 3711 elements and four dies 3503, where each two dies are connected to one glue logic 3711. Alternatively, at step 4119, the pattern may be provided by a designer of the memory chip.
In some cases, the pattern can be selected to maximize the yield of memory chips formed from wafer 3501. In an embodiment, the memory cells of die 3503 may be tested to identify a die having a failed memory cell (such die is referred to as a failed die), and based on the location of the failed die, a group of dies 3503 that contain memory cells that pass the test may be identified and an appropriate cutting pattern may be determined. For example, if a large number of dies 3503 fail at the edge of wafer 3501, a dicing pattern may be determined to avoid dies at the edge of wafer 3501. Other steps of process 4101, such as steps 4011, 4111, 4113, 4115, and 4140, may be the same as the like numbered steps of process 4100.
Fig. 41C shows an embodiment of a handler 4102 that can be a variation of the handler 4101. Steps 4011, 4111, 4113, 4115, and 4140 of process 4102 may be the same as like numbered steps of process 4101, step 4131 of process 4102 may replace step 4117 of process 4101, and step 4133 of process 4102 may replace step 4119 of process 4101. At step 4131, the system for dicing wafer 3501 may receive instructions describing the requirements of the first set of memory chips and the second set of memory chips. For example, the requirements may include: forming a first set of memory chips having memory chips consisting of four dies 3503; and forming a second set of memory chips having memory chips consisting of two dies 3503. In some cases, it may be desirable to form more than two sets of memory chips from wafer 3501. For example, the third set of memory chips can include memory chips consisting of only one die 3503. In some cases, at step 4133, the program software may determine a periodic pattern for a group of dies and glue logic 3711 for forming the memory chips of each set of memory chips. For example, a first set of memory chips may include memory chips containing two glue logic 3711 and four dies 3503, where each two dies are connected to one glue logic 3711. In various embodiments, glue logic cells 3711 for the same memory chip may be linked together to serve as a single glue logic. For example, during the manufacture of glue logic 3711, appropriate bus lines may be formed that link glue logic cells 3711 to each other.
The second set of memory chips may include memory chips containing one glue logic 3711 and two dies 3503, with the dies 3503 connected to the glue logic 3711. In some cases, glue logic 3711 may not be needed for the third set of memory chips when these memory chips are selected and when these memory chips include memory chips consisting of a single die 3503.
When designing a memory chip or an instance of memory within a chip, one important characteristic is the number of words that can be accessed simultaneously during a single clock cycle. The more addresses that can be accessed simultaneously for reading and/or writing (e.g., addresses along rows also referred to as words or word lines and columns also referred to as bits or bit lines), the faster the memory chip. While there has been some activity in developing memory that includes a multi-way port that allows multiple addresses to be accessed simultaneously, such as for building register files, cash, or shared memory, most examples use memory pads that are large in size and support multiple address accesses. However, DRAM chips typically include a single bit line and a single row line connected to each capacitor of each memory cell. Accordingly, embodiments of the present disclosure seek to provide multi-port access to existing DRAM chips without modifying this conventional single-port memory structure of the DRAM array.
Embodiments of the present disclosure may use memory to clock memory instances or chips at twice the speed of logic circuits. Any logic circuit using a memory may thus "correspond to" the memory and any components thereof. Thus, embodiments of the present disclosure can fetch or write two addresses in two memory array clock cycles, which are equivalent to a single processing clock cycle for logic circuitry. The logic circuitry may include circuitry such as a controller, accelerator, GPU, or CPU, or may include processing groups on the same substrate as the memory chip, e.g., as depicted in fig. 7A. As explained above with respect to fig. 3A, a "processing group" may refer to two or more processor subunits and their corresponding memory banks on a substrate. The groups may represent spatial distributions and/or logical groupings on a substrate for purposes of compiling code for execution on the memory chip 2800. Thus, as described above with respect to fig. 7A, a substrate having memory chips may include a memory array having a plurality of banks, such as bank 2801a and other banks shown in fig. 28. In addition, the substrate may also include a processing array, which may include a plurality of processor subunits (such as subunits 730a, 730b, 730c, 730d, 730e, 730f, 730g, and 730h shown in fig. 7A).
Thus, embodiments of the present disclosure may retrieve data from the array within each of two consecutive memory cycles, so as to handle two addresses for each logic cycle, and provide two results to the logic, as if the single-port memory array were a dual-port memory chip. Additional clocking may allow the memory chips of the present disclosure to function as if the single port memory array were a dual port memory example, a three port memory example, a four port memory example port, or any other multi-port memory example.
Fig. 42 depicts an embodiment of a circuitry 4200 consistent with the present disclosure that provides dual port access along a column of a memory chip in which the circuitry 4200 is used. The embodiment depicted in FIG. 42 may use one memory array 4201 having two row multiplexers ("muxs") 4205a and 4205b to access two words on the same row during the same clock cycle for the logic circuits. For example, during a memory clock cycle, RowAddrA is used in row decoder 4203 and ColAddrA is used in multiplexer 4205a to buffer data from memory cells having addresses (RowAddrA, ColAddrA). During the same memory clock cycle, ColAddrB is used in multiplexer 4205b to buffer data from memory cells having addresses (RowAddrA, ColAddrB). Thus, circuitry 4200 may allow dual port access to data (e.g., DataA and DataB) stored on memory cells at two different addresses along the same row or word line. Thus, two addresses may share a row such that row decoder 4203 activates the same word line for two fetches. Furthermore, the embodiment as depicted in FIG. 42 may use a column multiplexer so that both addresses can be accessed during the same memory clock cycle.
Similarly, fig. 43 depicts an embodiment of circuitry 4300 consistent with the present disclosure that provides dual port access along a row of a memory chip in which the circuitry 4300 is used. The embodiment depicted in FIG. 43 may use one memory array 4301 with a row decoder 4303 coupled to a multiplexer ("mux") to access two words on the same column during the same clock cycle for the logic circuits. For example, on the first of two memory clock cycles, RowAddrA is used in the row decoder 4303 and ColAddrA is used in the column multiplexer 4305 to buffer data (e.g., to the "buffer word" buffer of FIG. 43) from a memory cell having an address (RowAddrA, ColAddrA). On the second of the two memory clock cycles, RowAddrB is used in the row decoder 4303 and ColAddrA is used in the column multiplexer 4305 to buffer data from the memory cell having the address (RowAddrB, ColAddrA). Thus, circuitry 4300 may allow dual port access of data (e.g., DataA and DataB) stored on memory cells at two different addresses along the same column or bit line. Thus, two addresses may share a row such that a column decoder (which may be separate from or combined with one or more column multiplexers, as depicted in FIG. 43) activates the same bit lines for two fetches. The embodiment as depicted in FIG. 43 may use two memory clock cycles because the row decoder 4303 may require one memory clock cycle to activate each word line. Thus, a memory chip using circuitry 4300 may be used as a dual port memory if clocked at least twice as fast as the corresponding logic circuitry.
Thus, as explained above, fig. 43 may retrieve DataA and DataB during two memory clock cycles faster than the clock cycle for the corresponding logic circuit. For example, a row decoder (e.g., row decoder 4303 of FIG. 43) and a column decoder (which may be separate from or combined with one or more column multiplexers, as depicted in FIG. 43) may be configured to be clocked at a rate that is at least twice the rate at which corresponding logic circuits generate two addresses. For example, a clock circuit (not shown in fig. 43) for circuitry 4300 may clock circuitry 4300 according to a rate that is at least twice the rate at which the corresponding logic circuit generates two addresses.
The embodiment of fig. 42 and the embodiment of fig. 43 may be used separately or in combination. Thus, circuitry (e.g., circuitry 4200 or 4300) that provides dual port functionality on a single port memory array or pad may include multiple memory banks arranged along at least one row and at least one column. The plurality of memory banks is depicted in FIG. 42 as memory array 4201 and in FIG. 43 as memory array 4301. This embodiment may further use at least one row multiplexer (as depicted in FIG. 43) or at least one column multiplexer (as depicted in FIG. 42) configured to receive two addresses for reading or writing during a single clock cycle. Furthermore, this embodiment may use a row decoder (e.g., row decoder 4203 of FIG. 42 and row decoder 4303 of FIG. 43) and a column decoder (which may be separate from or combined with one or more column multiplexers, as depicted in FIG. 42 and FIG. 43) to read from or write to two addresses. For example, the row decoder and the column decoder may retrieve a first address of the two addresses from the at least one row multiplexer or the at least one column multiplexer and decode a word line and a bit line corresponding to the first address during a first cycle. In addition, the row decoder and the column decoder may retrieve a second address of the two addresses from the at least one row multiplexer or the at least one column multiplexer during a second cycle and decode word lines and bit lines corresponding to the second address. The fetching may each include activating a word line corresponding to the address using a row decoder and activating a bit line corresponding to the address on the activated word line using a column decoder.
Although described above with respect to fetching, the embodiments of fig. 42 and 43 (whether implemented separately or in combination) may include write commands. For example, during a first cycle, the row decoder and the column decoder may write first data retrieved from the at least one row multiplexer or the at least one column multiplexer to a first address of the two addresses. For example, during a second cycle, the row decoder and the column decoder may write second data retrieved from the at least one row multiplexer or the at least one column multiplexer to a second address of the two addresses.
FIG. 42 shows an embodiment of the process when the first address and the second address share a word line address, while FIG. 43 shows an embodiment of the process when the first address and the second address share a row address. As described further below with respect to fig. 47, the same process can be implemented when the first address and the second address do not share either a word line address or a row address.
Thus, while the above embodiments provide dual port access along at least one of a row or column, additional embodiments may provide dual port access along both rows and columns. Fig. 44 depicts an embodiment of a circuit system 4400 providing dual port access along both rows and columns of a memory chip in which the circuit system 4400 is used, consistent with the present disclosure. Accordingly, circuitry 4700 may represent a combination of circuitry 4200 of fig. 42 and circuitry 4300 of fig. 43.
The embodiment depicted in FIG. 44 can use one memory array 4401 having a row decoder 4403 coupled to a multiplexer ("mux") to access two rows during the same clock cycle for the logic circuits. Furthermore, the embodiment depicted in fig. 44 may use a memory array 4401 having a column decoder (or multiplexer) 4405 coupled to a multiplexer ("mux") to access two columns during the same clock cycle. For example, on the first of two memory clock cycles, RowAddrA is used in the row decoder 4403 and ColAddrA is used in the column multiplexer 4405 to buffer data from memory cells having addresses (RowAddrA, ColAddrA) (e.g., to the "buffer word" buffer of fig. 44). On the second of the two memory clock cycles, RowAddrB is used in the row decoder 4403 and ColAddrB is used in the column multiplexer 4405 to buffer data from the memory cell having the address (RowAddrB, ColAddrB). Thus, the circuitry 4400 may allow dual port access to data (e.g., DataA and DataB) stored on memory cells at two different addresses. The embodiment as depicted in fig. 44 may use additional buffers because the row decoder 4403 may require one memory clock cycle to activate each word line. Thus, a memory chip using the circuit system 4400 can be used as a dual port memory if clocked at least twice as fast as the corresponding logic circuitry.
Although not depicted in fig. 44, circuitry 4400 may also include additional circuitry of fig. 46 (described further below) along a row or word line and/or similar additional circuitry along a column or bit line. Accordingly, circuitry 4400 may activate corresponding circuitry (e.g., by opening one or more switching elements, such as one or more of switching elements 4613a, 4613b of fig. 46, and the like) to activate the disconnected portion (e.g., by connecting a voltage or allowing current to flow to the disconnected portion) including the address. Thus, a circuitry may "correspond" when an element (such as a line or the like) of the circuitry includes a location that identifies an address and/or when an element (such as a switching element) of the circuitry controls a supply or voltage and/or current to a memory cell identified by the address. Circuitry 4400 may then use row decoder 4403 and column multiplexer 4405 to decode corresponding word lines and bit lines to retrieve data from or write data to the address located in the activated disconnected portion.
As further depicted in fig. 44, circuitry 4400 may further employ at least one row multiplexer (depicted as separate from, but can be incorporated into, row decoder 4403) and/or at least one column multiplexer (depicted as separate from, but can be incorporated into, column multiplexer 4405) configured to receive two addresses for reading or writing during a single clock cycle. Thus, embodiments may use a row decoder (e.g., row decoder 4403) and a column decoder (which may be separate from or combined with column multiplexer 4405) to read from or write to two addresses. For example, a row decoder and a column decoder may retrieve a first address of the two addresses from the at least one row multiplexer or the at least one column multiplexer and decode a word line and a bit line corresponding to the first address during a memory clock cycle. Further, the row decoder and the column decoder may retrieve a second address of the two addresses from the at least one row multiplexer or the at least one column multiplexer during the same memory cycle and decode word lines and bit lines corresponding to the second address.
Fig. 45A and 45B depict a prior art replication technique for providing dual port functionality on a single port memory array or pad. As shown in fig. 45A, dual port reads may be provided by keeping copies of data synchronized across the memory array or pad. Thus, reads may be performed from both copies of the memory instance, as depicted in FIG. 45A. Furthermore, as shown in FIG. 45B, dual-port writes may be provided by copying all writes across the memory array or pad. For example, a memory chip may need to repeatedly send write commands, one for each copy of data, using logic circuitry of the memory chip. Alternatively, in some embodiments, as shown in FIG. 45A, the additional circuitry may allow the logic circuit of the memory instance to be used to send a single write command that is automatically replicated by the additional circuitry to generate a copy of the write data across the memory array or pad in order to keep the copies synchronized. The embodiments of fig. 42, 43, and 44 may reduce redundancy from these prior art replication techniques by using multiplexers to access two bit lines in a single memory clock cycle (e.g., as depicted in fig. 42) and/or by clocking the memory faster than the corresponding logic circuits (e.g., as depicted in fig. 43 and 44) and providing additional multiplexers to handle additional addresses rather than replicating all of the data in the memory.
In addition to the faster clocking and/or additional multiplexers described above, embodiments of the present disclosure may also use circuitry that disconnects the bit lines and/or word lines at some point within the memory array. These embodiments can allow multiple simultaneous accesses to the array, as long as the row and column decoders access different locations that are not coupled to the same portion of the disconnect circuitry. For example, locations with different word lines and bit lines may be accessed simultaneously because the disconnect circuitry may allow row and column decoding to access different addresses without interference. The granularity of the disconnection regions within the memory array can be traded off against the additional area required to disconnect circuitry during the design of the memory chip.
An architecture for implementing this simultaneous access is depicted in fig. 46. In particular, fig. 46 depicts an embodiment of circuitry 4600 that provides dual port functionality on a single port memory array or pad. As depicted in fig. 46, circuitry 4600 may include a plurality of memory pads (e.g., memory pads 4609a, pads 4609b, and the like) arranged along at least one row and at least one column. The layout of circuitry 4600 also includes a plurality of wordlines, such as wordlines 4611a and 4611b corresponding to rows, and bit lines 4615a and 4615b corresponding to columns.
The embodiment depicted in FIG. 46 includes twelve memory pads, each having two lines and eight columns. In other embodiments, the substrate may include any number of memory pads, and each memory pad may include any number of lines and any number of columns. Some memory pads may include the same number of lines and columns (as shown in fig. 46), while other memory pads may include a different number of lines and/or columns.
Although not depicted in fig. 46, circuitry 4600 may further use at least one row multiplexer (separate from or merged with row decoders 4601a and/or 4601b) or at least one column multiplexer (e.g., column multiplexers 4603a and/or 4603b) configured to receive two (or three or any plurality) addresses for reading or writing during a single clock cycle. Further, embodiments may use row decoders (e.g., row decoders 4601a and/or 4601b) and column decoders (which may be separate from or combined with column multiplexers 4603a and/or 4603b) to read from or write to two (or more) addresses. For example, a row decoder and a column decoder may retrieve a first address of the two addresses from the at least one row multiplexer or the at least one column multiplexer and decode a word line and a bit line corresponding to the first address during a memory clock cycle. Further, the row decoder and the column decoder may retrieve a second address of the two addresses from the at least one row multiplexer or the at least one column multiplexer during the same memory cycle and decode word lines and bit lines corresponding to the second address. As explained above, access may be made during the same memory clock cycle as long as the two addresses are in different locations that are not coupled to the same portion of open circuitry (e.g., switching elements such as 4613a, 4613b, and the like). Additionally, circuitry 4600 may simultaneously access the first two addresses during a first memory clock cycle, and then simultaneously access the next two addresses during a second memory clock cycle. In these embodiments, the memory chip using circuitry 4600 can be used as a four-port memory if clocked at least twice as fast as the corresponding logic circuitry.
Fig. 46 also includes at least one row circuit and at least one column circuit configured to function as switches. For example, the corresponding switching elements such as 4613a, 4613b, and the like, may include transistors or any other electrical elements configured to allow or stop current flow and/or connect or disconnect voltages to and from word lines or bit lines connected to the switching elements such as 4613a, 4613b, and the like. Thus, the corresponding switching elements may divide circuitry 4600 into disconnected portions. Although depicted as including a single row and each row including sixteen columns, the disconnected regions within circuitry 4600 may include different levels of granularity depending on the design of circuitry 4600.
More than one of the switch elements may be activated by circuitry 4600 depending on the open area including the address. For example, to reach an address within memory pad 4609b of fig. 46, the switching elements that allow access to memory pad 4609a must be turned off as well as the switching elements that allow access to memory pad 4609 b. Row control 4607 may determine which switching element to activate in order to retrieve a particular address within circuitry 4600 based on a particular address.
FIG. 46 shows an embodiment of circuitry 4600 for dividing word lines of a memory array (e.g., including memory pads 4609a, pads 4609b, and the like). However, other embodiments may use similar circuitry (e.g., switching elements that divide the memory chip 4600 into disconnected regions) to divide the bit lines of the memory array. Thus, the architecture of circuitry 4600 may be used in dual row access (as depicted in fig. 42 or fig. 44) as well as dual row access (as depicted in fig. 43 or fig. 44).
A process for multi-cycle access to a memory array or pad is depicted in fig. 47A. Specifically, fig. 47A is a flow diagram of an embodiment of a process 4700 for providing dual port access on a single port memory array or pad (e.g., using circuitry 4300 of fig. 43 or circuitry 4400 of fig. 44), process 4700 may be performed using a row decoder and a column decoder consistent with the present disclosure, such as row decoder 4303 or 4403 of fig. 43 or fig. 44, respectively, and a column decoder (which may be separate from or combined with one or more column multiplexers, such as column multiplexer 4305 or 4405 depicted in fig. 43 or fig. 44, respectively).
At step 4710, during a first memory clock cycle, the circuitry may use at least one row multiplexer and at least one column multiplexer to decode a wordline and a bitline corresponding to a first of two addresses. For example, at least one row decoder may activate a wordline, and at least one row multiplexer may amplify a voltage from a memory cell along the activated wordline and corresponding to a first address. The amplified voltage may be provided to a logic circuit using a memory chip including the circuitry, or buffered according to step 4720 described below. The logic circuit may include a circuit such as a GPU or CPU, or may include a processing group on the same substrate as the memory chip, e.g., as depicted in fig. 7A.
Although described above as a read operation, method 4700 may similarly process a write operation. For example, at least one row decoder may activate a wordline and at least one row multiplexer may apply a voltage to a memory cell along the activated wordline and corresponding to a first address to write new data to the memory cell. In some embodiments, the circuitry may provide an acknowledgement of the write to logic circuitry using a memory chip that includes the circuitry, or buffer the acknowledgement according to step 4720 below.
At step 4720, the circuitry may buffer the retrieved data for the first address. For example, as depicted in fig. 43 and 44, the buffer may allow the circuitry to retrieve the second of the two addresses (as described below in step 4730) and return the results of both retrievals back together. The buffer may include registers, SRAM, non-volatile memory, or any other data storage device.
At step 4730, during a second memory clock cycle, the circuitry can use at least one row multiplexer and at least one column multiplexer to decode word lines and bit lines corresponding to a second address of the two addresses. For example, at least one row decoder may activate a word line, and at least one column multiplexer may amplify a voltage from a memory cell along the activated word line and corresponding to a second address. The amplified voltage may be provided to logic circuits using a memory chip that includes the circuitry, either individually or in conjunction with, for example, the buffered voltage from step 4720. The logic circuit may include a circuit such as a GPU or CPU, or may include a processing group on the same substrate as the memory chip, e.g., as depicted in fig. 7A.
Although described above as a read operation, method 4700 may similarly process a write operation. For example, at least one row decoder may activate a word line, and at least one row multiplexer may apply a voltage to the memory cells along the activated word line and corresponding to the second address to write new data to the memory cells. In some embodiments, the circuitry may provide confirmation of the write to logic circuitry using the memory chip that includes the circuitry, either individually or in conjunction with, for example, the buffered voltage from step 4720.
At step 4740, the circuitry may output the retrieved data for the second address and the buffered first address. For example, as depicted in fig. 43 and 44, the circuitry may return the results of the two fetches (e.g., from steps 4710 and 4730) back together. The circuitry may communicate the results back to logic using the memory chip that includes the circuitry, which may include circuitry such as a GPU or CPU, or may include a processing group on the same substrate as the memory chip, e.g., as depicted in fig. 7A.
Although described with reference to multiple cycles, method 4700 may allow for a single cycle access to two addresses if the two addresses share a word line, as depicted in FIG. 42. For example, steps 4710 and 4730 may be performed during the same memory clock cycle because multiple column multiplexers may decode different bit lines on the same word line during the same memory clock cycle. In these embodiments, buffering step 4720 may be skipped.
A process for simultaneous access (e.g., using circuitry 4600 described above) is depicted in fig. 47B. Thus, although shown sequentially, the steps of fig. 47B may all be performed during the same memory clock cycle, and at least some steps (e.g., steps 4760 and 4780 or steps 4770 and 4790) may be performed simultaneously. Specifically, fig. 47B is a flow diagram of an embodiment of a process 4750 for providing dual port access on a single port memory array or pad (e.g., using circuitry 4200 of fig. 42 or circuitry 4600 of fig. 46), process 4750 may be performed using row decoders and column decoders consistent with the present disclosure, such as row decoder 4203 or row decoders 4601a and 4601B of fig. 42 or 46, respectively, and column decoders (which may be separate from or combined with one or more column multiplexers, such as column multiplexers 4205a and 4205B or column multiplexers 4603a and 4306B, respectively, depicted in fig. 42 or 46).
At step 4760, during a memory clock cycle, the circuitry can activate corresponding ones of at least one row circuit and at least one column circuit based on a first address of two addresses. For example, the circuitry may transmit one or more control signals to close corresponding ones of the switching elements including at least one row circuit and at least one column circuit. Thus, the circuitry can access a corresponding open area that includes the first of the two addresses.
At step 4770, the circuitry may use at least one row multiplexer and at least one column multiplexer to decode a wordline and a bitline corresponding to the first address during the memory clock cycle. For example, at least one row decoder may activate a wordline, and at least one row multiplexer may amplify a voltage from a memory cell along the activated wordline and corresponding to a first address. The amplified voltage may be provided to a logic circuit using a memory chip including the circuitry. For example, as described above, the logic circuitry may include circuitry such as a GPU or CPU, or may include processing groups on the same substrate as the memory chips, e.g., as depicted in fig. 7A.
Although described above as a read operation, method 4500 can similarly process write operations. For example, at least one row decoder may activate a wordline and at least one row multiplexer may apply a voltage to a memory cell along the activated wordline and corresponding to a first address to write new data to the memory cell. In some embodiments, the circuitry may provide confirmation of the write to logic circuitry using a memory chip that includes the circuitry.
At step 4780, during the same cycle, the circuitry can activate corresponding ones of at least one row circuit and at least one column circuit based on a second address of the two addresses. For example, the circuitry may transmit one or more control signals to close corresponding ones of the switching elements including at least one row circuit and at least one column circuit. Thus, the circuitry can access a corresponding open area that includes the second of the two addresses.
At step 4790, during the same cycle, the circuitry can use at least one row multiplexer and at least one column multiplexer to decode a wordline and a bitline corresponding to the second address. For example, at least one row decoder may activate a word line, and at least one column multiplexer may amplify a voltage from a memory cell along the activated word line and corresponding to a second address. The amplified voltage may be provided to a logic circuit using a memory chip including the circuitry. For example, as described above, the logic circuitry may comprise conventional circuitry such as a GPU or CPU, or may comprise a processing group on the same substrate as the memory chip, e.g., as depicted in fig. 7A.
Although described above as a read operation, method 4500 can similarly process write operations. For example, at least one row decoder may activate a wordline and at least one row multiplexer may apply a voltage to a memory cell along the activated wordline and corresponding to a second address to write new data to the memory cell. In some embodiments, the circuitry may provide confirmation of the write to logic circuitry using a memory chip that includes the circuitry.
Although described with reference to a single cycle, method 4500 can allow multi-cycle access to two addresses if the two addresses are in a break region that shares word or bit lines (or otherwise shares switching elements in at least one row circuit and at least one column circuit). For example, steps 4760 and 4770 may be performed during a first memory clock cycle in which a first row decoder and a first row multiplexer may decode word lines and bit lines corresponding to a first address, while steps 4780 and 4790 may be performed during a second memory clock cycle in which a second row decoder and a second row multiplexer may decode word lines and bit lines corresponding to a second address.
Another embodiment of an architecture for dual port access along both rows and columns is depicted in fig. 48. In particular, fig. 48 depicts an embodiment of circuitry 4800 that provides dual port access along both rows and columns using multiple row decoders in conjunction with multiple column multiplexers. In fig. 48, row decoder 4801a may access a first wordline and row multiplexer 4803a may decode data from one or more memory cells along the first wordline, while row decoder 4801b may access a second wordline and row multiplexer 4803b may decode data from one or more memory cells along the second wordline.
As described with respect to fig. 47B, this access may be done simultaneously during one memory clock cycle. Thus, similar to the architecture of FIG. 46, the architecture of FIG. 48 (including the memory pads described below in FIG. 49) may allow multiple addresses to be accessed in the same clock cycle. For example, the architecture of FIG. 48 may include any number of row decoders and any number of column multiplexers such that a number of addresses corresponding to the number of row decoders and column multiplexers may all be accessed within a single memory clock cycle.
In other embodiments, this access may occur sequentially along two memory clock cycles. By clocking the memory chip 4800 faster than the corresponding logic circuits, two memory clock cycles may be equivalent to one clock cycle of the logic circuits using the memory. For example, as described above, the logic circuitry may comprise conventional circuitry such as a GPU or CPU, or may comprise a processing group on the same substrate as the memory chip, e.g., as depicted in fig. 7A.
Other embodiments may allow simultaneous access. For example, as described with respect to fig. 42, multiple column decoders (which may include column multiplexers such as 4803a and 4803b, as shown in fig. 48) may read multiple bit lines along the same word line during a single memory clock cycle. Additionally or alternatively, as described with respect to fig. 46, circuitry 4800 can be incorporated with additional circuitry so that such access can be simultaneous. For example, the row decoder 4801a may access a first wordline, and the row multiplexer 4803a may decode data from memory cells along the first wordline during the same memory clock cycle in which the row decoder 4801b accesses a second wordline and the row multiplexer 4803b decodes data from memory cells along the second wordline.
The architecture of fig. 48 may be used with modified memory pads forming a memory bank, as shown in fig. 49. In fig. 49, each memory cell is accessed by two wordlines and two bitlines (depicted as capacitors similar to DRAM, but may also include several transistors arranged in a manner similar to SRAM or any other memory cell). Thus, the memory pad 4900 of fig. 49 allows two different bits to be accessed simultaneously, or even the same bit, by two different logic circuits. However, rather than implementing a dual port solution on a standard DRAM memory mat, the embodiment of figure 49 uses a modification to the memory mat that is wired for single port access, as in the above embodiments.
Although described as having two ports, any of the embodiments described above may be extended to more than two ports. For example, the embodiments of fig. 42, 46, 48, and 49 may include additional row multiplexers or column multiplexers, respectively, to provide access to additional rows or columns during a single clock cycle. As another example, the embodiments of FIGS. 43 and 44 may include additional row decoders and/or column multiplexers to provide access to additional rows or columns, respectively, during a single clock cycle.
Memory deviceVariable word length access in
As used above and further below, the term "coupled" may include direct connection, indirect connection, electrical communication, and the like.
Moreover, terms such as "first," "second," and the like are used to distinguish elements or method steps having the same or similar designation or heading, and do not necessarily indicate spatial or temporal order.
Typically, a memory chip may include a memory bank. The memory bank may be coupled to a row decoder and a column decoder configured to select a particular word (or other fixed-size unit of data) to be read or written. Each memory group may include memory cells to store data units, to amplify voltages from memory cells selected by the row and column decoders, and any other suitable circuitry.
Each memory bank typically has a particular I/O width. For example, an I/O width may comprise words.
While some handlers executed by logic circuits using memory chips may benefit from using very long words, some other handlers may only require a portion of the word.
In practice, an in-memory compute unit (such as a processor subunit disposed on the same substrate as the memory chip, e.g., as depicted and described in fig. 7A) frequently performs memory access operations that require only a portion of the word.
To reduce the latency associated with accessing an entire word when only a portion is used, embodiments of the present disclosure may provide methods and systems for taking only one or more portions of a word, thereby reducing data loss associated with transferring unnecessary portions of the word and allowing power savings in a memory device.
Furthermore, embodiments of the present disclosure may also reduce power consumption for interactions between the memory chip and other entities (such as logic circuits, whether separate, such as a CPU and GPU, or included on the same substrate as the memory chip, such as the processor subunit depicted and described in fig. 7A) that access the memory chip, which may receive or write only a portion of the word.
The memory access command (e.g., from a logic circuit using the memory) may include an address in the memory. The address may comprise a column address and a row address, for example, or may be transferred to a column address and a row address, for example, by a memory controller of the memory.
In many volatile memories, such as DRAMs, a column address is sent (e.g., directly by logic circuitry or using a memory controller) to a row decoder, which activates an entire row (also referred to as a wordline) and loads all of the bit lines included in the row.
The row address identifies the bit lines on the activated row that are forwarded outside the memory bank that includes the bit lines and passed to the next level circuitry. For example, the next level circuitry may include an I/O bus for the memory chip. In embodiments using in-memory processing, the next level circuitry may include a processor subunit of the memory chip (e.g., as depicted in FIG. 7A).
Thus, the memory chips described below may be included in, or otherwise include, a memory chip as illustrated in any one of fig. 3A, 3B, 4-6, 7A-7D, 11-13, 16-19, 22, or 23.
The memory chip may be manufactured by a first manufacturing process optimized for memory cells rather than logic cells. For example, memory cells fabricated by the first fabrication process may exhibit smaller critical dimensions (e.g., more than 2, 3, 4, 5, 6, 7, 8, 9, 10, and the like) than those of logic circuits fabricated by the first fabrication process. For example, the first manufacturing process may include a simulation manufacturing process, a DRAM manufacturing process, and the like.
Such a memory chip may include an integrated circuit, which may include memory cells. The memory unit may include a memory cell, an output port, and read circuitry. In some embodiments, the memory unit may also include a processing unit, such as a processor subunit as described above.
For example, the read circuitry may include a reduction unit and a first group of in-memory read paths for outputting up to a first number of bits via an output port. The output port may be connected to off-chip logic circuitry (such as an accelerator, CPU, GPU, or the like) or on-chip processor subunits, as described above.
In some embodiments, the processing unit may include, may be part of, may be distinct from, or may otherwise include a reduction unit.
The in-memory read path may be included in an integrated circuit (e.g., may be in a memory unit) and may include any circuitry and/or links configured for reading from and/or writing to memory cells. For example, the in-memory read path may include sense amplifiers, conductors coupled to the memory cells, multiplexers, and the like.
The processing unit may be configured to send a read request to the memory unit to read a second number of bits from the memory unit. Additionally or alternatively, the read request may originate from off-chip logic circuitry (such as an accelerator, CPU, GPU, or the like).
The reduction unit may be configured to assist in reducing power consumption associated with access requests, for example by using any of the partial word accesses described herein.
The reduction unit may be configured to control the in-memory read path based on the first number of bits and the second number of bits during a read operation triggered by the read request. For example, the control signal from the reduction unit may affect the memory consumption of the read path to reduce the energy consumption of the memory read path not associated with the requested second number of bits. For example, the reduction unit may be configured to control unrelated memory read paths when the second number is less than the first number.
As explained above, the integrated circuit may be included in, may include, or otherwise include a memory chip as illustrated in any of fig. 3A, 3B, 4-6, 7A-7D, 11-13, 16-19, 22, or 23.
The uncorrelated in-memory read paths may correlate to uncorrelated bits in the first number of bits, such as bits in the first number of bits that are not included in the second number of bits.
Fig. 50 illustrates an integrated circuit 5000 in an embodiment of the disclosure, comprising: memory cells 5001-5008 in memory cell array 5050; an output port 5020, which includes bits 5021 through 5028; read circuitry 5040, which includes memory read paths 5011-5018; and a reduction unit 5030.
When the second number of bits is read using the corresponding memory read path, the irrelevant bits of the first number of bits may correspond to bits that should not be read (e.g., bits that are not included in the second number of bits).
During a read operation, the reduction unit 5030 can be configured to initiate a memory read path corresponding to the second number of bits such that the initiated memory read path can be configured to convey the second number of bits. In these embodiments, only the memory read path corresponding to the second number of bits may be enabled.
During a read operation, the reduction unit 5030 can be configured to cut off at least a portion of each unrelated memory read path. For example, the uncorrelated memory read paths may correspond to uncorrelated bits in the first number of bits.
It should be noted that instead of cutting off at least a portion of the unrelated memory paths, the reduction unit 5030 may instead ensure that the unrelated memory paths are not enabled.
Additionally or alternatively, during a read operation, the reduction unit 5030 can be configured to maintain an unrelated memory read path in a low power mode. For example, the low power mode may include supplying a voltage or current lower than a normal operating voltage or current, respectively, to an unrelated memory path.
The reduction unit 5030 may be further configured to control the bit lines of an unrelated memory read path.
Thus, the reduction unit 5030 can be configured to load the bit lines of the concerned memory read path and maintain the bit lines of the irrelevant memory read path in the low power mode. For example, only the bit lines of the associated memory read path may be loaded.
Additionally or alternatively, the reduction unit 5030 may be configured to load the bit lines of the concerned memory read path while maintaining the bit lines of the irrelevant memory read path as deactivated.
In some embodiments, the reduction unit 5030 can be configured to utilize a portion of the correlated memory read paths during a read operation and maintain a portion of each uncorrelated memory read paths in a low power mode, where the portion is different from the bit lines.
As explained above, a memory chip may use sense amplifiers to amplify voltages from memory cells included in the memory chip. Thus, the reduction unit 5030 can be configured to utilize portions of the correlated memory read paths during a read operation and maintain sense amplifiers associated with at least some of the uncorrelated memory read paths in a low power mode.
In these embodiments, the reduction unit 5030 can be configured to utilize portions of the correlated memory read paths during a read operation and maintain one or more sense amplifiers associated with all of the uncorrelated memory read paths in a low power mode.
Additionally or alternatively, the reduction unit 5030 can be configured to utilize a portion of the correlated memory read path during a read operation and maintain a portion of the uncorrelated memory read path after (e.g., spatially and/or temporally) one or more sense amplifiers associated with the uncorrelated memory read path in a low power mode.
In any of the embodiments described above, the memory unit may include a column multiplexer (not shown).
In these embodiments, a reduction unit 5030 may be coupled between the row multiplexer and the output port.
Additionally or alternatively, the reduction unit 5030 may be embedded in the row multiplexer.
Additionally or alternatively, a reduction unit 5030 may be coupled between the memory cells and the row multiplexer.
The reduction unit 5030 may comprise a reduction sub-unit that may be independently controllable. For example, different reduced sub-cells may be associated with different columns of memory cells.
Although described above with respect to read operations and read circuitry, the above embodiments may be similarly applied to write operations and write circuitry.
For example, an integrated circuit according to the present disclosure may include a memory unit that includes memory cells, output ports, and write circuitry. In some embodiments, the memory unit may also include a processing unit, such as a processor subunit as described above. The write circuitry may include a reduction unit and a first group of in-memory write paths to output up to a first number of bits via an output port. The processing unit may be configured to send a write request to the memory unit to write a second number of bits from the memory unit. Additionally or alternatively, the write request may originate from off-chip logic circuitry (such as an accelerator, CPU, GPU, or the like). The reduction unit 5030 can be configured to control the memory write path based on the first number of bits and the second number of bits during a write operation triggered by the write request.
Fig. 51 illustrates a memory bank 5100 that includes an array 5111 of memory cells that are addressed using column addresses and row addresses (e.g., from an on-chip processor sub-unit or off-chip logic circuitry, such as accelerators, CPUs, GPUs, or the like). As shown in fig. 51, the memory cells are fed to bit lines (vertical) and word lines (horizontal, many word lines omitted for simplicity). In addition, row decoder 5112 may be fed with a row address (e.g., from an on-chip processor subunit, off-chip logic circuit, or memory controller not shown in fig. 51), row multiplexer 5113 may be fed with a row address (e.g., from an on-chip processor subunit, off-chip logic circuit, or memory controller not shown in fig. 51), and row multiplexer 5113 may receive outputs from up to an entire line and up to a word via output bus 5115. In FIG. 51, the output bus 5115 of the column multiplexer 5113 is coupled to the main I/O bus 5114. In other embodiments, the output bus 5115 may be coupled to a processor subunit of a memory chip (e.g., as depicted in fig. 7A) that sends a column address and a row address. The division of the memory banks into memory pads is not shown for simplicity.
Fig. 52 illustrates a memory bank 5101. In fig. 52, the memory bank is also illustrated as including in-memory Process (PIM) logic 5116 having an input coupled to an output bus 5115. The PIM logic 5116 may generate an address (e.g., comprising a column address and a row address) and output the address via the PIM address bus 5118 to access the memory bank. The PIM logic 5116 is an embodiment of a reduction unit (e.g., unit 5030) that also includes a processing unit. PIM logic 5016 may control other circuitry to assist in curtailing power not shown in fig. 52. The PIM logic 5116 may further control the memory path of the memory units comprising the memory bank 5101.
As explained above, in some cases, the word length (e.g., the number of bit lines selected to be transferred at one time) may be large.
In these cases, each word for read and/or write may be associated with a memory path that may consume power at various stages of a read and/or write operation, such as:
a. load bit line — to load a bit line to a desired value (either from the capacitor on the bit line in a read cycle, or a new value to be written to the capacitor in a write cycle), it is necessary to disable the sense amplifier at the end of the memory array and ensure that the capacitor holding the data is not discharged or charged (otherwise, the data stored thereon will be corrupted); and
b. The data from the sense amplifiers is moved to the rest of the chip (to the I/O bus that transfers data to and from the chip or to embedded logic that will use the data, such as a processor subunit on the same substrate as the memory) via a column multiplexer that selects the bit lines.
To achieve power savings, the integrated circuit of the present disclosure may determine at row activation times that portions of a word are uncorrelated and then send a disable signal to one or more sense amplifiers for the uncorrelated portions of the word.
Fig. 53 illustrates a memory unit 5102 including a memory cell array 5111, a column decoder 5112, a row multiplexer 5113 coupled to an output bus 5115, and PIM logic 5116.
The memory cell 5102 also includes a switch 5201 that enables or disables the passage of bits to the column multiplexer 5113. The switch 5201 may include an analog switch, a transistor configured to function as a switch, configured to control a supply or voltage and/or current flow to a portion of the memory cell 5102. Sense amplifiers (not shown) may be located at the ends of the memory cell array, e.g., before (spatially and/or temporally) the switch 5201.
The switch 5201 may be controlled by an enable signal sent from the PIM logic 5116 via the bus 5117. When open, the switch is configured to turn off a sense amplifier (not shown) of the memory cell 5102, and thus the bit line disconnected from the sense amplifier is not discharged or charged.
The switch 5201 and the PIM logic 5116 may form a reduction unit (e.g., reduction unit 5030).
In yet another embodiment, the PIM logic 5116 may send an enable signal to the sense amplifier (e.g., when the sense amplifier has an enable input) instead of to the switch 5201.
The bit lines may additionally or alternatively be disconnected at other points, e.g., not at the ends of the bit lines and after the sense amplifiers. For example, the bit lines may be disconnected prior to entering the array 5111.
In these embodiments, power may also be saved on data transfers from the sense amplifiers and forwarding hardware (such as output bus 5115).
Other embodiments (which may save less power, but may be easier to implement) focus on saving power to row multiplexer 5113 and shifting the penalty from row multiplexer 5113 to the next level of circuitry. For example, as explained above, the next level circuitry may include an I/O bus (such as bus 5115) for the memory chip. In embodiments using in-memory processing, the next level circuitry may additionally or alternatively include a processor subunit of a memory chip (such as PIM logic 5116).
FIG. 54A illustrates a column multiplexer 5113 segmented into a segment 5202. Each sector 5202 of the column multiplexer 5113 may be individually enabled or disabled by enable and/or disable signals sent from the PIM logic 5116 via the bus 5119. The row multiplexer 5113 may also be fed by the address row bus 5118.
The embodiment of fig. 54A may provide better control of different portions of the output from column multiplexer 5113.
It should be noted that the control of the different memory paths may have different resolutions, e.g. ranging from one bit resolution to many bit resolution. The former may be more efficient in the sense of power saving. The latter implementation may be simpler and require fewer control signals.
Fig. 54B illustrates an embodiment of a method 5130. For example, method 5130 may be implemented using any of the memory cells described above with respect to fig. 50, 51, 52, 53, or 54A.
Step 5132 may include: an access request is sent by a processing unit (e.g., PIM logic 5116) of the integrated circuit to a memory unit of the integrated circuit to read a second number of bits from the memory unit. The memory unit can include a memory cell (e.g., a memory cell of the array 5111), an output port (e.g., an output bus 5115), and read/write circuitry that can include a reduction unit (e.g., the reduction unit 5030) and a first group of memory read/write paths for outputting and/or inputting up to a first number of bits via the output port.
The access request may include a read request and/or a write request.
The memory input/output paths may include a memory read path, a memory write path, and/or paths for both reads and writes.
Step 5134 can include responding to the access request.
For example, step 5134 may include controlling, by a reduction unit (e.g., unit 5030), a memory read/write path based on the first number of bits and the second number of bits during an access operation triggered by the access request.
Step 5134 can also include any of the following operations and/or any combination of any of the following operations. Any of the operations listed below may be performed during, but may also be performed before and/or after responding to an access request.
Thus, step 5134 may include at least one of the following operations:
a. controlling an unrelated memory read path when the second number is less than the first number, wherein the unrelated memory read path is associated with a bit of the first number of bits that is not included in the second number of bits;
b. initiating an associated memory read path during a read operation, wherein the associated memory read path is configured to convey a second number of bits;
c. Cutting off at least a portion of each of the unrelated memory read paths during a read operation;
d. maintaining an unrelated memory read path in a low power mode during a read operation;
e. bit lines that control unrelated memory read paths;
f. loading the bit lines of the associative memory read path and maintaining the bit lines of the non-associative memory read path in a low power mode;
g. loading bit lines of the associated memory read path while maintaining the bit lines of the unrelated memory read path deactivated;
h. utilizing a portion of the associated memory read paths during a read operation and maintaining a portion of each of the associated memory read paths in a low power mode, wherein the portion is different from the bit line;
i. utilizing portions of related memory read paths and maintaining sense amplifiers for at least some of the unrelated memory read paths in a low power mode during a read operation;
j. utilizing portions of the correlated memory read paths and maintaining sense amplifiers of at least some of the uncorrelated memory read paths in a low power mode during a read operation; and
k. Portions of the correlated memory read path are utilized during read operations and the uncorrelated memory read path following the sense amplifiers of the uncorrelated memory read path is maintained in a low power mode.
The low power mode or idle mode may include a mode in which the power consumption of the memory access path is lower than the power consumption of the memory access path when the memory access path is used for an access operation. In some embodiments, the low power mode may even involve shutting down the memory access path. The low power mode may additionally or alternatively include not activating a memory access path.
It should be noted that the power reduction that occurs during the bit line phase may require that the dependency or irrelevancy of the memory access path be known before the word line is opened. Power reduction occurring elsewhere (e.g., in a column multiplexer) may alternatively allow the determination of the relevance or irrelevance of the memory access paths at each access.
Fast and low power boot and fast access memory
DRAM and other memory types (such as SRAM, flash memory, or the like) are often built from memory banks that are typically built to allow row and column access schemes.
FIG. 55 illustrates an embodiment of a memory chip 5140 that includes a plurality of memory pads and associated logic (such as row and column decoders, depicted in FIG. 55 as RD and COL, respectively). In the embodiment of FIG. 55, the pads are grouped into groups and have word lines and bit lines passing through them. The memory pads and associated logic are designated 5141, 5142, 5143, 5144, 5145, and 5146 in figure 55, and share at least one bus 5147.
The memory chip 5140 may be included in, may include, or otherwise comprise a memory chip as illustrated in any of fig. 3A, 3B, 4-6, 7A-7D, 11-13, 16-19, 22, or 23.
For example, in a DRAM, the overhead associated with starting a new row (e.g., preparing a new line for access) is significant. Once a line is activated (also referred to as open), the data within that line is available for faster access. In a DRAM, this access may be performed in a random manner.
Two issues associated with starting a new line are power and time:
c. power rises due to current surges caused by commonly accessing all capacitors on the line and having to load the line (e.g., power can reach several amps when opening a line with only a few memory banks); and
d. The time delay problem is primarily associated with the time it takes to load the row (word) lines and then the bit (column) lines.
Some embodiments of the present disclosure may include systems and methods to reduce peak power consumption during line start-up and reduce line start-up time. Some embodiments may sacrifice full random access in a line, at least to some extent, to reduce these power and time costs.
For example, in one embodiment, a memory unit may include a first memory pad, a second memory pad, and an activation unit configured to activate a first group of memory cells included in the first memory pad and not activate a second group of memory cells included in the second memory pad. The first group of memory cells and the second group of memory cells may both belong to a single column of the memory unit.
Alternatively, the activation unit may be configured to activate the second group of memory cells included in the second memory pad without activating the first group of memory cells.
In some embodiments, the activation unit may be configured to activate the second group of memory cells after activating the first group of memory cells.
For example, the activation unit may be configured to activate the second group of memory cells after an expiration of a delay period that is initiated after activation of the first group of memory cells has been completed.
Additionally or alternatively, the activation unit may be configured to activate the second group of memory cells based on a value of a signal generated on a first wordline segment coupled to the first group of memory cells.
In any of the embodiments described above, the activation cell may include an intermediate circuit disposed between the first word line segment and the second word line segment. In these embodiments, the first wordline section may be coupled to the first memory cell and the second wordline section may be coupled to the second memory cell. Non-limiting examples of intermediate circuits include switches, flip-flops, buffers, inverters, and the like, some of which are illustrated throughout fig. 56-61.
In some embodiments, the second memory cell may be coupled to a second wordline segment. In these embodiments, the second word line segment may be coupled to a bypass word line path through at least the first memory pad. An embodiment of such a bypass path is illustrated in fig. 61.
The activation unit may include a control unit configured to control supply of voltages (and/or currents) to the first group of memory cells and the second group of memory cells based on an activation signal from a wordline associated with a single row.
In another embodiment, a memory unit may include a first memory pad, a second memory pad, and an activation unit configured to supply an activation signal to a first group of memory cells of the first memory pad and delay the supply of the activation signal to a second group of memory cells of the second memory pad at least until activation of the first group of memory cells has completed. The first group of memory cells and the second group of memory cells may belong to a single column of the memory unit.
For example, the start-up unit may comprise a delay unit which may be configured to delay supplying the start-up signal.
Additionally or alternatively, the start-up unit may comprise a comparator which may be configured to receive the start-up signal at its input and to control the delay unit based on at least one characteristic of the start-up signal.
In another embodiment, a memory cell may include a first memory pad, a second memory pad, and an isolation unit, which may be configured to: isolating a first memory cell of a first memory pad from a second memory cell of a second memory pad during an initial start-up period in which the first memory cell is activated; and coupling the first memory cell to the two memory cells after the initial start-up period. The first memory cell and the second memory cell may belong to a single column of memory units.
In the following embodiments, no modification to the memory pads themselves may be required. In some examples, embodiments may rely on a small number of modifications to the memory banks.
The following figures depict mechanisms for shortening the word signals added to a memory bank, thereby splitting the word lines into shorter portions.
In the following figures, various memory bank components are omitted for clarity.
Fig. 56-61 illustrate portions of memory banks (designated 5140(1), 5140(2), 5140(3), 5140(4), 5140(5), and 5149(6), respectively) that include a row decoder 5112 and a plurality of memory pads (such as 5150(1), 5150(2), 5150(3), 5150(4), 5150(5), 5150(6), 5151(1), 5151(2), 5151(3), 5151(4), 5151(5), 5151(6), 5152(1), 5152(2), 5152(3), 5152(4), 5152(5), and 5152(6), respectively).
The memory pads arranged in rows may comprise different groups.
Fig. 56-59 and 61 illustrate nine groups of memory pads, where each group includes a pair of memory pads. Any number of groups, each having any number of memory pads, may be used.
Memory pads 5150(1), 5150(2), 5150(3), 5150(4), 5150(5), and 5150(6) are arranged in rows, sharing a plurality of memory lines, and divided into three groups: a first upper group comprising memory pads 5150(1) and 5150 (2); a second upper group comprising memory pads 5150(3) and 5150 (4); and a third upper group comprising memory pads 5150(5) and 5150 (6).
Similarly, memory pads 5151(1), 5151(2), 5151(3), 5151(4), 5151(5), and 5151(6) are arranged in rows, sharing multiple memory lines and divided into three groups: a first intermediate group comprising memory pads 5151(1) and 5151 (2); a second intermediate group comprising memory pads 5151(3) and 5151 (4); and a third intermediate group comprising memory pads 5151(5) and 5151 (6).
In addition, memory pads 5152(1), 5152(2), 5152(3), 5152(4), 5152(5), and 5152(6) are arranged in rows, sharing multiple memory lines and grouped into three groups: a first lower group comprising memory pads 5152(1) and 5152 (2); a second lower group comprising memory pads 5152(3) and 5152 (4); and a third lower group comprising memory pads 5152(5) and 5152 (6). Any number of memory pads may be arranged in rows and share memory lines, and may be divided into any number of groups.
For example, the number of memory pads per group may be one, two, or may exceed two.
As explained above, the activation circuit may be configured to activate one group of memory pads without activating another group of memory pads sharing the same memory line or at least coupled to a different segment of memory lines having the same line address.
Fig. 56 to 61 illustrate different examples of the start-up circuit. In some embodiments, at least a portion of the activation circuitry (such as intermediate circuitry) may be located between groups of memory pads to allow activation of memory pads of one group without activating another group of memory pads of the same row.
Fig. 56 illustrates intermediate circuitry, such as delay or isolation circuitry 5153(1) through 5153(3), as positioned between different lines of a first upper group of memory and different lines of a second upper group of memory pads.
Fig. 56 also illustrates intermediate circuitry, such as delay or isolation circuitry 5154(1) -5154 (3), as positioned between different lines of the second upper group of memory and different lines of the third upper group of memory pads. In addition, some delay or isolation circuitry is positioned between the groups formed by the intermediate groups of memory pads. Furthermore, some delay or isolation circuitry is positioned between the groups formed by the memory pads of the lower group.
The delay or isolation circuit can delay or stop the propagation of the word line signal from the row decoder 5112 along a row to another group.
Fig. 57 illustrates intermediate circuitry, such as delay or isolation circuitry, including flip-flops (such as 5155(1) -5155 (3) and 5156(1) -5156 (3)).
When an activation signal is injected to a wordline, one of the first group of pads is activated (depending on the wordline), while the other groups along the wordline remain deactivated. Other groups may be started on the next clock cycle. For example, a second group of the other groups may be started on the next clock cycle, and a third group of the other groups may be started after yet another clock cycle.
The flip-flop may comprise a D-type flip-flop or any other type of flip-flop. For simplicity, the clocks fed to the D-type flip-flops are omitted from the figure.
Thus, access to the first group may use power to charge only a portion of the word lines associated with the first group, which charges faster and requires less current than the entire word line.
More than one flip-flop may be used between groups of memory pads, thereby increasing the delay between open portions. Additionally or alternatively, embodiments may use a slower clock to increase delay.
Furthermore, the activated group may still contain groups from the previous line values used. For example, the method may allow a new line segment to be activated while still accessing data of a previous line, thereby reducing the penalty associated with activating the new line.
Thus, some embodiments may have a first group activated and allow other groups of previously activated lines to remain active, with the signals of the bit lines not interfering with each other.
Additionally, some embodiments may include switches and control signals. The control signals may be controlled by the stack controller or by adding flip-flops between the control signals (e.g., to produce the same timing effects as the mechanisms described above).
Fig. 58 illustrates intermediate circuitry, such as delay or isolation circuitry, which are switches (such as 5157(1) through 5157(3) and 5158(1) through 5158(3)) and positioned between one group of another group. A set of switches positioned between groups may be controlled by dedicated control signals. In fig. 58, control signals may be sent by row control unit 5160(1) and delayed by a sequence of one or more delay units (e.g., units 5160(2) and 5160(3)) between different sets of switches.
FIG. 59 illustrates intermediate circuitry, such as delay or isolation circuitry, which is a sequence of inverter gates or buffers, such as 5159(1) -5159 (3) and 5159'1(0-5159' (3)), and positioned between groups of memory pads.
Instead of switches, buffers may be used between groups of memory pads. The buffer may not allow for the voltage between the switches to be reduced along the word line, an effect that sometimes occurs when using a single transistor structure.
Other embodiments may allow for more random access and still provide very low power-up and time by using area added to the memory bank.
FIG. 60 illustrates an embodiment using global word lines (such as 5152(1) -5152 (8)) located close to the memory pads. These word lines may or may not be coupled to word lines within the memory pads through the memory pads and via intermediate circuitry such as switches (e.g., 5157(1) -5157 (8)). The switch can control which memory pad will be activated and allow the memory controller to activate only the relevant line portion at each point in time. Unlike the above-described embodiment using sequential activation of line portions, the embodiment of fig. 60 may provide better control.
The enable signals, such as row portion enable signals 5170(1) and 7150(2), may originate from logic not shown, such as a memory controller.
FIG. 61 illustrates that global word line 5180 passes through the memory pad and forms a bypass path for word line signals that may not need to be routed outside the pad. Thus, the embodiment shown in fig. 61 may reduce the area of the memory bank at the expense of some memory density.
In fig. 61, the global world line may pass through the memory pad without interruption and may not be connected to the memory cells. The local wordline segment may be controlled by one of the switches and connected to the memory cell in the pad.
A memory bank may actually support full random access when the group of memory pads provides a substantial partition of word lines.
Another embodiment for slowing down the spread of the enable signal along the word line may also save some wiring and logic, using switches and/or other buffering or isolation circuitry between the memory pads, rather than using a dedicated enable signal and dedicated lines to convey the enable signal.
For example, the comparator may be used to control a switch or other buffer or isolation circuit. When the level of the signal on the word line segment monitored by the comparator reaches a certain level, the comparator may activate a switch or other buffering or isolation circuit. For example, a certain level may indicate that a previous word line segment is fully loaded.
FIG. 62 illustrates a method 5190 for operating a memory cell. For example, method 5130 may be implemented using any of the memory banks described above with respect to fig. 56-61.
Step 5194 may include activating the second group of memory cells by the activation unit, e.g., after step 5192.
Step 5194 may be performed upon activation of the first group of memory cells, upon full activation of the first group of memory cells, upon expiration of a delay period that is initiated after activation of the first group of memory cells has been completed, upon deactivation of the first group of memory cells, and the like.
The delay period may be fixed or adjustable. For example, the duration of the delay period may be based on the expected access pattern of the memory cells, or may be set independently of the expected access pattern. The delay period may range between less than one millisecond and more than one second.
In some embodiments, step 5194 may be initiated based on the value of a signal generated on a first wordline segment coupled to a first group of memory cells. For example, when the value of the signal exceeds a first threshold, it may indicate that a first group of memory cells is fully activated.
Either of steps 5192 and 5194 may involve the use of an intermediate circuit (e.g., an intermediate circuit that activates a cell) disposed between the first word line segment and the second word line segment. The first wordline section may be coupled to the first memory cell and the second wordline section may be coupled to the second memory cell.
Embodiments of intermediate circuits are described throughout fig. 56-61.
Accelerated testing using memory parallelismTesting logic in memory using time and use vectors
Some embodiments of the present disclosure may use an on-chip test unit to speed up testing.
In general, memory chip testing requires a significant amount of test time. Reducing test time reduces production costs and also allows more tests to be performed to produce a more reliable product.
Fig. 63 and 64 illustrate tester 5200 and chip (or wafer of chips) 5210. Tester 5200 may include software to manage the tests. The tester 5200 can run different data sequences to all the memories 5210 and then read back the sequences to identify where the failed bits of the memories 5210 are located. Once recognized, tester 5200 may issue a repair bit command and if the problem can be repaired, tester 5200 may declare memory 5210 to pass. In other cases, some chips may be declared failed.
Fig. 64 shows a test system with tester 5200 and a complete wafer 5202 of chips (such as 5210) that are tested in parallel. For example, tester 5200 may be connected to each of the chips by a wire bus.
As shown in FIG. 64, tester 5200 must read and write all the memory chips several times, and the data must be passed through an external chip interface.
Furthermore, it may be beneficial to test both logic and memory banks of an integrated circuit, for example, using programmable configuration information, which may be provided using regular I/O operations.
The testing may also benefit from the presence of test cells within the integrated circuit.
The test unit may belong to an integrated circuit and may analyze test results and find faults, for example, in logic (e.g., a processor subunit as depicted and described in fig. 7A) and/or memory (e.g., across multiple memory banks).
Memory testers are typically very simple and exchange test vectors with integrated circuits according to a simple format. For example, there may be a write vector that includes pairs of addresses of memory entries to be written and values to be written to the memory entries. There may also be a read vector that includes the address of the memory entry to be read. At least some of the addresses of the write vectors may be the same as at least some of the addresses of the read vectors. At least some other addresses of the write vector may be different from at least some other addresses of the read vector. When programmed, the memory tester may also receive an expected result vector, which may include the address of the memory entry to be read and the expected value to be read. The memory tester may compare the expected value to its read value.
According to an embodiment, logic (e.g., processor subunits) of an integrated circuit (with or without memory of the integrated circuit) may be tested by a memory tester using the same protocol/format. For example, some values in the write vector may be commands to be executed by logic of the integrated circuit (and may, for example, involve computations and/or memory accesses). The memory tester may be programmed with a read vector and an expected result vector, which may include memory entry addresses, some of which store calculated expected values. Thus, the memory tester can be used to test logic as well as memory. Memory testers are typically simpler and cheaper than logic testers, and the proposed method allows complex logic tests to be performed using a simple memory tester.
In some embodiments, logic within the memory may enable testing of logic within the memory by using only vectors (or other data structures) without using more complex mechanisms common in logic testing, such as communicating with a controller, e.g., via an interface, telling the logic which circuit to test.
Instead of using a test unit, the memory controller may be configured to receive an instruction to access a memory entry included in the configuration information, and execute the access instruction and output a result.
Any of the integrated circuits illustrated in fig. 65-69 may perform testing even in the absence of a test cell or in the presence of an inability to perform testing.
Embodiments of the present disclosure may include methods and systems that use memory parallelism and internal chip bandwidth to accelerate and improve test time.
The method and system may be based on the memory chip test itself (running the test, reading the test results, and analyzing the results against the tester), saving the results and ultimately allowing the tester to read the results (and, if desired, program the memory chip back, e.g., activate a redundancy mechanism). The testing may include testing memory or testing memory banks and logic (in the case of computing memory having active logic portions to be tested, such as the case described above in FIG. 7).
In one embodiment, the method may include reading and writing data within the chip such that external bandwidth does not limit the test.
In embodiments where the memory chip includes processor subunits, each processor subunit may be programmed by test code or configuration.
In embodiments where the memory chip has a processor subunit that cannot execute test code or does not have a processor subunit but has a memory controller, the memory controller may then be configured to read and write patterns (e.g., externally programmed to the controller) and mark the location of the failure (e.g., write a value to a memory entry, read the entry, and receive a value other than the written value) for further analysis.
It should be noted that testing the memory may require testing a large number of bits, e.g., testing each bit of the memory and verifying whether the bit under test is functional. Furthermore, memory tests can sometimes be repeated under different voltage and temperature conditions.
For some defects, one or more redundancy mechanisms may be activated (e.g., by programming flash memory or OTP or blowing fuses). Furthermore, logic and analog circuits (e.g., controller, regulator, I/O) of the memory chip may also have to be tested.
In one embodiment, an integrated circuit may comprise: the memory device includes a substrate, a memory array disposed on the substrate, a process array disposed on the substrate, and an interface disposed on the substrate.
The integrated circuits described herein may be included in, may include, or otherwise include a memory chip as illustrated in any one of fig. 3A, 3B, 4-6, 7A-7D, 11-13, 16-19, 22, or 23.
Fig. 65-69 illustrate various integrated circuits 5210 and tester 5200.
The integrated circuit is illustrated as including a memory bank 5212, a chip interface 5211 (such as an I/O controller 5214 and a bus 5213 shared by the memory bank), and a logic unit (hereinafter "logic") 5215. Fig. 66 illustrates fuse interface 5216 and bus 5217 coupled to the fuse interface and the different memory banks.
Fig. 65-70 also illustrate various steps in a test handler, such as:
a. a write test sequence 5221 (fig. 65, 67, 68, and 69);
b. read back the test result 5222 (fig. 67, 68, and 69);
c. write expected result sequence 5223 (FIG. 65);
d. reading the failed address to repair 5224 (FIG. 66); and
e. program fuse 5225 (FIG. 66).
Each memory bank may be coupled to and/or controlled by its own logic unit 5215. However, as described above, any memory bank allocation to the logic unit 5215 may be provided. Thus, the number of logic units 5215 may be different than the number of memory banks, logic units may control more than a single memory bank or a portion of a memory bank, and the like.
The logic unit 5215 may include one or more test units. Fig. 65 illustrates a Test Unit (TU)5218 within the logic 5215. TUs may be included in all or some of the logical units 5212. It should be noted that the test unit may be separate from the logic unit or integrated with the logic unit.
Fig. 65 also illustrates a test pattern generator (designated GEN)5219 within TU 5218.
The test pattern generator may be included in all or some of the test cells. For simplicity, the test pattern generator and test unit are not illustrated in fig. 66-70, but may be included in these embodiments.
The memory array may include a plurality of memory banks. In addition, the processing array may include a plurality of test cells. The plurality of test units may be configured to test a plurality of memory banks to provide test results. The interface may be configured to output information indicative of the test results to a device external to the integrated circuit.
The plurality of test units may include at least one test pattern generator configured to generate at least one test pattern for testing one or more of a plurality of memory banks. In some embodiments, as explained above, each of the plurality of test units may include a test pattern generator configured to generate a test pattern for use by a particular test unit of the plurality of test units to test at least one of a plurality of memory banks. As indicated above, fig. 65 illustrates a test pattern Generator (GEN)5219 within a test cell. One or more or even all of the logic cells may include a test pattern generator.
The at least one test pattern generator may be configured to receive instructions from the interface for generating the at least one test pattern. The test pattern may include a memory entry that should be accessed (e.g., read and/or written) during testing and/or a value to be written to the entry, and the like.
The interface may be configured to receive configuration information from an external unit, which may be external to the integrated circuit, the configuration information comprising instructions for generating at least one test pattern.
The at least one test pattern generator may be configured to read configuration information from the memory array, the configuration information including instructions for generating the at least one test pattern.
In some embodiments, the configuration information may include a vector.
The interface may be configured to receive configuration information from a device that may be external to the integrated circuit, the configuration information may include instructions that may be at least one test pattern.
For example, the at least one test pattern may include memory array entries to be accessed during testing of the memory array.
The at least one test pattern further may include input data to be written to a memory array entry accessed during testing of the memory array.
Additionally or alternatively, the at least one test pattern further may include input data to be written to memory array entries accessed during testing of the memory array, and expected values of output data to be read from memory array entries accessed during testing of the memory array.
In some embodiments, the plurality of test units may be configured to retrieve from the memory array test instructions that, once executed by the plurality of test units, cause the plurality of test units to test the memory array.
For example, the test instructions may be included in the configuration information.
The configuration information may include expected results of testing of the memory array.
Additionally or alternatively, the configuration information may include a value of output data to be read from a memory array entry accessed during testing of the memory array.
Additionally or alternatively, the configuration information may include a vector.
In some embodiments, the plurality of test units may be configured to retrieve from the memory array test instructions that, once executed by the plurality of test units, cause the plurality of test units to test the memory array and test the processing array.
For example, the test instructions may be included in the configuration information.
The configuration information may include a vector.
Additionally or alternatively, the configuration information may include expected results of testing of the memory array and the processing array.
In some embodiments, as described above, the plurality of test cells may lack a test pattern generator for generating test patterns for use during testing of the plurality of memory banks.
In these embodiments, at least two of the plurality of test units may be configured to test at least two of the plurality of memory banks in parallel.
Alternatively, at least two of the plurality of test units may be configured to serially test at least two of a plurality of memory banks.
In some embodiments, the information indicative of the test result may include an identifier of the failed memory array entry.
In some embodiments, the interface may be configured to retrieve the partial test results obtained by the plurality of test circuits multiple times during testing of the memory array.
In some embodiments, the integrated circuit may include an error correction unit configured to correct at least one error detected during testing of the memory array. For example, the error correction unit may be configured to repair memory errors using any suitable technique, e.g., by disabling some memory words and replacing them with redundant words.
In any of the embodiments described above, the integrated circuit may be a memory chip.
For example, the integrated circuit may comprise a distributed processor, wherein the processing array may comprise a plurality of sub-units of the distributed processor, as depicted in fig. 7A.
In these embodiments, each of the processor subunits may be associated with a corresponding dedicated memory bank of a plurality of memory banks.
In any of the above described embodiments, the information indicative of the test results may indicate a state of the at least one memory bank. The state of the memory bank may be provided with one or more granularities: each memory word, each group of entries, or each complete memory bank.
Fig. 65-66 illustrate four steps in the tester test phase.
In a first step, the tester writes 5221 a test sequence and the logical units of the group write data to their memories. The logic may also be complex enough to receive commands from the tester and itself generate sequences (as explained below).
In a second step, the tester writes 5223 the expected result to the memory under test, and the logic unit compares the expected result to the data read from its bank, saving an error list. Writing of the expected result may be simplified if the logic is complex enough to itself produce the sequence of expected results, as explained below.
In a third step, the tester reads (5224) the failure address from the logic unit.
In a fourth step, the tester takes action on the result (5225) and the error can be recovered. For example, the tester may connect to a specific interface to program fuses in the memory, but any other mechanism that allows for programming of error correction mechanisms within the memory may also be used.
In these embodiments, the memory tester may use the vectors to test the memory.
For example, each vector may be built from an input series and an output series.
The series of inputs may include pairs of addresses and data written to memory (in many embodiments, this series may be modeled as a formula that allows a program, such as a program executed by a logic unit, to be generated on demand).
In some embodiments, the test pattern generator may produce such vectors.
It should be noted that a vector is one embodiment of a data structure, but some embodiments may use other data structures. The data structure may be compatible with other test data structures generated by a tester external to the integrated circuit.
The series of outputs may include address and data pairs that contain the expected data to be read back from memory (in some embodiments, the series may additionally or alternatively be generated at runtime by the program, such as by a logic unit).
Memory testing typically involves executing a list of vectors, each writing data to memory according to an input series, and then reading back the data according to an output series and comparing the data to its expected data.
In the event of a mismatch, the memory may be classified as failing, or if the memory includes a mechanism for redundancy, the redundancy mechanism may be activated such that the vectors are again tested on the activated redundancy mechanism.
In embodiments where the memory includes a processor subunit (as described above with respect to FIG. 7A) or contains many memory controllers, the entire test may be handled by the set of logic units. Thus, the memory controller or processor subunit may perform the test.
The memory controller may be programmed from the tester and the test results may be saved in the controller itself for later reading by the tester.
To configure and test the operation of the logic cells, the tester may configure the logic cells for memory access and confirm that the results may be read by memory access.
For example, an input vector may contain a programming sequence for a logic cell, and an output vector may contain the expected results of this test. For example, if a logic unit, such as a processor subunit, contains a multiplier or adder configured to perform a computation on two addresses in memory, the input vector may include a set of commands to write data to memory and a set of commands to adder/multiplier logic. The result may be sent to the tester as long as the adder/multiplier result can be read back to the output vector.
The testing may also include loading the logic configuration from memory and causing the logic output to be sent to memory.
In embodiments where the logic loads its configuration from memory (e.g., if the logic is a memory controller), the logic may run code from the memory itself.
Thus, the input vector may include a program for the logic cell, and the program itself may test various circuits in the logic cell.
Thus, testing may not be limited to receiving vectors in a format used by an external tester.
If the command loaded into the logic unit instructs the logic unit to write results back into the memory bank, the tester may read the results and compare the results to the expected output series.
For example, the vector written to memory may be or may include a test program for the logic (e.g., testing may assume that the memory is valid, but even if the memory is invalid, the written test program will not work and the test will fail, which is an acceptable result because the chip is invalid anyway) and/or how the logic runs the code and writes the results back to memory. Since all testing of the logic cells may be done through the memory (e.g., writing logic test inputs to the memory and writing test results back to the memory), the tester may run a simple vector test with a sequence of inputs and expected outputs.
The logic configuration and results may be accessed as read and/or write commands.
Fig. 68 illustrates tester 5200 sending a write test sequence 5221, which is a vector.
Portions of the vector include test code 5232 that is split between memory banks 5212 coupled to the logic 5215 of the processing array.
Each logic 5215 may execute code 5232 stored in its associated memory bank, and the execution may include accessing one or more memory banks, performing calculations, and storing results (e.g., test results 5231) in the memory bank 5212.
The test results may be sent back by tester 5200 (e.g., read back results 5222).
This may allow the logic 5215 to be controlled by commands received by the I/O controller 5214.
In fig. 68, the I/O controller 5214 is connected to memory banks and logic. In other embodiments, logic may be connected between the I/O controller 5214 and the memory banks.
FIG. 70 illustrates a method 5300 for testing a memory bank. For example, method 5300 may be implemented using any of the memory banks described above with respect to fig. 65-69.
In some embodiments, the request may include configuration information, one or more vectors, commands, and the like.
In these embodiments, the configuration information may include expected results of testing of the memory array, instructions, data, values of output data to be read from memory array entries accessed during testing of the memory array, test patterns, and the like.
The test pattern may include at least one of: (i) a memory array entry to be accessed during testing of the memory array, (ii) input data to be written to the memory array entry to be accessed during testing of the memory array, or (iii) an expected value of output data to be read from the memory array entry to be accessed during testing of the memory array.
Step 5302 may include and/or may be followed by at least one of:
a. receiving, by at least one test pattern generator, instructions for generating at least one test pattern from an interface;
b. receiving configuration information through the interface and from an external unit external to the integrated circuit, the configuration information including instructions for generating at least one test pattern;
c. Reading, by at least one test pattern generator, configuration information from the memory array, the configuration information including instructions for generating at least one test pattern;
d. receiving configuration information through the interface and from an external unit external to the integrated circuit, the configuration information including instructions for at least one test pattern;
e. test instructions that, upon execution by the plurality of test units, cause the plurality of test units to test the memory array are passed through and retrieved from the memory array; and
f. test instructions are received by and from a plurality of test units that, upon execution by the plurality of test units, cause the plurality of test units to test a memory array and test a processing array.
Step 5302 may be followed by step 5310. Step 5310 may include passing the plurality of test units and testing the plurality of memory banks in response to the request to provide test results.
Step 5310 may include and/or may be followed by at least one of:
a. Generating, by one or more test pattern generators (e.g., included in one, some, or all of the plurality of test cells), a test pattern for use by the one or more test cells for testing at least one of the plurality of memory banks;
b. testing at least two of a plurality of memory banks in parallel by at least two of the plurality of test units;
c. serially testing at least two of a plurality of memory banks by at least two of the plurality of test units;
d. writing a value to a memory entry, reading the memory entry and the comparison result; and
e. at least one error detected during testing of the memory array is corrected by an error correction unit.
Step 5320 may be followed by step 5310. Step 5320 may include outputting, via the interface and external to the integrated circuit, information indicative of the test result.
This information indicating the test results may include an identifier of the failed memory array entry. By not sending read data for each memory entry, time may be saved.
Additionally or alternatively, the information indicative of the test results may indicate a status of the at least one memory bank.
Thus, in some embodiments, this information indicative of the test results may be much smaller than the total size of data units written to or read from the memory bank during testing, and may be much smaller than input data that may be sent from a tester testing the memory without the assistance of a test unit.
The integrated circuit under test may include a memory chip and/or a distributed processor as illustrated in any of the previous figures. For example, the integrated circuits described herein may be included in, may include, or otherwise include a memory chip as illustrated in any one of fig. 3A, 3B, 4-6, 7A-7D, 11-13, 16-19, 22, or 23.
FIG. 71 illustrates an embodiment of a method 5350 for testing a memory bank of an integrated circuit. For example, method 5350 may be implemented using any of the memory banks described above with respect to fig. 65-69.
The configuration information may include expected results of a test of the memory array, instructions, data, values of output data to be read from memory array entries accessed during the test of the memory array, test patterns, and the like.
Additionally or alternatively, the configuration information may include instructions, addresses of memory entries to write instructions, input data, and may also include addresses of memory entries to receive output values computed during execution of instructions.
The test pattern may include at least one of: (i) a memory array entry to be accessed during testing of the memory array, (ii) input data to be written to the memory array entry to be accessed during testing of the memory array, or (iii) an expected value of output data to be read from the memory array entry to be accessed during testing of the memory array.
Step 5352 may be followed by step 5355. Step 5355 may include executing the instruction by the processing array by accessing the memory array, performing the computational operation, and providing the result.
Step 5358 may be followed by step 5355. Step 5358 may include outputting information indicative of the result via the interface and external to the integrated circuit.
Network (cyber) security and tamper detection techniques
The memory chip and/or processor may be the target of a malicious actor and may be subject to various types of network attacks. In some cases, such attacks may attempt to change data and/or code stored in one or more memory resources. Cyber attacks can be particularly problematic with respect to trained neural networks or other types of Artificial Intelligence (AI) models that depend on large amounts of data stored in memory. If the stored data is manipulated or even obscured, such manipulation may be detrimental. For example, if the data relied upon by the data-intensive AI models is corrupted or obscured, an autonomous vehicle system that relies upon the models to identify other vehicles or pedestrians, etc., may incorrectly assess the environment of the host vehicle. As a result, an accident may occur. As AI models become more prevalent in a wide range of technologies, cyber attacks on data associated with such models can cause significant damage.
In other cases, a network attack may include one or more actors tampering with or attempting to tamper with operating parameters associated with a processor or other type of integrated circuit-based logic element. For example, processors are typically designed to operate within certain operating specifications. A network attack involving tampering may attempt to change one or more of the operating parameters of a processor, memory unit, or other circuit such that the processor, memory unit, or other circuit exceeds its design operating specifications (e.g., clock speed, bandwidth specifications, temperature limits, operating rates, etc.). This tampering can cause the target hardware to fail.
Conventional techniques for defending against cyber attacks may include computer programs (e.g., anti-virus software or anti-malware software) that operate at the processor level. Other techniques may include using a software-based firewall associated with a router or other hardware. While these techniques may use software programs that execute outside of the memory cells to combat network attacks, there remains a need for additional or alternative techniques for efficiently protecting data stored in memory cells, particularly where the accuracy and availability of the data is critical to the operation of memory intensive applications such as neural networks. Embodiments of the invention may provide various integrated circuit designs including memory that are resistant to cyber attacks on the memory.
Capturing sensitive information and commands to the integrated circuit in a secure manner (e.g., during a boot process when an interface to the chip/integrated circuit external has not been functional) and then maintaining the sensitive information and commands within the integrated circuit without exposing them to the integrated circuit external can increase the security of the sensitive information and commands. CPUs and other types of processing units are vulnerable to network attacks, especially when those CPUs/processing units operate with external memory. The disclosed embodiments, which include distributed processor subunits disposed on a memory chip among a memory array comprising a plurality of memory banks, may be less susceptible to network attacks and tampering (e.g., because processing occurs within the memory chip). Any combination including the disclosed security measures discussed in more detail below may further reduce the susceptibility of the disclosed embodiments to network attacks and/or tampering.
FIG. 72A is a diagrammatic representation of an integrated circuit 7200 that includes a memory array and a processing array consistent with an embodiment of the invention. For example, integrated circuit 7200 can include any of the distributed processor architectures (and features) on memory chips described in the above sections and throughout this disclosure. The memory array and the processing array may be formed on a common substrate, and in some disclosed embodiments, the integrated circuit 7200 may constitute a memory chip. For example, as discussed above, integrated circuit 7200 can comprise a memory chip comprising a plurality of memory banks and a plurality of processor subunits spatially distributed over the memory chip, wherein each of the plurality of memory banks is associated with a dedicated one or more of the plurality of processor subunits. In some cases, each processor subunit may be dedicated to one or more memory banks.
In some embodiments, the memory array may include a plurality of discrete memory banks 7210_1, 7210_2 … … 7210_ J1, 7210_ Jn, as shown in fig. 72A. According to embodiments of the invention, the memory array 7210 can include one or more types of memory, including, for example, volatile memory (such as RAM, DRAM, SRAM, phase change RAM (pram), magnetoresistive RAM (mram), resistive RAM (reram), or the like) or non-volatile memory (such as flash memory or ROM). According to some embodiments of the invention, memory groups 7210_1 to 7210_ Jn may include a plurality of MOS memory structures.
As mentioned above, the processing array may include a plurality of processor sub-units 7220_1 to 7220_ K. In some embodiments, each of the processor sub-units 7220_1 to 7220_ K may be associated with one or more discrete memory banks among a plurality of discrete memory banks 7210_1 to 7210_ Jn. While the example embodiment of fig. 72A illustrates each processor subunit being associated with two discrete memory banks 7210, it should be appreciated that each processor subunit may be associated with any number of discrete dedicated memory banks. And vice versa, each memory bank may be associated with any number of processor subunits. According to embodiments of the invention, the number of discrete memory banks included in the memory array of integrated circuit 7200 can be equal to, less than, or greater than the number of processor subunits included in the processing array of integrated circuit 7200.
Integrated circuit 7200 can further include a plurality of first buses 7260 consistent with embodiments of the present invention (and as described in the sections above). Each bus 7260 can connect processor subunits 7220_ k to a corresponding dedicated memory bank 7210_ j. According to some embodiments of the invention, integrated circuit 7200 can further include a plurality of second buses 7261. Each bus 7261 may connect processor subunit 7220_ k to another processor subunit 7220_ k + 1. As shown in fig. 72A, multiple processor subunits 7220_1 to 7220_ K may be connected to each other via bus 7261. Although fig. 72A illustrates the multiple processor sub-units 7220_1 to 7220_ K forming a loop as being connected in series via a bus 7261, it should be appreciated that the processor unit 7220 may be connected in any other manner. For example, in some cases, a particular processor subunit may not be connected to other processor subunits via bus 7261. In other cases, a particular processor subunit may be connected to only one other processor subunit, and in still other cases, a particular processor subunit may be connected to two or more other processor subunits via one or more buses 7261 (e.g., forming a series connection, a parallel connection, a branch connection, etc.). It should be noted that the embodiments of integrated circuit 7200 described herein are merely exemplary. In some cases, integrated circuit 7200 may have different internal components and connections, and in other cases, one or more of the internal components and described connections may be omitted (e.g., depending on the needs of a particular application).
Referring back to fig. 72A, integrated circuit 7200 can include one or more structures for implementing at least one security measure with respect to integrated circuit 7200. In some cases, such a structure may be configured to detect a cyber attack that manipulates or masks (or attempts to manipulate or mask) data stored in one or more of the memory banks. In other cases, such structures may be configured to detect tampering with an operating parameter associated with integrated circuit 7200 or tampering with one or more hardware elements (whether included within integrated circuit 7200 or external to integrated circuit 7200) that directly or indirectly affect one or more operations associated with integrated circuit 7200.
In some cases, controller 7240 may be included in integrated circuit 7200. The controller 7240 may be connected to, for example, one or more of the processor sub-units 7220_1 … … 7220_ k via one or more buses 7250. The controller 7240 can also be connected to one or more of the memory banks 7210_1 … … 7210_ Jn. Although the example embodiment of fig. 72A shows one controller 7240, it is to be understood that controller 7240 may include multiple processor elements and/or logic circuits. In the disclosed embodiment, the controller 7240 can be configured to implement at least one security measure with respect to at least one operation of the integrated circuit 7200. Additionally, in the disclosed embodiment, if at least one security measure is triggered, the controller 7240 can be configured to take (or cause) one or more remedial actions.
According to some embodiments of the invention, the at least one security measure may comprise a controller-implemented handler for locking access to certain aspects of the integrated circuit 7200. Access locking involves having the controller prevent access (reading and/or writing) to certain regions of the memory from outside the chip. Access control may be applied in address resolution, portions of memory bank resolution, and the like. In some cases, one or more physical locations in a memory associated with integrated circuit 7200 may be locked (e.g., one or more memory banks or any portion of one or more of the memory banks of integrated circuit 7200). In some embodiments, the controller 7240 may lock access to portions of the integrated circuit 7200 associated with execution of an artificial intelligence model (or other type of software-based system). For example, in some embodiments, the controller 7240 may lock access to weights of a neural network model stored in a memory associated with the integrated circuit 7200. It should be noted that a software program (i.e., model) may include three components, including: input data of the program, code data of the program, and output data of the execution program. Such an assembly may also be suitable for use in neural network models. During operation of such a model, input data may be generated and fed into the model, and executing the model may generate output data for reading. However, program code and data values (e.g., predetermined model weights, etc.) associated with executing the model using the received input data may remain fixed.
As described herein, locking may refer to operations that the controller does not allow, for example, read or write operations with respect to certain regions of the memory initiated from outside the chip/integrated circuit. An I/O-passable controller of a chip/integrated circuit may lock not only all memory banks, but any range of memory addresses within a memory bank, from a single memory address to an address range that includes all addresses of the available memory banks (or any address range therebetween).
Because memory locations associated with receiving input data and storing output data are associated with changing values and interactions with components external to integrated circuit 7200 (e.g., components supplying input data or receiving output data), locking access to those memory locations may be impractical in some cases. On the other hand, restricting access to memory locations associated with model program code and fixed data values may be effective against certain types of cyber attacks. Thus, in some embodiments, as a security measure, memory associated with program code and data values may be locked (e.g., memory not used for writing/receiving input data and for reading/providing output data). Restricting access may include locking certain memory locations so that changes cannot be made to certain program code and/or data values (e.g., those associated with executing a model based on received input data). Additionally, memory regions associated with intermediate data (e.g., data generated during execution of the model) may also be locked against external access. Thus, while various operational logic (whether on board integrated circuit 7200 or external to integrated circuit 7200) may provide data to or receive data from memory locations associated with receiving input data or retrieving generated output data, such operational logic will not be able to access or modify memory locations storing program code and data values associated with program execution based on the received input data.
In addition to locking memory locations on the integrated circuit 7200 to provide a security measure, other security measures may also be implemented by restricting access to certain arithmetic logic elements (and memory regions accessed thereby) that are configured to execute program code associated with a particular program or model. In some cases, this access constraint may be implemented with respect to operational logic (and its associated memory regions) located on integrated circuit 7200 (e.g., operational memory (e.g., memory including operational capabilities, such as a distributed processor on a memory chip as disclosed herein), etc.). Access to the operational logic (and associated memory locations) associated with any execution of code stored in the lockout memory portion of the integrated circuit 7200 or associated with any access to data values stored in the lockout memory portion of the integrated circuit 7200 may also be locked out/restricted, regardless of whether the operational logic is located on board the integrated circuit 7200. Restricting access to the operational logic responsible for executing programs/models may further ensure that code and data values associated with operations on received input data are still protected from manipulation, shadowing, and the like.
The controller-implemented security measures may be implemented in any suitable manner, including locking or restricting access to hardware-based regions associated with certain portions of the memory array of integrated circuit 7200. In some embodiments, this locking may be implemented by adding or supplying a command to controller 7240 configured to cause controller 7240 to lock certain memory portions. In some embodiments, the hardware-based memory portion to be locked may be designated by a particular memory address (e.g., an address associated with any memory element of memory bank 7210_1 … … 7210_ J2, etc.). In some embodiments, the locked region of memory may remain fixed during program or model execution. In other cases, the lock region may be configurable. That is, in some cases, commands may be supplied to the controller 7240 such that the lock zone may change during execution of a program or model. For example, certain memory locations may be added to a lock zone of memory at a particular time. Or certain memory locations (e.g., previously locked memory locations) may be excluded from the locked region of memory at a particular time.
The locking of certain memory locations may be accomplished in any suitable manner. In some cases, a record of the locked memory location (e.g., a file, database, data structure, etc. storing and identifying the locked memory address) may be accessible by the controller 7240 such that the controller 7240 can determine whether a certain memory request is associated with the locked memory location. In some cases, the controller 7240 maintains a database of lock addresses to use to control access to certain memory locations. In other cases, the controller may have a table or set of one or more registers that are configurable up to lock, and may include fixed predetermined values that identify the memory locations to be locked (e.g., memory accesses to those memory locations should be restricted from outside the chip). For example, when requesting a memory access, the controller 7240 may compare the memory address associated with the memory access request to the locked memory address. If it is determined that the memory address associated with the memory access request is within the list of locked memory addresses, the memory access request (e.g., a read or write operation) may be denied.
As discussed above, the at least one security measure may include locking access to certain memory portions of memory array 7210 that are not used to receive input data or to provide access to generated output data. In some cases, the portion of memory within the lock region may be adjusted. For example, the locked memory portion may be unlocked and the unlocked memory portion may be locked. Any suitable method may be used to unlock the locked memory portion. For example, the security measures implemented may include the need for complex passwords to unlock one or more portions of the locked memory region.
Upon detecting any action against the implemented security measures, the implemented security measures may be triggered. For example, attempting to access a locked memory portion (whether a read or write request) may trigger a security measure. Additionally, if the entered complex password (e.g., attempting to unlock a locked memory portion) does not match the predetermined complex password, a security measure may be triggered. In some cases, a security measure may be triggered if the correct complex password is not provided in an allowable threshold number of complex password entry attempts (e.g., 1, 2, 3, etc.).
The memory portion may be locked at any suitable time. For example, in some cases, memory portions may be locked at various times during program execution. In other cases, the memory portion may be locked after startup or prior to program/model execution. For example, the memory address to be locked may be determined and identified in connection with programming of the program/model program code or after data to be accessed by the program/model is generated and stored. As such, vulnerabilities to attacks on the memory array 7210 may be reduced or eliminated during times when or after program/model execution begins, after data to be used by the program/model has been generated and stored, and so on.
Unlocking of the locked memory may be accomplished by any suitable method or at any suitable time. As described above, the locked memory portion may be unlocked upon receipt of a correct complex password, or the like. In other cases, the locked memory may be unlocked by rebooting (by command or by powering down and up) or deleting the entire memory array 7210. Additionally or alternatively, a release command sequence may be implemented to unlock one or more memory portions.
According to embodiments of the invention and as described above, the controller 7240 can be configured to control traffic (traffic) to and from the integrated circuit 7200, particularly from a source external to the integrated circuit 7200. For example, as shown in fig. 72A, traffic between components external to integrated circuit 7200 and components internal to integrated circuit 7200 (e.g., memory array 7210 or processor subunit 7220) may be controlled by controller 7240. This traffic may pass through controller 7240 or one or more buses (e.g., 7250, 7260 or 7261) controlled or monitored by controller 7240.
According to some embodiments of the invention, the integrated circuit 7200 can receive non-changeable data (e.g., fixed data; such as model weights, coefficients, etc.) and certain commands (e.g., code; such as identifying a portion of memory to lock) during a boot process. As used herein, immutable data may refer to data that remains fixed during execution of a program or model and may remain unchanged until a subsequent boot process. During program execution, integrated circuit 7200 can interact with alterable data, which can include input data to be processed and/or output data generated by processes associated with integrated circuit 7200. As discussed above, access to the memory array 7210 or the processing array 7220 may be restricted during program or model execution. For example, access may be limited to certain portions of memory array 7210 or to certain processor subunits associated with: processing with or interaction with incoming input data to be written, or processing with or interaction with generated output data to be read. During program or model execution, the portion of memory containing the immutable data may be locked and thereby made inaccessible. In some embodiments, the immutable data and/or commands associated with the portion of memory to be locked can be included in any suitable data structure. For example, such data and/or commands may be made available to the controller 7240 via one or more configuration files that may be accessed during or after a boot sequence.
Referring back to fig. 72A, integrated circuit 7200 can further include a communication port 7230. As shown in fig. 72A, controller 7240 may be coupled between communication port 7230 and bus 7250, which is shared between processing subunits 7220_1 to 7220_ K. In some embodiments, the communication port 7230 can be coupled indirectly or directly to a host computer 7270 associated with host memory 7280, which can include, for example, non-volatile memory. In some embodiments, host computer 7270 may retrieve changeable data 7281 (e.g., input data to be used during execution of a program or model), non-changeable data 7282, and/or commands 7283 from its associated host memory 7280. Changeable data 7181, non-changeable data 7282 and commands 7283 may be uploaded from host computer 7270 to controller 7240 during a boot process via 7230.
FIG. 72B is a diagrammatic representation of a memory region within an integrated circuit consistent with embodiments of the invention. As shown, fig. 72B depicts an example of a data structure included in host memory 7280.
Reference is now made to fig. 73A, which is another example of an integrated circuit consistent with an embodiment of the invention. As shown in fig. 73A, controller 7240 may include a network attack detector 7241 and a response module 7242. In some embodiments of the invention, controller 7240 may be configured to store or access control rules 7243. According to some embodiments of the invention, access control rules 7243 may be included in a configuration file accessible to controller 7240. In some embodiments, access control rules 7243 may be uploaded to controller 7240 during a boot process. Access control rules 7243 may include information prompting access rules associated with any of: changeable data 7281, non-changeable data 7282 and commands 7283 and their corresponding memory locations. As explained above, access control rules 7243 or configuration files may include information that identifies certain memory addresses among memory array 7210. In some embodiments, the controller 7240 can be configured to provide a locking mechanism and/or function that locks various addresses of the memory array 7210, such as addresses for storing commands or non-alterable data.
In addition to locking out portions of memory, other techniques for defending against network attacks can also be implemented to provide the described security measures associated with integrated circuit 7200. For example, in some embodiments, the controller 7240 can be configured to replicate programs or models in different memory locations and processor subunits associated with the integrated circuit 7200. In this way, the program/model and replicator of the program/model may be executed independently, and the results of the independent program/model executions may be compared. For example, programs/models may be replicated in two memory banks 7210 and executed by different processor subunits 7220 in integrated circuit 7200. In other embodiments, the program/model may be replicated in two different integrated circuits 7200. In either case, the results of the program/model executions may be compared to determine if there are any differences between the replicated program/model executions. Detected differences in execution results (e.g., intermediate execution results, final execution results, etc.) may suggest the presence of a network attack of one or more aspects of the altered program/model or its associated data. In some embodiments, different memory banks 7210 and processor subunits 7220 may be assigned to perform two replication models based on the same input data. In some embodiments, the intermediate results may be compared during execution of the two replication models based on the same input data, and if there is a mismatch between the two intermediate results at the same stage, execution may be temporarily suspended as a potential remedial action. The integrated circuit may also compare results in the case where both replica models are executed by processor subunits of the same integrated circuit. This may be done without notifying any entity external to the integrated circuit about the execution of the two replica models. In other words, entities external to the chip are unaware that the replica models are running in parallel on the integrated circuit.
FIG. 73B is a diagrammatic representation of an arrangement for concurrently executing a replication model in accordance with an embodiment of the present invention.
Although a single program/model copy is described as one example for detecting possible cyber attacks, any number of copies (e.g., 1, 2, 3, or more than 3) may be used to detect possible cyber attacks. As the number of copies and independent program/model executions increases, the confidence in the detection of a network attack may also increase. The larger number of replications may also reduce the potential success rate of a cyber attack, as it may be more difficult for an attacker to influence multiple program/model replicators. The number of program or model replicators may be determined at runtime to further increase the difficulty for a network attacker to successfully affect program or model execution.
In some embodiments, the replication models may differ in one or more aspects that differ from one another. In this example, the code associated with the two programs/models may be made different from each other, but the programs/models may be designed so that both return the same output results. At least in this way, two programs/models can be considered replicators of each other. For example, two neural network models may have different neuron ordering in one layer relative to each other. However, despite the model program code having this change, both neural network models can return the same output results. Replicating programs/models in this manner can make it more difficult for a network attacker to identify these valid replicators of the program or model to be broken, and as a result, the replication model/program can not only provide a way to provide redundancy to minimize the impact of a network attack, but can also enhance network attack detection (e.g., by highlighting tampering or unauthorized access in which a network attacker changes one program/model or its data, but fails to make corresponding changes to the program/model replicator).
In many cases, the replica program/model (including, among other things, replica programs/models exhibiting code differences) can be designed such that its outputs do not match exactly, but instead constitute soft values (e.g., approximately the same output values), rather than exact fixed values. In such embodiments, output results from two or more valid program/model replicators may be compared (e.g., using a dedicated module or by a host processor) to determine whether the difference between their output results (whether intermediate or final) is within a predetermined range. Differences in output soft values that do not exceed a predetermined threshold or range may be considered evidence of no tampering, unauthorized access, etc. On the other hand, if the difference in the output soft values exceeds a predetermined threshold or range, such difference may be considered evidence that a cyber attack in the form of tampering, unauthorized access to memory, or the like has occurred. In these situations, duplicate program/model security measures will be triggered and one or more remedial actions may be taken (e.g., stop executing the program or model, shut down one or more operations of the integrated circuit 7200, operate in a secure mode with limited functionality, among many other actions).
Security measures associated with the integrated circuit 7200 may also involve quantitative analysis of data associated with executing or executed programs or models. For example, in some embodiments, the controller 7240 can be configured to calculate one or more checksum (checksum)/hash/Cyclic Redundancy Check (CRC)/parity) values for data stored in at least a portion of the memory array 7210. The calculated value may be compared to one or more predetermined values. If there is a deviation between the compared values, this deviation may be interpreted as evidence of tampering with the data stored in at least a portion of the memory array 7210. In some embodiments, checksum/hash/CRC/check bit values may be calculated for all memory locations associated with memory array 7210 to identify changes in data. In this example, the entire memory (or bank of memory) in question may be read by, for example, host computer 7270 or a processor associated with integrated circuit 7200 for use in calculating the checksum/hash/CRC/check bit values. In other cases, checksum/hash/CRC/check bit values may be calculated for a predetermined subset of memory locations associated with memory array 7210 to identify changes to data associated with the subset of memory locations. In some embodiments, the controller 7240 may be configured to calculate checksum/hash/CRC/checksum values associated with predetermined data paths (e.g., associated with memory access patterns (patterns)), and the calculated values may be compared to each other or to predetermined values to determine whether tampering or another form of network attack has occurred.
By protecting one or more predetermined values (e.g., expected checksum/hash/CRC/parity values, expected difference values for intermediate or final output results, expected difference ranges associated with certain values, etc.) within integrated circuit 7200 or in locations accessible to integrated circuit 7200, integrated circuit 7200 may be made even more secure against cyber attacks. For example, in some embodiments, one or more predetermined values may be stored in registers of the memory array 7210 and may be used during or after each run of the model to evaluate intermediate or final output results, checksums, etc. (e.g., by the controller 7240 of the integrated circuit 7200). In some cases, the register value may be updated using a "save last result data" command to calculate a predetermined value in operation, and the calculated value may be saved in a register or another memory location. In this manner, the valid output value may be used to update the predetermined value for comparison after each program or model execution or partial execution. This technique may increase the difficulty that a network attacker may experience when attempting to modify or otherwise tamper with one or more predetermined reference values designed to expose the network attacker's activities.
In operation, a CRC calculator may be used to track memory accesses. For example, such calculation circuitry may be disposed at a memory bank level, in a processor subunit, or at a controller, where each calculation circuitry may be configured to accumulate to a CRC calculator upon each memory access.
Referring now to FIG. 74A, a diagrammatic representation of another embodiment of an integrated circuit 7200 is provided. In the example embodiment represented by fig. 74A, the controller 7240 can include a tamper detector 7245 and a response module 7246. Similar to other disclosed embodiments, tamper detector 7245 may be configured to detect evidence of a potential tamper attempt. According to some embodiments of the invention, the security measures associated with the integrated circuit 7200 and implemented by the controller 7240 may include, for example, comparing actual program/model operational patterns to predetermined/allowed operational patterns. If, in one or more aspects, the actual program/model operational pattern differs from the predetermined/allowed operational pattern, a security measure may be triggered. And if a security measure is triggered, the response module 7246 of the controller 7240 can be configured to implement one or more remedial measures in response.
Fig. 74C is a diagrammatic representation of detection elements that may be located at various points within a chip in accordance with an exemplary disclosed embodiment. As described above, detection of network attacks and tampering may be performed using detection elements located at various points within the chip, as shown, for example, in fig. 74C. For example, a certain code may be associated with an expected number of processing events within a certain time period. The detector shown in fig. 74C may count the number of events (monitored by the event counter) that the system experiences during a certain period of time (monitored by the time counter). Tampering may be prompted if the number of events exceeds some predetermined threshold (e.g., the number of expected events during a predefined time period). Such detectors may be included in multiple points of the system to monitor various types of events, as shown in fig. 74C.
More specifically, in some embodiments, the controller 7240 can be configured to store or access a desired program/model operation pattern 7244. For example, in some cases, the operational pattern may be represented as a curve 7247 that suggests allowed loads per time pattern and prohibited or illegal loads per time pattern. Tampering attempts may cause memory array 7210 or processing array 7220 to operate outside of certain operating specifications. This may cause memory array 7210 or processing array 7220 to generate heat or to fail, and may enable changes to data or code related to memory array 7210 or processing array 7220. Such changes may cause the operational pattern to exceed the allowed operational pattern as suggested by curve 7247.
According to some embodiments of the invention, controller 7240 can be configured to monitor an operating pattern associated with memory array 7210 or processing array 7220. The operational pattern may be associated with the number of access requests, the type of access requests, the timing of the access requests, and the like. The controller 7240 can be further configured to detect a tampering attack if the operational pattern is different from an allowable operational pattern.
It should be noted that the disclosed embodiments may be used to defend not only against network attacks, but also against non-malicious errors in operation. For example, the disclosed embodiments may also effectively protect systems such as integrated circuit 7200 from errors caused by environmental factors such as temperature or voltage changes or levels, particularly if such levels exceed operating specifications for integrated circuit 7200.
In response to detecting a suspected network attack (e.g., as a response to a triggered security measure), any suitable remedial action may be implemented. For example, the remedial action may include stopping one or more operations associated with program/model execution, operating one or more components associated with integrated circuit 7200 in a secure mode, locking one or more components of integrated circuit 7200 to additional inputs or accesses, and so forth.
Fig. 74B provides a flowchart representation of a method 7450 of protecting an integrated circuit against tampering according to an illustrative disclosed embodiment. For example, step 7452 may include implementing at least one security measure with respect to operation of the integrated circuit using a controller associated with the integrated circuit. At step 7454, if at least one security measure is triggered, one or more remedial actions may be taken. The integrated circuit includes: a substrate; a memory array disposed on a substrate, the memory array comprising a plurality of discrete memory groups; and a processing array disposed on the substrate, the processing array including a plurality of processor sub-units, each of the plurality of processor sub-units being associated with one or more discrete memory banks among a plurality of discrete memory banks.
In some embodiments, the disclosed security measures may be implemented in multiple memory chips, and at least one or more of the disclosed security mechanisms may be implemented for each memory chip/integrated circuit. In some cases, each memory chip/integrated circuit may implement the same security measures, but in some cases, different memory chips/integrated circuits may implement different security measures (e.g., when different security measures may be more appropriate for a certain type of operation associated with a particular integrated circuit). In some embodiments, more than one security measure may be implemented by a particular controller of the integrated circuit. For example, a particular integrated circuit may implement any number or type of the disclosed security measures. Additionally, a particular integrated circuit controller may be configured to implement a plurality of different remedial actions in response to the triggered security action.
It should also be noted that two or more of the above-described security mechanisms may be combined to improve security against network attacks or tampering attacks. Additionally, security measures may be implemented across different integrated circuits, and these integrated circuits may coordinate security measure implementations. For example, model copying may be performed within one memory chip or may be performed across different memory chips. In this example, results from one memory chip or results from two or more memory chips may be compared to detect a potential network attack or tampering attack. In some embodiments, the copy security measures applied across multiple integrated circuits may include one or more of: the disclosed access locking mechanism, hash protection mechanism, model replication, program/model execution pattern analysis, or any combination of these or other disclosed embodiments.
Multi-port processor subunit in DRAM
As described above, the disclosed embodiments of the invention may include a distributed processor memory chip that includes an array of processor subunits and an array of memory banks, where each of the processor subunits may be dedicated to at least one of the array of memory banks. As discussed in the following sections, distributed processor memory chips may serve as the basis for a scalable system. That is, in some cases, a distributed processor memory chip may include one or more communication ports configured to communicate data from one distributed processor memory chip to another distributed processor memory chip. In this way, any desired number of distributed processor memory chips may be linked together (e.g., in series, in parallel, in a loop, or any combination thereof) to form a scalable array of distributed processor memory chips. Such an array may provide a flexible solution for efficiently performing memory-intensive operations and for expanding the computational resources associated with the performance of memory-intensive operations. Because distributed processor memory chips may include clocks having different timing patterns, the disclosed embodiments of the present invention include features to accurately control data transfers between distributed processor memory chips even in the presence of clock timing differences. Such embodiments may enable efficient data sharing among different distributed processor memory chips.
FIG. 75A is a diagrammatic representation of a scalable processor memory system that includes multiple distributed processor memory chips consistent with an embodiment of the invention. According to embodiments of the present invention, the scalable processor memory system may include a plurality of distributed processor memory chips, such as a first distributed processor memory chip 7500, a second distributed processor memory chip 7500', and a third distributed processor memory chip 7500 ″. Each of the first distributed processor memory chip 7500, the second distributed processor memory chip 7500', and the third distributed processor memory chip 7500 ″ may include any of the configurations and/or features associated with any of the embodiments described in the present distributed processor.
In some embodiments, each of the first, second, and third distributed processor memory chips 7500, 7500', 7500 ″ may be implemented similar to the integrated chip 7200 shown in fig. 7200. As shown in fig. 75A, a first distributed processor memory chip 7500 can include a memory array 7510, a processing array 7520, and a controller 7540. Memory array 7510, processing array 7520, and controller 7540 may be configured similarly to memory array 7210, processing array 7220, and controller 7240 in FIG. 72A.
According to an embodiment of the invention, the first distributed processor memory chip 7500 can include a first communication port 7530. In some embodiments, the first communication port 7530 can be configured to communicate with one or more external entities. For example, the communication port 7530 can be configured to establish a communication connection between the distributed processor memory chip 7500 and an external entity other than another distributed processor memory chip (such as distributed processor memory chips 7500' and 7500 "). For example, the communication port 7530 can be indirectly or directly coupled to a host computer (e.g., as illustrated in fig. 72A) or any other computing device, communication module, or the like.
According to embodiments of the invention, the first distributed processor memory chip 7500 may further include one or more additional communication ports configured to communicate with other distributed processor memory chips, e.g., 7500' or 7500 ". In some embodiments, the one or more additional communication ports can include a second communication port 7531 and a third communication port 7532, as shown in fig. 75A. The second communication port 7531 can be configured to communicate with the second distributed processor memory chip 7500 'and establish a communication connection between the first distributed processor memory chip 7500 and the second distributed processor memory chip 7500'. Similarly, the third communication port 7532 can be configured to communicate with the third distributed processor memory chip 7500' and establish a communication connection between the first distributed processor memory chip 7500 and the third distributed processor memory chip 7500 ". In some embodiments, the first distributed processor memory chip 7500 (and any of the memory chips disclosed herein) can include a plurality of communication ports, including any suitable number of communication ports (e.g., 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100, 1000, etc.).
In some embodiments, the first communication port, the second communication port, and the third communication port are associated with corresponding buses. The corresponding bus may be a bus common to each of the first communication port, the second communication port, and the third communication port. In some embodiments, the corresponding bus associated with each of the first, second, and third communication ports is connected to a plurality of discrete memory banks. In some embodiments, the first communication port is connected to at least one of a main bus internal to the memory chip or at least one processor subunit included in the memory chip. In some embodiments, the second communication port is connected to at least one of a bus internal to the memory chip or at least one processor subunit included in the memory chip.
Although the configuration of the disclosed distributed processor memory chip is explained with respect to the first distributed processor memory chip 7500, it should be noted that the second processor memory chip 7500' and the third processor memory chip 7500 ″ may be configured similarly to the first distributed processor memory chip 7500. For example, the second distributed processor memory chip 7500' may also include a memory array 7510', a processing array 7520', a controller 7540', and/or a plurality of communication ports, such as ports 7530', 7531', and 7532 '. Similarly, a third distributed processor memory chip 7500 "can include memory array 7510", processing array 7520 ", controller 7540", and/or a plurality of communication ports, such as ports 7530 ", 7531", and 7532 ". In some embodiments, the second communication port 7531' and the third communication port 7532' of the second distributed processor memory chip 7500' can be configured to communicate with the third distributed processor memory chip 7500 "and the first distributed processor memory chip 7500, respectively. Similarly, the second communication port 7531 "and the third communication port 7532" of the third distributed processor memory chip 7500 "can be configured to communicate with the first distributed processor memory chip 7500 and the second distributed processor memory chip 7500', respectively. This similarity in configuration among distributed processor memory chips may facilitate scaling of computing systems based on the disclosed distributed processor memory chips. In addition, the disclosed arrangement and configuration of the communication ports associated with each distributed processor memory chip may enable flexible arrangement of the array of distributed processor memory chips (e.g., including series connections, parallel connections, ring connections, star connections, network connections, or the like).
According to embodiments of the present invention, distributed processor memory chips, such as the first through third distributed processor memory chips 7500, 7500', and 7500 ", may communicate with each other via a bus 7533. In some embodiments, bus 7533 may connect two communication ports of two different distributed processor memory chips. For example, the second communication port 7531 of the first processor memory chip 7500 can be connected to the third communication port 7532 'of the second processor memory chip 7500' via the bus 7533. According to embodiments of the present invention, distributed processor memory chips, such as first through third distributed processor memory chips 7500, 7500', and 7500 ", may also communicate with external entities (e.g., host computers) via a bus, such as bus 7534. For example, a first communication port 7530 of a first distributed processor memory chip 7500 can be connected to one or more external entities via a bus 7534. The distributed processor memory chips may be connected to each other in various ways. In some cases, the distributed processor memory chips may exhibit series connectivity, with each distributed processor memory chip connected to a pair of adjacent distributed processor memory chips. In other cases, distributed processor memory chips may exhibit a higher degree of connectivity, with at least one distributed processor memory chip connected to two or more other distributed processor memory chips. In some cases, all distributed processor memory chips within the plurality of memory chips may be connected to all other distributed processor memory chips in the plurality of memory chips.
As shown in fig. 75A, bus 7533 (or any other bus associated with the embodiment of fig. 75A) may be unidirectional. Although fig. 75A illustrates bus 7533 as unidirectional and having some data transfer flow, as suggested by the arrows shown in fig. 75A, bus 7533 (or any other bus in fig. 75A) may be implemented as a bidirectional bus. According to some embodiments of the present invention, a bus connected between two distributed processor memory chips may be configured to have a higher communication speed than a communication speed of a bus connected between a distributed processor memory chip and an external entity. In some embodiments, communication between the distributed processor memory chips and external entities may occur during a limited time, such as during preparation for execution (loading of program code, input data, weight data, etc. from a host computer), during periods of outputting results produced by execution of the neural network model, etc. to the host computer. During execution of one or more programs associated with the distributed processors of chips 7500, 7500', and 7500 "(e.g., during memory intensive operations associated with artificial intelligence applications, etc.), communication between the distributed processor memory chips may occur via bus 7533, 7533', etc. In some embodiments, communications between the distributed processor memory chip and the external entity may occur less frequently than communications between two processor memory chips. Depending on the communication requirements and embodiments, the bus between the distributed processor memory chips and the external entity may be configured to have a communication speed equal to, greater than, or less than the communication speed of the bus between the distributed processor memory chips.
In some embodiments, as represented by fig. 75A, a plurality of distributed processor memory chips, such as first through third distributed processor memory chips 7500, 7500', and 7500 ", may be configured to communicate with each other. As mentioned, this capability may facilitate assembly of a scalable distributed processor memory chip system. For example, memory arrays 7510, 7510', and 7510 "and processing arrays 7520, 7520', and 7520" from the first through third processor memory chips 7500, 7500', and 7500 "may be considered to actually belong to a single distributed processor memory chip when linked by a communication channel (such as the bus shown in fig. 75A).
According to embodiments of the invention, communication between multiple distributed processor memory chips and/or communication between a distributed processor memory chip and one or more external entities may be managed in any suitable manner. In some embodiments, these communications may be managed by processing resources such as processing array 7520 in distributed processor memory chip 7500. In some other embodiments, a controller, such as controllers 7540, 7540', 7540 ", etc., of the distributed processor memory chips may be configured to manage communications between the distributed processor memory chips and/or communications between the distributed processor memory chips and one or more external entities, for example, to alleviate the computational load imposed by communication management on the processing resources provided by the array of distributed processors. For example, each controller 7540, 7540', and 7540 "of the first through third processor memory chips 7500, 7500', and 7500" can be configured to manage communications related to its corresponding distributed processor memory chip relative to other distributed processor memory chips. In some embodiments, the controllers 7540, 7540', and 7540 "can be configured to control these communications via corresponding communications ports, such as ports 7531, 7531', 7531", 7532', and 7532 ", and so on.
The controllers 7540, 7540', and 7540 ″ may also be configured to manage communication between the distributed processor memory chips while taking into account timing differences that may exist between the distributed processor memory chips. For example, a distributed processor memory chip (e.g., 7500) may be fed by an internal clock, which may be different relative to the clocks of other distributed processor memory chips (e.g., 7500' and 7500 "). Thus, in some embodiments, the controller 7540 can be configured to implement one or more policies for accounting for different clock timing patterns among the distributed processor memory chips and manage communication among the distributed processor memory chips by accounting for possible time skew among the distributed processor memory chips.
For example, in some embodiments, the controller 7540 of the first distributed processor memory chip 7500 can be configured to enable data to be transferred from the first distributed processor memory chip 7500 to the second processor memory chip 7500' under certain conditions. In some cases, controller 7540 may refrain from data transfer if one or more processor subunits of first distributed processor memory chip 7500 are not ready to transfer data. Alternatively or additionally, controller 7540 may suppress data transfer if the receive processor subunit of second distributed processor memory chip 7500' is not ready to receive data. In some cases, controller 7540 may initiate transfer of data from the sending processor subunit (e.g., in chip 7500) to the receiving processor subunit (e.g., in chip 7500') after determining that the sending processor subunit is ready to send data and that the receiving processor subunit is ready to receive data. In other embodiments, controller 7540 may initiate data transfer based only on whether the sending processor subunit is ready to send data, particularly if the data may be buffered in controller 7540 or 7540', for example, until the receiving processor subunit is ready to receive the transferred data.
According to an embodiment of the invention, the controller 7540 may be configured to determine whether one or more other timing constraints are satisfied in order to enable data transfer. Such temporal constraints may be related to: a time difference from a transmit time of a sending processor subunit to a receive time in a receiving processor subunit, an access request from an external entity (e.g., a host computer) for processed data, a refresh operation performed on a memory resource (e.g., a memory array) associated with the sending or receiving processor subunit, and others.
FIG. 75E is an example timing diagram consistent with an embodiment of the present invention. Fig. 75E illustrates the following example.
In some embodiments, the controller 7540 and other controllers associated with the distributed processor memory chips may be configured to manage data transfers between the chips using the clock enable signal. For example, processing array 7520 may be fed by a clock. In some embodiments, a clock enable signal (e.g., shown as "to CE" in fig. 75A) may be used, for example, by controller 7540 to control whether one or more processor subunits respond to a supplied clock signal. Each processor subunit, e.g., 7520_1 to 7520_ K, may execute program code, and the program code may include communication commands. According to some embodiments of the invention, the controller 7540 may control the timing of communication commands by controlling clock enable signals to the processor subunits 7520_1 through 7520_ K. For example, according to some embodiments, when a sending processor subunit (e.g., in first processor memory chip 7500) is programmed to transmit data at a certain cycle (e.g., 1000 clock cycles) and a receiving processor subunit (e.g., in second processor memory chip 7500 ') is programmed to receive data at a certain cycle (e.g., 1000 clock cycles), controller 7540 of first processor memory chip 7500 and controller 7540' of second processor memory chip 7500' may not allow data transfer until both the sending processor subunit and the receiving processor subunit are ready to perform data transfer. For example, controller 7540 may "suppress" data transfers from the transmit processor subunit in chip 7500 by supplying a certain clock enable signal (e.g., a logic low) to the transmit processor subunit, which may prevent the transmit processor subunit from transmitting data in response to the received clock signal. A certain clock enable signal may "freeze" the entire distributed processor memory chip or any portion of the distributed processor memory chip. On the other hand, the controller 7540 may cause the sending processor subunit to initiate the data transfer by supplying an opposite clock enable signal (e.g., a logic high) to the sending processor subunit, which causes the sending processor subunit to respond to the received clock signal. A similar operation may be controlled using a clock enable signal issued by controller 7540', e.g., received or not received by a receive processor subunit in chip 7500'.
In some embodiments, the clock enable signal may be sent to all processor subunits (e.g., 7520_1 to 7520_ K) in the processor memory chip (e.g., 7500). In general, the clock enable signal may have the effect of causing the processor subunits to respond to their respective clock signals or to ignore those clock signals. For example, in some cases, when the clock enable signal is high (depending on the conventions of a particular application), the processor subunit may respond to its clock signal and may execute one or more instructions according to its clock signal timing. On the other hand, when the clock enable signal is low, the processor subunit is prevented from responding to its clock signal so that it does not execute instructions in response to clock timing. In other words, the processor subunit may ignore the received clock signal when the clock enable signal is low.
Returning to the example of fig. 75A, any of the controllers 7540, 7540', or 7540 ″ may be configured to use the clock enable signal to control the operation of the respective distributed processor memory chips by having one or more processor subunits in the respective array respond or not respond to the received clock signal. In some embodiments, the controller 7540, 7540', or 7540 ″ may be configured to selectively advance program code execution, such as when such code is associated with or includes data transfer operations and their timing. In some embodiments, the controller 7540, 7540', or 7540 "can be configured to use the clock enable signal to control the timing of data transmissions between two different distributed processor memory chips via any of the communication ports 7531, 7531', 7531", 7532', 7532 ", and so on. In some embodiments, the controller 7540, 7540', or 7540 "can be configured to use the clock enable signal to control the time of data reception between two different distributed processor memory chips via any of the communication ports 7531, 7531', 7531", 7532', 7532 ", and so on.
In some embodiments, the timing of data transfer between two different distributed processor memory chips may be configured based on compilation optimization steps. Compilation may allow for the building of a handler where tasks may be efficiently assigned to processing subunits without being affected by transmission delays on a bus connected between two different processor memory chips. The compilation may be performed by a compiler in the host computer or transmitted to the host computer. Typically, transfer delays on the bus between two different processor memory chips will result in a data bottleneck for the processing subunit requiring the data. The disclosed compilation may schedule data transmissions in a manner that enables a processing unit to continuously receive data even with adverse transmission delays on the bus.
Although the embodiment of fig. 75A includes three ports for each distributed processor memory chip (7500', 7500 "'), any number of ports may be included in a distributed processor memory chip in accordance with the disclosed embodiments. For example, in some cases, a distributed processor memory chip may include more or fewer ports. In the embodiment of fig. 75B, each distributed processor memory chip (e.g., 7500A-7500I) may be configured with multiple ports. These ports may be substantially the same as one another or may be different. In the example shown, each distributed processor memory chip includes five ports, including a host communication port 7570 and four chip ports 7572. The host communication port 7570 can be configured to communicate (via bus 7534) between any of the distributed processors in the array (as shown in fig. 75B) and a host computer, e.g., located remotely with respect to the array of distributed processor memory chips. The chip ports 7572 can be configured to enable communication between distributed processor memory chips via a bus 7535.
Any number of distributed processor memory chips may be connected to each other. In the example shown in fig. 75B including four chip ports per distributed processor, an array may be implemented in which each distributed processor memory chip is connected to two or more other distributed processor memory chips, and in some cases, some chips may be connected to four other distributed processor memory chips. Including more chip ports in the distributed processor memory chips may enable more interconnectivity between the distributed processor memory chips.
Additionally, while the distributed processor memory chips 7500A-7500I are shown in fig. 75B as having two different types of communication ports 7570 and 7572, in some cases a single type of communication port may be included in each distributed processor memory chip. In other cases, more than two different types of communication ports may be included in one or more of the distributed processor memory chips. In the example of fig. 75C, each of the distributed processor memory chips 7500A 'to 7500C' includes two (or more than two) communication ports 7570 of the same type. In this embodiment, the communication port 7570 can be configured to enable communication with external entities such as host computers via the bus 7534, and can also be configured to enable communication between distributed processor memory chips (e.g., between distributed processor memory chips 7500B 'and 7500C') via the bus 7535.
In some embodiments, ports disposed on one or more distributed processor memory chips may be used to provide access to more than one host. For example, in the embodiment shown in fig. 75D, the distributed processor memory chip includes two or more ports 7570. Port 7570 may constitute a host port, a chip port, or a combination of a host port and a chip port. In the example shown, two ports 7570 and 7570 'can enable two different hosts (e.g., host computers or computing elements or other types of logic units) to access the distributed processor memory chip 7500A via buses 7534 and 7534'. This embodiment may enable two (or more than two) different host computers to access the distributed processor memory chip 7500A. However, in other embodiments, both buses 7534 and 7534' may be connected to the same host entity, e.g., where the host entity requires additional bandwidth or parallel access to one or more of the processor subunits/memory banks of the distributed processor memory chip 7500A.
In some cases, as shown in fig. 75D, more than one controller 7540 and 7540' may be used to control access to the distributed processor subunits/memory banks of the distributed processor memory chip 7500A. In other cases, a single controller may be used to handle communications from one or more external host entities.
Additionally, one or more buses internal to the distributed processor memory chip 7500A may enable parallel access to the distributed processor subunits/memory banks of the distributed processor memory chip 7500A. For example, distributed processor memory chip 7500A may include a first bus 7580 and a second bus 7580' that enable parallel access to, for example, distributed processor subunits 7520_1 through 7520_6 and their corresponding dedicated memory banks 7510_1 through 7510_ 6. This configuration may allow for simultaneous access to two different locations in the distributed processor memory chip 7500A. Additionally, in the case where all ports are not used simultaneously, the ports may share hardware resources (e.g., a common bus and/or a common controller) within the distributed processor memory chip 7500A and may constitute multi-tasking (mux) IO to the hardware.
In some embodiments, some of the arithmetic units (e.g., processor sub-units 7520_1 to 7520_6) may be connected to additional ports (7570') or controllers, while others are not connected to additional ports or controllers. However, data from arithmetic units not connected to the additional port 7570 'may pass through an internal grid (grid) of connections to arithmetic units connected to the port 7570'. In this way, communication can be performed at both ports 7570 and 7570' simultaneously without adding an additional bus.
While the communication ports (e.g., 7530-7532) and the controller (e.g., 7540) have been illustrated as separate components, it should be appreciated that the communication ports and the controller (or any other components) can be implemented as an integrated unit in accordance with embodiments of the invention. FIG. 76 provides a diagrammatic representation of a distributed processor memory chip 7600 with integrated controller and interface modules consistent with an embodiment of the present invention. As shown in fig. 76, the processor memory chip 7600 can be implemented with an integrated controller and interface module 7547 configured to perform the functions of the controller 7540 and the communication ports 7530, 7531, and 7532 in fig. 75. As shown in fig. 76, the controller and interface module 7547 is configured to communicate with a plurality of different entities such as external entities, one or more distributed processor memory chips, etc. via interfaces 7548_1 to 7548_ N similar to communication ports (e.g., 7530, 7531, and 7532). The controller and interface module 7547 can be further configured to control communication between distributed processor memory chips or between the distributed processor memory chip 7600 and external entities such as host computers. In some embodiments, the controller and interface module 7547 can include communication interfaces 7548_1 to 7548_ N configured to communicate in parallel with one or more other distributed processor memory chips and with external entities such as host computers, communication modules, and the like.
FIG. 77 provides a flowchart representing a process for transferring data between distributed processor memory chips in the scalable processor memory system shown in FIG. 75 consistent with an embodiment of the invention. For purposes of illustration, the flow for transferring data will be described with reference to FIG. 75, and it is assumed that data is transferred from the first processor memory chip 7500 to the second processor memory chip 7500'.
At step S7710, a data transfer request may be received. However, it should be noted and as described above that in some embodiments, a data transfer request may not be necessary. For example, in some cases, the timing of data transfer may be predetermined (e.g., by a particular software code). In this case, the data transfer may proceed without a separate data transfer request. Step S7710 can be performed by, for example, controller 7540, and others. In some embodiments, the data transfer request may include a request to transfer data from one processor subunit of the first distributed processor memory chip 7500 to another processor subunit of the second distributed processor memory chip 7500'.
At step S7720, a data transfer timing may be determined. As mentioned, the data transfer timing may be predetermined and may depend on the execution order of a particular software program. Step S7720 may be performed by, for example, controller 7540, and others. In some embodiments, the data transfer timing may be determined by considering (1) whether the sending processor subunit is ready to transfer data and/or (2) whether the receiving processor subunit is ready to receive data. In accordance with embodiments of the present invention, it may also be considered whether one or more other timing constraints are satisfied to enable the data transfer. One or more temporal constraints may be associated with each of: a time difference from a transmit time of a sending processor subunit to a receive time at a receiving processor subunit, an access request from an external entity (e.g., a host computer) for processed data, a refresh operation performed on a memory resource (e.g., a memory array) associated with the sending or receiving processor subunit, and so forth. According to an embodiment of the invention, the processing subunit may be fed by a clock. In some embodiments, the clock supplied to the processing subunit may be controlled, for example, using a clock enable signal. According to some embodiments of the invention, the controller 7540 may control the timing of communication commands by controlling clock enable signals to the processor subunits 7520_1 through 7520_ K.
At step S7730, data transmission may be performed based on the data transmission timing determined at step S7720. Step S7730 may be performed by, for example, controller 7540, among others. For example, the transmitting processor subunit of the first processor memory chip 7500 may transmit data to the receiving processor subunit of the second processor memory chip 7500' according to the data transmission timing determined at step S7720.
The disclosed architecture may be applicable to a variety of applications. For example, in some cases, the above architecture may facilitate sharing data, such as weights or neuron values or partial neuron values associated with a neural network (particularly a large neural network), among different distributed processor memory chips. In addition, data from multiple different distributed processor memory chips may be needed in certain operations such as SUM, AVG, and the like. In this case, the disclosed architecture may facilitate sharing of this data from multiple distributed processor memory chips. Still further, for example, the disclosed architecture may facilitate sharing records among distributed processor memory chips to support join operations for queries.
It should also be noted that while the above embodiments have been described with respect to distributed processor memory chips, the same principles and techniques may be applied to, for example, conventional memory chips that do not include distributed processor subunits. For example, in some cases, multiple memory chips may be combined together into a multi-port memory chip to form an array of memory chips even without an array of processor subunits. In another embodiment, multiple memory chips can be combined together to form an array of connected memory, effectively providing the host with one larger memory comprising multiple memory chips.
The internal connection of the port may be to the main bus or to one of the internal processor subunits included in the processing array.
In-memory (in-memory) zero detection
Some embodiments of the invention relate to a memory cell for detecting a zero value stored in one or more specific addresses of a plurality of memory banks. This zero-value detection feature of the disclosed memory unit may be useful for reducing power consumption of the computing system, and additionally or alternatively, may also reduce the processing time required for retrieving zero values from memory. This feature may be particularly relevant in the following systems: in such a system, the large amount of data read is effectively a 0 value and is also used for computational operations, such as multiply \ add \ subtract \ and more operations for which it may not be necessary to retrieve a zero value from memory (e.g., the product of the zero value and any other value is zero), and the computational circuitry may use the fact that one of the operands is zero and compute the result more efficiently in time and energy. In such cases, detection of the presence of a zero value may be used in lieu of a memory access and retrieval of the zero value from memory.
Throughout this section, the disclosed embodiments are described with respect to read functions. It should be noted, however, that the disclosed architecture and techniques are equally applicable to zero-value write operations, or other specific predetermined non-zero-value operations, in situations where other values may occur more often.
In the disclosed embodiments, instead of retrieving a zero value from memory, when such a value is detected at a particular address, the memory unit may return a zero value indicator to one or more circuits external to the memory unit (e.g., one or more processors, CPUs, etc. external to the memory unit). The zero value is a multi-bit zero value zero (e.g., a zero value byte, a zero value word, a multi-bit zero value less than one byte, greater than one byte, and the like). The zero value indicator is a 1-bit signal that indicates a zero value stored in memory, and thus it is beneficial to transmit a 1-bit zero value of the hint signal as compared to transmitting n data bits stored in memory. The transmitted zero-hint may reduce energy consumption for transmission to 1/n and may speed up operations, such as where multiplication operations are involved in computing inputs by weight of neurons, convolution, applying kernels to input data, and many other calculations associated with trained neural networks, artificial intelligence, and a wide range of other types of operations. To provide this functionality, the disclosed memory unit may include one or more zero value detection logic units that may detect the presence of a zero value in a particular location in memory, prevent the retrieval of a zero value (e.g., via a read command), and cause a zero value indicator to be transmitted instead to circuitry external to the memory unit (e.g., using one or more control lines of the memory, one or more buses associated with the memory unit, etc.). Zero detection may be performed at the memory pad level, at the bank level, at the subset level, at the chip level, etc.
It should be noted that although the disclosed embodiments are described with respect to delivering a zero indicator to a location external to a memory chip, the disclosed embodiments and features may also provide significant benefits in systems where processing may occur internal to a memory chip. For example, in embodiments of a distributed processor memory chip such as disclosed herein, processing of data in various memory banks may be performed by corresponding processor subunits. In many cases, such as neural network execution or data analysis where the associated data may include many zeros, the disclosed techniques may speed up processing and/or reduce power consumption associated with processing performed by processor subunits in a distributed processor memory chip.
FIG. 78A illustrates a system 7800 for detecting zero values at the chip level stored in one or more specific addresses of multiple memory banks implemented in a memory chip 7810, in accordance with an embodiment of the present invention. System 7800 may include a memory chip 7810 and a host 7820. Memory chip 7810 may include multiple control units and each control unit may have a dedicated memory bank. For example, the control unit may be operatively connected to the dedicated memory bank.
In some cases, for example, with respect to the distributed processor memory chips disclosed herein, processing within a memory chip may involve memory access (whether read or write), which includes processor subunits spatially distributed among an array of memory banks. Even in the case of processing internal to the memory chip, the disclosed techniques of detecting a zero value associated with a read or write command may allow an internal processor unit or sub-unit to forgo transmitting an actual zero value. Instead, in response to a zero value detection and a zero value indicator transmission (e.g., to one or more internal processing subunits), the distributed processor memory chip may save energy that would otherwise have been used to transmit the zero data value within the memory chip.
In another example, each of memory chip 7810 and host 7820 may include input/output (IO) to enable communication between memory chip 7810 and host 7820. Each IO may be coupled with a zero value indicator line 7830A and a bus 7840A. The zero indicator line 7830A may communicate a zero indicator from the memory chip 7810 to the host 7820, where the zero indicator may comprise a 1-bit signal generated by the memory chip 7810 upon detection of a zero value stored in a particular address of a memory bank requested by the host 7820. Upon receiving a zero value indicator via zero value indicator line 7830A, host 7820 may perform one or more predefined actions associated with the zero value indicator. For example, if host 7820 requests from memory chip 7810 to retrieve operands for a multiplication, host 7820 may calculate the multiplication more efficiently because host 7820 will acknowledge (not receive the actual memory value) from the received zero value indicator that one of the operands is zero. Host 7820 may also provide instructions, data, and other inputs to memory chip 7810 through bus 7840 and reads outputs from memory chip 7810. Upon receiving a communication from host 7820, memory chip 7810 may retrieve data associated with the received communication and transmit the retrieved data to host 7820 via bus 7840.
In some embodiments, the host may send a zero value indicator to the memory chip instead of a zero data value. In this manner, a memory chip (e.g., a controller disposed on the memory chip) can store or refresh a zero value in the memory without having to receive a zero data value. This update may occur based on receipt of a zero value indicator (e.g., as part of a write command).
FIG. 78B illustrates a memory chip 7810 for detecting zero values stored in one or more specific addresses of multiple memory banks 7811A-7811B at the memory bank level in accordance with an embodiment of the present invention. The memory chip 7810 may include a plurality of memory banks 7811A through 7811B and IO buses 7812. Although FIG. 78B depicts two memory banks 7811A-7811B implemented in memory chip 7810, memory chip 7810 may include any number of memory banks.
IO bus 7812 may be configured to transfer data to/from an external chip (e.g., host 7820 in fig. 78A) via bus 7840B. The bus 7840B may function similarly to the bus 7840A in fig. 78A. IO 7812 may also transmit a null indicator via null indicator line 7830B, where null indicator line 7830B may function similar to null indicator line 7830A in fig. 78A. IO bus 7812 may also be configured to communicate with memory banks 7811A-7811B via internal null indicator line 7831 and bus 7841. The IO bus 7812 may transmit received data from an external chip to one of the memory banks 7811A to 7811B. For example, IO bus 7812 may transmit data over bus 7841 including instructions to read data stored at a particular address in memory bank 7811A. Multiplexers may be included between IO bus 7812 and memory banks 7811A-7811B, and may be connected by internal null indicator line 7831 and bus 7841A. The multiplexer may be configured to transmit received data from IO bus 7812 to a particular memory bank and may be further configured to transmit received data from the particular memory bank or a received null indicator to IO bus 7812.
In some cases, the host entity may be configured only to receive regular data transmissions and may not be equipped to interpret or respond to the disclosed zero value indicator. In this case, the disclosed embodiments (e.g., controller/chip IO, etc.) may regenerate a zero value on the data line to the host IO in place of the zero value indicator signal, and thus may save data transmission power internal to the chip.
Each of the memory banks 7811A through 7811B may include a control unit. The control unit may detect a zero value stored in the requested address of the memory bank. Upon detecting the stored zero value, the control unit may generate a zero value indicator and transmit the generated zero value indicator to IO bus 7812 via internal zero value indicator line 7831, wherein the zero value indicator is further communicated to an external chip via zero value indicator line 7830B.
FIG. 79 illustrates a memory bank 7911 for detecting, at a memory mat level, a zero value stored in one or more of a particular address of a plurality of memory mats, consistent with an embodiment of the invention. In some embodiments, memory bank 7911 may be organized into memory pads 7912A-7912B, each of which may be independently controlled and independently accessed. The memory bank 7911 may include memory pad controllers 7913A-7913B, which may include zero value detection logic 7914A-7914B. Each of the memory pad controllers 7913A-7913B may allow reading and writing of locations on the memory pads 7912A-7912B. The memory bank 7911 may further include read disable components, the region sense amplifiers 7915A-7915B, and/or the global sense amplifier 7916.
Each of the memory pads 7912A-7912B may include a plurality of memory cells. Each of the plurality of memory cells may store one bit of binary information. For example, any of the memory cells may individually store a zero value. If all memory cells in a particular memory pad store a zero value, the zero value may be associated with the entire memory pad.
Each of the memory pad controllers 7913A-7913B may be configured to access a dedicated memory pad and read or write data stored in the dedicated memory pad.
In some embodiments, the zero detection logic 7914A or 7914B may be implemented in the memory bank 7911. One or more zero value detection logic units 7914A-7914B may be associated with a set of memory banks, memory subsets, memory pads, and one or more memory cells. The zero value detection logic 7914A or 7914B may detect that a particular address requested (e.g., memory pad 7912A or 7912B) stores a zero value. The detection may be performed in a number of ways.
The first method may comprise using a digital comparator with respect to zero. The digital comparator may be configured to take two digits as input in binary form and determine whether the first digit (the captured data) is equal to the second digit (zero). If the digital comparator determines that the two numbers are equal, the zero detection logic may generate a zero indicator. The zero value indicator may be a 1-bit signal and may disable amplifiers (e.g., region sense amplifiers 7915A-7915B), transmitters, and buffers that may send data bits to the next level (e.g., IO bus 7812 in FIG. 78B). The zero value indicator may be further transmitted to the global sense amplifier 7916 via the zero value indicator line 7931A or 7931B, but in some cases, the global sense amplifier may be bypassed.
A second method for zero detection may include using an analog comparator. In addition to using the voltages of the two analog inputs for comparison, an analog comparator may also function similarly to a digital comparator. For example, all bits may be sensed, and the comparator may act as a logical OR function between the signals.
A third method for zero value detection may include using a transfer signal from the local sense amplifiers 7915A-7915B into the global sense amplifier 7916, where the global sense amplifier 7916 is configured to sense whether any of the inputs is high (non-zero) and use the logic signal to control the next level of amplifiers. The local sense amplifiers 7915A-7915B and the global sense amplifier 7916 may include a plurality of transistors configured to sense low power signals from multiple memory banks, and the amplifiers amplify small voltage swings to higher voltage levels so that data stored in the multiple memory banks may be interpreted by at least one controller, such as the memory pad controller 7913A or 7913B. For example, the memory cells may be arranged in rows and columns on the memory bank 7911. Each line may be attached to each memory cell in a row. The lines running along the rows are referred to as word lines, which are activated by selectively applying voltages to the word lines. The lines running along the columns are referred to as bit lines, and two such complementary bit lines may be attached to the sense amplifiers at the edge of the memory array. The number of sense amplifiers may correspond to the number of bit lines (columns) on the memory bank 7911. To read a bit from a particular memory cell, the word line along a row of cells is turned on, thereby activating all of the memory cells in the row. The stored value (0 or 1) from each cell is then available on the bitline associated with the particular cell. At the ends of the two complementary bit lines, the sense amplifier may amplify a small voltage to a normal logic level. The bit from the desired cell may then be latched into a buffer from the sense amplifier of the cell and placed on the output bus.
A fourth method for zero value detection may include: if the value is 0, an extra bit is used for each word saved to memory and stored at write time, and used when reading data to know if the data is zero. This approach may avoid writing all zeros to the memory, thus saving more energy.
As described above and throughout this disclosure, some embodiments may include a memory unit (such as memory unit 7800) that includes multiple processor sub-units. These processor subunits may be spatially distributed on a single substrate (e.g., a substrate of a memory chip such as memory unit 7800). Further, each of the plurality of processor sub-units may be dedicated to a corresponding memory bank among the plurality of memory banks of memory unit 7800. And the memory banks dedicated to the corresponding processor sub-units may also be spatially distributed over the substrate. In some embodiments, memory unit 7800 may be associated with a particular task (e.g., performing one or more operations associated with running a neural network, etc.), and each of the processor subunits of memory unit 7800 may be responsible for performing a portion of such task. For example, each processor subunit may be equipped with instructions that may include data handling and memory operations, arithmetic and logical operations, and so forth. In some cases, the zero value detection logic may be configured to provide zero value indicators to one or more of the described processor sub-units that are spatially distributed over the memory unit 7800.
Referring now to FIG. 80, a flow diagram is presented illustrating an exemplary method 8000 of detecting a zero value stored in a particular address of a plurality of memory banks, consistent with embodiments of the present invention. Method 8000 may be performed by a memory chip (e.g., memory chip 7810 of fig. 78B). In particular, a controller of a memory unit (e.g., controller 7913A of fig. 79) and zero value detection logic (e.g., zero value detection logic 7914A) may perform method 8000.
In step 8010, a read or write operation may be initiated by any suitable technique. In some cases, the controller may receive a request to read data stored in a particular address of a plurality of discrete memory banks (e.g., the memory bank depicted in fig. 78). The controller may be configured to control at least one aspect of read/write operations with respect to the plurality of discrete memory banks.
In step 8020, one or more zero value detection circuits may be used to detect the presence of a zero value associated with the read or write command. For example, zero value detection logic (e.g., zero value detection logic 7830 of fig. 78) may detect a zero value associated with a particular address associated with a read or write.
In step 8030, the controller can transmit a zero value indicator to one or more circuits external to the memory unit in response to the detection of a zero value by the zero value detection logic in step 8020. For example, the zero value detection logic may detect that the requested address stores a zero value, and may transmit a hint of a value of zero to an entity (e.g., one or more circuits) external to the memory chip (or within the memory chip, such as in the disclosed distributed processor memory chip including processor subunits distributed among an array of memory banks). If a zero value associated with a read or write command is not detected, the controller may transmit a data value instead of a zero value indicator. In some embodiments, the one or more circuits that are passed back a zero value indicator may be internal to the memory unit.
Although the disclosed embodiments have been described with respect to zero value detection, the same principles and techniques will be applicable to detecting other memory values (e.g., 1, etc.). In some cases, in addition to a zero value indicator, the detection logic may also return one or more indicators of other values (e.g., 1, etc.) associated with the read or write command, and these indicators may be returned/transmitted if any value corresponding to the value indicator is detected. In some cases, the values may be adjusted by a user (e.g., by updating one or more registers). Such updating may be particularly useful where characteristics about the data set may be known and it is known (e.g., to the user) that certain values may be more prevalent in the data than others. In this case, one, two, three, or more than three value indicators may be associated with the most prevalent data associated with the data set.
Compensating DRAM start penalty
In certain types of memory (e.g., DRAM), memory cells may be arranged in arrays within a memory bank, and values included in the memory cells may be accessed and retrieved (read) for a row of memory cells in the array at a time. This read process may involve first opening (activating) a bank (or row) of memory cells to make available the data values stored by the memory cells. Next, the values of the memory cells in the open bank may be sensed simultaneously, and the column address may be used to cycle through individual memory cell values or groups of memory cell values (i.e., words), and connect each memory cell value to an external data bus in order to read the memory cell values. These procedures take time. In some cases, opening a memory bank for reading may require 32 cycles of operation time, and reading a value from an open bank may require another 32 cycles. Significant latency can be generated if the next bank to be read is only opened after the read operation of the currently open bank is completed. In this example, no data is read during the 32 cycles required to open the next row, and reading each row effectively requires a total of 64 cycles instead of only 32 cycles needed to traverse the row data. Conventional memory systems do not allow for opening a second rank in the same group while a first rank is being read or written. To save latency, the next bank to be opened may thus be in a different set o of the special set for dual bank access, as discussed in further detail below. Before opening the next row, the current row may all sample to flip-flops (flipflops) or latches, and when the next row may be opened, all processing is done on flip-flops/latches. If the next prediction is in the same group (and none of the above exists), latency may not be avoided and the system may need to wait. These mechanisms are related to both standard memory and, in particular, memory processing devices.
Embodiments disclosed herein may reduce this latency by, for example, predicting the next memory bank to be opened before the read operation for the current open memory bank has completed. That is, if the next row to be opened can be predicted, the process for opening the next row may begin before the read operation for the current row has completed. Depending on when the next bank prediction is made in the handler, the latency associated with opening the next bank may be reduced from 32 cycles (in the specific example described above) to less than 32 cycles. In one particular example, if the next row open is predicted 20 cycles ahead, the additional latency is only 12 cycles. In another example, if the next row open is predicted 32 cycles ahead, there is no latency at all. As a result, instead of requiring a total of 64 cycles to open and read each row in series, by opening the next row while the current row is being read, the effective time to read each row can be reduced.
The following mechanisms may require that the current rank and the predicted rank be in the same set, but they may also be used if there is such a set that can support simultaneous startup and work on a rank.
In the disclosed embodiments, next row prediction may be performed using various techniques (discussed in more detail below). For example, the next row prediction may be based on pattern recognition, based on a predetermined row access schedule, based on the output of an artificial intelligence model (e.g., a trained neural network to analyze row accesses and make predictions of the next row to be opened), or based on any other suitable prediction technique. In some embodiments, a 100% successful prediction may be achieved by using a delayed address generator or formula as described below or other methods. Prediction may include building a system with the ability to adequately predict a bank before requiring access to the next bank to be opened. In some cases, the next line prediction may be performed by a next line predictor, which may be implemented in various ways. For example, a predictive address generator to generate a current address for reading and/or writing to a memory line. The entity that generates the address for accessing memory (read or write) may be based on any logic circuit or controller \ CPU executing software instructions. The predictive address generator may include a pattern learning model that observes the accessed row, identifies one or more patterns associated with the accesses (e.g., sequential row accesses, accesses to every second row, accesses to every third row, etc.), and estimates the next row to be accessed based on the observed patterns. In other examples, the predictive address generator may include a unit that applies a formula/algorithm to predict the next row to be accessed. In still other embodiments, the predicted address generator may include a trained neural network that outputs a predicted next row to be accessed (including one or more addresses associated with the predicted row) based on inputs such as the current address row being accessed, the last 2, 3, 4, or more than 4 addresses/rows accessed, and so on. Predicting the next memory bank to be accessed using any of the described predictive address generators may significantly reduce the latency associated with memory accesses. The predictive address/column generator described may be applicable in any system that involves accessing memory to retrieve data. In some cases, the described predictive address/row generator and associated techniques for predicting next memory bank accesses may be particularly suitable in systems that execute artificial intelligence models, as AI models may be associated with repeating memory access patterns that may facilitate next row prediction.
FIG. 81A illustrates a system 8100 for initiating a next row associated with a memory bank 8180 based on a next row prediction, consistent with embodiments of the invention. System 8100 may include current and predicted address generator 8192, bank controller 8191, and memory banks 8180A-8180B. The address generator may be an entity that generates addresses for accessing the memory banks 8180A-8180B, and may be based on any logic circuitry, controller, or microprocessor that executes a software program. Group controller 8191 may be configured to access the current row of memory group 8180A (e.g., using the current row identifier generated by address generator 8192). The bank controller 8191 may also be configured to activate a predicted next row within the memory bank 8180B to be accessed based on the predicted row identifier generated by the address generator 8192. The following example describes two groups. In other examples, more sets may be used. In some embodiments, there may be memory banks that allow more than one row (as discussed below) to be accessed at a time, and thus the same process may be performed on a single bank. As described above, the activation of the predicted next row to be accessed may begin before the read operation performed with respect to the current row being accessed is completed. Thus, in some cases, the address generator 8192 may predict the next row to be accessed, and may send an identifier (e.g., one or more addresses) of the predicted next row to the bank controller 8191 at any time before access to the current row has completed. This timing may allow the group controller to initiate the launch of the predicted next row at any point in time during the current row being accessed and before the access to the current row is complete. In some cases, the bank controller 8291 may initiate the firing of the predicted next row of the memory bank 8180 at the same time (or within a few clock cycles) that the firing of the current row to be accessed is complete and/or a read operation with respect to the current row has begun.
In some embodiments, the operation with respect to the current row associated with the current address may be a read or write operation. In some embodiments, the current row and the next row may be in the same memory bank. In some embodiments, the same memory bank may allow the next row to be accessed while the current row is being accessed. The current row and the next row may be in different memory banks. In some embodiments, the memory unit may include a processor configured to generate a current address and a predicted address. In some embodiments, the memory unit may comprise a distributed processor. A distributed processor may include a plurality of processor subunits of a processing array spatially distributed among a plurality of discrete memory banks of a memory array. In some embodiments, the predicted address may be generated by a series of flip-flops that sample the delay generated address. The delay may be configurable via a multiplexer that selects between flip-flops that store sampled addresses.
It should be noted that upon confirming that the predicted next line is actually the next line that the software request is executing to access (e.g., upon completing a read operation with respect to the current line), the predicted next line may become the current line to be accessed. In the disclosed embodiments, because the handler for launching the predicted next row may be initiated before the current row read operation is completed, the next row to be accessed may have been fully or partially launched upon confirmation of the correct next row to be accessed for the predicted next row. This may significantly reduce the latency associated with bank firing. If the next row is activated such that the activation ends before or at the same time as the reading of the current row ends, a power reduction may be obtained.
Current and predicted address generator 8192 may include any suitable logic components, arithmetic units, memory units, algorithms, trained models, etc., configured to identify a row in memory bank 8180 to be accessed (e.g., based on program execution) and predict a next row to be accessed (e.g., based on an observed pattern in the row access, based on a predetermined pattern (n +1, n +2), etc.). For example, in some embodiments, current and predicted address generator 8192 may include a counter 8192A, a current address generator 8192B, and a predicted address generator 8192C. Current address generator 8192B may be configured to generate a current address of a current row to be accessed in memory bank 8180 based on an output of counter 8192A, e.g., based on a request from an arithmetic unit. The address associated with the current row to be accessed may be provided to the bank controller 8191. Predicted address generator 8192C may be configured to determine a predicted address for a next row in memory bank 8180 to be accessed based on an output of counter 8192A, based on a predetermined access pattern (e.g., in conjunction with counter 8192A), or based on an output of a trained neural network or other type of pattern prediction algorithm that observes a row access and predicts the next row to be accessed based on, for example, a pattern associated with the observed row access. The address generator 8192 may provide the predicted next row address from the predicted address generator 8192C to the bank controller 8191.
In some embodiments, current address generator 8192B and predicted address generator 8192C may be implemented within or external to system 8100. An external host may also be implemented external to system 8100 and further connected to system 8100. For example, current address generator 8192B may be software at an external host executing a program, and to avoid any latency, predictive address generator 8192C may be implemented internal to system 8100 or external to system 8100.
As mentioned, the predicted next row address may be determined using a trained neural network that predicts the next row to be accessed based on inputs that may include one or more previously accessed row addresses. A trained neural network or other type of model may operate within the logic associated with the predictive address generator 8192C. In some cases, a trained neural network or the like may be executed by one or more arithmetic units external to, but in communication with, predictive address generator 8192C.
In some embodiments, the predicted address generator 8192C may comprise a replicator or a substantial replicator of the current address generator 8192B. Additionally, the timing of the operations of the current address generator 8192B and the predicted address generator 8192C may be fixed relative to each other or may be adjustable. For example, in some cases, the predicted address generator 8192C may be configured to output the address identifier associated with the predicted next row at a fixed time (e.g., a fixed number of clock cycles) relative to when the current address generator 8192B issues the address identifier associated with the next row to be accessed. In some cases, the predicted next row identifier may be generated before or after initiation of the current row to be accessed begins, before or after a read operation associated with the current row to be accessed begins, or at any time before the read operation associated with the current row being accessed completes. In some cases, the predicted next row identifier may be generated at the same time that the launch of the current row to be accessed begins or at the same time that the read operation associated with the current row to be accessed begins.
In other cases, the time between the generation of the predicted next row identifier and the initiation of the current row to be accessed or the start of a read operation associated with the current row may be adjustable. For example, in some cases, this time may be lengthened or shortened during operation of memory unit 8100 based on values associated with one or more operating parameters. In some cases, the current temperature (or any other parameter value) associated with a memory unit or another component of the operating system may cause current address generator 8192B and predicted address generator 8192C to change their relative operating timing. In an embodiment, the prediction mechanism may be part of the logic, among other things, in memory processing.
Current and predicted address generator 8192 may generate a confidence associated with the predicted next row to access the decision. This confidence, which may be determined by prediction address generator 8192C as part of the prediction handler, may be used to determine, for example, whether to initiate the launch of the predicted next row during the read operation of the current row (i.e., before the current row read operation has completed and before the identification of the next row to be accessed has been confirmed). For example, in some cases, the confidence associated with the predicted next row to be accessed may be compared to a threshold level. If the confidence level falls below a threshold level, for example, memory unit 8100 may forego activating the predicted next row. On the other hand, if the confidence exceeds the threshold level, memory unit 8100 may initiate activation of the predicted next row in memory bank 8180.
The mechanism to test the confidence of the predicted next row relative to the threshold level and the subsequent initiation or non-initiation of the predicted next row may be implemented in any suitable manner. In some cases, for example, if the confidence associated with the predicted next row falls below a threshold, predicted address generator 8192C may refrain from outputting its predicted next row result to downstream logic components. Alternatively, in this case, the current and predicted address generator 8192 may suppress the predicted next row identifier from the bank controller 8191, or the bank controller (or another logic unit) may be equipped to use the confidence of the predicted next row to determine whether to begin initiating the predicted next row before the read operation associated with the current row being read is complete.
The confidence associated with the predicted next row may be generated in any suitable manner. In some cases, such as where the predicted next row is identified based on a predetermined known access pattern, predicted address generator 8192C may generate a high confidence or may forgo generating a confidence altogether in view of the predetermined pattern of row accesses. On the other hand, where predicted address generator 8192C executes one or more algorithms to monitor row accesses and output a predicted row based on a pattern computed with respect to the monitored row accesses, or where one or more trained neural networks or other models are configured to output a predicted next row based on inputs including a most recent row access, the confidence of the predicted next row may be determined based on any relevant parameters. For example, in some cases, the confidence may depend on whether one or more previous next row predictions proved accurate (e.g., past performance indicators). The confidence may also be based on one or more characteristics of the inputs to the algorithm/model. For example, an input that includes an actual row access that follows a pattern may result in a higher confidence than an actual row access that exhibits less patterning. And in some cases where randomness is detected with respect to the stream including the input of the most recent row access, for example, the confidence generated may be low. Additionally, in the event randomness is detected, the next line prediction process may be completely aborted, one or more of the components of memory unit 8100 may ignore the next line prediction, or may take any other action to forego starting the predicted next line.
In some cases, a feedback mechanism may be included with respect to operation of memory 8100. For example, periodically or even after each next row prediction, the accuracy with which the predicted address generator 8192C predicts the actual next row to be accessed may be determined. In some cases, if there is an error in predicting the next row to be accessed (or after a predetermined number of errors), the next row prediction operation of prediction address generator 8192C may be temporarily suspended. In other cases, the prediction address generator 8192C may include a learning element such that one or more aspects of its prediction operation may be adjusted based on received feedback regarding the accuracy with which it predicts the next row to be accessed. This capability may improve the operation of the predictive address generator 8192C so that the address generator 8192C may adapt to changing access patterns, etc.
In some embodiments, the timing of the generation of the predicted next row and/or the activation of the predicted next row may depend on the overall operation of memory cell 8100. For example, after power-up or after resetting memory cells 8100, the predicted next row to be accessed may be temporarily suspended (or forwarded to bank controller 8191) (e.g., for a predetermined amount of time or clock cycles until a predetermined number of row accesses/reads have completed until the confidence of the predicted next row exceeds a predetermined threshold, or based on any other suitable criteria).
Figure 81B illustrates another configuration of a memory unit 8100 according to an exemplary disclosed embodiment. In the system 8100B of fig. 81B, a cache 8193 may be associated with the set controller 8191. For example, cache 8193 may be configured to store one or more data lines after they have been accessed and to prevent the need to re-enable the data lines. Thus, the cache 8193 may enable the bank controller 8191 to access row data from the cache 8193 rather than accessing the memory bank 8180. For example, the cache 8193 may store the last X rows of data (or any other cache saving policy), and the bank controller 8191 may populate the cache 8193 according to the predicted row. Additionally, if the predicted row is already in the cache 8193, then the predicted row need not be reopened and the set controller (or a cache controller implemented in the cache 8193) may protect the predicted row from being swapped. Cache 8193 may provide several benefits. First, because the cache 8193 loads a row into the cache 8193 and the bank controller may access the cache 8193 to retrieve the row data, no special bank or more than one bank is required for the next row prediction. Second, reading and writing to the cache 8193 may save energy because the physical distance from the bank controller 8191 to the cache 8193 is less than the physical distance from the bank controller 8191 to the memory bank 8180. Third, latency incurred by cache 8193 is typically lower compared to memory bank 8180 because cache 8193 is smaller and closer to controller 8191. In some cases, when the predicted next row is activated in memory bank 8180 by bank controller 8191, the identifier of the predicted next row generated by the predicted address generator may be stored in cache 8193, for example. Based on program execution, etc., current address generator 8192B may identify the actual next row in memory bank 8191 to be accessed. The identifier associated with the actual next line to be accessed may be compared to the identifier of the predicted next line stored in cache 8193. If the actual next row to be accessed is the same as the predicted next row to be accessed, the group controller 8191 may begin a read operation with respect to the actual next row to be accessed (which may be fully or partially initiated due to the next row prediction process) after the initiation of that row has completed. On the other hand, if the actual next line to be accessed (as determined by current address generator 8192B) does not match the predicted next line identifier stored in cache 8193, then the read operation will not begin with respect to the predicted next line being fully or partially enabled, but rather the system will begin enabling the actual next line to be accessed.
Dual start group
As discussed, it is valuable to describe mechanisms that allow building sets that can start one row while another is still being processed. Several embodiments may be provided for groups that activate additional rows while another row is being accessed. Although the embodiments describe only two rows of activation, it should be appreciated that they may be applied to more rows. In the first-suggested embodiment, the memory bank may be divided into memory subsets, and the described embodiments may be used to perform read operations with respect to one bank in one subset while activating a predicted or desired next row in another subset. For example, as shown in fig. 81C, memory bank 8180 may be configured to include a plurality of memory sub-banks 8181. Additionally, the bank controller 8191 associated with the memory bank 8180 may include a plurality of sub-bank controllers associated with corresponding sub-banks. A first subgroup controller of the plurality of subgroup controllers may be configured to enable access to data in a current row included in a first subgroup of the plurality of subgroups, and a second subgroup controller of the plurality of subgroup controllers may enable a next row in a second subgroup of the plurality of subgroups. Only one column decoder may be used when only the words in one subset are accessed at a time. Both banks may be tied to the same output bus to appear as a single bank. The new single set input may also be a single address and an additional row address for opening the next row.
Figure 81C illustrates first and second subsets of row controllers (8183A, 8183B) for each memory subset 8181. Memory bank 8180 may include multiple subsets 8181, as shown in fig. 81C. Additionally, the group controller 8191 may include a plurality of sub-group controllers 8183A-8183B each associated with a corresponding sub-group 8181. A first subgroup controller 8183A of the plurality of subgroup controllers may be configured to enable access to data in a current row included in the first portion of subgroup 8181, while a second subgroup controller 8183B may enable a next row in the second portion of subgroup 8181.
Because activating a row immediately adjacent to the row being accessed may distort and/or corrupt the data being read from the accessed row, the disclosed embodiments may be configured, for example, such that the predicted next row to be activated may be separated from the current row of data being accessed in the first subset by at least two rows. In some embodiments, the rows to be activated may be separated by at least one pad, such that activation may be performed in different pads. The second subgroup controller may be configured such that data included in a current row of the second subgroup is accessed, while the first subgroup controller activates a next row in the first subgroup. The activated next row of the first subset may be separated from the current row of data being accessed in the second subset by at least two rows.
This predefined distance between the row being read/accessed and the row being activated may be determined by, for example, hardware coupling different portions of the memory bank to different row decoders, and software may maintain the predefined distance so as not to corrupt the data. The spacing between current rows may be more than two columns (e.g., may be 3 rows, 4 rows, 5 rows, and even more than 5 rows). The distance may change over time, for example based on an evaluation of distortion introduced in the stored data. The distortion may be evaluated in various ways, such as by calculating a signal-to-noise ratio, an error rate, an error code needed to repair the distortion, and the like. If two rows are far enough apart and two bank controllers are implemented on the same bank, then in fact both rows can be activated. The new architecture (implementing two controllers on the same set) may prevent multiple banks in the same pad from opening.
FIG. 81D illustrates an embodiment of next line prediction consistent with an embodiment of the present invention. Embodiments may include additional pipelines of flip-flops (address registers a-C). The pipeline may be implemented by any number of flip-flops (stages) as the delay required to start and delay the overall execution after the address generator to use the delayed address, then the prediction may be the new address generated (at the beginning of the pipeline, below the address register C) and the current address is the end of the pipeline. In this embodiment, no duplicate address generator is required. A selector (multiplexer shown in figure 81D) may be added to configure the delay while the address register provides the delay.
FIG. 81E illustrates an embodiment of a memory bank consistent with an embodiment of the present invention. The memory bank may be implemented such that if the newly activated bank is sufficiently far from the current bank, the activation of the new bank will not destroy the current bank. As shown in fig. 81E, a memory bank may include additional memory pads (black) between each two rows of pads-thus, a control unit, such as a row decoder, may activate multiple rows that separate a pad.
In some embodiments, the memory unit may be configured to receive a first address for processing and a second address for activation and access at predetermined times.
FIG. 81F illustrates another embodiment of a memory bank consistent with an embodiment of the present invention. The memory bank may be implemented such that if the newly activated bank is sufficiently far from the current bank, the activation of the new bank will not destroy the current bank. The embodiment depicted in FIG. 81F may allow row decoder to open rows n and n +1 by ensuring that all even rows are implemented at the upper half of the memory bank and all odd rows are implemented at the lower half of the memory bank. Implementations may allow access to consecutive rows that are always far enough apart.
In accordance with the disclosed embodiments, dual control memory banks may allow different portions of a single memory bank to be accessed and activated, even when the dual control memory bank is configured to output one data unit at a time. For example, as described, dual control may enable a memory bank to access a first row when a second row (e.g., a predicted next row or a predetermined next row to be accessed) is activated.
FIG. 82 illustrates a dual control memory bank 8280 for reducing memory row launch penalty (e.g., latency) consistent with an embodiment of the invention. The dual control memory bank 8280 may comprise inputs including a Data Input (DIN)8290, a ROW address (ROW)8291, a COLUMN address (COLUMN)8292, a first COMMAND input (COMMAND _1)8293, and a second COMMAND input (COMMAND _2) 8294. The memory bank 8280 may comprise a data output (Dout) 8295.
Assume that the address may comprise a row address and a column address, and that there are two row decoders. Other configurations of addresses may be provided, the number of row decoders may exceed two, and there may be more than a single column decoder.
The ROW address (ROW)8291 may identify a ROW associated with a command, such as a start command. Since a row may then be read from or written to the row after it is activated, it may not be necessary to send a row address for writing to or reading from an open row then after the row is opened (after its activation).
A first COMMAND input (COMMAND _1)8293 may be used to send COMMANDs, such as, but not limited to, an activate COMMAND, to the row accessed by the first row decoder. A second COMMAND (COMMAND _2) input 8294 may be used to send COMMANDs, such as, but not limited to, an activate COMMAND, to the rows accessed by the second row decoder.
The Data Input (DIN)8290 may be used to feed data when performing a write operation.
Because an entire row cannot be read at once, a single row section can be read sequentially, and the COLUMN address (COLUMN)8292 can suggest which section(s) of the row to read. For simplicity of explanation, it may be assumed that there are 2Q sections and that the column input has Q bits; q is a positive integer greater than one.
The dual control memory bank 8280 may operate with or without the address prediction described above with respect to fig. 81A-81B. Of course, to reduce operating latency, dual control memory banks may operate with address prediction according to the disclosed embodiments.
Fig. 83A, 83B, and 83C illustrate examples of accessing and activating columns of the memory bank 8180. As mentioned above, assume that in one example, 32 cycles (sectors) are required for both the read row and the activate row. Additionally, to reduce the launch penalty (having a length expressed as Delta), it may be beneficial to know in advance (at least Delta before the next row needs to be accessed) that the next row should be opened. In some cases, the delta may be equal to four cycles. Each memory bank depicted in fig. 83A, 83B, and 83C may include two or more subsets within which, in some embodiments, only one row may be open at any given time. In some cases, even rows may be associated with a first subset and odd rows may be associated with a second subset. In this example, using the disclosed predictive addressing embodiments may enable initiation of activation of one row of a certain memory bank before the end of a read operation relative to a row of another memory bank is reached (a delay period before reaching the end). In this way, sequential memory accesses (e.g., a predefined sequence of memory accesses in which rows 1, 2, 3, 4, 5, 6, 7, 8 … … are to be read, and rows 1, 3, 5 … …, etc. are associated with a first subset of memory and rows 2, 4, 6 … …, etc. are associated with a second, different subset of memory) may be made in an efficient manner.
FIG. 83A may illustrate states for accessing a row of memory included in two different memory subsets. In the state shown in fig. 83A:
a. row A may be accessible by a first row decoder. The first section (the leftmost section in gray scale) can be accessed after the first row decoder activates row a.
b. Row B may be accessible by a second row decoder. In these states shown in FIG. 83A, row B is closed and has not yet been activated.
The state illustrated in FIG. 83A may be preceded by sending an activate command and the address of row A to the first row decoder.
FIG. 83B illustrates the state for accessing row B after accessing row A. According to this example: row A may be accessible by a first row decoder. In the state shown in FIG. 83B, the first row decoder activates row A and all but the four rightmost sectors (the four sectors not labeled in gray) have been accessed. Because the delta (four white segments in row A) is equal to four cycles, the bank controller can enable the second row decoder to activate row B before accessing the rightmost segment in row A. In some cases, activating row B may be in response to a predetermined access pattern (e.g., sequential row access, where odd rows are designated in the first subset and even rows are designated in the second subset). In other cases, activating row B may be responsive to any of the row prediction techniques described above. The bank controller may enable the second row decoder to pre-activate row B such that row B is already activated (opened) rather than waiting for row B to be activated to open row B when row B is accessed.
The state illustrated in FIG. 83B may be preceded by the following operations:
a. the activate command and the address of row a are sent to the first row decoder.
b. The first twenty-eight sectors of row a are written to or read from.
c. After a read or write operation to twenty-eight sections of row, an activate command is sent to the second row decoder relative to the address of row B.
In some embodiments, the even-numbered columns are located in one half of the one or more memory banks. In some embodiments, the odd-numbered columns are located in one half of the one or more memory banks.
In some embodiments, a row of additional redundant pads is placed between each of the two pad rows to establish a distance for allowing actuation. In some embodiments, multiple rows in proximity to each other may not be activated at the same time.
FIG. 83C may illustrate the state for accessing row C (e.g., the next odd row included in the first subset) after accessing row A. As shown in fig. 83C, row B may be accessible by a second row decoder. As shown, the second row decoder has activated row B and has accessed all but the four rightmost sectors (the four remaining sectors not labeled in gray). Because the delta is equal to four cycles in this example, the bank controller may enable the first row decoder to activate row C before accessing the rightmost segment in row B. The bank controller may enable the first row decoder to pre-activate row C such that row C is already activated when row C is accessed rather than waiting for row C to be activated. Operating in this manner may reduce or entirely eliminate the latency associated with memory read operations.
Memory mat as register file
In computer architectures, processor registers constitute storage locations that are quickly accessible to a computer processor (e.g., a Central Processing Unit (CPU)). The register typically includes a memory location closest to the processor core (L0). Registers may provide the fastest way to access certain types of data. Computers may have several types of registers, each classified according to the type of information they store or based on the type of instruction operating on information in a certain type of register. For example, the computer may include: a data register to hold numerical information, operands, intermediate results, and configurations; an address register storing address information used by instructions to access the main memory; a general purpose register storing both data and address information; and a status register; and other registers. The register file includes a logical group of registers available for use by the computer processing unit.
In many cases, the computer's register file is located within a processing unit (e.g., CPU) and implemented by logic transistors. However, in the disclosed embodiments, the arithmetic processing unit may not reside in a conventional CPU. Rather, such processing elements (e.g., processor subunits) may be spatially distributed (as described in the above sections) within the memory chip as a processing array. Each processor subunit may be associated with one or more corresponding and dedicated memory units (e.g., memory banks). Via this architecture, each processor subunit may be spatially located near one or more memory elements that store data for which the particular processor subunit is to operate. As described herein, this architecture can significantly speed up operations in certain memory-intensive operations by, for example, eliminating memory access bottlenecks experienced by typical CPU and external memory architectures.
However, the distributed processor memory chip architecture described herein may still utilize a register file that includes various types of registers for operating on data from memory elements dedicated to the corresponding processor subunit. However, since the processor subunits may be distributed among the memory elements of the memory chip, it is possible to add one or more memory elements (which may benefit from a particular manufacturing process as compared to the same process as logic components in that same process) to the corresponding processor subunit to act as a register file or cache for the corresponding processor subunit, rather than as primary memory storage.
This architecture may provide several advantages. For example, since the register file is part of the corresponding processor subunit, the processor subunit may be spatially located near the relevant register file. This configuration can significantly increase operating efficiency. Conventional register files are implemented with logic transistors. For example, each bit of a conventional register file is made of approximately 12 logic transistors, and thus a 16-bit register file is made of 192 logic transistors. Such a register file may require a large number of logic components to access the logic transistors, and thus may occupy a large space. The register file of the disclosed embodiments may require significantly less space than a register file implemented with logic transistors. This size reduction may be achieved by implementing the register file of the disclosed embodiments using memory pads comprising memory cells that are fabricated by processes optimized for fabricating memory structures rather than logic structures. The reduced size may also allow for larger register files or caches.
In some embodiments, distributed processor memory chips may be provided. The distributed processor memory chip may include: a substrate; a memory array disposed on a substrate and comprising a plurality of discrete memory banks; and a processing array disposed on the substrate and including a plurality of processor subunits. Each of the processor subunits may be associated with a corresponding dedicated memory bank of a plurality of discrete memory banks. The distributed processor memory chip may also include a first plurality of buses and a second plurality of buses. Each of the first plurality of buses may connect one of the plurality of processor subunits to its corresponding dedicated memory bank. Each of the second plurality of buses may connect one of the plurality of processor sub-units to another of the plurality of processor sub-units. In some cases, the second plurality of buses may connect one or more of the plurality of processor sub-units to two or more other processor sub-units among the plurality of processor sub-units. One or more of the processor subunits may also include at least one memory pad disposed on the substrate. The at least one memory pad may be configured to act as at least one register of a register file for one or more of the plurality of processing subunits.
In some cases, the register file may be associated with one or more logic components to enable the memory mat to function as one or more registers of the register file. For example, such logic elements may include switches, amplifiers, inverters, sense amplifiers, and others. In examples where the register file is implemented by a Dynamic Random Access Memory (DRAM) pad, logic components may be included to perform refresh operations to prevent loss of stored data. Such logic elements may include row and column multiplexers ("muxes"). Furthermore, the register file implemented by the DRAM pad may include a redundancy mechanism to counter yield degradation.
Fig. 84 illustrates a conventional computer architecture 8400 including a CPU 8402 and an external memory 8406. During operation, values from memory 8406 may be loaded into registers associated with register file 8504 included in CPU 8402.
FIG. 85A illustrates an exemplary distributed processor memory chip 8500a consistent with the disclosed embodiments. In contrast to the architecture of fig. 84, the distributed processor memory chip 8500a includes memory components and processor components disposed on the same substrate. That is, chip 8500a may include a memory array and a processing array that includes a plurality of processor sub-units that are each associated with one or more dedicated memory banks included in the memory array. In the architecture of fig. 85, the registers used by the processor subunits are provided by one or more memory pads disposed on the same substrate on which the memory array and the processing array are formed.
As depicted in fig. 85A, a distributed processor memory chip 8500a may be formed by a plurality of processing groups 8510a, 8510b, and 8510c disposed on a substrate 8502. More specifically, distributed processor memory chip 8500a may include a memory array 8520 and a processing array 8530 disposed on a substrate 8502. The memory array 8520 may include a plurality of memory banks, such as memory banks 8520a, 8520b, and 8520 c. Processing array 8530 may include multiple processor subunits, such as processor subunits 8530a, 8530b, and 8530 c.
Further, each of processing groups 8510a, 8510b, and 8510c may include a processor subunit and one or more corresponding memory banks dedicated to the processor subunit. In the embodiment depicted in FIG. 85A, each of processor sub-units 8530a, 8530b, and 8530c may be associated with a corresponding dedicated memory bank 8520a, 8520b, or 8520 c. That is, processor subunit 8530a may be associated with memory bank 8520 a; processor subunit 8530b can be associated with memory bank 8520 b; and processor subunit 8530c may be associated with memory bank 8520 c.
To allow each processor subunit to communicate with its corresponding dedicated memory bank, distributed processor memory chip 8500a may include a first plurality of buses 8540a, 8540b, and 8540c connecting one of the processor subunits to its corresponding dedicated memory bank. In the embodiment depicted in FIG. 85A, a bus 8540a may connect processor subunit 8530a to memory bank 8520 a; bus 8540b may connect processor subunit 8530b to memory bank 8520 b; and a bus 8540c may connect processor subunit 8530c to memory bank 8520 c.
Further, to allow each processor subunit to communicate with other processor subunits, the distributed processor memory chip 8500a may include a second plurality of buses 8550a and 8550b that connect one of the processor subunits to at least one other processor subunit. In the embodiment depicted in FIG. 85, bus 8550a may connect processor subunit 8530a to processor subunit 8530b, and bus 8550b may connect processor subunit 8530a to processor subunit 8550b, and so on.
Each of the discrete memory banks 8520a, 8520b, and 8520c may comprise a plurality of memory pads. In the embodiment depicted in FIG. 84, memory bank 8520a may include memory pads 8522a, 8524a, and 8526 a; memory bank 8520b may include memory pads 8522b, 8524b, and 8526 b; and memory bank 8520c may include memory pads 8522c, 8524c, and 8526 c. As previously disclosed with respect to fig. 10, the memory pad may include a plurality of memory cells, and each cell may include a capacitor, transistor, or other circuitry that stores at least one data bit. A conventional memory pad may include, for example, 512 bits by 512 bits, although embodiments disclosed herein are not so limited.
At least one of processor sub-units 8530a, 8530b, and 8530c may comprise at least one memory pad, such as memory pads 8532a, 8532b, and 8532c, configured to serve as a register file for the corresponding processor sub-unit 8530a, 8530b, and 8530 c. That is, at least one memory pad 8532a, 8532b, and 8532c provides at least one register of a register file used by one or more of processor subunits 8530a, 8530b, and 8530 c. The register file may include one or more registers. In the embodiment depicted in fig. 85A, memory pad 8532a in processor subunit 8530a may serve as a register file (also referred to as "register file 8532 a") for processor subunit 8530a (and/or any other processor subunit included in distributed processor memory chip 8500 a); memory pad 8532b in processor subunit 8530b may serve as a register file for processor subunit 8530 b; and memory pad 8532c in processor subunit 8530c may serve as a register file for processor subunit 8530 c.
At least one of processor sub-units 8530a, 8530b, and 8530c may also include at least one logic component, such as logic components 8534a, 8534b, and 8534 c. Each logic component 8534a, 8534b, or 8534c may be configured to enable a corresponding memory pad 8532a, 8532b, or 8532c to function as a register file for a corresponding processor subunit 8530a, 8530b, or 8530 c.
In some embodiments, at least one memory mat may be disposed on the substrate, and the at least one memory mat may contain at least one redundant memory bit configured to provide at least one redundant register for one or more of the plurality of processor subunits. In some embodiments, at least one of the processor subunits may include a mechanism to stop the current task and at some time trigger a memory refresh operation to refresh the memory pads.
FIG. 85B illustrates an exemplary distributed processor memory chip 8500B consistent with the disclosed embodiments. The memory chip 8500B illustrated in FIG. 85B is substantially the same as the memory chip 8500 illustrated in FIG. 85A, except that memory pads 8532a, 8532B, and 8532c in FIG. 85B are not included in corresponding processor sub-units 8530a, 8530B, and 8530 c. Instead, memory pads 8532a, 8532B, and 8532c in FIG. 85B are disposed outside but spatially near corresponding processor sub-units 8530a, 8530B, and 8530 c. In this way, memory pads 8532a, 8532b, and 8532c may still act as register files for corresponding processor subunits 8530a, 8530b, and 8530 c.
FIG. 85C illustrates a device 8500C consistent with the disclosed embodiments. The device 8500c includes a substrate 8560, a first memory group 8570, a second memory group 8572, and a processing unit 8580. A first memory bank 8570, a second memory bank 8572, and a processing unit 8580 are disposed on the substrate 8560. Processing unit 8580 includes a processor 8584 and a register file 8582 implemented by memory pads. During operation of processing unit 8580, processor 8584 may access register file 8582 to read or write data.
Distributed processor memory chips 8500a, 8500b or device 8500c may provide multiple functions based on the processor subunits' access to registers provided by the memory mats. For example, in some embodiments, distributed processor memory chips 8500a or 8500b may include a processor subunit that acts as an accelerator coupled to memory, allowing it to use more memory bandwidth. In the embodiment depicted in fig. 85A, processor subunit 8530a may act as an accelerator (also referred to as "accelerator 8530 a"). Accelerator 8530a may use memory pad 8532a disposed in accelerator 8530a to provide one or more registers of a register file. Alternatively, in the embodiment depicted in FIG. 85B, the accelerator 8530a may use a memory pad 8532a disposed outside the accelerator 8530a as a register file. Still further, accelerator 8530a may use any of memory pads 8522b, 8524b, and 8526b in memory bank 8520b or any of memory pads 8522c, 8524c, and 8526c in memory bank 8520c to provide one or more registers.
The disclosed embodiments may be particularly applicable to certain types of image processing, neural networks, database analysis, compression and decompression, and more. For example, in the embodiment of fig. 85A or 85B, the memory mat may provide one or more registers of a register file for one or more processor subunits included on the same chip as the memory mat. One or more registers may be used to store data that is frequently accessed by the processor subunits. For example, during convolutional image processing, the convolution accelerator may repeatedly use the same coefficients over the entire image stored in memory. The proposed implementation for this convolution accelerator may save all of these coefficients in a "closed" register file within one or more registers included within a memory mat dedicated to one or more processor subunits located on the same chip as the register file memory mat. This architecture places the registers (and stored coefficient values) in close proximity to the processor subunit operating on the coefficient values. Because the register file implemented by the memory pad can act as an efficient cache that is spatially compact, significantly lower loss of data transfer and lower latency of access can be achieved.
In another example, the disclosed embodiments may include an accelerator that may input words into registers provided by a memory mat. The accelerator may treat the registers as a circular buffer to multiply the vectors in a single cycle. For example, in the device 8500C illustrated in fig. 85C, the processor 8584 in the processing unit 8580 acts as an accelerator, using the register file 8582 implemented by memory pads as a circular buffer to store the data a1, a2, A3 … …. The first memory group 8570 stores data B1, B2, B3 … … to be multiplied by data a1, a2, A3 … …. The second memory bank 8572 stores multiplication results C1, C2, C3 … …. That is, Ci is Ai × Bi. If no register file exists in processing unit 8580, processor 8584 would require more memory bandwidth and more cycles to read both data A1, A2, A3 … … and data B1, B2, B3 … … from an external memory bank, such as memory bank 8570 or 8572, which can create significant latency. On the other hand, in the present embodiment, the data a1, a2, A3 … … are stored in a register file 8582 formed within the processing unit 8580. Thus, the processor 8584 will only need to read the data B1, B2, B3 … … from the external memory bank 8570. Thus, memory bandwidth may be significantly reduced.
In memory processing, the memory pads typically allow unidirectional access (i.e., a single access). In unidirectional access, there is one port to memory. As a result, only one access operation, such as a read or write, to a particular address may be performed at a time. However, if the memory pad itself allows bi-directional access, then bi-directional access may be an effective option. In bidirectional access, two different addresses may be accessed at a time. The method of accessing the memory pad may be determined based on area and requirements. In some cases, register files implemented by memory pads may allow four-way access if they are connected to a processor that needs to read two sources and has one destination register. In some cases, the register file may only allow one-way access when implemented by a DRAM pad to store configuration or cache data. A standard CPU may include multi-directional access pads, while uni-directional access pads may be better for DRAM applications.
When the controller or accelerator is designed in such a way that it only needs a single access to registers (in the few cases possible), memory pad implemented registers may be used instead of the traditional register file. In a single access, only one word can be accessed at a time. For example, a processing unit may access two words from two register files at a time. Each of the two register files may be implemented by a memory pad (e.g., a DRAM pad) that allows only a single access.
In most technologies, the memory pads IP, which are closed blocks (IPs) obtained from the manufacturer, will be accompanied by wiring, such as word lines and row lines, in place for row and column access. But the memory pad IP does not include surround logic components. Thus, a register file implemented by a memory mat disclosed in an embodiment of the present invention may comprise logic components. The size of the memory pad may be selected based on the desired size of the register file.
Certain challenges may arise when using a memory pad to provide registers of a register file, and these challenges may depend on the particular memory technology used to form the memory pad. For example, in memory production, not all manufactured memory cells may operate properly after production. This is a known problem, especially if there is a high density of SRAM or DRAM on the chip. To address this issue in memory technology, one or more redundancy mechanisms may be used in order to maintain yield at a reasonable level. In the disclosed embodiments, because the number of memory instances (e.g., memory banks) used to provide registers of a register file may be quite small, the redundancy mechanism may not be as important as in normal memory applications. On the other hand, the same production issues that affect memory functionality may also affect whether a particular memory pad may function properly when providing one or more registers. As a result, redundant elements may be included in the disclosed embodiments. For example, at least one redundant memory pad may be disposed on a substrate of a distributed processor memory chip. The at least one redundant memory pad may be configured to provide at least one redundant register for one or more of the plurality of processor subunits. In another example, the pads may be larger than desired (e.g., 620 × 620 instead of 512 × 512), and redundancy mechanisms may be built into regions of the memory pads outside the 512 × 512 region or its equivalent.
Another challenge may be timing related. The timing of loading the word and bit lines is typically determined by the size of the memory. Since the register file may be implemented by a relatively small single memory pad (e.g., 512 x 512 bits), the time required to load a word from the memory pad will be small, timing may be sufficient to run relatively quickly compared to logic.
Refresh-some memory types, such as DRAM, require periodic refreshing. The flush may be performed when the processor or accelerator is halted. For small memory pads, the refresh time may be a fraction of the time. Therefore, even if the system is stopped in a short period of time, the gain obtained by using the memory pad as a register is worth of downtime from the overall performance perspective. In one embodiment, the processing unit may comprise a counter counting backwards from a predefined number. When the counter reaches "0," the processing unit may stop the current task being executed by the processor (e.g., accelerator) and trigger a refresh operation that refreshes the memory pads row-by-row. When the refresh operation is complete, the processor may resume its task and the counter may be reset to count back from the predefined number.
FIG. 86 provides a flowchart 8600 representative of an exemplary method for executing at least one instruction in a distributed processor memory chip consistent with the disclosed embodiments. For example, at step 8602, at least one data value may be retrieved from a memory array on a substrate of a distributed processor memory chip. At 8604, the retrieved data values may be stored in registers provided by memory pads of a memory array on a substrate of the distributed processor memory chip. At step 8606, a processor element, such as one or more of the distributed processor subunits on board the distributed processor memory chip, may operate on the stored data value from the memory pad register.
Here and throughout, it should be understood that all references to a register file should be to cache equally, as the register file may be the lowest level cache.
Processing bottlenecks
The terms "first," "second," "third," and the like are used solely to distinguish one from another. These terms may not imply an order and/or timing and/or importance to the components. For example, the first process may be preceded by a second process, and the like.
The term "coupled" may mean directly connected and/or indirectly connected.
The terms "memory/processing," "memory and processing," and "memory processing" are used interchangeably.
Various methods, computer-readable media, memory/processing units, and/or systems may be provided that may be a memory/processing unit.
The memory/processing unit is a hardware unit with memory and processing capabilities.
The memory/processing unit may be, may be included in, or may include one or more memory processing integrated circuits.
The memory/processing unit may be a distributed processor as illustrated in PCT patent application publication WO 2019025892.
The memory/processing unit may comprise a distributed processor as illustrated in PCT patent application publication WO 2019025892.
The memory/processing unit may belong to a distributed processor as illustrated in PCT patent application publication WO 2019025892.
The memory/processing unit may be a memory chip as illustrated in PCT patent application publication WO 2019025892.
The memory/processing unit may comprise a memory chip as illustrated in PCT patent application publication WO 2019025892.
The memory/processing unit may be a distributed processor as illustrated in PCT patent application No. PCT/IB 2019/001005.
The memory/processing unit may belong to a distributed processor as illustrated in PCT patent application No. PCT/IB 2019/001005.
The memory/processing unit may be a memory chip as described in PCT patent application No. PCT/IB 2019/001005.
The memory/processing unit may comprise memory chips as described in PCT patent application No. PCT/IB 2019/001005.
The memory/processing unit may belong to a memory chip as illustrated in PCT patent application No. PCT/IB 2019/001005.
The memory/processing unit may be an integrated circuit that is connected to each other using inter-wafer bonding and multiple conductors.
Any reference to a distributed processor memory chip, a distributed memory processing integrated circuit, a memory chip, a distributed processor may be implemented as a pair of integrated circuits connected to each other by an inter-wafer bond and a plurality of conductors.
The memory/processing unit may be manufactured by a first manufacturing process that is better suited for the memory cells than the logic cells. Thus, the first fabrication process may be considered a memory class fabrication process. The memory cell may include one of a plurality of transistors. The logic cell may include one or more transistors. A first fabrication process may be applied to fabricate the memory bank. A logic cell may include one or more transistors that together perform a logic function, and may be used as the basic building block for larger logic circuits. A memory cell may include one or more transistors that together perform a memory function and may be used as the basic building block for larger logic circuits. The corresponding logic cell may perform the same logic function.
The memory/processing unit may be different from any of the processor, processing integrated circuit, and/or processing unit that is fabricated by a second fabrication process that is better suited for the logic cells than the memory cells. Thus, the first manufacturing process may be considered a logical type of manufacturing process. The second fabrication process may be used to fabricate CPU, GPU and the like.
The memory/processing unit may be more suitable for performing fewer arithmetic-intensive operations than the processor, processing integrated circuit, and/or processing unit.
For example, memory cells fabricated by the first fabrication process may exhibit critical dimensions that exceed, and even greatly exceed (e.g., exceed 2 times, 3 times, 4 times, 5 times, 9 times, 7 times, 8 times, 9 times, 10 times, and the like) critical dimensions of logic circuits fabricated by the first fabrication process.
The first manufacturing process may be a simulation manufacturing process, the first manufacturing process may be a DRAM manufacturing process, and the like.
The size of the logic cells fabricated by the first fabrication process may exceed the size of the corresponding logic cells fabricated by the second fabrication process by at least a factor of two. The corresponding logic call may have the same functionality as the logic cell fabricated by the first fabrication process.
The second fabrication process may be a digital fabrication process.
The second fabrication process may be any one of Complementary Metal Oxide Semiconductor (CMOS), bipolar CMOS (bicmos), double Diffused Metal Oxide Semiconductor (DMOS), silicon-on-oxide (soi) fabrication processes, and the like.
The memory/processing unit may include multiple processor subunits.
The processor sub-units of one or more memory/processing units may operate independently of each other and/or may cooperate with each other and/or perform distributed processing. Distributed processing may be performed in various ways, such as in a planar manner or in a hierarchical manner.
The planar approach may involve having the processor subunits perform the same operations (and processing results may or may not be output between the processor subunits).
A hierarchical approach may involve performing a sequence of processing operations at different levels, with processing operations at one level being performed after processing operations at another level. The processor subunits may be assigned (dynamically or statically) to different layers and participate in hierarchical processing.
Distributed processing may also involve other units, such as controllers of memory/processing units and/or units not belonging to memory/processing units.
The terms logic and processor subunit are used interchangeably.
Any of the processes mentioned in the present application may be performed in any manner (distributed and/or non-distributed and the like).
In the following applications, PCT patent application publication WO2019025892 and PCT patent application No. PCT/IB2019/001005 (9/2019) are variously referenced and/or incorporated by reference. PCT patent application publication nos. WO2019025892 and/or PCT patent application nos. PCT/IB2019/001005 provide non-limiting examples of various methods, systems, processors, memory chips, and the like. Other methods, systems, processors may be provided.
A processing system (system) may be provided in which a processor is preceded by one or more memory/processing units, each memory and processing unit (memory/processing unit) having processing resources and storage resources.
A processor may request or issue instructions to one or more memory/processing units to perform various processing tasks. Execution of various processing tasks may relieve the processor of burden, reduce latency, and in some cases reduce the overall information bandwidth between one or more memory/processing units and the processor, and the like.
The processor may provide instructions and/or requests with different granularities, e.g., the processor may send instructions for certain processing resources or may send higher order instructions for memory/processing units without specifying any processing resources.
The memory/processing units may manage their processing and/or memory resources in any manner (dynamic, static, distributed, centralized, offline, online, and the like). The management of resources may be performed in the following cases: autonomously, under control of the processor, after the processor has been configured, and the like.
For example, a task may be divided into subtasks that may require one or more processing resources and/or memory resources of one or more memory/processing units to execute or one or more instructions. Each processing resource may be configured to execute (e.g., independently or non-independently) at least one instruction. See, e.g., the execution of instruction subsequences by processing resources of a processor subunit such as PCT patent application publication WO 2019025892.
At least an allocation of memory resources may also be provided to entities other than one or more memory/processing units, such as a direct access memory (DMA) unit that may be coupled to one or more memory/processing units.
The compiler may prepare a configuration file for each type of task performed by the memory/processing unit. The configuration file includes memory allocations and processing resource allocations associated with the task type. The configuration file may include instructions that may be executed by different processing resources and/or may define memory allocations.
For example, a configuration file associated with the task of matrix multiplication (multiplying matrix a by matrix B, a x B x C) may suggest where to store elements of matrix a, where to store elements of matrix B, where to store elements of matrix C, where to store intermediate results produced during matrix multiplication, and may include instructions for processing resources for performing any mathematical operations associated with matrix multiplication. A configuration file is an example of a data structure, and other data structures may be provided.
The matrix multiplication may be performed by one or more memory/processing units in any manner.
The one or more memory/processing units may multiply the matrix a by the vector V. This can be done in any manner. For example, this may involve maintaining one row or column of the matrix per processing resource (maintaining a different row of rows per different processing resource), and cycling (between different processing resources) the final result of the multiplication of the row or column of the matrix with the vector (during the first iteration), and cycling the final result of the previous multiplication (during the second to the last iteration).
Assume that matrix a is a 4 x 4 matrix, vector V is a 1 x 4 vector, and there are four processing resources. Under this assumption, the first row of matrix a is stored at the first processor subunit, the second row of matrix a is stored at the second processor subunit, the third row of matrix a is stored at the third processing resource, and matrix a is stored at the fourth processor subunit in the fourth row. The multiplication is started by: sending the first to fourth elements of the vector V to the first to fourth processing resources; and multiplying the first through fourth elements of vector V by different vectors of a to provide a first intermediate result. Continue multiplication by looping the first intermediate result by: the first intermediate result computed by the first processing resource is sent by each processing resource to its neighboring processing resources. Each processing resource multiplies the first multiplication result by a vector to provide a second multiplication result. This process is repeated multiple times until the multiplication of matrix a with vector V is complete.
FIG. 90A is an example of a system 10900 that includes one or more memory/processing units (collectively 10910) and a processor 10920. Processor 10920 may send the requests or instructions (via link 10931) to one or more memory/processing units 10920, which in turn completes (or selectively completes) the requests and/or instructions and sends the results (via link 10932) to processor 10920, as described above. The processor 10920 may further process the results to provide (via link 10933) one or more outputs.
The one or more memory/processing units may include J (J is a positive integer) memory resources 10912(1,1) through 10912(1, J) and K (K is a positive integer) processing resources 10911(1,1) through 10911(1, K).
J may be equal to K or may be different from K.
The processing resources 10911(1,1) to 10911(1, K) may be, for example, processing groups or processor subunits, as illustrated in PCT patent application publication WO 2019025892.
The memory resources 10912(1,1) to 10912(1, J) may be memory instances, memory pads, memory banks, as illustrated in PCT patent application publication WO 2019025892.
There may be any connectivity and/or any functional relationship between any of the resources (memory or processing) of one or more memory/processing units.
Fig. 90B illustrates an example of the memory/processing unit 10910 (1).
In fig. 90B, K (K is a positive integer) processing resources 10911(1,1) to 10911(1, K) form a loop because the processing resources are connected to each other in series (see link 10915). Each processing resource is also coupled to its own pair of dedicated memory resources (e.g., processing resource 10911(1) is coupled to memory resources 10912(1) and 10912(2), and processing resource 10911(K) is coupled to memory resources 10912(J-1) and 10912 (J)). The processing resources may be connected to each other in any other way. The number of memory resources allocated per processing resource may be different than two. Examples of connectivity between different resources are illustrated in PCT patent application publication WO 2019025892.
FIG. 90C is an example of a system 10901 of N (N is a positive integer) memory/processing units 10910(1) -10910 (N) and a processor 10920. The processor 10920 may send the requests or instructions (via links 10931(1) through 10931(N)) to the memory/processing units 10920(1) through 10910(N), which in turn complete the requests and/or instructions and send the results (via links 10932(1) through 3232(N)) to the processor 10920, as described above. The processor 10920 may further process the results to provide (via link 10933) one or more outputs.
FIG. 90D illustrates an example of a system 10902 including N memory/processing units 10910(1) - (N) and a processor 10920. Fig. 90D illustrates the preprocessor 10909 before the memory/processing units 10910(1) to 10910 (N). The preprocessor may perform various preprocessing operations such as frame extraction, header detection, and the like.
Fig. 90E is an example of a system 10903 that includes one or more memory/processing units 10910 and a processor 10920. Fig. 90E illustrates the preprocessor 10909 before the one or more memory/processing units 10910 and the DMA controller 10908.
Fig. 90F illustrates a method 10800 for distributed processing of at least one information stream.
The method 10800 can begin at step 10810 by receiving, by one or more memory processing integrated circuits, at least one information stream via a first communication channel; wherein each memory processing integrated unit includes a controller, a plurality of processor sub-units, and a plurality of memory units.
Thus, the total size of the information stream may exceed the total size of the first processing result. The total size of the information stream may reflect the amount of information received during a period of a given duration. The total size of the first processing result may reflect the amount of the first processing result output during any period of the same given duration.
Alternatively, the total size of the information stream (any other information entity mentioned in this specification) is smaller than the total size of the first processing result. In this case, compression is obtained.
One or more memory processing integrated circuits may be fabricated by a memory class of fabrication process.
One or more memory processing integrated circuits may be fabricated by a logic class of fabrication processes.
In a memory processing integrated unit, each of the memory units may be coupled to a processor subunit.
Step 10840 may be followed by step 10850 of performing, by one or more processing integrated circuits, a second processing operation on the first processing result to provide a second processing result.
The first processing operation may have a lower arithmetic strength than the second processing operation.
Disaggregated system memory/processing unit and method for distributed processing
A decomposed system, a method for distributed processing, a processing/memory unit, a method for operating a decomposed system, a method for operating a processing/memory unit, and a computer readable medium that is non-transitory and stores instructions for performing any of the methods may be provided. The disaggregated system allocates different subsystems to perform different functions. For example, storage may be implemented primarily in one or more storage subsystems, and operations may be performed primarily in one or more storage subsystems.
The disaggregated system may be a disaggregated server, one or more disaggregated servers, and/or may be distinct from one or more servers.
The decomposed system may include one or more switching subsystems, one or more arithmetic subsystems, one or more storage subsystems, and one or more processing/memory subsystems.
One or more processing/memory subsystems, one or more arithmetic subsystems, and one or more storage subsystems are coupled to each other via one or more switching subsystems.
One or more processing/memory subsystems may be included in one or more subsystems of the disaggregated system.
FIG. 87A illustrates various examples of a disassembled system.
Any number of any type of subsystems may be provided. The disassembled system may include one or more additional subsystems of a type not included in fig. 87A, may include fewer types of subsystems, and the like.
The disaggregated system 7101 includes two storage subsystems 7130, an operations subsystem 7120, a switching subsystem 7140, and a processing/memory subsystem 7110.
The disaggregated system 7102 includes two storage subsystems 7130, an operations subsystem 7120, a switching subsystem 7140, a processing/memory subsystem 7110, and an accelerator subsystem 7150.
The disaggregated system 7103 includes two storage subsystems 7130, an arithmetic subsystem 7120, and a switching subsystem 7140, which includes a processing/memory subsystem 7110.
The disaggregated system 7104 includes two storage subsystems 7130, an operations subsystem 7120, a switching subsystem 7140 including a processing/memory subsystem 7110, and an accelerator subsystem 7150.
Including processing/memory subsystem 7110 in switching subsystem 7140 may reduce traffic within disaggregated systems 7101 and 7102, may reduce the latency of handoffs, and the like.
The different subsystems of the disassembled system can communicate with each other using various communication protocols. It has been found that using ethernet and even ethernet RDMA communication protocols can increase throughput and possibly even reduce the complexity of various control and/or storage operations related to the exchange of information units between components of the disaggregated system.
A disaggregated system may perform distributed processing by allowing the processing/memory subsystem to participate in computations, particularly by performing memory-intensive computations.
For example, assuming that N arithmetic units should share information units therebetween (all shared), then (a) the N information units may be sent to one or more processing/memory units of one or more processing/memory subsystems, (b) the one or more processing/memory units may perform computations that require full sharing, and (c) the N updated information units are sent to the N arithmetic units. This would require about N transfer operations.
For example, fig. 87B illustrates a distributed process of updating a model of a neural network (the model including weights assigned to nodes of the neural network).
Each of the N arithmetic units PU (1)7120(1) -PU (N)7120(N) may belong to the arithmetic subsystem 7120 of any of the decomposed systems 7101, 7102, 7103, and 7104.
The N arithmetic units compute N partial model updates (N different portions of the update) 7121(1) through 7121(N) and send them (via the switch subsystem 7140) to the processing/memory subsystem 7110.
The processing/memory subsystem 7110 computes the updated model 7122 and sends (via the switch subsystem 7140) the updated model to the N arithmetic units PU (1)7120(1) to PU (N)7120 (N).
Fig. 87C, 87D, and 87E illustrate examples of memory/ processing units 7011, 7012, and 7013, respectively, and fig. 87F and 87G illustrate integrated circuits 7014 and 7015 that include one or more communication modules of memory/processing unit 9010, such as an ethernet module and an ethernet RDMA module 22.
The memory/processing unit includes a controller 9020, an internal bus 9021, and pairs of logic 9030 and memory banks 9040. The controller is configured to operate as or be coupled to the communication module.
Connectivity between the controller 9020 and the pairs of logic 9030 and memory banks 9040 may be implemented in other ways. The memory banks and logic may be configured in other ways (unpaired).
One or more memory/processing units 9010 of processing/memory subsystem 7110 may process (use different logic and retrieve different portions of the model in parallel from different memory banks) model updates in parallel and, with the benefit of the extremely high bandwidth of the connections between the mass memory resources, memory banks, and logic, may perform these computations in an efficient manner.
Memory/ processing units 7011, 7012, and 7013 of fig. 87C-87E and integrated circuits 7014 and 7015 of fig. 87C-87E include one or more communication modules, such as ethernet module 7023 (in fig. 87C-87G) and ethernet RDMA module 7022 (in fig. 87E and 87G).
Having such RDMA and/or ethernet modules (within the memory/processing unit or within the same integrated circuit as the memory/processing unit) greatly speeds up communication between different elements of the decomposed system and, in the case of RDMA, greatly simplifies communication between different elements of the decomposed system.
It should be noted that a memory/processing unit that includes RDMA and/or ethernet modules may be beneficial in other environments, even when the memory/processing unit is not included in the disaggregated system.
It should also be noted that RDMA and/or ethernet modules may be allocated for each group of memory/processing units, e.g., for cost reduction reasons.
It should be noted that the memory/processing units, groups of memory/processing units, and even the processing/memory subsystems may include other communication ports, such as PCIe communication ports.
Using RDMA and/or ethernet network modules may be cost effective because the need to connect the memory/processing unit to a bridge connected to a Network Integrated Circuit (NIC) that may have an ethernet port may be eliminated.
The use of RDMA and/or ethernet modules may make ethernet (or ethernet RDMA) native in the memory/processing unit.
It should be noted that ethernet is merely an example of a Local Area Network (LAN) protocol. PCIe is merely an example of another communication protocol that may be used over longer distances than ethernet.
FIG. 87H illustrates a method 7000 for distributed processing.
The processing iterations may be performed by one or more memory processing integrated circuits of the decomposed system.
The processing iterations may be performed by one or more processing integrated circuits of the decomposed system.
The processing iterations performed by the more memory processing integrated circuits may be followed by processing iterations performed by one or more processing integrated circuits.
The processing iterations performed by the more memory processing integrated circuits may precede the processing iterations performed by the one or more processing integrated circuits.
Yet another processing iteration may be performed by other circuitry of the decomposed system. For example, the one or more pre-processing circuits may perform any type of pre-processing, including preparing information units for processing iterations performed by the one or more memory processing integrated circuits.
Each memory processing integrated unit may include a controller, a plurality of processor sub-units, and a plurality of memory units.
One or more memory processing integrated circuits may be fabricated by a memory class of fabrication processes.
The information unit may convey part of a model of the neural network.
The information unit may convey partial results of at least one database query.
The information unit may convey partial results of at least one aggregated database query.
The total size of the information units may exceed, may be equal to, or may be less than the total size of the processing results.
Step 7040 may comprise outputting the processing results to one or more computing subsystems of the disaggregated system, which may comprise a plurality of processing integrated circuits manufactured by a logic-based manufacturing process.
Step 7040 may include outputting the processing results to one or more storage subsystems of the disaggregated system.
The information units may be sent from different groups of processing units of the multiple processing integrated circuits and may be different portions of an intermediate result of a processing procedure executed in a distributed manner by the multiple processing integrated circuits. The group of processing units may include at least one processing integrated circuit.
Step 7040 may comprise sending the results of the entire handler to each of the plurality of processing integrated circuits.
Different portions of the intermediate results may be different portions of the updated neural network model, and wherein the result of the entire process is the updated neural network model.
Step 7040 may comprise sending the updated neural network model to each of the plurality of processing integrated circuits.
Step 7040 may be followed by step 7050 of performing, by the plurality of processing integrated circuits, another process based, at least in part, on processing results to the plurality of processing integrated circuits.
Step 7040 may include outputting the processing result using a switch subunit of the decomposed system.
Fig. 87I illustrates a method 7001 for distributed processing.
Step 7010 may be followed by steps 7010, 7020, 7030, and 7040.
Database analysis acceleration
An apparatus, method, and computer-readable medium are provided that store instructions for performing at least screening by a screening unit belonging to the same integrated circuit as the memory unit, and a filter that can prompt which entries are relevant to a database query. The arbiter or any other flow control manager may send relevant entries to the processor and not send irrelevant entries to the processor, thus saving almost all traffic to and from the processor.
See, e.g., fig. 91A, which shows a processor (CPU 9240), integrated circuit including memory and screening system 9220. Memory and screening system 9220 may include a screening unit 9224 coupled to memory unit entries 9222 and one or more arbiters, such as arbiter 9229 for sending the associated entries to the processor. Any arbitration handler may be applied. There may be any relationship between the number of entries, the number of screening units, and the number of arbiters.
The arbiter may be replaced by any unit capable of controlling the flow of information, such as a communication interface, a flow controller, and the like.
Reference screening, which is based on one or more relevance/screening criteria.
The correlation may be set for each database query and may be prompted in any manner, e.g., the memory unit may store a correlation flag 9224' that prompts which entry is correlated. There is also a storage 9210 that stores K database segments 9220(K), where K ranges between 1 and K. It should be noted that the entire database may be stored in the memory unit and not in the storage device (this solution is also referred to as a volatile memory stored database).
The memory cell entries may be too small to store the entire database, and thus may receive one sector at a time.
The screening unit may perform screening operations such as comparing the value of the field to a threshold, comparing the value of the field to a predefined value, determining whether the value of the field is within a predefined range, and the like.
Thus, the screening unit may perform known database screening operations and may be a compact and inexpensive circuit.
The final result (e.g., the contents of the relevant database entry) 9101 of the screening operation is sent to the CPU9420 for processing.
Memory and screening system 9220 may be replaced by a memory and processing system as illustrated in figure 91B.
Memory and processing system 9229 includes processing unit 9225 coupled to memory unit entry 9222. Processing unit 9225 may perform the screening operation and may participate, at least in part, in performing one or more additional operations on the associated record.
The processing unit may be customized to perform a particular operation and/or may be a programmable unit configured to perform multiple operations. For example, the processing unit may be a pipelined processing unit, may include an ALU, may include multiple ALUs, and the like.
Alternatively, a portion of the one or more additional operations are performed by the processing unit and another portion of the one or more additional operations are performed by the processor (CPU 9240).
The final result of the processing operation (e.g., a partial response 9102 to a database query, or a complete response 9103) is sent to the CPU 9420.
The partial response requires further processing.
Fig. 92A illustrates a memory/processing system 9228 that includes a memory/processing unit 9227 configured to perform screening and additional processing.
Memory/processing system 9228 implements the processing unit and memory units of fig. 91 through memory/processing unit 9227.
The role of the processor may include controlling the processing unit, performing at least a portion of one or more additional operations, and the like.
The combination of memory entries and processing units may be implemented, at least in part, by one or more memory/processing units.
Fig. 92B illustrates an example memory/processing unit 9010.
The memory/processing unit 9010 includes a controller 9020, an internal bus 9021, and pairs of logic 9030 and memory banks 9040. The controller is configured to operate as or be coupled to the communication module.
Connectivity between the controller 9020 and the pairs of logic 9030 and memory banks 9040 may be implemented in other ways. The memory banks and logic may be configured in other ways (unpaired). Multiple memory banks may be coupled to and/or managed by a single logic.
The memory/processing system receives database query 9100 via interface 9211. The interface 9211 may be a bus, a port, an input/output interface, and the like.
It should be noted that the response to the database query may result from at least one of the following (or a combination of one or more of the following): one or more memory/processing systems, one or more memory and screening systems, one or more processors external to these systems, and the like.
It should be noted that the response to the database query may result from at least one of the following (or a combination of one or more of the following): one or more screening units, one or more memory/processing units, one or more other processors (such as one or more other CPUs), and the like.
Any processing procedure may include finding relevant database entries and processing the relevant database entries. The processing may be performed by one or more processing entities.
The processing entity may be at least one of: a processing unit of a memory and processing system (e.g., processing unit 9225 of memory and processing system 9229), a processor subunit (or logic) of a memory/processing unit, another processor (e.g., CPU 9240 of fig. 91A, 91B, and 74), and the like.
The processing involved in generating a response to a database query may result from any one or a combination of:
a. memory and processing unit 9225 of processing system 9229.
b. Different memories and processing units 9225 of system 9229.
c. A processor subunit (or logic 9030) of one or more memory/processing units 9227 of memory/processing system 9228.
d. A processor subunit (or logic 9030) of memory/processing unit 9227 of different memory/processing system 9228.
e. A controller for one or more memory/processing units 9227 of memory/processing system 9228.
f. A controller for one or more memory/processing units 9227 of a different memory/processing system 9228.
Thus, the processing involved in responding to a database query may result from a combination or sub-combination of: (a) one or more controllers of one or more memory/processing units, (b) one or more processing units of a memory processing system, (c) one or more processor sub-units of one or more memory/processing units, and (d) one or more other processors, and the like.
The processing performed by more than one processing entity may be referred to as distributed processing.
It should be noted that the screening may be performed by a screening entity in one or more screening units and/or one or more processing units and/or one or more processor sub-units. In this sense, the processing units and/or processor sub-units performing the screening operation may be referred to as screening units.
The processing entity may be a screening entity or may be distinct from the screening entity.
The processing entity may perform processing operations for database entries deemed relevant by another screening entity.
The processing entity may also perform the screening operation.
Responses to the database query may utilize one or more screening entities and one or more processing entities.
The one or more screening entities and the one or more processing entities may belong to the same system (e.g., memory/processing system 9228, memory and processing system 9229, memory and screening system 9220) or belong to different systems.
The memory/processing unit may include multiple processor subunits. The processor subunits may operate independently of one another, may partially cooperate with one another, may participate in distributed processing, and the like.
Fig. 92C illustrates a plurality of memory and screening systems 9220, a plurality of other processors (such as CPU9240), and storage 9210.
Multiple storage and screening systems 9220 may participate (simultaneously or non-simultaneously) in the screening of one or more database entries based on one or more screening criteria within one of the multiple database queries.
Fig. 92D illustrates a plurality of memory and processing systems 9229, a plurality of other processors (such as CPU9240), and storage 9210.
Multiple memory and processing systems 9229 may participate (simultaneously or non-simultaneously) in the screening and at least partial processing involved in responding to one of the multiple database queries.
Fig. 92F illustrates a plurality of memory/processing systems 9228, a plurality of other processors (such as CPU9240), and storage 9210.
Multiple memory/processing systems 9228 may participate (simultaneously or non-simultaneously) in the screening and at least partial processing involved in responding to one of multiple database queries.
FIG. 92G illustrates a method 9300 method for database analysis acceleration.
The method 9300 can begin at step 9310 with a memory processing integrated circuit receiving a database query that includes at least one relevancy criterion prompting a database entry in a database that is relevant to the database query.
The database entry in the database that is relevant to the database query may not be a database entry of the database, and may be one, some, or all of the database entries of the database.
The memory processing integrated circuit may include a controller, a plurality of processor sub-units, and a plurality of memory units.
Step 9310 may be followed by step 9320 of determining, by the memory processing integrated circuit and based on at least one dependency criterion, a group of related database entries stored in the memory processing integrated circuit.
The phrase "without substantially sending" means that no irrelevant items are sent at all (during response to a database query) or a small number of irrelevant items are sent. Not much may mean up to 1, 2, 3, 4, 5, 9, 7, 8, 9, 10 percent, or sending any amount that does not have a significant impact on bandwidth.
FIG. 92H illustrates a method 9301 for database analysis acceleration.
It is assumed that the screening and the entire processing required to respond to a database query is performed by the memory processing integrated circuit.
The method 9301 can begin at step 9310 with receiving, by a memory processing integrated circuit, a database query that includes at least one relevancy criterion that prompts database entries in a database that are relevant to the database query.
Step 9310 may be followed by step 9320 of determining, by the memory processing integrated circuit and based on at least one dependency criterion, a group of related database entries stored in the memory processing integrated circuit.
Step 9331 may be followed by step 9341 of fully processing the group of related database entries to provide a response to the database query.
Step 9341 may be followed by step 9351 of outputting a response to the database query from the memory processing integrated circuit.
FIG. 92I illustrates a method 9302 for database analysis acceleration.
It is assumed that only a portion of the screening and processing required to respond to a database query is performed by the memory processing integrated circuit. The memory processing integrated circuit will output partial results that will be processed by one or more other processing entities located external to the memory processing integrated circuit.
The method 9301 can begin at step 9310 with receiving, by a memory processing integrated circuit, a database query that includes at least one relevancy criterion that prompts database entries in a database that are relevant to the database query.
Step 9310 may be followed by step 9320 of determining, by the memory processing integrated circuit and based on at least one dependency criterion, a group of related database entries stored in the memory processing integrated circuit.
Step 9342 may be followed by step 9352 of outputting an intermediate response to the database query from the memory processing integrated circuit.
Step 9352 may be followed by step 9390 of further processing the intermediate responses to provide responses to the database.
Fig. 92J illustrates a method 9303 for database analysis acceleration.
Assume that the memory processing integrated circuit performs the screening of the relevant database entries but does not perform the processing of the relevant database entries. The memory processing integrated circuit will output the group of related database entries to be fully processed by one or more other processing entities located external to the memory processing integrated circuit.
The method 9301 can begin at step 9310 with receiving, by a memory processing integrated circuit, a database query that includes at least one relevancy criterion that prompts database entries in a database that are relevant to the database query.
Step 9310 may be followed by step 9320 of determining, by the memory processing integrated circuit and based on at least one dependency criterion, a group of related database entries stored in the memory processing integrated circuit.
Step 9333 may be followed by step 9391 of fully processing the intermediate responses to provide a response to the database.
FIG. 92K illustrates a method 9304 for database analysis acceleration.
The method 9303 can begin at step 9315 with receiving, by an integrated circuit, a database query that includes at least one relevancy criterion that prompts database entries in a database that are relevant to the database query; the integrated circuit comprises a controller, a screening unit and a plurality of memory units.
Step 9335 may be followed by step 9391.
FIG. 92L illustrates a method 9305 of database analysis acceleration.
The method 9305 can begin in step 9314 with receiving, by an integrated circuit, a database query that includes at least one relevance formula that prompts database entries in a database that are relevant to the database query; the integrated circuit includes a controller, a screening unit, and a plurality of memory units.
Step 9314 may be followed by step 9324 of determining, by the processing unit and based on at least one correlation criterion, a group of related database entries stored in the integrated circuit.
In any of the methods 9300, 9301, 9302, 9304, and 9305, the memory processing integrated circuit outputs an output. The output may be a related database entry, one or more intermediate results, or a group of one or more (complete) results.
The output may be preceded by the retrieval of one or more relevant database entries and/or one or more results (complete or intermediate) from a screening entity and/or a processing entity of the memory processing integrated circuit.
The retrieval may be controlled in one or more ways and may be controlled by an arbiter and/or one or more controllers of the memory processing integrated circuit.
The outputting and/or retrieving may include controlling one or more parameters of the retrieving and/or outputting. The parameters may include acquisition timing, acquisition rate, acquisition source, bandwidth, sequence or acquisition, output timing, output rate, output source, bandwidth, sequence or output, type of acquisition method, type of arbitration method, and the like.
An executable flow control process is output and/or retrieved.
The outputting and/or retrieving (e.g., applying a flow control handler) may be responsive to an indicator output from one or more processing entities regarding completion of processing of the group's database entries. The indicator may indicate whether the intermediate result is ready to be retrieved from the processing entity.
Outputting may include attempting to match a bandwidth used during outputting to a maximum allowable bandwidth on a link coupling the memory processing integrated circuit to the requester unit. The link may be a link to a recipient of an output of the memory processing integrated circuit. The maximum allowable bandwidth may be dictated by the capacity and/or availability of the link, the capacity and/or availability of the recipient of the output content, and the like.
Outputting may include attempting to output the output content in a best or sub-best manner.
The outputting of the output content may include attempting to maintain fluctuations in the output traffic rate below a threshold.
Any of the methods 9300, 9301, 9302, and 9305 can include generating, by one or more processing entities, a process status indicator that can prompt progress of further processing of the group of related database entries.
When a process included in any of the above-mentioned methods is performed by more than a single processing entity, then the process may be considered a distributed process, as the process is performed in a distributed manner.
As noted above, processing may be performed in a hierarchical manner or in a planar manner.
Any of the methods 9300-9305 can be performed by multiple systems that can respond to one or more database queries simultaneously or sequentially.
Word embedding
As mentioned above, word embedding (word embedding) is a generic term for a collection of language modeling and feature learning techniques in Natural Language Processing (NLP), where words or phrases from a vocabulary are mapped to vectors of elements. Conceptually, word embedding involves mathematical embedding from a space with many dimensions per word to a continuous vector space with much lower dimensions.
The vectors may be mathematically processed. For example, vectors belonging to a matrix may be summed to provide a summed vector.
For yet another example, the covariance of the matrix (of the statement) may be calculated. This may include multiplying the matrix by its transpose.
The memory/processing unit may store a vocabulary. In particular, portions of the vocabulary may be stored in multiple memory banks of the memory/processing unit.
Thus, the memory/processing unit may be accessed using access information (such as a retrieval key) of a set of words that will represent a phrase of the sentence, such that vectors of words that represent the phrase of the sentence will be retrieved from at least some of the memory banks of the memory/processing unit.
Different memory banks of the memory/processing unit may store different parts of the vocabulary and may be accessed in parallel (depending on the distribution of the index of the statement). Prediction may reduce penalties even when more than a single bank of memory banks need to be accessed in sequence.
The allocation of vocabulary words between different memory banks of the memory/processing unit may be optimized or highly beneficial in the sense that the chances of parallel access per statement to different memory banks of the memory/processing unit may be increased. The assignment may be learned per user, per general group, or per group.
Furthermore, the memory/processing unit may also be used to perform at least some of the processing operations (by virtue of its logic), and thereby the bandwidth required from a bus external to the memory/processing unit may be reduced, multiple operations (using multiple processors of the memory/processing unit in parallel) may be computed in an efficient manner (even in parallel).
The memory banks may be associated with logic.
At least a portion of the processing operations may be performed by one or more additional processors, such as vector processors, including but not limited to vector adders.
The memory/processing unit may include one or more additional processors that may be allocated to some or all of the memory banks (logical pairs).
Thus, a single additional processor may be allocated to all or some of the memory banks (logical pairs). For yet another example, the additional processors may be configured in a hierarchical manner such that an additional processor of a level processes output from an additional processor of a lower level.
It should be noted that the processing operations may be performed without using any additional processors, but may be performed by logic of the memory/processing unit.
Fig. 89A, 89B, 89C, 89D, 89E, 89F, and 89G illustrate examples of memory/ processing units 9010, 9011, 9012, 9013, 9014, 9015, and 9019, respectively. The memory/processing unit 9010 includes a controller 9020, an internal bus 9021, and pairs of logic 9030 and memory banks 9040.
It should be noted that the logic 9030 and the memory bank 9040 may be coupled to the controller and/or to each other in other manners, e.g., multiple buses may be disposed between the controller and the logic, the logic may be configured in multiple layers, a single logic may be shared by multiple memory banks (see, e.g., fig. 89E), and the like.
The length of the page of each memory bank within memory/processing unit 9010 may be defined in any manner, e.g., it may be small enough, and the number of memory banks may be large enough to enable a large number of vectors to be output in parallel without wasting many bits on irrelevant information.
Connectivity between the controller 9020 and the pairs of logic 9030 and memory banks 9040 may be implemented in other ways. The memory banks and logic may be configured in other ways (unpaired).
The memory/processing unit 9010 may not have additional vectors, and processing of vectors (from the memory bank) is performed by the logic 9030.
Fig. 89B illustrates additional processors, such as vector processor 9050 coupled to internal bus 9021.
Fig. 89C illustrates additional processors, such as vector processor 9050 coupled to internal bus 9021. One or more additional processors perform (either alone or in cooperation with logic) the processing operations.
Fig. 89D also illustrates a host 9018 coupled to the memory/processing unit 9010 via a bus 9022.
Fig. 89D also illustrates a vocabulary 9070 mapping words/phrases 9072 to vectors 9073. The memory/processing unit is accessed using the retrieval keys 9071, each representing a previously recognized word or phrase. The host 9018 sends a plurality of fetch keys 9071 representing the statements to the memory/processing unit, and the memory/processing unit may output the final result of the vector 9070 or the processing operation applied by the vector associated with the statement. The word/phrase is not typically stored in the memory/processing unit 9010.
Memory controller functionality to control the memory banks may be included (separately or partially) in logic, may be included (separately or partially) in controller 9020, and/or may be included (separately or partially) in one or more memory controllers (not shown) within memory/processing unit 9010.
The memory/processing unit may be configured to maximize the throughput of vectors/results sent to host 9018, or any processing program for controlling internal memory/processing unit traffic and/or controlling traffic between the memory/processing unit and the host computer (or any other entity external to the memory/processing unit) may be applied.
When the total size of the processed vectors exceeds the total size of the result, then a reduction in output bandwidth (outside the memory/processing unit) is obtained. For example, when K vectors are summed by the memory/processing unit to provide a single output vector, a K:1 reduction in bandwidth is obtained.
The controller 9020 may be configured to open multiple memory banks in parallel by broadcasting the addresses of different vectors to be accessed.
The controller may be configured to control the order in which different vectors are retrieved from the multiple memory banks (or from any intermediate buffer or storage circuit that stores different vectors, see buffer 9033 of fig. 89D) based at least in part on the order of words or phrases in the statement.
The controller 9020 may be configured to manage retrieval of the different vectors based on one or more parameters associated with outputting the vectors outside the memory/processing unit 9010, e.g., the rate at which the different vectors are retrieved from the memory bank may be set substantially equal to the allowable rate at which the different vectors are output from the memory/processing unit 9010.
The controller may output the different vectors outside the memory/processing unit 9010 by applying any business shaping processing procedures. For example, the controller 9020 may be intended to output different vectors at a rate as close as possible to the maximum rate allowable by the host computer or the link coupling the memory/processing unit 9010 to the host computer. For yet another example, the controller may output different vectors while minimizing or at least substantially reducing fluctuations in traffic rate over time.
The controller 9020 belongs to the same integrated circuit as the memory bank 9040 and the logic 9030, and thus may readily receive feedback from different logic/memory banks regarding the fetch status of different vectors (e.g., whether a vector is ready but is being fetched or is about to be fetched another vector from the same memory bank), and the like. Feedback may be provided in any manner: via dedicated control lines, via shared control lines. One or more status bits and the like are used (see status line 9039 of fig. 89F).
The controller 9020 can independently control the fetching and outputting of different vectors, and thus can reduce the involvement of a host computer. Alternatively, the host computer may not be aware of the management capabilities of the controller, and may continue to send detailed instructions, and in this case, memory/processing unit 9010 may ignore the detailed instructions, may hide the management capabilities of the controller, and the like. The above mentioned solutions can be used based on protocols that can be managed by the host computer.
It has been found that performing processing operations in the memory/processing unit is extremely beneficial (in terms of energy), even when these operations consume more power than processing operations in the host and even when these operations consume more power than transfer operations between the host and the memory/processing unit. For example, assuming that the vector is large enough, the energy consumption for transferring a data unit is 4pJ, and the energy consumption for processing the data unit (by the host) is 0.1pJ, it is more efficient to process the data unit by the memory/processing unit when the energy consumption for processing the data unit by the memory/processing unit is below 5 pJ.
Each vector (representing a matrix of statements) may be represented by a sequence of words (or other multi-bit segments). For simplicity of explanation, the multiple bit sections are assumed to be words.
Additional power savings may be obtained when the vector includes zero-valued words. Instead of outputting the entire zero-valued word, a zero-valued flag shorter than the word (e.g., bit) may be output (even conveyed via a dedicated control line) instead of the entire word. Flags may be assigned to other values (e.g., words of value 1).
Fig. 88A illustrates a method 9400 for embedding, or rather, can be a method for retrieving information related to a feature vector. The feature vector related information may comprise a feature vector and/or a result of processing the feature vector.
The method 9400 can begin at 9410 by receiving, by a memory processing integrated circuit, fetch information for fetching a plurality of requested feature vectors, which can be mapped to a plurality of statement segments.
The memory processing unit may include a controller, a plurality of processor subunits, and a plurality of memory units. Each of the memory units may be coupled to the processor sub-unit.
The retrieval may include simultaneously requesting, from two or more memory units, the requested feature vectors stored in the two or more memory units.
The request is executed based on a known mapping between the statement segment and the location of the feature vector mapped to the statement segment.
The mapping may be uploaded during a boot process of the memory processing integrated circuit.
It may be beneficial to retrieve as many requested feature vectors as possible at a time, but this depends on where the requested feature vectors are stored and the number of different memory units.
If more than one requested feature vector is stored in the same memory bank, predictive fetching may be applied for reducing the penalty associated with fetching information from the memory bank. Various methods for reducing penalties are described in various sections of the present application.
Retrieving may include applying predictive retrieval of at least some of the set of requested feature vectors stored in the single memory unit.
The requested feature vectors may be distributed among the memory units in an optimal manner.
The requested feature vectors may be distributed among the memory units based on the expected extraction pattern.
The retrieval of the plurality of requested feature vectors may be performed according to a certain order. For example, according to the order of the statement segments in one or more statements.
The retrieval of the plurality of requested feature vectors may be performed at least partially out of order; and wherein retrieving further may comprise reordering the plurality of requested feature vectors.
The retrieval of the plurality of requested features may include buffering the plurality of requested feature vectors before the plurality of requested feature vectors are readable by the controller.
Retrieval of the plurality of requested features may include generating buffer status indicators that indicate when one or more buffers associated with the plurality of memory units store the one or more requested feature vectors.
The method may include transmitting the buffer status indicator via a dedicated control line.
A dedicated control line may be allocated per memory cell.
The buffer status indicator may be a status bit stored in one or more buffers.
The method may include transmitting the buffer status indicator via one or more shared control lines.
Additionally or alternatively, step 9420 may be followed by step 9440 of outputting from a memory processing integrated circuit an output that may include at least one of: (a) the requested feature vector; and (b) processing the results of the requested feature vector. (a) At least one of the requested feature vector and (b) the result of processing the requested feature vector is also referred to as feature vector related information.
When step 9430 is performed, then step 9440 may include outputting the results of (at least) processing the requested feature vector.
When step 9430 is skipped, then step 9440 includes outputting the requested feature vector and may not include outputting the results of processing the requested feature vector.
Fig. 88B illustrates a method 9401 for embedding.
Assume that the output includes the requested feature vector but does not include the result of processing the requested feature vector.
The method 9401 may begin with receiving 9410 retrieval information for retrieving a plurality of requested feature vectors, which may be mapped to a plurality of statement segments, by a memory processing integrated circuit.
Fig. 88C illustrates a method 9402 for embedding.
Assume that the output includes a result of processing the requested feature vector.
The method 9402 may begin at 9410 with receiving, by a memory processing integrated circuit, retrieval information for retrieving a plurality of requested feature vectors, which may be mapped to a plurality of statement segments.
Step 9430 may be followed by step 9442 of outputting an output from the memory processing integrated circuit that may include processing the results of the requested feature vector.
The outputting of the output may include applying traffic shaping to the output.
The outputting of the output may include attempting to match a bandwidth used during the outputting to a maximum allowable bandwidth on a link coupling the memory processing integrated circuit to the requester unit.
The outputting of the output may include attempting to maintain fluctuations in the output traffic rate below a threshold.
Any of the steps in the capturing and outputting may be under control of the host and/or performed independently or in part by the controller.
The host may send fetch commands with different granularities, from sending fetch information generally regardless of the location of the requested feature vector within the plurality of memory units, until sending detailed fetch information based on the location of the requested feature vector within the plurality of memory units.
The host may control (or attempt to control) the timing of different fetch operations within the memory processing integrated circuit, but may be timing independent.
The controller may be controlled by the host in various levels, and may even ignore detailed commands of the host, and at least control the fetching and/or outputting independently.
The processing of the requested feature vector may be performed by at least one of (a combination of one or more of) the following: one or more memory/processing units, and one or more processors external to the one or more memory/processing units, and the like.
It should be noted that the processing of the requested feature vector may be performed by at least one of the following (a combination of one or more of the following): one or more processor subunits, a controller, one or more vector processors, and one or more memory/processing units external to the one or more memory/processing units.
The processing of the requested feature vector may be performed by, may result from, any of, or a combination of:
a. a processor subunit of a memory/processing unit (or logic 9030).
b. A processor subunit of multiple memory/processing units (or logic 9030).
c. A controller of the memory/processing unit.
d. A controller of a plurality of memory/processing units.
e. One or more vector processors of a memory/processing unit.
f. One or more vector processors, a plurality of memories/processing units.
Thus, the processing of the requested feature vector may be performed by any combination or sub-combination of: (a) one or more controllers of one or more memory/processing units; (b) one or more processor sub-units of one or more memory/processing units; (c) one or more vector processors of the one or more memory/processing units; and (d) one or more other processors external to the one or more memory/processing units.
The processing performed by more than one processing entity may be referred to as distributed processing.
The memory/processing unit may include multiple processor subunits. The processor subunits may operate independently of one another, may partially cooperate with one another, may participate in distributed processing, and the like.
Processing may be performed in a planar fashion, with all processor subunits performing the same operation (and with or without processing results possibly being output between them).
Processing may be performed in a hierarchical manner, where processing involves a sequence of processing operations at different levels, with processing operations at one level following processing operations at another level. The processor subunits may be assigned (dynamically or statically) to different layers and participate in hierarchical processing.
Any processing of the requested feature vectors may be performed by more than one processing entity (processor subunit, controller, vector processor, other processor), and may be distributed in any manner (planar, hierarchical, or otherwise). For example, the processor subunit may output its processing results to a controller, which may further process the results. One or more other processors external to the one or more memory/processing units may further process the output of the memory processing integrated circuit.
It should be noted that retrieving information may also include retrieving information for retrieving requested feature vectors that do not map to statement segments. These feature vectors may be mapped to one or more persons, devices, or any other entity that may be associated with a statement segment. For example, a user of the device sensing the statement segment, a device sensing the segment, a user identified as the source of the statement segment, a website accessed when generating the statement, a location where the statement was captured, and the like.
Non-limiting examples of processing of feature vectors may include summing, weighted sum, averaging, subtracting, or applying any other mathematical function.
Mixing device
As both processor speed and memory size continue to increase, a significant limitation on effective processing speed is the von Neumann (von Neumann) bottleneck. The von neumann bottleneck is manufactured by the throughput limitations caused by traditional computer architectures. In particular, data transfers from memory to a processor (external to a logic die such as external DRAM memory) often encounter bottlenecks as compared to actual operations performed by the processor. Thus, the number of clock cycles to read and write to the memory increases significantly with memory intensive processing procedures. These clock cycles result in a lower effective processing speed because reading and writing to the memory consumes clock cycles that are not available to perform operations on the data. Furthermore, the operating bandwidth of a processor is typically greater than the bandwidth of the bus that the processor uses to access the memory.
These bottlenecks are particularly evident for each of: memory intensive processing programs, such as neural networks and other machine learning algorithms; constructing a database, searching and querying an index; and other tasks that include more read and write operations than data processing operations.
The present invention describes solutions to mitigate or overcome one or more of the problems set forth above, as well as other problems of the prior art.
A hybrid device for memory intensive processing may be provided that may include a base die, a plurality of processors, a first memory resource of at least one other die, and a second memory resource of at least one other die.
The base die and the at least one further die are connected to each other by wafer bonding on the wafer.
The plurality of processors are configured to perform processing operations and retrieve the retrieved information stored in the first memory resource.
The second memory resource is configured to send additional information from the second memory resource to the first memory resource.
The total bandwidth of a first path between the base die and the at least one other die exceeds the total bandwidth of a second path between the at least one other die and the at least one other die, and the storage capacity of the first memory resource is a fraction of the storage capacity of the second memory resource.
The second memory resource is a High Bandwidth Memory (HBM) resource.
At least one other die is a stack of High Bandwidth Memory (HBM) chips.
At least some of the second memory resources may belong to another die that is connected to the base die by a connectivity different from the inter-wafer bonding.
At least some of the second memory resources belong to another die that is connected to another die by a connectivity different from the inter-wafer bonding.
The first memory resource and the second memory resource are different levels of cache.
The first memory resource is positioned between the base die and the second memory resource.
The first memory resource is positioned to one side of the second memory resource.
Another die is configured to perform additional processing, where the other die includes a plurality of processor subunits and a first memory resource.
Each processor subunit is coupled to a unique portion of the first memory resources allocated to the processor subunit.
The only portion of the first memory resource is at least one memory bank.
The plurality of processors are a plurality of processor subunits included in a first memory resource that processes the chip for memory.
The base die includes a plurality of processors, wherein the plurality of processors are a plurality of processor subunits coupled to the first memory resource via conductors formed using inter-wafer bonding.
Each processor subunit is coupled to a unique portion of the first memory resources allocated to the processor subunit.
A hybrid integrated circuit may be provided that may utilize wafer-on-wafer (WOW) connectivity to couple at least a portion of a base die to second memory resources included in one or more other dies and connected using connectivity different from the WOW connectivity. An example of the second memory resource may be a High Bandwidth Memory (HBM) memory resource. In various figures, the second memory resource is included in a stack of HBM memory cells, which may be coupled to the controller using through-silicon-via (TSV) connectivity. The controller may be included in the base die or coupled (e.g., via micro bumps) to at least a portion of the base die.
The base die may be a logic die, but may be a memory/processing unit.
WOW connectivity is used to couple one or more portions of the base die to one or more portions of another die (WOW connected die), which may be a memory die or a memory/processing unit. WOW connectivity is very high throughput connectivity.
A stack of High Bandwidth Memory (HBM) chips can be coupled to a base die (die connected directly or via WOW) and can provide high throughput connections and extremely wide memory resources.
The WOW connected die may be coupled between the stack of HBM chips and the base die to form a HBM memory chip stack having TSV connectivity and a WOW connected die at its bottom.
An HBM chip stack with TSV connectivity and a WOW-connected die at the bottom can provide multiple levels of memory hierarchy, where the WOW-connected die can be used as lower level memory (e.g., a 3-level cache) accessible by the base die, where fetch and/or prefetch operations from a higher level HBM memory stack populate the WOW-connected die.
The HBM memory chips may be HBM DRAM chips, although any other memory technology may be used.
Using WOW connectivity in combination with an HMB chip enables the provision of a multi-tier memory structure that may include multiple memory tiers that may provide different tradeoffs between bandwidth and memory density.
The proposed solution may serve as an additional completely new memory hierarchy between traditional DRAM memory/HBM to the internal cache of the logic die, enabling more bandwidth on the DRAM side and better management and reuse.
This may provide a new memory hierarchy on the DRAM side that better manages memory reads in a fast manner.
Fig. 93A to 93I illustrate hybrid integrated circuits 11011 'to 11019', respectively.
Fig. 93A illustrates an HBM DRAM stack with TSV connectivity with microbumps at the lowest level (collectively 11030) that includes a stack of HDM DRAM memory chips 11032 coupled to each other and to a first memory controller 11031 of a base die using TSVs (11039).
Fig. 93A also illustrates a wafer (collectively 11040) having at least memory resources and coupled using WOW technology, which includes a second memory controller 11022 coupled to a base die 11019 of a DRAM wafer (11021) via one or more WOW intermediate layers (11023). The one or more WOW interlayers may be made of different materials, but may be different than pad connectivity and/or may be different than TSV connectivity.
Conductors 11022' through one or more WOW intermediate layers electrically couple the DRAM die to components of the base die.
The base die 11019 is coupled to an interposer 11018, which is in turn coupled to a package substrate 11017 using micro-bumps. The package substrate has an array of micro-bumps at a lower surface thereof.
The microbumps may be replaced by other connectivity properties. The interposer 11018 and the package substrate 11017 may be replaced with other layers.
The first and/or second memory controllers (11031 and 11032, respectively) may be positioned outside (at least in part) base die 11019, such as in a DRAM wafer, between a DRAM wafer and a base die, between a stack of HBM memory cells and a base die, and the like.
The first and/or second memory controllers (11031 and 11032, respectively) may belong to the same controller or may belong to different controllers.
One or more of the HBM memory units may include logic and memory, and may be or include a memory/processing unit.
The first and second memory controllers are coupled to each other by a plurality of buses 11016 for transporting information between the first memory resource and the second memory resource. Fig. 93A also illustrates a bus 11014 from the second memory controller to components of the base die, such as multiple processors. FIG. 93A further illustrates a bus 11015 from the first memory controller to components of the base die (e.g., multiple processors, as shown in FIG. 93C).
Fig. 93B illustrates a hybrid integrated circuit 11012 that differs from the hybrid integrated circuit 11011 of fig. 93A in having a memory/processing unit 11021' instead of a DRAM die 11021.
Fig. 93C illustrates a hybrid integrated circuit 11013 that differs from the hybrid integrated circuit 11011 of fig. 93A in having an HBM memory chip stack with TSV connectivity and a WOW connected die (collectively designated 11040) at its bottom that includes a DRAM die 11021 between the stack of HBM memory cells and a base die 11018.
DRAM die 11021 is coupled to first memory controller 11031 of base die 11019 using WOW technology (see WOW interlayer 11023). One or more of HBM memory dies 11032 may include logic and memory, and may be or include memory/processing units.
The lowermost DRAM die (represented as DEAM die 11021 in fig. 93C) may be an HBM memory die or may be different from an HBM die. The lowermost DRAM die (DRAM die 11021) may be replaced by a memory/processing unit 11021', as illustrated by hybrid integrated circuit 11014 of fig. 93D.
Fig. 93E-93G illustrate hybrid integrated circuits 11015, 11016, and 11016', respectively, in which a base die 11019 is coupled to multiple instances of an HBM DRAM stack (11020) with TSV connectivity and micro-bumps at the lowest level and at least a wafer (11030) with memory resources and coupled using WOW technology, and/or multiple instances of an HBM memory chip stack (11040) with TSV connectivity and a die with a WOW connection at the bottom.
FIG. 93H illustrates a hybrid integrated circuit 11014' that differs from the hybrid integrated circuit 11014 of FIG. 93D in illustrating memory unit 53, level two cache (L2 cache 52), multiple processors 11051. Processors 11051 are coupled to L2 cache 11052 and may be fed with coefficients and/or data stored in memory units 11053 and L2 cache 11052.
Any of the hybrid integrated circuits mentioned above may be used for Artificial Intelligence (AI) processing, which is bandwidth intensive.
When coupled to a memory controller using WOW techniques, memory/processing unit 11021' of fig. 93D and 93H may perform AI calculations and may receive both data and coefficients from the HBM DRAM stack and/or from WOW connected dies at very high rates.
Any memory/processing unit may include a distributed memory array and a processor array. A distributed memory and processor array may include multiple memory banks and multiple processors. The plurality of processors may form a processing array.
Referring to fig. 93C, 93D, and 93H and assuming that a hybrid integrated circuit (11013, 11014, or 11014') is required to perform general matrix-vector multiplications (GEMVs), these multiplications include calculating the products of matrices and vectors. This type of computation is bandwidth intensive because there is no reuse of the extracted matrix. Thus, the entire matrix need only be retrieved and used once.
The GEMV may be part of a sequence of mathematical operations that involve (i) multiplying a first matrix (a) by a first vector (V1) to provide a first intermediate vector, applying a first non-linear operation (NLO1) to the first intermediate vector to provide a first intermediate result; (ii) multiplying the second matrix (B) by the first intermediate result to provide a second intermediate vector, applying a second non-linear operation (NLO2) to the second intermediate vector to provide a second intermediate result, and so on (until an nth intermediate result is received, N may exceed 2).
Assuming each matrix is large (e.g., 1Gb), the computation will require 1Tbs of computational power and 1Tbs of bandwidth/throughput. Operations and computations may be performed in parallel.
Assume that the GEMV calculation exhibits N-4 and has the following form: result ═ NLO4(D × (NLO3(C × (NLO2(B × (NLO1(a × V1)))))).
Also assuming that DRAM die 11021 (or memory/processing unit 11021') does not have enough memory resources to store A, B, C and D simultaneously, at least some of these matrices will be stored in HDM DRAM die 11032.
The base die is assumed to be a logic die that includes computational units such as, but not limited to, processors, arithmetic logic units, and the like.
While the first die is computing aa V1, the first memory controller 11031 retrieves the missing portions of the other matrices from one or more HBM DRAM dies 11032 for subsequent computations.
Referring to FIG. 93H and assuming (a) that DRAM die 11021 has a 2TBs bandwidth and 512Mb capacity, (b) that HBM DRAM die 11032 has a 0.2TBs bandwidth and 8Gb capacity, and (c) that L2 cache 11052 is an SRAM having a 6Ts bandwidth and 10Mb capacity.
Matrix multiplication involves reusing data, segmenting a large matrix into multiple sections (e.g., 5Mb sections to fit an L2 cache that may be used in a double buffer configuration) and multiplying the extracted first matrix section by sections of a second matrix (one second matrix section followed by another).
While multiplying the first matrix section by the second matrix section, another second matrix section is fetched from DRAM die 11021 (of memory processing unit 11021') to the L2 cache.
Assuming that the matrices are each 1Gb, DRAM die 11021 or memory/processing unit 11021' is fed with matrix segments from HBM DRAM die 11032 when performing fetches and computations.
DRAM die 11021 or memory/processing unit 11021' aggregates matrix segments, and the matrix segments are then fed to base die 11019 via WOW interlayer (11023).
Memory/processing unit 11021' may reduce the amount of information sent to base die 11019 via WOW interlayer (11023) by performing computations and sending results instead of sending intermediate values that are computed to provide results. When multiple (Q) intermediate values are processed to provide a result, then the compression ratio may be Q to 1.
FIG. 93I illustrates an example of a memory processing unit 11019' implemented using WOW technology. The logic 9030 (which may be a processor sub-unit), controller 9020, and bus 9021 are located in one chip 111061, the memory group 9040 allocated to a different logic is located in a second chip 11062, and the first and second chips are connected to each other using a conductor 11012' that passes through a WOW junction 11061, which may include one or more WOW interlayers.
FIG. 93J is an example of a method 11100 for memory intensive processing. Memory intensive means that the processing requires or is associated with high bandwidth memory consumption.
Each processor subunit may be coupled to a unique portion of the first memory resources allocated to the processor subunit.
The only portion of the first memory resource is at least one memory bank.
The second memory resource may be a High Bandwidth Memory (HBM) memory resource or may be different from the HBM memory resource.
At least one other die is a stack of High Bandwidth Memory (HBM) memory chips.
Communication chip
The database includes a number of entries that include a plurality of fields. Database processing typically includes executing one or more queries that include one or more screening parameters (e.g., identifying one or more relevant fields and one or more relevant field values) and also include one or more operational parameters that may determine the type of operation to be performed, variables or constants to be used in applying the operation, and the like. The data processing may include database analysis or other database processing procedures.
For example, a database query may request that a statistical operation (operation parameter) be performed on all records of the database, with a certain field having a value within a predefined range (filter parameter). For yet another example, a database query may request deletion of a (operational parameter) record having a certain field less than a threshold (filter parameter).
Large databases are typically stored in storage devices. In order to respond to a query, the database is sent to the memory units, typically one database segment followed by another.
Entries for the database segment are sent from the memory unit to processors that do not belong to the same integrated circuit as the memory unit. The entries are then processed by a processor.
For each database segment of the database stored in the memory unit, the process comprises the steps of: (i) selecting a record of a database segment; (ii) sending the record from the memory unit to the processor; (iii) screening the records by a processor to determine if the records are related; and (iv) performing one or more additional operations (summing, applying any other mathematical operations and/or statistical operations) on the associated records.
The filter handler ends after all records are sent to the processor and the processor determines which records are relevant.
In the case where the relevant entries of the database segment are not stored in the processor, then these relevant records need to be sent to the processor for further processing (applying the post-processing operations) after the screening phase.
When multiple processing operations follow a single screen, the results of each operation may then be sent to the memory unit and then again to the processor.
This process is bandwidth consuming and time consuming.
There is an increasing need to provide efficient ways of performing database processing.
An apparatus may be provided that may include a database acceleration integrated circuit.
An apparatus may be provided that may include one or more groups of database acceleration integrated circuits that may be configured to exchange information and/or accelerate results (final results of processing by the database acceleration integrated circuits) between database acceleration integrated circuits in the one or more groups of database acceleration integrated circuits.
The group of database acceleration integrated circuits may be connected to the same printed circuit board.
The group of database acceleration integrated circuits may belong to a modular unit of a computerized system.
Different groups of database acceleration integrated circuits may be connected to different printed circuit boards.
Different groups of database acceleration integrated circuits may belong to different modular units of a computerized system.
The apparatus may be configured to accelerate the execution of the distributed processing program by the integrated circuit via one or more groups of databases.
The apparatus may be configured to use at least one switch for exchanging at least one of (a) information and (b) database acceleration results between database acceleration integrated circuits of different ones of the one or more groups.
The apparatus may be configured to accelerate execution of the distributed processing program by some of the integrated circuits by the database of some of the one or more groups.
The apparatus may be configured to perform a distributed processing procedure of first and second data structures, wherein a total size of the first and second data structures exceeds a storage capacity of the plurality of memory processing integrated circuits.
The apparatus may be configured to perform the distributed processing procedure by performing a plurality of iterations of the following steps: (a) performing a new assignment of different pairs of the first data structure portion and the second data structure portion to different database acceleration integrated circuits; and (b) processing the different pairs.
Fig. 94A and 9B illustrate examples of a storage system 11560, a computer system 11150, and one or more devices for database acceleration 11520. The one or more devices for database acceleration 11520 may monitor communication between storage system 11560 and computer system 11150 in various ways (by listening or by being located between computer system 11150 and storage system 11560).
The storage system 11560 may include many (e.g., more than 20, 50, 100, and the like) storage units (such as disks or raid thereof), and may store, for example, more than 100 terabytes of information. Computing system 11510 may be a large computer system and may include tens, hundreds, and even thousands of processing units.
The computing system 11510 may include a plurality of computing nodes 11512 controlled by a manager 11511.
The compute nodes may control or otherwise interact with one or more devices 11520 for database acceleration.
The one or more means for database acceleration 11520 may comprise one or more database acceleration integrated circuits (see, e.g., database acceleration integrated circuit 11530 of fig. 94A and 94B) and memory resources 11550. The memory resources may belong to one or more chips dedicated to memory but may belong to a memory/processing unit.
Fig. 94C and 94D illustrate an example of a computer system 11150 and one or more devices 11520 for database acceleration.
One or more database acceleration integrated circuits of one or more devices for database acceleration 11520 may be controlled by a management unit 11513, which may be located within the computer system (see fig. 94C) or within one or more devices for database acceleration 11520 (fig. 94D).
Fig. 94E illustrates an apparatus 11520 for database acceleration, which includes a database acceleration integrated circuit 11530 and a plurality of memory processing integrated circuits 1151. Each memory processing integrated circuit may include a controller, a plurality of processor sub-units, and a plurality of memory units.
Database acceleration integrated circuit 11530 is illustrated as including network communication interface 11531, first processing unit 11532, memory controller 11533, database acceleration unit 11535, interconnect 11536, and management unit 11513.
The network communication interface (11531) may be configured to receive (e.g., via the first port 11531(1) of the network communication interface) a large amount of information from the mass storage unit. Each storage unit may output information at rates in excess of tens and even hundreds of megabytes/second, while data transfer speeds are expected to increase over time (e.g., double every 2 to 3 years). The number of stored data units (large number) may exceed 10, 50, 100, 200 and even more. Large amounts of information can exceed tens, hundreds of gigabytes/second, and can even range from terabytes/second and gigabytes/second.
First processing unit 11532 may be configured to perform a first processing (pre-processing) on the bulk of the information to provide first processed information.
The plurality of memory processing integrated circuits 11551 may be configured to perform a second processing (processing) of at least a portion of the first processed information by the plurality of memory processing integrated circuits to provide second processed information.
The memory controller 11533 may be configured to retrieve the retrieved information from the plurality of memory processing integrated circuits. The retrieved information may include at least one of: (a) at least a portion of the first processed information; and (b) at least a portion of the second processed information.
The database acceleration unit 11535 may be configured to perform database processing operations on the retrieved information to provide database acceleration results.
The database acceleration integrated circuit may be configured to output the database acceleration results, for example, via one or more second ports 11531(2) of the network communication interface.
Fig. 94E also illustrates a management unit 11513 configured to manage at least one of: the method includes the steps of retrieval of retrieved information, first processing (preprocessing), second processing (processing), and third processing (database processing). The management unit 11513 may be located external to the database acceleration integrated circuit.
The management unit may be configured to perform the management based on the execution plan. The execution plan may be generated by the management unit or may be generated by an entity external to the database acceleration integrated circuit. The execution plan may include at least one of: (a) instructions to be executed by various components of the database-accelerated integrated circuit, (b) data and/or coefficients needed to implement the execution plan, (c) memory allocation of instructions and/or data.
The management unit may be configured to perform the management by assigning at least some of: (a) network communication network interface resources, (b) decompression unit resources, (c) memory controller resources, (d) memory processing integrated circuit resources, and (e) database acceleration unit resources.
As illustrated in fig. 94E and 94G, the network communication network interface may include different types of network communication ports.
The different types of network communication ports may include storage interface protocol ports (e.g., SATA ports, ATA ports, ISCSI ports, network file systems, fibre channel ports) and general network storage interface protocol ports (e.g., ethernet ATA, ethernet fibre channel, NVME, rice, and others).
The different types of network communication ports may include a storage interface protocol port and a PCIe port.
FIG. 94F includes dashed lines illustrating the flow of bulk information, first processed information, retrieved information, and database acceleration results. FIG. 94F illustrates the database acceleration integrated circuit 11530 as coupled to a plurality of memory resources 11550. The plurality of memory resources 11550 may not belong to a memory processing integrated circuit.
The means for database acceleration 11520 may be configured to perform multiple tasks concurrently by the database acceleration integrated circuit 11530 because the network communication interface 11531 may receive multiple information streams (concurrently), the first processing unit 11532 may perform the first processing on multiple information units concurrently, the memory controller 11533 may send multiple first processed information units concurrently to the multiple memory processing integrated circuits 11551, and the database acceleration unit 11535 may process multiple retrieved information units concurrently.
The means for database acceleration 11520 may be configured to perform at least one of the fetching, the first processing, the sending, and the third processing by a compute node of the large compute system based on an execution plan sent to the database acceleration integrated circuit.
The means for database acceleration 11520 may be configured to manage at least one of the fetching, the first processing, the sending, and the third processing in a manner that substantially optimizes utilization of the database acceleration integrated circuit. The optimization takes into account latency, throughput, and any other timing or storage or processing considerations, and attempts to keep all components along the flow path busy and bottleneck-free.
The database acceleration integrated circuit may be configured to output the database acceleration results, for example, via one or more second ports 11531(2) of the network communication interface.
The means for database acceleration 11520 may be configured to substantially optimize the bandwidth of traffic exchanged via the network communication network interface.
The means for database acceleration 11520 may be configured to substantially prevent bottlenecks from forming in at least one of the fetching, the first processing, the sending, and the third processing in a manner that substantially optimizes utilization of the database accelerated integrated circuit.
The means for database acceleration 11520 may be configured to allocate resources of the database acceleration integrated circuit according to the temporal I/O bandwidth.
Fig. 94G illustrates an apparatus 11520 for database acceleration, which includes a database acceleration integrated circuit 11530 and a plurality of memory processing integrated circuits 1151. FIG. 94G also illustrates various units coupled to the database acceleration integrated circuit 11530: remote RAM 11546, ethernet memory DIMM11547, storage system 11560, local storage unit 11561, and non-volatile memory (NVM)11563 (which may be a fast NVM unit (NVME)).
The database acceleration integrated circuit 11530 is illustrated as including an ethernet port 11531(1), an RDMA unit 11545, a serial expansion port 11531(15), a SATA controller 11540, a PCIe port 11531(9), a first processing unit 11532, a memory controller 11533, a database acceleration unit 11535, an interconnect 11536, a management unit 11513, a cryptogra phic engine 11537 for performing cryptographic operations, and a second order static random access memory (L2 SRAM) 11538.
The database acceleration unit is illustrated as including a DMA engine 11549, a third level (L3) memory 11548, and a database acceleration subunit 11547. The database acceleration subunit 11547 may be a configurable unit.
Ethernet ports 11531(1), RDMA units 11545, serial expansion ports 11531(15), SATA controller 11540, PCIe ports 11531(9) may be considered part of network communication interface 11531.
PCIe port 11531(9) is coupled to NVM 11563. PCIe ports may also be used to exchange commands, for example for management purposes.
Fig. 94H is an example of the database acceleration unit 11535.
The database acceleration unit 11535 may be configured to execute database processing instructions concurrently by the database processing subunit 11573, where the database acceleration unit may comprise a group of database accelerator subunits that share a shared memory unit 11575.
Different combinations of database acceleration subunits 11535 may be dynamically linked to each other (via configurable link or interconnect 11576) to provide the execution pipeline needed to perform database processing operations that may include multiple instructions.
Each database processing subunit may be configured to execute a particular type of database processing instruction (e.g., filter, merge, accumulate, and the like).
FIG. 94H also illustrates a separate database processing unit 11572 coupled to cache 11571. Database processing unit 11572 and cache 11571 may also be provided instead of or in addition to reconfigurable array 11574 of the DB accelerator.
The apparatus may facilitate scaling-in and/or scaling-out, thus enabling multiple database acceleration integrated circuits 11530 (and their associated memory resources 11550 or their associated multiple memory processing integrated circuits 11551) to cooperate with one another, e.g., by participating in distributed processing of database operations.
FIG. 94I illustrates a modular unit, such as a blade (blade)11580, that includes two database acceleration integrated circuits 11530 (and their associated memory resources 11550). The blade may include one, two, or more than two memory processing integrated circuits 11551 and their associated memory resources 11550.
The blade may also include one or more non-volatile memory units, ethernet switches, PCIe switches, and ethernet switches.
Multiple blades may communicate with each other using any communication method, communication protocol, and connectivity.
Fig. 94I illustrates four database acceleration integrated circuits 11530 (and their associated memory resources 11550) fully connected to each other, each database acceleration integrated circuit 11530 being connected to all three other database acceleration integrated circuits 11530. Connectivity may be achieved using any communication protocol, such as by using the RDMA over Ethernet protocol.
Fig. 94I also illustrates a database acceleration integrated circuit 11530, which is connected to its associated memory resources 11550 and a unit 11531 comprising RAM memory and an ethernet port.
Fig. 94J, 94K, 94L, and 94M illustrate four groups 11580 of database acceleration integrated circuits, each group including four database acceleration integrated circuits 11530 (fully connected to each other) and their associated memory resources 11550. The different groups are connected to each other via a switch 11590.
The number of groups may be two, three or more than four. The number of database acceleration integrated circuits per group may be two, three, or more than four. The number of groups may be the same as (or may be different from) the number of database acceleration integrated circuits per group.
FIG. 94K illustrates two tables A and B that are too large (e.g., 1 terabyte) to be joined (join) efficiently at once.
The table is actually segmented into shards and the join operation is applied to pairs of shards including table a and table B.
The group of database acceleration integrated circuits may process the shards in various ways.
For example, the apparatus may be configured to perform the distributed processing by:
g. different first data structure portions (shards of table a, e.g., first through sixteenth shards a0 through a15) are assigned to different database acceleration integrated circuits of one or more groups.
h. Performing a plurality of iterations of: (i) newly assigning a different second data structure portion (shards of table B, e.g., first through sixteenth shards B0-B15) to different database acceleration integrated circuits of one or more groups; and (ii) accelerating, by the database, processing of the first and second data structure portions by the integrated circuit.
The apparatus may be configured to perform the new assignment for the next iteration in a manner that at least partially overlaps in time with the processing of the current iteration.
The apparatus may be configured to perform the new allocation by exchanging the second data structure portion between different database acceleration integrated circuits.
The exchange may be performed in a manner that at least partially overlaps the handler.
The apparatus may be configured to perform the new allocation by: exchanging a second data structure portion between different database acceleration integrated circuits of the group; and exchanging the second data structure portion between the different groups of the database acceleration integrated circuits once the exchanging has been completed.
In FIG. 94K, four cycles of some of the Join operations are shown, such as referencing the upper left database acceleration integrated circuit 11530 of the upper left group, the four cycles including computing Join (A0, B0), Join (A0, B3), Join (A0, B2), and Join (A0, B1). During these four cycles, A0 remains at the same database acceleration integrated circuit 11530, while the shards of matrix B (B0, B1, B2, and B3) rotate between members of the same group of database acceleration integrated circuits 11530.
In fig. 94L, the tiles of the second matrix are rotated between different groups, (a) tiles B0, B1, B2, and B3 (previously processed by the upper-left group) are sent from the upper-left group to the lower-left group, (B) tiles B4, B5, B6, and B7 (previously processed by the lower-left group) are sent from the lower-left group to the upper-right group, (c) tiles B8, B9, B10, and B11 (previously processed by the upper-right group) are sent from the upper-right group to the lower-right group, and (d) tiles B12, B13, B14, and B15 (previously processed by the lower-right group) are sent from the lower-right group to the upper-left group.
FIG. 94N is an example of a system that includes a plurality of blades 11580, SATA controllers 11540, local storage units 11561, NVME 11563, PCIe switches 11601, Ethernet memories DIMM 11547, and Ethernet ports 11531 (4).
Fig. 94O illustrates two systems 11621 and 11622.
The system 11621 may include one or more devices 11520 for database acceleration, a switching system 11611, a storage system 11612, and a computing system 11613. Switching system 11611 provides connectivity between one or more devices 11520 for database acceleration, storage system 11612 and computing system 11613.
The system 11622 may include a storage system and one or more devices 11615, a switching system 11611, and a computing system 11613 for database acceleration. The switching system 11611 provides connectivity between the storage system and one or more devices 11615 for database acceleration and the computing system 11613.
FIG. 95A illustrates a method 11200 for database acceleration.
The method 11200 may begin at step 11210 by retrieving a plurality of information from a mass storage unit via a network communication network interface of a database acceleration integrated circuit.
Connecting to a large number of storage units (e.g., using multiple different buses) enables the network communication network interface to receive large amounts of information, even when a single storage unit has limited throughput.
It should be noted that steps 11210, 11220, 11230, 11240, 11250, and 11260 of method 11100, or any other steps, may be performed in a pipelined manner. These steps may be performed simultaneously or in an order different from that mentioned above.
For example, step 1120 may be followed by step 11250, such that the first processed information is further processed by the database acceleration unit.
For yet another example, the first processed information may be sent to a plurality of memory processing integrated circuits and then sent (without processing by the plurality of memory processing integrated circuits) to the database acceleration unit.
For yet another example, the first processed information and/or the second processed information may be output from the database acceleration integrated circuit without database processing by the database acceleration unit.
The method may include performing, by a compute node of the large computing system based on an execution plan sent to the database acceleration integrated circuit, at least one of: acquisition, first processing, sending and third processing.
The method may include managing at least one of the fetching, the first processing, the sending, and the third processing in a manner that substantially optimizes utilization of the database acceleration integrated circuit.
The method may include substantially optimizing bandwidth of traffic exchanged via the network communication network interface.
The method may include substantially preventing bottlenecks from forming in at least one of the fetching, the first processing, the sending and the third processing in a manner that substantially optimizes utilization of the database-accelerated integrated circuit.
The management may be performed based on an execution plan generated by a management unit of the database acceleration integrated circuit.
The management may be performed based on an execution plan received by a management unit of the database acceleration integrated circuit and not generated by the management unit.
The managing may include assigning at least some of: (a) network communication network interface resources, (b) decompression unit resources, (c) memory controller resources, (d) memory processing integrated circuit resources, and (e) database acceleration unit resources.
FIG. 95B illustrates a method 11300 for operating a group of database accelerated integrated circuits.
The method 11300 may begin in step 11310 by executing a database acceleration operation by the database acceleration integrated circuit. Step 11310 may include performing one or more steps of method 11200.
The method 11300 may also include a step 11320 of exchanging at least one of (a) information and (b) database acceleration results between database acceleration integrated circuits of one or more groups of database acceleration integrated circuits.
The combination of steps 11310 and 11320 may correspond to accelerating the integrated circuit performing distributed processing by one or more groups of databases.
One or more groups of databases may be used to accelerate the switching performed by the network communication network interface of the integrated circuit.
Switching may be performed via a plurality of groups, which may be connected to each other by a star connection.
The execution of the distributed processing may include performing multiple iterations of: (a) performing a new assignment of different pairs of the first data structure portion and the second data structure portion to different database acceleration integrated circuits; and (b) processing the different pairs.
Execution of the distributed processing may include performing a database join operation.
FIG. 95C illustrates a method 11350 for database acceleration.
The method 11350 may include retrieving 11352 a plurality of information from the mass storage unit via a network communication network interface of the database acceleration integrated circuit.
The method can also include a step 11355 of second processing the first processed information to provide second processed information. The second processing is performed by a plurality of processors located in one or more memory processing integrated circuits further comprising a plurality of memory resources. Step 11355 follows step 11354 and precedes step 11356.
The total size of the second processed information may be less than the total size of the first processed information.
The total size of the first processed information may be less than the total size of the volume of information.
The first processing may include screening the database entry. Thus, irrelevant database entries are screened out before any other processing is performed and/or even before they are stored in multiple memory resources, thereby saving bandwidth, storage resources, and other processing resources.
The second process may include screening the database entry. Screening may be applied when the screening conditions may be complex (including multiple conditions), and may require the receipt of multiple database entry fields before screening proceeds. For example, when searching for (a) people over a certain age who like bananas and (b) people over another age who like apples.
Database with a plurality of databases
The following examples may refer to a database. The database may be a data center, may be part of a data center, or may not belong to a data center.
The database may be coupled to a plurality of users via one or more networks. The database may be a cloud database.
A database may be provided that includes one or more management units and a plurality of database accelerator boards, including one or more memory/processing units.
Fig. 96B illustrates a database 12020 including a management unit 12021 and a plurality of DB accelerator boards 12022 each including a communication/management processor (processor 12024) and a plurality of memory/processing units 12026.
The processor 12024 may support various communication protocols, such as but not limited to PCIe, a rock-like protocol, and the like.
The database commands may be executed by the memory/processing unit 12026, and the processor may route traffic between the memory/processing unit 12026, between different DB accelerator boards 12022, and with the management unit 12021.
The use of multiple memory/processing units 12026 can significantly accelerate the execution of database commands and avoid communication bottlenecks, especially when large internal memory banks are included.
Fig. 96C illustrates a DB accelerator board 12022 including a processor 12024 and a plurality of memory/processing units 12026. The processor 12024 includes a number of communication specific components, such as a DDR controller 12033 to communicate with the memory/processing unit 12026, the RDMA engine 12031, the DB query database engine 12034, and the like. The DDR controller is an instance of a communication controller and the RDMA engine is an instance of any communication engine.
A method may be provided for operating the system (or any portion of the operating system) of any of fig. 96B, 96C, and 96D.
It should be noted that the database acceleration integrated circuit 11530 may be associated with a plurality of memory resources that are not included in a plurality of memory processing integrated circuits or otherwise associated with a processing unit. In this case, the processing is performed primarily and even only by the database acceleration integrated circuit.
FIG. 94P illustrates a method 11700 for database acceleration.
The method 11700 can include a step 11710 of retrieving information from a storage unit via a network communication interface of the database acceleration integrated circuit.
The first and/or second processing may comprise screening the database entries to determine which database entries should be further processed.
The second process includes screening the database entry.
Mixing system
The memory/processing unit may be efficient in performing computations that may be memory intensive and/or bottleneck related fetch operations. Processing-oriented (and less memory-oriented) processor units (such as, but not limited to, graphics processing units, central processing units) may be more efficient when bottlenecks are associated with arithmetic operations.
A hybrid system may include both one or more processor units and one or more memory/processing units, which may be fully or partially connected to each other.
The memory/processing unit (MPU) may be fabricated by a first fabrication process that is better suited to memory cells than logic cells. For example, memory cells fabricated by the first fabrication process may exhibit smaller and even much smaller critical dimensions (e.g., more than 2, 3, 4, 5, 6, 7, 8, 9, 10, and the like) than the critical dimensions of logic circuits fabricated by the first fabrication process. For example, the first manufacturing process may be a simulation manufacturing process, the first manufacturing process may be a DRAM manufacturing process, and the like.
The processor may be fabricated by a second fabrication process that is preferably adapted to logic. For example, the critical dimension of the logic circuit fabricated by the second fabrication process may be smaller and even much smaller than the critical dimension of the logic circuit fabricated by the first fabrication process. For yet another example, the critical dimensions of the logic circuit fabricated by the second fabrication process may be smaller and even much smaller than the critical dimensions of the memory cell fabricated by the first fabrication process. For example, the second fabrication process may be an analog fabrication process, the second fabrication process may be a CMOS fabrication process, and the like.
Tasks may be distributed between different units in a static or dynamic manner by taking into account the benefits of each unit and any penalties associated with transferring data between units.
For example, memory-intensive processing programs may be allocated to the memory/processing units, while processing-intensive memory light processing may be allocated to the processing units.
A processor may request or issue instructions to one or more memory/processing units to perform various processing tasks. Execution of various processing tasks may relieve the processor of burden, reduce latency, and in some cases reduce the overall information bandwidth between one or more memory/processing units and the processor, and the like.
The processor may provide instructions and/or requests with different granularities, e.g., the processor may send instructions for certain processing resources or may send higher order instructions for memory/processing units without specifying any processing resources.
Fig. 96D is an example of a hybrid system 12040 including one or more memory/processing units (MPUs) 12043 and a processor 12042. Processor 12042 may send the request or instruction to one or more MPUs 12043, which in turn complete (or selectively complete) the request and/or instruction and send the results to processor 12042, as described above.
The processor 12042 may further process the results to provide one or more outputs.
Each MPU includes memory resources, processing resources (such as compact microcontroller 12044), and cache 12049. The microcontroller may have limited arithmetic capabilities (e.g., may include primarily multiply-accumulate units).
Microcontroller 12044 can apply the processing program for in-memory acceleration purposes, or can be the CPU or the entire DB processing engine or a subset thereof.
There may be more than one DDR controller for fast inter-DIMM communication.
The goal of in-memory packet processors is to reduce BW, data movement, power consumption, and increase performance. The use of an in-memory packet processor will result in a significant increase in performance/TCO over standard solutions.
It should be noted that the management unit is optional.
Each MPU may operate as an Artificial Intelligence (AI) memory/processing unit because it may perform AI calculations and merely pass the results back to the processor, thereby reducing traffic, particularly when the MPU receives and stores neural network coefficients to be used in multiple calculations, and does not need to receive coefficients from an external chip each time a portion of the neural network is used to process new data.
The MPU can determine when the coefficients are zero and notify the processor that multiplication involving zero-valued coefficients need not be performed.
It should be noted that the first and second processes may include screening database entries.
The MPU may be any memory processing unit described in any of this specification, PCT patent application WO2019025862 and PCT patent application No. PCT/IB 2019/001005.
An AI computing system (and system executable by the system) may be provided in which a network adapter has AI processing capabilities and is configured to perform some AI processing tasks in order to reduce the amount of traffic to be sent over a network coupling a plurality of AI acceleration servers.
For example, in some inference systems, the input is a network (e.g., multiple streams of IP cameras connected to an AI server). In this case, utilizing RDMA + AI on the processing and network connection unit may reduce the load of the CPU and PCIe bus and provide processing to the processing and network connection unit, rather than providing processing by a GPU that is not included in the processing and network connection unit.
For example, instead of computing and sending the initial result to the target AI acceleration server (applying one or more AI processing operations), the processing and network connection unit may perform pre-processing that reduces the amount of value sent to the target AI acceleration server. The target AI calculation server is an AI calculation server assigned to perform calculations on values provided by other AI acceleration servers. This reduces the bandwidth of the traffic exchanged between the AI acceleration servers and also reduces the load of the target AI acceleration server.
The target AI acceleration server may be allocated in a dynamic or static manner by using load balancing or other allocation algorithms. There may be more than a single target AI acceleration server.
For example, if the target AI acceleration server adds multiple losses, the processing and network connection unit may add the losses generated by its AI acceleration server and send the sum of the losses to the target AI acceleration server, thereby reducing bandwidth. The same benefits may be obtained when performing other preprocessing operations such as derivative calculations and aggregation, and the like.
Fig. 97B illustrates a system 12060 including subsystems each including a switch 12061 for connecting the AI processing and network connection unit 12063 with the server board 12064 to each other. The server motherboard includes one or more AI processing and network connection units 12063 with network capabilities and AI processing capabilities. The AI processing and network connection unit 12063 may include one or more NICs and ALUs or other computational circuitry for performing pre-processing.
The AI processing and network connection unit 12063 may be a chip, or may include more than a single chip. It may be beneficial to have AI processing and network connection unit 12063 be a single chip.
The AI processing and network connection unit 12063 may include (only or primary) processing resources. The AI processing and network connection unit 12063 may or may not include in-memory arithmetic circuitry, or may not include mass-memory arithmetic circuitry.
The AI processing and network connection unit 12063 may be an integrated circuit, may include more than a single integrated circuit, may be part of an integrated circuit, and the like.
The AI processing and network connection unit 12063 can transport (see, e.g., fig. 97C) traffic between the AI acceleration server including the AI processing and network connection unit 12063 and other AI acceleration servers (e.g., by using communication ports such as DDR channels, network channels, and/or PCIe channels). The AI processing and networking unit 12063 may also be coupled to external memory, such as DDR memory. The processing and network connection units may include memory and/or may include memory/processing units.
In fig. 97C, AI processing and network connection unit 12063 is illustrated as including a local DDR connection, DDR channel, AI accelerator, RAM memory, encryption/decryption engine, PCIe switch, PCIe interface, multiple core processing arrays, fast network connection, and the like.
A method for operating the system (or any portion of the operating system) of either of fig. 97B and 97C may be provided.
Any combination of any steps of any of the methods mentioned in the present application may be provided.
Any combination of any of the units, integrated circuits, memory resources, logic, processing sub-units, controllers, components mentioned in the present application may be provided.
Any reference to "comprising" and/or "including" may be applied to "consisting", "consisting essentially" with necessary modification in detail.
The foregoing description has been presented for purposes of illustration. The foregoing description is not intended to be exhaustive or to be limited to the precise form or embodiments disclosed. Modifications and adaptations will be apparent to those skilled in the art from consideration of the present specification and practice of the disclosed embodiments. Additionally, although aspects of the disclosed embodiments are described as being stored in memory, those skilled in the art will appreciate that these aspects can also be stored on other types of computer-readable media, such as secondary storage devices, e.g., hard disks or CD ROMs, or other forms of RAM or ROM, USB media, DVD, Blu-ray, 4K ultra HD Blu-ray, or other optical drive media.
Computer programs based on written descriptions and the disclosed methods are within the skill of experienced developers. Various programs or program modules may be created using any of the techniques known to those skilled in the art or may be designed in conjunction with existing software. For example, program segments or program modules may be available or designed with the help of Net Framework,. Net Compact Framework (and related languages such as Visual Basic, C, etc.), Java, C + +, Objective-C, HTML/AJAX combinations, XML, or HTML that includes Java applets.
Moreover, although illustrative embodiments have been described herein, the scope of any and all embodiments having equivalent components, modifications, omissions, combinations (e.g., of aspects across various embodiments), adaptations and/or alterations will become apparent to those skilled in the art based on the present disclosure. The limitations in the claims are to be interpreted broadly based on the language employed in the claims and not limited to embodiments described in the specification or during the prosecution of the application. The embodiments are to be construed as non-exclusive. Moreover, the steps of the disclosed methods may be modified in any manner, including by reordering steps and/or inserting or deleting steps. Accordingly, the specification and examples are to be considered as illustrative only, with a true scope and spirit being indicated by the following claims and their full scope of equivalents.
Claims (368)
1. An integrated circuit, comprising:
a substrate;
a memory array disposed on the substrate, the memory array comprising a plurality of discrete memory banks;
a processing array disposed on the substrate, the processing array comprising a plurality of processor sub-units, each of the plurality of processor sub-units associated with one or more discrete memory banks among the plurality of discrete memory banks; and
a controller configured to:
at least one security measure is implemented with respect to operation of the integrated circuit.
2. The integrated circuit of claim 1, wherein the controller is configured to take one or more remedial actions if the at least one security measure is triggered.
3. The integrated circuit of claim 1, wherein the controller is configured to implement at least one security measure in at least one memory location.
4. The integrated circuit of claim 2, wherein the data comprises weight data for a neural network model.
5. The integrated circuit of claim 1, wherein the controller is configured to implement at least one security measure, the at least one security measure comprising locking access to one or more memory portions of the memory array that are not used for input data or output data operations.
6. The integrated circuit of claim 1, wherein the controller is configured to implement at least one security measure, the at least one security measure comprising locking only a subset of the memory array.
7. The integrated circuit of claim 6, wherein the subset of the array is specified by a particular memory address.
8. The integrated circuit of claim 6, wherein the subset of the memory array is configurable.
9. The integrated circuit of claim 1, wherein the controller is configured to implement at least one security measure, the at least one security measure comprising controlling traffic to or from the integrated circuit.
10. The integrated circuit of claim 1, wherein the controller is configured to implement at least one security measure, the at least one security measure comprising uploading changeable data, code, or fixed data.
11. The integrated circuit of claim 1, wherein the uploading of the alterable data, code, or fixed data occurs during a boot process.
12. The integrated circuit of claim 1, wherein the controller is configured to implement at least one security measure, the at least one security measure comprising uploading a configuration file during a boot process procedure, the configuration file identifying a particular memory address of at least a portion of the memory array to be locked after completion of the boot process procedure.
13. The integrated circuit of claim 1, wherein the controller is further configured to require a complex password to unlock access to a memory portion of the memory array associated with one or more memory addresses.
14. The integrated circuit of claim 1, wherein the at least one security measure is triggered upon detecting an attempted access to at least one locked memory address.
15. The integrated circuit of claim 1, wherein the controller is configured to implement at least one security measure, the at least one security measure comprising:
calculating a checksum, hash, CRC (cyclic redundancy check), or check bits calculated with respect to at least a portion of the memory array; and
the calculated checksum, hash, CRC or check bits are compared with a predetermined value.
16. The integrated circuit of claim 15, wherein the controller is configured to determine whether the calculated checksum, hash, CRC, or check bits match the predetermined value as part of the at least one security measure.
17. The integrated circuit of claim 1, wherein the at least one security measure comprises copying program code in at least two different memory portions.
18. The integrated circuit of claim 17, wherein the at least one security measure comprises determining whether output results of executing the program code in the at least two different memory portions are different.
19. The integrated circuit of claim 18, wherein the output result comprises an intermediate or final output result.
20. The integrated circuit of claim 17, wherein the at least two different memory portions are included within the integrated circuit.
21. The integrated circuit of claim 1, wherein the at least one security measure comprises determining whether an operational pattern differs from one or more predetermined operational patterns.
22. The integrated circuit of claim 2, wherein the one or more remedial actions include ceasing to perform an operation.
23. A method of protecting an integrated circuit against tampering, the method comprising:
implementing at least one security measure with respect to operation of the integrated circuit using a controller associated with the integrated circuit; wherein the integrated circuit comprises:
a substrate;
a memory array disposed on the substrate, the memory array comprising a plurality of discrete memory groups; and
A processing array disposed on the substrate, the processing array comprising a plurality of processor sub-units, each of the plurality of processor sub-units associated with one or more discrete memory banks among the plurality of discrete memory banks.
24. The method of claim 23, further comprising taking one or more remedial actions if the at least one security measure is triggered.
25. An integrated circuit, comprising:
a substrate;
a memory array disposed on the substrate, the memory array comprising a plurality of discrete memory groups;
a processing array disposed on the substrate, the processing array comprising a plurality of processor sub-units, each of the plurality of processor sub-units associated with one or more discrete memory banks among the plurality of discrete memory banks; and
a controller configured to:
implementing at least one security measure with respect to operation of the integrated circuit; wherein the at least one security measure comprises copying the program code in at least two different memory portions.
26. An integrated circuit, comprising:
a substrate;
a memory array disposed on the substrate, the memory array comprising a plurality of discrete memory groups;
A processing array disposed on the substrate, the processing array comprising a plurality of processor sub-units, each of the plurality of processor sub-units associated with one or more discrete memory banks among the plurality of discrete memory banks; and
a controller configured to implement at least one security measure with respect to operation of the integrated circuit.
27. The integrated circuit of claim 26, wherein the controller is further configured to take one or more remedial actions if the at least one security measure is triggered.
28. A distributed processor memory chip, comprising:
a substrate;
a memory array disposed on the substrate, the memory array comprising a plurality of discrete memory groups;
a processing array disposed on the substrate, the processing array comprising a plurality of processor sub-units, each of the plurality of processor sub-units associated with one or more discrete memory banks among the plurality of discrete memory banks; and
a first communication port configured to establish a communication connection between the distributed processor memory chip and an external entity other than another distributed processor memory chip; and
A second communication port configured to establish a communication connection between the distributed processor memory chip and a first additional distributed processor memory chip.
29. The distributed processor memory chip of claim 28 further comprising a third communication port configured to establish a communication connection between the distributed processor memory chip and a second additional distributed processor memory chip.
30. The distributed processor memory chip of claim 29 further comprising a controller configured to control communications via at least one of the first communication port, the second communication port, the third communication port.
31. The distributed processor memory chip of claim 29, wherein each of the first communication port, the second communication port, and the third communication port is associated with a corresponding bus.
32. The distributed processing memory chip of claim 31 wherein the corresponding bus is a bus common to each of the first communication port, the second communication port, and the third communication port.
33. A distributed processing memory chip as defined in claim 31, wherein
The corresponding bus associated with each of the first, second, and third communication ports is connected to the plurality of discrete memory banks.
34. The distributed processor memory chip of claim 31 wherein at least one bus associated with the first communication port, the second communication port, and the third communication port is unidirectional.
35. The distributed processor memory chip of claim 31 wherein at least one bus associated with the first communication port, the second communication port, and the third communication port is bidirectional.
36. The distributed processor memory chip of claim 30, wherein the controller is configured to schedule a data transfer between the distributed processor memory chip and the first additional distributed processor memory chip such that a receiving processor subunit of the first additional distributed processor memory chip executes its associated program code based on the data transfer and during a time period when the data transfer is received.
37. The distributed processor memory chip of claim 30 wherein the controller is configured to send a clock enable signal to at least one of the plurality of processor subunits of the distributed processor memory chip to control one or more operational aspects of the at least one of the plurality of processor subunits.
38. The distributed processor memory chip of claim 37, wherein the controller is configured to control timing of one or more communication commands associated with the at least one of the plurality of processor subunits by controlling the clock enable signal sent to the at least one of the plurality of processor subunits.
39. The distributed processor memory chip of claim 30, wherein the controller is configured to selectively initiate execution of program code by one or more of the plurality of processor subunits on the distributed processor memory chip.
40. The distributed processor memory chip of claim 30, wherein the controller is configured to use a clock enable signal to control timing of data transfers from one or more of the plurality of processor subunits to at least one of the second communication port and the third communication port.
41. The distributed processor memory chip of claim 28 wherein a communication speed associated with the first communication port is lower than a communication speed associated with the second communication port.
42. The distributed processor memory chip of claim 30, wherein the controller is configured to determine whether a first processor subunit among the plurality of processor subunits is ready to transfer data to a second processor subunit included in the first additional distributed processor memory chip, and to use a clock enable signal to initiate transfer of the data from the first processor subunit to the second processor subunit after determining that the first processor subunit is ready to transfer the data to the second processor subunit.
43. The distributed processor memory chip of claim 42 wherein the controller is further configured to determine whether the second processor subunit is ready to receive the data, and to use the clock enable signal to initiate the transfer of the data from the first processor subunit to the second processor subunit after determining that the second processor subunit is ready to receive the data.
44. The distributed processor memory chip of claim 42 wherein the controller is further configured to determine whether the second processor subunit is ready to receive the data and buffer the data included in the transfer until after a determination that the second processor subunit of the first additional distributed processor memory chip is ready to receive the data.
45. A method of transferring data between a first distributed processor memory chip and a second distributed processor memory chip, the method comprising:
determining, using a controller associated with at least one of the first distributed processor memory chip and the second distributed processor memory chip, whether a first processor subunit among a plurality of processor subunits disposed on the first distributed processor memory chip is ready to transfer data to a second processor subunit included in the second distributed processor memory chip; and
using a clock enable signal controlled by the controller to initiate transfer of the data from the first processor subunit to the second processor subunit after determining that the first processor subunit is ready to transfer the data to the second processor subunit.
46. The method of claim 45, further comprising:
determining, using the controller, whether the second processor subunit is ready to receive the data; and
using the clock enable signal to initiate the transfer of the data from the first processor subunit to the second processor subunit after determining that the second processor subunit is ready to receive the data.
47. The method of claim 45, further comprising:
determining, using the controller, whether the second processor subunit is ready to receive the data, and buffering the data included in the transfer until after a determination that the second processor subunit of the first additional distributed processor memory chip is ready to receive the data.
48. A memory chip, comprising:
a substrate;
a memory array disposed on the substrate, the memory array comprising a plurality of discrete memory groups; and
a first communication port configured to establish a communication connection between the memory chip and an external entity other than another memory chip; and
a second communication port configured to establish a communication connection between the memory chip and a first additional memory chip.
49. The memory chip of claim 48, in which the first communication port is connected to at least one of a main bus internal to the memory chip or at least one processor subunit included in the memory chip.
50. The memory chip of claim 48, in which the second communication port is connected to at least one of a main bus internal to the memory chip or at least one processor subunit included in the memory chip.
51. A memory cell, comprising:
a memory array comprising a plurality of memory banks;
at least one controller configured to control at least one aspect of a read operation with respect to the plurality of memory banks;
at least one zero value detection logic configured to detect a multi-bit zero value associated with data stored in a particular address of the plurality of memory banks; and is
Wherein the at least one controller is configured to return a zero value indicator to the one or more circuits in response to a zero value detection by the at least one zero value detection logic.
52. The memory unit of claim 51, wherein the one or more circuits having the zero value indicator returned are external to the memory unit.
53. The memory unit of claim 51, wherein the one or more circuits having the zero value indicator returned are internal to the memory unit.
54. The memory unit of claim 51, wherein the memory unit further comprises at least one read disable element configured to interrupt a read command associated with the particular address when the at least one zero value detection logic detects a zero value associated with the particular address.
55. The memory unit of claim 51, wherein the at least one controller is configured to send the zero value indicator to the one or more circuits instead of sending zero value data stored in the particular address.
56. The memory unit of claim 51, wherein the size of the zero value indicator is less than the size of zero data.
57. The memory unit of claim 51, wherein the energy consumed by the first processing procedure comprising the following is less than the energy consumed by sending zero-valued data to the one or more circuits: (a) detecting the zero value; (b) generating the null indicator; and (c) sending the zero value indicator to the one or more circuits.
58. The memory unit of claim 57, wherein the energy consumed by the first handler is less than half the energy consumed by sending the zero value data to the one or more circuits.
59. The memory unit of claim 51, wherein the memory unit further comprises at least one sense amplifier configured to prevent activation of at least one of the plurality of memory banks after zero detection by the at least one zero detection unit.
60. The memory cell of claim 59, wherein the at least one sense amplifier comprises a plurality of transistors configured to sense low power signals from the plurality of memory banks, and the at least one sense amplifier amplifies small voltage swings to higher voltage levels so that data stored in the plurality of memory banks can be interpreted by the at least one controller.
61. The memory unit of claim 51, wherein each of the plurality of memory banks is further organized into subsets, the at least one controller comprises a subset controller, and wherein the at least one zero value detection logic unit comprises zero value detection logic associated with the subsets.
62. The memory cell of claim 61 wherein the memory cell further comprises at least one read disable element comprising a sense amplifier associated with each of the subsets.
63. The memory unit of claim 51, further comprising a plurality of processor sub-units spatially distributed within the memory unit, wherein each of the plurality of processor sub-units is associated with a dedicated at least one of the plurality of memory banks, and wherein each of the plurality of processor sub-units is configured to access and operate on data stored in a corresponding memory bank.
64. The memory unit of claim 63, wherein the one or more circuits comprise one or more of the processor subunits.
65. The memory unit of claim 63, wherein each of the plurality of processor subunits is connected to two or more other processor subunits among the plurality of processor subunits by one or more buses.
66. The memory unit of claim 51, further comprising a plurality of buses.
67. The memory unit of claim 66, wherein the plurality of buses are configured to transfer data between the plurality of memory banks.
68. The memory unit of claim 67, wherein at least one of the plurality of buses is configured to communicate the zero value indicator to the one or more circuits.
69. A method for detecting a zero value in a particular address of a plurality of memory banks, comprising:
receiving a request to read data stored in addresses of a plurality of memory banks from a circuit external to the memory unit;
activating, in response to the received request, a zero value detection logic to detect, by the controller, a zero value in the received address; and
transmitting, by the controller, a zero value indicator to the circuit in response to a zero value detection by the zero value detection logic.
70. The method of claim 69, further comprising configuring, by the controller, a read disable element to interrupt a read command associated with a requested address when the zero value detection logic detects a zero value associated with the requested address.
71. The method of claim 69, further comprising configuring, by the controller, a sense amplifier to prevent activation of at least one of the plurality of memory banks when the zero-value detection unit detects a zero value.
72. A non-transitory computer readable medium storing a set of instructions executable by a controller of a memory unit to cause the memory unit to detect a zero value in a particular address of a plurality of memory banks, a method comprising:
receiving a request to read data stored in addresses of a plurality of memory banks from a circuit external to the memory unit;
activating, in response to the received request, a zero value detection logic to detect, by the controller, a zero value in the received address; and
transmitting, by the controller, a zero value indicator to the circuit in response to a zero value detection by the zero value detection logic.
73. The non-transitory computer-readable medium of claim 72, wherein the method further comprises configuring, by the controller, a read disable element to interrupt a read command associated with a requested address when the zero value detection logic detects a zero value associated with the requested address.
74. The non-transitory computer readable medium of claim 72, wherein the method further comprises configuring, by the controller, a sense amplifier to prevent activation of at least one of the plurality of memory banks when the zero-value detection unit detects a zero value.
75. An integrated circuit, comprising:
memory cells including a plurality of memory banks, at least one controller configured to control at least one aspect of a read operation with respect to the plurality of memory banks, and at least one zero value detection logic configured to detect a multi-bit zero value associated with data stored in a particular address of the plurality of memory banks;
a processing unit configured to send a read request to the memory unit for reading data from the memory unit; and is
Wherein the at least one controller and the at least one zero value detection logic are configured to return a zero value indicator to one or more circuits in response to a zero value detection by the at least one zero value detection logic.
76. A memory cell, comprising:
a memory array comprising a plurality of memory banks;
at least one controller configured to control at least one aspect of a read operation with respect to the plurality of memory banks;
at least one detection logic configured to detect a predetermined multi-bit value associated with data stored in a particular address of the plurality of memory banks; and is
Wherein the at least one controller is configured to return a value indicator to one or more circuits in response to detection of the predetermined multi-bit value by the at least one detection logic.
77. The memory cell of claim 76 wherein said predetermined multi-bit value is selectable by a user.
78. A memory cell, comprising:
a memory array comprising a plurality of memory banks;
at least one controller configured to control at least one aspect of a write operation with respect to the plurality of memory banks;
at least one snoop logic unit configured to snoop a predetermined multi-bit value associated with data to be written to a particular address of the plurality of memory banks; and is
Wherein the at least one controller is configured to provide a value indicator to one or more circuits in response to detection of the predetermined multi-bit value by the at least one detection logic.
79. A distributed processor memory chip, comprising:
a substrate;
a memory array comprising a plurality of memory banks disposed on the substrate;
a plurality of processor subunits disposed on the substrate;
at least one controller configured to control at least one aspect of a read operation with respect to the plurality of memory banks;
At least one detection logic configured to detect a predetermined multi-bit value associated with data stored in a particular address of the plurality of memory banks; and is provided with
Wherein the at least one controller is configured to return a value indicator to one or more of the plurality of processor subunits in response to the detection of the predetermined multi-bit value by the at least one detection logic.
80. A memory cell, comprising:
one or more memory banks;
a group controller; and
an address generator;
wherein the address generator is configured to:
providing a current address of a current row to be accessed in an associated one of the one or more memory banks to the bank controller;
determining a predicted address of a next row to be accessed in the associated memory bank; and
providing the predicted address to the group controller before operations with respect to the current row associated with the current address are completed.
81. The memory cell of claim 80 wherein the operation with respect to the current row associated with the current address is a read operation or a write operation.
82. The memory cell of claim 80, wherein the current row and the next row are in the same memory bank.
83. The memory cell of claim 82, wherein the same memory bank allows the next row to be accessed while the current row is being accessed.
84. The memory cell of claim 80, wherein the current row and the next row are in different memory banks.
85. The memory unit of claim 80, a distributed processor, wherein the distributed processor includes a plurality of processor sub-units of a processing array spatially distributed among a plurality of discrete memory banks of a memory array.
86. The memory cell of claim 80, wherein the group controller is configured to access the current row and initiate the next row prior to completion of the operation with respect to the current row.
87. The memory cell of claim 80, wherein each of the one or more memory banks comprises at least a first subset and a second subset, and wherein a bank controller associated with each of the one or more memory banks comprises a first bank controller associated with the first subset and a second bank controller associated with the second subset.
88. The memory unit of claim 87, wherein a first subset controller is configured to enable access to data included in a current row of the first subset while a second subset controller enables a next row of the second subset.
89. The memory cell of claim 88 wherein the activated next row of the second subset is spaced at least two rows apart from the current row of data being accessed in the first subset.
90. The memory cell of claim 87, wherein the second subset controller is configured such that data included in a current row of the second subset is accessed while the first subset controller enables a next row of the first subset.
91. The memory cell of claim 90 wherein the activated next row of the first subset is separated from the current row of data being accessed in the second subset by at least two rows.
92. The memory unit of claim 80, wherein the predicted address is determined using a trained neural network.
93. The memory unit of claim 80, wherein the predicted address is determined based on the determined bank access pattern.
94. The memory unit of claim 80, wherein the address generator comprises a first address generator configured to generate the current address and a second address generator configured to generate the predicted address.
95. The memory unit of claim 94, wherein the second address generator is configured to calculate the predicted address a predetermined period of time after a current address generator has generated the current address.
96. The memory unit of claim 95, wherein the predetermined period of time is adjustable.
97. The memory unit of claim 96, wherein the predetermined period of time is adjusted based on a value of at least one operating parameter associated with the memory unit.
98. The memory cell of claim 97 wherein the at least one operating parameter comprises a temperature of the memory cell.
99. The memory unit of claim 80, wherein the address generator is further configured to generate a confidence level associated with the predicted address and to cause the set controller to forgo accessing the next row at the predicted address if the confidence level falls below a predetermined threshold.
100. The memory unit of claim 80, wherein the predicted address is generated by a series of flip-flops that sample the address generated by a delay.
101. The memory cell of claim 100 wherein the delay is configurable via a multiplexer, the multiplexer selecting between flip-flops storing sampled addresses.
102. The memory unit of claim 80, wherein the bank controller is configured to ignore a predicted address received from the address generator during a predetermined period after a reset of the memory unit.
103. The memory unit of claim 80, wherein the address generator is configured to forgo providing the predicted address to the bank controller after detecting a random pattern in a row access with respect to the associated memory bank.
104. A memory cell, comprising:
one or more memory banks, wherein each of the one or more memory banks comprises:
a plurality of rows;
a first row controller configured to control a first subset of the plurality of rows;
a second row controller configured to control a second subset of the plurality of rows;
A single data input to receive data to be stored in the plurality of columns; and
a single data output to provide data retrieved from the plurality of rows.
105. The memory unit of claim 104 wherein the memory unit is configured to receive a first address for processing and a second address for activation and access at predetermined times.
106. The memory cell of claim 104 wherein the first subset of the plurality of rows consists of even numbered rows.
107. The memory cell of claim 106, wherein even numbered lines are located in one half of the one or more memory banks.
108. The memory cell of claim 106, wherein odd numbered lines are located in one half of the one or more memory banks.
109. The memory cell of claim 104 wherein the second subset of the plurality of rows is comprised of odd-numbered rows.
110. The memory cell of claim 104 wherein the first subset of the plurality of rows is contained in a first subset of a memory bank that is adjacent to a second subset of the memory bank that contains the second subset of the plurality of rows.
111. The memory cell of claim 104, wherein said first row controller is configured to cause access to data included in a row among said first subset of said plurality of rows while said second row controller activates a row among said second subset of said plurality of rows.
112. The memory cell of claim 111 wherein an activated row among the second subset of the plurality of rows is separated from a row of the first subset of the plurality of rows that is being accessed for data by at least two rows.
113. The memory cell of claim 104 wherein said second row controller is configured to cause access to data included in a row of said second subset of said plurality of rows while said first row controller activates a row of said second subset of said plurality of rows.
114. The memory cell of claim 113 wherein an activated row among the first subset of the plurality of rows is separated from a row of data being accessed in the second subset of the plurality of rows by at least two rows.
115. The memory cell of claim 104 wherein each of the one or more memory banks comprises a column input for receiving a column identifier that prompts a portion of a row to be accessed.
116. The memory cell of claim 104 wherein an additional row of redundant pads is placed between each of the two rows of pads to create a distance for enabling activation.
117. The memory cell of claim 104 wherein rows that are proximate to each other may not be activated simultaneously.
118. A distributed processor on a memory chip, comprising:
a substrate;
a memory array disposed on the substrate, the memory array comprising a plurality of discrete memory groups;
a processing array disposed on the substrate, the processing array comprising a plurality of processor sub-units, each of the processor sub-units being associated with a corresponding dedicated memory bank of the plurality of discrete memory banks; and
at least one memory pad disposed on the substrate, wherein the at least one memory pad is configured to act as at least one register of a register file for one or more of the plurality of processor subunits.
119. The memory chip of claim 118, wherein the at least one memory pad is included in at least one of the plurality of processor subunits of the processing array.
120. The memory chip of claim 118, wherein the register file is configured as a data register file.
121. The memory chip of claim 118, wherein the register file is configured as an address register file.
122. The memory chip of claim 118, wherein the at least one memory pad is configured to provide at least one register of a register file for one or more of the plurality of processor subunits to store data to be accessed by one or more of the plurality of processor subunits.
123. The memory chip of claim 118, wherein the at least one memory pad is configured to provide at least one register of a register file for one or more of the plurality of processing subunits, wherein the at least one register of the register file is configured to store coefficients used by the plurality of processor subunits during execution of a convolution accelerator operation by the plurality of processor subunits.
124. The memory chip of claim 118, wherein the at least one memory pad is a DRAM memory pad.
125. The memory chip of claim 118, in which the at least one memory pad is configured to communicate via unidirectional access.
126. The memory chip of claim 118, in which the at least one memory pad allows bidirectional access.
127. The memory chip of claim 1, further comprising at least one redundant memory pad disposed on the substrate, wherein the at least one redundant memory pad is configured to provide at least one redundant register for one or more of the plurality of processor subunits.
128. The memory chip of claim 118, further comprising at least one memory pad disposed on the substrate, wherein the at least one memory pad contains at least one redundant memory bit configured to provide at least one redundant register for one or more of the plurality of processor subunits.
129. The memory chip of claim 118, further comprising:
a first plurality of buses, each of the first plurality of buses connecting one of the plurality of processor subunits to a corresponding dedicated memory bank; and
A second plurality of buses, each of the second plurality of buses connecting one of the plurality of processor sub-units to another of the plurality of processor sub-units.
130. The memory chip of claim 118, wherein at least one of the processor subunits comprises a counter configured to count back from a predefined number, and upon the counter reaching a zero value, the at least one of the processor subunits is configured to stop a current task and trigger a memory refresh operation.
131. The memory chip of claim 118, wherein at least one of the processor subunits comprises a mechanism to stop a current task and trigger a memory refresh operation at a particular time to refresh the memory pads.
132. The memory chip of claim 118, wherein said register file is configured to function as a cache.
133. A method of executing at least one instruction in a distributed processor memory chip, the method comprising:
retrieving one or more data values from a memory array of the distributed processor memory chip;
storing the one or more data values in a register formed in a memory pad of the distributed processor memory chip; and
Accessing the one or more data values stored in the register in accordance with at least one instruction executed by a processor element;
wherein the memory array comprises a plurality of discrete memory banks disposed on a substrate;
wherein the processor element is a processor sub-unit included among a plurality of processor sub-units in a processing array disposed on the substrate, wherein each of the processor sub-units is associated with a corresponding dedicated memory bank of the plurality of discrete memory banks; and is provided with
Wherein the register is provided by a memory pad disposed on the substrate.
134. The method of claim 133, wherein the processor element is configured to act as an accelerator, and the method further comprises:
accessing first data stored in the register;
accessing second data from the memory array;
performing an operation on the first data and the second data.
135. The method of claim 133, wherein at least one memory pad comprises a plurality of word lines and bit lines, and further comprising:
determining a timing of loading the word lines and bit lines, the timing determined by a size of the memory pad.
136. The method of claim 133, further comprising:
the registers are periodically refreshed.
137. The method of claim 12, wherein the memory mat comprises a DRAM memory mat.
138. The method of claim 133 wherein the memory pad is included in at least one of the plurality of discrete memory banks of the memory array.
139. A device, comprising:
a substrate;
a processing unit disposed on the substrate; and
a memory unit disposed on the substrate, wherein the memory unit is configured to store data to be accessed by the processing unit, an
Wherein the processing unit includes a memory pad configured to act as a cache for the processing unit.
140. A method for distributed processing of at least one information stream, the method comprising:
receiving, by one or more memory processing integrated circuits, the at least one information stream via a first communication channel; wherein each memory processing integrated unit comprises a controller, a plurality of processor subunits and a plurality of memory units;
buffering, by the one or more memory processing integrated circuits, the at least one information stream;
Performing, by the one or more memory processing integrated circuits, a first processing operation on the at least one information stream to provide a first processing result;
sending the first processing result to a processing integrated circuit; and
performing, by the one or more memory processing integrated circuits, a second processing operation on the first processing result to provide a second processing result;
wherein a size of logic cells of the one or more memory processing integrated circuits is smaller than a size of logic cells of the processing integrated circuits.
141. The method of claim 140, wherein each of the plurality of memory units is coupled to at least one of the plurality of processor subunits.
142. The method of claim 140, wherein a total size of information units of the at least one information stream received during a particular time duration exceeds a total size of first processing results output during the particular time duration.
143. The method of claim 140, wherein a total size of the at least one information stream is less than a total size of the first processing result.
144. The method of claim 140 wherein the memory class fabrication process is a DRAM fabrication process.
145. The method of claim 140, wherein the processing integrated circuit is fabricated by a fabrication process of a memory class; and is provided with
Wherein the processing integrated circuit is fabricated by a logic-type fabrication process.
146. The method of claim 140, wherein a size of a logic cell of the one or more memory processing integrated circuits is at least twice a size of a corresponding logic cell of the processing integrated circuit.
147. The method of claim 140, wherein a critical dimension of a logic cell of the one or more memory processing integrated circuits is at least twice a critical dimension of a corresponding logic cell of the processing integrated circuit.
148. The method of claim 140, wherein a critical dimension of a memory cell of the one or more memory processing integrated circuits is at least twice a critical dimension of a corresponding logic cell of the processing integrated circuit.
149. The method of claim 140, comprising requesting, by the processing integrated circuit, the one or more memory processing integrated circuits to perform the first processing operation.
150. The method of claim 140, comprising instructing, by the processing integrated circuit, the one or more memory processing integrated circuits to perform the first processing operation.
151. The method of claim 140, comprising configuring, by the processing integrated circuit, the one or more memory processing integrated circuits to perform the first processing operation.
152. The method of claim 140, comprising performing the first processing operation by the one or more memory processing integrated circuits without intervention by the processing integrated circuits.
153. The method of claim 140, wherein the first processing operation is less computationally complex than the second processing operation.
154. The method of claim 140, wherein an overall throughput of the first processing operation exceeds an overall throughput of the second processing operation.
155. The method of claim 140, wherein the at least one information stream comprises one or more pre-processed information streams.
156. The method of claim 157, wherein the one or more preprocessed information streams are data extracted from a network transport unit.
157. The method of claim 140, wherein a portion of the first processing operation is performed by one of the plurality of processor sub-units and another portion of the first processing operation is performed by another one of the plurality of processor sub-units.
158. The method of claim 140, wherein the first processing operation and the second processing operation comprise cellular network processing operations.
159. The method of claim 140, wherein the first processing operation and the second processing operation comprise database processing operations.
160. The method of claim 140, wherein the first processing operation and the second processing operation comprise database analysis processing operations.
161. The method of claim 140, wherein the first processing operation and the second processing operation comprise artificial intelligence processing operations.
162. A method for distributed processing, the method comprising:
receiving an information unit by one or more memory processing integrated circuits of a decomposed system, the decomposed system comprising one or more arithmetic subsystems separate from one or more storage subsystems; wherein each of the one or more memory processing integrated circuits includes a controller, a plurality of processor sub-units, and a plurality of memory units;
wherein the one or more arithmetic subsystems comprise a plurality of processing integrated circuits;
wherein a size of a logic cell of the one or more memory processing integrated circuits is at least twice a size of a corresponding logic cell of the plurality of processing integrated circuits;
Performing, by the one or more memory processing integrated circuits, a processing operation on the information unit to provide a processing result; and
outputting the processing result from the one or more memory processing integrated circuits.
163. The method of claim 162, comprising outputting the processing result to the one or more arithmetic subsystems of the decomposed system.
164. The method of claim 162, comprising receiving the information unit from the one or more storage subsystems of the disassembled system.
165. The method of claim 162, comprising outputting the processing results to the one or more storage subsystems of the decomposed system.
166. The method of claim 162, comprising receiving the information unit from the one or more operations subsystems of the decomposed system.
167. The method of claim 166, wherein information units sent from different groups of processing units of the plurality of processing integrated circuits comprise different portions of intermediate results of processing programs executed by the plurality of processing integrated circuits, wherein a group of processing units includes at least one processing integrated circuit.
168. The method of claim 167, comprising outputting, by the one or more memory processing integrated circuits, a result of an entire process.
169. The method of claim 168, comprising sending the results of the entire handler to each of the plurality of processing integrated circuits.
170. The method of claim 168, wherein the different portions of the intermediate result are different portions of an updated neural network model, and wherein the result of the entire process is the updated neural network model.
171. The method of claim 168, including sending the updated neural network model to each of the plurality of processing integrated circuits.
172. The method of claim 162, comprising outputting the processing result using a switch subunit of the decomposed system.
173. The method of claim 162, wherein the one or more memory processing integrated circuits are included in a memory processing subsystem of the disassembled system.
174. The method of claim 162, wherein at least one of the one or more memory processing integrated circuits is included in one or more arithmetic subsystems of the decomposed system.
175. The method of claim 162, wherein at least one of the one or more memory processing integrated circuits is included in one or more memory subsystems of the decomposed system.
176. The method of claim 162, wherein at least one of the following is true: (a) receiving the information unit from at least one of the plurality of processing integrated circuits; and (b) sending the processing results to one or more memory processing integrated circuits of the plurality of processing integrated circuits.
177. The method of claim 176, wherein a critical dimension of a logic cell of the one or more memory processing integrated circuits exceeds at least twice a critical dimension of a corresponding logic cell of the plurality of processing integrated circuits.
178. The method of claim 176, wherein a critical dimension of a memory cell of the one or more memory processing integrated circuits exceeds a critical dimension of a corresponding logic cell of the plurality of processing integrated circuits by at least two times.
179. The method of claim 162, wherein the information unit comprises a preprocessed information unit.
180. The method of claim 179, comprising pre-processing the information unit by the plurality of processing integrated circuits to provide the pre-processed information unit.
181. The method of claim 162, wherein the information unit conveys a portion of a model of a neural network.
182. The method of claim 162, wherein the information unit conveys partial results of at least one database query.
183. The method of claim 162, wherein the information unit conveys partial results of at least one aggregated database query.
184. A method for database analysis acceleration, the method comprising:
receiving, by a memory processing integrated circuit, a database query, the database query including at least one relevance formula prompting a database entry in a database that is relevant to the database query;
wherein the memory processing integrated circuit includes a controller, a plurality of processor subunits, and a plurality of memory units;
determining, by the memory processing integrated circuit and based on the at least one dependency criterion, a group of related database entries stored in the memory processing integrated circuit; and
sending the group of related database entries to one or more processing entities for further processing without substantially sending unrelated database entries stored in the memory processing integrated circuit to the one or more processing entities;
Wherein the unrelated database entry is different from the related database entry.
185. The method of claim 184, wherein the one or more processing entities are included in the plurality of processor subunits of the memory processing integrated circuit.
186. The method of claim 185, comprising further processing, by the memory processing integrated circuit, the group of related database entries to complete a response to the database query.
187. The method of claim 186, including outputting the response to the database query from the memory processing integrated circuit.
188. The method of claim 187, wherein the outputting comprises applying a flow control handler.
189. The method of claim 188, wherein the applying of the flow control handler is responsive to an indicator output from the one or more processing entities regarding completion of processing of one or more database entries of the group.
190. The method of claim 185, comprising further processing, by the memory processing integrated circuit, the group of related database entries to provide an intermediate response to the database query.
191. The method of claim 190 including outputting the intermediate response to the database query from the memory processing integrated circuit.
192. The method of claim 191, wherein the outputting includes applying a flow control handler.
193. The method of claim 192, wherein the applying of the flow control handler is responsive to an indicator output from the one or more processing entities regarding completion of partial processing of database entries of the group.
194. The method of claim 185, comprising generating, by the one or more processing entities, a processing status indicator that prompts progress of the further processing of the group of related database entries.
195. The method of claim 185, including further processing the group of related database entries using the memory processing integrated circuit.
196. The method of claim 195, wherein the processing is performed by the plurality of processor subunits.
197. The method of claim 195, wherein the processing comprises computing an intermediate result by one of the plurality of processing subunits, sending the intermediate result to another one of the plurality of processor subunits, and performing additional computations by the other processing subunit.
198. The method of claim 195, wherein the processing is performed by the controller.
199. The method of claim 195, wherein the processing is performed by the plurality of processor subunits and the controller.
200. The method of claim 184, wherein the one or more processing entities are external to the memory processing integrated circuit.
201. The method of claim 200 including outputting the group of related database entities from the memory processing integrated circuit.
202. The method of claim 201, wherein the output comprises an application flow control handler.
203. The method of claim 202, wherein the application of the flow control handler is responsive to an indicator output from the one or more processing entities regarding a relevance of a database entry associated with the one or more processing entities.
204. The method of claim 184, wherein the plurality of processor subunits comprise a complete arithmetic logic unit.
205. The method of claim 184, wherein the plurality of processor subunits comprise partial arithmetic logic units.
206. The method of claim 184, wherein the plurality of processor subunits comprises a memory controller.
207. The method of claim 184, wherein the plurality of processor subunits comprises a partial memory controller.
208. The method of claim 184, comprising outputting at least one of: (i) the group of related database entries, (ii) a response to the database query, and (iii) an intermediate response to the database query.
209. The method of claim 212, wherein the outputting comprises applying traffic shaping.
210. The method of claim 212, wherein the outputting comprises attempting to match a bandwidth used during the outputting to a maximum allowable bandwidth on a link coupling the memory processing integrated circuit to a requester unit.
211. The method of claim 212, wherein the outputting of the output comprises maintaining fluctuations in output traffic rate below a threshold.
212. The method of claim 184, wherein the one or more processing entities comprise a plurality of processing entities, wherein at least one of the plurality of processing entities belongs to the memory processing integrated circuit and at least another one of the plurality of processing entities does not belong to the memory processing integrated circuit.
213. The method of claim 184, wherein the one or more processing entities belong to another memory processing integrated circuit.
214. A method for database analysis acceleration, the method comprising:
receiving, by a plurality of memory processing integrated circuits, a database query, the database query including at least one relevance formula prompting a database entry in a database that is relevant to the database query; wherein each of the plurality of memory processing integrated circuits includes a controller, a plurality of processor sub-units, and a plurality of memory units;
determining, by each of the plurality of memory processing integrated circuits and based on the at least one dependency criterion, a group of related database entries stored in the memory processing integrated circuit; and
sending, by each of the plurality of memory processing integrated circuits, the group of related database entries stored in the memory processing integrated circuit to one or more processing entities for further processing without substantially sending unrelated database entries stored in the memory processing integrated circuit to the one or more processing entities; wherein the unrelated database entry is different from the related database entry.
215. A method for database analysis acceleration, the method comprising:
receiving, by an integrated circuit, a database query, the database query including at least one relevance formula that prompts database entries in a database that are relevant to the database query; wherein the integrated circuit includes a controller, a screening unit, and a plurality of memory units;
determining, by the screening unit and based on the at least one correlation criterion, a group of related database entries stored in the integrated circuit; and
sending the group of related database entries to one or more processing entities external to the integrated circuit for further processing without substantially sending unrelated database entries stored in the integrated circuit to the one or more processing entities.
216. A method for database analysis acceleration, the method comprising:
receiving, by an integrated circuit, a database query, the database query including at least one relevance formula that prompts database entries in a database that are relevant to the database query;
wherein the integrated circuit includes a controller, a processing unit, and a plurality of memory units;
Determining, by the processing unit and based on the at least one dependency criterion, a group of related database entries stored in the integrated circuit;
processing, by the processing unit, the group of related database entries without processing, by the processing unit, unrelated data entries stored in the integrated circuit to provide a processing result; wherein the unrelated database entry is different from the related database entry; and
outputting the processing result from the integrated circuit.
217. A method for retrieving feature vector related information, the method comprising:
receiving, by a memory processing integrated circuit, fetch information for fetching of a plurality of requested feature vectors mapped to a plurality of statement segments; wherein the memory processing integrated circuit includes a controller, a plurality of processor subunits, and a plurality of memory units, each of the memory units coupled to a processor subunit;
retrieving the plurality of requested feature vectors from at least some of the plurality of memory units; wherein the retrieving comprises simultaneously requesting, from two or more memory units, the requested feature vectors stored in the two or more memory units; and
Outputting, from the memory processing integrated circuit, an output including at least one of: (a) the requested feature vector; and (b) processing results of the requested feature vector.
218. The method of claim 217, wherein the output includes the requested feature vector.
219. The method of claim 217, wherein the output includes the results of the processing of the requested feature vector.
220. The method of claim 219, wherein the processing is performed by the plurality of processor subunits.
221. The method of claim 220, wherein the processing comprises sending requested feature vectors from one processing subunit to another processing subunit.
222. The method of claim 220, wherein the processing comprises computing an intermediate result by one processing subunit, sending the intermediate result to another processing subunit, and computing another intermediate result or processing result by the other processing subunit.
223. The method of claim 219, wherein the processing is performed by the controller.
224. The method of claim 219, wherein the processing is performed by the plurality of processor subunits and the controller.
225. The method of claim 219, wherein the processing is performed by a vector processor of the memory processing integrated circuit.
226. The method of claim 217, wherein the controller is configured to simultaneously request the requested feature vectors based on a known mapping between statement segments and locations of feature vectors mapped to the statement segments.
227. The method of claim 11, wherein the mapping is uploaded during a boot-up process of the memory processing integrated circuit.
228. The method of claim 217, wherein the controller is configured to manage the retrieving of the plurality of requested feature vectors.
229. The method of claim 217, wherein the plurality of statement segments have a particular order, and wherein the outputting of the requested feature vector is performed according to the particular order.
230. The method of claim 229, wherein the retrieving of the plurality of requested feature vectors is performed according to the particular order.
231. The method of claim 229, wherein the retrieving of the plurality of requested feature vectors is performed at least partially out of order; and wherein the retrieving further comprises reordering the plurality of requested feature vectors.
232. The method of claim 217, wherein the retrieving of the plurality of requested features comprises buffering the plurality of requested feature vectors before the plurality of requested feature vectors are read by the controller.
233. The method of claim 232, wherein the retrieving of the plurality of requested features comprises generating a buffer status indicator that prompts one or more buffers associated with the plurality of memory cells when to store one or more requested feature vectors.
234. The method of claim 233, including conveying the buffer status indicator over a dedicated control line.
235. The method of claim 234, wherein a dedicated control line is allocated per memory cell.
236. The method of claim 234, wherein the buffer status indicator comprises one or more status bits stored in one or more of the buffers.
237. The method of claim 234, including conveying the buffer status indicator via one or more shared control lines.
238. The method of claim 217, wherein the fetch information is included in one or more fetch commands at a first resolution, the first resolution representing a particular number of bits.
239. The method of claim 238, comprising managing, via the controller, the fetching at a higher resolution, the higher resolution representing a number of bits lower than the particular number of bits.
240. The method of claim 238, wherein the controller is configured to manage the retrieving according to a feature vector resolution.
241. The method of claim 238, comprising independently managing the retrieving by the controller.
242. The method of claim 217, wherein the plurality of processor subunits comprises a complete arithmetic logic unit.
243. The method of claim 217, wherein the plurality of processor subunits comprise partial arithmetic logic units.
244. The method of claim 217, wherein the plurality of processor subunits comprises a memory controller.
245. The method of claim 217, wherein the plurality of processor subunits comprises a partial memory controller.
246. The method of claim 217, wherein the outputting of the output comprises applying traffic shaping to the output.
247. The method of claim 217, wherein the outputting of the output includes matching a bandwidth used during the outputting to a maximum allowable bandwidth on a link coupling the memory processing integrated circuit to a requester unit.
248. The method of claim 217, wherein the outputting of the output comprises maintaining fluctuations in output traffic rate below a predetermined threshold.
249. The method of claim 217, wherein the retrieving comprises applying predictive retrieval to at least some of requested feature vectors from a set of the requested feature vectors stored in a single memory unit.
250. The method of claim 217, wherein the requested feature vectors are distributed among the memory units.
251. The method of claim 217, wherein the requested feature vectors are distributed among the memory units based on an expected retrieval pattern.
252. A method for memory intensive processing, the method comprising:
performing, by a plurality of processors included in a hybrid device, a processing operation, the hybrid device including a base die, a first memory resource associated with at least one second die, and a second memory resource associated with at least one third die; wherein the base die and the at least one second die are connected to each other by inter-wafer bonding;
retrieving, using the plurality of processors, information stored in the first memory resource; and
Sending additional information from the second memory resource to the first memory resource, wherein a total bandwidth of a first path between the base die and the at least one second die exceeds a total bandwidth of a second path between the at least one second die and the at least one third die, and wherein a storage capacity of the first memory resource is less than a storage capacity of the second memory resource.
253. The method of claim 252, wherein the second memory resource comprises a High Bandwidth Memory (HBM) resource.
254. The method of claim 252, wherein the at least one third die comprises a stack of High Bandwidth Memory (HBM) chips.
255. The method of claim 252, wherein at least some of the second memory resources belong to a third die of at least one third die, the third die being connected to the base die without using inter-wafer bonding.
256. The method of claim 252, wherein at least some of the second memory resources belong to a third die of at least one third die that is connected to a second die of the at least one second die without using inter-wafer bonding.
257. The method of claim 252, wherein the first memory resource and the second memory resource comprise different levels of cache.
258. The method of claim 252, wherein the first memory resource is located between the base die and the second memory resource.
259. The method of claim 252, wherein the first memory resource is not located on top of the second memory resource.
260. The method of claim 252, comprising performing additional processing by a second die of the at least one second die, the second die comprising a plurality of processor subunits and the first memory resource.
261. The method of claim 260, wherein at least one processor subunit is coupled to a dedicated portion of the first memory resources allocated to the processor subunit.
262. The method of claim 261, wherein the dedicated portion of the first memory resource comprises at least one memory bank.
263. The method of claim 252, wherein the plurality of processors belong to a memory processing chip that also includes the first memory resource.
264. The method of claim 252, wherein the base die includes the plurality of processors, wherein the plurality of processors includes a plurality of processor subunits coupled to the first memory resource via conductors formed using the inter-wafer bonds.
265. The method of claim 264, wherein each processor subunit is coupled to a dedicated portion of the first memory resources allocated to the processor subunit.
266. A hybrid device for memory intensive processing, the hybrid device comprising:
a base grain;
a plurality of processors;
a first memory resource of at least one second die;
a second memory resource of the at least one third die;
wherein the base die and the at least one second die are connected to each other by inter-wafer bonding;
wherein the plurality of processors are configured to perform processing operations and retrieve information stored in the first memory resource; and is
Wherein the second memory resource is configured to send additional information from the second memory resource to the first memory resource;
wherein a total bandwidth of a first path between the base die and the at least one second die exceeds a total bandwidth of a second path between the at least one second die and the at least one third die, and
Wherein the storage capacity of the first memory resource is less than the storage capacity of the second memory resource.
267. The blending device of claim 266, wherein the second memory resource comprises a High Bandwidth Memory (HBM) resource.
268. The hybrid device of claim 266, wherein the at least one third die comprises a stack of High Bandwidth Memory (HBM) memory chips.
269. The hybrid device of claim 266, wherein at least some of the second memory resources belong to a third die of the at least one third die, the third die being connected to the base die without using inter-wafer bonding.
270. The hybrid device of claim 266, wherein at least some of the second memory resources belong to a third die of the at least one third die that is connected to a second die of the at least one second die without using inter-wafer bonding.
271. The blending device of claim 266, wherein the first memory resource and the second memory resource comprise different levels of cache.
272. The blending device of claim 266, wherein the first memory resource is positioned between the base die and the second memory resource.
273. The blending device of claim 266, wherein the first memory resource is located to one side of the second memory resource.
274. The hybrid device of claim 266, wherein a second die of the at least one second die is configured to perform additional processing, wherein the second die comprises a plurality of processor subunits and the first memory resource.
275. The blending device of claim 274, wherein each processor subunit is coupled to a dedicated portion of the first memory resources allocated to the processor subunit.
276. The hybrid device of claim 275, wherein the dedicated portion of the first memory resource comprises at least one memory bank.
277. The blending device of claim 266, wherein the plurality of processors comprises a plurality of processor subunits of a memory processing chip that also includes the first memory resource.
278. The hybrid device of claim 266, wherein the base die comprises the plurality of processors, wherein the plurality of processors comprise a plurality of processor subunits coupled to the first memory resource via conductors formed using the inter-wafer bonds.
279. The blending device of claim 278, wherein each processor subunit is coupled to a dedicated portion of the first memory resources allocated to the processor subunit.
280. A method for database acceleration, the method comprising:
retrieving a certain amount of information from a storage unit by a network communication interface of the database acceleration integrated circuit;
first processing the amount of information to provide first processed information;
accelerating a memory controller of an integrated circuit using the database and sending the first processed information to a plurality of memory processing integrated circuits via an interface, wherein each memory processing integrated circuit includes a controller, a plurality of processor sub-units, and a plurality of memory units,
second processing at least a portion of the first processed information using the plurality of memory processing integrated circuits to provide second processed information;
retrieving, by the memory controller of the database acceleration integrated circuit, information from the plurality of memory processing integrated circuits, wherein the retrieved information comprises at least one of: (a) at least a portion of the first processed information; and (b) at least a portion of the second processed information;
Performing a database processing operation on the retrieved information using a database acceleration unit of the database acceleration integrated circuit to provide a database acceleration result; and
and outputting the database acceleration result.
281. The method of claim 280, comprising managing at least one of the retrieving, first processing, sending, and processing using a management unit of the database accelerated integrated circuit.
282. The method of claim 281, wherein the managing is performed based on an execution plan generated by the management unit of the database acceleration integrated circuit.
283. The method of claim 281, wherein the managing is performed based on an execution plan received by the management unit of the database acceleration integrated circuit and not generated by the management unit.
284. The method of claim 281, wherein the managing comprises assigning at least one of: (a) network communication network interface resources; (b) a decompression unit resource; (c) a memory controller resource; (d) a plurality of memory processing integrated circuit resources; and (e) database acceleration unit resources.
285. The method of claim 280, wherein the network communication interface includes two or more different types of network communication ports.
286. The method of claim 285, wherein the two or more different types of network communication ports include a storage interface protocol port and a universal network protocol storage interface port.
287. The method of claim 285, wherein the two or more different types of network communication ports include a storage interface protocol port and an ethernet protocol storage interface port.
288. The method of claim 285, wherein the two or more different types of network communication ports comprise a storage interface protocol port and a PCIe port.
289. The method of claim 280, comprising a management unit that includes an compute node of a compute system and is controlled by a manager of the compute system.
290. The method of claim 280, comprising controlling, by a compute node of a computing system, at least one of the fetching, first processing, sending, and third processing.
291. The method of claim 280, comprising accelerating simultaneous execution of multiple tasks by an integrated circuit using the database.
292. The method of claim 280, comprising managing at least one of the retrieving, first processing, sending, and third processing using a management unit located external to the database acceleration integrated circuit.
293. The method of claim 280, wherein the database acceleration integrated circuit belongs to a computing system.
294. The method of claim 280, wherein the database acceleration integrated circuit does not belong to a computing system.
295. The method of claim 280, comprising performing, by a compute node of a computing system, at least one of the fetching, first processing, sending, and third processing based on an execution plan sent to the database acceleration integrated circuit.
296. The method of claim 280, wherein the performing of the database processing operation comprises concurrently executing database processing instructions by database processing subunits, wherein the database acceleration unit comprises a group of database accelerator subunits that share a shared memory unit.
297. The method of claim 296, wherein each database processing subunit is configured to execute a particular type of database processing instruction.
298. The method of claim 297, comprising dynamically linking database processing subunits to provide an execution pipeline for performing database processing operations comprising a plurality of instructions.
299. The method of claim 280, wherein the performing of the database processing operation includes allocating resources of the database acceleration integrated circuit as a function of temporal I/O bandwidth.
300. The method of claim 280, comprising outputting the database acceleration results to local storage and retrieving the database acceleration results from the local storage.
301. The method of claim 280 wherein the network communication interface comprises an RDMA unit.
302. The method of claim 280, comprising exchanging information between database accelerated integrated circuits of one or more groups of database accelerated integrated circuits.
303. The method of claim 280, comprising exchanging database acceleration results between database acceleration integrated circuits of one or more groups of database acceleration integrated circuits.
304. The method of claim 280, comprising exchanging between database acceleration integrated circuits of one or more groups of database acceleration integrated circuits at least one of: (a) information; and (b) database acceleration results.
305. The method of claim 304, wherein the group of database acceleration integrated circuits are connected to a common printed circuit board.
306. The method of claim 304, wherein the group of database acceleration integrated circuits belong to a modular unit of a computerized system.
307. The method of claim 304, wherein different groups of database acceleration integrated circuits are connected to different printed circuit boards.
308. The method of claim 304, wherein different groups of database acceleration integrated circuits belong to different modular units of a computerized system.
309. The method of claim 304, comprising accelerating, by the database of the one or more groups, integrated circuit execution of distributed processing.
310. The method of claim 304, wherein the exchanging is performed using one or more groups of the databases to accelerate a network communication interface of an integrated circuit.
311. The method of claim 304, wherein the switching is performed over a plurality of groups connected to each other by a star connection.
312. The method of claim 304, comprising using at least one switch for switching at least one of the following between different groups of database acceleration integrated circuits of the one or more groups: (a) information and (b) database acceleration results.
313. The method of claim 304, comprising accelerating execution of distributed processing by at least some of the one or more groups of the databases.
314. The method of claim 304, comprising performing distributed processing of first and second data structures, wherein a total size of the first and second data structures exceeds a storage capacity of the plurality of memory processing integrated circuits.
315. The method of claim 314, wherein the performing of the distributed processing comprises performing multiple iterations of: (a) performing a new assignment of the different pairs of the first data structure portion and the second data structure portion to different database acceleration integrated circuits; and (b) processing said different pairs.
316. The method of claim 315, wherein the execution of the distributed processing comprises a database join operation.
317. The method of claim 315, wherein the performing of the distributed processing comprises:
assigning different first data structure portions to different database acceleration integrated circuits of the one or more groups; and
performing a plurality of iterations of:
a different database acceleration integrated circuit that newly allocates a different second data structure portion to the one or more groups, an
Processing the first and second data structure portions by the database acceleration integrated circuit.
318. The method of claim 317, wherein the new assignment for a next iteration is performed in a manner that at least partially overlaps in time with the processing of a current iteration.
319. The method of claim 317, wherein the new allocation comprises exchanging a second data structure portion between the different database acceleration integrated circuits.
320. The method of claim 319, wherein the exchanging is performed in a manner that overlaps at least a portion of the time with the processing.
321. The method of claim 317, wherein the new allocation includes exchanging a second data structure portion between the different database acceleration integrated circuits of a group; and exchanging the second data structure portion between different groups of the database acceleration integrated circuit once the exchanging has been completed.
322. The method of claim 280, wherein the database acceleration integrated circuit is included in a blade that includes a plurality of database acceleration integrated circuits, one or more non-volatile memory units, an ethernet switch, a PCIe switch, and an ethernet switch, and the plurality of memory processing integrated circuits.
323. A device for database acceleration, the device comprising:
A database acceleration integrated circuit; and
a plurality of memory processing integrated circuits; wherein each memory processing integrated circuit comprises a controller, a plurality of processor subunits and a plurality of memory units;
wherein the network communication interface of the database acceleration integrated circuit is configured to receive information from a storage unit;
wherein the database acceleration integrated circuit is configured to perform a first processing on an amount of information to provide first processed information;
wherein a memory controller of the database acceleration integrated circuit is configured to send the first processed information to the plurality of memory processing integrated circuits via an interface;
wherein the plurality of memory processing integrated circuits are configured to second process at least a portion of the first processed information by the plurality of memory processing integrated circuits to provide second processed information;
wherein the memory controller of the database acceleration integrated circuit is configured to retrieve information from the plurality of memory processing integrated circuits, wherein the retrieved information includes at least one of: (a) at least a portion of the first processed information; and (b) at least a portion of the second processed information;
Wherein a database acceleration unit of the database acceleration integrated circuit is configured to perform database processing operations on the retrieved information to provide a database acceleration result; and is
Wherein the database acceleration integrated circuit is configured to output the database acceleration result.
324. The apparatus of claim 323 configured to manage at least one of the retrieving, first processing, and second processing of the retrieved information using a management unit of the database acceleration integrated circuit.
325. The apparatus of claim 324, wherein the management unit is configured to manage based on an execution plan generated by the management unit of the database accelerated integrated circuit.
326. The apparatus of claim 324, wherein the management unit is configured to manage based on an execution plan received by the management unit of the database acceleration integrated circuit and not generated by the management unit.
327. The apparatus of claim 324, wherein the management unit is configured to manage by distributing one or more of: (a) network communication network interface resources; (b) a decompression unit resource; (c) a memory controller resource; (d) a plurality of memory processing integrated circuit resources; and (e) database acceleration unit resources.
328. The apparatus of claim 323, wherein the network communication interfaces include different types of network communication ports.
329. The device of claim 328, wherein the different types of network communication ports include a storage interface protocol port and a universal network protocol storage interface port.
330. The device of claim 328, wherein the different types of network communication ports include a storage interface protocol port and an ethernet protocol storage interface port.
331. The device of claim 328, wherein the different types of network communication ports include a storage interface protocol port and a PCIe port.
332. The apparatus of claim 323, wherein the apparatus is coupled to a management unit that includes a compute node of a computing system and that is controlled by a manager of the computing system.
333. The apparatus of claim 323 configured to be controlled by a compute node of a computing system.
334. The apparatus of claim 323, configured to accelerate, by the database, simultaneous execution of multiple tasks by an integrated circuit.
335. The apparatus of claim 323, wherein the database acceleration integrated circuit belongs to an operating system.
336. The apparatus of claim 323, wherein the database acceleration integrated circuit does not belong to a computing system.
337. The apparatus of claim 323 configured to perform at least one of the fetching, the first processing, the sending, and the third processing by a compute node of a computing system based on an execution plan sent to the database acceleration integrated circuit.
338. The apparatus of claim 323, wherein the database acceleration unit is configured to concurrently execute database processing instructions by database processing subunits, wherein the database acceleration unit includes a group of database accelerator subunits that share a shared memory unit.
339. The apparatus of claim 338, wherein each database processing subunit is configured to execute a particular type of database processing instruction.
340. The apparatus of claim 339, wherein the apparatus is configured to dynamically link database processing subunits to provide an execution pipeline for performing database processing operations comprising a plurality of instructions.
341. The apparatus of claim 323, wherein the apparatus is configured to allocate resources of the database accelerated integrated circuit as a function of temporal I/O bandwidth.
342. The device of claim 323, wherein the device includes local storage accessible by the database acceleration integrated circuit.
343. The apparatus of claim 323 wherein the network communication interface comprises an RDMA unit.
344. The device of claim 323, wherein the device includes one or more groups of database acceleration integrated circuits configured to exchange information between database acceleration integrated circuits of the one or more groups of database acceleration integrated circuits.
345. The device of claim 323, wherein the device includes one or more groups of database acceleration integrated circuits configured to exchange acceleration results between database acceleration integrated circuits of the one or more groups of database acceleration integrated circuits.
346. The device of claim 323, wherein the device includes one or more groups of database acceleration integrated circuits configured to perform at least one of the following between the database acceleration integrated circuits of the one or more groups of database acceleration integrated circuits: (a) information; and (b) database acceleration results.
347. The apparatus of claim 346, wherein the database acceleration integrated circuits of a group are connected to the same printed circuit board.
348. The apparatus of claim 346, wherein the database acceleration integrated circuits of a group belong to a modular unit of a computerized system.
349. The apparatus of claim 346, wherein different groups of database acceleration integrated circuits are connected to different printed circuit boards.
350. The device of claim 346, wherein different groups of database acceleration integrated circuits belong to different modular units of a computerized system.
351. The apparatus of claim 346, wherein the exchange is performed using a network communication interface of one or more groups of the database acceleration integrated circuits.
352. The apparatus of claim 346, wherein the switching is performed over a plurality of groups connected to each other by a star connection.
353. The apparatus of claim 346, wherein the apparatus is configured to use at least one switch for switching at least one of the following between database-accelerated integrated circuits of different ones of the one or more groups: (a) information; and (b) database acceleration results.
354. The apparatus of claim 346, wherein the apparatus is configured to accelerate the execution of distributed processing by some of the integrated circuits by the database of some of the one or more groups.
355. The apparatus of claim 346, wherein the apparatus is configured to perform distributed processing using first and second data structures, wherein a total size of the first and second data structures exceeds a storage capacity of the plurality of memory processing integrated circuits.
356. The apparatus of claim 355, wherein the apparatus is configured to perform the distributed processing by performing a plurality of iterations of: (a) performing a new assignment of the different pairs of the first data structure portion and the second data structure portion to different database acceleration integrated circuits; and (b) processing said different pairs.
357. The apparatus of claim 355, wherein the distributed processing comprises a database join operation.
358. The apparatus of claim 355, wherein the apparatus is configured to perform the distributed processing by:
assigning different first data structure portions to different database acceleration integrated circuits of the one or more groups; and
Performing a plurality of iterations of:
a different database acceleration integrated circuit that newly allocates a different second data structure portion to the one or more groups, an
Processing the first and second data structure portions by the database acceleration integrated circuit.
359. The apparatus of claim 358, wherein the apparatus is configured to perform the new assignment for a next iteration in a manner that at least partially overlaps in time with processing for a current iteration.
360. The apparatus of claim 358, wherein the apparatus is configured to perform the new allocation by exchanging a second data structure portion between the different database acceleration integrated circuits.
361. The apparatus of claim 360, wherein the swapping is performed by the database acceleration integrated circuit in a manner that overlaps at least a portion of the processing.
362. The apparatus of claim 358, wherein the apparatus is configured to perform the new allocation by: exchanging a second data structure portion between the different database acceleration integrated circuits of a group; and exchanging the second data structure portion between different groups of the database acceleration integrated circuit once the exchanging has been completed.
363. The apparatus of claim 323, wherein the database acceleration integrated circuit is included in a blade that includes a plurality of database acceleration integrated circuits, one or more non-volatile memory units, an ethernet switch, a PCIe switch, and an ethernet switch, and the plurality of memory processing integrated circuits.
364. A method for database acceleration, the method comprising:
retrieving information from the storage unit via a network communication interface of the database acceleration integrated circuit;
first processing an amount of information to provide first processed information;
accelerating, by the database, a memory controller of an integrated circuit and sending the first processed information to a plurality of memory resources via an interface;
retrieving information from the plurality of memory resources;
performing, by a database acceleration unit of the database acceleration integrated circuit, a database processing operation on the captured information to provide a database acceleration result; and
and outputting the database acceleration result.
365. The method of claim 364, further comprising processing the first processed information to provide second processed information, wherein the processing of the first processed information is performed by a plurality of processors located in one or more memory processing integrated circuits further comprising the plurality of memory resources.
366. The method of claim 364, wherein the first processing comprises screening database entries.
367. The method of claim 364, wherein the second processing comprises screening database entries.
368. The method of claim 364, wherein the first and second processes comprise screening database entries.
Applications Claiming Priority (11)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201962886328P | 2019-08-13 | 2019-08-13 | |
US62/886,328 | 2019-08-13 | ||
US201962907659P | 2019-09-29 | 2019-09-29 | |
US62/907,659 | 2019-09-29 | ||
US201962930593P | 2019-11-05 | 2019-11-05 | |
US62/930,593 | 2019-11-05 | ||
US202062971912P | 2020-02-07 | 2020-02-07 | |
US62/971,912 | 2020-02-07 | ||
US202062983174P | 2020-02-28 | 2020-02-28 | |
US62/983,174 | 2020-02-28 | ||
PCT/IB2020/000665 WO2021028723A2 (en) | 2019-08-13 | 2020-08-13 | Memory-based processors |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114586019A true CN114586019A (en) | 2022-06-03 |
Family
ID=74570549
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202080071415.1A Pending CN114586019A (en) | 2019-08-13 | 2020-08-13 | Memory-based processor |
Country Status (5)
Country | Link |
---|---|
EP (1) | EP4010808A4 (en) |
KR (1) | KR20220078566A (en) |
CN (1) | CN114586019A (en) |
TW (1) | TW202122993A (en) |
WO (1) | WO2021028723A2 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110875066A (en) * | 2018-09-03 | 2020-03-10 | 爱思开海力士有限公司 | Semiconductor device and semiconductor system including the same |
CN115237036A (en) * | 2022-09-22 | 2022-10-25 | 之江实验室 | Full-digitalization management device for wafer-level processor system |
CN118295960A (en) * | 2024-06-03 | 2024-07-05 | 芯方舟(上海)集成电路有限公司 | Force calculating chip, design method and manufacturing method thereof and force calculating chip system |
WO2024193274A1 (en) * | 2023-03-22 | 2024-09-26 | 华为技术有限公司 | Memory and device |
Families Citing this family (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US12073251B2 (en) | 2020-12-29 | 2024-08-27 | Advanced Micro Devices, Inc. | Offloading computations from a processor to remote execution logic |
WO2022245382A1 (en) * | 2021-05-18 | 2022-11-24 | Silicon Storage Technology, Inc. | Split array architecture for analog neural memory in a deep learning artificial neural network |
US20220374696A1 (en) * | 2021-05-18 | 2022-11-24 | Silicon Storage Technology, Inc. | Split array architecture for analog neural memory in a deep learning artificial neural network |
US11327771B1 (en) * | 2021-07-16 | 2022-05-10 | SambaNova Systems, Inc. | Defect repair circuits for a reconfigurable data processor |
US12112792B2 (en) * | 2021-08-10 | 2024-10-08 | Micron Technology, Inc. | Memory device for wafer-on-wafer formed memory and logic |
CN115729845A (en) * | 2021-08-30 | 2023-03-03 | 华为技术有限公司 | Data storage device and data processing method |
US11914532B2 (en) | 2021-08-31 | 2024-02-27 | Apple Inc. | Memory device bandwidth optimization |
US11947940B2 (en) | 2021-10-11 | 2024-04-02 | International Business Machines Corporation | Training data augmentation via program simplification |
CN116264085A (en) | 2021-12-14 | 2023-06-16 | 长鑫存储技术有限公司 | Storage system and data writing method thereof |
TWI819480B (en) | 2022-01-27 | 2023-10-21 | 緯創資通股份有限公司 | Acceleration system and dynamic configuration method thereof |
TWI776785B (en) * | 2022-04-07 | 2022-09-01 | 點序科技股份有限公司 | Die test system and die test method thereof |
US11755399B1 (en) * | 2022-05-24 | 2023-09-12 | Macronix International Co., Ltd. | Bit error rate reduction technology |
TW202406056A (en) * | 2022-05-25 | 2024-02-01 | 以色列商紐羅布萊德有限公司 | Processing systems and methods |
US20230393849A1 (en) * | 2022-06-01 | 2023-12-07 | Advanced Micro Devices, Inc. | Method and apparatus to expedite system services using processing-in-memory (pim) |
WO2024027937A1 (en) * | 2022-08-05 | 2024-02-08 | Synthara Ag | Memory-mapped compact computing array |
TWI843280B (en) * | 2022-11-09 | 2024-05-21 | 財團法人工業技術研究院 | Artificial intelligence accelerator and operating method thereof |
CN115599025B (en) * | 2022-12-12 | 2023-03-03 | 南京芯驰半导体科技有限公司 | Resource grouping control system, method and storage medium of chip array |
CN116962176B (en) * | 2023-09-21 | 2024-01-23 | 浪潮电子信息产业股份有限公司 | Data processing method, device and system of distributed cluster and storage medium |
CN118133574B (en) * | 2024-05-06 | 2024-07-19 | 沐曦集成电路(上海)有限公司 | SRAM (static random Access memory) generating system |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002063069A (en) * | 2000-08-21 | 2002-02-28 | Hitachi Ltd | Memory controller, data processing system, and semiconductor device |
US9612979B2 (en) * | 2010-10-22 | 2017-04-04 | Intel Corporation | Scalable memory protection mechanism |
US20140040622A1 (en) * | 2011-03-21 | 2014-02-06 | Mocana Corporation | Secure unlocking and recovery of a locked wrapped app on a mobile device |
US9262246B2 (en) * | 2011-03-31 | 2016-02-16 | Mcafee, Inc. | System and method for securing memory and storage of an electronic device with a below-operating system security agent |
US8590050B2 (en) * | 2011-05-11 | 2013-11-19 | International Business Machines Corporation | Security compliant data storage management |
US8996951B2 (en) * | 2012-11-15 | 2015-03-31 | Elwha, Llc | Error correction with non-volatile memory on an integrated circuit |
US9424213B2 (en) * | 2012-11-21 | 2016-08-23 | Coherent Logix, Incorporated | Processing system with interspersed processors DMA-FIFO |
CN111149166B (en) * | 2017-07-30 | 2024-01-09 | 纽罗布拉德有限公司 | Memory-based distributed processor architecture |
US10810141B2 (en) * | 2017-09-29 | 2020-10-20 | Intel Corporation | Memory control management of a processor |
-
2020
- 2020-08-13 WO PCT/IB2020/000665 patent/WO2021028723A2/en unknown
- 2020-08-13 CN CN202080071415.1A patent/CN114586019A/en active Pending
- 2020-08-13 EP EP20852497.5A patent/EP4010808A4/en not_active Withdrawn
- 2020-08-13 KR KR1020227008116A patent/KR20220078566A/en unknown
- 2020-08-13 TW TW109127495A patent/TW202122993A/en unknown
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110875066A (en) * | 2018-09-03 | 2020-03-10 | 爱思开海力士有限公司 | Semiconductor device and semiconductor system including the same |
CN115237036A (en) * | 2022-09-22 | 2022-10-25 | 之江实验室 | Full-digitalization management device for wafer-level processor system |
CN115237036B (en) * | 2022-09-22 | 2023-01-10 | 之江实验室 | Full-digitalization management device for wafer-level processor system |
WO2024193274A1 (en) * | 2023-03-22 | 2024-09-26 | 华为技术有限公司 | Memory and device |
CN118295960A (en) * | 2024-06-03 | 2024-07-05 | 芯方舟(上海)集成电路有限公司 | Force calculating chip, design method and manufacturing method thereof and force calculating chip system |
CN118295960B (en) * | 2024-06-03 | 2024-09-03 | 芯方舟(上海)集成电路有限公司 | Force calculating chip, design method and manufacturing method thereof and force calculating chip system |
Also Published As
Publication number | Publication date |
---|---|
WO2021028723A3 (en) | 2021-07-08 |
EP4010808A2 (en) | 2022-06-15 |
TW202122993A (en) | 2021-06-16 |
KR20220078566A (en) | 2022-06-10 |
EP4010808A4 (en) | 2023-11-15 |
WO2021028723A2 (en) | 2021-02-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114586019A (en) | Memory-based processor | |
US20220164284A1 (en) | In-memory zero value detection | |
TWI779069B (en) | Memory chip with a memory-based distributed processor architecture | |
US11901026B2 (en) | Partial refresh | |
CN111433758B (en) | Programmable operation and control chip, design method and device thereof | |
CN112912856B (en) | Memory-based processor | |
TWI856974B (en) | Variable word length access |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |