TW202122993A - Memory-based processors - Google Patents
Memory-based processors Download PDFInfo
- Publication number
- TW202122993A TW202122993A TW109127495A TW109127495A TW202122993A TW 202122993 A TW202122993 A TW 202122993A TW 109127495 A TW109127495 A TW 109127495A TW 109127495 A TW109127495 A TW 109127495A TW 202122993 A TW202122993 A TW 202122993A
- Authority
- TW
- Taiwan
- Prior art keywords
- memory
- processing
- processor
- database
- unit
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7807—System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
- G06F15/7821—Tightly coupled to memory, e.g. computational memory, smart memory, processor in memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/0215—Addressing or allocation; Relocation with look ahead addressing means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/14—Protection against unauthorised use of memory or access to memory
- G06F12/1416—Protection against unauthorised use of memory or access to memory by checking the object accessibility, e.g. type of access defined by the memory independently of subject rights
- G06F12/1425—Protection against unauthorised use of memory or access to memory by checking the object accessibility, e.g. type of access defined by the memory independently of subject rights the protection being physical, e.g. cell, word, block
- G06F12/1441—Protection against unauthorised use of memory or access to memory by checking the object accessibility, e.g. type of access defined by the memory independently of subject rights the protection being physical, e.g. cell, word, block for a range
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/70—Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer
- G06F21/78—Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer to assure secure storage of data
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C7/00—Arrangements for writing information into, or reading information out from, a digital store
- G11C7/10—Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers
- G11C7/1006—Data managing, e.g. manipulating data before writing or reading out, data bus switches or control circuits therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0811—Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0813—Multiuser, multiprocessor or multiprocessing cache systems with a network or matrix configuration
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0862—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with prefetch
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/45—Caching of specific data in cache memory
- G06F2212/454—Vector or matrix data
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
Description
優先權priority
本申請案主張以下各者之優先權:2019年8月13日申請之美國臨時申請案第62/886,328號;2019年9月29日申請之美國臨時申請案第62/907,659號;2020年2月7日申請之美國臨時申請案第62/971,912號;及2020年2月28日申請之美國臨時申請案第62/983,174號。前述申請案以全文引用之方式併入本文中。 This application claims the priority of the following: U.S. Provisional Application No. 62/886,328 filed on August 13, 2019; U.S. Provisional Application No. 62/907,659 filed on September 29, 2019; February 2020 U.S. Provisional Application No. 62/971,912 filed on July 7; and U.S. Provisional Application No. 62/983,174 filed on February 28, 2020. The aforementioned application is incorporated herein by reference in its entirety.
本發明大體上係關於用於便利記憶體密集型操作之設備。特定而言,本發明係關於包括耦接至專用記憶體組之處理元件的硬體晶片。本發明亦係關於用於改善記憶體晶片之功率效率及速度的設備。特定而言,本發明係關於用於在記憶體晶片上實施部分再新或甚至無再新之系統及方法。本發明亦係關於大小可選擇之記憶體晶片及記憶體晶片上之雙埠能力。 The present invention generally relates to devices for facilitating memory-intensive operations. In particular, the present invention relates to a hardware chip including processing elements coupled to a dedicated memory bank. The present invention also relates to equipment for improving the power efficiency and speed of memory chips. In particular, the present invention relates to a system and method for implementing partial renewal or even no renewal on a memory chip. The present invention also relates to memory chips with selectable sizes and dual port capabilities on the memory chips.
隨著處理器速度及記憶體大小兩者均繼續增加,對有效處理速度之顯著限制係馮諾依曼(von Neumann)瓶頸。馮諾依曼瓶頸係由習知電腦架構所導致之輸送量限制造成。特定而言,相較於由處理器進行之實際運算,自記憶體至處理器之資料傳送常常遇到瓶頸。因此,用以對記憶體進行讀取及寫入之時脈循環的數目隨著記憶體密集型處理程序而顯著增加。此等時脈循環導致較低的有效處理速度,此係因為對記憶體進行讀取及寫入會消耗時脈循環,該 等時脈循環無法用於對資料執行操作。此外,處理器之運算頻寬通常大於處理器用以存取記憶體之匯流排的頻寬。 As both processor speed and memory size continue to increase, the significant limitation on effective processing speed is the von Neumann bottleneck. The Von Neumann bottleneck is caused by the throughput limitation caused by the conventional computer architecture. In particular, compared to the actual operation performed by the processor, the data transfer from the memory to the processor often encounters a bottleneck. Therefore, the number of clock cycles used to read and write the memory increases significantly with memory-intensive processing procedures. These clock cycles result in a lower effective processing speed. This is because reading and writing to the memory consumes clock cycles. Isochronous cycles cannot be used to perform operations on data. In addition, the computing bandwidth of the processor is generally greater than the bandwidth of the bus used by the processor to access the memory.
此等瓶頸對於以下各者特別明顯:記憶體密集型處理程序,諸如神經網路及其他機器學習演算法;資料庫建構、索引搜尋及查詢;以及包括比資料處理操作多的讀取及寫入操作之其他任務。 These bottlenecks are particularly obvious for the following: memory-intensive processing procedures, such as neural networks and other machine learning algorithms; database construction, index search and query; and include more reads and writes than data processing operations Other tasks of operation.
另外,可用數位資料之量及粒度的迅速增長已為開發機器學習演算法創造了機會且已賦能新技術。然而,此亦為資料庫及並列運算領域帶來棘手之挑戰。舉例而言,社交媒體及物聯網(IoT)之出現以創記錄的速率產生數位資料。此新資料可用以產生用於多種用途之演算法,範圍為新廣告技術至工業處理程序之更精確控制方法。然而,新資料難以儲存、處理、分析及處置。 In addition, the rapid growth in the amount and granularity of available digital data has created opportunities for the development of machine learning algorithms and has empowered new technologies. However, this also brings thorny challenges to the field of database and parallel computing. For example, the emergence of social media and the Internet of Things (IoT) generates digital data at record rates. This new data can be used to generate algorithms for a variety of purposes, ranging from new advertising techniques to more precise control methods for industrial processing procedures. However, new data is difficult to store, process, analyze, and dispose of.
新資料資源可為巨大的,有時為大約千萬億至澤塔(zetta)位元組。此外,此等資料資源之增長速率可能超過資料處理能力。因此,資料科學家已致力於並列資料處理技術,以應對此等挑戰。為了提高計算能力且處置大量資料,科學家已嘗試建立能夠進行並列密集型運算之系統及方法。但此等現有系統及方法跟不上資料處理要求,此常常係因為所使用之技術受該等技術對用於資料管理、分隔資料整合及分段資料分析之額外資源的需求限制。 New data resources can be huge, sometimes on the order of petabytes to zetta bytes. In addition, the growth rate of these data resources may exceed the data processing capacity. Therefore, data scientists have committed to parallel data processing techniques to meet these challenges. In order to improve computing power and handle large amounts of data, scientists have tried to establish systems and methods that can perform parallel-intensive operations. However, these existing systems and methods cannot keep up with the data processing requirements. This is often because the technologies used are limited by the requirements of these technologies for additional resources for data management, separate data integration, and segmented data analysis.
為了便利對大資料集之操縱,工程師及科學家現設法改善用以分析資料之硬體。舉例而言,新的半導體處理器或晶片(諸如,本文中所描述之彼等半導體處理器或晶片)可藉由在以更適合記憶體操作而非算術運算之技術製造的單個基板中併入記憶體及處理功能而特定地針對資料密集型任務進行設計。利用特定地針對資料密集型任務而設計之積體電路,有可能滿足新的資料處理要求。然而,應對大資料集之資料處理的此新方法需要解決晶片設計及製造中之新問題。舉例而言,若針對資料密集型任務而設計之新晶片係藉由用於通用晶片之製造技術及架構製造,則該等新晶片將具有較差效能及/或不可接受 之良率。此外,若該等新晶片經設計以利用當前資料處置方法進行操作,則該等新晶片將具有較差效能,此係因為當前方法可限制晶片處置並列操作的能力。 In order to facilitate the manipulation of large data sets, engineers and scientists are now trying to improve the hardware used to analyze data. For example, new semiconductor processors or chips (such as those described herein) can be incorporated in a single substrate manufactured with a technology more suitable for memory operations rather than arithmetic operations The memory and processing functions are specifically designed for data-intensive tasks. It is possible to meet new data processing requirements by using integrated circuits specifically designed for data-intensive tasks. However, this new approach to data processing for large data sets needs to solve new problems in chip design and manufacturing. For example, if new chips designed for data-intensive tasks are manufactured using manufacturing technology and architecture for general-purpose chips, these new chips will have poor performance and/or unacceptable The yield rate. In addition, if the new chips are designed to operate using the current data processing method, the new chips will have poor performance because the current method can limit the ability of the chip to handle parallel operations.
本發明描述用於減輕或克服上文所闡述之問題中之一或多者以及先前技術中之其他問題的解決方案。 The present invention describes solutions for alleviating or overcoming one or more of the problems set forth above and other problems in the prior art.
在一些實施例中,一種積體電路可包括一基板及安置於該基板上之一記憶體陣列,其中該記憶體陣列包括複數個離散記憶體組。該積體電路亦可包括安置於該基板上之一處理陣列,其中該處理陣列包括複數個處理器子單元,該等複數個處理器子單元中之每一者與該等複數個離散記憶體組當中之一或多個離散記憶體組相關聯。該積體電路亦可包括一控制器,該控制器經組態以相對於該積體電路之一操作實施至少一個安全措施且在該至少一個安全措施被觸發之情況下採取一或多個補救動作。 In some embodiments, an integrated circuit may include a substrate and a memory array disposed on the substrate, wherein the memory array includes a plurality of discrete memory groups. The integrated circuit may also include a processing array disposed on the substrate, wherein the processing array includes a plurality of processor subunits, each of the plurality of processor subunits and the plurality of discrete memories One or more discrete memory groups in the group are associated. The integrated circuit may also include a controller configured to implement at least one safety measure relative to an operation of the integrated circuit and take one or more remedies if the at least one safety measure is triggered action.
所揭示實施例亦可包括一種保護積體電路以防篡改之方法,其中該方法包括使用與積體電路相關聯之控制器實施相對於積體電路之操作的至少一個安全措施及在至少一個安全措施被觸發之情況下採取一或多個補救動作,且其中該積體電路包括:基板;記憶體陣列,其安置於基板上,該記憶體陣列包括複數個離散記憶體組;及處理陣列,其安置於基板上,該處理陣列包括複數個處理器子單元,該等複數個處理器子單元中之每一者與該等複數個離散記憶體組當中之一或多個離散記憶體組相關聯。 The disclosed embodiments may also include a method for protecting an integrated circuit from tampering, wherein the method includes using a controller associated with the integrated circuit to implement at least one security measure relative to the operation of the integrated circuit and at least one security measure One or more remedial actions are taken when the measure is triggered, and the integrated circuit includes: a substrate; a memory array arranged on the substrate, the memory array including a plurality of discrete memory groups; and a processing array, It is arranged on a substrate, the processing array includes a plurality of processor sub-units, each of the plurality of processor sub-units is related to one or more of the plurality of discrete memory groups United.
所揭示實施例可包括一種積體電路,其包含:基板;記憶體陣列,其安置於基板上,該記憶體陣列包括複數個離散記憶體組;處理陣列,其安置於基板上,該處理陣列包括複數個處理器子單元,該等複數個處理器子單元中之每一者與該等複數個離散記憶體組當中之一或多個離散記憶體組相關聯;及 控制器,其經組態以:實施相對於積體電路之操作的至少一個安全措施;其中至少一個安全措施包括在至少兩個不同記憶體部分中複製程式碼。 The disclosed embodiment may include an integrated circuit including: a substrate; a memory array arranged on the substrate, the memory array including a plurality of discrete memory groups; a processing array arranged on the substrate, the processing array Includes a plurality of processor sub-units, each of the plurality of processor sub-units is associated with one or more of the plurality of discrete memory groups; and The controller is configured to: implement at least one security measure relative to the operation of the integrated circuit; wherein the at least one security measure includes copying code in at least two different memory portions.
在一些實施例中,提供一種分散式處理器記憶體晶片,其包含:基板;記憶體陣列,其安置於基板上;處理陣列,其安置於基板上;第一通信埠;及第二通信埠。該記憶體陣列可包括複數個離散記憶體組。該處理陣列可包括複數個處理器子單元,該等複數個處理器子單元中之每一者與複數個離散記憶體組當中之一或多個離散記憶體組相關聯。該第一通信埠可經組態以在該分散式處理器記憶體晶片與除另一分散式處理器記憶體晶片以外之外部實體之間建立通信連接。該第二通信埠可經組態以在該分散式處理器記憶體晶片與第一額外分散式處理器記憶體晶片之間建立通信連接。 In some embodiments, a distributed processor memory chip is provided, which includes: a substrate; a memory array arranged on the substrate; a processing array arranged on the substrate; a first communication port; and a second communication port . The memory array may include a plurality of discrete memory groups. The processing array may include a plurality of processor subunits, each of the plurality of processor subunits being associated with one or more of the plurality of discrete memory groups. The first communication port can be configured to establish a communication connection between the distributed processor memory chip and an external entity other than another distributed processor memory chip. The second communication port can be configured to establish a communication connection between the distributed processor memory chip and the first additional distributed processor memory chip.
在一些實施例中,一種在第一分散式處理器記憶體晶片與第二分散式處理器記憶體晶片之間傳送資料的方法可包括:使用與第一分散式處理器記憶體晶片及第二分散式處理器記憶體晶片中之至少一者相關聯的控制器判定安置於第一分散式處理器記憶體晶片上之複數個處理器子單元當中的第一處理器子單元是否已準備好將資料傳送至包括於第二分散式處理器記憶體晶片中之第二處理器子單元;及在判定第一處理器子單元已準備好將資料傳送至第二處理器子單元之後,使用由控制器控制之時脈賦能信號以起始資料自第一處理器子單元至第二處理器子單元之傳送。 In some embodiments, a method for transferring data between a first distributed processor memory chip and a second distributed processor memory chip may include: using a first distributed processor memory chip and a second distributed processor memory chip. The controller associated with at least one of the distributed processor memory chips determines whether the first processor subunit among the plurality of processor subunits placed on the first distributed processor memory chip is ready to The data is transferred to the second processor sub-unit included in the second distributed processor memory chip; and after determining that the first processor sub-unit is ready to transfer the data to the second processor sub-unit, use the control The clock enabling signal controlled by the device starts the transmission of data from the first processor sub-unit to the second processor sub-unit.
在一些實施例中,一種記憶體單元可包括:記憶體陣列,其包括複數個記憶體組;至少一個控制器,其經組態以控制相對於複數個記憶體組之讀取操作的至少一個態樣;至少一個零值偵測邏輯單元,其經組態以偵測儲存於複數個記憶體組之特定位址中的多位元零值;且其中該至少一個控制器及該至少一個零值偵測邏輯單元經組態以回應於由該至少一個零值偵測邏輯進行之零值偵測而將零值指示符傳回至記憶體單元外部之一或多個電路。 In some embodiments, a memory unit may include: a memory array including a plurality of memory banks; at least one controller configured to control at least one of the read operations relative to the plurality of memory banks Aspect; at least one zero value detection logic unit, which is configured to detect multi-bit zero values stored in a specific address of a plurality of memory groups; and wherein the at least one controller and the at least one zero The value detection logic unit is configured to return the zero value indicator to one or more circuits outside the memory unit in response to the zero value detection performed by the at least one zero value detection logic.
一些實施例可包括一種用於偵測複數個離散記憶體組之特定位址中之零值的方法,其包含:自記憶體單元外部之電路接收讀取儲存於複數個離散記憶體組之位址中之資料的請求;回應於所接收請求而藉由控制器啟動零值偵測邏輯單元以偵測所接收位址中之零值;及回應於由該零值偵測邏輯單元進行之零值偵測而藉由該控制器將零值指示符傳輸至電路。 Some embodiments may include a method for detecting a zero value in a specific address of a plurality of discrete memory groups, which includes: receiving and reading bits stored in the plurality of discrete memory groups from a circuit outside the memory unit Request for data in the address; in response to the received request, the controller activates the zero value detection logic unit to detect the zero value in the received address; and responds to the zero value performed by the zero value detection logic unit The value is detected and the zero value indicator is transmitted to the circuit by the controller.
一些實施例可包括一種非暫時性電腦可讀媒體,其儲存可由記憶體單元之控制器執行以使記憶體單元偵測複數個離散記憶體組之特定位址中之零值的指令集,該方法包含:自記憶體單元外部之電路接收讀取儲存於複數個離散記憶體組之位址中之資料的請求;回應於所接收請求而藉由控制器啟動零值偵測邏輯單元以偵測所接收位址中之零值;及回應於由該零值偵測邏輯單元進行之零值偵測而藉由該控制器將零值指示符傳輸至電路。 Some embodiments may include a non-transitory computer-readable medium that stores a set of instructions that can be executed by a controller of a memory unit to enable the memory unit to detect a zero value in a specific address of a plurality of discrete memory groups, the The method includes: receiving a request to read data stored in the addresses of a plurality of discrete memory groups from a circuit outside the memory unit; in response to the received request, the controller activates a zero-value detection logic unit to detect Receiving the zero value in the address; and in response to the zero value detection performed by the zero value detection logic unit, the controller transmits the zero value indicator to the circuit.
在一些實施例中,一種記憶體單元可包括:一或多個記憶體組;組控制器;及位址產生器;其中位址產生器經組態以將相關聯記憶體組中待存取之當前列中的當前位址提供至組控制器,判定相關聯記憶體組中待存取之下一列的預測位址,且在相對於與當前位址相關聯之當前列的讀取操作完成之前將預測位址提供至組控制器。 In some embodiments, a memory unit may include: one or more memory banks; a bank controller; and an address generator; wherein the address generator is configured to assign the associated memory bank to be accessed The current address in the current row is provided to the group controller to determine the predicted address of the next row to be accessed in the associated memory group, and the read operation relative to the current row associated with the current address is completed Previously, the predicted address was provided to the group controller.
在一些實施例中,一種記憶體單元可包括:一或多個記憶體組,其中一或多個記憶體組中之每一者包括複數個列;第一列控制器,其經組態以控制複數個列之第一子集;第二列控制器,其經組態以控制複數個列之第二子集;單個資料輸入端,其用以接收待儲存於複數個列中之資料;及單個資料輸出端,其用以提供自複數個列擷取之資料。 In some embodiments, a memory unit may include: one or more memory groups, wherein each of the one or more memory groups includes a plurality of rows; the first row of controllers is configured to Control the first subset of the plurality of rows; the second row controller, which is configured to control the second subset of the plurality of rows; a single data input terminal, which is used to receive the data to be stored in the plurality of rows; And a single data output terminal, which is used to provide data retrieved from multiple rows.
在一些實施例中,一種分散式處理器記憶體晶片可包括:基板;記憶體陣列,其安置於基板上,該記憶體陣列包括複數個離散記憶體組;處理陣列,其安置於基板上,該處理陣列包括複數個處理器子單元,該等處理器子 單元中之每一者與該等複數個離散記憶體組中之對應的專用記憶體組相關聯;第一複數個匯流排,其各將複數個處理器子單元中之一者連接至其對應的專用記憶體組;及第二複數個匯流排,其各將複數個處理器子單元中之一者連接至複數個處理器子單元中之另一者。記憶體組中之至少一者可包括安置於基板上之至少一個DRAM記憶體墊。處理器單元中之至少一者可包括與至少一個記憶體墊相關聯之一或多個邏輯組件。至少一個記憶體墊及一或多個邏輯組件可經組態以充當用於複數個處理子單元中之一或多者的快取記憶體。 In some embodiments, a distributed processor memory chip may include: a substrate; a memory array arranged on the substrate, the memory array including a plurality of discrete memory groups; a processing array, arranged on the substrate, The processing array includes a plurality of processor subunits, and the processor subunits Each of the units is associated with a corresponding dedicated memory group in the plurality of discrete memory groups; the first plurality of buses each connects one of the plurality of processor subunits to its corresponding And a second plurality of buses, each of which connects one of the plurality of processor subunits to the other of the plurality of processor subunits. At least one of the memory banks may include at least one DRAM memory pad disposed on the substrate. At least one of the processor units may include one or more logic components associated with at least one memory pad. At least one memory pad and one or more logic components can be configured to act as cache memory for one or more of the plurality of processing subunits.
在一些實施例中,一種執行分散式處理器記憶體晶片中之至少一個指令的方法可包括:自分散式處理器記憶體晶片之記憶體陣列擷取一或多個資料值;將一或多個資料值儲存於形成於分散式處理器記憶體晶片之記憶體墊中的暫存器中;及根據由處理器元件執行之至少一個指令存取儲存於暫存器中之一或多個資料值;其中該記憶體陣列包括安置於基板上之複數個離散記憶體組;其中該處理器元件為包括於安置在基板上之處理陣列中的複數個處理器子單元當中之處理器子單元,其中處理器子單元中之每一者與複數個離散記憶體組中之對應的專用記憶體組相關聯;且其中該暫存器由安置於基板上之記憶體墊提供。 In some embodiments, a method for executing at least one instruction in a distributed processor memory chip may include: retrieving one or more data values from a memory array of the distributed processor memory chip; A data value is stored in a register formed in the memory pad of the memory chip of a distributed processor; and one or more data stored in the register is accessed according to at least one instruction executed by the processor element Value; wherein the memory array includes a plurality of discrete memory groups arranged on a substrate; wherein the processor element is a processor subunit included in a plurality of processor subunits in the processing array arranged on the substrate, Each of the processor sub-units is associated with a corresponding dedicated memory group in a plurality of discrete memory groups; and the register is provided by a memory pad arranged on the substrate.
一些實施例可包括一種裝置,其包含:基板;處理單元,其安置於基板上;及記憶體單元,其安置於基板上,其中該記憶體單元經組態以儲存待由處理單元存取之資料,且其中該處理單元包含經組態以充當用於處理單元之快取記憶體的記憶體墊。 Some embodiments may include a device that includes: a substrate; a processing unit disposed on the substrate; and a memory unit disposed on the substrate, wherein the memory unit is configured to store data to be accessed by the processing unit Data, and where the processing unit includes a memory pad configured to act as a cache for the processing unit.
預期處理系統處理以極高速率處理增加的資訊量。舉例而言,預期第五代(5G)行動網際網路接收大量資訊串流且以增加之速率處理此等資訊串流。 The processing system is expected to process the increased amount of information at an extremely high rate. For example, the fifth generation (5G) mobile Internet is expected to receive a large amount of information streams and process these information streams at an increased rate.
該處理系統可包括一或多個緩衝器及一處理器。由處理器應用之 處理操作可能具有某一潛時且此可能需要大量緩衝器。大量緩衝器可為代價高的及/或耗面積的。 The processing system may include one or more buffers and a processor. Applied by the processor The processing operation may have a certain latency and this may require a lot of buffers. A large number of buffers can be costly and/or area consuming.
將大量資訊自緩衝器傳送至處理器可能需要緩衝器與處理器之間的高頻寬連接器及/或高頻寬匯流排,此亦可增加處理系統之成本及面積。 Sending a large amount of information from the buffer to the processor may require a high-bandwidth connector and/or a high-bandwidth bus between the buffer and the processor, which can also increase the cost and area of the processing system.
愈來愈需要提供高效處理系統。 There is an increasing need to provide efficient processing systems.
預期處理系統處理以極高速率處理增加的資訊量。舉例而言,預期第五代(5G)行動網際網路接收大量資訊串流且以增加之速率處理此等資訊串流。 The processing system is expected to process the increased amount of information at an extremely high rate. For example, the fifth generation (5G) mobile Internet is expected to receive a large amount of information streams and process these information streams at an increased rate.
該處理系統可包括一或多個緩衝器及處理器。由處理器應用之處理操作可能具有某一潛時且此可能需要大量緩衝器。大量緩衝器可為代價高的及/或耗面積的。 The processing system may include one or more buffers and processors. The processing operations applied by the processor may have a certain latency and this may require a large amount of buffers. A large number of buffers can be costly and/or area consuming.
將大量資訊自緩衝器傳送至處理器可能需要緩衝器與處理器之間的高頻寬連接器及/或高頻寬匯流排,此亦可增加處理系統之成本及面積。 Sending a large amount of information from the buffer to the processor may require a high-bandwidth connector and/or a high-bandwidth bus between the buffer and the processor, which can also increase the cost and area of the processing system.
愈來愈需要提供高效處理系統。 There is an increasing need to provide efficient processing systems.
一種分解式伺服器包括多個子系統,而每一子系統具有獨特作用。舉例而言,一種分解式伺服器可包括一或多個交換子系統、一或多個運算子系統及一或多個儲存子系統。 A decomposed server includes multiple subsystems, and each subsystem has a unique role. For example, a decomposed server may include one or more switching subsystems, one or more computing subsystems, and one or more storage subsystems.
一或多個運算子系統及一或多個儲存子系統經由一或多個交換子系統彼此耦接。 One or more computing subsystems and one or more storage subsystems are coupled to each other through one or more switching subsystems.
運算子系統可包括多個運算單元。 The arithmetic subsystem may include multiple arithmetic units.
交換子系統可包括多個交換單元。 The switching subsystem may include multiple switching units.
儲存子系統可包括多個儲存單元。 The storage subsystem may include multiple storage units.
此分解式伺服器之瓶頸在於在子系統之間傳送資訊所需的頻寬。 The bottleneck of this decomposed server is the bandwidth required to transmit information between subsystems.
當執行需要在不同運算子系統之所有(或至少大部分)運算單元 (諸如,圖形處理單元)之間共用資訊單元的分散式計算時,尤其為如此。 When the execution requires all (or at least most) computing units in different computing subsystems This is especially true when sharing information units among distributed calculations (such as graphics processing units).
假定存在參與共用之N個運算單元,N為極大整數(例如,至少1024),且N個運算單元中之每一者必須將資訊單元發送至所有其他運算單元(及自所有其他運算單元接收資訊單元)。在此等假定下,需要執行資訊單元之大約N×N個傳送處理程序。大量傳送處理程序係耗時且耗能量的,且將顯著地限制分解式伺服器之輸送量。 Suppose there are N arithmetic units participating in the sharing, N is a very large integer (for example, at least 1024), and each of the N arithmetic units must send the information unit to all other arithmetic units (and receive information from all other arithmetic units) unit). Under these assumptions, approximately N×N transmission processing procedures of the information unit need to be executed. Mass transfer processing procedures are time-consuming and energy-consuming, and will significantly limit the throughput of the disaggregated server.
愈來愈需要提供高效分解式伺服器及執行分散式處理之高效方式。 There is an increasing need to provide efficient decomposed servers and efficient ways to perform distributed processing.
資料庫包括許多條目,該等條目包括多個欄位。資料庫處理通常包括執行一或多個查詢,該一或多個查詢包括一或多個篩選參數(例如,識別一或多個相關欄位及一或多個相關欄位值)且亦包括一或多個操作參數,該一或多個操作參數可判定待執行之操作的類型、待在應用操作時使用之變數或常數,及其類似者。 The database includes many entries, and these entries include multiple fields. Database processing usually includes executing one or more queries that include one or more filter parameters (for example, identifying one or more related fields and one or more related field values) and also includes a Or multiple operating parameters, the one or more operating parameters can determine the type of operation to be performed, the variable or constant to be used in the application operation, and the like.
舉例而言,資料庫查詢可請求對資料庫之所有記錄執行統計操作(操作參數),其中某一欄位具有預定義範圍內之值(篩選參數)。又對於另一實例,資料庫查詢可請求刪除具有小於臨限值(篩選參數)之某一欄位的(操作參數)記錄。 For example, a database query can request a statistical operation (operation parameter) to be performed on all records in the database, and a certain field has a value within a predefined range (filter parameter). For another example, the database query may request deletion of (operation parameter) records with a certain field less than the threshold value (screening parameter).
大型資料庫通常儲存於儲存裝置中。為了對查詢作出回應,將資料庫發送至記憶體單元,通常為一個資料庫區段接著另一資料庫區段。 Large databases are usually stored in storage devices. In order to respond to queries, the database is sent to the memory unit, usually one database section followed by another database section.
將資料庫區段之條目自記憶體單元發送至不屬於與記憶體單元相同之積體電路的處理器。該等條目接著由處理器處理。 The entries of the database section are sent from the memory unit to the processor that does not belong to the same integrated circuit as the memory unit. These entries are then processed by the processor.
對於儲存於記憶體單元中之資料庫的每一資料庫區段,處理包括以下步驟:(i)選擇資料庫區段之記錄;(ii)將記錄自記憶體單元發送至處理器;(iii)藉由處理器篩選記錄以判定記錄是否相關;及(iv)對相關記錄執行 一或多個額外操作(求和、應用任何其他數學運算及/或統計操作)。 For each database section of the database stored in the memory unit, the processing includes the following steps: (i) select the record of the database section; (ii) send the record from the memory unit to the processor; (iii) ) Filter records through the processor to determine whether the records are relevant; and (iv) execute on related records One or more additional operations (summing, applying any other mathematical operations and/or statistical operations).
篩選處理程序在所有記錄被發送至處理器且處理器判定哪些記錄相關之後結束。 The screening process ends after all records are sent to the processor and the processor determines which records are relevant.
在資料庫區段之相關條目不儲存於處理器中之狀況下,則需要在篩選階段之後將此等相關記錄發送至處理器以供進一步處理(應用在處理之後的操作)。 Under the condition that the relevant entries of the database section are not stored in the processor, these relevant records need to be sent to the processor for further processing after the screening phase (application of operations after processing).
當多個處理操作在單個篩選之後時,則可將每一操作之結果發送至記憶體單元且接著再次發送至處理器。 When multiple processing operations follow a single screening, the result of each operation can be sent to the memory unit and then sent to the processor again.
此處理程序為耗頻寬且耗時的。 This processing procedure is bandwidth-consuming and time-consuming.
愈來愈需要提供執行資料庫處理之高效方式。 There is an increasing need to provide efficient ways to perform database processing.
字嵌入為自然語言處理(NLP)中之語言模型化及特徵學習技術之集合的統稱,其中將來自詞彙表之字或片語映射至元素之向量。在概念上,其涉及自每字具有許多維度之空間至具有低得多之維度的連續向量空間的數學嵌入(www.wikipedia.org)。 Word embedding is a collective term for a collection of language modeling and feature learning techniques in natural language processing (NLP), in which words or phrases from a vocabulary are mapped to a vector of elements. Conceptually, it involves mathematical embedding (www.wikipedia.org) from a space with many dimensions per word to a continuous vector space with much lower dimensions.
產生此映射之方法包括神經網路、字同現矩陣之降維、機率模型、可解釋知識庫方法及依據字出現之上下文的顯式表示。 The methods for generating this mapping include neural networks, dimensionality reduction of word co-occurrence matrices, probability models, interpretable knowledge base methods, and explicit representations based on the context in which words appear.
字及片語嵌入在用作基礎輸入表示時已展示為提高諸如語法剖析及情感分析之NLP任務的效能。 Word and phrase embeddings have been shown to improve the effectiveness of NLP tasks such as grammar analysis and sentiment analysis when used as basic input representations.
語句可分段成字或片語,且每一區段可由向量表示。語句可由矩陣表示,該矩陣包括表示語句之字或片語的所有向量。 Sentences can be segmented into words or phrases, and each segment can be represented by a vector. Sentences can be represented by a matrix, which includes all vectors representing words or phrases of the sentence.
將字映射至向量之詞彙表可儲存於記憶體單元(諸如,動態隨機存取記憶體(DRAM))中,該記憶體單元可使用字或片語(或表示字之索引)進行存取。 The vocabulary that maps words to vectors can be stored in a memory unit (such as dynamic random access memory (DRAM)), which can be accessed using words or phrases (or indexes representing words).
該等存取可為隨機存取,此減少DRAM之輸送量。此外,該等 存取可使DRAM飽和,尤其在將大量存取饋入至DRAM時。 These accesses can be random access, which reduces the throughput of DRAM. In addition, these Access can saturate the DRAM, especially when a large amount of access is fed into the DRAM.
特定而言,包括於語句中之字通常相當隨機。甚至在使用DRAM叢發時,存取儲存映射之DRAM記憶體亦將通常導致隨機存取之較低效能,此係因為通常在叢發期間,DRAM記憶體組條目(在同時被存取之不同記憶體組的多個條目當中)之一小部分中的僅一者將儲存與某一語句相關之條目。 In particular, the words included in the sentence are usually quite random. Even when using DRAM bursts, accessing the memory mapped DRAM memory will usually result in lower performance of random access. This is because usually during bursts, DRAM memory bank entries (different when accessed at the same time) Only one of a small part of the multiple entries in the memory group will store entries related to a certain sentence.
因此,DRAM記憶體之輸送量低且為非連續的。 Therefore, the throughput of DRAM memory is low and non-continuous.
在主機電腦之控制下自DRAM記憶體擷取語句之每一字或片語,該主機電腦在DRAM記憶體之積體電路外部且必須基於對字之位置的瞭解來控制表示每一字或區段之每一向量的每次擷取,此為耗時且耗資源的任務。 Retrieve every word or phrase of a sentence from the DRAM memory under the control of the host computer. The host computer is outside the integrated circuit of the DRAM memory and must control the representation of each word or area based on the knowledge of the position of the word Each acquisition of each vector of a segment is a time-consuming and resource-intensive task.
預期資料中心及其他電腦化系統以極高速率處理及交換增加量之資訊。 Data centers and other computerized systems are expected to process and exchange increased amounts of information at extremely high rates.
增加量之資料的交換可為資料中心及其他電腦化系統之瓶頸,且可使此類資料中心及其他電腦化系統僅利用其能力之一部分。 The exchange of increased amounts of data can be a bottleneck for data centers and other computerized systems, and can allow such data centers and other computerized systems to use only part of their capabilities.
圖96A說明先前技術資料庫12010及先前技術伺服器主機板12011之實例。資料庫可包括多個伺服器,每一伺服器包括多個伺服器主機板(亦表示為「CPU+記憶體+網路」)。每一伺服器主機板12011包括CPU 12012(諸如但不限於因特爾之XEON),該CPU接收訊務,連接至記憶體單元12013(表示為RAM)及多個資料庫加速器(DB加速器)12014。
FIG. 96A illustrates an example of the
DB加速器為可選的,且DB加速操作可由CPU 12012執行。 The DB accelerator is optional, and the DB acceleration operation can be executed by the CPU 12012.
所有訊務流經CPU,且CPU可經由具有相對有限頻寬之鏈路(諸如,PCIe)耦接至DB加速器。 All traffic flows through the CPU, and the CPU can be coupled to the DB accelerator via a link with a relatively limited bandwidth (such as PCIe).
大量資源專用於在多個伺服器主機板之間投送資訊單元。 A large amount of resources are dedicated to the delivery of information units among multiple server motherboards.
愈來愈需要提供高效資料中心及其他電腦化系統。 There is an increasing need to provide efficient data centers and other computerized systems.
諸如神經網路之人工智慧(AI)應用的大小顯著增加。為了應對 神經網路之增加的大小,各作為AI加速伺服器(包括伺服器主機板)之多個伺服器用以執行神經網路處理任務,諸如但不限於訓練。包括配置於不同機架中之多個AI加速伺服器的系統之實例展示於圖1中。 The size of artificial intelligence (AI) applications such as neural networks has increased significantly. In response to The increased size of the neural network each serves as multiple servers of the AI acceleration server (including the server motherboard) for performing neural network processing tasks, such as but not limited to training. An example of a system including multiple AI acceleration servers arranged in different racks is shown in FIG. 1.
在典型的訓練工作階段中,同時處理大量影像以提供大量值,諸如損失。大量值在不同AI加速伺服器之間輸送且導致例外量的訊務。舉例而言,可跨越位於不同AI加速伺服器中之多個GPU運算一些神經網路層,且可能需要消耗頻寬之網路上聚集。 In a typical training session, a large number of images are processed at the same time to provide a large number of values, such as loss. A large number of values are transmitted between different AI acceleration servers and result in an exceptional amount of traffic. For example, some neural network layers can be calculated across multiple GPUs located in different AI acceleration servers, and may need to be aggregated on a network that consumes bandwidth.
例外量之訊務的傳送需要超高頻寬,其可能不可行或可能不具成本效益。 The transmission of exceptional traffic requires ultra-high bandwidth, which may not be feasible or cost-effective.
圖97A說明包括子系統之系統12050,每一子系統包括:交換器12051,其用於連接具有伺服器主機板12055之AI加速伺服器12052,該伺服器主機板包括RAM記憶體(RAM 12056)、中央處理單元(CPU)12054、網路介面卡(NIC)12053,而CPU 12054連接(經由PCIe匯流排)至多個AI加速器12057(諸如,圖形處理單元、AI晶片(AI ASIC)、FPGA及其類似者)。NIC藉由網路(使用例如乙太網路、UDP鏈路及其類似者)耦接至彼此(例如,藉由一或多個交換器),且此等NIC可能夠輸送系統所需之超高頻寬。
Figure 97A illustrates a
愈來愈需要提供高效AI運算系統。 There is an increasing need to provide efficient AI computing systems.
符合其他所揭示實施例,非暫時性電腦可讀儲存媒體可儲存程式指令,該等程式指令由至少一個處理裝置執行且執行本文中所描述之方法中的任一者。 In accordance with other disclosed embodiments, the non-transitory computer-readable storage medium can store program instructions that are executed by at least one processing device and perform any of the methods described herein.
前文之一般描述及下文之詳細描述僅為例示性及解釋性的,且不受申請專利範圍限制。 The general description above and the detailed description below are only illustrative and explanatory, and are not limited by the scope of the patent application.
38:線 38: line
100:CPU 100: CPU
110:處理單元 110: processing unit
120a:處理器子單元 120a: processor subunit
120b:處理器子單元 120b: processor subunit
130:快取記憶體 130: Cache memory
140a:共用記憶體 140a: shared memory
140b:共用記憶體 140b: shared memory
200:GPU 200: GPU
210:處理單元 210: Processing Unit
220a:處理器子單元 220a: processor subunit
220b:處理器子單元 220b: processor subunit
220c:處理器子單元 220c: processor subunit
220d:處理器子單元 220d: processor subunit
220e:處理器子單元 220e: processor subunit
220f:處理器子單元 220f: processor subunit
220g:處理器子單元 220g: processor subunit
220h:處理器子單元 220h: processor subunit
220i:處理器子單元 220i: processor subunit
220j:處理器子單元 220j: processor subunit
220k:處理器子單元 220k: processor subunit
220l:處理器子單元 220l: processor subunit
220m:處理器子單元 220m: processor subunit
220n:處理器子單元 220n: processor subunit
220o:處理器子單元 220o: processor subunit
220p:處理器子單元 220p: processor subunit
230a:快取記憶體 230a: Cache memory
230b:快取記憶體 230b: Cache memory
230c:快取記憶體 230c: Cache memory
230d:快取記憶體 230d: Cache memory
250a:共用記憶體 250a: shared memory
250b:共用記憶體 250b: shared memory
250c:共用記憶體 250c: shared memory
250d:共用記憶體 250d: shared memory
300:硬體晶片 300: hardware chip
300':硬體晶片 300': hardware chip
310a:處理群組 310a: Processing group
310b:處理群組 310b: Processing group
310c:處理群組 310c: Processing group
310d:處理群組 310d: Processing group
320a:邏輯及控制子單元 320a: logic and control subunit
320b:邏輯及控制子單元 320b: logic and control subunit
320c:邏輯及控制子單元 320c: logic and control subunit
320d:邏輯及控制子單元 320d: logic and control subunit
320e:邏輯及控制子單元 320e: logic and control subunit
320f:邏輯及控制子單元 320f: logic and control subunit
320g:邏輯及控制子單元 320g: logic and control subunit
320h:邏輯及控制子單元 320h: logic and control subunit
330a:專用記憶體例項 330a: Dedicated memory instance
330b:專用記憶體例項 330b: Dedicated memory example
330c:專用記憶體例項 330c: Dedicated memory instance
330d:專用記憶體例項 330d: Dedicated memory instance
330e:專用記憶體例項 330e: Dedicated memory instance
330f:專用記憶體例項 330f: Dedicated memory example
330g:專用記憶體例項 330g: Dedicated memory example
330h:專用記憶體例項 330h: Dedicated memory instance
340a:控制件 340a: control
340b:控制件 340b: control
340c:控制件 340c: control
340d:控制件 340d: control part
350:主機 350: host
350a:處理器子單元 350a: processor subunit
350b:處理器子單元 350b: processor subunit
350c:處理器子單元 350c: processor subunit
350d:處理器子單元 350d: processor subunit
360a:匯流排 360a: bus
360b:匯流排 360b: bus
360c:匯流排 360c: bus
360d:匯流排 360d: bus
360e:匯流排 360e: bus
360f:匯流排 360f: bus
400:處理程序 400: Processing program
410:處理群組 410: Processing Group
420:專用記憶體例項 420: Dedicated memory instance
430:處理器子單元 430: processor subunit
440:處理元件 440: processing element
450:位址產生器 450: address generator
460:控制件 460: control
500:用於執行專門命令之例示性處理程序 500: Illustrative processing program for executing special commands
510:處理群組 510: Processing Group
520:專用記憶體例項 520: Dedicated memory instance
530:處理元件 530: processing components
600:處理群組 600: Processing group
610:處理器子單元 610: processor subunit
620:記憶體/記憶體元件 620: Memory/Memory Components
630:匯流排 630: Bus
640:處理元件 640: processing element
650:加速器/MUX 650: accelerator/MUX
660:輸入多工器(MUX)/DEMUX 660: Input multiplexer (MUX)/DEMUX
670:輸出DEMUX 670: output DEMUX
710:基板 710: Substrate
720a:記憶體組 720a: memory bank
720b:記憶體組 720b: Memory bank
720c:記憶體組 720c: memory bank
720d:記憶體組 720d: memory bank
720e:記憶體組 720e: memory bank
720f:記憶體組 720f: memory bank
720g:記憶體組 720g: memory bank
720h:記憶體組 720h: memory bank
730a:處理器子單元 730a: processor subunit
730b:處理器子單元 730b: processor subunit
730c:處理器子單元 730c: processor subunit
730d:處理器子單元 730d: processor subunit
730e:處理器子單元 730e: processor subunit
730f:處理器子單元 730f: processor subunit
730g:處理器子單元 730g: processor subunit
730h:處理器子單元 730h: processor subunit
740a:匯流排 740a: bus
740b:匯流排 740b: bus
740c:匯流排 740c: bus
740d:匯流排 740d: bus
740e:匯流排 740e: bus
740f:匯流排 740f: bus
740g:匯流排 740g: busbar
740h:匯流排 740h: bus
750a:匯流排 750a: busbar
750b:匯流排 750b: bus
750c:匯流排 750c: bus
750d:匯流排 750d: bus
750e:匯流排 750e: bus
750f:匯流排 750f: bus
750g:匯流排 750g: busbar
750h:匯流排 750h: busbar
750i:匯流排 750i: bus
750j:匯流排 750j: bus
760:架構 760: Architecture
770a:記憶體晶片 770a: memory chip
770b:記憶體晶片 770b: Memory chip
770c:記憶體晶片 770c: memory chip
770d:記憶體晶片 770d: memory chip
800:用於編譯一系列指令之方法 800: Method for compiling a series of instructions
810:步驟 810: step
820:步驟 820: step
830:步驟 830: step
840:步驟 840: step
850:步驟 850: step
900:組 900: Group
910:列解碼器 910: column decoder
920:全域感測放大器 920: Global Sensing Amplifier
930-1:墊 930-1: Pad
930-2:墊 930-2: Pad
940-1:墊 940-1: Pad
940-2:墊 940-2: Pad
950:字線 950: word line
960:位元線 960: bit line
1000:墊 1000: pad
1010-1:區域放大器 1010-1: Regional amplifier
1010-2:區域放大器 1010-2: Area amplifier
1010-x:區域放大器 1010-x: Area amplifier
1020-1:字線驅動器 1020-1: word line driver
1020-2:字線驅動器 1020-2: Word line driver
1020-x:字線驅動器 1020-x: word line driver
1030-1:胞元 1030-1: Cell
1030-2:胞元 1030-2: Cell
1030-3:胞元 1030-3: Cell
1040:字線 1040: word line
1050:位元線 1050: bit line
1100:組/子組 1100: group/subgroup
1110:組列解碼器 1110: group column decoder
1120:組行解碼器 1120: Group row decoder
1121a:位元線 1121a: bit line
1121b:位元線 1121b: bit line
1130a:子組控制器 1130a: Subgroup Controller
1130b:子組控制器 1130b: Subgroup controller
1130c:子組控制器 1130c: Subgroup Controller
1131a:匯流排 1131a: bus
1131b:匯流排 1131b: bus
1131c:匯流排 1131c: bus
1140a:解算器 1140a: solver
1140b:解算器 1140b: solver
1140c:解算器 1140c: solver
1150a:邏輯 1150a: logic
1150b:邏輯 1150b: logic
1150c:邏輯 1150c: logic
1160a:解碼器 1160a: decoder
1160b:解碼器 1160b: decoder
1160c:解碼器 1160c: decoder
1160d:解碼器 1160d: decoder
1160e:解碼器 1160e: decoder
1160f:解碼器 1160f: decoder
1160g:解碼器 1160g: decoder
1160h:解碼器 1160h: decoder
1160i:解碼器 1160i: decoder
1170a:子組 1170a: subgroup
1170b:子組 1170b: subgroup
1170c:子組 1170c: subgroup
1180a:解碼器 1180a: decoder
1180b:解碼器 1180b: decoder
1180c:解碼器 1180c: decoder
1181a:位元線 1181a: bit line
1181b:位元線 1181b: bit line
1190a-1:墊 1190a-1: pad
1190a-2:墊 1190a-2: pad
1190a-x:墊 1190a-x: pad
1190b-1:墊 1190b-1: Pad
1190b-2:墊 1190b-2: pad
1190b-x:墊 1190b-x: pad
1190c-1:墊 1190c-1: pad
1190c-2:墊 1190c-2: pad
1190c-x:墊 1190c-x: pad
1200:記憶體子組 1200: memory subgroup
1210:記憶體控制器 1210: Memory Controller
1220a:熔斷器及比較器 1220a: fuse and comparator
1220b:熔斷器及比較器 1220b: fuse and comparator
1230a:列解碼器 1230a: column decoder
1230b:列解碼器 1230b: column decoder
1240a:墊 1240a: pad
1240b:墊 1240b: pad
1250a:行解碼器 1250a: Line decoder
1250b:行解碼器 1250b: Line decoder
1251:位元線 1251: bit line
1253:位元線 1253: bit line
1260a-1:胞元 1260a-1: cell
1260a-2:胞元 1260a-2: cell
1260a-x:胞元 1260a-x: cell
1260b-1:胞元 1260b-1: cell
1260b-2:胞元 1260b-2: cell
1260b-x:胞元 1260b-x: cell
1300:記憶體晶片 1300: memory chip
1301:基板 1301: substrate
1302:位址管理器 1302: Address Manager
1304:記憶體陣列 1304: memory array
1304(a,a):記憶體組/記憶體區塊/記憶體例項 1304(a,a): memory group/memory block/memory instance
1304(z,z):記憶體組/記憶體例項 1304(z,z): memory group/memory instance
1306:記憶體邏輯 1306: memory logic
1308:商業邏輯 1308: business logic
1310:冗餘商業邏輯/冗餘邏輯區塊/冗餘商業區塊 1310: Redundant business logic/redundant logic block/redundant business block
1312:不啟動開關 1312: Do not start the switch
1314:啟動開關 1314: Start switch
1400:冗餘邏輯區塊集合 1400: Redundant logical block collection
1402:位址匯流排 1402: address bus
1404:資料匯流排 1404: data bus
1500:邏輯區塊 1500: logical block
1504:取得電路 1504: get circuit
1506:解碼器 1506: decoder
1508:暫存器 1508: register
1510:運算單元 1510: arithmetic unit
1512:複製運算單元 1512: Copy arithmetic unit
1514:開關電路 1514: switch circuit
1516:開關電路 1516: switch circuit
1518:寫回電路 1518: write back circuit
1602:邏輯區塊 1602: logical block
1602(a):邏輯區塊 1602(a): logical block
1602(b):邏輯區塊 1602(b): logical block
1602(c):邏輯區塊 1602(c): logical block
1604:熔斷識別件 1604: Fuse identification
1604(a):邏輯區塊 1604(a): logical block
1604(b):邏輯區塊 1604(b): logical block
1604(c):邏輯區塊 1604(c): logical block
1614:位址匯流排 1614: address bus
1616:命令線 1616: command line
1618:資料線 1618: data line
1702:單元 1702: unit
1702(a):單元 1702(a): unit
1702(b):發生故障的單元 1702(b): The failed unit
1702(c):單元 1702(c): unit
1712:單元 1712: unit
1712(a):單元 1712(a): unit
1712(b):單元 1712(b): unit
1712(c):單元 1712(c): unit
1722:開關電路 1722: switch circuit
1722(a):開關電路 1722(a): Switching circuit
1722(b):開關電路 1722(b): Switching circuit
1722(c):開關電路 1722(c): Switch circuit
1728:開關電路 1728: switch circuit
1728(a):開關電路 1728(a): Switch circuit
1728(c):開關電路 1728(c): Switch circuit
1730:樣本電路 1730: sample circuit
1730(a):樣本電路 1730(a): Sample circuit
1730(b):樣本電路 1730(b): Sample circuit
1730(c):樣本電路 1730(c): Sample circuit
1804:I/O區塊 1804: I/O block
1806:單元 1806: unit
1808:開關箱 1808: switch box
1810:連接箱 1810: connection box
1902(a):單元 1902(a): unit
1902(b):單元 1902(b): unit
1902(c):單元 1902(c): unit
1902(d):單元 1902(d): unit
1902(e):單元 1902(e): unit
1902(f):單元 1902(f): unit
1904(a):組態開關 1904(a): configuration switch
1904(b):組態開關 1904(b): configuration switch
1904(c):組態開關 1904(c): configuration switch
1904(d):組態開關 1904(d): configuration switch
1904(e):組態開關 1904(e): configuration switch
1904(f):組態開關 1904(f): configuration switch
1904(g):組態開關 1904(g): configuration switch
1904(h):組態開關 1904(h): configuration switch
2000:冗餘區塊賦能處理程序 2000: Redundant block enablement process
2002:步驟 2002: steps
2004:步驟 2004: steps
2006:步驟 2006: steps
2008:步驟 2008: steps
2010:步驟 2010: steps
2100:位址指派處理程序 2100: Address assignment process
2102:步驟 2102: step
2104:步驟 2104: step
2106:步驟 2106: step
2108:步驟 2108: step
2110:步驟 2110: steps
2112:步驟 2112: steps
2114:步驟 2114: step
2116:步驟 2116: step
2118:步驟 2118: step
2120:步驟 2120: step
2122:步驟 2122: step
2200:處理裝置 2200: processing device
2202:第一記憶體區塊 2202: The first memory block
2204:第二記憶體區塊 2204: second memory block
2210:記憶體控制器 2210: Memory Controller
2212:組態管理器 2212: Configuration Manager
2214:邏輯區塊/單元 2214: logical block/unit
2216:加速器/單元 2216: accelerator/unit
2216(a):加速器 2216(a): accelerator
2216(n):加速器 2216(n): accelerator
2218:線 2218: line
2220:線 2220: line
2230:主機 2230: host
2300:處理裝置 2300: processing device
2302:MAC單元 2302: MAC unit
2304:組態管理器 2304: Configuration Manager
2306:記憶體控制器 2306: Memory Controller
2308(a):記憶體區塊 2308(a): memory block
2308(b):記憶體區塊 2308(b): memory block
2308(c):記憶體區塊 2308(c): memory block
2308(d):記憶體區塊 2308(d): memory block
2500:記憶體組態處理程序 2500: Memory configuration processing program
2502:步驟 2502: step
2504:步驟 2504: step
2506:步驟 2506: step
2508:步驟 2508: step
2510:步驟 2510: steps
2512:步驟 2512: step
2514:步驟 2514: step
2600:記憶體讀取處理程序 2600: Memory read process
2602:步驟 2602: step
2604:步驟 2604: step
2606:步驟 2606: step
2608:步驟 2608: step
2614:步驟 2614: step
2616:步驟 2616: step
2618:步驟 2618: step
2620:步驟 2620: steps
2700:執行處理程序 2700: Execute the handler
2702:步驟 2702: step
2704:步驟 2704: step
2706:步驟 2706: step
2708:步驟 2708: step
2710:步驟 2710: Step
2712:步驟 2712: step
2714:步驟 2714: step
2176:步驟 2176: step
2718:步驟 2718: step
2720:步驟 2720: step
2800:記憶體晶片 2800: memory chip
2801a:記憶體組 2801a: memory bank
2803:再新控制器 2803: New controller
2805:控制器 2805: Controller
2900:實例再新控制器 2900: Instance and new controller
2900':實例再新控制器 2900': New controller for instance
2901:計時器 2901: Timer
2903:列計數器 2903: column counter
2905:有效位元 2905: effective bit
2907:加法器 2907: adder
2909:資料儲存器 2909: Data Storage
2911:再新閘 2911: new gate
3000:用於記憶體晶片中之部分再新的處理程序 3000: Used for partial renewal processing procedures in the memory chip
3010:步驟 3010: steps
3020:步驟 3020: steps
3030:步驟 3030: steps
3100:用於判定記憶體晶片之再新的處理程序 3100: Process used to determine the renewal of memory chips
3110:步驟 3110: step
3120:步驟 3120: step
3130:步驟 3130: step
3140:步驟 3140: step
3200:用於判定記憶體晶片之再新的處理程序 3200: Process used to determine the renewal of memory chips
3210:步驟 3210: steps
3220:步驟 3220: steps
3230:步驟 3230: steps
3240:步驟 3240: step
3250:步驟 3250: steps
3300:實例再新控制器 3300: Instance and new controller
3301:計時器 3301: timer
3303:列計數器 3303: column counter
3305:加法器 3305: adder
3307:資料儲存器 3307: Data Storage
3400:用於判定記憶體晶片之再新的處理程序 3400: Process used to determine the renewal of memory chips
3410:步驟 3410: steps
3420:步驟 3420: step
3430:步驟 3430: step
3440:步驟 3440: step
3501:晶圓 3501: Wafer
3503:晶粒 3503: Die
3504:第二區/記憶體晶片/群組 3504: second zone/memory chip/group
3505:區/記憶體晶片 3505: zone/memory chip
3506:記憶體晶片 3506: memory chip
3506A:記憶體晶粒/記憶體晶片 3506A: Memory Die/Memory Chip
3506B:記憶體晶粒/記憶體晶片 3506B: memory die/memory chip
3506C:記憶體晶片 3506C: Memory chip
3506D:記憶體晶片 3506D: Memory chip
3507:基板 3507: substrate
3511A:記憶體組 3511A: Memory Bank
3511B:記憶體組 3511B: Memory Bank
3511C:記憶體組 3511C: Memory Bank
3511D:記憶體組 3511D: Memory Bank
3511E:記憶體組 3511E: memory bank
3511F:記憶體組 3511F: Memory Bank
3511G:記憶體組 3511G: memory bank
3511H:記憶體組 3511H: memory bank
3512:匯流排或連接件 3512: busbar or connector
3515A:處理器子單元 3515A: processor subunit
3515B:處理器子單元 3515B: processor subunit
3515C:處理器子單元 3515C: processor subunit
3515D:處理器子單元 3515D: processor subunit
3515E:處理器子單元 3515E: processor subunit
3515F:處理器子單元 3515F: processor subunit
3515G:處理器子單元 3515G: processor subunit
3515H:處理器子單元/連接件 3515H: Processor subunit/connector
3515I:處理器子單元 3515I: processor subunit
3515J:處理器子單元 3515J: processor subunit
3515K:處理器子單元 3515K: processor subunit
3515L:處理器子單元 3515L: processor subunit
3515M:處理器子單元 3515M: processor subunit
3515N:處理器子單元 3515N: processor subunit
3515O:處理器子單元 3515O: processor subunit
3515P:處理器子單元 3515P: processor subunit
3516A:連接件 3516A: Connector
3516B:連接件 3516B: Connector
3516C:連接件 3516C: Connector
3516D:連接件 3516D: Connector
3517:區/記憶體晶片 3517: zone/memory chip
3521:輸入輸出(IO)控制器/IO控制模組 3521: Input and output (IO) controller/IO control module
3521A:IO控制器 3521A: IO controller
3521B:IO控制器 3521B: IO controller
3521:IO控制器/IO控制模組 3521: IO controller/IO control module
3522:IO控制器/IO控制模組 3522: IO controller/IO control module
3523:IO控制模組 3523: IO control module
3524:IO控制器 3524: IO controller
3530:輸入輸出匯流排/匯流排線 3530: input and output bus/bus line
3530A:輸入輸出匯流排 3530A: Input and output bus
3530B:輸入輸出匯流排 3530B: input and output bus
3531A:分支 3531A: branch
3531B:分支 3531B: branch
3532:記憶體晶片 3532: memory chip
3533:線 3533: line
3540:晶粒 3540: Die
3542A:輸入/輸出控制器 3542A: Input/Output Controller
3542B:輸入/輸出控制器 3542B: Input/Output Controller
3546:晶粒 3546: Die
3554:熔斷器/熔斷器元件 3554: Fuse/fuse element
3554A:熔斷器 3554A: Fuse
3554B:熔斷器 3554B: Fuse
3555:熔斷器/熔斷器元件 3555: Fuse/fuse element
3556:熔斷器/熔斷器元件 3556: Fuse/fuse element
3557:熔斷器/熔斷器元件 3557: fuse/fuse element
3601:域 3601: domain
3602:域 3602: domain
3603:域 3603: domain
3611:匯流排線 3611: bus line
3612:匯流排線 3612: bus line
3613:匯流排線 3613: bus line
3711:膠合邏輯/邏輯電路/膠合邏輯單元 3711: glue logic / logic circuit / glue logic unit
3713:群組 3713: Group
3715:群組 3715: Group
3801:水平切割 3801: Horizontal cutting
3802:線/切割 3802: wire/cut
3803:豎直切割 3803: Vertical cutting
3804:線/切割 3804: wire/cut
3806:線/切割 3806: wire/cut
3811A:區 3811A: District
3811B:區 3811B: District
3811C:區 3811C: District
3820:匯流排線 3820: bus line
3822:區 3822: District
3824:匯流排線 3824: bus line
3901:記憶體單元之群組 3901: Group of memory units
3905:連接件 3905: connector
4000:自晶粒群組建置記憶體晶片之實例處理程序 4000: An example process for building a memory chip from a die group
4011:步驟 4011: step
4015:步驟 4015: steps
4017:步驟 4017: step
4100:用於製造含有多個晶粒之記憶體晶片的實例處理程序 4100: An example process for manufacturing a memory chip containing multiple dies
4101:處理程序 4101: handler
4102:處理程序 4102: handler
4111:步驟 4111: Step
4113:步驟 4113: Step
4115:步驟 4115: step
4117:步驟 4117: Step
4119:步驟 4119: step
4131:步驟 4131: step
4133:步驟 4133: Step
4140:步驟 4140: Step
4200:實例電路系統 4200: Example circuit system
4201:記憶體陣列 4201: Memory Array
4203:列解碼器 4203: column decoder
4205a:行多工器(「mux」) 4205a: Row multiplexer (``mux'')
4205b:行多工器(「mux」) 4205b: Row multiplexer (``mux'')
4300:實例電路系統 4300: Example circuit system
4301:記憶體陣列 4301: Memory Array
4303:列解碼器 4303: column decoder
4305:行多工器 4305: Row Multiplexer
4400:實例電路系統 4400: Example circuit system
4401:記憶體陣列 4401: memory array
4403:列解碼器 4403: column decoder
4405:行解碼器(或多工器) 4405: Row decoder (or multiplexer)
4500:方法 4500: method
4600:實例電路系統 4600: Example circuit system
4601a:列解碼器 4601a: column decoder
4601b:列解碼器 4601b: column decoder
4603a:行多工器 4603a: Row multiplexer
4603b:行多工器 4603b: Row multiplexer
4607:列控制件 4607: Column Control
4609a:記憶體墊 4609a: Memory pad
4609b:記憶體墊 4609b: Memory pad
4611a:字線 4611a: word line
4611b:字線 4611b: word line
4613a:開關元件 4613a: switching element
4613b:開關元件 4613b: switching element
4615a:位元線 4615a: bit line
4615b:位元線 4615b: bit line
4700:用於在單埠記憶體陣列或墊上提供雙埠存取之處理程序 4700: Process used to provide dual-port access on a single-port memory array or pad
4710:步驟 4710: step
4720:步驟 4720: step
4730:步驟 4730: step
4740:步驟 4740: step
4750:用於在單埠記憶體陣列或墊上提供雙埠存取的處理程序 4750: A process used to provide dual-port access on a single-port memory array or pad
4760:步驟 4760: Step
4770:步驟 4770: step
4780:步驟 4780: step
4790:步驟 4790: step
4800:實例電路系統 4800: Example circuit system
4801a:列解碼器 4801a: column decoder
4801b:列解碼器 4801b: column decoder
4803a:行多工器 4803a: Row multiplexer
4803b:行多工器 4803b: Row multiplexer
4900:記憶體墊 4900: memory pad
5000:實例積體電路 5000: Example integrated circuit
5001:記憶體胞元 5001: memory cell
5008:記憶體胞元 5008: memory cell
5011:記憶體讀取路徑 5011: Memory read path
5018:記憶體讀取路徑 5018: Memory read path
5020:輸出埠 5020: output port
5021:位元 5021: bit
5028:位元 5028: bit
5030:縮減單元 5030: Reduced unit
5040:讀取電路系統 5040: Reading circuit system
5050:記憶體胞元之陣列 5050: Array of memory cells
5100:記憶體組 5100: memory bank
5101:記憶體組 5101: Memory Bank
5102:記憶體單元 5102: memory unit
5111:陣列 5111: Array
5112:列解碼器 5112: column decoder
5113:行多工器 5113: Row Multiplexer
5114:主I/O匯流排 5114: main I/O bus
5115:輸出匯流排 5115: output bus
5116:記憶體內處理(PIM)邏輯 5116: In-Memory Processing (PIM) logic
5117:匯流排 5117: bus
5118:PIM位址匯流排/位址行匯流排 5118: PIM address bus/address line bus
5119:匯流排 5119: bus
5130:實例方法 5130: instance method
5132:步驟 5132: step
5134:步驟 5134: step
5140:記憶體晶片 5140: Memory chip
5140(1):記憶體組之部分 5140(1): Part of the memory bank
5140(2):記憶體組之部分 5140(2): Part of the memory bank
5140(3):記憶體組之部分 5140(3): Part of the memory bank
5140(4):記憶體組之部分 5140(4): part of the memory bank
5140(5):記憶體組之部分 5140(5): Part of the memory bank
5140(6):記憶體組之部分 5140(6): Part of the memory bank
5141:記憶體墊及相關聯邏輯 5141: Memory pad and associated logic
5142:記憶體墊及相關聯邏輯 5142: Memory pad and associated logic
5143:記憶體墊及相關聯邏輯 5143: Memory pad and associated logic
5144:記憶體墊及相關聯邏輯 5144: Memory pad and associated logic
5145:記憶體墊及相關聯邏輯 5145: Memory pad and associated logic
5146:記憶體墊及相關聯邏輯 5146: Memory pad and associated logic
5147:匯流排 5147: bus
5150(10):記憶體墊 5150(10): Memory pad
5150(2):記憶體墊 5150(2): Memory pad
5150(3):記憶體墊 5150(3): memory pad
5150(4):記憶體墊 5150(4): memory pad
5150(5):記憶體墊 5150(5): Memory pad
5150(6):記憶體墊 5150(6): Memory pad
5151(1):記憶體墊 5151(1): memory pad
5151(2):記憶體墊 5151(2): memory pad
5151(3):記憶體墊 5151(3): memory pad
5151(4):記憶體墊 5151(4): memory pad
5151(5):記憶體墊 5151(5): Memory pad
5151(6):記憶體墊 5151(6): Memory pad
5152(1):記憶體墊/全域字線 5152(1): Memory Pad/Global Word Line
5152(2):記憶體墊/全域字線 5152(2): Memory Pad/Global Word Line
5152(3):記憶體墊/全域字線 5152(3): Memory Pad/Global Word Line
5152(4):記憶體墊/全域字線 5152(4): Memory Pad/Global Word Line
5152(5):記憶體墊/全域字線 5152(5): Memory Pad/Global Word Line
5152(6):記憶體墊/全域字線 5152(6): Memory Pad/Global Word Line
5152(8):全域字線 5152(8): Global word line
5153(1):延遲或隔離電路 5153(1): Delay or isolation circuit
5153(3):延遲或隔離電路 5153(3): Delay or isolation circuit
5154(1):延遲或隔離電路 5154(1): Delay or isolation circuit
5154(3):延遲或隔離電路 5154(3): Delay or isolation circuit
5155(1):正反器 5155(1): Flip-flop
5155(3):正反器 5155(3): Flip-flop
5156(1):正反器 5156(1): Flip-flop
5156(3):正反器 5156(3): Flip-flop
5157(1):開關 5157(1): switch
5157(3):開關 5157(3): switch
5157(8):開關 5157(8): switch
5158(1):開關 5158(1): switch
5158(3):開關 5158(3): switch
5158(8):開關 5158(8): switch
5159(1):反相器閘或緩衝器 5159(1): inverter gate or buffer
5159(3):反相器閘或緩衝器 5159(3): inverter gate or buffer
5159'(1):反相器閘或緩衝器 5159'(1): inverter gate or buffer
5159'(3):反相器閘或緩衝器 5159'(3): inverter gate or buffer
5160(1):列控制單元 5160(1): column control unit
5160(2):單元 5160(2): unit
5160(3):單元 5160(3): unit
5170(1):列部分賦能信號 5170(1): column part enable signal
5170(2):列部分賦能信號 5170(2): Column part enable signal
5180:全域字線 5180: Global word line
5190:用於操作記憶體單元之方法 5190: Method for operating memory unit
5192:步驟 5192: Step
5194:步驟 5194: step
5200:測試器 5200: Tester
5201:開關 5201: switch
5202:區段/完整晶圓 5202: Segment/complete wafer
5210:晶片(或晶片之晶圓)/積體電路/記憶體 5210: chip (or wafer of chip)/integrated circuit/memory
5211:晶片介面 5211: chip interface
5212:記憶體組 5212: Memory Bank
5213:匯流排 5213: bus
5214:I/O控制器 5214: I/O Controller
5215:邏輯單元/邏輯 5215: Logic Unit/Logic
5216:熔斷器介面 5216: Fuse interface
5217:匯流排 5217: bus
5218:測試單元(TU) 5218: Test Unit (TU)
5219:測試圖案產生器 5219: Test Pattern Generator
5221:寫入測試序列/第一步驟 5221: Write test sequence/first step
5222:讀回測試結果 5222: Read back test results
5223:寫入預期結果序列/第二步驟 5223: Write expected result sequence / second step
5224:讀取故障位址以修復/第三步驟 5224: Read the fault address to repair/the third step
5225:程式化熔斷器/第四步驟 5225: Programmable Fuse/Fourth Step
5231:測試結果 5231: test result
5232:測試程式碼 5232: test code
5300:用於測試記憶體組之方法 5300: Method for testing memory bank
5302:步驟 5302: Step
5310:步驟 5310: step
5320:步驟 5320: steps
5350:用於測試積體電路之記憶體組的方法 5350: Method for testing the memory bank of an integrated circuit
5352:步驟 5352: step
5355:步驟 5355: step
5358:步驟 5358: step
7000:用於分散式處理之方法 7000: Method for decentralized processing
7001:用於分散式處理之方法 7001: Method for decentralized processing
7010:步驟 7010: steps
7020:步驟 7020: steps
7030:步驟 7030: steps
7040:步驟 7040: steps
7050:步驟 7050: steps
7011:記憶體/處理單元 7011: memory/processing unit
7012:記憶體/處理單元 7012: memory/processing unit
7013:記憶體/處理單元 7013: memory/processing unit
7014:積體電路 7014: Integrated Circuit
7015:積體電路 7015: Integrated Circuit
7101:分解式系統 7101: Decomposition System
7102:分解式系統 7102: Decomposition System
7103:分解式系統 7103: Decomposition System
7104:分解式系統 7104: Decomposition System
7110:處理/記憶體子系統 7110: Processing/Memory Subsystem
7120:運算子系統 7120: computing subsystem
7120(1):運算單元PU(1) 7120(1): arithmetic unit PU(1)
7120(n):運算單元PU(n) 7120(n): arithmetic unit PU(n)
7120(N):運算單元PU(N) 7120(N): arithmetic unit PU(N)
7121(1):部分模型更新 7121(1): Partial model update
7121(n):部分模型更新 7121(n): Partial model update
7121(N):部分模型更新 7121(N): Partial model update
7122:經更新模型 7122: updated model
7130:儲存子系統 7130: Storage subsystem
7140:交換子系統 7140: Exchange subsystem
7150:加速器子系統 7150: accelerator subsystem
7200:積體電路/積體晶片 7200: Integrated Circuit/Integrated Chip
7210:記憶體陣列 7210: memory array
7210_1:離散記憶體組 7210_1: Discrete memory bank
7210_2:離散記憶體組 7210_2: Discrete memory bank
7210_j:專用記憶體組 7210_j: dedicated memory bank
7210_J1:離散記憶體組 7210_J1: Discrete memory bank
7210_Jn:離散記憶體組 7210_Jn: Discrete memory bank
7220:處理器子單元 7220: processor subunit
7220_1:處理器子單元 7220_1: processor subunit
7220_2:處理器子單元 7220_2: processor subunit
7220_k:處理器子單元 7220_k: processor subunit
7220_k+1:處理器子單元 7220_k+1: processor subunit
7220_K:處理器子單元 7220_K: processor subunit
7230:通信埠 7230: Communication port
7240:控制器 7240: Controller
7241:網路攻擊偵測器 7241: Cyber Attack Detector
7242:回應模組 7242: Response module
7243:存取控制規則 7243: Access Control Rules
7244:程式/模型操作圖案 7244: program/model operation pattern
7245:篡改偵測器 7245: tamper detector
7246:回應模組 7246: Response module
7247:曲線 7247: Curve
7250:匯流排 7250: bus
7260:第一匯流排 7260: the first bus
7261:第二匯流排 7261: second bus
7270:主機電腦 7270: host computer
7280:主機記憶體 7280: host memory
7281:可改變資料 7281: data can be changed
7282:不可改變資料 7282: Unchangeable data
7283:命令 7283: command
7450:方法 7450: method
7452:步驟 7452: step
7454:步驟 7454: step
7500:第一分散式處理器記憶體晶片 7500: The first distributed processor memory chip
7500':第二分散式處理器記憶體晶片 7500': Second distributed processor memory chip
7500":第三分散式處理器記憶體晶片 7500": The third distributed processor memory chip
7500''':分散式處理器記憶體晶片 7500''': Distributed processor memory chip
7500A:分散式處理器記憶體晶片 7500A: Distributed processor memory chip
7500B:分散式處理器記憶體晶片 7500B: Distributed processor memory chip
7500C:分散式處理器記憶體晶片 7500C: Distributed processor memory chip
7500D:分散式處理器記憶體晶片 7500D: Distributed processor memory chip
7500E:分散式處理器記憶體晶片 7500E: Distributed processor memory chip
7500F:分散式處理器記憶體晶片 7500F: Distributed processor memory chip
7500G:分散式處理器記憶體晶片 7500G: Distributed processor memory chip
7500H:分散式處理器記憶體晶片 7500H: Distributed processor memory chip
7500I:分散式處理器記憶體晶片 7500I: Distributed processor memory chip
7500A':分散式處理器記憶體晶片 7500A': Distributed processor memory chip
7500B':分散式處理器記憶體晶片 7500B': Distributed processor memory chip
7500C':分散式處理器記憶體晶片 7500C': Distributed processor memory chip
7510:記憶體陣列 7510: memory array
7510_1:專用記憶體組 7510_1: dedicated memory bank
7510_2:專用記憶體組 7510_2: dedicated memory bank
7510_3:專用記憶體組 7510_3: dedicated memory bank
7510_4:專用記憶體組 7510_4: dedicated memory bank
7510_5:專用記憶體組 7510_5: dedicated memory bank
7510_6:專用記憶體組 7510_6: dedicated memory bank
7510':記憶體陣列 7510': memory array
7510":記憶體陣列 7510": Memory array
7520:處理陣列 7520: Processing array
7520_1:分散式處理器子單元 7520_1: Distributed processor subunit
7520_2:分散式處理器子單元 7520_2: Distributed processor subunit
7520_3:分散式處理器子單元 7520_3: Distributed processor subunit
7520_4:分散式處理器子單元 7520_4: Distributed processor subunit
7520_5:分散式處理器子單元 7520_5: Distributed processor subunit
7520_6:分散式處理器子單元 7520_6: Distributed processor subunit
7520_K:分散式處理器子單元 7520_K: Distributed processor subunit
7520':處理陣列 7520': Processing array
7520":處理陣列 7520": Processing array
7530:第一通信埠 7530: the first communication port
7530':通信埠 7530': Communication port
7530":通信埠 7530": Communication port
7531:第二通信埠 7531: second communication port
7531':通信埠 7531': Communication port
7531":通信埠 7531": Communication port
7532:第三通信埠 7532: third communication port
7532':通信埠 7532': Communication port
7532":通信埠 7532": Communication port
7533:匯流排 7533: bus
7533':匯流排 7533': busbar
7534:匯流排 7534: bus
7534':匯流排 7534': busbar
7535:匯流排 7535: bus
7540:控制器 7540: Controller
7540':控制器 7540': Controller
7540":控制器 7540": Controller
7547:控制器及介面模組 7547: Controller and interface module
7548_1:通信介面 7548_1: Communication interface
7548_N:通信介面 7548_N: Communication interface
7570:主機通信埠 7570: Host communication port
7570':埠 7570': Port
7572:晶片埠/通信埠 7572: chip port/communication port
7580:第一匯流排 7580: The first bus
7580':第二匯流排 7580': second bus
7600:分散式處理器記憶體晶片 7600: Distributed processor memory chip
S7710:步驟 S7710: steps
S7720:步驟 S7720: steps
S7730:步驟 S7730: steps
7800:用於偵測儲存於複數個記憶體組之一或多個特定位址中之零值的系統/記憶體單元 7800: System/memory unit used to detect zero values stored in one or more specific addresses in a plurality of memory groups
7810:記憶體晶片 7810: memory chip
7811A:記憶體組 7811A: Memory Bank
7811B:記憶體組 7811B: Memory Bank
7812:IO匯流排 7812: IO bus
7820:主機 7820: host
7830:零值偵測邏輯單元 7830: Zero detection logic unit
7830A:零值指示符線 7830A: Zero indicator line
7830B:零值指示符線 7830B: Zero indicator line
7831:內部零值指示符線 7831: Internal zero indicator line
7840:匯流排 7840: bus
7840A:匯流排 7840A: Bus
7840B:匯流排 7840B: bus
7841:匯流排 7841: bus
7841A:匯流排 7841A: Bus
7911:記憶體組 7911: Memory Bank
7912A:記憶體墊 7912A: Memory pad
7912B:記憶體墊 7912B: memory pad
7913A:記憶體墊控制器 7913A: Memory Pad Controller
7913B:記憶體墊控制器 7913B: Memory Pad Controller
7914A:零值偵測邏輯單元 7914A: Zero detection logic unit
7914B:零值偵測邏輯單元 7914B: Zero detection logic unit
7915A:區域感測放大器 7915A: Area sensing amplifier
7915B:區域感測放大器 7915B: Area sensing amplifier
7916:全域感測放大器 7916: Global Sensing Amplifier
7931A:零值指示符線 7931A: Zero indicator line
7931B:零值指示符線 7931B: Zero indicator line
8000:偵測儲存於複數個記憶體組之特定位址中之零值的例示性方法 8000: An exemplary method for detecting zero values stored in specific addresses of multiple memory banks
8010:步驟 8010: steps
8020:步驟 8020: steps
8030:步驟 8030: steps
8100:系統/記憶體單元 8100: system/memory unit
8180:記憶體組 8180: memory bank
8180A:記憶體組 8180A: Memory bank
8180B:記憶體組 8180B: memory bank
8181:記憶體子組 8181: memory subgroup
8183A:第一子組列控制器 8183A: The first sub-group controller
8183B:第二子組列控制器 8183B: The second sub-group controller
8191:組控制器 8191: Group Controller
8192:當前及預測位址產生器 8192: Current and predicted address generator
8192A:計數器 8192A: Counter
8192B:當前位址產生器 8192B: current address generator
8192C:預測位址產生器 8192C: Predictive address generator
8193:快取記憶體 8193: Cache memory
8280:雙重控制記憶體組 8280: Dual control memory bank
8290:資料輸入(DIN) 8290: Data input (DIN)
8291:組控制器/列位址(ROW) 8291: Group Controller/Column Address (ROW)
8292:行位址(COLUMN) 8292: Row Address (COLUMN)
8293:第一命令輸入(COMMAND_1) 8293: The first command input (COMMAND_1)
8294:第二命令輸入(COMMAND_2) 8294: Second command input (COMMAND_2)
8295:資料輸出(Dout) 8295: Data output (Dout)
8400:傳統電腦架構 8400: traditional computer architecture
8402:CPU 8402: CPU
8406:外部記憶體 8406: External memory
8500a:分散式處理器記憶體晶片 8500a: Distributed processor memory chip
8500b:分散式處理器記憶體晶片 8500b: Distributed processor memory chip
8500c:裝置 8500c: device
8502:基板 8502: substrate
8504:暫存器檔案 8504: Register file
8510a:處理群組 8510a: Processing group
8510b:處理群組 8510b: Processing group
8510c:處理群組 8510c: Processing group
8520:記憶體陣列 8520: memory array
8520a:專用記憶體組 8520a: dedicated memory bank
8520b:專用記憶體組 8520b: dedicated memory bank
8520c:專用記憶體組 8520c: dedicated memory bank
8522a:記憶體墊 8522a: Memory pad
8522b:記憶體墊 8522b: Memory pad
8522c:記憶體墊 8522c: memory pad
8524a:記憶體墊 8524a: Memory pad
8524b:記憶體墊 8524b: memory pad
8524c:記憶體墊 8524c: memory pad
8526a:記憶體墊 8526a: Memory pad
8526b:記憶體墊 8526b: Memory pad
8526c:記憶體墊 8526c: memory pad
8530:處理陣列 8530: Processing array
8530a:處理器子單元/加速器 8530a: processor subunit/accelerator
8530b:處理器子單元 8530b: processor subunit
8530c:處理器子單元 8530c: processor subunit
8532a:記憶體墊/暫存器檔案 8532a: Memory pad/register file
8532b:記憶體墊 8532b: memory pad
8532c:記憶體墊 8532c: memory pad
8534a:邏輯組件 8534a: logic component
8534b:邏輯組件 8534b: logic component
8534c:邏輯組件 8534c: logical component
8540a:匯流排 8540a: bus
8540b:匯流排 8540b: bus
8540c:匯流排 8540c: bus
8550a:匯流排 8550a: bus
8550b:匯流排 8550b: bus
8560:基板 8560: substrate
8570:第一記憶體組 8570: First memory group
8572:第二記憶體組 8572: second memory bank
8580:處理單元 8580: processing unit
8582:暫存器檔案 8582: Register file
8584:處理器 8584: processor
8600:流程圖 8600: Flow chart
8602:步驟 8602: step
8604:步驟 8604: step
8606:步驟 8606: step
9010:記憶體/處理單元 9010: memory/processing unit
9011:記憶體/處理單元 9011: memory/processing unit
9012:記憶體/處理單元 9012: memory/processing unit
9013:記憶體/處理單元 9013: memory/processing unit
9014:記憶體/處理單元 9014: memory/processing unit
9015:記憶體/處理單元 9015: memory/processing unit
9018:主機 9018: host
9019:記憶體/處理單元 9019: memory/processing unit
9020:控制器/邏輯 9020: Controller/Logic
9021:內部匯流排 9021: internal bus
9022:匯流排 9022: Bus
9030:邏輯 9030: logic
9033:緩衝器 9033: Buffer
9039:狀態線 9039: Status line
9040:記憶體組 9040: memory bank
9050:向量處理器 9050: vector processor
9070:詞彙表 9070: Glossary
9071:擷取金鑰 9071: Retrieve key
9072:字/片語 9072: words/phrases
9073:向量 9073: Vector
9100:資料庫查詢 9100: database query
9101:篩選操作之最終結果 9101: The final result of the screening operation
9102:部分回應 9102: partial response
9103:完整回應 9103: complete response
9210:儲存裝置 9210: storage device
9211:介面 9211: Interface
9220:記憶體及篩選系統 9220: Memory and filtering system
9220(k):資料庫區段 9220(k): database section
9222:記憶體單元條目 9222: Memory cell entry
9224:篩選單元 9224: Screening unit
9224':相關性旗標 9224': relevance flag
9225:處理單元 9225: Processing Unit
9227:記憶體/處理單元 9227: memory/processing unit
9228:記憶體/處理系統 9228: Memory/Processing System
9229:仲裁器/記憶體及處理系統 9229: Arbiter/Memory and Processing System
9240:CPU 9240: CPU
9300:用於資料庫分析加速之方法 9300: Method for accelerating database analysis
9301:用於資料庫分析加速之方法 9301: Method for accelerating database analysis
9302:用於資料庫分析加速之方法 9302: Method for accelerating database analysis
9303:用於資料庫分析加速之方法 9303: Method for accelerating database analysis
9304:資料庫分析加速之方法 9304: Methods of accelerating database analysis
9305:資料庫分析加速之方法 9305: Methods of accelerating database analysis
9310:步驟 9310: steps
9314:步驟 9314: step
9315:步驟 9315: steps
9320:步驟 9320: steps
9324:步驟 9324: step
9325:步驟 9325: Step
9330:步驟 9330: steps
9331:步驟 9331: step
9332:步驟 9332: step
9333:步驟 9333: Step
9334:步驟 9334: step
9335:步驟 9335: Step
9340:步驟 9340: steps
9341:步驟 9341: Step
9342:步驟 9342: steps
9344:步驟 9344: step
9351:步驟 9351: step
9352:步驟 9352: step
9390:步驟 9390: steps
9391:步驟 9391: step
9400:用於嵌入之方法 9400: Method for embedding
9401:用於嵌入之方法 9401: Method for embedding
9402:用於嵌入之方法 9402: Method for embedding
9410:步驟 9410: steps
9420:步驟 9420: steps
9430:步驟 9430: steps
9431:步驟 9431: step
9440:步驟 9440: Step
9442:步驟 9442: step
10800:用於至少一個資訊串流之分散式處理的方法 10800: Method for distributed processing of at least one information stream
10810:步驟 10810: step
10820:步驟 10820: step
10830:步驟 10830: step
10840:步驟 10840: step
10850:步驟 10850: step
10900:系統 10900: System
10901:系統 10901: System
10902:系統 10902: system
10903:系統 10903: System
10908:DMA控制器 10908: DMA controller
10909:預處理器 10909: preprocessor
10910:記憶體/處理單元 10910: memory/processing unit
10910(1):記憶體/處理單元 10910(1): memory/processing unit
10910(N):記憶體/處理單元 10910(N): Memory/processing unit
10911(1,1):處理資源 10911(1,1): Processing resources
10911(1,2):處理資源 10911(1,2): Processing resources
10911(1,K):處理資源 10911(1,K): Processing resources
10912(1,1):記憶體資源 10912(1,1): Memory resource
10912(1,2):記憶體資源 10912(1,2): Memory resources
10912(1,J-1):記憶體資源 10912(1,J-1): Memory resource
10912(1,J):記憶體資源 10912(1,J): Memory resource
10915:鏈路 10915: link
10920:處理器 10920: processor
10931:鏈路 10931: link
10931(1):鏈路 10931(1): link
10931(N):鏈路 10931(N): link
10932:鏈路 10932: link
10932(1):鏈路 10932(1): link
10932(N):鏈路 10932(N): link
10933:鏈路 10933: link
11011:混合積體電路 11011: Hybrid integrated circuit
11011':混合積體電路 11011': Hybrid integrated circuit
11012:混合積體電路 11012: Hybrid integrated circuit
11012':導體 11012': Conductor
11013:混合積體電路 11013: Hybrid integrated circuit
11013':混合積體電路 11013': Hybrid integrated circuit
11014:混合積體電路/匯流排 11014: Hybrid integrated circuit/bus
11014':混合積體電路 11014': Hybrid integrated circuit
11015:混合積體電路/匯流排 11015: Hybrid integrated circuit/bus
11015':混合積體電路 11015': Hybrid integrated circuit
11016:混合積體電路/匯流排 11016: Hybrid integrated circuit/bus
11016':混合積體電路 11016': Hybrid integrated circuit
11017:封裝基板 11017: Package substrate
11017':混合積體電路 11017': Hybrid integrated circuit
11018:中介層 11018: Intermediary layer
11018':混合積體電路 11018': Hybrid integrated circuit
11019:基礎晶粒 11019: basic grain
11019':記憶體處理單元/混合積體電路 11019': Memory processing unit/hybrid integrated circuit
11020:微凸塊 11020: Micro bump
11021:DRAM晶圓/DRAM晶粒 11021: DRAM wafer/DRAM die
11021':記憶體/處理單元 11021': memory/processing unit
11022:第二記憶體控制器 11022: Second memory controller
11022':導體 11022': Conductor
11023:WOW中間層 11023: WOW middle layer
11030:HBM DRAM堆疊/晶圓 11030: HBM DRAM stack/wafer
11031:第一記憶體控制器 11031: The first memory controller
11032:HDM DRAM記憶體晶片/HDM DRAM晶粒/第二記憶體控制器 11032: HDM DRAM memory chip/HDM DRAM die/second memory controller
11039:TSV 11039: TSV
11040:晶圓/HBM記憶體晶片堆疊 11040: Wafer/HBM memory chip stacking
11051:處理器 11051: processor
11052:L2快取記憶體 11052: L2 cache
11053:記憶體單元 11053: memory unit
11061:WOW接合部 11061: WOW joint
11062:第二晶片 11062: second chip
11100:用於記憶體密集型處理之方法 11100: Method for memory-intensive processing
11110:步驟 11110: steps
11120:步驟 11120: step
11130:步驟 11130: steps
11140:步驟 11140: steps
11150:電腦系統 11150: computer system
11200:用於資料庫加速之方法 11200: Method for database acceleration
11210:步驟 11210: steps
11220:步驟 11220: steps
11230:步驟 11230: steps
11240:步驟 11240: steps
11250:步驟 11250: steps
11260:步驟 11260: steps
11270:步驟 11270: steps
11271:步驟 11271: steps
11272:步驟 11272: steps
11300:用於操作資料庫加速積體電路之群組的方法 11300: A method for operating the database to accelerate the group of integrated circuits
11310:步驟 11310: steps
11311:步驟 11311: Step
11312:步驟 11312: steps
11314:步驟 11314: steps
11316:步驟 11316: steps
11320:步驟 11320: steps
11350:用於資料庫加速之方法 11350: Method for database acceleration
11352:步驟 11352: step
11354:步驟 11354: step
11355:步驟 11355: steps
11356:步驟 11356: step
11358:步驟 11358: step
11359:步驟 11359: step
11510:運算系統 11510: computing system
11511:管理器 11511: Manager
11512:運算節點 11512: computing node
11513:管理單元 11513: Management Unit
11520:用於資料庫加速之裝置 11520: Device used for database acceleration
11530:資料庫加速積體電路 11530: Database accelerated integrated circuit
11531:網路通信介面/單元 11531: network communication interface/unit
11531(1):網路通信介面之第一埠/乙太網路埠 11531(1): The first port of the network communication interface/Ethernet port
11531(2):網路通信介面之一或多個第二埠 11531(2): One or more second ports of network communication interface
11531(4):乙太網路埠 11531(4): Ethernet port
11531(5):串列擴展埠 11531(5): Serial expansion port
11531(9):PCIe埠 11531(9): PCIe port
11532:第一處理單元 11532: the first processing unit
11533:記憶體控制器 11533: Memory Controller
11534:大輸送量介面 11534: Large throughput interface
11535:資料庫加速單元 11535: Database acceleration unit
11536:互連件 11536: Interconnect
11537:密碼編譯引擎 11537: Cryptographic engine
11538:二階靜態隨機存取記憶體(L2 SRAM) 11538: Second-level static random access memory (L2 SRAM)
11540:SATA控制器 11540: SATA controller
11545:RDMA單元 11545: RDMA unit
11546:遠端RAM 11546: remote RAM
11547:乙太網路記憶體DIMM/資料庫加速子單元 11547: Ethernet memory DIMM/database accelerator subunit
11548:三階L3記憶體
11548:
11549:DMA引擎 11549: DMA engine
11550:記憶體資源 11550: Memory resources
11551:記憶體處理積體電路 11551: Memory processing integrated circuit
11560:儲存系統 11560: storage system
11561:本端儲存單元 11561: Local storage unit
11563:非揮發性記憶體(NVM) 11563: Non-volatile memory (NVM)
11571:快取記憶體 11571: Cache memory
11572:獨立資料庫處理單元 11572: Independent database processing unit
11573:資料庫處理子單元 11573: database processing subunit
11574:DB加速器之可重組態陣列 11574: Reconfigurable array of DB accelerator
11575:共用記憶體單元 11575: Shared memory unit
11576:可組態鏈路或互連件 11576: Configurable link or interconnect
11580:刀鋒/群組 11580: Blade/Group
11590:交換器 11590: switch
11601:PCIe交換器 11601: PCIe switch
11611:交換系統 11611: exchange system
11612:儲存系統 11612: storage system
11613:運算系統 11613: computing system
11615:用於資料庫加速之一或多個裝置 11615: One or more devices used for database acceleration
11621:系統 11621: system
11622:系統 11622: system
11700:用於資料庫加速之方法 11700: Method for database acceleration
11710:步驟 11710: steps
11720:步驟 11720: steps
11730:步驟 11730: steps
11740:步驟 11740: steps
11750:步驟 11750: steps
11760:步驟 11760: steps
12010:資料庫 12010: database
12011:伺服器主機板 12011: Server motherboard
12012:CPU 12012: CPU
12013:記憶體單元 12013: memory unit
12014:資料庫加速器 12014: Database Accelerator
12020:資料庫 12020: database
12021:管理單元 12021: Management Unit
12022:DB加速器板 12022: DB accelerator board
12024:處理器 12024: processor
12026:記憶體/處理單元 12026: memory/processing unit
12031:RDMA引擎 12031: RDMA engine
12033:DDR控制器 12033: DDR controller
12034:DB查詢資料庫引擎 12034: DB query database engine
12040:混合系統 12040: Hybrid system
12042:處理器 12042: processor
12043:記憶體/處理單元(MPU) 12043: Memory/Processing Unit (MPU)
12044:緊密微控制器 12044: compact microcontroller
12049:快取記憶體 12049: Cache memory
12050:系統 12050: System
12051:交換器 12051: Switch
12052:AI加速伺服器 12052: AI acceleration server
12053:網路介面卡(NIC) 12053: Network Interface Card (NIC)
12054:中央處理單元(CPU) 12054: Central Processing Unit (CPU)
12055:伺服器主機板 12055: Server motherboard
12056:RAM 12056: RAM
12057:AI加速器 12057: AI accelerator
12060:系統 12060: System
12061:交換器 12061: Switch
12063:AI處理及網路連接單元 12063: AI processing and network connection unit
12064:伺服器主機板 12064: Server motherboard
A:矩陣 A: Matrix
A1:資料/分片 A1: Data/Shard
A15:資料/分片 A15: Data/Shard
B:矩陣 B: Matrix
B1:資料/分片 B1: data/shard
B2:資料/分片 B2: data/sharding
B3:資料/分片 B3: data/sharding
B4:資料/分片 B4: data/sharding
B5:資料/分片 B5: data/sharding
B6:資料/分片 B6: data/shard
B7:資料/分片 B7: data/shard
B8:資料/分片 B8: data/shard
B9:資料/分片 B9: data/sharding
B10:資料/分片 B10: data/shard
B11:資料/分片 B11: data/sharding
B12:資料/分片 B12: data/shard
B13:資料/分片 B13: data/shard
B14:資料/分片 B14: data/shard
B15:資料/分片 B15: data/shard
RD:列解碼器 RD: column decoder
COL:行解碼器 COL: Row decoder
併入於本發明中且構成本發明之一部分的隨附圖式說明各種所揭示實施例。在圖式中: The accompanying drawings incorporated in and forming part of the present invention illustrate various disclosed embodiments. In the schema:
圖1為中央處理單元(CPU)之圖解表示。 Figure 1 is a diagrammatic representation of a central processing unit (CPU).
圖2為圖形處理單元(GPU)之圖解表示。 Figure 2 is a diagrammatic representation of a graphics processing unit (GPU).
圖3A為符合所揭示實施例之例示性硬體晶片之實施例的圖解表示。 FIG. 3A is a diagrammatic representation of an embodiment of an exemplary hardware chip in accordance with the disclosed embodiment.
圖3B為符合所揭示實施例之例示性硬體晶片之另一實施例的圖解表示。 FIG. 3B is a diagrammatic representation of another embodiment of an exemplary hardware chip in accordance with the disclosed embodiment.
圖4為符合所揭示實施例之由例示性硬體晶片執行之一般命令的圖解表示。 Figure 4 is a diagrammatic representation of general commands executed by an exemplary hardware chip consistent with the disclosed embodiments.
圖5為符合所揭示實施例之由例示性硬體晶片執行之專門命令的圖解表示。 Figure 5 is a diagrammatic representation of a special command executed by an exemplary hardware chip consistent with the disclosed embodiment.
圖6為符合所揭示實施例之供用於例示性硬體晶片中之處理群組的圖解表示。 Figure 6 is a diagrammatic representation of processing groups for use in an exemplary hardware chip consistent with the disclosed embodiments.
圖7A為符合所揭示實施例之處理群組之矩形陣列的圖解表示。 Figure 7A is a diagrammatic representation of a rectangular array of processing groups consistent with the disclosed embodiment.
圖7B為符合所揭示實施例之處理群組之橢圓形陣列的圖解表示。 Figure 7B is a diagrammatic representation of an elliptical array of processing groups consistent with the disclosed embodiment.
圖7C為符合所揭示實施例之硬體晶片之一陣列的圖解表示。 Figure 7C is a diagrammatic representation of an array of hardware chips in accordance with the disclosed embodiments.
圖7D為符合所揭示實施例之硬體晶片之另一陣列的圖解表示。 Figure 7D is a diagrammatic representation of another array of hardware chips in accordance with the disclosed embodiments.
圖8為描繪符合所揭示實施例之用於編譯一系列指令以供在例示性硬體晶片上執行的例示性方法之流程圖。 FIG. 8 is a flowchart depicting an exemplary method for compiling a series of instructions for execution on an exemplary hardware chip consistent with the disclosed embodiments.
圖9為記憶體組之圖解表示。 Figure 9 is a diagrammatic representation of the memory bank.
圖10為記憶體組之圖解表示。 Figure 10 is a diagrammatic representation of the memory bank.
圖11為符合所揭示實施例之具有子組控制件的例示性記憶體組之一實施例的圖解表示。 FIG. 11 is a diagrammatic representation of an embodiment of an exemplary memory bank with sub-group controls in accordance with the disclosed embodiments.
圖12為符合所揭示實施例之具有子組控制件的例示性記憶體組之另一實施例的圖解表示。 FIG. 12 is a diagrammatic representation of another embodiment of an exemplary memory bank with sub-group controls in accordance with the disclosed embodiments.
圖13為符合所揭示實施例之例示性記憶體晶片的方塊圖。 FIG. 13 is a block diagram of an exemplary memory chip in accordance with the disclosed embodiment.
圖14為符合所揭示實施例之例示性冗餘邏輯區塊集合的方塊圖。 FIG. 14 is a block diagram of an exemplary set of redundant logic blocks in accordance with the disclosed embodiment.
圖15為符合所揭示實施例之例示性邏輯區塊的方塊圖。 FIG. 15 is a block diagram of an exemplary logic block in accordance with the disclosed embodiment.
圖16為符合所揭示實施例之與匯流排連接之例示性邏輯區塊的方塊圖。 FIG. 16 is a block diagram of an exemplary logic block connected to the bus in accordance with the disclosed embodiment.
圖17為符合所揭示實施例之串聯連接之例示性邏輯區塊的方塊圖。 FIG. 17 is a block diagram of an exemplary logic block connected in series in accordance with the disclosed embodiment.
圖18為符合所揭示實施例之成二維陣列連接之例示性邏輯區塊的方塊圖。 FIG. 18 is a block diagram of an exemplary logic block connected in a two-dimensional array in accordance with the disclosed embodiment.
圖19為符合所揭示實施例之成複雜連接之例示性邏輯區塊的方塊圖。 FIG. 19 is a block diagram of an exemplary logic block forming a complex connection in accordance with the disclosed embodiment.
圖20為說明符合所揭示實施例之冗餘區塊賦能處理程序的例示性流程圖。 FIG. 20 is an exemplary flowchart illustrating a redundant block enabling processing procedure in accordance with the disclosed embodiment.
圖21為說明符合所揭示實施例之位址指派處理程序的例示性流程圖。 FIG. 21 is an exemplary flowchart illustrating an address assignment processing procedure in accordance with the disclosed embodiment.
圖22提供符合所揭示實施例之例示性處理裝置的方塊圖。 Figure 22 provides a block diagram of an exemplary processing device in accordance with the disclosed embodiments.
圖23為符合所揭示實施例之例示性處理裝置的方塊圖。 Figure 23 is a block diagram of an exemplary processing device in accordance with the disclosed embodiments.
圖24包括符合所揭示實施例之例示性記憶體組態圖。 Figure 24 includes an exemplary memory configuration diagram consistent with the disclosed embodiments.
圖25為說明符合所揭示實施例之記憶體組態處理程序的例示性流程圖。 FIG. 25 is an exemplary flowchart illustrating a memory configuration processing procedure in accordance with the disclosed embodiment.
圖26為說明符合所揭示實施例之記憶體讀取處理程序的例示性流程圖。 FIG. 26 is an exemplary flowchart illustrating a memory read processing procedure in accordance with the disclosed embodiment.
圖27為說明符合所揭示實施例之處理程序執行的例示性流程圖。 FIG. 27 is an exemplary flowchart illustrating the execution of a processing program in accordance with the disclosed embodiment.
圖28展示符合本發明之具有再新控制器的實例記憶體晶片。 Figure 28 shows an example memory chip with a renewed controller in accordance with the present invention.
圖29A展示符合本發明之一實例再新控制器。 Figure 29A shows a renewed controller in accordance with an example of the present invention.
圖29B展示符合本發明之另一實例再新控制器。 Figure 29B shows another example of a renewed controller in accordance with the present invention.
圖30為符合本發明之由再新控制器執行之處理程序的實例流程圖。 Fig. 30 is an example flow chart of the processing procedure executed by the renewed controller according to the present invention.
圖31為符合本發明之由編譯器實施之處理程序的一實例流程圖。 FIG. 31 is a flowchart of an example of a processing program implemented by a compiler according to the present invention.
圖32為符合本發明之由編譯器實施之處理程序的另一實例流程圖。 FIG. 32 is a flowchart of another example of a processing program implemented by a compiler according to the present invention.
圖33展示符合本發明之根據所儲存圖案組態的實例再新控制器。 Figure 33 shows an example of renewing the controller according to the stored pattern configuration according to the present invention.
圖34為符合本發明之由再新控制器內之軟體實施的處理程序的實例流程圖。 FIG. 34 is an example flow chart of the processing procedure implemented by the software in the renewed controller according to the present invention.
圖35A展示符合本發明之包括晶粒的實例晶圓。 Figure 35A shows an example wafer including dies in accordance with the present invention.
圖35B展示符合本發明之連接至輸入/輸出匯流排的實例記憶體晶片。 Figure 35B shows an example memory chip connected to the input/output bus in accordance with the present invention.
圖35C展示符合本發明之包括成列配置且連接至輸入輸出匯流排之記憶體晶片的實例晶圓。 Figure 35C shows an example wafer including memory chips arranged in rows and connected to the input and output bus in accordance with the present invention.
圖35D展示符合本發明之形成群組且連接至輸入輸出匯流排的兩個記憶體晶片。 FIG. 35D shows two memory chips that are grouped and connected to the input and output bus in accordance with the present invention.
圖35E展示符合本發明之實例晶圓,其包括以六邊形晶格置放且連接至輸入輸出匯流排之晶粒。 Figure 35E shows an example wafer in accordance with the present invention, which includes dies placed in a hexagonal lattice and connected to an input and output bus.
圖36A至圖36D展示符合本發明之連接至輸入/輸出匯流排之記憶體晶片的各種可能組態。 36A to 36D show various possible configurations of the memory chip connected to the input/output bus in accordance with the present invention.
圖37展示符合本發明之共用膠合邏輯(glue logic)之晶粒的實 例分組。 Figure 37 shows the implementation of the die of the shared glue logic according to the present invention Example grouping.
圖38A至圖38B展示符合本發明之穿過晶圓的實例切割。 Figures 38A to 38B show example cutting through the wafer in accordance with the present invention.
圖38C展示符合本發明之晶圓上之晶粒的實例配置及輸入輸出匯流排之配置。 FIG. 38C shows an example configuration of the die on a wafer and the configuration of the input/output bus in accordance with the present invention.
圖39展示符合本發明之具有互連處理器子單元之晶圓上的實例記憶體晶片。 Figure 39 shows an example memory chip on a wafer with interconnected processor subunits in accordance with the present invention.
圖40為符合本發明之自晶圓佈置記憶體晶片之群組的處理程序的一實例流程圖。 FIG. 40 is an example flow chart of the processing procedure for arranging groups of memory chips from the wafer in accordance with the present invention.
圖41A為符合本發明之自晶圓佈置記憶體晶片之群組的處理程序的另一實例流程圖。 FIG. 41A is a flowchart of another example of a processing procedure for arranging a group of memory chips from a wafer according to the present invention.
圖41B至圖41C為符合本發明之判定用於自晶圓切割記憶體晶片之一或多個群組的切割圖案之處理程序的實例流程圖。 41B to 41C are flow charts of an example of a processing procedure for determining a cutting pattern for cutting one or more groups of memory chips from a wafer according to the present invention.
圖42展示符合本發明的提供沿著行之雙埠存取的記憶體晶片內之電路系統的實例。 FIG. 42 shows an example of a circuit system in a memory chip that provides dual port access along a row in accordance with the present invention.
圖43展示符合本發明的提供沿著列之雙埠存取的記憶體晶片內之電路系統的實例。 FIG. 43 shows an example of a circuit system in a memory chip that provides dual port access along a row in accordance with the present invention.
圖44展示符合本發明的提供沿著列及行兩者之雙埠存取的記憶體晶片內之電路系統的實例。 FIG. 44 shows an example of a circuit system in a memory chip that provides dual-port access along both rows and rows according to the present invention.
圖45A展示使用複製記憶體陣列或墊的雙讀取。 Figure 45A shows double read using a replicated memory array or pad.
圖45B展示使用複製記憶體陣列或墊的雙寫入。 Figure 45B shows double writing using a replicated memory array or pad.
圖46展示符合本發明之具有用於沿著列之雙埠存取的開關元件之記憶體晶片內之電路系統的實例。 FIG. 46 shows an example of a circuit system in a memory chip having switching elements for dual-port access along a row in accordance with the present invention.
圖47A為符合本發明之用於在單埠記憶體陣列或墊上提供雙埠存取之一處理程序的實例流程圖。 FIG. 47A is an example flowchart of a process for providing dual-port access on a single-port memory array or pad according to the present invention.
圖47B為符合本發明之用於在單埠記憶體陣列或墊上提供雙埠存取之另一處理程序的實例流程圖。 FIG. 47B is an example flowchart of another processing procedure for providing dual-port access on a single-port memory array or pad according to the present invention.
圖48展示符合本發明之提供沿著列及行兩者之雙埠存取的記憶體晶片記憶體晶片內之電路系統的另一實例。 FIG. 48 shows another example of a circuit system in a memory chip that provides dual-port access along both rows and rows in accordance with the present invention.
圖49展示符合本發明之用於記憶體墊內的雙埠存取之開關元件的實例。 FIG. 49 shows an example of a switch element for dual-port access in a memory pad according to the present invention.
圖50說明符合本發明之具有經組態以存取部分字之縮減單元的實例積體電路。 Figure 50 illustrates an example integrated circuit with reduced cells configured to access partial words in accordance with the present invention.
圖51說明用於使用如關於圖50所描述之縮減單元的記憶體組。 FIG. 51 illustrates a memory bank for using the reduction unit as described in relation to FIG. 50.
圖52說明符合本發明之使用整合至PIM邏輯中之縮減單元的記憶體組。 FIG. 52 illustrates a memory bank that uses a reduced unit integrated into the PIM logic in accordance with the present invention.
圖53說明符合本發明之使用PIM邏輯以啟動用於存取部分字之開關的記憶體組。 FIG. 53 illustrates the use of PIM logic to activate the memory bank used to access the partial word switch in accordance with the present invention.
圖54A說明符合本發明之具有用於不啟動以存取部分字之分段行多工器的記憶體組。 FIG. 54A illustrates a memory bank with a segmented row multiplexer for not starting to access part of words in accordance with the present invention.
圖54B為符合本發明之用於記憶體中的部分字存取之處理程序的實例流程圖。 FIG. 54B is an example flowchart of a processing procedure for partial word access in memory according to the present invention.
圖55說明包括多個記憶體墊的現有記憶體晶片。 FIG. 55 illustrates a conventional memory chip including multiple memory pads.
圖56說明符合本發明之具有用於在線斷開期間減少功率消耗之啟動電路的一實例記憶體晶片。 Figure 56 illustrates an example memory chip with a startup circuit for reducing power consumption during line disconnection in accordance with the present invention.
圖57說明符合本發明之具有用於在線斷開期間減少功率消耗之啟動電路的另一實例記憶體晶片。 FIG. 57 illustrates another example memory chip with a startup circuit for reducing power consumption during line disconnection in accordance with the present invention.
圖58說明符合本發明之具有用於在線斷開期間減少功率消耗之啟動電路的又一實例記憶體晶片。 FIG. 58 illustrates yet another example memory chip with a startup circuit for reducing power consumption during line disconnection in accordance with the present invention.
圖59說明符合本發明之具有用於在線斷開期間減少功率消耗之啟動電路的額外實例記憶體晶片。 FIG. 59 illustrates an additional example memory chip in accordance with the present invention with a startup circuit for reducing power consumption during online disconnection.
圖60說明符合本發明之具有用於在線斷開期間減少功率消耗之全域字線及區域字線的一實例記憶體晶片。 FIG. 60 illustrates an example memory chip with global word lines and regional word lines for reducing power consumption during line disconnection in accordance with the present invention.
圖61說明符合本發明之具有用於在線斷開期間減少功率消耗之全域字線及區域字線的另一實例記憶體晶片。 Fig. 61 illustrates another example memory chip with global word lines and regional word lines for reducing power consumption during line disconnection in accordance with the present invention.
圖62為符合本發明之用於依序斷開記憶體中的線之處理程序的實例流程圖。 FIG. 62 is an example flow chart of the processing procedure for sequentially disconnecting the lines in the memory according to the present invention.
圖63說明用於記憶體晶片之一現有測試器。 Figure 63 illustrates an existing tester used for memory chips.
圖64說明用於記憶體晶片之另一現有測試器。 Figure 64 illustrates another conventional tester used for memory chips.
圖65說明符合本發明之使用與記憶體在同一基板上的邏輯單元測試記憶體晶片的一實例。 FIG. 65 illustrates an example of testing a memory chip using logic cells on the same substrate as the memory according to the present invention.
圖66說明符合本發明之使用與記憶體在同一基板上的邏輯單元測試記憶體晶片的另一實例。 FIG. 66 illustrates another example of testing a memory chip using logic cells on the same substrate as the memory according to the present invention.
圖67說明符合本發明之使用與記憶體在同一基板上的邏輯單元測試記憶體晶片的又一實例。 FIG. 67 illustrates another example of testing a memory chip using logic cells on the same substrate as the memory according to the present invention.
圖68說明符合本發明之使用與記憶體在同一基板上的邏輯單元測試記憶體晶片的額外實例。 FIG. 68 illustrates an additional example of testing a memory chip using logic cells on the same substrate as the memory in accordance with the present invention.
圖69說明符合本發明之使用與記憶體在同一基板上的邏輯單元測試記憶體晶片的另一實例。 FIG. 69 illustrates another example of testing a memory chip using logic cells on the same substrate as the memory according to the present invention.
圖70為符合本發明之用於測試記憶體晶片之一處理程序的實例流程圖。 FIG. 70 is an example flow chart of a processing procedure for testing memory chips in accordance with the present invention.
圖71為符合本發明之用於測試記憶體晶片之另一處理程序的實例流程圖。 FIG. 71 is an example flowchart of another processing procedure for testing memory chips in accordance with the present invention.
圖72A為符合本發明之實施例的包括記憶體陣列及處理陣列之積體電路的圖解表示。 FIG. 72A is a diagrammatic representation of an integrated circuit including a memory array and a processing array in accordance with an embodiment of the present invention.
圖72B為符合本發明之實施例的積體電路內部之記憶體區的圖解表示。 FIG. 72B is a diagrammatic representation of a memory area inside an integrated circuit according to an embodiment of the present invention.
圖73A為符合本發明之實施例的具有控制器之實例組態的積體電路之圖解表示。 Figure 73A is a diagrammatic representation of an integrated circuit with an example configuration of a controller in accordance with an embodiment of the present invention.
圖73B為符合本發明之實施例的用於同時執行複製模型的組態之圖解表示。 Figure 73B is a diagrammatic representation of a configuration for simultaneous execution of a copy model in accordance with an embodiment of the present invention.
圖74A為符合本發明之實施例的具有控制器之另一實例組態的積體電路之圖解表示。 Figure 74A is a diagrammatic representation of an integrated circuit with another example configuration of a controller in accordance with an embodiment of the present invention.
圖74B為根據例示性所揭示實施例之保護積體電路的方法之流程圖表示。 FIG. 74B is a flowchart representation of a method of protecting an integrated circuit according to an exemplary disclosed embodiment.
圖74C為根據例示性所揭示實施例之位於晶片內之各個點處的偵測元件之圖解表示。 FIG. 74C is a diagrammatic representation of detecting elements located at various points within the chip according to an illustratively disclosed embodiment.
圖75A為符合本發明之實施例的包括複數個分散式處理器記憶體晶片之可擴展處理器記憶體系統的圖解表示。 FIG. 75A is a diagrammatic representation of an expandable processor memory system including a plurality of distributed processor memory chips in accordance with an embodiment of the present invention.
圖75B為符合本發明之實施例的包括複數個分散式處理器記憶體晶片之可擴展處理器記憶體系統的圖解表示。 FIG. 75B is a diagrammatic representation of an expandable processor memory system including a plurality of distributed processor memory chips in accordance with an embodiment of the present invention.
圖75C為符合本發明之實施例的包括複數個分散式處理器記憶體晶片之可擴展處理器記憶體系統的圖解表示。 FIG. 75C is a diagrammatic representation of an expandable processor memory system including a plurality of distributed processor memory chips in accordance with an embodiment of the present invention.
圖75D為符合本發明之實施例的雙埠分散式處理器記憶體晶片之圖解表示。 FIG. 75D is a diagrammatic representation of a dual-port distributed processor memory chip in accordance with an embodiment of the present invention.
圖75E為符合本發明之實施例的實例時序圖。 FIG. 75E is an example timing diagram according to an embodiment of the present invention.
圖76為符合本發明之實施例的具有整合式控制器及介面模組且 構成可擴展處理器記憶體系統之處理器記憶體晶片的圖解表示。 FIG. 76 is an integrated controller and interface module according to an embodiment of the present invention and A diagrammatic representation of the processor memory chips that make up an expandable processor memory system.
圖77為符合本發明之實施例的用於在圖75A中所展示之可擴展處理器記憶體系統中的處理器記憶體晶片之間傳送資料的流程圖。 FIG. 77 is a flowchart for transferring data between processor memory chips in the expandable processor memory system shown in FIG. 75A according to an embodiment of the present invention.
圖78A說明符合本發明之實施例的用於在晶片層級偵測儲存於實施於記憶體晶片中之複數個記憶體組的一或多個特定位址中之零值的系統。 FIG. 78A illustrates a system for detecting zero values stored in one or more specific addresses of a plurality of memory banks implemented in a memory chip at the chip level in accordance with an embodiment of the present invention.
圖78B說明符合本發明之實施例的用於在記憶體組層級偵測儲存於複數個記憶體組之特定位址中的一或多者中之零值的記憶體晶片。 FIG. 78B illustrates a memory chip for detecting zero values stored in one or more of the specific addresses of a plurality of memory groups at the memory group level in accordance with an embodiment of the present invention.
圖79說明符合本發明之實施例的用於在記憶體墊層級偵測儲存於複數個記憶體墊之特定位址中的一或多者中之零值的記憶體組。 FIG. 79 illustrates a memory set for detecting zero values stored in one or more of the specific addresses of a plurality of memory pads at the memory pad level in accordance with an embodiment of the present invention.
圖80為說明符合本發明之實施例的偵測複數個離散記憶體組之特定位址中之零值的例示性方法之流程圖。 FIG. 80 is a flowchart illustrating an exemplary method for detecting zero values in specific addresses of a plurality of discrete memory groups according to an embodiment of the present invention.
圖81A說明符合本發明之實施例的用於基於下一列預測啟動與記憶體組相關聯之下一列的系統。 FIG. 81A illustrates a system for activating the next row associated with a memory bank based on the prediction of the next row in accordance with an embodiment of the present invention.
圖81B說明符合本發明之實施例的圖81A之系統的另一實施例。 FIG. 81B illustrates another embodiment of the system of FIG. 81A in accordance with an embodiment of the present invention.
圖81C說明符合本發明之實施例的每一記憶體子組之第一及第二子組列控制器。 FIG. 81C illustrates the first and second sub-group row controllers of each memory sub-group in accordance with an embodiment of the present invention.
圖81D說明符合本發明之實施例的下一列預測之實施例。 Figure 81D illustrates an embodiment of the next column prediction in accordance with an embodiment of the present invention.
圖81E說明符合本發明之實施例的記憶體組之實施例。 FIG. 81E illustrates an embodiment of a memory bank in accordance with an embodiment of the present invention.
圖81F說明符合本發明之實施例的記憶體組之另一實施例。 FIG. 81F illustrates another embodiment of the memory bank in accordance with the embodiment of the present invention.
圖82說明符合本發明之實施例的用於減少記憶體列啟動懲罰之雙重控制記憶體組。 FIG. 82 illustrates a dual-control memory bank for reducing memory row activation penalty in accordance with an embodiment of the present invention.
圖83A說明存取及啟動記憶體組之列的第一實例。 Figure 83A illustrates the first example of accessing and activating a row of memory banks.
圖83B說明存取及啟動記憶體組之列的第二實例。 FIG. 83B illustrates a second example of accessing and activating a row of memory banks.
圖83C說明存取及啟動記憶體組之列的第三實例。 FIG. 83C illustrates a third example of accessing and activating a row of memory banks.
圖84提供習知CPU/暫存器檔案及外部記憶體架構之圖解表示。 Figure 84 provides a graphical representation of the conventional CPU/register file and external memory architecture.
圖85A說明符合一個實施例之具有充當暫存器檔案之記憶體墊的例示性分散式處理器記憶體晶片。 Figure 85A illustrates an exemplary distributed processor memory chip with memory pads acting as register files in accordance with one embodiment.
圖85B說明符合另一實施例之具有經組態以充當暫存器檔案之記憶體墊的例示性分散式處理器記憶體晶片。 FIG. 85B illustrates an exemplary distributed processor memory chip with a memory pad configured to act as a register file in accordance with another embodiment.
圖85C說明符合另一實施例之具有充當暫存器檔案之記憶體墊的例示性裝置。 FIG. 85C illustrates an exemplary device with a memory pad acting as a register file in accordance with another embodiment.
圖86提供表示符合所揭示實施例之用於在分散式處理器記憶體晶片中執行至少一個指令的例示性方法之流程圖。 FIG. 86 provides a flowchart representing an exemplary method for executing at least one instruction in a distributed processor memory chip consistent with the disclosed embodiments.
圖87A包括分解式伺服器之實例; Figure 87A includes an example of an exploded server;
圖87B為分散式處理之實例; Figure 87B is an example of distributed processing;
圖87C為記憶體/處理單元之實例; Figure 87C is an example of a memory/processing unit;
圖87D為記憶體/處理單元之實例; Figure 87D is an example of a memory/processing unit;
圖87E為記憶體/處理單元之實例; Figure 87E is an example of a memory/processing unit;
圖87F為包括記憶體/處理單元及一或多個通信模組之積體電路的實例; FIG. 87F is an example of an integrated circuit including a memory/processing unit and one or more communication modules;
圖87G為包括記憶體/處理單元及一或多個通信模組之積體電路的實例; FIG. 87G is an example of an integrated circuit including a memory/processing unit and one or more communication modules;
圖87H為方法之實例; Figure 87H is an example of the method;
圖87I為方法之實例; Figure 87I is an example of the method;
圖88A為方法之實例; Figure 88A is an example of the method;
圖88B為方法之實例; Figure 88B is an example of the method;
圖88C為方法之實例; Figure 88C is an example of the method;
圖89A為記憶體/處理單元及詞彙表之實例; Figure 89A is an example of memory/processing unit and vocabulary;
圖89B為記憶體/處理單元之實例; Figure 89B is an example of a memory/processing unit;
圖89C為記憶體/處理單元之實例; Figure 89C is an example of a memory/processing unit;
圖89D為記憶體/處理單元之實例; Figure 89D is an example of a memory/processing unit;
圖89E為記憶體/處理單元之實例; Figure 89E is an example of a memory/processing unit;
圖89F為記憶體/處理單元之實例; Figure 89F is an example of a memory/processing unit;
圖89G為記憶體/處理單元之實例; Figure 89G is an example of a memory/processing unit;
圖89H為記憶體/處理單元之實例; Figure 89H is an example of a memory/processing unit;
圖90A為系統之實例; Figure 90A is an example of the system;
圖90B為系統之實例; Figure 90B is an example of the system;
圖90C為系統之實例; Figure 90C is an example of the system;
圖90D為系統之實例; Figure 90D is an example of the system;
圖90E為系統之實例; Figure 90E is an example of the system;
圖90F為方法之實例; Figure 90F is an example of the method;
圖91A為記憶體及篩選系統、儲存裝置以及CPU之實例; Figure 91A is an example of a memory and screening system, storage device, and CPU;
圖91B為記憶體及處理系統、儲存裝置以及CPU之實例; Figure 91B is an example of memory and processing system, storage device and CPU;
圖92A為記憶體及處理系統、儲存裝置以及CPU之實例; Figure 92A is an example of a memory and processing system, storage device and CPU;
圖92B為記憶體/處理單元之實例; Figure 92B is an example of a memory/processing unit;
圖92C為記憶體及篩選系統、儲存裝置以及CPU之實例; Figure 92C is an example of a memory and screening system, storage device and CPU;
圖92D為記憶體及處理系統、儲存裝置以及CPU之實例; Figure 92D is an example of memory and processing system, storage device and CPU;
圖92E為記憶體及處理系統、儲存裝置以及CPU之實例; Figure 92E is an example of memory and processing system, storage device and CPU;
圖92F為方法之實例; Figure 92F is an example of the method;
圖92G為方法之實例; Figure 92G is an example of the method;
圖92H為方法之實例; Figure 92H is an example of the method;
圖92I為方法之實例; Figure 92I is an example of the method;
圖92J為方法之實例; Figure 92J is an example of the method;
圖92K為方法之實例; Figure 92K is an example of the method;
圖93A為混合積體電路之實例的橫截面圖; Fig. 93A is a cross-sectional view of an example of a hybrid integrated circuit;
圖93B為混合積體電路之實例的橫截面圖; Figure 93B is a cross-sectional view of an example of a hybrid integrated circuit;
圖93C為混合積體電路之實例的橫截面圖; Figure 93C is a cross-sectional view of an example of a hybrid integrated circuit;
圖93D為混合積體電路之實例的橫截面圖; Figure 93D is a cross-sectional view of an example of a hybrid integrated circuit;
圖93E為混合積體電路之實例的俯視圖; Fig. 93E is a top view of an example of a hybrid integrated circuit;
圖93F為混合積體電路之實例的俯視圖; FIG. 93F is a top view of an example of a hybrid integrated circuit;
圖93G為混合積體電路之實例的俯視圖; Figure 93G is a top view of an example of a hybrid integrated circuit;
圖93H為混合積體電路之實例的橫截面圖; Figure 93H is a cross-sectional view of an example of a hybrid integrated circuit;
圖93I為混合積體電路之實例的橫截面圖; Figure 93I is a cross-sectional view of an example of a hybrid integrated circuit;
圖93J為方法之實例; Figure 93J is an example of the method;
圖94A為儲存系統、一或多個裝置及運算系統之實例; Fig. 94A is an example of a storage system, one or more devices, and a computing system;
圖94B為儲存系統、一或多個裝置及運算系統之實例; FIG. 94B is an example of a storage system, one or more devices, and a computing system;
圖94C為一或多個裝置及運算系統之實例; FIG. 94C is an example of one or more devices and computing systems;
圖94D為一或多個裝置及運算系統之實例; Fig. 94D is an example of one or more devices and computing systems;
圖94E為資料庫加速積體電路之實例; Figure 94E is an example of a database accelerated integrated circuit;
圖94F為資料庫加速積體電路之實例; Figure 94F is an example of a database accelerated integrated circuit;
圖94G為資料庫加速積體電路之實例; Figure 94G is an example of a database accelerated integrated circuit;
圖94H為資料庫加速單元之實例; Figure 94H is an example of a database acceleration unit;
圖94I為刀鋒以及資料庫加速積體電路之群組的實例; Fig. 94I is an example of the group of blade and database accelerated integrated circuit;
圖94J為資料庫加速積體電路之群組的實例; Fig. 94J is an example of a group of accelerated integrated circuits in the database;
圖94K為資料庫加速積體電路之群組的實例; Fig. 94K is an example of the group of accelerating integrated circuits in the database;
圖94L為資料庫加速積體電路之群組的實例; FIG. 94L is an example of the group of accelerated integrated circuits in the database;
圖94M為資料庫加速積體電路之群組的實例; FIG. 94M is an example of a group of accelerated integrated circuits in the database;
圖94N為系統之實例; Figure 94N is an example of the system;
圖94O為系統之實例; Figure 94O is an example of the system;
圖94P為方法之實例; Figure 94P is an example of the method;
圖95A為方法之實例; Figure 95A is an example of the method;
圖95B為方法之實例; Figure 95B is an example of the method;
圖95C為方法之實例; Figure 95C is an example of the method;
圖96A為先前技術系統之實例; Figure 96A is an example of a prior art system;
圖96B為系統之實例; Figure 96B is an example of the system;
圖96C為資料庫加速器板之實例; Figure 96C is an example of a database accelerator board;
圖96D為系統之一部分的實例; Figure 96D is an example of a part of the system;
圖97A為先前技術系統之實例; Figure 97A is an example of a prior art system;
圖97B為系統之實例;及 Figure 97B is an example of the system; and
圖97C為AI網路介面卡之實例。 Figure 97C shows an example of an AI network interface card.
以下詳細描述參考隨附圖式。在任何方便之處,在圖式及以下描述中使用相同參考編號來指相同或類似部分。雖然本文中描述了若干說明性實施例,但修改、調適及其他實施方案為可能的。舉例而言,可對圖式中所說明之組件進行替代、添加或修改,且可藉由替代、重排序、移除步驟或添加步驟至所揭示方法來修改本文中所描述之說明性方法。因此,以下詳細描述不限於所揭示實施例及實例。實情為,適當範圍由隨附申請專利範圍界定。 The following detailed description refers to the accompanying drawings. Wherever convenient, the same reference numbers are used in the drawings and the following description to refer to the same or similar parts. Although several illustrative examples are described herein, modifications, adaptations, and other implementations are possible. For example, the components illustrated in the drawings can be replaced, added, or modified, and the illustrative methods described herein can be modified by replacing, reordering, removing steps, or adding steps to the disclosed methods. Therefore, the following detailed description is not limited to the disclosed embodiments and examples. The fact is that the appropriate scope is defined by the scope of the attached patent application.
處理器架構 Processor architecture
如貫穿本發明所使用,術語「硬體晶片」係指半導體晶圓(諸如, 矽或其類似者),其上形成有一或多個電路元件(諸如,電晶體、電容器、電阻器及/或其類似者)。該等電路元件可形成處理元件或記憶體元件。「處理元件」係指一起執行至少一個邏輯功能(諸如,算術函數、邏輯閘、其他布林運算或其類似者)之一或多個電路元件。處理元件可為通用處理元件(諸如,可組態的複數個電晶體)或專用處理元件(諸如,經設計以執行特定邏輯功能之一特定邏輯閘或複數個電路元件)。「記憶體元件」係指可用以儲存資料之一或多個電路元件。「記憶體元件」亦可被稱作「記憶體胞元」。記憶體元件可為動態(使得需要電再新以維持資料儲存)、靜態(使得資料在失去電力之後持續存在至少一段時間)或非揮發性之記憶體。 As used throughout the present invention, the term "hardware chip" refers to a semiconductor wafer (such as, Silicon or the like) on which one or more circuit elements (such as transistors, capacitors, resistors, and/or the like) are formed. These circuit elements can form processing elements or memory elements. "Processing element" refers to one or more circuit elements that perform at least one logic function (such as arithmetic functions, logic gates, other Boolean operations, or the like) together. The processing element may be a general-purpose processing element (such as a configurable plurality of transistors) or a dedicated processing element (such as a specific logic gate or a plurality of circuit elements designed to perform a specific logic function). "Memory component" refers to one or more circuit components that can be used to store data. "Memory device" can also be referred to as "memory cell". The memory device can be dynamic (so that it needs electricity to be renewed to maintain data storage), static (so that the data persists for at least a period of time after losing power) or non-volatile memory.
處理元件可接合以形成處理器子單元。「處理器子單元」因此可包含可執行至少一個任務或指令(例如,屬於處理器指令集)的處理元件之最小分組。舉例而言,子單元可包含經組態以一起執行指令之一或多個通用處理元件、與經組態成以互補方式執行指令之一或多個專用處理元件配對的一或多個通用處理元件,或其類似者。該等處理器子單元可成陣列配置於基板(例如,晶圓)上。儘管「陣列」可包含矩形形狀,但陣列中之子單元的任何配置可形成於基板上。 The processing elements can be joined to form a processor sub-unit. The "processor subunit" may therefore include the smallest grouping of processing elements that can perform at least one task or instruction (for example, belonging to the processor instruction set). For example, a subunit may include one or more general-purpose processing elements configured to execute instructions together, one or more general-purpose processing elements paired with one or more dedicated processing elements configured to execute instructions in a complementary manner Element, or the like. The processor sub-units can be arranged in an array on a substrate (for example, a wafer). Although the "array" may include a rectangular shape, any configuration of the sub-units in the array may be formed on the substrate.
記憶體元件可接合以形成記憶體組。舉例而言,記憶體組可包含沿著至少一條導線(或其他導電連接件)鏈接之記憶體元件的一或多排。此外,記憶體元件可在另一方向上沿著至少一條添加導線鏈接。舉例而言,記憶體元件可沿著字線及位元線配置,如下文所解釋。儘管記憶體組可包含多個排,但組中之元件的任何配置可用以在基板上形成組。此外,一或多個組可電接合至至少一個記憶體控制器以形成記憶體陣列。儘管記憶體陣列可包含組之矩形配置,但陣列中之組的任何配置可形成於基板上。 The memory components can be joined to form a memory bank. For example, the memory bank may include one or more rows of memory elements linked along at least one wire (or other conductive connection). In addition, the memory device can be linked along at least one additional wire in another direction. For example, memory devices can be arranged along word lines and bit lines, as explained below. Although the memory bank can include multiple rows, any configuration of the elements in the group can be used to form the group on the substrate. In addition, one or more groups can be electrically coupled to at least one memory controller to form a memory array. Although the memory array can include rectangular configurations of groups, any configuration of the groups in the array can be formed on the substrate.
如貫穿本發明進一步所使用,「匯流排」係指基板之元件之間的 任何通信連接件。舉例而言,導線或線(形成電連接件)、光纖(形成光學連接件)或進行組件之間的通信之任何其他連接件可被稱作「匯流排」。 As used further throughout the present invention, "bus bar" refers to the interconnection between the components of the substrate Any communication connection. For example, wires or wires (to form electrical connections), optical fibers (to form optical connections), or any other connections for communication between components may be referred to as "bus bars."
習知處理器使通用邏輯電路與共用記憶體配對。共用記憶體可儲存供邏輯電路執行之指令集以及用於指令集之執行且由指令集之執行產生的資料兩者。如下文所描述,一些習知處理器使用快取系統來減少執行自共用記憶體取得時的延遲;然而,習知快取系統保持共用。習知處理器包括中央處理單元(CPU)、圖形處理單元(GPU)、各種特殊應用積體電路(ASIC)或其類似者。圖1展示CPU之實例,且圖2展示GPU之實例。 Conventional processors pair general-purpose logic circuits with shared memory. The shared memory can store both the instruction set for the execution of the logic circuit and the data used for the execution of the instruction set and generated by the execution of the instruction set. As described below, some conventional processors use a cache system to reduce the delay in executing fetches from shared memory; however, the conventional cache system remains shared. The conventional processor includes a central processing unit (CPU), a graphics processing unit (GPU), various application-specific integrated circuits (ASIC), or the like. Figure 1 shows an example of a CPU, and Figure 2 shows an example of a GPU.
如圖1中所展示,CPU 100可包含處理單元110,該處理單元包括一或多個處理器子單元,諸如處理器子單元120a及處理器子單元120b。儘管圖1中未描繪,但每一處理器子單元可包含複數個處理元件。此外,處理單元110可包括一或多個層級之晶載快取記憶體。此類快取記憶體元件通常與處理單元110形成於同一半導體晶粒上,而非經由形成於基板中之一或多個匯流排連接至處理器子單元120a及120b,該基板含有處理器子單元120a及120b以及快取記憶體元件。對於習知處理器中之一階(L1)及二階(L2)快取記憶體,直接在同一晶粒上而非經由匯流排連接之配置為常用的。替代地,在早期處理器中,使用子單元與L2快取記憶體之間的背側匯流排而在處理器子單元間共用L2快取記憶體。背側匯流排通常大於下文所描述之前側匯流排。因此,因為快取記憶體待供晶粒上之所有處理器子單元共用,所以快取記憶體130可與處理器子單元120a及120b在同一晶粒上形成或經由一或多個背側匯流排以通信方式耦接至處理器子單元120a及120b。在不具有匯流排(例如,快取記憶體直接形成於晶粒上)之實施例以及使用背側匯流排之實施例兩者中,快取記憶體在CPU之處理器子單元之間共用。
As shown in FIG. 1, the
此外,處理單元110與共用記憶體140a及記憶體140b通信。舉
例而言,記憶體140a及140b可表示共用動態隨機存取記憶體(DRAM)之記憶體組。儘管描繪為具有兩個組,但大部分習知記憶體晶片包括介於八個與十六個之間的記憶體組。因此,處理器子單元120a及120b可使用共用記憶體140a及140b儲存資料,該資料接著由處理器子單元120a及120b進行操作。然而,此配置導致在處理單元110之時脈速度超過匯流排之資料傳送速度時,記憶體140a及140b與處理單元110之間的匯流排成為瓶頸。對於習知處理器,通常係如此情況,從而導致低於基於時脈速率及電晶體數目之規定處理速度的有效處理速度。
In addition, the
如圖2中所展示,GPU中亦存在類似缺陷。GPU 200可包含處理單元210,該處理單元包括一或多個處理器子單元(例如,子單元220a、220b、220c、220d、220e、220f、220g、220h、220i、220j、220k、220l、220m、220n、220o及220p)。此外,處理單元210可包括一或多個層級之晶載快取記憶體及/或暫存器檔案。此類快取記憶體元件通常與處理單元210形成於同一半導體晶粒上。實際上,在圖2之實例中,快取記憶體210與處理單元210形成於同一晶粒上且在所有處理器子單元間共用,而快取記憶體230a、230b、230c及230d分別形成於處理器子單元之子集上且專用於該等處理器子單元。
As shown in Figure 2, similar defects also exist in GPUs. The
此外,處理單元210與共用記憶體250a、250b、250c及250d通信。舉例而言,記憶體250a、250b、250c及250d可表示共用DRAM之記憶體組。因此,處理單元210之處理器子單元可使用共用記憶體250a、250b、250c及250d儲存資料,該資料接著由該等處理器子單元進行操作。然而,此配置導致記憶體250a、250b、250c及250d與處理單元210之間的匯流排成為瓶頸,其類似於上文關於CPU所描述之瓶頸。
In addition, the
所揭示硬體晶片之綜述 Overview of disclosed hardware chips
圖3A為描繪例示性硬體晶片300之實施例的圖解表示。硬體晶
片300可包含經設計以緩解上文關於CPU、GPU及其他習知處理器所描述之瓶頸的分散式處理器。分散式處理器可包括在空間上分佈於單個基板上之複數個處理器子單元。此外,如上文所解釋,在本發明之分散式處理器中,對應記憶體組亦在空間上分佈於基板上。在一些實施例中,分散式處理器可與指令集相關聯,且分散式處理器之處理器子單元中的每一者可負責執行包括於該指令集中之一或多個任務。
FIG. 3A is a diagrammatic representation depicting an embodiment of an
如圖3A中所描繪,硬體晶片300可包含複數個處理器子單元,例如邏輯及控制子單元320a、320b、320c、320d、320e、320f、320g及320h。如圖3A中進一步所描繪,每一處理器子單元可具有專用記憶體例項。舉例而言,邏輯及控制子單元320a可操作地連接至專用記憶體例項330a,邏輯及控制子單元320b可操作地連接至專用記憶體例項330b,邏輯及控制子單元320c可操作地連接至專用記憶體例項330c,邏輯及控制子單元320d可操作地連接至專用記憶體例項330d,邏輯及控制子單元320e可操作地連接至專用記憶體例項330e,邏輯及控制子單元320f可操作地連接至專用記憶體例項330f,邏輯及控制子單元320g可操作地連接至專用記憶體例項330g,且邏輯及控制子單元320h可操作地連接至專用記憶體例項330h。
As depicted in FIG. 3A, the
儘管圖3A將每一記憶體例項描繪為單個記憶體組,但硬體晶片300可包括兩個或多於兩個記憶體組作為用於硬體晶片300上之處理器子單元的專用記憶體例項。此外,儘管圖3A將每一處理器子單元描繪為包含邏輯組件及用於專用記憶體組之控制件兩者,但硬體晶片300可使用用於記憶體組之控制件,該等控制件至少部分地與該等邏輯組件分開。此外,如圖3A中所描繪,可將兩個或多於兩個處理器子單元及其對應記憶體組分組成例如處理群組310a、310b、310c及310d。「處理群組」可表示上面形成有硬體晶片300之基板上的空間區別。因此,處理群組可包括用於群組中之記憶體組的其他控制件,例如
控制件340a、340b、340c及340d。另外或替代地,「處理群組」可表示用於編譯程式碼以供在硬體晶片300上執行之目的之邏輯分組。因此,用於硬體晶片300之編譯器(下文進一步描述)可在硬體晶片300上之處理群組之間劃分整個指令集。
Although FIG. 3A depicts each memory instance as a single memory group, the
此外,主機350可將指令、資料及其他輸入提供至硬體晶片300且自該硬體晶片讀取輸出。因此,指令集可全部在單個晶粒上,例如在代管硬體晶片300之晶粒上執行。實際上,晶粒外之僅有通信可包括指令至硬體晶片300之載入、發送至硬體晶片300之任何輸入及自硬體晶片300讀取之任何輸出。因此,所有計算及記憶體操作可在晶粒上(在硬體晶片300上)執行,此係因為硬體晶片300之處理器子單元與硬體晶片300之專用記憶體組通信。
In addition, the
圖3B為描繪另一例示性硬體晶片300'之實施例的圖解表示。儘管描繪為硬體晶片300之替代,但圖3B中所描繪之架構可至少部分地與圖3A中所描繪之架構組合。
FIG. 3B is a diagrammatic representation depicting an embodiment of another exemplary hardware chip 300'. Although depicted as an alternative to the
如圖3B中所描繪,硬體晶片300'可包含複數個處理器子單元,例如處理器子單元350a、350b、350c及350d。如圖3B中進一步所描繪,每一處理器子單元可具有複數個專用記憶體例項。舉例而言,處理器子單元350a可操作地連接至專用記憶體例項330a及330b,處理器子單元350b可操作地連接至專用記憶體例項330c及330d,處理器子單元350c可操作地連接至專用記憶體例項330e及330f,且處理器子單元350d可操作地連接至專用記憶體例項330g及330h。此外,如圖3B中所描繪,可將處理器子單元及其對應記憶體組分組成例如處理群組310a、310b、310c及310d。如上文所解釋,「處理群組」可表示上面形成有硬體晶片300'之基板上的空間區別及/或用於編譯程式碼以供在硬體晶片300'上執行之目的之邏輯分組。
As depicted in FIG. 3B, the hardware chip 300' may include a plurality of processor sub-units, such as
如圖3B中進一步所描繪,處理器子單元可經由匯流排彼此通信。
舉例而言,如圖3B所展示,處理器子單元350a可經由匯流排360a與處理器子單元350b通信,經由匯流排360c與處理器子單元350c通信,且經由匯流排360f與處理器子單元350d通信。類似地,處理器子單元350b可經由匯流排360a(如上文所描述)與處理器子單元350a通信,經由匯流排360e與處理器子單元350c通信,且經由匯流排360d與處理器子單元350d通信。此外,處理器子單元350c可經由匯流排360c(如上文所描述)與處理器子單元350a通信,經由匯流排360e(如上文所描述)與處理器子單元350b通信,且經由匯流排360b與處理器子單元350d通信。相應地,處理器子單元350d可經由匯流排360f(如上文所描述)與處理器子單元350a通信,經由匯流排360d(如上文所描述)與處理器子單元350b通信,且經由匯流排360b(如上文所描述)與處理器子單元350c通信。一般熟習此項技術者將理解,可使用比圖3B中所描繪之匯流排少的匯流排。舉例而言,可消除匯流排360e,使得處理器子單元350b與350c之間的通信經由處理器子單元350a及/或350d傳遞。類似地,可消除匯流排360f,使得處理器子單元350a與處理器子單元350d之間的通信經由處理器子單元350b或350c傳遞。
As further depicted in Figure 3B, the processor sub-units can communicate with each other via a bus.
For example, as shown in FIG. 3B, the processor subunit 350a may communicate with the
此外,一般熟習此項技術者將理解,可使用除圖3A及圖3B中所描繪之架構以外的架構。舉例而言,各具有單個處理器子單元及記憶體例項之處理群組的陣列可配置於基板上。處理器子單元可另外或替代地形成用於對應的專用記憶體組之控制器的部分、用於對應的專用記憶體之記憶體墊之控制器的部分,或其類似者。 In addition, those who are generally familiar with the technology will understand that architectures other than those depicted in FIGS. 3A and 3B can be used. For example, arrays of processing groups each having a single processor subunit and memory instance can be arranged on the substrate. The processor subunit may additionally or alternatively form part of the controller for the corresponding dedicated memory bank, part of the controller for the memory pad of the corresponding dedicated memory, or the like.
鑒於上文所描述之架構,相較於傳統架構,硬體晶片300及300'可顯著提高記憶體密集型任務之效率。舉例而言,資料庫操作及人工智慧演算法(諸如,神經網路)為記憶體密集型任務之實例,對於記憶體密集型任務,傳統架構在效率上低於硬體晶片300及300'。因此,硬體晶片300及300'可被稱作資料庫加速器處理器及/或人工智慧加速器處理器。
In view of the architecture described above, compared to the traditional architecture, the
組態所揭示硬體晶片 Hardware chip revealed by configuration
上文所描述之硬體晶片架構可經組態以用於程式碼執行。舉例而言,每一處理器子單元可與硬體晶片中之其他處理器子單元隔開而個別地執行程式碼(定義指令集)。因此,替代依賴於作業系統來管理多執行緒處理或使用多任務處理(其為同時而非並列的),本發明之硬體晶片可允許處理器子單元完全並列地操作。 The hardware chip architecture described above can be configured for code execution. For example, each processor sub-unit can be separated from other processor sub-units in the hardware chip and execute the program code (defining instruction set) individually. Therefore, instead of relying on the operating system to manage multi-threaded processing or using multi-tasking (which is simultaneous rather than parallel), the hardware chip of the present invention can allow the processor sub-units to operate completely in parallel.
除上文所描述之完全並列實施方案以外,指派給每一處理器子單元之指令中的至少一些可重疊。舉例而言,分散式處理器上之複數個處理器子單元可執行重疊指令作為例如作業系統或其他管理軟體之實施方案,同時執行非重疊指令以便在作業系統或其他管理軟體之上下文內執行並列任務。 Except for the fully parallel implementation described above, at least some of the instructions assigned to each processor subunit may overlap. For example, a plurality of processor subunits on a distributed processor can execute overlapping instructions as an implementation solution for operating systems or other management software, and execute non-overlapping instructions at the same time to execute parallel operations within the context of the operating system or other management software. task.
圖4描繪藉由處理群組410進行之用於執行一般命令的例示性處理程序400。舉例而言,處理群組410可包含本發明之硬體晶片(例如,硬體晶片300、硬體晶片300'或其類似者)的一部分。
FIG. 4 depicts an
如圖4中所描繪,可將命令發送至與專用記憶體例項420配對之處理器子單元430。外部主機(例如,主機350)可將命令發送至處理群組410以供執行。替代地,主機350可能已發送包括該命令之指令集以用於儲存於記憶體例項420中,使得處理器子單元430可自記憶體例項420擷取命令且執行所擷取命令。因此,該命令可由處理元件440執行,該處理元件為可組態以執行所接收命令之一般處理元件。此外,處理群組410可包括用於記憶體例項420之控制件460。如圖4中所描繪,控制件460可執行處理元件440在執行所接收命令時所需的對記憶體例項420之任何讀取及/或寫入。在執行命令之後,處理群組410可將命令之結果輸出至例如外部主機或輸出至同一硬體晶片上之不同處理群組。
As depicted in FIG. 4, commands may be sent to the
在一些實施例中,如圖4中所描繪,處理器子單元430可進一步
包括位址產生器450。「位址產生器」可包含複數個處理元件,該等複數個處理元件經組態以判定用於執行讀取及寫入之一或多個記憶體組中的位址,且亦可對位於所判定位址處之資料執行操作(例如,加法、減法、乘法或其類似者)。舉例而言,位址產生器450可判定用於對記憶體進行之任何讀取或寫入的位址。在一個實例中,位址產生器450可藉由在不再需要讀取值時用基於命令所判定之新值覆寫讀取值來提高效率。另外或替代地,位址產生器450可選擇可用位址以用於儲存來自命令執行之結果。此可允許為後一時脈循環排程結果讀出,此對於外部主機較為便利。在另一實例中,位址產生器450可在諸如向量或矩陣乘法累加(multiply-accumulate)計算之多循環計算期間判定讀取及寫入的位址。因此,位址產生器450可維持或計算用於讀取資料及寫入多循環計算之中間結果的記憶體位址,使得處理器子單元430可繼續處理而不必儲存此等記憶體位址。
In some embodiments, as depicted in FIG. 4, the
圖5描繪藉由處理群組510進行之用於執行專門命令的例示性處理程序500。舉例而言,處理群組510可包含本發明之硬體晶片(例如,硬體晶片300、硬體晶片300'或其類似者)的一部分。
FIG. 5 depicts an
如圖5中所描繪,專門命令(例如,乘法累加命令)可發送至與專用記憶體例項520配對之處理元件530。外部主機(例如,主機350)可將命令發送至處理元件530以供執行。因此,該命令可由處理元件530在來自主機之給定信號下執行,該處理元件為可組態以執行特定命令(包括所接收命令)的專門處理元件。替代地,處理元件530可自記憶體例項520擷取命令以供執行。因此,在圖5之實例中,處理元件530為乘法累加(MAC)電路,該電路經組態以執行自外部主機接收或自記憶體例項520擷取之MAC命令。在執行命令之後,處理群組410可將命令之結果輸出至例如外部主機或輸出至同一硬體晶片上之不同處理群組。儘管關於單個命令及單個結果來描繪,但可接收或擷取及
執行複數個命令,且複數個結果可在輸出之前在處理群組510上組合。
As depicted in FIG. 5, special commands (eg, multiply and accumulate commands) can be sent to the
儘管在圖5中描繪為MAC電路,但額外或替代的專門電路可包括於處理群組510中。舉例而言,可實施MAX讀取命令(其傳回向量之最大值)、MAX0讀取命令(亦被稱作整流器之常用功能,其傳回整個向量,而且傳回為0之最大值),或其類似者。
Although depicted as a MAC circuit in FIG. 5, additional or alternative specialized circuits may be included in the
儘管分開地描繪,但可組合圖4之一般處理群組410與圖5之專門處理群組510。舉例而言,一般處理器子單元可耦接至一或多個專門處理器子單元以形成處理器子單元。因此,一般處理器子單元可用於不可由一或多個專門處理器子單元執行的所有指令。
Although depicted separately, the
一般熟習此項技術者將理解,可藉由專門邏輯電路來處置神經網路實施方案及其他記憶密集型任務。舉例而言,資料庫查詢、封包檢測、字串比較及其他功能在由本文中所描述之硬體晶片執行的情況下可提高效率。 Those who are familiar with this technology will understand that special logic circuits can be used to handle neural network implementations and other memory-intensive tasks. For example, database query, packet inspection, string comparison, and other functions can improve efficiency when executed by the hardware chip described in this article.
用於分散式處理之基於記憶體之架構 Memory-based architecture for distributed processing
在符合本發明之硬體晶片上,專用匯流排可在該晶片上之處理器子單元之間及/或在該等處理器子單元與其對應的專用記憶體組之間傳送資料。使用專用匯流排可降低仲裁成本,此係因為競爭請求係不可能的或容易使用軟體而非使用硬體來避免。 On the hardware chip according to the present invention, the dedicated bus can transfer data between the processor sub-units on the chip and/or between the processor sub-units and their corresponding dedicated memory banks. Using a dedicated bus can reduce the cost of arbitration, because competing requests are impossible or easy to avoid using software rather than hardware.
圖6示意性地描繪處理群組600之圖解表示。處理群組600可供用於硬體晶片(例如,硬體晶片300、硬體晶片300'或其類似者)中。處理器子單元610可經由匯流排630連接至記憶體620。記憶體620可包含隨機可存取記憶體(RAM)元件,其儲存供處理器子單元610執行之資料及程式碼。在一些實施例中,記憶體620可為N路記憶體(其中N為等於或大於1之數字,其暗示交錯式記憶體620中之區段的數目)。因為處理器子單元610經由匯流排630耦接至專用於處理器子單元610之記憶體620,所以N可保持相對較小而不損害
執行效能。此表示對習知多路暫存器檔案或快取記憶體之改善,其中較低N通常導致較低執行效能,且較高N通常導致大的面積及功率損失。
Figure 6 schematically depicts a diagrammatic representation of a
可根據例如一或多個任務中所涉及之資料的大小而調整記憶體620之大小、通路之數目及匯流排630之寬度以滿足使用處理群組600之系統之任務及應用程式實施方案的要求。記憶體元件620可包含此項技術中已知的一或多個類型之記憶體,例如揮發性記憶體(諸如,RAM、DRAM、SRAM、相變RAM(PRAM)、磁阻式RAM(MRAM)、電阻式RAM(ReRAM)或其類似者)或非揮發性記憶體(諸如,快閃記憶體或ROM)。根據一些實施例,記憶體元件620之一部分可包含第一記憶體類型,而另一部分可包含另一記憶體類型。舉例而言,記憶體元件620之程式碼區可包含ROM元件,而記憶體元件620之資料區可包含DRAM元件。此分割之另一實例為將神經網路之權重儲存於快閃記憶體中,而將用於計算之資料儲存於DRAM中。
The size of the
處理器子單元610包含處理元件640,該處理元件可包含處理器。該處理器可為管線式或非管線式的,可為定製精簡指令集運算(RISC)元件或實施於此項技術中已知之任何商業積體電路(IC)(諸如,ARM、ARC、RISC-V等)上的其他處理方案,如一般熟習此項技術者所瞭解。處理元件640可包含控制器,該控制器在一些實施例中包括算術邏輯單元(ALU)或其他控制器。
The
根據一些實施例,執行所接收或所儲存之程式碼的處理元件640可包含一般處理元件,且因此為靈活的並能夠執行廣泛多種處理操作。當比較在特定操作之執行期間所消耗的功率時,非專用電路系統通常比特定操作專用電路系統消耗更多功率。因此,當執行特定的複雜算術計算時,處理元件640可比專用硬體消耗更多功率且執行效率更低。因此,根據一些實施例,處理元件640之控制器可經設計以執行特定操作(例如,加法或「移動」操作)。
According to some embodiments, the
在一個實例中,特定操作可由一或多個加速器650執行。每一加
速器可為專用的且經程式化以執行特定計算(諸如,乘法、浮點向量運算或其類似者)。藉由使用加速器,每個處理器子單元之每次計算所消耗的平均功率可降低,且計算輸送量因此增加。可根據系統經設計以實施之應用程式(例如,執行神經網路、執行資料庫查詢或其類似者)而選擇加速器650。加速器650可由處理元件640組態且可與處理元件協同操作以用於降低功率消耗且加速計算及運算。加速器可另外或替代地用以在記憶體與諸如智慧型直接記憶體存取(DMA)周邊裝置之處理群組600的MUX/DEMUX/輸入/輸出埠(例如,MUX 650及DEMUX 660)之間傳送資料。
In one example, certain operations may be performed by one or
加速器650可經組態以執行多種功能。舉例而言,一個加速器可經組態以執行通常用於神經網路中之16位元浮點計算或8位元整數計算。加速器功能之另一實例為通常用於神經網路之訓練階段期間的32位元浮點計算。加速器功能之又一實例為查詢處理,諸如用於資料庫中之查詢處理。在一些實施例中,加速器650可包含用以執行此等功能之專門處理元件及/或可根據儲存於記憶體元件620上之組態資料進行組態使得其可加以修改。
The
加速器650可另外或替代地實施記憶體移動之可組態的指令碼處理清單以對資料至/自記憶體620或至/自其他加速器及/或輸入/輸出端的移動進行計時。因此,如下文進一步所解釋,使用處理群組600之硬體晶片內部的所有資料移動可使用軟體同步而非硬體同步。舉例而言,一個處理群組(例如,群組600)中之加速器可每第十循環將資料自其輸入端傳送至其加速器,且接著在下一循環輸出資料,藉此使資訊自處理群組之記憶體流送至另一記憶體。
The
如圖6中進一步所描繪,在一些實施例中,處理群組600可進一步包含連接至其輸入埠之至少一個輸入多工器(MUX)660及連接至其輸出埠之至少一個輸出DEMUX670。此等MUX/DEMUX可由來自處理元件640及/或來自加速器650中之一者的控制信號(未圖示)控制,該等控制信號係根據
正由處理元件640進行之當前指令及/或由加速器650中之加速器執行的操作而判定。在一些情境中,可能需要處理群組600(根據來自其程式碼記憶體之預定義指令)將資料自其輸入埠傳送至其輸出埠。因此,除DEMUX/MUX中之每一者連接至處理元件640及加速器650以外,輸入MUX(例如,MUX 660)中之一或多者亦可經由一或多個匯流排直接連接至輸出DEMUX(例如,DEMUX 670)。
As further depicted in FIG. 6, in some embodiments, the
圖6之處理群組600可排成陣列以形成分散式處理器,例如,如圖7A中所描繪。處理群組可安置於基板710上以形成陣列。在一些實施例中,基板710可包含諸如矽之半導體基板。另外或替代地,基板710可包含電路板,諸如可撓性電路板。
The
如圖7A中所描繪,基板710可包括安置於其上之複數個處理群組,諸如處理群組600。因此,基板710包括記憶體陣列,該記憶體陣列包括複數個組,諸如組720a、720b、720c、720d、720e、720f、720g及720h。此外,基板710包括處理陣列,該處理陣列可包括複數個處理器子單元,諸如子單元730a、730b、730c、730d、730e、730f、730g及730h。
As depicted in FIG. 7A, the
此外,如上文所解釋,每一處理群組可包括處理器子單元及專用於該處理器子單元之一或多個對應的記憶體組。因此,如圖7A中所描繪,每一子單元與一對應的專用記憶體組相關聯,例如:處理器子單元730a與記憶體組720a相關聯,處理器子單元730b與記憶體組720b相關聯,處理器子單元730c與記憶體組720c相關聯,處理器子單元730d與記憶體組720d相關聯,處理器子單元730e與記憶體組720e相關聯,處理器子單元730f與記憶體組720f相關聯,處理器子單元730g與記憶體組720g相關聯,處理器子單元730h與記憶體組720h相關聯。
In addition, as explained above, each processing group may include a processor sub-unit and one or more corresponding memory groups dedicated to the processor sub-unit. Therefore, as depicted in FIG. 7A, each subunit is associated with a corresponding dedicated memory group. For example, the
為了允許每一處理器子單元與其對應的專用記憶體組通信,基板
710可包括將處理器子單元中之一者連接至其對應的專用記憶體組的第一複數個匯流排。因此,匯流排740a將處理器子單元730a連接至記憶體組720a,匯流排740b將處理器子單元730b連接至記憶體組720b,匯流排740c將處理器子單元730c連接至記憶體組720c,匯流排740d將處理器子單元730d連接至記憶體組720d,匯流排740e將處理器子單元730e連接至記憶體組720e,匯流排740f將處理器子單元730f連接至記憶體組720f,匯流排740g將處理器子單元730g連接至記憶體組720g,且匯流排740h將處理器子單元730h連接至記憶體組720h。此外,為了允許每一處理器子單元與其他處理器子單元通信,基板710可包括將處理器子單元中之一者連接至處理器子單元中之另一者的第二複數個匯流排。在圖7A之實例中,匯流排750a將處理器子單元730a連接至處理器子單元750e,匯流排750b將處理器子單元730a連接至處理器子單元750b,匯流排750c將處理器子單元730b連接至處理器子單元750f,匯流排750d將處理器子單元730b連接至處理器子單元750c,匯流排750e將處理器子單元730c連接至處理器子單元750g,匯流排750f將處理器子單元730c連接至處理器子單元750d,匯流排750g將處理器子單元730d連接至處理器子單元750h,匯流排750h將處理器子單元730h連接至處理器子單元750g,匯流排750i將處理器子單元730g連接至處理器子單元750g,且匯流排750j將處理器子單元730f連接至處理器子單元750e。
In order to allow each processor sub-unit to communicate with its corresponding dedicated memory bank, the
因此,在圖7A中所展示之實例配置中,複數個邏輯處理器子單元配置成至少一列及至少一行。第二複數個匯流排將每一處理器子單元連接至同一列中之至少一個鄰近處理器子單元且連接至同一行中之至少一個鄰近處理器子單元。圖7A可被稱作「部分塊連接」。 Therefore, in the example configuration shown in FIG. 7A, a plurality of logical processor subunits are configured in at least one column and at least one row. The second plurality of bus bars connect each processor subunit to at least one adjacent processor subunit in the same column and to at least one adjacent processor subunit in the same row. Figure 7A can be referred to as "partial block connection".
圖7A中所展示之配置可經修改以形成「完全塊連接」。完全塊連接包括連接對角線處理器子單元之額外匯流排。舉例而言,第二複數個匯流
排可包括處理器子單元730a與處理器子單元730f之間、處理器子單元730b與處理器子單元730e之間、處理器子單元730b與處理器子單元730g之間、處理器子單元730c與處理器子單元730f之間、處理器子單元730c與處理器子單元730h之間以及處理器子單元730d與處理器子單元730g之間的額外匯流排。
The configuration shown in Figure 7A can be modified to form a "complete block connection." Full block connections include additional bus bars connecting diagonal processor sub-units. For example, the second plurality of confluences
The row may include between the
完全塊連接可用於卷積計算,在卷積計算中,使用儲存於附近處理器子單元中之資料及結果。舉例而言,在卷積影像處理期間,每一處理器子單元可接收影像之塊(諸如,像素或像素群組)。為了詳算卷積結果,每一處理器子單元可自所有八個鄰近處理器子單元獲取資料,該等鄰近處理器子單元中之每一者已接收對應塊。在部分塊連接中,來自對角線鄰近處理器子單元之資料可經由連接至該處理器子單元之其他鄰近處理器子單元傳遞。因此,晶片上之分散式處理器可為人工智慧加速器處理器。 Complete block connection can be used for convolution calculations. In convolution calculations, data and results stored in nearby processor subunits are used. For example, during convolutional image processing, each processor sub-unit may receive a block of the image (such as pixels or pixel groups). In order to calculate the convolution result in detail, each processor subunit can obtain data from all eight neighboring processor subunits, each of which has received the corresponding block. In partial block connections, data from diagonally adjacent processor subunits can be transferred via other adjacent processor subunits connected to the processor subunit. Therefore, the distributed processor on the chip can be an artificial intelligence accelerator processor.
在卷積計算之特定實例中,可跨越複數個處理器子單元來劃分N×M影像。每一處理器子單元可在其對應塊上執行與A×B濾波器的卷積。為了對塊之間的邊界上的一或多個像素執行濾波,每一處理器子單元可能需要來自相鄰處理器子單元之資料,該等相鄰處理器子單元具有包括同一邊界上之像素的塊。因此,針對每一處理器子單元產生之程式碼組態該子單元以計算卷積,且每當需要來自鄰近子單元之資料時便自第二複數個匯流排取得。將資料輸出至第二複數個匯流排之對應命令被提供至該子單元以確保所需資料傳送之適當時序。 In the specific example of convolution calculation, the N×M image can be divided across a plurality of processor sub-units. Each processor sub-unit can perform convolution with the A×B filter on its corresponding block. In order to perform filtering on one or more pixels on the boundary between blocks, each processor sub-unit may require data from adjacent processor sub-units that have pixels on the same boundary Block. Therefore, the code generated by each processor sub-unit is configured to calculate the convolution, and data from the adjacent sub-unit is obtained from the second plurality of buses. The corresponding commands for outputting data to the second plurality of buses are provided to the sub-unit to ensure proper timing of the required data transmission.
圖7A之部分塊連接可修改為N部分塊連接。在此修改中,第二複數個匯流排可進一步將每一處理器子單元連接至在圖7A之匯流排運行所沿的四個方向(亦即,上、下、左及右)上處於該處理器子單元之臨限距離內(例如,處於n個處理器子單元內)的處理器子單元。可對完全塊連接進行類似修改(以產生N完全塊連接),使得第二複數個匯流排進一步將每一處理器子單 元連接至在除兩個對角線方向以外的圖7A之匯流排運行所沿的四個方向上處於該處理器子單元之臨限距離內(例如,處於n個處理器子單元內)的處理器子單元。 The partial block connection in Figure 7A can be modified to N partial block connections. In this modification, the second plurality of buses can further connect each processor subunit to the four directions along which the bus in FIG. 7A runs (ie, up, down, left, and right). The processor subunits within the threshold distance of the processor subunits (for example, within n processor subunits). Similar modifications can be made to the complete block connection (to produce N complete block connections), so that the second plurality of buses further connects each processor sub-unit The element is connected to the four directions along which the bus of FIG. 7A runs except for the two diagonal directions, which are within the threshold distance of the processor subunit (for example, in n processor subunits). Processor subunit.
其他配置為可能的。舉例而言,在圖7B中所展示之配置中,匯流排750a將處理器子單元730a連接至處理器子單元730d,匯流排750b將處理器子單元730a連接至處理器子單元730b,匯流排750c將處理器子單元730b連接至處理器子單元730c,且匯流排750d將處理器子單元730c連接至處理器子單元730d。因此,在圖7B中所展示之實例配置中,複數個處理器子單元配置成星形圖案。第二複數個匯流排將每一處理器子單元連接至星形圖案內之至少一個鄰近處理器子單元。
Other configurations are possible. For example, in the configuration shown in FIG. 7B,
其他配置(未圖示)為可能的。舉例而言,可使用相鄰者連接配置,使得複數個處理器子單元配置成一或多排(例如,類似於圖7A中所描繪之情況)。在相鄰者連接配置中,第二複數個匯流排將每一處理器子單元連接至同一排中之左方處理器子單元、同一排中之右方處理器子單元、同一排中之左方處理器子單元及右方處理器子單元兩者等。 Other configurations (not shown) are possible. For example, a neighbor connection configuration may be used such that a plurality of processor subunits are arranged in one or more rows (e.g., similar to the situation depicted in FIG. 7A). In the adjacent connection configuration, the second plurality of bus bars connect each processor subunit to the left processor subunit in the same row, the right processor subunit in the same row, and the left processor subunit in the same row. Both the square processor subunit and the right processor subunit, etc.
在另一實例中,可使用N線性連接配置。在N線性連接配置中,第二複數個匯流排將每一處理器子單元連接至處於該處理器子單元之臨限距離內(例如,處於n個處理器子單元內)的處理器子單元。N線性連接配置可與線形陣列(上文所描述)、矩形陣列(圖7A中所描繪)、橢圓形陣列(圖7B中所描繪)或任何其他幾何形狀陣列一起使用。 In another example, an N linear connection configuration can be used. In the N linear connection configuration, the second plurality of buses connect each processor subunit to the processor subunit within the threshold distance of the processor subunit (for example, within n processor subunits) . The N linear connection configuration can be used with linear arrays (described above), rectangular arrays (depicted in FIG. 7A), elliptical arrays (depicted in FIG. 7B), or any other geometric arrays.
在又一實例中,可使用N對數連接配置。在N對數連接配置中,第二複數個匯流排將每一處理器子單元連接至處於該處理器子單元之二的冪之臨限距離內(例如,處於2n個處理器子單元內)的處理器子單元。N對數連接配置可與線形陣列(上文所描述)、矩形陣列(圖7A中所描繪)、橢圓形陣列 (圖7B中所描繪)或任何其他幾何形狀陣列一起使用。 In yet another example, an N logarithmic connection configuration may be used. In the N logarithmic connection configuration, the second plurality of buses connect each processor subunit to within a threshold distance of the power of two of the processor subunit (for example, within 2 n processor subunits) The processor subunit. The N-log connected configuration can be used with linear arrays (described above), rectangular arrays (depicted in FIG. 7A), elliptical arrays (depicted in FIG. 7B), or any other geometric arrays.
可組合上文所描述之連接方案中之任一者以供用於同一硬體晶片中。舉例而言,可在一個區中使用完全塊連接,而在另一區中使用部分塊連接。在另一實例中,可在一個區中使用N線性連接配置,而在另一區中使用N完全塊連接。 Any of the connection schemes described above can be combined for use in the same hardware chip. For example, a complete block connection can be used in one zone, and a partial block connection can be used in another zone. In another example, an N linear connection configuration can be used in one zone, while an N complete block connection can be used in another zone.
替代記憶體晶片之處理器子單元之間的專用匯流排或除該等專用匯流排以外,亦可使用一或多個共用匯流排以互連分散式處理器之所有處理器子單元(或處理器子單元之子集)。仍可藉由使用由處理器子單元執行之程式碼對共用匯流排上之資料傳送進行計時來避免共用匯流排上之衝突,如下文進一步所解釋。除共用匯流排以外或替代共用匯流排,亦可使用可組態匯流排以動態地連接處理器子單元以形成連接至分開匯流排之處理器單元之群組。舉例而言,可組態匯流排可包括電晶體或可由處理器子單元控制以將資料傳送導引至選定處理器子單元的其他機構。 Instead of dedicated buses between processor subunits of memory chips, or in addition to these dedicated buses, one or more common buses can also be used to interconnect all processor subunits (or processing A subset of the device subunit). It is still possible to avoid conflicts on the shared bus by using the code executed by the processor subunit to time the data transmission on the shared bus, as explained further below. In addition to or instead of the shared bus, a configurable bus can also be used to dynamically connect processor subunits to form a group of processor units connected to separate buses. For example, the configurable bus may include transistors or other mechanisms that can be controlled by the processor sub-unit to direct data transmission to the selected processor sub-unit.
在圖7A及圖7B兩者中,處理陣列之複數個處理器子單元在空間上分佈於記憶體陣列之複數個離散記憶體組當中。在其他替代實施例(未圖示)中,複數個處理器子單元可叢集於基板之一或多個區中,且複數個記憶體組可叢集於基板之一或多個其他區中。在一些實施例中,可使用空間分佈與叢集之組合(未圖示)。舉例而言,基板之一個區可包括處理器子單元之叢集,基板之另一區可包括記憶體組之叢集,且基板之又一區可包括分佈於記憶體組當中之處理陣列。 In both FIGS. 7A and 7B, the plurality of processor subunits of the processing array are spatially distributed among the plurality of discrete memory groups of the memory array. In other alternative embodiments (not shown), a plurality of processor subunits may be clustered in one or more regions of the substrate, and a plurality of memory groups may be clustered in one or more other regions of the substrate. In some embodiments, a combination of spatial distribution and clustering (not shown) may be used. For example, one area of the substrate may include a cluster of processor subunits, another area of the substrate may include a cluster of memory banks, and another area of the substrate may include a processing array distributed among the memory banks.
一般熟習此項技術者將認識到,在基板上將處理器群組600排成陣列並非排他性實施例。舉例而言,每一處理器子單元可與至少兩個專用記憶體組相關聯。因此,可替代處理群組600或與該處理群組組合地使用圖3B之處理群組310a、310b、310c及310d,以形成處理陣列及記憶體陣列。可使用包括
例如三個、四個或多於四個專用記憶體組之其他處理群組(未圖示)。
Those skilled in the art will recognize that arranging the
複數個處理器子單元中之每一者可經組態以相對於包括於複數個處理器子單元中之其他處理器子單元獨立地執行與特定應用程式相關聯之軟體程式碼。舉例而言,如下文所解釋,指令之複數個子系列可分組為機器碼且被提供至每一處理器子單元以供執行。 Each of the plurality of processor subunits can be configured to independently execute software code associated with a specific application program relative to other processor subunits included in the plurality of processor subunits. For example, as explained below, multiple sub-series of instructions can be grouped into machine code and provided to each processor sub-unit for execution.
在一些實施例中,每一專用記憶體組包含至少一個動態隨機存取記憶體(DRAM)。替代地,記憶體組可包含諸如靜態隨機存取記憶體(SRAM)、DRAM、快閃記憶體或其類似者之記憶體類型的混合。 In some embodiments, each dedicated memory bank includes at least one dynamic random access memory (DRAM). Alternatively, the memory bank may include a mixture of memory types such as static random access memory (SRAM), DRAM, flash memory, or the like.
在習知處理器中,處理器子單元之間的資料共用通常藉由共用記憶體來執行。共用記憶體通常需要大部分晶片面積及/或執行由額外硬體(諸如,仲裁器)管理之匯流排。如上文所描述,該匯流排造成瓶頸。此外,可在晶片外部之共用記憶體通常包括快取一致性機制及更複雜的快取記憶體(例如,L1快取記憶體、L2快取記憶體及共用DRAM),以便將準確且最新的資料提供至處理器子單元。如下文進一步所解釋,圖7A及圖7B中所描繪之專用匯流排允許無硬體管理(諸如,仲裁器)之硬體晶片。此外,使用如圖7A及圖7B中所描繪之專用記憶體允許消除複雜的快取層及一致性機制。 In conventional processors, data sharing between processor subunits is usually performed by sharing memory. Shared memory usually requires most of the chip area and/or implementation of a bus managed by additional hardware (such as an arbiter). As described above, this bus bar creates a bottleneck. In addition, the shared memory that can be external to the chip usually includes a cache coherency mechanism and more complex cache memory (for example, L1 cache memory, L2 cache memory, and shared DRAM), so that the accurate and up-to-date The data is provided to the processor sub-unit. As explained further below, the dedicated bus depicted in FIGS. 7A and 7B allows for hardware chips without hardware management (such as an arbiter). In addition, the use of dedicated memory as depicted in Figures 7A and 7B allows the elimination of complex cache layers and coherency mechanisms.
實情為,為了允許每一處理器子單元存取由其他處理器子單元計算及/或儲存於專用於其他處理器子單元之記憶體組中的資料,提供匯流排,該等匯流排之時序係使用由每一處理器子單元個別地執行之程式碼動態地執行。此情形允許消除如習知地所使用的大部分(若非全部)匯流排管理硬體。此外,此等匯流排上之直接傳送替換複雜的快取機制,以減少在記憶體讀取及寫入期間的潛時。 In fact, in order to allow each processor sub-unit to access data calculated by other processor sub-units and/or stored in a memory bank dedicated to other processor sub-units, buses are provided, and the timing of these buses It is dynamically executed using code executed individually by each processor subunit. This situation allows the elimination of most (if not all) bus management hardware as conventionally used. In addition, direct transmission on these buses replaces complex caching mechanisms to reduce latency during memory reading and writing.
基於記憶體之處理陣列 Memory-based processing array
如圖7A及圖7B中所描繪,本發明之記憶體晶片可獨立地操作。
替代地,本發明之記憶體晶片可與諸如記憶體裝置(例如,一或多個DRAM組)、系統單晶片、場可程式化閘陣列(FPGA)或其他處理及/或記憶體晶片的一或多個額外積體電路可操作地連接。在此等實施例中,由該架構執行之一系列指令中的任務可在記憶體晶片之處理器子單元與額外積體電路之任何處理器子單元之間進行劃分(例如,藉由編譯器,如下文所描述)。舉例而言,其他積體電路可包含將指令及/或資料輸入至記憶體晶片且自其接收輸出之主機(例如,圖3A之主機350)。
As depicted in FIG. 7A and FIG. 7B, the memory chip of the present invention can be operated independently.
Alternatively, the memory chip of the present invention can be combined with one such as a memory device (for example, one or more DRAM banks), a system-on-a-chip, a field programmable gate array (FPGA), or other processing and/or memory chips. Or multiple additional integrated circuits are operatively connected. In these embodiments, the tasks in a series of instructions executed by the architecture can be divided between the processor subunits of the memory chip and any processor subunits of additional integrated circuits (for example, by the compiler , As described below). For example, other integrated circuits may include a host (for example, the
為了將本發明之記憶體晶片與一或多個額外積體電路互連,記憶體晶片可包括記憶體介面,諸如遵從聯合電子裝置工程委員會(Joint Electron Device Engineering Council;JEDEC)標準或其變體中之任一者的記憶體介面。一或多個額外積體電路接著可連接至該記憶體介面。因此,若該一或多個額外積體電路連接至本發明之複數個記憶體晶片,則資料可經由該一或多個額外積體電路在記憶體晶片之間共用。另外或替代地,該一或多個額外積體電路可包括用以連接至本發明之記憶體晶片上之匯流排的匯流排,使得該一或多個額外積體電路可與本發明之記憶體晶片協同執行程式碼。在此等實施例中,該一或多個額外積體電路進一步輔助分散式處理,即使該等額外積體電路可能與本發明之記憶體晶片在不同基板上亦如此。 In order to interconnect the memory chip of the present invention with one or more additional integrated circuits, the memory chip may include a memory interface, such as compliant with the Joint Electron Device Engineering Council (JEDEC) standard or its variants The memory interface of any one of them. One or more additional integrated circuits can then be connected to the memory interface. Therefore, if the one or more additional integrated circuits are connected to the plurality of memory chips of the present invention, data can be shared among the memory chips through the one or more additional integrated circuits. Additionally or alternatively, the one or more additional integrated circuits may include a bus for connecting to the bus on the memory chip of the present invention, so that the one or more additional integrated circuits can be connected to the memory of the present invention. The integrated chip executes the code together. In these embodiments, the one or more additional integrated circuits further assist in distributed processing, even though the additional integrated circuits may be on different substrates from the memory chip of the present invention.
此外,本發明之記憶體晶片可排成陣列以便形成分散式處理器之陣列。舉例而言,一或多個匯流排可將記憶體晶片770a連接至額外記憶體晶片770b,如圖7C中所描繪。在圖7C之實例中,記憶體晶片770a包括處理器子單元與專用於每一處理器子單元之一或多個對應的記憶體組,例如:處理器子單元730a與記憶體組720a相關聯,處理器子單元730b與記憶體組720b相關聯,處理器子單元730e與記憶體組720c相關聯,且處理器子單元730f與記憶體組720d相關聯。匯流排將每一處理器子單元連接至其對應的記憶體組。因此,匯
流排740a將處理器子單元730a連接至記憶體組720a,匯流排740b將處理器子單元730b連接至記憶體組720b,匯流排740c將處理器子單元730e連接至記憶體組720c,且匯流排740d將處理器子單元730f連接至記憶體組720d。此外,匯流排750a將處理器子單元730a連接至處理器子單元750e,匯流排750b將處理器子單元730a連接至處理器子單元750b,匯流排750c將處理器子單元730b連接至處理器子單元750f,且匯流排750d將處理器子單元730e連接至處理器子單元750f。舉例而言,如上文所描述,可使用記憶體晶片770a之其他配置。
In addition, the memory chips of the present invention can be arranged in an array to form an array of distributed processors. For example, one or more bus bars can connect the
類似地,記憶體晶片770b包括處理器子單元與專用於每一處理器子單元之一或多個對應的記憶體組,例如:處理器子單元730c與記憶體組720e相關聯,處理器子單元730d與記憶體組720f相關聯,處理器子單元730g與記憶體組720g相關聯,且處理器子單元730h與記憶體組720h相關聯。匯流排將每一處理器子單元連接至其對應的記憶體組。因此,匯流排740e將處理器子單元730c連接至記憶體組720e,匯流排740f將處理器子單元730d連接至記憶體組720f,匯流排740g將處理器子單元730g連接至記憶體組720g,且匯流排740h將處理器子單元730h連接至記憶體組720h。此外,匯流排750g將處理器子單元730c連接至處理器子單元750g,匯流排750h將處理器子單元730d連接至處理器子單元750h,匯流排750i將處理器子單元730c連接至處理器子單元750d,且匯流排750j將處理器子單元730g連接至處理器子單元750h。舉例而言,如上文所描述,可使用記憶體晶片770b之其他配置。
Similarly, the
記憶體晶片770a及770b之處理器子單元可使用一或多個匯流排來連接。因此,在圖7C之實例中,匯流排750e可將記憶體晶片770a之處理器子單元730b與記憶體晶片770b之處理器子單元730c連接,且匯流排750f可將記憶體晶片770a之處理器子單元730f與記憶體770b之處理器子單元730c連接。舉例而言,匯流排750e可充當至記憶體晶片770b之輸入匯流排(且因此充
當記憶體晶片770a之輸出匯流排),而匯流排750f可充當至記憶體晶片770a之輸入匯流排(且因此充當記憶體晶片770b之輸出匯流排),或反之亦然。替代地,匯流排750e及750f均可充當記憶體晶片770a與770b之間的雙向匯流排。
The processor sub-units of the
匯流排750e及750f可包括直接導線或可在高速連接上交錯,以便減少用於記憶體晶片770a與積體電路770b之間的晶片間介面的接腳。此外,用於記憶體晶片本身中的上文所描述之連接配置中之任一者可用以將記憶體晶片連接至一或多個額外積體電路。舉例而言,記憶體晶片770a及770b可使用完全塊或部分塊連接而非如圖7C所展示僅使用兩個匯流排來連接。
The
因此,儘管使用匯流排750e及750f來描繪,但架構760可包括更少匯流排或額外匯流排。舉例而言,可在處理器子單元730b與730c之間或處理器子單元730f與730c之間使用單個匯流排。替代地,可例如在處理器子單元730b與730d之間、處理器子單元730f與730d之間或其類似者之間使用額外匯流排。
Therefore, although the
此外,儘管描繪為使用單個記憶體晶片及額外積體電路,但複數個記憶體晶片可使用匯流排來連接,如上文所解釋。舉例而言,如圖7C之實例中所描繪,記憶體晶片770a、770b、770c及770d連接成陣列。類似於上文所描述之記憶體晶片,每一記憶體晶片包括處理器子單元及專用記憶體組。因此,此處不重複對此等組件之描述。
In addition, although it is depicted as using a single memory chip and additional integrated circuits, a plurality of memory chips can be connected using a bus, as explained above. For example, as depicted in the example of FIG. 7C,
在圖7C之實例中,記憶體晶片770a、770b、770c及770d連接成迴路。因此,匯流排750a連接記憶體晶片770a與770d,匯流排750c連接記憶體晶片770a與770b,匯流排750e連接記憶體晶片770b與770c,且匯流排750g連接記憶體晶片770c與770d。儘管記憶體晶片770a、770b、770c及770d可利用完全塊連接、部分塊連接或其他連接配置來連接,但圖7C之實例允許記憶體晶片770a、770b、770c及770d之間的更少接腳連接。
In the example of FIG. 7C, the
相對較大的記憶體 Relatively large memory
本發明之實施例可使用大小與習知處理器之共用記憶體相比相對較大的專用記憶體。使用專用記憶體而非共用記憶體允許繼續獲得效率增益而不會隨著記憶體增加而逐漸減少。此允許諸如神經網路處理及資料庫查詢之記憶體密集型任務比在習知處理器中更高效地執行,在習知處理器中,共用記憶體增加之效率增益由於馮諾伊曼瓶頸而逐漸減少。 The embodiments of the present invention can use a dedicated memory that is relatively larger in size than the shared memory of a conventional processor. Using dedicated memory instead of shared memory allows continued efficiency gains without gradual decrease as memory increases. This allows memory-intensive tasks such as neural network processing and database query to be performed more efficiently than in conventional processors. In conventional processors, the increased efficiency gain of shared memory is due to the von Neumann bottleneck. gradually decreases.
舉例而言,在本發明之分散式處理器中,安置於分散式處理器之基板上的記憶體陣列可包括複數個離散記憶體組。離散記憶體組中之每一者可具有大於一百萬位元組之容量;以及安置於該基板上之處理陣列,該處理陣列包括複數個處理器子單元。如上文所解釋,該等處理器子單元中之每一者可與該等複數個離散記憶體組中之對應的專用記憶體組相關聯。在一些實施例中,該等複數個處理器子單元可在空間上分佈於記憶體陣列內之複數個離散記憶體組當中。藉由將至少一百萬位元組之專用記憶體而非幾百萬位元組之共用快取記憶體用於大型CPU或GPU,本發明之分散式處理器獲得在習知系統中由於CPU及GPU中之馮諾依曼瓶頸而不可能達成的效率。 For example, in the distributed processor of the present invention, the memory array disposed on the substrate of the distributed processor may include a plurality of discrete memory groups. Each of the discrete memory groups may have a capacity greater than one million bytes; and a processing array disposed on the substrate, the processing array including a plurality of processor subunits. As explained above, each of the processor subunits can be associated with a corresponding dedicated memory group among the plurality of discrete memory groups. In some embodiments, the plurality of processor subunits may be spatially distributed among a plurality of discrete memory groups in the memory array. By using at least one million bytes of dedicated memory instead of a few million bytes of shared cache memory for large CPUs or GPUs, the distributed processor of the present invention can be used in conventional systems due to the CPU And the efficiency that the von Neumann bottleneck in the GPU is impossible to achieve.
不同記憶體可用作專用記憶體。舉例而言,每一專用記憶體組可包含至少一個DRAM組。替代地,每一專用記憶體組可包含至少一個靜態隨機存取記憶體組。在其他實施例中,不同類型之記憶體可在單個硬體晶片上組合。 Different memories can be used as dedicated memories. For example, each dedicated memory bank may include at least one DRAM bank. Alternatively, each dedicated memory bank may include at least one static random access memory bank. In other embodiments, different types of memory can be combined on a single hardware chip.
如上文所解釋,每一專用記憶體可為至少一百萬位元組。因此,每一專用記憶體組之大小可相同,或該等複數個記憶體組中之至少兩個記憶體組可具有不同大小。 As explained above, each dedicated memory can be at least one million bytes. Therefore, the size of each dedicated memory group may be the same, or at least two of the plurality of memory groups may have different sizes.
此外,如上文所描述,該分散式處理器可包括:第一複數個匯流排,其各將該等複數個處理器子單元中之一者連接至對應的專用記憶體組;及第二複數個匯流排,其各將該等複數個處理器子單元中之一者連接至該等複數 個處理器子單元中之另一者。 In addition, as described above, the distributed processor may include: a first plurality of bus bars, each of which connects one of the plurality of processor subunits to a corresponding dedicated memory group; and a second plurality of buses Bus bars, each of which connects one of the plurality of processor subunits to the plurality of The other of a processor subunit.
使用軟體之同步 Synchronization using software
如上文所解釋,本發明之硬體晶片可使用軟體而非硬體來管理資料傳送。特定而言,因為匯流排上之傳送、對記憶體進行之讀取及寫入以及處理器子單元之計算的時序係藉由處理器子單元所執行的指令之子系列設定,所以本發明之硬體晶片可執行程式碼以防止匯流排上之衝突。因此,本發明之硬體晶片可避免習知地用以管理資料傳送之硬體機構(諸如,晶片內之網路控制器、處理器子單元之間的封包剖析器及封包傳送器、匯流排仲裁器、用以避免仲裁的複數個匯流排,或其類似者)。 As explained above, the hardware chip of the present invention can use software instead of hardware to manage data transmission. In particular, because the timing of the transmission on the bus, the reading and writing of the memory, and the calculation of the processor sub-unit is set by the sub-series of instructions executed by the processor sub-unit, the hardware of the present invention is The body chip can execute code to prevent conflicts on the bus. Therefore, the hardware chip of the present invention can avoid the conventional hardware mechanism used to manage data transmission (such as the network controller in the chip, the packet parser and the packet transmitter between the processor sub-units, and the bus Arbiter, multiple buses used to avoid arbitration, or the like).
若本發明之硬體晶片習知地傳送資料,則利用匯流排連接N個處理器子單元將需要由仲裁器控制的匯流排仲裁或寬MUX。實情為,如上文所描述,本發明之實施例可在處理器子單元之間使用僅為導線、光學纜線或其類似者之匯流排,其中該等處理器子單元個別地執行程式碼以避免匯流排上之衝突。因此,本發明之實施例可節省基板上之空間以及材料成本及效率損失(例如,由於仲裁導致之功率及時間消耗)。相較於使用先進先出(FIFO)控制器及/或信箱之其他架構,效率及空間增益甚至更大。 If the hardware chip of the present invention conventionally transmits data, using a bus to connect N processor sub-units will require bus arbitration or wide MUX controlled by the arbiter. In fact, as described above, the embodiments of the present invention can use only wires, optical cables, or the like between the processor sub-units of the bus, wherein the processor sub-units individually execute code to Avoid conflicts on the bus. Therefore, the embodiments of the present invention can save space on the substrate and material cost and efficiency loss (for example, power and time consumption due to arbitration). Compared with other architectures that use a first-in-first-out (FIFO) controller and/or mailbox, the efficiency and space gains are even greater.
此外,如上文所解釋,除一或多個處理元件以外,每一處理器子單元亦可包括一或多個加速器。在一些實施例中,加速器可自匯流排而非自處理元件進行讀取及寫入。在此等實施例中,可藉由允許加速器在處理元件執行一或多個計算之同一循環期間傳輸資料來獲得額外效率。然而,此等實施例需要用於加速器之額外材料。舉例而言,可能需要額外電晶體以用於製造加速器。 In addition, as explained above, in addition to one or more processing elements, each processor sub-unit may also include one or more accelerators. In some embodiments, the accelerator can read and write from the bus instead of the processing element. In these embodiments, additional efficiency can be obtained by allowing the accelerator to transmit data during the same cycle in which the processing element performs one or more calculations. However, these embodiments require additional materials for the accelerator. For example, additional transistors may be required for manufacturing accelerators.
程式碼亦可考量處理器子單元(例如,包括形成處理器子單元之部分的處理元件及/或加速器)之內部行為,包括時序及潛時。舉例而言,編譯 器(如下文所描述)可在產生控制資料傳送之指令子系列時執行考量時序及潛時的預處理。 The program code may also consider the internal behavior of the processor sub-units (for example, including processing elements and/or accelerators that form part of the processor sub-units), including timing and latency. For example, compile The processor (as described below) can perform preprocessing that considers timing and latency when generating a sub-series of commands that control data transmission.
在一個實例中,複數個處理器子單元可經指派計算神經網路層之任務,該神經網路層含有全部連接至較大複數個神經元之前一層的複數個神經元。假設前一層之資料均勻地散佈在複數個處理器子單元之間,執行該計算的一種方式可為組態每一處理器子單元,以依次將前一層之資料傳輸至主匯流排,且接著每一處理器子單元將此資料乘以子單元實施之對應神經元的權重。因為每一處理器子單元計算多於一個神經元,所以每一處理器子單元將數次傳輸前一層之資料,該次數等於神經元之數目。因此,每一處理器子單元之程式碼與用於其他處理器子單元之程式碼不相同,此係因為該等子單元將在不同時間進行傳輸。 In one example, a plurality of processor sub-units may be assigned the task of computing a neural network layer that contains a plurality of neurons that are all connected to a layer before a larger plurality of neurons. Assuming that the data of the previous layer is evenly distributed among a plurality of processor subunits, one way to perform this calculation can be to configure each processor subunit to transmit the data of the previous layer to the main bus in turn, and then Each processor subunit multiplies this data by the weight of the corresponding neuron implemented by the subunit. Because each processor subunit calculates more than one neuron, each processor subunit will transmit the data of the previous layer several times, which is equal to the number of neurons. Therefore, the code of each processor sub-unit is different from the code used in other processor sub-units because the sub-units will be transmitted at different times.
在一些實施例中,分散式處理器可包含:基板(例如,諸如矽之半導體基板及/或諸如可撓性電路板之電路板);安置於該基板上之記憶體陣列,該記憶體陣列包括複數個離散記憶體組;及安置於該基板上之處理陣列,該處理陣列包括複數個處理器子單元,如描繪於例如圖7A及圖7B中。如上文所解釋,該等處理器子單元中之每一者可與該等複數個離散記憶體組中之對應的專用記憶體組相關聯。此外,如描繪於例如圖7A及圖7B中,分散式處理器可進一步包含複數個匯流排,該等複數個匯流排中之每一者將該等複數個處理器子單元中之一者連接至該等複數個處理器子單元中之至少另一者。 In some embodiments, the distributed processor may include: a substrate (for example, a semiconductor substrate such as silicon and/or a circuit board such as a flexible circuit board); a memory array disposed on the substrate, the memory array It includes a plurality of discrete memory groups; and a processing array disposed on the substrate. The processing array includes a plurality of processor subunits, as depicted in, for example, FIG. 7A and FIG. 7B. As explained above, each of the processor subunits can be associated with a corresponding dedicated memory group among the plurality of discrete memory groups. In addition, as depicted in, for example, FIGS. 7A and 7B, the distributed processor may further include a plurality of bus bars, each of which is connected to one of the plurality of processor subunits To at least another of the plurality of processor subunits.
如上文所解釋,該等複數個匯流排可用軟體來控制。因此,該等複數個匯流排可能不含時序硬體邏輯組件,使得在處理器子單元之間及跨越該等複數個匯流排中之對應者的資料傳送不受時序硬體邏輯組件控制。在一個實例中,該等複數個匯流排可能不含匯流排仲裁器,使得在處理器子單元之間及跨越該等複數個匯流排中之對應者的資料傳送不受匯流排仲裁器控制。 As explained above, these multiple buses can be controlled by software. Therefore, the plurality of buses may not contain sequential hardware logic components, so that the data transmission between the processor subunits and across the corresponding ones of the plurality of buses is not controlled by the sequential hardware logic components. In one example, the plurality of buses may not contain a bus arbiter, so that data transmission between processor subunits and across corresponding ones of the plurality of buses is not controlled by the bus arbiter.
在一些實施例中,如描繪於例如圖7A及圖7B中,分散式處理器可進一步包含第二複數個匯流排,該等第二複數個匯流排將複數個處理器子單元中之一者連接至對應的專用記憶體組。類似於上文所描述之複數個匯流排,第二複數個匯流排可能不含時序硬體邏輯組件,使得處理器子單元與對應的專用記憶體組之間的資料傳送不受時序硬體邏輯組件控制。在一個實例中,第二複數個匯流排可能不含匯流排仲裁器,使得處理器子單元與對應的專用記憶體組之間的資料傳送不受匯流排仲裁器控制。 In some embodiments, as depicted in, for example, FIG. 7A and FIG. 7B, the distributed processor may further include a second plurality of buses, and the second plurality of buses may be one of the plurality of processor subunits. Connect to the corresponding dedicated memory bank. Similar to the plurality of buses described above, the second plurality of buses may not contain sequential hardware logic components, so that the data transfer between the processor subunit and the corresponding dedicated memory bank is not affected by the sequential hardware logic. Component control. In one example, the second plurality of buses may not contain a bus arbiter, so that the data transmission between the processor subunit and the corresponding dedicated memory bank is not controlled by the bus arbiter.
如本文中所使用,片語「不含」未必暗示諸如時序硬體邏輯組件(例如,匯流排仲裁器、仲裁樹、FIFO控制器、信箱或其類似者)的組件絕對不存在。此等組件仍可包括於描述為「不含」彼等組件之硬體晶片中。實情為,片語「不含」係指硬體晶片之功能;亦即,「不含」時序硬體邏輯組件之硬體晶片控制其資料傳送之時序而不使用包括於其中的時序硬體邏輯組件(若存在)。舉例而言,硬體晶片執行包括指令之子系列的程式碼,該等指令控制硬體晶片之處理器子單元之間的資料傳送,即使該硬體晶片包括時序硬體邏輯組件作為防範由於所執行程式碼中之錯誤的衝突之輔助預防措施亦如此。 As used herein, the phrase "does not contain" does not necessarily imply that components such as sequential hardware logic components (for example, bus arbiter, arbitration tree, FIFO controller, mailbox, or the like) absolutely do not exist. These components can still be included in the hardware chip described as "excluding" their components. In fact, the phrase "not included" refers to the function of the hardware chip; that is, the hardware chip that "does not contain" the sequential hardware logic component controls the timing of its data transmission without using the sequential hardware logic included in it The component (if present). For example, a hardware chip executes program codes that include a sub-series of instructions that control the data transfer between the processor subunits of the hardware chip, even if the hardware chip includes sequential hardware logic components as a protection against the execution The same is true for auxiliary preventive measures for conflicts in code errors.
如上文所解釋,複數個匯流排可包含介於複數個處理器子單元中之對應者之間的導線或光纖中之至少一者。因此,在一個實例中,不含時序硬體邏輯組件之分散式處理器可僅包括導線或光纖,而無匯流排仲裁器、仲裁樹、FIFO控制器、信箱或其類似者。 As explained above, the plurality of bus bars may include at least one of wires or optical fibers between corresponding ones of the plurality of processor subunits. Therefore, in one example, a distributed processor without sequential hardware logic components may only include wires or fibers, and no bus arbiter, arbitration tree, FIFO controller, mailbox, or the like.
在一些實施例中,複數個處理器子單元經組態以根據由該等複數個處理器子單元執行之程式碼跨越複數個匯流排中之至少一者傳送資料。因此,如下文所解釋,編譯器可組織指令之子系列,每一子系列包含由單個處理器子單元執行之程式碼。該等子系列指令可發指令處理器子單元何時將資料傳送至匯流排中之一者上及何時自匯流排擷取資料。當該等子系列跨越分散式處 理器協同執行時,處理器子單元之間的傳送之時序可藉由包括於該等子系列中的用以傳送及擷取之指令來控制。因此,程式碼規定跨越複數個匯流排中之至少一者的資料傳送之時序。編譯器可產生待由單個處理器子單元執行之程式碼。另外,編譯器可產生待由處理器子單元之群組執行的程式碼。在一些狀況下,編譯器可將所有處理器子單元一起視為如同該等處理器子單元為一個超處理器(例如,分散式處理器),且編譯器可產生用於由彼定義的超處理器/分散式處理器執行的程式碼。 In some embodiments, the plurality of processor subunits are configured to transmit data across at least one of the plurality of buses according to the code executed by the plurality of processor subunits. Therefore, as explained below, the compiler can organize sub-series of instructions, each of which contains code that is executed by a single processor subunit. These sub-series commands can instruct the processor sub-unit when to send data to one of the buses and when to retrieve data from the bus. When this sub-series crosses the decentralized place When the processors are executed in cooperation, the timing of the transmission between the processor sub-units can be controlled by the instructions for transmission and retrieval included in the sub-series. Therefore, the code specifies the timing of data transmission across at least one of the plurality of buses. The compiler can generate program code to be executed by a single processor subunit. In addition, the compiler can generate code to be executed by the group of processor subunits. In some cases, the compiler can treat all the processor subunits together as if the processor subunits are a super processor (for example, a distributed processor), and the compiler can generate the super processor for the super processor defined by it. The code executed by the processor/distributed processor.
如上文所解釋且如圖7A及圖7B中所描繪,複數個處理器子單元可在空間上分佈於記憶體陣列內之複數個離散記憶體組當中。替代地,複數個處理器子單元可叢集於基板之一或多個區中,且複數個記憶體組可叢集於基板之一或多個其他區中。在一些實施例中,可使用空間分佈與叢集之組合,如上文所解釋。 As explained above and as depicted in FIGS. 7A and 7B, a plurality of processor subunits may be spatially distributed among a plurality of discrete memory groups in the memory array. Alternatively, a plurality of processor subunits may be clustered in one or more regions of the substrate, and a plurality of memory groups may be clustered in one or more other regions of the substrate. In some embodiments, a combination of spatial distribution and clustering may be used, as explained above.
在一些實施例中,分散式處理器可包含基板(例如,包括矽之半導體基板及/或諸如可撓性電路板之電路板),該基板具有安置於其上之記憶體陣列,該記憶體陣列包括複數個離散記憶體組。處理陣列亦可安置於基板上,該處理陣列包括複數個處理器子單元,如描繪於例如圖7A及圖7B中。如上文所解釋,該等處理器子單元中之每一者可與該等複數個離散記憶體組中之對應的專用記憶體組相關聯。此外,如描繪於例如圖7A及圖7B中,該分散式處理器可進一步包含複數個匯流排,該等複數個匯流排中之每一者將該等複數個處理器子單元中之一者連接至該等複數個離散記憶體組中之對應的專用記憶體組。 In some embodiments, the distributed processor may include a substrate (for example, a semiconductor substrate including silicon and/or a circuit board such as a flexible circuit board), the substrate having a memory array disposed thereon, and the memory The array includes a plurality of discrete memory groups. The processing array can also be disposed on the substrate, and the processing array includes a plurality of processor subunits, as depicted in, for example, FIG. 7A and FIG. 7B. As explained above, each of the processor subunits can be associated with a corresponding dedicated memory group among the plurality of discrete memory groups. In addition, as depicted in, for example, FIGS. 7A and 7B, the distributed processor may further include a plurality of bus bars, each of the plurality of bus bars is one of the plurality of processor subunits Connected to the corresponding dedicated memory group among the plurality of discrete memory groups.
如上文所解釋,該等複數個匯流排可用軟體來控制。因此,複數個匯流排可能不含時序硬體邏輯組件,使得處理器子單元與複數個離散記憶體組中之對應的專用離散記憶體組之間及跨越複數個匯流排中之對應者的資料傳 送不受時序硬體邏輯組件控制。在一個實例中,該等複數個匯流排可能不含匯流排仲裁器,使得在處理器子單元之間及跨越該等複數個匯流排中之對應者的資料傳送不受匯流排仲裁器控制。 As explained above, these multiple buses can be controlled by software. Therefore, a plurality of buses may not contain sequential hardware logic components, so that data between the processor subunit and the corresponding dedicated discrete memory group in the plurality of discrete memory groups and across the corresponding ones in the plurality of buses pass Sending is not controlled by sequential hardware logic components. In one example, the plurality of buses may not contain a bus arbiter, so that data transmission between processor subunits and across corresponding ones of the plurality of buses is not controlled by the bus arbiter.
在一些實施例中,如描繪於例如圖7A及圖7B中,分散式處理器可進一步包含第二複數個匯流排,該等第二複數個匯流排將該等複數個處理器子單元中之一者連接至該等複數個處理器子單元中之至少另一者。類似於上文所描述之複數個匯流排,第二複數個匯流排可能不含時序硬體邏輯組件,使得處理器子單元與對應的專用記憶體組之間的資料傳送不受時序硬體邏輯組件控制。在一個實例中,第二複數個匯流排可能不含匯流排仲裁器,使得處理器子單元與對應的專用記憶體組之間的資料傳送不受匯流排仲裁器控制。 In some embodiments, as depicted in, for example, FIG. 7A and FIG. 7B, the distributed processor may further include a second plurality of buses, and the second plurality of buses may be one of the plurality of processor subunits. One is connected to at least another of the plurality of processor subunits. Similar to the plurality of buses described above, the second plurality of buses may not contain sequential hardware logic components, so that the data transfer between the processor subunit and the corresponding dedicated memory bank is not affected by the sequential hardware logic. Component control. In one example, the second plurality of buses may not contain a bus arbiter, so that the data transmission between the processor subunit and the corresponding dedicated memory bank is not controlled by the bus arbiter.
在一些實施例中,分散式處理器可使用軟體時序組件與硬體時序組件之組合。舉例而言,分散式處理器可包含基板(例如,包括矽之半導體基板及/或諸如可撓性電路板之電路板),該基板具有安置於其上之記憶體陣列,該記憶體陣列包括複數個離散記憶體組。處理陣列亦可安置於基板上,該處理陣列包括複數個處理器子單元,如描繪於例如圖7A及圖7B中。如上文所解釋,該等處理器子單元中之每一者可與該等複數個離散記憶體組中之對應的專用記憶體組相關聯。此外,如描繪於例如圖7A及圖7B中,分散式處理器可進一步包含複數個匯流排,該等複數個匯流排中之每一者將該等複數個處理器子單元中之一者連接至該等複數個處理器子單元中之至少另一者。此外,如上文所解釋,該等複數個處理器子單元可經組態以執行軟體,該軟體控制跨越該等複數個匯流排之資料傳送的時序,以避免與該等複數個匯流排中之至少一者上的資料傳送衝突。在此實例中,軟體可控制資料傳送之時序,但傳送本身可至少部分地由一或多個硬體組件控制。 In some embodiments, distributed processors can use a combination of software timing components and hardware timing components. For example, a distributed processor may include a substrate (for example, a semiconductor substrate including silicon and/or a circuit board such as a flexible circuit board) having a memory array disposed thereon, the memory array including Multiple discrete memory groups. The processing array can also be disposed on the substrate, and the processing array includes a plurality of processor subunits, as depicted in, for example, FIG. 7A and FIG. 7B. As explained above, each of the processor subunits can be associated with a corresponding dedicated memory group among the plurality of discrete memory groups. In addition, as depicted in, for example, FIGS. 7A and 7B, the distributed processor may further include a plurality of bus bars, each of which is connected to one of the plurality of processor subunits To at least another of the plurality of processor subunits. In addition, as explained above, the plurality of processor sub-units can be configured to execute software that controls the timing of data transmission across the plurality of buses to avoid interfering with the plurality of buses. Data transfer conflicts on at least one of them. In this example, the software can control the timing of data transmission, but the transmission itself can be controlled at least in part by one or more hardware components.
程式碼之劃分 Code division
如上文所解釋,本發明之硬體晶片可跨越包括於形成硬體晶片之基板上的處理器子單元並列地執行程式碼。另外,本發明之硬體晶片可執行多任務處理。舉例而言,本發明之硬體晶片可執行區域多任務處理,其中硬體晶片之處理器子單元的一個群組執行一個任務(例如,音訊處理),而硬體晶片之處理器子單元的另一群組執行另一任務(例如,影像處理)。在另一實例中,本發明之硬體晶片可執行時序多任務處理,其中硬體晶片之一或多個處理器子單元在第一時間段期間執行一個任務且在第二時間段期間執行另一任務。亦可使用區域多任務處理與時序多任務處理之組合,使得可在第一時間段期間將一個任務指派給第一群組處理器子單元,而可在第一時間段期間將另一任務指派給第二群組處理器子單元,此後,可在第二時間段期間將第三任務指派給包括於第一群組及第二群組中之處理器子單元。 As explained above, the hardware chip of the present invention can run program codes in parallel across the processor subunits included on the substrate forming the hardware chip. In addition, the hardware chip of the present invention can perform multi-task processing. For example, the hardware chip of the present invention can perform area multitasking processing, where a group of processor subunits of the hardware chip perform one task (for example, audio processing), and the processor subunits of the hardware chip Another group performs another task (for example, image processing). In another example, the hardware chip of the present invention can perform sequential multitasking, wherein one or more processor subunits of the hardware chip perform one task during a first time period and perform another during a second time period. One task. A combination of regional multitasking and sequential multitasking can also be used, so that one task can be assigned to the first group processor subunit during the first time period, and another task can be assigned during the first time period To the second group of processor sub-units, thereafter, a third task can be assigned to the processor sub-units included in the first group and the second group during the second time period.
為了組織機器碼以供在本發明之記憶體晶片上執行,機器碼可在記憶體晶片之處理器子單元之間進行劃分。舉例而言,記憶體晶片上之處理器可包含基板及安置於該基板上之複數個處理器子單元。該記憶體晶片可進一步包含安置於該基板上之對應的複數個記憶體組,該等複數個處理器子單元中之每一者連接至不被該等複數個處理器子單元中之任何其他處理器子單元共用的至少一個專用記憶體組。該記憶體晶片上之每一處理器子單元可經組態以獨立於其他處理器子單元而執行一系列指令。每一系列指令可藉由以下操作執行:根據定義該系列指令之程式碼而組態處理器子單元之一或多個一般處理元件及/或根據在定義該系列指令之該程式碼中所提供的序列而啟動處理器子單元之一或多個特殊處理元件(例如,一或多個加速器)。 In order to organize the machine code for execution on the memory chip of the present invention, the machine code can be divided among the processor subunits of the memory chip. For example, the processor on the memory chip may include a substrate and a plurality of processor sub-units disposed on the substrate. The memory chip may further include a plurality of corresponding memory groups disposed on the substrate, each of the plurality of processor sub-units is connected to any other of the plurality of processor sub-units At least one dedicated memory bank shared by the processor sub-units. Each processor subunit on the memory chip can be configured to execute a series of instructions independently of other processor subunits. Each series of instructions can be executed by the following operations: configure one or more general processing elements of the processor subunit according to the code defining the series of instructions and/or according to the code provided in the code defining the series of instructions The sequence of the processor subunit activates one or more special processing elements (for example, one or more accelerators).
因此,每一系列指令可定義待由單個處理器子單元執行之一系列任務。單個任務可包含在由處理器子單元中之一或多個處理元件之架構定義的指令集內之指令。舉例而言,該處理器子單元可包括特定暫存器,且單個任務 可將資料推送至暫存器上,自暫存器取得資料,對暫存器內之資料執行算術函數,對暫存器內之資料執行邏輯運算,或其類似者。此外,處理器子單元可針對任何數自個運算元來組態,諸如0運算元處理器子單元(亦被稱作「堆疊機」)、1運算元處理器子單元(亦被稱作累加機)、2運算元處理器子單元(諸如,RISC)、3運算元處理器子單元(諸如,複雜指令集電腦(CISC))或其類似者。在另一實例中,處理器子單元可包括一或多個加速器,且單個任務可啟動一加速器以執行特定功能,諸如MAC功能、MAX功能、MAX-0功能或其類似者。 Therefore, each series of instructions can define a series of tasks to be executed by a single processor subunit. A single task may include instructions in an instruction set defined by the architecture of one or more processing elements in a processor subunit. For example, the processor sub-unit may include a specific register, and a single task It can push data to the register, obtain data from the register, perform arithmetic functions on the data in the register, perform logical operations on the data in the register, or the like. In addition, the processor subunit can be configured for any number of operands, such as 0 operand processor subunit (also known as "stacker"), 1 operand processor subunit (also known as accumulator Computer), 2-operand processor subunit (such as RISC), 3-operand processor subunit (such as complex instruction set computer (CISC)), or the like. In another example, the processor subunit may include one or more accelerators, and a single task may activate an accelerator to perform a specific function, such as a MAC function, a MAX function, a MAX-0 function, or the like.
該系列指令可進一步包括用於對記憶體晶片之專用記憶體組進行讀取及寫入的任務。舉例而言,一任務可包括將一段資料寫入至專用於執行該任務之處理器子單元的記憶體組、自專用於執行該任務之處理器子單元的記憶體組讀取一段資料,或其類似者。在一些實施例中,讀取及寫入可由處理器子單元與記憶體組之控制器協同執行。舉例而言,處理器子單元可藉由將控制信號發送至控制器以執行讀取或寫入來執行讀取或寫入任務。在一些實施例中,該控制信號可包括用於讀取及寫入之特定位址。替代地,處理器子單元可聽從記憶體控制器以選擇可用於讀取及寫入之位址。 The series of commands may further include tasks for reading and writing the dedicated memory bank of the memory chip. For example, a task may include writing a piece of data to the memory bank of the processor subunit dedicated to performing the task, reading a piece of data from the memory bank of the processor subunit dedicated to performing the task, or Its similar. In some embodiments, reading and writing can be performed by the processor sub-unit and the controller of the memory bank in cooperation. For example, the processor sub-unit can perform read or write tasks by sending control signals to the controller to perform read or write. In some embodiments, the control signal may include specific addresses for reading and writing. Alternatively, the processor sub-unit can listen to the memory controller to select an address that can be used for reading and writing.
另外或替代地,讀取及寫入可由一或多個加速器與記憶體組之控制器協同執行。舉例而言,該等加速器可產生用於記憶體控制器之控制信號,此類似於處理器子單元如何產生控制信號,如上文所描述。 Additionally or alternatively, reading and writing can be performed by one or more accelerators in cooperation with the controller of the memory bank. For example, the accelerators can generate control signals for the memory controller, which is similar to how the processor sub-units generate control signals, as described above.
在上文所描述之實施例中之任一者中,位址產生器亦可用以導引對記憶體組之特定位址的讀取及寫入。舉例而言,該位址產生器可包含經組態以產生用於讀取及寫入之記憶體位址的處理元件。該位址產生器可經組態以產生位址以便提高效率,例如藉由將稍後計算之結果寫入至與先前計算之不再需要之結果相同的位址。因此,位址產生器可回應於來自處理器子單元(例如, 來自包括於其中之處理元件或來自其中之一或多個加速器)之命令抑或與處理器子單元協同產生用於記憶體控制器之控制信號。另外或替代地,位址產生器可基於一些組態或暫存器產生位址,例如產生巢套迴圈結構,從而以某一圖案在記憶體中之某些位址上進行反覆。 In any of the above-described embodiments, the address generator can also be used to guide the reading and writing of specific addresses of the memory bank. For example, the address generator may include processing elements configured to generate memory addresses for reading and writing. The address generator can be configured to generate addresses in order to improve efficiency, for example, by writing the result of a later calculation to the same address as the result of the previous calculation that is no longer needed. Therefore, the address generator can respond to sub-units from the processor (e.g., Commands from the processing elements included therein or from one or more accelerators) or from the processor sub-units to generate control signals for the memory controller. Additionally or alternatively, the address generator can generate addresses based on some configurations or registers, such as generating a nested loop structure, so as to repeat on certain addresses in the memory in a certain pattern.
在一些實施例中,每一系列指令可包含定義對應的一系列任務之機器碼的集合。因此,上文所描述之該系列任務可囊封於包含該系列指令之機器碼內。在一些實施例中,如下文關於圖8所解釋,該系列任務可由編譯器定義,該編譯器經組態以將較高階系列之任務作為複數個系列之任務分佈於複數個邏輯電路當中。舉例而言,編譯器可基於較高階系列之任務產生複數個系列之任務,使得協同執行對應的每一系列任務之處理器子單元執行與由較高階系列之任務所概述之功能相同的功能。 In some embodiments, each series of instructions may include a set of machine codes that define a corresponding series of tasks. Therefore, the series of tasks described above can be encapsulated in the machine code containing the series of instructions. In some embodiments, as explained below with respect to FIG. 8, the series of tasks can be defined by a compiler that is configured to distribute higher-order series of tasks as a plurality of series of tasks among a plurality of logic circuits. For example, the compiler can generate multiple series of tasks based on the higher-order series of tasks, so that the processor subunits that perform each series of tasks cooperatively perform the same functions as those outlined by the higher-order series of tasks.
如下文進一步所解釋,較高階系列之任務可包含用人類可讀程式設計語言編寫之指令集。對應地,每一處理器子單元之該系列任務可包含較低階系列之任務,該等任務中之每一者包含用機器碼編寫之指令集。 As explained further below, the higher-level series of tasks can include a set of instructions written in a human-readable programming language. Correspondingly, the series of tasks of each processor subunit may include a lower-level series of tasks, and each of these tasks includes an instruction set written in machine code.
如上文關於圖7A及圖7B所解釋,記憶體晶片可進一步包含複數個匯流排,每一匯流排將複數個處理器子單元中之一者連接至複數個處理器子單元中之至少另一者。此外,如上文所解釋,複數個匯流排上之資料傳送可使用軟體來控制。因此,跨越複數個匯流排中之至少一者的資料傳送可藉由包括於連接至複數個匯流排中之至少一者之處理器子單元中的該系列指令預定義。因此,包括於該系列指令中之任務中之一者可包括將資料輸出至匯流排中之一者或自匯流排中之一者取得資料。此等任務可由處理器子單元之一處理元件或由包括於處理器子單元中之一或多個加速器執行。在後一實施例中,處理器子單元可執行計算或在同一循環中將控制信號發送至對應記憶體組,在該循環期間,加速器自匯流排中之一者取得資料或將資料置放於匯流排中之一者上。 As explained above with respect to FIGS. 7A and 7B, the memory chip may further include a plurality of bus bars, and each bus bar connects one of the plurality of processor sub-units to at least another of the plurality of processor sub-units. By. In addition, as explained above, data transmission on multiple buses can be controlled by software. Therefore, data transmission across at least one of the plurality of buses can be predefined by the series of instructions included in the processor subunit connected to at least one of the plurality of buses. Therefore, one of the tasks included in the series of commands may include outputting data to one of the buses or obtaining data from one of the buses. These tasks can be performed by one of the processing elements of the processor subunit or by one or more accelerators included in the processor subunit. In the latter embodiment, the processor subunit can perform calculations or send control signals to the corresponding memory bank in the same cycle. During the cycle, the accelerator obtains data from one of the buses or places the data in On one of the bus bars.
在一個實例中,包括於連接至複數個匯流排中之至少一者之處理器子單元中的該系列指令可包括發送任務,該發送任務包含針對連接至複數個匯流排中之至少一者之處理器子單元的用以將資料寫入至複數個匯流排中之至少一者的命令。另外或替代地,包括於連接至複數個匯流排中之至少一者之處理器子單元中的該系列指令可包括接收任務,該接收任務包含針對連接至複數個匯流排中之至少一者之處理器子單元的用以自複數個匯流排中之至少一者讀取資料的命令。 In one example, the series of instructions included in the processor subunit connected to at least one of the plurality of buses may include a sending task, the sending task including instructions for connecting to at least one of the plurality of buses A command of the processor subunit to write data to at least one of the plurality of buses. Additionally or alternatively, the series of instructions included in the processor subunit connected to at least one of the plurality of buses may include a receiving task, the receiving task including instructions for connecting to at least one of the plurality of buses A command of the processor sub-unit to read data from at least one of the plurality of buses.
除將程式碼分佈在處理器子單元當中以外或替代將程式碼分佈在處理器子單元當中,可在記憶體晶片之記憶體組之間劃分資料。舉例而言,如上文所解釋,記憶體晶片上之分散式處理器可包含安置於記憶體晶片上之複數個處理器子單元及安置於記憶體晶片上之複數個記憶體組。該等複數個記憶體組中之每一者可經組態以儲存獨立於儲存於該等複數個記憶體組之其他者中之資料的資料,且該等複數個處理器子單元中之一者可連接至該等複數個記憶體組當中之至少一個專用記憶體組。舉例而言,每一處理器子單元可存取專用於該處理器子單元之一或多個對應記憶體組的一或多個記憶體控制器,且其他處理器子單元不可存取此等對應的一或多個記憶體控制器。因此,儲存於每一記憶體組中之資料對於專用處理器子單元可為唯一的。此外,儲存於每一記憶體組中之資料可獨立於儲存於其他記憶體組中之記憶體,此係因為無記憶體控制器可在記憶體組之間共用。 In addition to distributing the code in the processor sub-units or instead of distributing the code in the processor sub-units, data can be divided between the memory groups of the memory chip. For example, as explained above, a distributed processor on a memory chip may include a plurality of processor subunits arranged on the memory chip and a plurality of memory groups arranged on the memory chip. Each of the plurality of memory groups can be configured to store data independent of the data stored in the other of the plurality of memory groups, and one of the plurality of processor subunits It can be connected to at least one dedicated memory group among the plurality of memory groups. For example, each processor subunit can access one or more memory controllers dedicated to one or more corresponding memory banks of the processor subunit, and other processor subunits cannot access these Corresponding one or more memory controllers. Therefore, the data stored in each memory bank can be unique to the dedicated processor subunit. In addition, the data stored in each memory bank can be independent of the memories stored in other memory banks, because the memoryless controller can be shared between memory banks.
在一些實施例中,如下文關於圖8所描述,儲存於複數個記憶體組中之每一者中的資料可由編譯器定義,該編譯器經組態以將資料分佈於該等複數個記憶體組當中。此外,該編譯器可經組態以使用分佈於對應處理器子單元當中之複數個較低階任務將定義於較高階系列之任務中的資料分佈於複數個記憶體組當中。 In some embodiments, as described below with respect to FIG. 8, the data stored in each of the plurality of memory groups can be defined by a compiler that is configured to distribute the data among the plurality of memories In the body group. In addition, the compiler can be configured to use a plurality of lower-level tasks distributed in the corresponding processor subunits to distribute data defined in a higher-level series of tasks among a plurality of memory banks.
如下文進一步所解釋,較高階系列之任務可包含用人類可讀程式設計語言編寫之指令集。對應地,每一處理器子單元之該系列任務可包含較低階系列之任務,該等任務中之每一者包含用機器碼編寫之指令集。 As explained further below, the higher-level series of tasks can include a set of instructions written in a human-readable programming language. Correspondingly, the series of tasks of each processor subunit may include a lower-level series of tasks, and each of these tasks includes an instruction set written in machine code.
如上文關於圖7A及圖7B所解釋,記憶體晶片可進一步包含複數個匯流排,每一匯流排將複數個處理器子單元中之一者連接至複數個記憶體組當中之一或多個對應的專用記憶體組。此外,如上文所解釋,複數個匯流排上之資料傳送可使用軟體來控制。因此,跨越複數個匯流排中之特定匯流排的資料傳送可由連接至該等複數個匯流排中之該特定匯流排的對應處理器子單元來控制。因此,包括於該系列指令中之任務中之一者可包括將資料輸出至匯流排中之一者或自匯流排中之一者取得資料。如上文所解釋,此等任務可由(i)處理器子單元之處理元件或(ii)包括於處理器子單元中之一或多個加速器執行。在後一實施例中,處理器子單元可執行計算或在同一循環中使用將該處理器子單元連接至其他處理器子單元之匯流排,在該循環期間,加速器自連接至一或多個對應的專用記憶體組的匯流排中之一者取得資料或將資料置放於該等匯流排中之一者上。 As explained above with respect to FIGS. 7A and 7B, the memory chip may further include a plurality of bus bars, and each bus bar connects one of the plurality of processor subunits to one or more of the plurality of memory banks Corresponding dedicated memory bank. In addition, as explained above, data transmission on multiple buses can be controlled by software. Therefore, data transmission across a specific bus of the plurality of buses can be controlled by the corresponding processor subunit connected to the specific bus of the plurality of buses. Therefore, one of the tasks included in the series of commands may include outputting data to one of the buses or obtaining data from one of the buses. As explained above, these tasks can be performed by (i) the processing elements of the processor sub-unit or (ii) one or more accelerators included in the processor sub-unit. In the latter embodiment, the processor subunit can perform calculations or use a bus that connects the processor subunit to other processor subunits in the same cycle. During the cycle, the accelerator is self-connected to one or more One of the buses of the corresponding dedicated memory group obtains the data or places the data on one of the buses.
因此,在一個實例中,包括於連接至複數個匯流排中之至少一者之處理器子單元中的該系列指令可包括發送任務。該發送任務可包含針對連接至複數個匯流排中之至少一者之處理器子單元的用以將資料寫入至複數個匯流排中之至少一者以供儲存於一或多個對應的專用記憶體組中的命令。另外或替代地,包括於連接至複數個匯流排中之至少一者之處理器子單元中的該系列指令可包括接收任務。該接收任務可包含針對連接至複數個匯流排中之至少一者之處理器子單元的用以自複數個匯流排中之至少一者讀取資料以供儲存於一或多個對應的專用記憶體組中的命令。因此,此等實施例中之發送任務及接收任務可包含控制信號,該等控制信號沿著複數個匯流排中之至少一者發送至一或 多個對應的專用記憶體組之一或多個記憶體控制器。此外,發送任務及接收任務可與計算或由處理子單元之另一部分(例如,由處理子單元之一或多個不同加速器)執行的其他任務同時由處理子單元之一個部分(例如,由處理子單元之一或多個加速器)執行。此同時執行之實例可包括MAC中繼命令,其中協同執行接收、相乘及發送。 Therefore, in one example, the series of instructions included in the processor subunit connected to at least one of the plurality of buses may include sending tasks. The sending task may include a processor subunit connected to at least one of the plurality of buses for writing data to at least one of the plurality of buses for storage in one or more corresponding dedicated Commands in the memory group. Additionally or alternatively, the series of instructions included in the processor subunit connected to at least one of the plurality of buses may include receiving tasks. The receiving task may include a processor subunit connected to at least one of the plurality of buses for reading data from at least one of the plurality of buses for storage in one or more corresponding dedicated memories Commands in the body group. Therefore, the sending task and the receiving task in these embodiments may include control signals, and the control signals are sent to one or the other along at least one of a plurality of buses. One of multiple corresponding dedicated memory groups or multiple memory controllers. In addition, the sending task and the receiving task can be performed by one part of the processing subunit (for example, by the processing subunit) simultaneously with the calculation or other tasks performed by another part of the processing subunit (for example, by one or more different accelerators of the processing subunit). One or more accelerators of the sub-units) execute. Examples of this simultaneous execution may include MAC relay commands, where receiving, multiplying, and sending are performed in concert.
除將資料分佈於記憶體組當中以外,亦可跨越不同記憶體組複製資料之特定部分。舉例而言,如上文所解釋,記憶體晶片上之分散式處理器可包含安置於記憶體晶片上之複數個處理器子單元及安置於記憶體晶片上之複數個記憶體組。該等複數個處理器子單元中之每一者可連接至該等複數個記憶體組當中之至少一個專用記憶體組,且該等複數個記憶體組中之每一記憶體組可經組態以儲存獨立於儲存於該等複數個記憶體組之其他者中之資料的資料。此外,儲存於複數個記憶體組當中之一個特定記憶體組中之資料中的至少一些可包含儲存於複數個記憶體組中之至少另一記憶體組中的資料之複製者。舉例而言,該系列指令中所使用之數字、字串或其他類型之資料可儲存於專用於不同處理器子單元之複數個記憶體組中,而非自一個記憶體組傳送至記憶體晶片中之其他處理器子單元。 In addition to distributing data among memory groups, specific parts of data can also be copied across different memory groups. For example, as explained above, a distributed processor on a memory chip may include a plurality of processor subunits arranged on the memory chip and a plurality of memory groups arranged on the memory chip. Each of the plurality of processor subunits can be connected to at least one dedicated memory group among the plurality of memory groups, and each of the plurality of memory groups can be grouped State to store data independent of the data stored in the others of the plurality of memory groups. In addition, at least some of the data stored in one specific memory group among the plurality of memory groups may include a copy of the data stored in at least another memory group among the plurality of memory groups. For example, the numbers, strings, or other types of data used in this series of commands can be stored in a plurality of memory banks dedicated to different processor subunits instead of being transferred from one memory bank to the memory chip The other processor subunits in it.
在一個實例中,並列字串匹配可使用上文所描述之資料複製。舉例而言,可將複數個字串與同一字串進行比較。習知處理器可依序將複數個字串中之每一字串與同一字串進行比較。在本發明之硬體晶片上,可跨越記憶體組複製同一字串,使得處理器子單元可並列地將複數個字串中之分開字串與所複製字串進行比較。 In one example, parallel string matching can be copied using the data described above. For example, a plurality of strings can be compared with the same string. The conventional processor can sequentially compare each of the plural character strings with the same character string. On the hardware chip of the present invention, the same character string can be copied across the memory bank, so that the processor sub-unit can compare the divided character string with the copied character string in parallel.
在一些實施例中,如下文關於圖8所描述,跨越複數個記憶體組當中之一個特定記憶體組及複數個記憶體組中之至少另一記憶體組而複製的至少一些資料由編譯器定義,該編譯器經組態以跨越記憶體組而複製資料。此外, 該編譯器可經組態以使用分佈於對應處理器子單元當中之複數個較低階任務來複製至少一些資料。 In some embodiments, as described below with respect to FIG. 8, at least some of the data copied across one specific memory group among the plurality of memory groups and at least another memory group among the plurality of memory groups is performed by the compiler By definition, the compiler is configured to copy data across memory banks. In addition, The compiler can be configured to use a plurality of lower-level tasks distributed among the corresponding processor subunits to copy at least some data.
資料之複製可適用於跨越不同計算而重複使用資料之相同部分的特定任務。藉由複製資料之此等部分,不同計算可分佈於記憶體晶片之處理器子單元當中以用於並列執行,而每一處理器子單元可將資料之該等部分儲存於專用記憶體組中且自專用記憶體組存取所儲存部分(而非跨越連接處理器子單元之匯流排推送及取得資料之該等部分)。在一個實例中,跨越複數個記憶體組當中之一個特定記憶體組及複數個記憶體組中之至少另一記憶體組而複製的至少一些資料可包含神經網路之權重。在此實例中,該神經網路中之每一節點可由複數個處理器子單元當中之至少一個處理器子單元定義。舉例而言,每一節點可包含由定義該節點之至少一個處理器子單元執行的機器碼。在此實例中,權重之複製可允許每一處理器子單元執行機器碼以至少部分地實現對應節點,同時僅存取一或多個專用記憶體組(而非與其他處理器子單元執行資料傳送)。因為對專用記憶體組進行之讀取及寫入的時序獨立於其他處理器子單元,而處理器子單元之間的資料傳送之時序需要時序同步(例如,使用軟體,如上文所解釋),所以複製記憶體以避免處理器子單元之間的資料傳送可進一步提高總體執行之效率。 The duplication of data can be applied to specific tasks where the same part of the data is reused across different calculations. By copying these parts of the data, different calculations can be distributed among the processor subunits of the memory chip for parallel execution, and each processor subunit can store these parts of the data in a dedicated memory bank And access the stored part from the dedicated memory bank (instead of pushing and obtaining the parts of the data across the bus connected to the processor subunit). In one example, at least some of the data copied across a specific memory group among the plurality of memory groups and at least another memory group among the plurality of memory groups may include neural network weights. In this example, each node in the neural network can be defined by at least one processor subunit among a plurality of processor subunits. For example, each node may include machine code executed by at least one processor subunit that defines the node. In this example, the duplication of weights allows each processor subunit to execute machine code to at least partially realize the corresponding node, while only accessing one or more dedicated memory banks (rather than executing data with other processor subunits). Send). Because the timing of reading and writing to the dedicated memory bank is independent of other processor sub-units, and the timing of data transfer between processor sub-units requires timing synchronization (for example, using software, as explained above), Therefore, duplicating the memory to avoid data transfer between processor sub-units can further improve the overall execution efficiency.
如上文關於圖7A及圖7B所解釋,記憶體晶片可進一步包含複數個匯流排,每一匯流排將複數個處理器子單元中之一者連接至複數個記憶體組當中之一或多個對應的專用記憶體組。此外,如上文所解釋,複數個匯流排上之資料傳送可使用軟體來控制。因此,跨越複數個匯流排中之特定匯流排的資料傳送可由連接至該等複數個匯流排中之該特定匯流排的對應處理器子單元來控制。因此,包括於該系列指令中之任務中之一者可包括將資料輸出至匯流排中之一者或自匯流排中之一者取得資料。如上文所解釋,此等任務可由(i)處 理器子單元之處理元件或(ii)包括於處理器子單元中之一或多個加速器執行。如上文進一步所解釋,此等任務可包括包含控制信號之發送任務及/或接收任務,該等控制信號沿著複數個匯流排中之至少一者發送至一或多個對應的專用記憶體組之一或多個記憶體控制器。 As explained above with respect to FIGS. 7A and 7B, the memory chip may further include a plurality of bus bars, and each bus bar connects one of the plurality of processor subunits to one or more of the plurality of memory banks Corresponding dedicated memory bank. In addition, as explained above, data transmission on multiple buses can be controlled by software. Therefore, data transmission across a specific bus of the plurality of buses can be controlled by the corresponding processor subunit connected to the specific bus of the plurality of buses. Therefore, one of the tasks included in the series of commands may include outputting data to one of the buses or obtaining data from one of the buses. As explained above, these tasks can be represented by (i) The processing element of the processor sub-unit or (ii) is executed by one or more accelerators included in the processor sub-unit. As explained further above, these tasks may include sending tasks and/or receiving tasks including control signals, which are sent to one or more corresponding dedicated memory banks along at least one of a plurality of buses One or more memory controllers.
圖8描繪用於編譯一系列指令以供在本發明之例示性記憶體晶片(例如,如圖7A及圖7B中所描繪)上執行之方法800的流程圖。方法800可藉由任何習知處理器(無論係通用抑或專用的)實施。
FIG. 8 depicts a flowchart of a
方法800可作為形成編譯器之電腦程式之一部分執行。如本文中所使用,「編譯器」係指將較高階語言(例如,程序性語言,諸如C、FORTRAN、BASIC或其類似者;物件導向式語言,諸如Java、C++、Pascal、Python或其類似者;等等)轉換成較低階語言(例如,組合程式碼、目標程式碼、機器碼或其類似者)的任何電腦程式。編譯器可允許人類以人類可讀語言來程式設計一系列指令,接著將該人類可讀語言轉換成機器可執行語言。
The
在步驟810處,處理器可將與該系列指令相關聯之任務指派給處理器子單元中之不同處理器子單元。舉例而言,該系列指令可分成子群組,該等子群組待跨越處理器子單元而並列地執行。在一個實例中,可將神經網路分成其節點,且可將一或多個節點指派給分開的處理器子單元。在此實例中,每一子群組可包含跨越不同層連接的複數個節點。因此,處理器子單元可實施:來自神經網路之第一層的節點;來自第二層之節點,該第二層連接至來自藉由同一處理器子單元實施之第一層的節點;及其類似者。藉由基於節點之連接來指派節點,可減少處理器子單元之間的資料傳送,此可導致效率提高,如上文所解釋。
At
如上文圖7A及圖7B中所描繪而解釋,處理器子單元可在空間上分佈於安置於記憶體晶片上之複數個記憶體組當中。因此,任務之指派可至少 部分地為空間劃分以及邏輯劃分。 As explained above as depicted in FIG. 7A and FIG. 7B, the processor sub-units may be spatially distributed among a plurality of memory banks arranged on a memory chip. Therefore, the assignment of tasks can be at least Part of it is spatial division and logical division.
在步驟820處,處理器可產生用以在記憶體晶片之多對處理器子單元之間傳送資料的任務,每一對處理器子單元由匯流排連接。舉例而言,如上文所解釋,該等資料傳送可使用軟體來控制。因此,處理器子單元可經組態以在同步時間將資料推送於匯流排上及取得匯流排上之資料。所產生之任務可因此包括用於執行資料之此同步推送及取得的任務。
At
如上文所解釋,步驟820可包括預處理以考量處理器子單元之內部行為,包括時序及潛時。舉例而言,處理器可使用處理器子單元之已知時間及潛時(例如,將資料推送至匯流排的時間、自匯流排取得資料的時間、計算與推送或取得之間的潛時,或其類似者)以確保所產生之任務同步。因此,包含由一或多個處理器子單元進行之至少一次推送及由一或多個處理器子單元進行之至少一次取得的資料傳送可同時發生,而不會由於處理器子單元之間的時序差、處理器子單元之潛時或其類似者而引起延遲。
As explained above,
在步驟830處,處理器可將所指派及產生之任務分組成子系列指令之複數個群組。舉例而言,該等子系列指令可各包含供單個處理器子單元執行的一系列任務。因此,子系列指令之複數個群組中之每一者可對應於複數個處理器子單元中之一不同處理器子單元。因此,步驟810、820及830可導致將該系列指令分成子系列指令之複數個群組。如上文所解釋,步驟820可確保不同群組之間的任何資料傳送同步。
At
在步驟840處,處理器可產生對應於子系列指令之複數個群組中之每一者的機器碼。舉例而言,可將表示子系列指令之較高階程式碼轉換成可由對應處理器子單元執行的較低階程式碼,諸如機器碼。
At
在步驟850處,處理器可根據劃分將對應於子系列指令之複數個群組中之每一者的所產生機器碼指派給複數個處理器子單元中之對應處理器子
單元。舉例而言,處理器可用對應處理器子單元之識別符來標記每一子系列指令。因此,當將子系列指令上傳至記憶體晶片以供執行(例如,由圖3A之主機350)時,每一子系列可組態一正確的處理器子單元。
At step 850, the processor may assign the generated machine code corresponding to each of the plurality of groups of the sub-series instructions to the corresponding processor sub-units of the plurality of processor sub-units according to the division.
unit. For example, the processor can mark each sub-series of instructions with an identifier corresponding to the processor sub-unit. Therefore, when the sub-series commands are uploaded to the memory chip for execution (for example, by the
在一些實施例中,將與該系列指令相關聯之任務指派給處理器子單元中之不同處理器子單元可至少部分地取決於記憶體晶片上之處理器子單元中之兩者或多於兩者之間的空間接近性。舉例而言,如上文所解釋,可藉由減少處理器子單元之間的資料傳送之數目來提高效率。因此,處理器可最少化跨越處理器子單元中之多於兩者而移動資料的資料傳送。因此,處理器可結合一或多個最佳化演算法(諸如,貪婪演算法)使用記憶體晶片之已知佈局,以便將子系列指派給處理器子單元,其指派方式最大化(至少區域地)鄰近傳送且最少化(至少區域地)至非相鄰處理器子單元之傳送。 In some embodiments, the assignment of tasks associated with the series of instructions to different processor subunits in the processor subunit may depend at least in part on two or more of the processor subunits on the memory chip. The spatial proximity between the two. For example, as explained above, the efficiency can be improved by reducing the number of data transfers between processor sub-units. Therefore, the processor can minimize data transfers that move data across more than two of the processor subunits. Therefore, the processor can use the known layout of the memory chip in combination with one or more optimization algorithms (such as the greedy algorithm) in order to assign the sub-series to the processor sub-units in a way that maximizes (at least the area Ground) Proximity transfers and minimize (at least regionally) transfers to non-adjacent processor subunits.
方法800可包括針對本發明之記憶體晶片的進一步最佳化。舉例而言,處理器可基於劃分將與該系列指令相關聯之資料分組且根據該分組將資料指派給記憶體組。因此,該等記憶體組可保存用於指派給每一記憶體組所專用於的每一處理器子單元之子系列指令的資料。
The
在一些實施例中,將資料分組可包括判定資料之至少一部分以在記憶體組中之兩者或多於兩者中複製。舉例而言,如上文所解釋,可跨越多於一個子系列指令而使用一些資料。此資料可跨越專用於經指派不同子系列指令之複數個處理器子單元的記憶體組而複製。此最佳化可進一步減少跨越處理器子單元之資料傳送。 In some embodiments, grouping data may include determining at least a portion of the data to be copied in two or more of the memory groups. For example, as explained above, some data can be used across more than one sub-series of commands. This data can be copied across memory banks dedicated to a plurality of processor subunits assigned different sub-series commands. This optimization can further reduce data transfer across processor subunits.
可將方法800之輸出輸入至本發明之記憶體晶片以供執行。舉例而言,記憶體晶片可包含複數個處理器子單元及對應的複數個記憶體組,每一處理器子單元連接至專用於該處理器子單元之至少一個記憶體組,且該記憶體晶片之該等處理器子單元可經組態以執行由方法800產生之機器碼。如上文關
於圖3A所解釋,主機350可將由方法800產生之機器碼輸入至處理器子單元以供執行。
The output of the
子組及子控制器 Subgroup and Subcontroller
在習知記憶體組中,控制器設置在組層級處。每一組包括複數個墊,該等複數個墊通常以矩形方式配置,但可按任何幾何形狀配置。每一墊包括複數個記憶體胞元,該等複數個記憶體胞元亦通常以矩形方式配置,但可按任何幾何形狀配置。每一胞元可儲存單個資料位元(例如,取決於該胞元保持在高電壓抑或低電壓下)。 In the conventional memory group, the controller is set at the group level. Each group includes a plurality of pads, and the plurality of pads are usually arranged in a rectangular manner, but may be arranged in any geometric shape. Each pad includes a plurality of memory cells, and the plurality of memory cells are usually arranged in a rectangular manner, but they can be arranged in any geometric shape. Each cell can store a single data bit (for example, depending on whether the cell is maintained at a high voltage or a low voltage).
此習知架構之實例描繪於圖9及圖10中。如圖9中所展示,在組層級處,複數個墊(例如,墊930-1、930-2、940-1及940-2)可形成組900。在習知矩形組織中,可跨越全域字線(例如,字線950)及全域位元線(例如,位元線960)而控制組900。因此,列解碼器910可基於傳入控制信號(例如,對自位址讀取之請求、對寫入至位址之請求或其類似者)選擇正確字線,且全域感測放大器920(及/或全域行解碼器,圖9中未展示)可基於控制信號選擇正確位元線。放大器920亦可在讀取操作期間放大來自選定組之任何電壓位準。儘管描繪為將列解碼器用於初始選擇且沿著行執行放大,但組可另外或替代地將行解碼器用於初始選擇且沿著列執行放大。
Examples of this conventional architecture are depicted in FIG. 9 and FIG. 10. As shown in FIG. 9, at the group level, a plurality of pads (eg, pads 930-1, 930-2, 940-1, and 940-2) may form a
圖10描繪墊1000之實例。舉例而言,墊1000可形成諸如圖9之組900的記憶體組之一部分。如圖10中所描繪,複數個胞元(例如,胞元1030-1、1030-2及1030-3)可形成墊1000。每一胞元可包含電容器、電晶體或儲存至少一個資料位元之其他電路系統。舉例而言,胞元可包含電容器或可包含正反器,該電容器經充電以表示「1」且放電以表示「0」,該正反器具有表示「1」之第一狀態及表示「0」之第二狀態。習知墊可包含例如512個位元×512個位元。在墊1000形成MRAM、ReRAM或其類似者之一部分的實施例中,胞
元可包含電晶體、電阻器、電容器或用於隔離儲存至少一個資料位元之材料之離子或一部分的其他機構。舉例而言,胞元可包含具有表示「1」之第一狀態及表示「0」之第二狀態的電解質離子、硫族化物玻璃之一部分,或其類似者。
Figure 10 depicts an example of a
如圖10中進一步所描繪,在習知矩形組織中,可跨越區域字線(例如,字線1040)及區域位元線(例如,位元線1050)而控制墊1000。因此,字線驅動器(例如,字線驅動器1020-1、1020-2、……、1020-x)可基於來自與記憶體組(墊1000形成該記憶體組之一部分)相關聯之控制器的控制信號(例如,對自位址讀取之請求、對寫入至位址之請求、再新信號)而控制選定字線以執行讀取、寫入或再新。此外,區域感測放大器(例如,區域放大器1010-1、1010-2、……、1010-x)及/或區域行解碼器(圖10中未展示)可控制選定位元線以執行讀取、寫入或再新。該等區域感測放大器亦可在讀取操作期間放大來自選定胞元之任何電壓位準。儘管描繪為將字線驅動器用於初始選擇且沿著行執行放大,但墊可替代地將位元線驅動器用於初始選擇且沿著列執行放大。
As further depicted in FIG. 10, in the conventional rectangular organization, the
如上文所解釋,複製大量墊以形成記憶體組。記憶體組可分組以形成記憶體晶片。舉例而言,記憶體晶片可包含八個至三十二個記憶體組。因此,使處理器子單元與習知記憶體晶片上之記憶體組配對可產生僅八個至三十二個處理器子單元。因此,本發明之實施例可包括具有額外子組階層之記憶體晶片。本發明之此等記憶體晶片可接著包括處理器子單元與用作與處理器子單元配對之專用記憶體組的記憶體子組,以形成大量子處理器,此可接著達成記憶體內運算之較高並列性及效能。 As explained above, a large number of pads are copied to form a memory bank. Memory groups can be grouped to form memory chips. For example, the memory chip may include eight to thirty-two memory banks. Therefore, pairing the processor sub-units with the memory bank on the conventional memory chip can produce only eight to thirty-two processor sub-units. Therefore, embodiments of the present invention may include memory chips with additional sub-group levels. The memory chips of the present invention can then include a processor subunit and a memory subgroup used as a dedicated memory group paired with the processor subunit to form a large number of sub-processors, which can then achieve in-memory operations High parallelism and efficiency.
在本發明之一些實施例中,組900之全域列解碼器及全域感測放大器可用子組控制器來替換。因此,記憶體組之控制器可將控制信號導引至適當的子組控制器,而非將控制信號發送至記憶體組之全域列解碼器及全域感測放大器。導引可動態地加以控制或可為硬連線的(例如,經由一或多個邏輯閘)。
在一些實施例中,熔斷器可用以提示每一子組或墊之控制器阻斷控制信號抑或將控制信號傳遞至適當的子組或墊。在此等實施例中,可因此使用熔斷器來不啟動故障子組。
In some embodiments of the present invention, the global column decoders and global sense amplifiers of the
在此等實施例之一個實例中,記憶體晶片可包括複數個記憶體組,每一記憶體組具有組控制器及複數個記憶體子組,每一記憶體子組具有子組列解碼器及子組行解碼器以允許對該記憶體子組上之位置進行讀取及寫入。每一子組可包含複數個記憶體墊,每一記憶體墊具有複數個記憶體胞元且可在內部具有區域列解碼器、行解碼器及/或區域感測放大器。該等子組列解碼器及該等子組行解碼器可處理來自用於子組記憶體上之記憶體內運算的組控制器或子組處理器子單元之讀取及寫入請求,如下文所描述。另外,每一記憶體子組可進一步具有控制器,該控制器經組態以判定處理來自組控制器之讀取請求及寫入請求及/或將讀取請求及寫入請求轉送至下一層級(例如,墊上之列解碼器及行解碼器的下一層級),抑或阻斷該等請求,例如以允許內部處理元件或處理器子單元存取記憶體。在一些實施例中,該組控制器可與系統時脈同步。然而,該等子組控制器可能不與系統時脈同步。 In one example of these embodiments, the memory chip may include a plurality of memory banks, each memory bank has a bank controller and a plurality of memory sub banks, and each memory bank has a sub bank decoder And the sub-group row decoder to allow reading and writing of the location on the memory sub-group. Each sub-group may include a plurality of memory pads, and each memory pad has a plurality of memory cells and may have a regional column decoder, a row decoder and/or a regional sense amplifier inside. The sub-group row decoders and the sub-group row decoders can handle read and write requests from the group controller or the sub-group processor subunits used for in-memory operations on the sub-group memory, as follows Described. In addition, each memory sub-group may further have a controller that is configured to determine whether to process the read request and write request from the group controller and/or forward the read request and write request to the next Level (for example, the next level of column decoders and row decoders on the pad), or block these requests, for example, to allow internal processing elements or processor subunits to access memory. In some embodiments, the set of controllers can be synchronized with the system clock. However, these subgroup controllers may not be synchronized with the system clock.
如上文所解釋,子組之使用可允許在記憶體晶片中包括比在處理器子單元與習知晶片之記憶體組配對之情況下更大數目個處理器子單元。因此,每一子組可進一步具有使用子組作為專用記憶體之處理器子單元。如上文所解釋,該處理器子單元可包含RISC、CISC或其他通用處理子單元及/或可包含一或多個加速器。另外,該處理器子單元可包括位址產生器,如上文所解釋。在上文所描述之實施例中之任一者中,每一處理器子單元可經組態以使用專用於該處理器子單元之子組的列解碼器及行解碼器來存取該子組,而不使用組控制器。與子組相關聯之處理器子單元亦可處置記憶體墊(包括下文所描述之解碼器及記憶體冗餘機構)及/或判定是否轉送且因此處置來自上部層級(例如, 組層級或記憶體層級)之讀取或寫入請求。 As explained above, the use of sub-groups may allow a larger number of processor sub-units to be included in the memory chip than if the processor sub-units are paired with the memory group of the conventional chip. Therefore, each sub-group may further have a processor sub-unit that uses the sub-group as a dedicated memory. As explained above, the processor sub-unit may include RISC, CISC, or other general-purpose processing sub-units and/or may include one or more accelerators. In addition, the processor subunit may include an address generator, as explained above. In any of the embodiments described above, each processor sub-unit can be configured to use column decoders and row decoders dedicated to the sub-group of the processor sub-unit to access the sub-group , Instead of using the group controller. The processor subunits associated with the subgroups can also handle memory pads (including the decoders and memory redundancy mechanisms described below) and/or determine whether to forward and therefore handle them from the upper level (e.g., Group level or memory level) read or write request.
在一些實施例中,子組控制器可進一步包括儲存子組之狀態的暫存器。因此,在該暫存器提示該子組處於使用中時,若該子組控制器接收到來自記憶體控制器之控制信號,則該子組控制器可傳回錯誤。在每一子組進一步包括處理器子單元之實施例中,若該子組中之該處理器子單元正存取與來自記憶體控制器之外部請求衝突的記憶體,則該暫存器可提示錯誤。 In some embodiments, the sub-group controller may further include a register for storing the state of the sub-group. Therefore, when the register indicates that the subgroup is in use, if the subgroup controller receives a control signal from the memory controller, the subgroup controller can return an error. In an embodiment where each sub-group further includes a processor sub-unit, if the processor sub-unit in the sub-group is accessing a memory that conflicts with an external request from the memory controller, the register may Prompt an error.
圖11展示使用子組控制器之記憶體組之另一實施例的實例。在圖11之實例中,組1100具有列解碼器1110、行解碼器1120,及具有子組控制器(例如,控制器1130a、1130b及1130c)之複數個記憶體子組(例如,子組1170a、1170b及1170c)。該等子組控制器可包括位址解算器(例如,解算器1140a、1140b及1140c),該等位址解算器可判定是否將請求傳遞至由子組控制器控制之一或多個子組。
Figure 11 shows an example of another embodiment of a memory bank using a sub-bank controller. In the example of FIG. 11, the
該等子組控制器可進一步包括一或多個邏輯電路(例如,邏輯1150a、1150b及1150c)。舉例而言,包含一或多個處理元件之邏輯電路可允許執行諸如再新子組中之胞元、清除子組中之胞元或其類似者的一或多個操作而無需來自組1100外部之處理請求。替代地,邏輯電路可包含處理器子單元,如上文所解釋,使得處理器子單元具有由子組控制器控制之任何子組作為對應的專用記憶體。在圖11之實例中,邏輯1150a可具有子組1170a作為對應的專用記憶體,邏輯1150b可具有子組1170b作為對應的專用記憶體,且邏輯1150c可具有子組1170c作為對應的專用記憶體。在上文所描述之實施例中之任一者中,邏輯電路可具有至子組之匯流排,例如匯流排1131a、1131b或1131c。如圖11中進一步所描繪,該等子組控制器可各包括複數個解碼器,諸如子組列解碼器及子組行解碼器,以允許處理元件或處理器子單元或發佈命令之較高階記憶體控制器對記憶體子組上之位址進行讀取及寫入。舉例而言,子組控制器1130a
包括解碼器1160a、1160b及1160c,子組控制器1130b包括解碼器1160d、1160e及1160f,且子組控制器1130c包括解碼器1160g、1160h及1160i。基於來自組列解碼器1110之請求,子組控制器可使用包括於子組控制器中之解碼器來選擇字線。所描述系統可允許子組之處理元件或處理器子單元存取記憶體而不會中斷其他組及甚至其他子組,藉此允許每一子組處理器子單元與其他子組處理器子單元並列地執行記憶體運算。
The sub-group controllers may further include one or more logic circuits (for example,
此外,每一子組可包含複數個記憶體墊,每一記憶體墊具有複數個記憶體胞元。舉例而言,子組1170a包括墊1190a-1、1190a-2、……、1190a-x;子組1170b包括墊1190b-1、1190b-2、……、1190b-x;且子組1170c包括墊1190c-1、1190c-2、……、1190c-3。如圖11中進一步所描繪,每一子組可包括至少一個解碼器。舉例而言,子組1170a包括解碼器1180a,子組1170b包括解碼器1180b,且子組1170c包括解碼器1180c。因此,組行解碼器1120可基於外部請求選擇全域位元線(例如,位元線1121a或1121b),而由組列解碼器1110選擇之子組可使用其行解碼器以基於來自子組所專用於的邏輯電路之區域請求而選擇區域位元線(例如,位元線1181a或1181b)。因此,每一處理器子單元可經組態以使用子組之列解碼器及行解碼器來存取專用於該處理器子單元之子組而無需使用組列解碼器及組行解碼器。因此,每一處理器子單元可存取對應子組而不會中斷其他子組。此外,當對子組之請求在處理器子單元外部時,子組解碼器可向組解碼器反映所存取資料。替代地,在每一子組僅具有一列記憶體墊之實施例中,區域位元線可為墊之位元線而非子組之位元線。
In addition, each sub-group may include a plurality of memory pads, and each memory pad has a plurality of memory cells. For example,
可使用以下各者之組合:使用子組列解碼器及子組行解碼器之實施例;及圖11中所描繪之實施例。舉例而言,可消除組列解碼器,但保留組行解碼器且使用區域位元線。 A combination of the following can be used: an embodiment using sub-group column decoders and sub-group row decoders; and the embodiment depicted in FIG. 11. For example, the group column decoder can be eliminated, but the group row decoder is reserved and regional bit lines are used.
圖12展示具有複數個墊之記憶體子組1200之實施例的實例。舉
例而言,子組1200可表示圖11之子組1100的一部分或可表示記憶體組之替代實施方案。在圖12之實例中,子組1200包括複數個墊(例如,墊1240a及1240b)。此外,每一墊可包括複數個胞元。舉例而言,墊1240a包括胞元1260a-1、1260a-2、……、1260a-x,且墊1240b包括胞元1260b-1、1260b-2、……、1260b-x。
Figure 12 shows an example of an embodiment of a
每一墊可經指派將指派給墊之記憶體胞元的位址之範圍。此等位址可在生產時組態,使得墊可到處移動且使得故障墊可被不啟動且保持未使用(例如,使用一或多個熔斷器,如下文進一步所解釋)。 Each pad can be assigned a range of addresses to be assigned to the memory cell of the pad. These addresses can be configured at production time so that the pad can be moved around and so that a faulty pad can be deactivated and left unused (for example, using one or more fuses, as explained further below).
子組1200自記憶體控制器1210接收讀取及寫入請求。儘管圖12中未描繪,但來自記憶體控制器1210之請求可經由子組1200之控制器來篩選且導引至子組1200之適當墊以進行位址解算。替代地,來自記憶體控制器1210之請求的位址之至少一部分(例如,較高位元)可傳輸至子組1200之所有墊(例如,墊1240a及1240b),使得僅在墊之經指派位址範圍包括命令中所指定之位址的情況下,每一墊方可處理完整位址及與該位址相關聯之請求。類似於上文所描述之子組導引,墊判定可動態地加以控制或可為硬連線的。在一些實施例中,熔斷器可用以判定每一墊之位址範圍,從而亦允許藉由指派不合法位址範圍來使故障墊去能。墊可另外或替代地藉由其他常用方法或熔斷器之連接來去能。
The
在上文所描述之實施例中之任一者中,子組之每一墊可包括用於選擇墊中之字線的列解碼器(例如,列解碼器1230a或1230b)。在一些實施例中,每一墊可進一步包括熔斷器及比較器(例如,1220a及1220b)。如上文所描述,比較器可允許每一墊判定是否處理傳入請求,且熔斷器可允許每一墊在發生故障之情況下不啟動。替代地,可使用組及/或子組之列解碼器而非每一墊中之列解碼器。
In any of the embodiments described above, each pad of the sub-group may include a column decoder (e.g.,
此外,在上文所描述之實施例中之任一者中,包括於適當墊中之
行解碼器(例如,行解碼器1250a或1250b)可選擇區域位元線(例如,位元線1251或1253)。區域位元線可連接至記憶體組之全域位元線。在子組具有其自身的區域位元線之實施例中,胞元之區域位元線可進一步連接至子組之區域位元線。因此,可經由胞元之行解碼器(及/或感測放大器)、接著經由子組之行解碼器(及/或感測放大器)(在包括子組行解碼器及/或感測放大器之實施例中)且接著經由組之行解碼器(及/或感測放大器)來讀取選定胞元中之資料。
Furthermore, in any of the above-described embodiments, the ones included in the appropriate pad
The row decoder (e.g., row decoder 1250a or 1250b) can select a regional bit line (e.g.,
墊1200可經複製及排成陣列以形成記憶體組(或記憶體子組)。舉例而言,本發明之記憶體晶片可包含複數個記憶體組,每一記憶體組具有複數個記憶體子組,且每一記憶體子組具有用於處理對記憶體子組上之位置的讀取及寫入之子組控制器。此外,每一記憶體子組可包含複數個記憶體墊,每一記憶體墊具有複數個記憶體胞元且具有一墊列解碼器及一墊行解碼器(例如,如圖12中所描繪)。該等墊列解碼器及該等墊行解碼器可處理來自子組控制器之讀取及寫入請求。舉例而言,該等墊解碼器可接收所有請求且基於每一墊之已知位址範圍判定(例如,使用比較器)是否處理請求,或該等墊解碼器可基於子組(或組)控制器對墊的選擇而僅接收在已知位址範圍內之請求。
The
控制器資料傳送 Controller data transfer
除使用處理子單元來共用資料以外,本發明之記憶體晶片中之任一者亦可使用記憶體控制器(或子組控制器或墊控制器)來共用資料。舉例而言,本發明之記憶體晶片可包含:複數個記憶體組(例如,SRAM組、DRAM組或其類似者),每一記憶體組具有一組控制器、一列解碼器及一行解碼器,以允許對該記憶體組上之位置進行讀取及寫入;以及複數個匯流排,其將複數個組控制器中之每一控制器連接至複數個組控制器中之至少另一控制器。該等複數個匯流排可類似於如上文所描述之連接處理子單元的匯流排,但該等複數個匯流排直接地而非經由處理子單元來連接該等組控制器。此外,儘管描述為 連接組控制器,但匯流排可另外或替代地連接子組控制器及/或墊控制器。 In addition to using the processing subunit to share data, any one of the memory chips of the present invention can also use a memory controller (or sub-group controller or pad controller) to share data. For example, the memory chip of the present invention may include: a plurality of memory banks (for example, SRAM bank, DRAM bank or the like), each memory bank has a set of controllers, a row of decoders, and a row of decoders , To allow reading and writing to the location on the memory bank; and a plurality of buses, which connect each controller of the plurality of group controllers to at least another control of the plurality of group controllers Device. The plurality of buses may be similar to the buses connecting the processing sub-units as described above, but the plurality of buses are directly connected to the set of controllers instead of via the processing sub-units. In addition, despite the description as The group controller is connected, but the bus bar can additionally or alternatively connect the sub-group controller and/or the pad controller.
在一些實施例中,可存取該等複數個匯流排而不會中斷連接至一或多個處理器子單元之記憶體組之主匯流排上的資料傳送。因此,記憶體組(或子組)可在與將資料傳輸至不同記憶體組(或子組)或自不同記憶體組(或子組)傳輸資料相同的時脈循環中將資料傳輸至對應處理器子單元或自對應處理器子單元傳輸資料。在每一控制器連接至複數個其他控制器之實施例中,該等控制器可為可組態的以用於選擇其他控制器中之另一者用於發送或接收資料。在一些實施例中,每一控制器可連接至至少一個相鄰控制器(例如,多對空間鄰近控制器可彼此連接)。 In some embodiments, the plurality of buses can be accessed without interrupting the data transmission on the main bus of the memory bank connected to one or more processor subunits. Therefore, the memory group (or sub-group) can transmit data to the corresponding clock cycle in the same clock cycle as when transferring data to a different memory group (or sub-group) or from a different memory group (or sub-group). The processor subunit or the corresponding processor subunit transmits data. In embodiments where each controller is connected to a plurality of other controllers, the controllers may be configurable for selecting another of the other controllers for sending or receiving data. In some embodiments, each controller may be connected to at least one adjacent controller (for example, multiple pairs of spatially adjacent controllers may be connected to each other).
記憶體電路中之冗餘邏輯 Redundant logic in memory circuit
本發明大體上係有關於具有用於晶片上資料處理之主要邏輯部分的記憶體晶片。該記憶體晶片可包括冗餘邏輯部分,該等冗餘邏輯部分可替換有缺陷的主要邏輯部分以提高晶片之製造良率。因此,該晶片可包括晶載組件,該等晶載組件允許基於對該等邏輯部分之個別測試來組態記憶體晶片中之邏輯區塊。該晶片之此特徵可提高良率,此係因為具有專用於邏輯部分之較大面積的記憶體晶片更容易發生製造故障。舉例而言,具有大冗餘邏輯部分之DRAM記憶體晶片可容易發生製造問題,此降低良率。然而,實施冗餘邏輯部分可導致提高良率及可靠性,此係因為該實施使DRAM記憶體晶片之製造商或使用者能夠在維持高並列性的同時接通或斷開全部邏輯部分。應注意,在此處及貫穿本發明,可識別某些記憶體類型(諸如,DRAM)之實例,以便利解釋所揭示實施例。然而,應理解,在此等情況下,所識別之記憶體類型並不意欲為限制性的。確切而言,諸如DRAM、快閃記憶體、SRAM、ReRAM、PRAM、MRAM、ROM或任何其他記憶體之記憶體類型可與所揭示實施例一起使用,即使在本發明之某一章節中特定地識別較少實例亦如此。 The present invention generally relates to a memory chip having a main logic portion for data processing on the chip. The memory chip may include redundant logic parts, and the redundant logic parts may replace defective main logic parts to improve the manufacturing yield of the chip. Therefore, the chip may include on-chip components that allow the logic blocks in the memory chip to be configured based on individual testing of the logic parts. This feature of the chip can improve the yield, because the memory chip with a larger area dedicated to the logic part is more prone to manufacturing failures. For example, a DRAM memory chip with a large redundant logic part may be prone to manufacturing problems, which reduces the yield rate. However, the implementation of redundant logic parts can lead to improved yield and reliability, because this implementation enables manufacturers or users of DRAM memory chips to switch on or off all logic parts while maintaining high parallelism. It should be noted that here and throughout the present invention, examples of certain memory types (such as DRAM) may be identified to facilitate the explanation of the disclosed embodiments. However, it should be understood that under these circumstances, the identified memory type is not intended to be limiting. To be precise, memory types such as DRAM, flash memory, SRAM, ReRAM, PRAM, MRAM, ROM, or any other memory can be used with the disclosed embodiments, even if it is specified in a certain section of the present invention. The same is true for fewer instances.
圖13為符合所揭示實施例之例示性記憶體晶片1300的方塊圖。記憶體晶片1300可實施為DRAM記憶體晶片。記憶體晶片1300亦可實施為任何類型之揮發性或非揮發性記憶體,諸如快閃記憶體、SRAM、ReRAM、PRAM及/或MRAM等。記憶體晶片1300可包括基板1301,該基板中安置有位址管理器1302、包括複數個記憶體組1304(a,a)至1304(z,z)的記憶體陣列1304、記憶體邏輯1306、商業邏輯1308及冗餘商業邏輯1310。記憶體邏輯1306及商業邏輯1308可構成主要邏輯區塊,而冗餘商業邏輯1310可構成冗餘區塊。此外,記憶體晶片1300可包括組態開關,該等組態開關可包括不啟動開關1312及啟動開關1314。不啟動開關1312及啟動開關1314亦可安置於基板1301中。在本申請案中,記憶體邏輯1306、商業邏輯1308及冗餘商業邏輯1310亦可統稱為「邏輯區塊」。
FIG. 13 is a block diagram of an
位址管理器1302可包括列及行解碼器或其他類型之記憶體輔助裝置。替代地或另外,位址管理器1302可包括微控制器或處理單元。
The
在一些實施例中,如圖13中所展示,記憶體晶片1300可包括單個記憶體陣列1304,該記憶體陣列可將複數個記憶體區塊以二維陣列配置於基板1301上。然而,在其他實施例中,記憶體晶片1300可包括多個記憶體陣列1304,且記憶體陣列1304中之每一者可按不同組態配置記憶體區塊。舉例而言,記憶體陣列中之至少一者中的記憶體區塊(亦被稱為記憶體組)可按徑向分佈配置以便利位址管理器1302或記憶體邏輯1306至記憶體區塊之間的路由。
In some embodiments, as shown in FIG. 13, the
商業邏輯1308可用以進行與用以管理記憶體本身之邏輯無關的應用程式之記憶體內運算。舉例而言,商業邏輯1308可實施與AI相關之功能,諸如用作啟動功能之浮點、整數或MAC運算。此外,商業邏輯1308可實施資料庫相關功能,如最小值、最大值、排序、計數以及其他。記憶體邏輯1306可執行與記憶體管理相關之任務,包括(但不限於)讀取、寫入及再新操作。因
此,可在組層級、墊層級或墊群組層級中之一或多者中添加商業邏輯。商業邏輯1308可具有一或多個位址輸出及一或多個資料輸入/輸出。舉例而言,商業邏輯1308可藉由至位址管理器1302之列\行線來定址。然而,在某些實施例中,邏輯區塊可另外或替代地經由資料輸入\輸出來定址。
The
冗餘商業邏輯1310可為商業邏輯1308之再製品。此外,冗餘商業邏輯1310可連接至不啟動開關1312及/或啟動開關1314,其可包括小的熔斷器\反熔斷器,且用於使例項中之一者(例如,預設連接之例項)邏輯去能或賦能且對其他邏輯區塊中之一者(例如,預設斷開之例項)賦能。在一些實施例中,如關於圖15進一步所描述,區塊之冗餘在諸如商業邏輯1308之邏輯區塊內可為區域的。
The
在一些實施例中,記憶體晶片1300中之邏輯區塊可藉由專用匯流排連接至記憶體陣列1304之子集。舉例而言,記憶體邏輯1306、商業邏輯1308及冗餘商業邏輯1310之集合可連接至記憶體陣列1304中之第一列記憶體區塊(亦即,記憶體區塊1304(a,a)至1304(a,z))。專用匯流排可允許相關聯邏輯區塊快速地存取記憶體區塊之資料,而不要求經由例如位址管理器1302開放通信線。
In some embodiments, the logic blocks in the
複數個主要邏輯區塊中之每一者可連接至複數個記憶體組1304中之至少一者。又,諸如冗餘商業區塊1310之冗餘區塊可連接至記憶體例項1304(a,a)至1304(z,z)中之至少一者。冗餘區塊可再製複數個主要邏輯區塊中之至少一者,諸如記憶體邏輯1306或商業邏輯1308。不啟動開關1312可連接至該等複數個主要邏輯區塊中之至少一者,且啟動開關1314可連接至該等複數個冗餘區塊中之至少一者。
Each of the plurality of main logic blocks can be connected to at least one of the plurality of
在此等實施例中,在偵測到與複數個主要邏輯區塊中之一者(記憶體邏輯1306及/或商業邏輯1308)相關聯之故障後,不啟動開關1312可經組
態以使複數個主要邏輯區塊中之該者去能。同時,啟動開關1314可經組態以對複數個冗餘區塊中的再製複數個主要邏輯區塊中之一者賦能的冗餘區塊,諸如冗餘邏輯區塊1310。
In these embodiments, after detecting a fault associated with one of a plurality of main logic blocks (
此外,可統稱為「組態開關」之啟動開關1314及不啟動開關1312可包括用以組態開關之狀態的外部輸入。舉例而言,啟動開關1314可經組態以使得外部輸入中之啟動信號產生閉合開關條件,而不啟動開關1312可經組態以使得外部輸入中之不啟動信號產生斷開開關條件。在一些實施例中,1300中之所有組態開關可預設為經不啟動,且在測試提示相關聯邏輯區塊起作用且信號施加於外部輸入中之後變得被啟動或賦能。替代地,在一些狀況下,1300中之所有組態開關可預設為經賦能,且可在測試提示相關聯邏輯區塊不起作用且不啟動信號施加於外部輸入中之後被不啟動或去能。
In addition, the
無關於最初對組態開關賦能抑或使其去能,在偵測到與相關聯邏輯區塊相關聯之故障後,組態開關可使相關聯邏輯區塊去能。在最初賦能組態開關之狀況下,組態開關之狀態可改變為去能,以便使相關聯邏輯區塊去能。在最初使組態開關去能之狀況下,組態開關之狀態可保持在其去能狀態中,以便使相關聯邏輯區塊去能。舉例而言,可操作性測試之結果可提示,某一邏輯區塊為非操作的或該邏輯區塊不能在某些規格內操作。在此等狀況下,可藉由不對邏輯區塊之對應組態開關賦能來使邏輯區塊去能。 Regardless of whether the configuration switch is initially enabled or disabled, after detecting a fault associated with the associated logic block, the configuration switch can disable the associated logic block. Under the condition that the configuration switch is initially enabled, the state of the configuration switch can be changed to disable in order to disable the associated logic block. Under the condition of initially disabling the configuration switch, the state of the configuration switch can be maintained in its disabling state so as to disable the associated logic block. For example, the result of the operability test may indicate that a certain logic block is non-operational or that the logic block cannot be operated within certain specifications. Under these conditions, the logic block can be disabled by not enabling the corresponding configuration switch of the logic block.
在一些實施例中,組態開關可連接至兩個或多於兩個邏輯區塊,且可經組態以在不同邏輯區塊之間進行選擇。舉例而言,組態開關可連接至商業邏輯1308及冗餘邏輯區塊1310兩者。組態開關可對冗餘邏輯區塊1310賦能,同時使商業邏輯1308去能。
In some embodiments, the configuration switch can be connected to two or more logic blocks, and can be configured to select between different logic blocks. For example, the configuration switch can be connected to both the
替代地或另外,複數個主要邏輯區塊中之至少一者(記憶體邏輯1306及/或商業邏輯1308)可藉由第一專用連接件連接至複數個記憶體組或記憶
體例項1304之子集。接著,複數個冗餘區塊中的再製複數個主要邏輯區塊中之至少一者的至少一個冗餘區塊(諸如,冗餘商業邏輯1310)可藉由第二專用連接件連接至相同複數個記憶體組或例項1304之子集。
Alternatively or in addition, at least one of the plurality of main logic blocks (
此外,記憶體邏輯1306可具有不同於商業邏輯1308之功能及能力。舉例而言,雖然記憶體邏輯1306可經設計以使得能夠進行記憶體組1304中之讀取及寫入操作,但商業邏輯1308可經設計以執行記憶體內運算。因此,若商業邏輯1308包括第一商業邏輯區塊且商業邏輯1308包括第二商業邏輯區塊(如冗餘商業邏輯1310),則有可能將有缺陷的商業邏輯1308斷開且重新連接冗餘商業邏輯1310而不會失去任何能力。
In addition, the
在一些實施例中,組態開關(包括不啟動開關1312及啟動開關1314)可用熔斷器、反熔斷器或可程式化裝置(包括一次性可程式化裝置)或其他形式之非揮發性記憶體來實施。
In some embodiments, the configuration switch (including the
圖14為符合所揭示實施例之例示性冗餘邏輯區塊集合1400的方塊圖。在一些實施例中,冗餘邏輯區塊集合1400可安置於基板1301中。冗餘邏輯區塊集合1400可包括分別連接至開關1312及1314之商業邏輯1308及冗餘商業邏輯1310中之至少一者。此外,商業邏輯1308及冗餘商業邏輯1310可連接至位址匯流排1402及資料匯流排1404。
FIG. 14 is a block diagram of an exemplary redundant logical block set 1400 in accordance with the disclosed embodiment. In some embodiments, the redundant
在一些實施例中,如圖14中所展示,開關1312及1314可將邏輯區塊連接至時脈節點。以此方式,組態開關可使邏輯區塊與時脈信號連結或脫離,以有效地啟動或不啟動邏輯區塊。然而,在其他實施例中,開關1312及1314可將邏輯區塊連接至其他節點以用於啟動或不啟動。舉例而言,組態開關可將邏輯區塊連接至電壓供應節點(例如,VCC)或連接至接地節點(例如,GND)或時脈信號。以此方式,邏輯區塊可由組態開關賦能或去能,此係因為該等組態開關可產生開路或截斷邏輯區塊供電。
In some embodiments, as shown in FIG. 14,
在一些實施例中,如圖14中所展示,位址匯流排1402及資料匯流排1404可在邏輯區塊之相對側,該等邏輯區塊並聯地連接至該等匯流排中之每一者。以此方式,可藉由邏輯區塊集合1400便利不同晶載組件之路由。
In some embodiments, as shown in FIG. 14, the
在一些實施例中,複數個不啟動開關1312中之每一者將複數個主要邏輯區塊中之至少一者與時脈節點耦接,且複數個啟動開關1314中之每一者可將複數個冗餘區塊中之至少一者與時脈節點耦接,以允許連接\斷開時脈以作為簡單的啟動\不啟動機制。
In some embodiments, each of the plurality of
冗餘邏輯區塊集合1400之冗餘商業邏輯1310允許設計者基於面積及路由而選擇值得複製之區塊。舉例而言,晶片設計者可選擇較大區塊進行複製,此係因為較大區塊可更容易出錯。因此,晶片設計者可決定複製大的邏輯區塊。另一方面,設計者可偏好複製較小邏輯區塊,此係因為較小邏輯區塊容易複製而無顯著的空間損失。此外,使用圖14中之組態,設計者可取決於每個區域之錯誤的統計而容易地選擇複製邏輯區塊。
The
圖15為符合所揭示實施例之例示性邏輯區塊1500的方塊圖。該邏輯區塊可為商業邏輯1308及/或冗餘商業邏輯1310。然而,在其他實施例中,例示性邏輯區塊可描述記憶體邏輯1306,或記憶體晶片1300之其他組件。
FIG. 15 is a block diagram of an
邏輯區塊1500呈現在小型處理器管線內使用邏輯冗餘之又一實施例。邏輯區塊1500可包括暫存器1508、提取電路1504、解碼器1506及寫回電路1518。此外,邏輯區塊1500可包括運算單元1510及複製運算單元1512。然而,在其他實施例中,邏輯區塊1500可包括其他單元,該等其他單元不包含控制器管線,但包括包含所需商業邏輯之分散的處理元件。
The
運算單元1510及複製運算單元1512可包括能夠執行數位計算之數位電路。舉例而言,運算單元1510及複製運算單元1512可包括算術邏輯單元(ALU)以對二進位數執行算術及逐位元運算。替代地,運算單元1510及複製
運算單元1512可包括對浮點數進行運算之浮點單元(FPU)。此外,在一些實施例中,運算單元1510及複製運算單元1512可實施資料庫相關功能,如最小值、最大值、計數及比較運算以及其他。
The
在一些實施例中,如圖15中所展示,運算單元1510及複製運算單元1512可連接至開關電路1514及1516。當經啟動時,該等開關電路可對該等運算單元賦能或使其去能。
In some embodiments, as shown in FIG. 15, the
在邏輯區塊1500中,複製運算單元1512可再製計算單元1510。此外,在一些實施例中,暫存器1508、提取電路1504、解碼器1506及寫回電路1518(統稱為區域邏輯單元)之大小可小於運算單元1510。因為較大元件更容易在製造期間出現問題,所以設計者可決定再製較大單元(諸如,運算單元1510)而非較小單元(諸如,區域邏輯單元)。然而,取決於歷史良率及錯誤率,除複製大單元(或整個區塊)以外或替代複製大單元(或整個區塊),設計者亦可選擇複製區域邏輯單元。舉例而言,運算單元1510可比暫存器1508、提取電路1504、解碼器1506及寫回電路1518大,且因此更容易出錯。設計者可選擇複製運算單元1510而非邏輯區塊1500中之其他元件或整個區塊。
In the
邏輯區塊1500可包括複數個區域組態開關,該等複數個區域組態開關中之每一者連接至運算單元1510或複製運算單元1512中之至少一者中的至少一者。當在運算單元1510中偵測到故障時,區域組態開關可經組態以使運算單元1510去能且對複製運算單元1512賦能。
The
圖16展示符合所揭示實施例之與匯流排連接之例示性邏輯區塊的方塊圖。在一些實施例中,邏輯區塊1602(其可表示記憶體邏輯1306、商業邏輯1308或冗餘商業邏輯1310)可彼此獨立,可經由匯流排連接,且可藉由特定地定址該等邏輯區塊而在外部啟動。舉例而言,記憶體晶片1300可包括許多邏輯區塊,每一邏輯區塊具有ID碼。然而,在其他實施例中,邏輯區塊1602
可表示包含記憶體邏輯1306、商業邏輯1308或冗餘商業邏輯1310中之複數個一或多者的較大單元。
FIG. 16 shows a block diagram of an exemplary logic block connected to the bus in accordance with the disclosed embodiment. In some embodiments, the logical blocks 1602 (which can represent
在一些實施例中,邏輯區塊1602中之每一者可能與其他邏輯區塊1602為冗餘的。所有區塊可作為主要或冗餘區塊來操作之此完全冗餘性可改善製造良率,此係因為設計者可斷開故障單元,同時維持整個晶片之功能性。舉例而言,設計者可能夠使容易出錯但維持類似運算能力之邏輯區域去能,此係因為所有複製區塊可連接至相同的位址匯流排及資料匯流排。舉例而言,邏輯區塊1602之初始數目可大於目標能力。因而,使一些邏輯區塊1602去能將不會影響目標能力。
In some embodiments, each of the
連接至邏輯區塊之匯流排可包括位址匯流排1614、命令線1616及資料線1618。如圖16中所展示,邏輯區塊中之每一者可獨立於匯流排中之每一線而連接。然而,在某些實施例中,邏輯區塊1602可按階層式結構連接以便利路由。舉例而言,匯流排中之每一線可連接至將該線路由至不同邏輯區塊1602之多工器。
The bus connected to the logic block may include an
在一些實施例中,為了在不知曉內部晶片結構(其可能由於賦能及去能單元而改變)之情況下允許外部存取,邏輯區塊中之每一者可包括熔斷ID,諸如熔斷識別件1604。熔斷識別件1604可包括判定ID之開關(如熔斷器)的陣列,且可連接至管理電路。舉例而言,熔斷識別件1604可連接至位址管理器1302。替代地,熔斷識別件1604可連接至較高記憶體位址單元。在此等實施例中,熔斷識別件1604可為可組態的以用於特定位址。舉例而言,熔斷識別件1604可包括可程式化的非揮發性裝置,該裝置基於自管理電路接收到之指令而判定最終ID。
In some embodiments, in order to allow external access without knowing the internal chip structure (which may be changed due to enabling and disabling units), each of the logical blocks may include a fuse ID, such as
記憶體晶片上之分散式處理器可設計成具有圖16中所描繪之組態。在晶片喚醒時或在工廠測試時執行為BIST之測試程序可將運行ID碼指派
給通過測試協定的複數個主要邏輯區塊(記憶體邏輯1306及商業邏輯1308)中之區塊。測試程序亦可將不合法ID碼指派給未通過測試協定的複數個主要邏輯區塊中之區塊。測試程序亦可將運行ID碼指派給通過測試協定的複數個冗餘區塊中之區塊(冗餘邏輯區塊1310)。因為冗餘區塊替換未通過的主要邏輯區塊,所以經指派運行ID碼的複數個冗餘區塊中之區塊可等於或大於經指派不合法ID碼的複數個主要邏輯區塊中之區塊,藉此使區塊去能。此外,複數個主要邏輯區塊中之每一者及複數個冗餘區塊中之每一者可包括至少一個熔斷識別件1604。又,如圖16中所展示,連接邏輯區塊1602之匯流排可包括命令線、資料線及位址線。
The distributed processor on the memory chip can be designed to have the configuration depicted in FIG. 16. When the chip wakes up or when the factory test is executed as a BIST test program, the running ID code can be assigned
Give the blocks in a plurality of main logic blocks (
然而,在其他實施例中,連接至匯流排之所有邏輯區塊1602將開始被去能且不具有ID碼。逐個地測試,每一良好邏輯區塊將得到運行ID碼,且不工作之彼等邏輯區塊將保留不合法ID,此將使此等區塊去能。以此方式,冗餘邏輯區塊可藉由替換在測試處理程序期間已知有缺陷的區塊來改善製造良率。
However, in other embodiments, all
位址匯流排1614可將管理電路耦接至複數個記憶體組中之每一者、複數個主要邏輯區塊中之每一者及複數個冗餘區塊中之每一者。此等連接允許管理電路在偵測到與主要邏輯區塊(諸如,商業邏輯1308)相關聯之故障後將無效位址指派給複數個主要邏輯區塊中之一者且將有效位址指派給複數個冗餘區塊中之一者。
The
舉例而言,如圖16A中所展示,不合法ID經組態至所有邏輯區塊1602(a)至1602(c)(例如,位址0xFFF)。在測試之後,邏輯區塊1602(a)及1602(c)經驗證為起作用,而邏輯區塊1602(b)不起作用。在圖16A中,無陰影邏輯區塊可表示成功地通過功能性測試之邏輯區塊,而陰影邏輯區塊可表示未通過功能性測試之邏輯區塊。因而,測試程序針對起作用的邏輯區塊將不合法ID
改變為合法ID,而為不作用之邏輯區塊保留不合法ID。作為實例,在圖16A中,邏輯區塊1602(a)及1602(c)之位址自0xFFF分別改變為0x001及0x002。相比之下,邏輯區塊1602(b)之位址仍為不合法位址0xFFF。在一些實施例中,ID藉由程式化對應熔斷識別件1604來改變。
For example, as shown in FIG. 16A, illegal IDs are configured to all logical blocks 1602(a) to 1602(c) (for example, address 0xFFF). After the test, the logic blocks 1602(a) and 1602(c) are verified to be functional, while the logic block 1602(b) is not functional. In FIG. 16A, the unshaded logic blocks may indicate the logic blocks that successfully passed the functional test, and the shaded logic blocks may indicate the logic blocks that failed the functional test. Therefore, the test program will not have a valid ID for the functional logic block
Change to a legal ID, and reserve an illegal ID for the inoperative logical block. As an example, in FIG. 16A, the addresses of logical blocks 1602(a) and 1602(c) are changed from 0xFFF to 0x001 and 0x002, respectively. In contrast, the address of the logical block 1602(b) is still the illegal address 0xFFF. In some embodiments, the ID is changed by programming the
來自邏輯區塊1602之測試的不同結果可產生不同組態。舉例而言,如圖16B中所展示,位址管理器1302最初可將不合法ID指派給所有邏輯區塊1602(亦即,0xFFF)。然而,測試結果可提示兩個邏輯區塊1602(a)及1602(b)起作用。在此等狀況下,對邏輯區塊1602(c)之測試可能並非必要的,此係因為記憶體晶片1300可能僅需要兩個邏輯區塊。因此,為了最少化測試資源,可僅根據1300之產品定義所需的起作用邏輯區塊之最小數目來測試邏輯區塊,以使其他邏輯區塊未受測試。圖16B亦展示表示通過功能性測試之受測試邏輯區塊的無陰影邏輯區塊及表示未測試邏輯區塊之陰影邏輯區塊。
Different results from the testing of the
在此等實施例中,在起動時執行BIST之生產測試器(外部或內部的,自動或人工的)或控制器可針對起作用的受測試邏輯區塊將不合法ID改變為運行ID,而為未測試邏輯區塊保留不合法ID。作為實例,在圖16B中,邏輯區塊1602(a)及1602(b)之位址自0xFFF分別改變為0x001及0x002。相比之下,未測試邏輯區塊1602(c)之位址仍為不合法位址0xFFF。 In these embodiments, the production tester (external or internal, automatic or manual) or controller that executes BIST at startup can change the illegal ID to the running ID for the functioning logic block under test, and Reserve illegal IDs for untested logical blocks. As an example, in FIG. 16B, the addresses of logical blocks 1602(a) and 1602(b) are changed from 0xFFF to 0x001 and 0x002, respectively. In contrast, the address of the untested logical block 1602(c) is still the illegal address 0xFFF.
圖17為符合所揭示實施例的串聯連接之例示性單元1702及1712的方塊圖。圖17可表示整個系統或晶片。替代地,圖17可表示含有其他起作用區塊之晶片中的區塊。
Figure 17 is a block diagram of
單元1702及1712可表示包括諸如記憶體邏輯1306及/或商業邏輯1308的複數個邏輯區塊之完整單元。在此等實施例中,單元1702及1712亦可包括執行操作所需之元件,諸如位址管理器1302。然而,在其他實施例中,單元1702及1712可表示諸如商業邏輯1308或冗餘商業邏輯1310之邏輯單元。
圖17呈現單元1702及1712可能需要在其本身之間通信的實施例。在此類狀況下,單元1702及1712可串聯連接。然而,非工作單元可破壞邏輯區塊之間的連續性。因此,當單元由於缺陷而需要被去能時,單元之間的連接可包括旁路選項。該旁路選項亦可為旁路單元本身之部分。
Figure 17 presents an embodiment in which
在圖17中,單元可串聯連接(例如,1702(a)至1702(c)),且未通過的單元(例如,1702(b))可在其有缺陷時被繞過。該等單元可進一步與開關電路並聯地連接。舉例而言,在一些實施例中,單元1702及1712可與開關電路1722及1728連接,如圖17中所描繪。在圖17中所描繪之實例中,單元1702(b)有缺陷。舉例而言,單元1702(b)未通過電路功能性測試。因此,可使用例如啟動開關1314(圖17中未展示)來使單元1702(b)去能,及/或可啟動開關電路1722(b)以繞過單元1702(b)且維持邏輯區塊之間的連接性。
In FIG. 17, cells can be connected in series (e.g., 1702(a) to 1702(c)), and a failed cell (e.g., 1702(b)) can be bypassed when it is defective. These units can be further connected in parallel with the switching circuit. For example, in some embodiments, the
因此,當複數個主要單元串聯連接時,該等複數個單元中之每一者可與一並聯開關並聯地連接。在偵測到與複數個單元中之一者相關聯的故障後,可啟動連接至該等複數個單元中之該者的並聯開關以連接該等複數個單元中之兩者。 Therefore, when a plurality of main units are connected in series, each of the plurality of units can be connected in parallel with a parallel switch. After detecting a fault associated with one of the plurality of units, a parallel switch connected to that one of the plurality of units can be activated to connect two of the plurality of units.
在其他實施例中,如圖17中所展示,開關電路1728可包括將致使一或多個循環延遲以維持單元之不同排之間的同步的一或更多個取樣點。當使一單元去能時,鄰近邏輯區塊之間的連接之短路可能會產生與其他計算之同步誤差。舉例而言,若任務需要來自A排及B排兩者之資料,且A及B中之每一者由獨立之一系列單元承載,則使單元去能將導致將需要進一步資料管理之排之間的去同步。為了防止去同步,樣本電路1730可模擬由經去能單元1712(b)引起的延遲。然而,在一些實施例中,並聯開關可包括反熔斷器而非取樣電路1730。
In other embodiments, as shown in FIG. 17, the
圖18為符合所揭示實施例之成二維陣列連接之例示性單元的方 塊圖。圖18可表示整個系統或晶片。替代地,圖18可表示含有其他起作用區塊之晶片中的區塊。 18 is a method of exemplary units connected in a two-dimensional array in accordance with the disclosed embodiment Block diagram. Figure 18 may represent the entire system or wafer. Alternatively, FIG. 18 may represent a block in a chip containing other functional blocks.
單元1806可表示包括諸如記憶體邏輯1306及/或商業邏輯1308之複數個邏輯區塊的自主單元。然而,在其他實施例中,單元1806可表示諸如商業邏輯1308之邏輯單元。在方便時,圖18之論述可參考圖13(例如,記憶體晶片1300)中所識別且上文所論述之元件。
The
如圖18中所展示,單元可配置成二維陣列,其中單元1806(其可包括或表示記憶體邏輯1306、商業邏輯1308或冗餘商業邏輯1310中之一或多者)經由開關箱1808及連接箱1810互連。此外,為了控制二維陣列之組態,二維陣列可在二維陣列之周邊中包括I/O區塊1804。
As shown in FIG. 18, the units may be configured in a two-dimensional array, where the unit 1806 (which may include or represent one or more of the
連接箱1810可為可程式化且可重組態之裝置,該裝置可對自I/O區塊1804輸入之信號作出回應。舉例而言,連接箱可包括來自單元1806之複數個輸入接腳且亦可連接至開關箱1808。替代地,連接箱1810可包括將可程式化邏輯胞元之接腳與路由軌線連接的開關之群組,而開關箱1808可包括連接不同軌線的開關之群組。
The
在某些實施例中,連接箱1810及開關箱1808可藉由諸如開關1312及1314之組態開關實施。在此等實施例中,連接箱1810及開關箱1808可由生產測試器或在晶片起動時所執行之BIST來組態。
In some embodiments, the
在一些實施例中,連接箱1810及開關箱1808可在測試單元1806之電路功能性之後進行組態。在此等實施例中,I/O區塊1804可用以將測試信號發送至單元1806。取決於測試結果,I/O區塊1804可發送程式化信號,該等程式化信號以使未通過測試協定之單元1806去能且對通過測試協定之單元1806賦能的方式來組態連接箱1810及開關箱1808。
In some embodiments, the
在此等實施例中,複數個主要邏輯區塊及複數個冗餘區塊可成二
維柵格安置於基板上。因此,複數個主要單元1806中之每一者及複數個冗餘區塊中之每一者(諸如,冗餘商業邏輯1310)可用開關箱1808互連,且輸入區塊可安置於二維柵格之每一排及每一行之周邊中。
In these embodiments, a plurality of main logic blocks and a plurality of redundant blocks may be two
The dimension grid is arranged on the substrate. Therefore, each of the plurality of
圖19為符合所揭示實施例之成複雜連接之例示性單元的方塊圖。圖19可表示整個系統。替代地,圖19可表示含有其他起作用區塊之晶片中的區塊。 FIG. 19 is a block diagram of an exemplary unit for complex connection in accordance with the disclosed embodiment. Figure 19 can represent the entire system. Alternatively, FIG. 19 may represent a block in a chip containing other functional blocks.
圖19之複雜連接包括單元1902(a)至1902(f)及組態開關1904(a)至1904(f)。單元1902可表示包括諸如記憶體邏輯1306及/或商業邏輯1308之複數個邏輯區塊的自主單元。然而,在其他實施例中,單元1902可表示諸如記憶體邏輯1306、商業邏輯1308或冗餘商業邏輯1310之邏輯單元。組態開關1904可包括不啟動開關1312及啟動開關1314中之任一者。
The complex connection of Figure 19 includes units 1902(a) to 1902(f) and configuration switches 1904(a) to 1904(f). The
如圖19中所展示,該複雜連接可包括兩個平面中之單元1902。舉例而言,複雜連接可包括在z軸上分開的兩個獨立基板。替代地或另外,單元1902可配置於基板之兩個表面中。舉例而言,出於減小記憶體晶片1300之面積之目的,基板1301可配置於兩個重疊表面中且與三維配置之組態開關1904連接。組態開關可包括不啟動開關1312及/或啟動開關1314。
As shown in Figure 19, the complex connection may include
基板之第一平面可包括「主」單元1902。此等區塊可預設為經賦能。在此等實施例中,第二平面可包括「冗餘」單元1902。此等單元可預設為經去能。
The first plane of the substrate may include the "main"
在一些實施例中,組態開關1904可包括反熔斷器。因此,在測試單元1902之後,區塊可藉由將某些反熔斷器切換至「始終接通」及使選定單元1902去能來連接於起作用單元之塊中,即使該等單元在不同平面中亦如此。在圖19中所呈現之實例中,「主」單元中之一者(單元1902(e))不工作。圖19可將不起作用區塊或未測試區塊表示為陰影區塊,而受測試或起作用區塊可
為無陰影的。因此,組態開關1904經組態以使得不同平面中之邏輯區塊中之一者(例如,單元1902(f))變為在作用中。以此方式,即使主邏輯區塊中之一者有缺陷,記憶體晶片仍藉由替換備用邏輯單元而工作。
In some embodiments, the
圖19另外展示不測試第二平面中之單元1902中之一者(亦即,1902(c))或對其賦能,此係因為主邏輯區塊起作用。舉例而言,在圖19中,兩個主單元1902(a)及1902(d)通過功能性測試。因此,單元1902(c)未被測試或賦能。因此,圖19展示特定地選擇取決於測試結果而變為在作用中之邏輯區塊的能力。
FIG. 19 additionally shows that one of the
在一些實施例中,如圖19中所展示,並非第一平面中之所有單元1902均可具有對應的備用或冗餘區塊。然而,在其他實施例中,所有單元可彼此冗餘以實現完全冗餘,其中所有單元均為主要或冗餘的。此外,雖然一些實施方案可遵循圖19中所描繪之星形網路拓樸,但其他實施方案可使用並聯連接、串聯連接及/或將不同元件與組態開關並聯地或串聯地耦接。
In some embodiments, as shown in FIG. 19, not all
圖20為說明符合所揭示實施例之冗餘區塊賦能處理程序2000的例示性流程圖。可針對記憶體晶片1300且特別地針對DRAM記憶體晶片實施賦能處理程序2000。在一些實施例中,處理程序2000可包括以下步驟:測試記憶體晶片之基板上的複數個邏輯區塊中之每一者的至少一個電路功能性;基於測試結果識別複數個主要邏輯區塊中之故障邏輯區塊;測試記憶體晶片之基板上的至少一個冗餘或額外邏輯區塊的至少一個電路功能性;藉由將外部信號施加至不啟動開關來使至少一個故障邏輯區塊去能;及藉由將該外部信號施加至啟動開關來對該至少一個冗餘區塊賦能,該啟動開關與該至少一個冗餘區塊連接且安置於該記憶體晶片之該基板上。以下圖20之描述進一步詳述處理程序2000之每一步驟。
FIG. 20 is an exemplary flowchart illustrating a redundant block enabling
處理程序2000可包括測試諸如商業區塊1308之複數個邏輯區塊
(步驟2002)以及複數個冗餘區塊(例如,冗餘商業區塊1310)。測試可在封裝之前使用例如用於晶圓上測試之探測站進行。然而,步驟2000亦可在封裝之後執行。
The
步驟2002中之測試可包括將測試信號之有限序列施加至記憶體晶片1300中之每個邏輯區塊或記憶體晶片1300中之邏輯區塊的子集。該等測試信號可包括請求預期得到0或1之運算。在其他實施例中,測試信號可請求讀取記憶體組中之特定位址或寫入特定記憶體組中。
The test in
可在步驟2002中實施測試技術以測試邏輯區塊在反覆處理程序下之回應。舉例而言,該測試可涉及藉由傳輸將資料寫入記憶體組中之指令及接著驗證寫入資料之完整性來測試邏輯區塊。在一些實施例中,該測試可包括利用反轉資料重複演算法。
The test technique can be implemented in
在替代實施例中,步驟2002之測試可包括運行邏輯區塊之模型以基於測試指令集產生目標記憶體影像。接著,可對記憶體晶片中之邏輯區塊執行同一指令序列,且可記錄結果。模擬之殘餘記憶體影像亦可與自測試獲得之影像進行比較,且任何失配可標示為故障。
In an alternative embodiment, the test of
替代地,在步驟2002中,該測試可包括陰影模型化,在陰影模型化中會產生診斷,但未必預測結果。實情為,使用陰影模型化之測試可對記憶體晶片及模擬兩者並列地運行。舉例而言,當記憶體晶片中之邏輯區塊完成指令或任務時,模擬可經發信以執行同一指令。一旦記憶體晶片中之邏輯區塊完成該等指令,便可將兩個模型之架構狀態進行比較。若存在失配,則標示故障。
Alternatively, in
在一些實施例中,可在步驟2002中測試所有邏輯區塊(包括例如記憶體邏輯1306、商業邏輯1308或冗餘商業邏輯1310中之每一者)。然而,在其他實施例中,可在不同測試回合中僅測試邏輯區塊之子集。舉例而言,在
第一測試回合中,可僅測試記憶體邏輯1306及相關聯區塊。在第二回合中,可僅測試商業邏輯1308及相關聯區塊。在第三回合中,取決於前兩個回合之結果,可測試與冗餘商業邏輯1310相關聯之邏輯區塊。
In some embodiments, all logic blocks (including, for example, each of
處理程序2000可繼續至步驟2004。在步驟2004中,可識別故障邏輯區塊,且亦可識別故障冗餘區塊。舉例而言,未通過步驟2002之測試的邏輯區塊可在步驟2004中識別為故障區塊。然而,在其他實施例中,最初僅可識別某些故障邏輯區塊。舉例而言,在一些實施例中,僅可識別與商業邏輯1308相關聯之邏輯區塊,且僅在需要故障冗餘區塊以替代故障邏輯區塊的情況下識別故障冗餘區塊。此外,識別故障區塊可包括在記憶體組或非揮發性記憶體上寫入經識別故障區塊之識別資訊。
The
在步驟2006中,可使故障邏輯區塊去能。舉例而言,使用組態電路,可藉由將故障邏輯區塊與時脈、接地及/或電源節點斷開來使故障邏輯區塊去能。替代地,可藉由以避開邏輯區塊之配置組態連接箱來使故障邏輯區塊去能。又,在其他實施例中,可藉由自位址管理器1302接收不合法位址來使故障邏輯區塊去能。
In
在步驟2008中,可識別複製故障邏輯區塊之冗餘區塊。即使一些邏輯區塊已發生故障,為了支援記憶體晶片的相同能力,在步驟2008中,可識別可用且可複製故障邏輯區塊之冗餘區塊。舉例而言,若執行向量之乘法的邏輯區塊經判定為發生故障,則在步驟2008中,位址管理器1302或晶載控制器可識別亦執行向量之乘法的可用冗餘邏輯區塊。
In
在步驟2010中,可對在步驟2008中所識別之冗餘區塊賦能。與步驟2006之去能操作相比,在步驟2010中,可藉由將經識別冗餘區塊連接至時脈、接地及/或電源節點來對該等經識別冗餘區塊賦能。替代地,可藉由以連接經識別冗餘區塊之配置組態連接箱來對經識別冗餘區塊賦能。又,在其他實施
例中,可藉由在測試程序執行時間接收運行位址來對經識別冗餘區塊賦能。
In
圖21為說明符合所揭示實施例之位址指派處理程序2100的例示性流程圖。可針對記憶體晶片1300且特別地針對DRAM記憶體晶片實施位址指派處理程序2100。如關於圖16所描述,在一些實施例中,記憶體晶片1300中之邏輯區塊可連接至資料匯流排且具有位址識別項。處理程序2100描述位址指派方法,該位址指派方法使故障邏輯區塊去能且對通過測試之邏輯區塊賦能。處理程序2100中所描述之步驟將描述為由生產測試器或在晶片起動時所執行之BIST執行;然而,記憶體晶片1300之其他組件及/或外部裝置亦可執行處理程序2100之一或多個步驟。
FIG. 21 is an exemplary flowchart illustrating an address
在步驟2102中,測試器可藉由在晶片層級將不合法識別項指派給每一邏輯區塊來使所有邏輯區塊及冗餘區塊去能。
In
在步驟2104中,測試器可執行邏輯區塊之測試協定。舉例而言,測試器可針對記憶體晶片1300中之邏輯區塊中之一或多者運行步驟2002中所描述的測試方法。
In
在步驟2106中,取決於步驟2104中之測試之結果,測試器可判定邏輯區塊是否有缺陷。若邏輯區塊無缺陷(步驟2106:否),則位址管理器可在步驟2108中將運行ID指派給受測試邏輯區塊。若邏輯區塊有缺陷(步驟2106:是),則位址管理器1302可在步驟2110中為有缺陷邏輯區塊保留不合法ID。
In
在步驟2112中,位址管理器1302可選擇再製有缺陷邏輯區塊之冗餘邏輯區塊。在一些實施例中,再製有缺陷邏輯區塊之冗餘邏輯區塊可具有與有缺陷邏輯區塊相同的組件及連接。然而,在其他實施例中,冗餘邏輯區塊可具有不同於有缺陷邏輯區塊的組件及/或連接,但能夠執行等效操作。舉例而言,若有缺陷邏輯區塊經設計以執行向量之乘法,則選定冗餘邏輯區塊將能夠
執行向量之乘法,即使選定冗餘邏輯區塊不具有與有缺陷單元相同的架構亦如此。
In
在步驟2114中,位址管理器1302可測試冗餘區塊。舉例而言,測試器可將步驟2104中應用之測試技術應用於經識別冗餘區塊。
In
在步驟2116中,基於步驟2114中之測試之結果,測試器可判定冗餘區塊是否有缺陷。在步驟2118中,若冗餘區塊無缺陷(步驟2116:否),則測試器可將運行ID指派給經識別冗餘區塊。在一些實施例中,處理程序2100可在步驟2118之後返回至步驟2104,以產生測試記憶體晶片中之所有邏輯區塊的反覆迴圈。
In
若測試器判定冗餘區塊有缺陷(步驟2116:是),則在步驟2120中,測試器可判定額外冗餘區塊是否可用。舉例而言,測試器可向記憶體組查詢關於可用冗餘邏輯區塊之資訊。若冗餘邏輯區塊可用(步驟2120:是),則測試器可返回至步驟2112且識別再製有缺陷邏輯區塊之新的冗餘邏輯區塊。若冗餘邏輯區塊不可用(步驟2120:否),則在步驟2122中,測試器可產生錯誤信號。該錯誤信號可包括有缺陷邏輯區塊及有缺陷冗餘區塊之資訊。
If the tester determines that the redundant block is defective (step 2116: Yes), then in
耦接之記憶體組 Coupled memory bank
本發明所揭示之實施例亦包括分散式高效能處理器。該處理器可包括介接記憶體組及處理單元之記憶體控制器。該處理器可為可組態的以加快將資料遞送至處理單元以用於計算。舉例而言,若處理單元需要兩個資料例項以執行任務,則記憶體控制器可經組態以使得通信線獨立地提供對來自兩個資料例項之資訊的存取。所揭示之記憶體架構試圖最小化與複雜快取記憶體及複雜暫存器檔案方案相關聯之硬體要求。通常,處理器晶片包括允許核心直接與暫存器一起工作的快取記憶體階層。然而,快取記憶體操作需要相當大的晶粒面積且消耗額外功率。所揭示之記憶體架構藉由在記憶體中添加邏輯組件來避 免使用快取記憶體階層。 The disclosed embodiments of the present invention also include distributed high-performance processors. The processor may include a memory controller that interfaces with the memory bank and the processing unit. The processor can be configurable to expedite the delivery of data to the processing unit for calculation. For example, if the processing unit requires two data instances to perform tasks, the memory controller can be configured so that the communication line independently provides access to information from the two data instances. The disclosed memory architecture attempts to minimize the hardware requirements associated with complex cache and complex register file solutions. Generally, the processor chip includes a cache hierarchy that allows the core to work directly with the register. However, the cache operation requires a considerable die area and consumes additional power. The disclosed memory architecture avoids the problem by adding logical components to the memory Avoid using the cache hierarchy.
所揭示架構亦實現資料在記憶體組中之策略性(或甚至最佳化)置放。即使記憶體組具有單個埠及高潛時,所揭示之記憶體架構亦可藉由將資料策略性地定位於記憶體組之不同區塊中來實現高效能且避免記憶體存取瓶頸。以將資料之連續串流提供至處理單元為目標,編譯最佳化步驟可針對特定或一般任務判定資料應如何儲存於記憶體組中。接著,介接處理單元及記憶體組之記憶體控制器可經組態以在特定處理單元需要資料以執行操作時向該等特定處理單元授權存取。 The disclosed architecture also realizes the strategic (or even optimized) placement of data in the memory bank. Even if the memory bank has a single port and high potential, the disclosed memory architecture can also achieve high performance and avoid memory access bottlenecks by strategically positioning data in different blocks of the memory bank. With the goal of providing a continuous stream of data to the processing unit, the compilation optimization step can determine how the data should be stored in the memory bank for a specific or general task. Then, the memory controller that interfaces the processing unit and the memory bank can be configured to authorize access to specific processing units when they need data to perform operations.
記憶體晶片之組態可由處理單元(例如,組態管理者)或外部介面執行。該組態亦可由編譯器或其他SW工具寫入。此外,記憶體控制器之組態可基於記憶體組中之可用埠及記憶體組中之資料的組織。因此,所揭示架構可向處理單元提供來自不同記憶體區塊之恆定資料流或同時資訊。以此方式,記憶體內之運算任務可藉由避免潛時瓶頸或快取記憶體要求來快速地處理。 The configuration of the memory chip can be performed by a processing unit (for example, a configuration manager) or an external interface. The configuration can also be written by the compiler or other SW tools. In addition, the configuration of the memory controller can be based on the available ports in the memory bank and the organization of the data in the memory bank. Therefore, the disclosed architecture can provide a constant data stream or simultaneous information from different memory blocks to the processing unit. In this way, computing tasks in the memory can be quickly processed by avoiding latency bottlenecks or cache memory requirements.
此外,儲存於記憶體晶片中之資料可基於編譯最佳化步驟進行配置。編譯可允許建置處理常式,其中處理器將任務高效地指派給處理單元而無記憶體潛時相關聯之延遲。該編譯可由編譯器執行且被傳輸至連接至基板中之外部介面之主機。通常,某些存取圖案的高潛時及/或低的埠數目將導致需要資料之處理單元的資料瓶頸。然而,所揭示編譯可按使得處理單元能夠甚至在不利記憶體類型之情況下仍連續地接收資料的方式將資料定位於記憶體組中。 In addition, the data stored in the memory chip can be configured based on the compilation optimization step. Compilation may allow the creation of processing routines in which the processor efficiently assigns tasks to processing units without the delays associated with memory latency. The compilation can be executed by the compiler and transmitted to the host connected to the external interface in the substrate. Generally, the high latency and/or low port number of certain access patterns will cause data bottlenecks in the processing units that require data. However, the disclosed compilation can locate the data in the memory bank in a way that enables the processing unit to continuously receive the data even in the case of unfavorable memory types.
此外,在一些實施例中,組態管理器可基於任務所需之運算向所需處理單元發信。晶片中之不同處理單元或邏輯區塊可具有用於不同任務之專門硬體或架構。因此,取決於將執行之任務,可選擇處理單元或處理單元群組來執行任務。基板上之記憶體控制器可為可組態的以根據處理子單元之選擇來投送資料或授權存取,以改善資料傳送速率。舉例而言,基於編譯最佳化及記 憶體架構,當需要處理單元以執行任務時,可授權該等處理單元對記憶體組之存取。 In addition, in some embodiments, the configuration manager can send a message to the required processing unit based on the calculation required by the task. Different processing units or logic blocks in a chip can have specialized hardware or architectures for different tasks. Therefore, depending on the task to be performed, a processing unit or a group of processing units can be selected to perform the task. The memory controller on the substrate can be configurable to send data or authorize access according to the selection of processing sub-units to improve the data transfer rate. For example, based on compilation optimization and memory Memory architecture, when processing units are needed to perform tasks, the processing units can be authorized to access the memory bank.
此外,晶片架構可包括晶載組件,該等晶載組件藉由減少存取記憶體組中之資料所需的時間來便利資料之傳送。因此,本發明描述用於能夠使用簡單的記憶體例項執行特定或一般任務的高效能處理器的晶片架構連同編譯最佳化步驟。記憶體例項可具有高的隨機存取潛時及/或低的埠數目,諸如DRAM裝置或面向其他記憶體之技術中所使用之彼等記憶體例項,但所揭示架構可藉由實現自記憶體組至處理單元之連續(或幾乎連續)資料流來克服此等缺點。 In addition, the chip architecture may include on-chip components that facilitate data transmission by reducing the time required to access data in the memory bank. Therefore, the present invention describes a chip architecture for a high-performance processor capable of performing specific or general tasks using simple memory instances along with compilation optimization steps. Memory instances can have high random access latency and/or low port numbers, such as those used in DRAM devices or other memory-oriented technologies, but the disclosed architecture can be realized by self-memory The continuous (or almost continuous) data flow from the body group to the processing unit overcomes these shortcomings.
在本申請案中,同時通信可指時脈循環內之通信。替代地,同時通信可指在預定時間量內發送資訊。舉例而言,同時通信可指在幾奈秒內之通信。 In this application, simultaneous communication may refer to communication within a clock cycle. Alternatively, simultaneous communication may refer to sending information within a predetermined amount of time. For example, simultaneous communication may refer to communication within a few nanoseconds.
圖22提供符合所揭示實施例之例示性處理裝置的方塊圖。圖22A展示處理裝置2200之第一實施例,其中記憶體控制器2210使用多工器連接第一記憶體區塊2202及第二記憶體區塊2204。記憶體控制器2210亦可連接至少一組態管理器2212、一邏輯區塊2214及多個加速器2216(a)至2216(n)。圖22B展示處理裝置2200之第二實施例,其中記憶體控制器2210使用匯流排連接記憶體區塊2202及2204,該匯流排連接記憶體控制器2210與至少一組態管理器2212、一邏輯區塊2214及多個加速器2216(a)至2216(n)。此外,主機2230可在處理裝置2200外部且經由例如外部介面連接至處理裝置。
Figure 22 provides a block diagram of an exemplary processing device in accordance with the disclosed embodiments. FIG. 22A shows the first embodiment of the
記憶體區塊2202及2204可包括DRAM墊或墊群組、DRAM組、MRAM\PRAM\RERA1M\SRAM單元、快閃記憶體墊或其他記憶體技術。記憶體區塊2202及2204可替代地包括非揮發性記憶體、快閃記憶體裝置、電阻式隨機存取記憶體(ReRAM)裝置或磁阻式隨機存取記憶體(MRAM)裝置。 The memory blocks 2202 and 2204 may include DRAM pads or pad groups, DRAM banks, MRAM\PRAM\RERA1M\SRAM cells, flash memory pads, or other memory technologies. The memory blocks 2202 and 2204 may alternatively include non-volatile memory, flash memory devices, resistive random access memory (ReRAM) devices, or magnetoresistive random access memory (MRAM) devices.
記憶體區塊2202及2204可另外包括複數個記憶體胞元,該等複 數個記憶體胞元在複數條字線(未圖示)與複數條位元線(未圖示)之間按列及行配置。每一列記憶體胞元之閘極可連接至複數條字線中之各別者。每一行記憶體胞元可連接至複數條位元線中之各別者。 The memory blocks 2202 and 2204 may additionally include a plurality of memory cells, which Several memory cells are arranged in columns and rows between a plurality of word lines (not shown) and a plurality of bit lines (not shown). The gate of each row of memory cells can be connected to each of the plurality of word lines. Each row of memory cells can be connected to each of the plurality of bit lines.
在其他實施例中,記憶體區域(包括記憶體區塊2202及2204)由簡單的記憶體例項建置。在本申請案中,術語「記憶體例項」可與術語「記憶體區塊」互換地使用。記憶體例項(或區塊)可具有不良特性。舉例而言,記憶體可為僅單埠記憶體且可具有高隨機存取潛時。替代地或另外,記憶體在行及排改變期間可能無法存取且面臨與例如電容充電及/或電路系統設置相關之資料存取問題。然而,藉由允許記憶體例項與處理單元之間的專用連接及以考量區塊之特性的某一方式來配置資料,圖22中所呈現之架構仍便利記憶體裝置中之並列處理。
In other embodiments, the memory area (including
在一些裝置架構中,記憶體例項可包括若干埠,以便利並列操作。然而,在此等實施例中,當資料基於晶片架構來編譯及組織時,晶片仍可達成改善效能。舉例而言,編譯器可藉由提供指令及組織資料置放來改善記憶體區域中之存取的效率,因此即使使用單埠記憶體,仍能夠容易存取記憶體區域。 In some device architectures, the memory instance can include several ports to facilitate parallel operation. However, in these embodiments, when the data is compiled and organized based on the chip architecture, the chip can still achieve improved performance. For example, the compiler can improve the efficiency of access in the memory area by providing instructions and organizing data placement. Therefore, even if a single-port memory is used, the memory area can still be easily accessed.
此外,記憶體區塊2202及2204可為單個晶片中之多種類型的記憶體。舉例而言,記憶體區塊2202及2204可為eFlash及eDRAM。又,記憶體區塊可包括具有ROM例項之DRAM。 In addition, the memory blocks 2202 and 2204 can be multiple types of memory in a single chip. For example, the memory blocks 2202 and 2204 can be eFlash and eDRAM. Also, the memory block may include DRAM with ROM instances.
記憶體控制器2210可包括用以處置記憶體存取及將結果傳回至模組之其餘部分的邏輯電路。舉例而言,記憶體控制器2210可包括位址管理器及諸如多工器之選擇裝置,以在記憶體區塊與處理單元之間投送資料或授權對記憶體區塊之存取。替代地,記憶體控制器2210可包括用以驅動DDR SDRAM之雙資料速率(DDR)記憶體控制器,其中資料係在系統之記憶體時脈的上升
緣及下降緣傳送。
The
此外,記憶體控制器2210可構成雙通道記憶體控制器。雙通道記憶體之併入可便利記憶體控制器2210對並列存取線之控制。該等並列存取線可經組態以具有相同長度,以在結合使用多條線時便利資料同步。替代地或另外,該等並列存取線可允許存取記憶體組之多個記憶體埠。
In addition, the
在一些實施例中,處理裝置2200可包括可連接至處理單元之一或多個多工器。該等處理單元可包括可直接連接至多工器之組態管理器2212、邏輯區塊2214及加速器2216。又,記憶體控制器2210可包括自複數個記憶體組或區塊2202及2204之至少一個資料輸入端,及連接至複數個處理單元中之每一者的至少一個資料輸出端。藉由此組態,記憶體控制器2210可經由兩個資料輸入端同時自記憶體組或記憶體區塊2202及2204接收資料,且經由兩個資料輸出端同時將經由接收之資料傳輸至至少一個選定處理單元。然而,在一些實施例中,至少一個資料輸入端及至少一個資料輸出端可實施於單個埠中,以僅允許讀取或寫入操作。在此等實施例中,單個埠可實施為包括資料線、位址線及命令線之資料匯流排。
In some embodiments, the
記憶體控制器2210可連接至複數個記憶體區塊2202及2204中之每一者,且亦可經由例如選擇開關連接至處理單元。基板上之處理單元(包括組態管理器2212、邏輯區塊2214及加速器2216)亦可獨立地連接至記憶體控制器2210。在一些實施例中,組態管理器2212可接收待執行之任務的提示,且作為回應,根據儲存於記憶體中或自外部供應之組態而組態記憶體控制器2210、加速器2216及/或邏輯區塊2214。替代地,記憶體控制器2210可由外部介面組態。該任務可能需要可用以自複數個處理單元選擇至少一個選定處理單元之至少一次運算。替代地或另外,該選擇可至少部分地基於選定處理單元執行至少一次運算之能力。作為回應,記憶體控制器2210可授權對記憶體組之存
取,或使用專用匯流排及/或以管線式記憶體存取在至少一個選定處理單元與至少兩個記憶體組之間投送資料。
The
在一些實施例中,至少兩個記憶體區塊中之第一記憶體區塊2202可配置於複數個處理單元之第一側;且至少兩個記憶體組中之第二記憶體組2204可配置於該等複數個處理單元之與該第一側相對的第二側。另外,用以執行任務之選定處理單元(例如,加速器2216(n))可經組態以在至第一記憶體組或第一記憶體區塊2202之通信線開放的時脈循環期間存取第二記憶體組2204。替代地,該選定處理單元可經組態以在通信線開放至第一記憶體區塊2202的時脈循環期間將資料傳送至第二記憶體區塊2204。
In some embodiments, the
在一些實施例中,記憶體控制器2210可實施為獨立元件,如圖22中所展示。然而,在其他實施例中,記憶體控制器2210可嵌入於記憶體區域中或可沿著加速器2216(a)至2216(n)安置。
In some embodiments, the
處理裝置2200中之處理區域可包括組態管理器2212、邏輯區塊2214及加速器2216(a)至2216(n)。加速器2216可包括具有預定義功能之多個處理電路且可由特定應用程式定義。舉例而言,加速器可為處置模組之間的記憶體移動之向量乘法累加(MAC)單元或直接記憶體存取(DMA)單元。加速器2216亦可能夠計算其自身位址且向記憶體控制器2210請求資料或將資料寫入至記憶體控制器。舉例而言,組態管理器2212可向加速器2216中之至少一者發信該加速器可存取記憶體組。接著,加速器2216可組態記憶體控制器2210以投送資料或向加速器本身授權存取。此外,加速器2216可包括至少一個算術邏輯單元、至少一個向量處置邏輯單元、至少一個字串比較邏輯單元、至少一個暫存器及至少一個直接記憶體存取件。
The processing area in the
組態管理器2212可包括用以組態加速器2216及發指令任務之執行的數位處理電路。舉例而言,組態管理器2212可連接至記憶體控制器2210
以及複數個加速器2216中之每一者。組態管理器2212可具有其自身的專用記憶體以保存加速器2216之組態。組態管理器2212可使用記憶體組以經由記憶體控制器2210提取命令及組態。替代地,組態管理器2212可經由外部介面來程式化。在某些實施例中,組態管理器2212可用具有自身的快取記憶體階層之晶載精簡指令集電腦(RISC)或晶載複雜CPU來實施。在一些實施例中,亦可省略組態管理器2212,且加速器可經由外部介面來組態。
The
處理裝置2200亦可包括外部介面(未圖示)。該外部介面允許自上部層級(此記憶體組控制器,其自外部主機2230或晶載主處理器接收命令)對記憶體進行存取,或自外部主機2230或晶載主處理器對記憶體進行存取。該外部介面可藉由經由記憶體控制器2210將組態或程式碼寫入至記憶體以供稍後由組態管理器2212或單元2214及2216本身使用來允許程式化組態管理器2212及加速器2216。然而,該外部介面亦可直接程式化處理單元而不經由記憶體控制器2210進行路由。在組態管理器2212為微控制器之狀況下,組態管理器2212可允許經由外部介面將程式碼自主記憶體載入至控制器區域記憶體。記憶體控制器2210可經組態以回應於自外部介面接收到請求而中斷任務。
The
該外部介面可包括與邏輯電路相關聯之多個連接器,該等連接器提供至處理裝置上之多種元件的無膠合介面。該外部介面可包括:用於資料讀取之資料I/O輸入端及用於資料寫入之輸出端;外部位址輸出端;外部CE0晶片選擇接腳;低有效晶片選擇器;位元組賦能接腳;用於記憶體循環之等待狀態的接腳;寫入賦能接腳;輸出賦能有效接腳;及讀取寫入賦能接腳。因此,該外部介面具有所需輸入端及輸出端以控制處理程序且自處理裝置獲得資訊。舉例而言,該外部介面可符合JEDEC DDR標準。替代地或另外,外部介面可符合其他標準,諸如SPI\OSPI或UART。 The external interface may include a plurality of connectors associated with the logic circuit, and the connectors provide a glueless interface to various components on the processing device. The external interface may include: data I/O input terminal for data reading and output terminal for data writing; external address output terminal; external CE0 chip selection pin; low effective chip selector; byte Enable pin; pin used for waiting state of memory cycle; write enable pin; output enable valid pin; and read write enable pin. Therefore, the external interface has required input terminals and output terminals to control the processing procedure and obtain information from the processing device. For example, the external interface can comply with the JEDEC DDR standard. Alternatively or in addition, the external interface may comply with other standards, such as SPI\OSPI or UART.
在一些實施例中,該外部介面可安置於晶片基板上且可連接外部
主機2230。外部主機可經由外部介面存取記憶體區塊2202及2204、記憶體控制器2210以及處理單元。替代地或另外,外部主機2230可對記憶體進行讀取及寫入,或可經由讀取及寫入命令向組態管理器2212發信以執行操作,諸如開始處理程序及/或停止處理程序。此外,外部主機2230可直接組態加速器2216。在一些實施例中,外部主機2230能夠直接對記憶體區塊2202及2204執行讀取/寫入操作。
In some embodiments, the external interface can be disposed on the chip substrate and can be connected to external
The
在一些實施例中,組態管理器2212及加速器2216可經組態以取決於目標任務而使用直接匯流排來連接裝置區域與記憶體區域。舉例而言,當加速器之該子集能夠執行任務執行所需之運算時,加速器2216之子集可與記憶體例項2204連接。藉由進行此分開,有可能確保專用加速器獲得記憶體區塊2202及2204所需之頻寬(BW)。此外,具有專用匯流排之此組態可允許將大記憶體分裂成較小例項或區塊,此係因為將記憶體例項連接至記憶體控制器2210允許甚至在具有高列潛時之情況下亦可快速存取不同記憶體中的資料。為達成連接之並列化,記憶體控制器2210可用資料匯流排、位址匯流排及/或控制匯流排連接至記憶體例項中之每二者。
In some embodiments, the
記憶體控制器2210之上述包括可消除對處理裝置中之快取記憶體階層或複雜暫存器檔案的要求。儘管可添加快取記憶體階層以得到添加的能力,但處理裝置處理裝置2200中之架構可允許設計者基於處理操作而添加足夠記憶體區塊或例項,且相應地管理該等例項而無需快取記憶體階層。舉例而言,處理裝置處理裝置2200中之架構可藉由實施管線式記憶體存取來消除對快取記憶體階層的要求。在管線式記憶體存取中,處理單元可在某些資料線可開放(或啟動)而其他資料線接收或傳輸資料的每個循環中接收持續資料流。由於線改變,使用獨立通信線之持續資料流可實現改善之執行速度及最少潛時。
The aforementioned inclusions of the
此外,圖22中之所揭示架構實現管線式記憶體存取,有可能將
資料組織在少量記憶體區塊中且節省由線切換造成之功率損失。舉例而言,在一些實施例中,編譯器可向主機2230傳達資料在記憶體組中之組織或用以將資料組織在記憶體組中之方法,以便利在給定任務期間存取資料。接著,組態管理器2212可定義哪些記憶體組且在一些狀況下,記憶體組之哪些埠可由加速器存取。記憶體組中之資料的位置與資料存取方法之間的此同步藉由以最少潛時將資料饋入至加速器來改善運算任務。舉例而言,在組態管理器2212包括RISC\CPU的實施例中,該方法可用離線軟體(SW)來實施,且接著組態管理器2212可經程式化以執行該方法。該方法可用可由RISC/CPU電腦執行之任何語言來開發且可在任何平台上執行。該方法之輸入可包括記憶體控制器後方之記憶體的組態以及資料本身,連同記憶體存取之圖案。此外,該方法可用特定於實施例之語言或機器語言來實施,且亦可僅為以二進位或文字表示的一系列組態值。
In addition, the architecture disclosed in FIG. 22 implements pipelined memory access, which may change
Data is organized in a small number of memory blocks and the power loss caused by line switching is saved. For example, in some embodiments, the compiler can communicate the organization of data in a memory group or a method for organizing data in a memory group to the
如上文所論述,在一些實施例中,編譯器可將指令提供至主機2230以用於在準備管線式記憶體存取時將資料組織在記憶體區塊2202及2204中。該管線式記憶體存取通常可包括以下步驟:接收複數個記憶體組或記憶體區塊2202及2204之複數個位址;使用獨立資料線根據所接收位址存取該等複數個記憶體組;經由第一通信線將來自第一位址之資料供應至複數個處理單元中之至少一者且開放至第二位址之第二通信線,該第一位址在該等複數個記憶體組中之第一記憶體組中,該第二位址在該等複數個記憶體組中之第二記憶體組2204中;及在第二時脈循環內,經由該第二通信線將來自該第二位址之資料供應至該等複數個處理單元中之該至少一者且開放至第一線中之第一記憶體組中之第三位址的第三通信線。在一些實施例中,該管線式記憶體存取可在兩個記憶體區塊連接至單個埠的情況下執行。在此等實施例中,記憶體控制器2210可將兩個記憶體區塊隱藏在單個埠後方,但利用管線式記憶體存取方法將資料傳
輸至處理單元。
As discussed above, in some embodiments, the compiler may provide instructions to the
在一些實施例中,編譯器可在主機2230上運行,之後執行任務。在此等實施例中,編譯器可能夠基於記憶體裝置之架構而判定資料流之組態,此係因為該組態將為編譯器已知的。
In some embodiments, the compiler can run on the
在其他實施例中,若記憶體區塊2204及2202之組態在離線時間係未知的,則管線式方法可在主機2230上運行,該主機可在開始計算之前將資料配置在記憶體區塊中。舉例而言,主機2230可將資料直接寫入記憶體區塊2204及2202中。在此等實施例中,諸如組態管理器2212及記憶體控制器2210之處理單元在運行時間之前可能不會具有關於所需硬體的資訊。接著,可能有必要延遲對加速器2216之選擇,直至任務開始運行。在此等情形中,處理單元或記憶體控制器2210可隨機地選擇加速器2216且產生測試資料存取圖案,該存取圖案可在執行任務時加以修改。
In other embodiments, if the configuration of the memory blocks 2204 and 2202 is unknown at offline time, the pipeline method can be run on the
然而,當任務預先已知時,編譯器可將資料及指令組織在記憶體組中以供主機2230提供至諸如組態管理器2212之處理單元,以設定最少化存取潛時之信號連接。舉例而言,在一些狀況下,加速器2216可能同時需要n個字。然而,每一記憶體例項支援每次僅擷取m個字,其中「m」及「n」為整數且m<n。因此,編譯器可跨越不同記憶體例項或區塊置放所需資料,以便利資料存取。又,為了避免排錯漏潛時,在處理裝置2200包括多個記憶體記憶體的情況下,主機可在不同記憶體例項之不同排中分裂資料。資料之劃分可允許存取下一例項中之下一排資料,同時仍使用來自當前例項之資料。
However, when the task is known in advance, the compiler can organize the data and instructions in a memory group for the
舉例而言,加速器2216(a)可經組態以將兩個向量相乘。向量中之每一者可儲存於諸如記憶體區塊2202及2204之獨立記憶體區塊中,且每一向量可包括多個字。因此,為了完成需要加速器2216(a)進行乘法之任務,可能有必要存取兩個記憶體區塊且擷取多個字。然而,在一些實施例中,記憶體區塊僅
允許每個時脈循環存取一個字。舉例而言,記憶體區塊可具有單個埠。在此等狀況下,為了在操作期間加快資料傳輸,編譯器可將構成向量之字組織在不同記憶體區塊中,以允許對字之並列及/或同時讀取。在此等情形中,編譯器可將字儲存於具有專用線之記憶體區塊中。舉例而言,若每一向量包括兩個字且記憶體控制器能夠直接存取四個記憶體區塊,則編譯器可將資料配置於四個記憶體區塊中,每一記憶體區塊傳輸一字且加快資料遞送。此外,在實施例中,當記憶體控制器2210可具有至每一記憶體區塊之多於單個連接時,編譯器可發指令給組態管理器2212(或其他處理單元)以存取埠特定埠。以此方式,處理裝置2200可執行管線式記憶體存取,以藉由同時在一些線中載入字及在其他線中傳輸資料來將資料連續地提供至處理單元。因此,此管線式記憶體存取避免可避免潛時問題。
For example, accelerator 2216(a) can be configured to multiply two vectors. Each of the vectors can be stored in separate memory blocks such as
圖23為符合所揭示實施例之例示性處理裝置2300的方塊圖。該方塊圖展示簡化之處理裝置2300,其顯示呈MAC單元2302形式之單個加速器、組態管理器2304(等效或類似於組態管理器2212)、記憶體控制器2306(等效或類似於記憶體控制器2210)及複數個記憶體區塊2308(a)至2308(d)。
FIG. 23 is a block diagram of an
在一些實施例中,MAC單元2302可為用於處理特定任務之特定加速器。作為實例,處理裝置2300可以2D卷積為任務。接著,組態管理器2304可向具有適當硬體之加速器發信以執行與任務相關聯之計算。舉例而言,MAC單元2302可具有四個內部遞增計數器(用以管理卷積計算所需之四個迴圈的邏輯加法器及暫存器)及一乘法累加單元。組態管理器2304可向MAC單元2302發信以處理傳入資料且執行任務。組態管理器2304可將提示傳輸至MAC單元2302以執行任務。在此等情形中,MAC單元2302可在所計算位址上進行反覆,將數字相乘,且將其累加至內部暫存器。
In some embodiments, the MAC unit 2302 may be a specific accelerator for processing specific tasks. As an example, the
在一些實施例中,組態管理器2304可組態加速器,而記憶體控
制器2306授權使用專用匯流排存取區塊2308及MAC單元2302。然而,在其他實施例中,記憶體控制器2306可基於自組態管理器2304或外部介面接收到之指令而直接組態加速器。替代地或另外,組態管理器2304可預先載入幾個組態且允許加速器反覆地在具有不同大小之不同位址上運行。在此等實施例中,組態管理器2304可包括快取記憶體,該快取記憶體儲存命令,之後該命令被傳輸至諸如加速器2216之複數個處理單元中的至少一者。然而,在其他實施例中,組態管理器2304可能不包括快取記憶體。
In some embodiments, the
在一些實施例中,組態管理器2304或記憶體控制器2306可接收為了任務需要存取之位址。組態管理器2304或記憶體控制器2306可檢查暫存器以判定位址是否已經在至記憶體區塊2308中之一者的載入線中。若在載入線中,則記憶體控制器2306可自記憶體區塊2308讀取字且將該字傳遞至MAC單元2302。若位址不在載入線中,則組態管理器2304可請求記憶體控制器2306可載入該線且向MAC單元2302發信以延遲,直至擷取該位址。
In some embodiments, the
在一些實施例中,如圖23中所展示,記憶體控制器2306可包括形成兩個獨立位址之兩個輸入。但若應同時存取多於兩個位址,且此等位址在單個記憶體區塊中(例如,位址僅在記憶體區塊2308(a)中),則記憶體控制器2306或組態管理器2304可能會引發例外狀況。替代地,當兩個位址僅可經由單條線來存取時,組態管理器2304可傳回無效資料信號。在其他實施例中,單元可延遲處理程序執行,直至有可能擷取所有需要的資料。此可降低總體效能。然而,編譯器可能夠找到將防止延遲之組態及資料置放。
In some embodiments, as shown in FIG. 23, the
在一些實施例中,編譯器可產生用於處理裝置2300之組態或指令集,該組態或指令集可組態組態管理器2304及記憶體控制器2306以及加速器2302以處置需要存取單個記憶體區塊之多個位址但該記憶體區塊具有一個埠的情形。舉例而言,編譯器可重新配置記憶體區塊2308中之資料,使得處理單元
可存取記憶體區塊2308中之多個排。
In some embodiments, the compiler can generate a configuration or instruction set for the
此外,記憶體控制器2306亦可在同一時間同時對多於一個輸入進行工作。舉例而言,記憶體控制器2306可允許經由一個埠存取記憶體區塊2308中之一者及在於另一輸入端中接收對不同記憶體區塊之請求時供應資料。因此,此操作可導致以例示性2D卷積為任務之加速器2216自相關記憶體區塊之專用通信線接收資料。
In addition, the
另外或替代地,記憶體控制器2306或邏輯區塊可保持針對每個記憶體區塊2308之再新計數器且處置所有排之再新。具有此計數器允許記憶體控制器2306插入裝置之停滯存取時間之間的再新循環中。
Additionally or alternatively, the
此外,記憶體控制器2306可為可組態的以執行管線式記憶體存取,以接收位址且開放記憶體區塊中之線,之後供應資料。該管線式記憶體存取可在不中斷或不延遲時脈循環之情況下將資料提供至處理單元。舉例而言,雖然記憶體控制器2306或邏輯區塊中之一者在圖23中利用右方線存取資料,但記憶體控制器或邏輯區塊可正在左方線中傳輸資料。將關於圖26更詳細地解釋此等方法。
In addition, the
回應於所需資料,處理裝置2300可使用多工器及/或其他開關裝置來選擇服務哪些裝置以執行給定任務。舉例而言,組態管理器2304可組態多工器,使得至少兩個資料線到達MAC單元2302。以此方式,需要來自多個位址之資料的任務(諸如,2D卷積)可較快地執行,此係因為在卷積期間需要乘法之向量或字可在單個時脈中同時到達處理單元。此資料傳送方法可允許諸如加速器2216之處理單元快速地輸出結果。
In response to the required data, the
在一些實施例中,組態管理器2304可為可組態的以基於任務之優先權執行處理程序。舉例而言,組態管理器2304可經組態以使運行中處理程序無任何中斷地完成。在彼狀況下,組態管理器2304可將任務之指令或組態提
供至加速器2216,使該等加速器不中斷地運行,且僅在任務完成時切換多工器。然而,在其他實施例中,組態管理器2304可在其接收到優先任務(諸如,來自外部介面之請求)時中斷任務且重新組態資料投送。然而,在記憶體區塊2308足夠之情況下,記憶體控制器2306可為可組態的以利用專用線將資料投送至處理單元或向處理單元授權存取,該等專用線在任務完成之前不必改變。此外,在一些實施例中,所有裝置可藉由匯流排連接至組態管理器2304之實體,且裝置可管理裝置本身與匯流排之間的存取(例如,使用與多工器相同之邏輯)。因此,記憶體控制器2306可直接連接至數個記憶體例項或記憶體區塊。
In some embodiments, the
替代地,記憶體控制器2306可直接連接至記憶體子例項。在一些實施例中,每一記憶體例項或區塊可由子例項建置(例如,DRAM可由配置於多個子區塊中的具有獨立資料線之墊建置)。另外,例項可包括DRAM墊、DRAM、組、快閃記憶體墊或SRAM墊或任何其他類型的記憶體中之至少一者。接著,記憶體控制器2306可包括專用線以直接定址子例項,從而最少化管線式記憶體存取期間之潛時。
Alternatively, the
在一些實施例中,記憶體控制器2306亦可保持特定記憶體例項所需之邏輯(諸如,列\行解碼器、再新邏輯等),且記憶體區塊2308可處置其自身的邏輯。因此,記憶體區塊2308可獲得位址且產生用於傳回\寫入資料之命令。
In some embodiments, the
圖24描繪符合所揭示實施例之例示性記憶體組態圖。在一些實施例中,產生用於處理裝置2200之程式碼或組態的編譯器可執行用以藉由將資料預先配置在每一區塊中來組態自記憶體區塊2202及2204之載入的方法。舉例而言,編譯器可預先配置資料,使得任務所需之每一字與一排記憶體例項或記憶體區塊相關。但對於需要比處理裝置2200中可用之一個記憶體區塊多的記憶體區塊之任務,編譯器可實施使資料適合每一記憶體區塊之多於一個記憶體位
置的方法。編譯器亦可依序儲存資料且評估每一記憶體區塊之潛時以避免排錯漏潛時。在一些實施例中,主機可為處理單元之部分,諸如組態管理器2212,但在其他實施例中,編譯器主機可經由外部介面連接至處理裝置2200。在此等實施例中,主機可運行編譯功能,諸如針對編譯器所描述之編譯功能。
FIG. 24 depicts an exemplary memory configuration diagram consistent with the disclosed embodiment. In some embodiments, a compiler that generates code or configuration for the
在一些實施例中,組態管理器2212可為CPU或微控制器(uC)。在此等實施例中,組態管理器2212可能必須存取記憶體以提取置放於記憶體中之命令或指令。特定編譯器可產生程式碼且將該程式碼置放於記憶體中,方式為允許在同一記憶體排中及跨越數個記憶體組儲存連續命令,從而允許亦對所提取命令進行管線式記憶體存取。在此等實施例中,組態管理器2212及記憶體控制器2210可能夠藉由便利管線式記憶體存取來避免線性執行中之列潛時。
In some embodiments, the
程式之線性執行之先前狀況描述供編譯器辨識及置放指令以允許管線式記憶體執行之方法。然而,其他軟體結構可能更複雜且將需要編譯器辨識其他軟體結構且相應地採取動作。舉例而言,在任務需要迴圈及分支之狀況下,編譯器可將所有迴圈程式碼置放於單條線內,使得單條線可在不具有線開放潛時之情況下進行迴圈。接著,記憶體控制器2210可能不需要在執行期間改變線。
The previous state of the linear execution of the program describes the method for the compiler to recognize and place instructions to allow pipelined memory execution. However, other software structures may be more complex and will require the compiler to recognize other software structures and take actions accordingly. For example, in a situation where the task requires loops and branches, the compiler can place all loop code in a single line, so that a single line can loop without the line open latent time. Then, the
在一些實施例中,組態管理器2212可包括內部快取記憶體或小記憶體。內部快取記憶體可儲存由組態管理器2212執行以處置分支及迴圈的命令舉例而言,內部快取記憶體中之命令可包括用以組態用於存取記憶體區塊之加速器的指令。
In some embodiments, the
圖25為說明符合所揭示實施例之可能記憶體組態處理程序2500的例示性流程圖。在便於描述記憶體組態處理程序2500之情況下,可參考圖22中所描繪及上文所描述的元件之識別符。在一些實施例中,處理程序2500可由編譯器執行,該編譯器將指令提供至經由外部介面連接之主機。在其他實施例
中,處理程序2500可由處理裝置2200之組件(諸如,組態管理器2212)執行。
FIG. 25 is an exemplary flowchart illustrating a possible memory
一般而言,處理程序2500可包括:判定執行任務同時所需的字之數目;判定可同時自複數個記憶體組中之每一者存取的字之數目;及當同時所需的字之數目大於可同時存取的字之數目時,在多個記憶體組之間劃分同時所需的字之數目。此外,劃分同時所需的字之數目可包括執行字之循環組織及依序地每個記憶體組指派一個字。
Generally speaking, the
更具體而言,處理程序2500可以步驟2502開始,在該步驟中,編譯器可接收任務規格。該規格包括所需運算及/或優先權等級。
More specifically, the
在步驟2504中,編譯器可識別可執行任務之加速器或加速器群組。替代地,編譯器可產生指令,因此處理單元(諸如,組態管理器2212)可識別加速器以執行任務。舉例而言,使用所需運算組態管理器2212可識別加速器2216之群組中的可處理任務之加速器。
In
在步驟2506中,編譯器可判定需要同時存取以執行任務的字之數目。舉例而言,兩個向量之乘法需要存取至少兩個向量,且編譯器因此可判定必須同時存取向量字以執行運算。
In
在步驟2508中,編譯器可判定執行任務必需的循環之數目。舉例而言,若任務需要對四個副乘積之卷積運算,則編譯器可判定至少4個循環將為執行任務所必需的。
In
在步驟2510中,編譯器可將需要同時存取之字置放於不同記憶體組中。以彼方式,記憶體控制器2210可經組態以開放至不同記憶體例項之線且在時脈循環內存取所需記憶體區塊,而不需要任何快取記憶體資料。
In
在步驟2512中,編譯器將依序存取的字置放於相同記憶體組中。舉例而言,在需要操作之四個循環的狀況下,編譯器可產生指令以在依序循環中將所需字寫入單個記憶體區塊中,以避免在執行期間在不同記憶體區塊之間
改變線。
In
在步驟2514中,編譯器產生用於程式化諸如組態管理器2212之處理單元的指令。該等指令可指定操作開關裝置(諸如,多工器)或組態資料匯流排之條件。藉由此等指令,組態管理器2212可根據任務組態記憶體控制器2210以使用專用通信線將資料自記憶體區塊投送至處理單元或授權對該等記憶體區塊之存取。
In
圖26為說明符合所揭示實施例之記憶體讀取處理程序2600的例示性流程圖。在便於描述記憶體讀取處理程序2600之情況下,可參考圖22中所描繪及上文所描述的元件之識別符。在一些實施例中,如下文所描述,處理程序2600可由記憶體控制器2210實施。然而,在其他實施例中,處理程序2600可由處理裝置2200中之其他元件(諸如,組態管理器2212)實施。
FIG. 26 is an exemplary flowchart illustrating a memory
在步驟2602中,記憶體控制器2210、組態管理器2212或其他處理單元可接收來自記憶體組之投送資料或授權對記憶體組之存取的提示。請求可指定位址及記憶體區塊。
In
在一些實施例中,該請求可經由在線2218中指定讀取命令及在線2220中指定位址的資料匯流排接收。在其他實施例中,該請求可經由連接至記憶體控制器2210之解多工器接收。
In some embodiments, the request may be received via a data bus that specifies a read command in
在步驟2604中,組態管理器2212、主機或其他處理單元可查詢內部暫存器。該內部暫存器可包括關於至記憶體組之開放線、開放位址、開放記憶體區塊及/或即將進行的任務的資訊。基於內部暫存器中之資訊,可判定是否存在至記憶體組之開放線及/或記憶體區塊是否在步驟2602中接收到請求。替代地或另外,記憶體控制器2210可直接查詢該內部暫存器。
In
若該內部暫存器提示記憶體組未載入開放線中(步驟2606:否),則處理程序2600可繼續至步驟2616,且可將線載入至與所接收位址相關聯之記
憶體組。此外,記憶體控制器2210或諸如組態管理器2212之處理單元可在步驟2616中將延遲發信至請求來自記憶體位址之資訊的元件。舉例而言,若加速器2216正請求位於已被佔用之記憶體區塊中的記憶體資訊,則在步驟2618中,記憶體控制器2210可將延遲信號發送至加速器。在步驟2620中,組態管理器2212或記憶體控制器2210可更新內部暫存器以提示已開放至新記憶體組或新記憶體區塊之線。
If the internal register prompts that the memory bank is not loaded into the open line (step 2606: No), the
若該內部暫存器提示記憶體組載入開放線中(步驟2606:是),則處理程序2600可繼續至步驟2608。在步驟2608中,可判定載入有記憶體組之線是否正用於不同位址。若該線正用於不同位址(步驟2608:是),則此將提示單個區塊中存在兩個例項,且因此,不能同時存取該兩個例項。因此,可在步驟2616中將錯誤或免除信號發送至請求來自記憶體位址之資訊的元件。但若該線並未正用於不同位址(步驟2608:否),則可開放針對該位址之線且自目標記憶體組擷取資料,且繼續至步驟2614以將資料傳輸至請求來自記憶體位址之資訊的元件。
If the internal register prompts that the memory bank is loaded into the open line (step 2606: Yes), the
利用處理程序2600,處理裝置2200能夠建立處理單元與含有執行任務所需之資訊的記憶體區塊或記憶體例項之間的直接連接。資料之此組織將使得能夠自不同記憶體例項中之經組織向量讀取資訊,以及允許在裝置請求複數個此等位址時同時自不同記憶體區塊擷取資訊。
Using the
圖27為說明符合所揭示實施例之執行處理程序2700的例示性流程圖。在便於描述執行處理程序2700之情況下,可參考圖22中所描繪及上文所描述的元件之識別符。
FIG. 27 is an exemplary flowchart illustrating the
在步驟2702中,編譯器或諸如組態管理器2212之區域單元可接收需要執行之任務的提示。該任務可包括單個運算(例如,乘法)或更複雜運算(例如,矩陣之間的卷積)。該任務亦可提示所需運算。
In
在步驟2704中,編譯器或組態管理器2212可判定執行任務同時所需的字之數目。舉例而言,組態編譯器可判定同時需要兩個字來執行向量之間的乘法。在另一實例(2D卷積任務)中,組態管理器2212可判定矩陣之間的卷積需要「n」乘「m」個字,其中「n」及「m」為矩陣維度。此外,在步驟2704中,組態管理器2212亦可判定執行任務必需的循環之數目。
In
在步驟2706中,取決於步驟2704中之判定,編譯器可將需要同時存取之字寫入安置於基板上之複數個記憶體組中。舉例而言,當可自複數個記憶體組中之一者同時存取的字之數目的數目小於同時所需的字之數目時,編譯器可將資料組織在多個記憶體組中以便利在時脈內存取不同的所需字。此外,當組態管理器2212或編譯器判定執行任務必需的循環之數目時,編譯器可在依序循環中將所需的字寫入複數個記憶體組中之單個記憶體組中,以防止記憶體組之間的線之切換。
In
在步驟2708中,記憶體控制器2210可經組態以使用第一記憶體線自複數個記憶體組或區塊中之第一記憶體組讀取至少一個第一字或授權對該至少一個第一字的存取。
In
在步驟2170中,處理單元(例如,加速器2216中之一者)可使用至少一個第一字來處理任務。 In step 2170, the processing unit (for example, one of the accelerator 2216) may use at least one first word to process the task.
在步驟2712中,記憶體控制器2210可經組態以開放第二記憶體組中之第二記憶體線。舉例而言,基於任務且使用管線式記憶體存取方法,記憶體控制器2210可經組態以開放在步驟2706中寫入有任務所需之資訊的第二記憶體區塊中之第二記憶體線。在一些實施例中,該第二記憶體線可在步驟2170中之任務將要完成時開放。舉例而言,若任務需要100個時脈,則第二記憶體線可在第90個時脈中開放。
In
在一些實施例中,步驟2708至2712可在一個線存取循環內執行。
In some embodiments,
在步驟2714中,記憶體控制器2210可經組態以授權使用在步驟2710中開放之第二記憶體線存取來自第二記憶體組之至少一個第二字的資料。
In
在步驟2176中,處理單元(例如,加速器2216中之一者)可使用至少第二字來處理任務。 In step 2176, the processing unit (e.g., one of the accelerator 2216) may use at least the second word to process the task.
在步驟2718中,記憶體控制器2210可經組態以開放第一記憶體組中之第二記憶體線。舉例而言,基於任務且使用管線式記憶體存取方法,記憶體控制器2210可經組態以開放至第一記憶體區塊之第二記憶體線。在一些實施例中,至第一區塊之第二記憶體線可在步驟2176中之任務將要完成時開放。
In
在一些實施例中,步驟2714至2718可在一個線存取循環內執行。
In some embodiments,
在步驟2720中,記憶體控制器2210可使用第一組中之第二記憶體線或第三組中之第一線及在不同記憶體組中繼續而自複數個記憶體組或區塊中之第一記憶體組讀取至少一個第三字或授權對該至少一個第三字的存取。
In
部分再新 Partially renewed
諸如動態隨機存取記憶體(DRAM)晶片之一些記憶體晶片使用再新以避免所儲存資料(例如,使用電容)由於電容器或晶片之其他電組件中之電壓衰減而丟失。舉例而言,在DRAM中,必須時常再新每一胞元(基於特定處理程序及設計)以恢復電容器中之電荷,使得資料不會丟失或損壞。隨著DRAM晶片之記憶體容量增加,再新記憶體所需之時間量變得顯著。在正再新記憶體之某一的時間段期間,不能存取含有正再新之該線的組。此可導致效能降低。另外,與再新處理程序相關聯之功率亦可為顯著的。先前已努力嘗試減小執行再新之速率以減少與再新記憶體相關聯之不利影響,但大部分此等努力集中於DRAM之實體層。 Some memory chips, such as dynamic random access memory (DRAM) chips, are reused to avoid loss of stored data (for example, using capacitors) due to voltage attenuation in capacitors or other electrical components of the chip. For example, in DRAM, each cell must be renewed from time to time (based on specific processing procedures and design) to restore the charge in the capacitor so that data will not be lost or damaged. As the memory capacity of DRAM chips increases, the amount of time required to regenerate the memory becomes significant. During a certain period of time when the memory is being renewed, the group containing the line being renewed cannot be accessed. This can result in reduced performance. In addition, the power associated with the renewal process can also be significant. Previous efforts have been made to reduce the rate of performing refresh to reduce the adverse effects associated with refreshing memory, but most of these efforts have focused on the physical layer of DRAM.
再新類似於讀取及寫回記憶體之一列。使用此原理且集中於存取記憶體之圖案,本發明之實施例包括軟體及硬體技術以及對記憶體晶片之修 改,以使用較少功率用於再新且減少再新記憶體期間之時間量。舉例而言,作為綜述,一些實施例可使用硬體及/或軟體以追蹤線存取時序且在再新循環內跳過最近存取列(例如,基於時序臨限值)。在另一實例中,一些實施例可依賴於由記憶體晶片之再新控制器執行的軟體來指派讀取及寫入,使得對記憶體之存取為非隨機的。因此,軟體可更精確地控制再新以避免浪費再新循環及/或線。此等技術可單獨使用或與編碼用於再新控制器之命令及用於處理器之機器碼的編譯器組合使用,使得對記憶體之存取同樣為非隨機的。使用下文詳細描述之此等技術及組態之任何組合,所揭示實施例可藉由減少再新記憶體單元期間之時間量來降低記憶體再新功率要求及/或提高系統效能。 Renewing is similar to reading and writing back a row of memory. Using this principle and focusing on accessing memory patterns, the embodiments of the present invention include software and hardware technology and repairs to memory chips. Changed to use less power for renewing and reducing the amount of time during renewing memory. For example, as an overview, some embodiments may use hardware and/or software to track the line access timing and skip the most recently accessed row in the new cycle (e.g., based on timing thresholds). In another example, some embodiments may rely on software executed by the renewed controller of the memory chip to assign reads and writes so that access to the memory is non-random. Therefore, the software can more accurately control the renewal to avoid wasting the renewal cycle and/or line. These techniques can be used alone or in combination with a compiler that encodes the commands for the renewed controller and the machine code for the processor, so that the access to the memory is also non-random. Using any combination of these techniques and configurations described in detail below, the disclosed embodiments can reduce memory regeneration power requirements and/or improve system performance by reducing the amount of time during memory cell regeneration.
圖28描繪符合本發明之具有再新控制器2803的實例記憶體晶片2800。舉例而言,記憶體晶片2800可包括基板上之複數個記憶體組(例如,記憶體組2801a及其類似者)。在圖28之實例中,基板包括四個記憶體組,其各具有四條線。線可指記憶體晶片2800之一或多個記憶體組或記憶體晶片2800內之記憶體胞元之任何其他集合(諸如,記憶體組之一部分或沿著記憶體組之一整列或記憶體組之群組)內的字線。
Figure 28 depicts an example memory chip 2800 with a
在其他實施例中,基板可包括任何數目個記憶體組,且每一記憶體組可包括任何數目條線。一些記憶體組可包括相同數目條線(如圖28中所展示),而其他記憶體組可包括不同數目條線。如圖28中進一步描繪,記憶體晶片2800可包括控制器2805,該控制器用以接收至記憶體晶片2800之輸入且自記憶體晶片2800傳輸輸出(例如,如上文在「碼之劃分」中所描述)。
In other embodiments, the substrate may include any number of memory banks, and each memory bank may include any number of lines. Some memory groups may include the same number of lines (as shown in FIG. 28), while other memory groups may include a different number of lines. As further depicted in FIG. 28, the memory chip 2800 may include a
在一些實施例中,複數個記憶體組可包含動態隨機存取記憶體(DRAM)。然而,複數個記憶體組可包含儲存需要週期性再新之資料的任何揮發性記憶體。 In some embodiments, the plurality of memory banks may include dynamic random access memory (DRAM). However, the plurality of memory banks can include any volatile memory that stores data that needs to be refreshed periodically.
如下文將更詳細地論述,本發明所揭示之實施例可使用計數器或
電阻器-電容器電路以對再新循環進行計時。舉例而言,計數器或計時器可用以對自最後完整再新循環之時間進行計數,且接著當計數器達到其目標值時,可使用另一計數器在所有列上進行反覆。本發明之實施例可另外追蹤對記憶體晶片2800之區段的存取且減小所需的再新功率。舉例而言,儘管未在圖28中描繪,但記憶體晶片2800可進一步包括資料儲存器,該資料儲存器經組態以儲存提示對複數個記憶體組之一或多個區段之存取操作的存取資訊。舉例而言,該一或多個區段可包含記憶體晶片2800內之記憶體胞元的排、行或任何其他分組之任何部分。在一個特定實例中,該一或多個區段可包括複數個記憶體組內之至少一列記憶體結構。再新控制器2803可經組態以至少部分地基於所儲存的存取資訊而執行該一或多個區段之再新操作。
As will be discussed in more detail below, the disclosed embodiments of the present invention can use counters or
A resistor-capacitor circuit is used to time the renewal cycle. For example, a counter or timer can be used to count the time since the last complete recycle, and then when the counter reaches its target value, another counter can be used to repeat on all columns. The embodiment of the present invention can additionally track the access to the section of the memory chip 2800 and reduce the renewal power required. For example, although not depicted in FIG. 28, the memory chip 2800 may further include a data memory configured to store prompts for access to one or more sections of a plurality of memory banks Access information for the operation. For example, the one or more segments may include any part of the row, row, or any other grouping of memory cells in the memory chip 2800. In a specific example, the one or more segments may include at least one row of memory structures in a plurality of memory groups. The
舉例而言,資料儲存器可包含與記憶體晶片2800之區段(例如,記憶體晶片2800內之記憶體胞元的排、行或任何其他分組)相關聯的一或多個暫存器、靜態隨機存取記憶體(SRAM)胞元,或其類似者。另外,資料儲存器可經組態以儲存提示相關聯之區段是否在一或多個先前循環中經存取的位元。「位元」可包含儲存至少一個位元之任何資料結構,諸如暫存器、SRAM胞元、非揮發性記憶體或其類似者。此外,位元可藉由將資料結構之對應開關(或開關元件,諸如電晶體)設定為接通(其可等效於「1」或「真」)來設定。另外或替代地,位元可藉由修改資料結構內之任何其他性質(諸如,對快閃記憶體之浮動閘極充電,修改SRAM中之一或多個正反器的狀態,或其類似者)以便將「1」寫入至該資料結構(或提示位元之設定的任何其他值)來設定。若位元被判定為作為記憶體控制器之再新操作的部分而經設定,則再新控制器2803可跳過相關聯區段之再新循環且清空與彼部分相關聯之暫存器。
For example, the data storage may include one or more registers associated with a section of the memory chip 2800 (for example, the row, row, or any other grouping of memory cells in the memory chip 2800), Static random access memory (SRAM) cells, or the like. In addition, the data storage can be configured to store bits that indicate whether the associated segment has been accessed in one or more previous cycles. "Bit" can include any data structure that stores at least one bit, such as a register, SRAM cell, non-volatile memory, or the like. In addition, the bit can be set by setting the corresponding switch (or switching element, such as a transistor) of the data structure to ON (which can be equivalent to "1" or "true"). Additionally or alternatively, bits can be modified by modifying any other properties within the data structure (such as charging the floating gate of flash memory, modifying the state of one or more flip-flops in SRAM, or the like ) In order to write "1" into the data structure (or any other value of the prompt bit setting) for setting. If the bit is determined to be set as part of the renew operation of the memory controller, the renew
在另一實例中,資料儲存器可包含與記憶體晶片2800之區段(例如,記憶體晶片2800內之記憶體胞元的排、行或任何其他分組)相關聯的一或 多個非揮發性記憶體(例如,快閃記憶體或其類似者)。非揮發性記憶體可經組態以儲存提示相關聯之區段是否在一或多個先前循環中經存取的位元。 In another example, the data storage may include one or more sections associated with the memory chip 2800 (for example, the row, row, or any other grouping of memory cells within the memory chip 2800). Multiple non-volatile memories (for example, flash memory or the like). The non-volatile memory can be configured to store bits that indicate whether the associated segment has been accessed in one or more previous cycles.
一些實施例可另外或替代地在每一列或列群組(或記憶體晶片2800之其他區段)上添加時戳暫存器,該時戳暫存器保存當前再新循環內存取線的最後時刻。此意謂在每一列存取之情況下,再新控制器可更新列時戳暫存器。因此,當下一次再新發生時(例如,在再新循環結束時),再新控制器可比較所儲存時戳,且若相關聯區段先前在某一時間段內(例如,在如應用於所儲存時戳之某一臨限值內)經存取,則再新控制器可跳至下一區段。此避免系統在最近已存取之區段上消耗再新功率。此外,再新控制器可繼續追蹤存取以確保在下一循環存取或再新每一區段。 Some embodiments may additionally or alternatively add a time stamp register to each row or group of rows (or other sections of the memory chip 2800), and the time stamp register saves the current recycle access line last moment. This means that in the case of each row access, the new controller can update the row time stamp register. Therefore, when the next renewal occurs (for example, at the end of the renewed cycle), the renewed controller can compare the stored timestamps, and if the associated section was previously within a certain period of time (for example, in the case of After accessing the stored time stamp within a certain threshold, the new controller can jump to the next section. This prevents the system from consuming new power on the recently accessed section. In addition, the new controller can continue to track the access to ensure that each section is accessed in the next cycle or renewed.
因此,在又一實例中,資料儲存器可包含與記憶體晶片2800之區段(例如,記憶體晶片2800內之記憶體胞元的排、行或任何其他分組)相關聯之一或多個暫存器或非揮發性記憶體。該等暫存器或非揮發性記憶體可經組態以儲存時戳或提示相關聯區段之最近存取的其他資訊,而非使用位元來提示是否已存取相關聯區段。在此實例中,再新控制器2803可基於儲存於相關聯暫存器或記憶體中之時戳與當前時間(例如,來自計時器,如下文在圖29A及圖29B中所解釋)之間的時間量是否超過預定臨限值(例如,8ms、16ms、32ms、64ms或其類似者)來判定是否再新或存取相關聯區段。
Therefore, in yet another example, the data storage may include one or more sections associated with the memory chip 2800 (for example, the row, row, or any other grouping of memory cells within the memory chip 2800). Register or non-volatile memory. These registers or non-volatile memory can be configured to store time stamps or other information that prompts the associated section of the most recently accessed, instead of using bits to prompt whether the associated section has been accessed. In this example, the
因此,預定臨限值可包含確保相關聯區段在每個再新循環內被再新(若並非存取)至少一次之再新循環的時間量。替代地,預定臨限值可包含短於再新循環所需之時間量的時間量(例如,以確保任何所需再新或存取信號可在再新循環完成之前到達相關聯區段)。舉例而言,預定時間可包含用於具有8ms再新時段之記憶體晶片的7ms,使得若區段在7ms內尚未被存取,則再新控制器將發送在8ms再新時段結束時到達該區段之再新或存取信號。在一些 實施例中,預定臨限值可取決於相關聯區段之大小。舉例而言,對於記憶體晶片2800之較小區段,預定臨限值可較小。 Therefore, the predetermined threshold may include the amount of time for the renewal cycle to ensure that the associated section is renewed (if not accessed) at least once in each renewal cycle. Alternatively, the predetermined threshold value may include an amount of time that is shorter than the amount of time required for the recycle (e.g., to ensure that any required recycle or access signals can reach the associated section before the recycle is completed). For example, the predetermined time may include 7ms for a memory chip with a renew period of 8ms, so that if the segment has not been accessed within 7ms, the renew controller will send a message that arrives at the end of the 8ms renew period. Renew or access signal of the section. In some In an embodiment, the predetermined threshold value may depend on the size of the associated section. For example, for a smaller section of the memory chip 2800, the predetermined threshold value may be smaller.
儘管上文關於記憶體晶片進行了描述,但本發明之再新控制器亦可用於分散式處理器架構中,如在上文之章節中及貫穿本發明所描述的彼等架構。此類架構之一個實例描繪於圖7A中。在此等實施例中,與記憶體晶片2800相同之基板可包括安置於其上之複數個處理群組,例如,如圖7A中所描繪。如上文關於圖3A所解釋,「處理群組」可指基板上之兩個或多於兩個處理器子單元及其對應記憶體組。該群組可表示基板上之空間分佈及/或用於編譯程式碼以供在記憶體晶片2800上執行之目的之邏輯分組。因此,該基板可包括記憶體陣列,該記憶體陣列包括複數個組,諸如圖28中所展示之組2801a及其他組。此外,該基板可包括處理陣列,該處理陣列可包括複數個處理器子單元(諸如,圖7A中所展示之子單元730a、730b、730c、730d、730e、730f、730g及730h)。 Although the memory chips are described above, the new controllers of the present invention can also be used in distributed processor architectures, such as those described in the above sections and throughout the present invention. An example of such an architecture is depicted in Figure 7A. In these embodiments, the same substrate as the memory chip 2800 may include a plurality of processing groups disposed thereon, for example, as depicted in FIG. 7A. As explained above with respect to FIG. 3A, the "processing group" can refer to two or more processor sub-units on the substrate and their corresponding memory groups. The group may represent a spatial distribution on the substrate and/or a logical grouping for the purpose of compiling program codes for execution on the memory chip 2800. Therefore, the substrate may include a memory array including a plurality of groups, such as the group 2801a shown in FIG. 28 and other groups. In addition, the substrate may include a processing array, which may include a plurality of processor sub-units (such as the sub-units 730a, 730b, 730c, 730d, 730e, 730f, 730g, and 730h shown in FIG. 7A).
如上文關於圖7A進一步所解釋,每一處理群組可包括一處理器子單元及專用於該處理器子單元之一或多個對應記憶體組。此外,為了允許每一處理器子單元與其對應的專用記憶體組通信,該基板可包括將處理器子單元中之一者連接至其對應的專用記憶體組之第一複數個匯流排。 As further explained above with respect to FIG. 7A, each processing group may include a processor sub-unit and one or more corresponding memory groups dedicated to the processor sub-unit. In addition, in order to allow each processor sub-unit to communicate with its corresponding dedicated memory bank, the substrate may include a first plurality of buses connecting one of the processor sub-units to its corresponding dedicated memory bank.
在此等實施例中,如圖7A中所展示,該基板可包括用以將每一處理器子單元連接至至少另一處理器子單元(例如,同一列中之鄰近子單元、同一行中之鄰近處理器子單元,或基板上之任何其他處理器子單元)的第二複數個匯流排。第一複數個匯流排及/或第二複數個匯流排可能不含時序硬體邏輯組件,使得在處理器子單元之間及跨越該等複數個匯流排中之對應者的資料傳送不受時序硬體邏輯組件控制,如上文在「使用軟體之同步」章節中所解釋。 In these embodiments, as shown in FIG. 7A, the substrate may include a substrate for connecting each processor subunit to at least another processor subunit (for example, adjacent subunits in the same column, in the same row The second plurality of bus bars adjacent to the processor sub-unit, or any other processor sub-unit on the substrate. The first plurality of buses and/or the second plurality of buses may not contain timing hardware logic components, so that the data transmission between the processor subunits and across the corresponding ones of the plurality of buses is not subject to timing Control of hardware logic components, as explained in the chapter "Synchronization using software" above.
在與記憶體晶片2800相同之基板可包括安置於其上之複數個處理群組(例如,如圖7A中所描繪)的實施例中,處理器子單元可進一步包括位
址產生器(例如,如圖4中所描繪之位址產生器450)。此外,每一處理群組可包括一處理器子單元及專用於該處理器子單元之一或多個對應記憶體組。因此,該等位址產生器中之每一者可與該等複數個記憶體組中之一對應的專用記憶體組相關聯。此外,該基板可包括複數個匯流排,每一匯流排將該等複數個位址產生器中之一者連接至其對應的專用記憶體組。
In an embodiment where the same substrate as the memory chip 2800 may include a plurality of processing groups (for example, as depicted in FIG. 7A) disposed thereon, the processor subunit may further include bit
Address generator (e.g.,
圖29A描繪符合本發明之實例再新控制器2900。再新控制器2900可併入本發明之記憶體晶片(諸如,圖28之記憶體晶片2800)中。如圖29A中所描繪,再新控制器2900可包括計時器2901,該計時器可包含晶載振盪器或用於再新控制器2900之任何其他時序電路。在圖29A中所描繪之組態中,計時器2901可週期性地(例如,每8ms、16ms、32ms、64ms或其類似時間)觸發再新循環。再新循環可使用列計數器2903以循環通過對應記憶體晶片之所有列,且使用加法器2901結合有效位元2905而針對每一列產生一再新信號。如圖29A中所展示,位元2905可固定為1(「真」)以確保在循環期間再新每一列。
Figure 29A depicts an example renewed controller 2900 in accordance with the present invention. The renewed controller 2900 can be incorporated into the memory chip of the present invention (such as the memory chip 2800 of FIG. 28). As depicted in FIG. 29A, the refresh controller 2900 may include a
在本發明之實施例中,再新控制器2900可包括資料儲存器。如上文所描述,該資料儲存器可包含與記憶體晶片2800之區段(例如,記憶體晶片2800內之記憶體胞元的排、行或任何其他分組)相關聯的一或多個暫存器或非揮發性記憶體。該等暫存器或非揮發性記憶體可經組態以儲存時戳或提示相關聯區段之最近存取的其他資訊。 In an embodiment of the present invention, the refresh controller 2900 may include a data storage device. As described above, the data storage may include one or more temporary memories associated with a section of the memory chip 2800 (for example, the row, row, or any other grouping of memory cells within the memory chip 2800) Device or non-volatile memory. These registers or non-volatile memory can be configured to store time stamps or prompt other information of the most recently accessed associated section.
再新控制器2900可使用所儲存的資訊來跳過記憶體晶片2900之區段的再新。舉例而言,若該資訊提示區段在一或多個先前再新循環期間已再新,則再新控制器2900可在當前再新循環中跳過該區段。在另一實例中,若區段之所儲存時戳與當前時間之間的差低於臨限值,則再新控制器2900可在當前再新循環中跳過該區段。再新控制器2900可進一步經由多個再新循環繼續追蹤記憶體晶片2800之區段的存取及再新。舉例而言,再新控制器2900可使用計時
器2901更新所儲存時戳。在此等實施例中,再新控制器2900可經組態以在臨限時間間隔之後使用計時器之輸出來清除儲存於資料儲存器中之存取資訊。舉例而言,在資料儲存器儲存對相關聯區段之最近存取或再新之時戳的實施例中,每當將存取命令或再新信號發送至該區段時,再新控制器2900便可將新時戳儲存於資料儲存器中。若資料儲存器儲存位元而非時戳,則計時器2901可經組態以清除經設定持續長於臨限時間段之位元。舉例而言,在資料儲存器儲存提示相關聯區段在一或多個先前循環中經存取之實施例中,每當計時器2901觸發新的再新循環,再新控制器2900便可清除資料儲存器中之位元(例如,將其設定為0),該新的再新循環係自設定相關聯位元(例如,設定為1)起經過臨界數目個循環(例如,一個、兩個或其類似者)的循環。
The refresh controller 2900 can use the stored information to skip the refresh of the section of the memory chip 2900. For example, if the information indicates that the section has been renewed during one or more previous renew cycles, the renew controller 2900 may skip the section in the current renew cycle. In another example, if the difference between the stored time stamp of the section and the current time is lower than the threshold value, the renew controller 2900 may skip the section in the current renew cycle. The refresh controller 2900 may further continue to track the access and refresh of the section of the memory chip 2800 through multiple refresh cycles. For example, the new controller 2900 can use timing
The
再新控制器2900可協同記憶體晶片2800之其他硬體追蹤記憶體晶片2800之區段的存取。舉例而言,記憶體晶片使用感測放大器以執行讀取操作(例如,如上文在圖9及圖10中所展示)。該等感測放大器可包含複數個電晶體,該等複數個電晶體經組態以感測來自將資料儲存於一或多個記憶體胞元中之記憶體晶片2800之區段的低功率信號,且該等感測放大器將小的電壓擺動放大至較高電壓位準,使得資料可由諸如外部CPU或GPU或整合式處理器子單元(如上文所解釋)的邏輯解譯。儘管在圖29A中未描繪,但再新控制器2900可進一步與感測放大器通信,該感測放大器經組態以存取一或多個區段且改變至少一個位元暫存器之狀態。舉例而言,當感測放大器存取一或多個區段時,其可設定與該等區段相關聯之位元(例如,設定為1),該等位元提示相關聯區段在前一循環中經存取。在資料儲存器儲存對相關聯區段之最近存取或再新之時戳的實施例中,當感測放大器存取一或多個區段時,其可觸發將來自計時器2901之時戳寫入至暫存器、記憶體或包含資料儲存器之其他元件。
The renewed controller 2900 can cooperate with other hardware of the memory chip 2800 to track the access to the section of the memory chip 2800. For example, a memory chip uses a sense amplifier to perform a read operation (e.g., as shown in FIGS. 9 and 10 above). The sense amplifiers may include a plurality of transistors configured to sense low-power signals from a section of the memory chip 2800 storing data in one or more memory cells And the sense amplifiers amplify small voltage swings to higher voltage levels, so that data can be interpreted by logic such as external CPU or GPU or integrated processor sub-units (as explained above). Although not depicted in FIG. 29A, the refresh controller 2900 can further communicate with a sense amplifier that is configured to access one or more sectors and change the state of at least one bit register. For example, when the sense amplifier accesses one or more sections, it can set the bits associated with those sections (for example, set to 1), and these bits indicate that the associated section is first Accessed in one cycle. In an embodiment where the data storage stores the time stamp of the most recent access or renewal of the associated section, when the sense amplifier accesses one or more sections, it can trigger the time stamp from the
在上文所描述之實施例中之任一者中,再新控制器2900可與用 於複數個記憶體組之記憶體控制器整合。舉例而言,類似於圖3A中所描繪之實施例,再新控制器2900可併入至與記憶體晶片2800之記憶體組或其他區段相關聯的邏輯及控制子單元中。 In any of the above-described embodiments, the new controller 2900 can be used with Integration of memory controllers in multiple memory banks. For example, similar to the embodiment depicted in FIG. 3A, the renewed controller 2900 can be incorporated into the logic and control sub-units associated with the memory bank or other sections of the memory chip 2800.
圖29B描繪符合本發明之另一實例再新控制器2900'。再新控制器2900'可併入本發明之記憶體晶片(諸如,圖28之記憶體晶片2800)中。類似於再新控制器2900,再新控制器2900'包括計時器2901、列計數器2903、有效位元2905及加法器2907。另外,再新控制器2900,可包括資料儲存器2909。如圖29B中所展示,資料儲存器2909可包含與記憶體晶片2800之區段(例如,記憶體晶片2800內之記憶體胞元的排、行或任何其他分組)相關聯之一或多個暫存器或非揮發性記憶體,且資料儲存器內之狀態可經組態以回應於一或多個區段正經存取而改變(例如,藉由感測放大器及/或再新控制器2900'之其他元件,如上文所描述)。因此,再新控制器2900'可經組態以基於資料儲存器內之狀態跳過一或多個區段之再新。舉例而言,若與區段相關聯之狀態經啟動(例如,藉由接通、使性質變更以便儲存「1」或其類似者而設定為1),則再新控制器2900'可跳過相關聯區段之再新循環且清除與彼部分相關聯之狀態。該狀態可藉由至少一位元暫存器或經組態以儲存至少一個資料位元之任何其他記憶體結構來儲存。
Figure 29B depicts another example renewed controller 2900' in accordance with the present invention. The renewed controller 2900' can be incorporated into the memory chip of the present invention (such as the memory chip 2800 of FIG. 28). Similar to the renew controller 2900, the renew controller 2900' includes a
為了確保記憶體晶片之區段在每一再新循環期間經再新或存取,再新控制器2900'可重設或以其他方式清除狀態以便在下一再新循環期間觸發再新信號。在一些實施例中,在區段被跳過之後,再新控制器2900'可清除相關聯狀態,以便確保在下一再新循環再新該區段。在其他實施例中,再新控制器2900'可經組態以在臨限時間間隔之後重設資料儲存器內之狀態。舉例而言,每當自相關聯狀態經設定(例如,藉由接通、使性質變更以便儲存「1」或其類似者而設定為1)起,計時器2901超過臨限時間,再新控制器2900'便可清除資
料儲存器中之狀態(例如,將其設定為0)。在一些實施例中,再新控制器2900'可使用臨界數目個再新循環(例如,一個、兩個或其類似者)或使用臨界數目個時脈循環(例如,兩個、四個或其類似者)而非臨限時間。
In order to ensure that the segments of the memory chip are renewed or accessed during each renewal cycle, the renewal controller 2900' can reset or clear the state in other ways to trigger the renewal signal during the next renewal cycle. In some embodiments, after the section is skipped, the renew controller 2900' may clear the associated state to ensure that the section is renewed in the next recycle. In other embodiments, the renew controller 2900' can be configured to reset the state in the data storage after the threshold time interval. For example, every time the associated state is set (for example, set to 1 by turning on, changing the properties so as to store "1" or the like), the
在其他實施例中,該狀態可包含相關聯區段之最近再新或存取的時戳,使得若該時戳與當前時間(例如,來自圖29A及圖29B之計時器2901)之間的時間量超過預定臨限值(例如,8ms、16ms、32ms、64ms或其類似時間),則再新控制器2900'可將存取命令或再新信號發送至相關聯區段且更新與彼部分相關聯之時戳(例如,使用計時器2901)。另外或替代地,若再新時間指示符提示最後再新時間在預定時間臨限值內,則再新控制器2900'可經組態以跳過相對於複數個記憶體組之一或多個區段的再新操作。在此等實施例中,在跳過相對於一或多個區段之再新操作之後,再新控制器2900'可經組態以更改與一或多個區段相關聯之所儲存的再新時間指示符,使得在下一操作循環期間,將再新該一或多個區段。舉例而言,如上文所描述,再新控制器2900'可使用計時器2901來更新所儲存的再新時間指示符。
In other embodiments, the state may include the most recently renewed or accessed time stamp of the associated section, so that if the time stamp is between the time stamp and the current time (for example, the
因此,資料儲存器可包括經組態以儲存再新時間指示符之時戳暫存器,該再新時間指示符提示最後再新複數個記憶體組之一或多個區段的時間。此外,再新控制器2900'可在臨限時間間隔之後使用計時器之輸出來清除儲存於資料儲存器中之存取資訊。 Therefore, the data storage may include a time stamp register configured to store a new time indicator that prompts the time of the last new new one or more sections of the plurality of memory banks. In addition, the new controller 2900' can use the output of the timer to clear the access information stored in the data storage after the threshold time interval.
在上文所描述之實施例中之任一者中,對一或多個區段之存取可包括與一或多個區段相關聯之寫入操作。另外或替代地,對一或多個區段之存取可包括與一或多個區段相關聯之讀取操作。 In any of the embodiments described above, access to one or more sectors may include write operations associated with one or more sectors. Additionally or alternatively, access to one or more sectors may include read operations associated with one or more sectors.
此外,如圖29B中所描繪,再新控制器2900'可包含經組態以至少部分地基於資料儲存器內之狀態而輔助更新資料儲存器2909的列計數器2903及加法器2907。資料儲存器2909可包含與複數個記憶體組相關聯之位元表。舉
例而言,該位元表可包含經組態以保存用於相關聯區段之位元的開關(或開關元件,諸如電晶體)或暫存器(例如,SRAM或其類似者)的陣列。另外或替代地,資料儲存器2909可儲存與複數個記憶體組相關聯之時戳。
In addition, as depicted in FIG. 29B, the refresh controller 2900' may include a
此外,再新控制器2900'可包括再新閘2911,該再新閘經組態以基於儲存於位元表中之對應值而控制是否進行對一或多個區段的再新。舉例而言,再新閘2911可包含邏輯閘(諸如,「及(and)閘」,該邏輯閘經組態以在資料儲存器2909之對應狀態提示相關聯區段在一或多個先前時脈循環期間經再新或存取之情況下使來自列計數器2903之再新信號無效。在其他實施例中,再新閘2911可包含微處理器或其他電路,該微處理器或其他電路經組態以在來自資料儲存器2909之對應時戳提示相關聯區段在預定臨限時間值內經再新或存取之情況下使來自列計數器2903之再新信號無效。
In addition, the renew controller 2900' may include a renew
圖30為用於記憶體晶片(例如,圖28之記憶體晶片2800)中之部分再新的處理程序3000之實例流程圖。處理程序3000可由符合本發明之再新控制器執行,諸如圖29A之再新控制器2900或圖29B之再新控制器2900'。
FIG. 30 is a flowchart of an example of a
在步驟3010處,再新控制器可存取提示對複數個記憶體組之一或多個區段之存取操作的資訊。舉例而言,如上文關於圖29A及圖29B所解釋,再新控制器可包括資料儲存器,該資料儲存器與記憶體晶片2800之區段(例如,記憶體晶片2800內之記憶體胞元的排、行或任何其他分組)相關聯且經組態以儲存時戳或提示相關聯區段之最近存取的其他資訊。
At
在步驟3020處,再新控制器可至少部分地基於所存取資訊而產生再新及/或存取命令。舉例而言,如上文關於圖29A及圖29B所解釋,若所存取資訊提示最後再新或存取時間在預定時間臨限值內及/或若所存取資訊提示最後再新或存取發生在一或多個先前時脈循環期間,則再新控制器可跳過相對於複數個記憶體組之一或多個區段的再新操作。另外或替代地,再新控制器可基
於所存取資訊是否提示最後再新或存取時間超過預定臨限值及/或所存取資訊是否提示最後再新或存取並未在一或多個先前時脈循環期間發生而產生再新或存取相關聯區段之意見。
At
在步驟3030處,再新控制器可更改與一或多個區段相關聯之所儲存的再新時間指示符,使得在下一操作循環期間,將再新該一或多個區段。舉例而言,在跳過相對於一或多個區段之再新操作之後,再新控制器可更改提示對該一或多個區段之存取操作的資訊,使得在下一時脈循環期間,將再新該一或多個區段。因此,在跳過再新循環之後,再新控制器可清除區段之狀態(例如,設定為0)。另外或替代地,再新控制器可設定在當前循環期間再新及/或存取之區段的狀態(例如,設定為1)。在提示對一或多個區段之存取操作之資訊包括時戳的實施例中,再新控制器可更新與在當前循環期間再新及/或存取之區段相關聯的任何所儲存的時戳。 At step 3030, the renew controller may change the stored renew time indicator associated with one or more sections so that during the next operating cycle, the one or more sections will be renewed. For example, after skipping the renew operation relative to one or more sectors, the renew controller can change the information prompting the access operation of the one or more sectors so that during the next clock cycle, The one or more sections will be renewed. Therefore, after skipping the renewal cycle, the renewing controller can clear the state of the section (for example, set it to 0). Additionally or alternatively, the renew controller may set the status of the section to be renewed and/or accessed during the current cycle (for example, set to 1). In an embodiment where the information that prompts access operations to one or more sections includes a time stamp, the renew controller may update any stored information associated with the section renewed and/or accessed during the current cycle Timestamp.
方法3000可進一步包括額外步驟。舉例而言,除步驟3030以外或作為該步驟之替代,感測放大器可存取一或多個區段且可改變與該一或多個區段相關聯之資訊。另外或替代地,感測放大器可在存取已發生時向再新控制器發信,使得再新控制器可更新與一或多個區段相關聯之資訊。如上文所解釋,感測放大器可包含複數個電晶體,該等複數個電晶體經組態以感測來自將資料儲存於一或多個記憶體胞元中之記憶體晶片之區段的低功率信號,且感測放大器將小的電壓擺動放大至較高電壓位準,使得資料可由諸如外部CPU或GPU或整合式處理器子單元(如上文所解釋)的邏輯解譯。在此實例中,每當感測放大器存取一或多個區段時,其可設定與區段相關聯之位元(例如,設定為1),該等位元提示相關聯區段在前一循環中經存取。在提示對一或多個區段之存取操作之資訊包括時戳的實施例中,每當感測放大器存取一或多個區段時,其便可觸發將來自再新控制器之計時器之時戳寫入至資料儲存器以更新與該等區段
相關聯之任何所儲存的時戳。
The
圖31為用於判定記憶體晶片(例如,圖28之記憶體晶片2800)之再新的處理程序3100之實例流程圖。處理程序3100可實施於符合本發明之編譯器內。如上文所解釋,「編譯器」係指將較高階語言(例如,程序性語言,諸如C、FORTRAN、BASIC或其類似者;物件導向式語言,諸如Java、C++、Pascal、Python或其類似者;等等)轉換成較低階語言(例如,組合程式碼、目標程式碼、機器碼或其類似者)的任何電腦程式。編譯器可允許人類以人類可讀語言來程式設計一系列指令,接著將該人類可讀語言轉換成機器可執行語言。編譯器可包含由一或多個處理器執行之軟體指令。
FIG. 31 is a flowchart of an example of a
在步驟3110處,一或多個處理器可接收較高階電腦程式碼。舉例而言,該較高階電腦程式碼可編碼於記憶體(例如,諸如硬碟機或其類似者之非揮發性記憶體、諸如DRAM之揮發性記憶體,或其類似者)上之一或多個檔案中或經由網路(例如,網際網路或其類似者)接收。另外或替代地,可自使用者接收該較高階電腦程式碼(例如,使用諸如鍵盤之輸入裝置)。
At
在步驟3120處,一或多個處理器可識別待由較高階電腦程式碼存取之在與記憶體晶片相關聯之複數個記憶體組上分佈的複數個記憶體區段。舉例而言,一或多個處理器可存取定義記憶體晶片之複數個記憶體組及一對應結構的資料結構。一或多個處理器可自記憶體(例如,諸如硬碟機或其類似者之非揮發性記憶體、諸如DRAM之揮發性記憶體,或其類似者)存取資料結構,或經由網路(例如,網際網路或其類似者)接收資料結構。在此等實施例中,資料結構包括於可由編譯器存取之一或多個庫中,以准許編譯器產生用於待存取之特定記憶體晶片的指令。
At
在步驟3130處,一個或處理器可評估較高階電腦程式碼以識別在複數個記憶體存取循環內出現的複數個記憶體讀取命令。舉例而言,一或多
個處理器可識別需要針對記憶體之一或多個讀取命令及/或針對記憶體之一或多個寫入命令的較高階電腦程式碼內之每一操作。此等指令可包括變數初始化、變數重新指派、對變數進行邏輯運算、輸入輸出操作或其類似者。
At
在步驟3140處,一或多個處理器可致使跨越複數個記憶體區段中之每一者而分佈與複數個記憶體存取命令相關聯之資料,使得在複數個記憶體存取循環中之每一者期間存取複數個記憶體區段中之每一者。舉例而言,一或多個處理器可自定義記憶體晶片之結構的資料結構識別記憶體區段,且接著將來自較高階程式碼之變數指派給記憶體區段中之各者,使得在每一再新循環(其可包含特定數目個時脈循環)期間存取(例如,經由寫入或讀取)每一記憶體區段至少一次。在此實例中,一或多個處理器可存取提示較高階程式碼之每一行需要多少個時脈循環的資訊,以便指派來自較高階程式碼之行的變數,使得在特定數目個時脈循環期間存取(例如,經由寫入或讀取)每一記憶體區段至少一次。
At
在另一實例中,一或多個處理器可首先自較高階程式碼產生機器碼或其他較低階程式碼。一或多個處理器可接著將來自較低階程式碼之變數指派給記憶體區段中之各者,使得在每一再新循環(其可包含特定數目個時脈循環)期間存取(例如,經由寫入或讀取)每一記憶體區段至少一次。在此實例中,較低階程式碼之每一行可能需要單個時脈循環。 In another example, one or more processors may first generate machine code or other lower-level code from higher-level code. One or more processors can then assign variables from the lower-level code to each of the memory segments so that they can be accessed during each renewal cycle (which can include a specific number of clock cycles) (e.g., , By writing or reading) each memory section at least once. In this example, each line of lower-level code may require a single clock cycle.
在上文所給出之實例中之任一者中,一或多個處理器可進一步將邏輯運算或使用臨時輸出之其他命令指派給記憶體區段中之各者。此等臨時輸出亦可產生讀取及/或寫入命令,使得即使尚未將命名變數指派給經指派之記憶體區段,在彼再新循環期間仍存取該記憶體區段。 In any of the examples given above, one or more processors may further assign logical operations or other commands using temporary outputs to each of the memory segments. These temporary outputs can also generate read and/or write commands so that even if the named variable has not been assigned to the assigned memory segment, the memory segment is still accessed during its recycle.
方法3100可進一步包括額外步驟。舉例而言,在變數在編譯之前經指派的實施例中,一或多個處理器可自較高階程式碼產生機器碼或其他較
低階程式碼。此外,一或多個處理器可傳輸經編譯程式碼以供記憶體晶片及對應邏輯電路執行。該等邏輯電路可包含諸如GPU或CPU之習知電路,或可包含與記憶體晶片在同一基板上之處理群組,例如,如圖7A中所描繪。因此,如上文所描述,該基板可包括記憶體陣列,該記憶體陣列包括複數個組,諸如圖28中所展示之組2801a及其他組。此外,該基板可包括處理陣列,該處理陣列可包括複數個處理器子單元(諸如,圖7A中所展示之子單元730a、730b、730c、730d、730e、730f、730g及730h)。
圖32為用於判定記憶體晶片(例如,圖28之記憶體晶片2800)之再新的處理程序3200之另一實例流程圖。處理程序3200可實施於符合本發明之編譯器內。處理程序3200可由執行包含編譯器之軟體指令的一或多個處理器執行。處理程序3200可與圖31之處理程序3100分開地或組合地實施。
FIG. 32 is a flowchart of another example of the
在步驟3210處,類似於步驟3110,一或多個處理器可接收較高階電腦程式碼。在步驟3220處,類似於步驟3210,一或多個處理器可識別待由較高階電腦程式碼存取之在與記憶體晶片相關聯之複數個記憶體組上分佈的複數個記憶體區段。
At
在步驟3230處,一或多個處理器可評估較高階電腦程式碼以識別各涉及複數個記憶體區段中之一或多者的複數個記憶體讀取命令。舉例而言,一或多個處理器可識別需要針對記憶體之一或多個讀取命令及/或針對記憶體之一或多個寫入命令的較高階電腦程式碼內之每一操作。此等指令可包括變數初始化、變數重新指派、對變數進行邏輯運算、輸入輸出操作或其類似者。
At
在一些實施例中,一或多個處理器可使用邏輯電路及複數個記憶體區段模擬較高階程式碼之執行。舉例而言,該模擬可包含較高階程式碼之逐行逐步通過,其類似於除錯器或其他指令集模擬器(ISS)之情況。該模擬可進一步維護表示複數個記憶體區段之位址的內部變數,其類似於除錯器可如何維 護表示處理器之暫存器的內部變數。 In some embodiments, one or more processors can use logic circuits and a plurality of memory sections to simulate the execution of higher-level code. For example, the simulation may include a step-by-step pass of higher-level code, which is similar to the case of a debugger or other instruction set simulator (ISS). The simulation can further maintain the internal variables representing the addresses of multiple memory segments, which is similar to how the debugger can be maintained The protection represents the internal variables of the processor's register.
在步驟3240處,一或多個處理器可基於對記憶體存取命令之分析且針對複數個記憶體區段當中之每一記憶體區段而追蹤將自對記憶體區段之最後一次存取所累積的時間量。舉例而言,使用上文所描述之模擬,一個或處理器可判定對複數個記憶體區段中之每一者內的一或多個位址之每一存取(例如,讀取或寫入)之間的時間長度。可按絕對時間、時脈循環或再新循環(例如,由記憶體晶片之已知再新速率判定)來量測時間長度。
At
在步驟3250處,回應於自對任何特定記憶體區段之最後一次存取起的時間量將超過預定臨限值的判定,一或多個處理器可將經組態以致使對特定記憶體區段之存取的記憶體再新命令或記憶體存取命令中之至少一者引入至較高階電腦程式碼中。舉例而言,一或多個處理器可包括供再新控制器(例如,圖29A之再新控制器2900或圖29B之再新控制器2900')執行的再新命令。在邏輯電路不嵌入與記憶體晶片相同之基板上的實施例中,一或多個處理器可產生與用於發送至邏輯電路之較低階程式碼分開的用於發送至記憶體晶片之再新命令。
At
另外或替代地,一或多個處理器可包括供記憶體控制器(其可與再新控制器分開或併入至再新控制器中)執行之存取命令。該存取命令可包含虛設命令,該虛設命令經組態以觸發對記憶體區段之讀取操作,但不使邏輯電路對來自記憶體區段之經讀取或寫入變數執行任何其他操作。 Additionally or alternatively, the one or more processors may include access commands for the memory controller (which may be separate from the renewed controller or incorporated into the renewed controller) to execute. The access command may include a dummy command, which is configured to trigger a read operation to the memory section, but does not cause the logic circuit to perform any other operations on the read or write variables from the memory section .
在一些實施例中,編譯器可包括來自處理程序3100之步驟與來自處理程序3200之步驟的組合。舉例而言,編譯器可根據步驟3140指派變數且接著根據步驟3250運行上文所描述之模擬,以添加於任何額外的記憶體再新命令或記憶體存取命令中。此組合可允許編譯器跨越儘可能多的記憶體區段而分佈變數,且為無法在預定臨限時間量存取之任何記憶體區段產生再新或存取命
令。在另一組合實例中,編譯器可根據步驟3230模擬程式碼,且基於該模擬提示在預定臨限時間量內將不存取之任何記憶體區段而根據步驟3140指派變數。在一些實施例中,此組合可進一步包括步驟3250以允許編譯器為在預定臨限時間量內無法存取之任何記憶體區段產生再新或存取命令,即使在根據步驟3140之指派完成之後亦如此。
In some embodiments, the compiler may include a combination of steps from the
本發明之再新控制器可允許由邏輯電路(無論係諸如CPU及GPU之習知邏輯電路抑或與記憶體晶片在同一基板上之處理群組,例如,如圖7A中所描繪)執行之軟體使由再新控制器執行之自動再新去能,且替代地經由所執行軟體控制再新。因此,本發明之一些實施例可將具有已知存取圖案之軟體提供至記憶體晶片(例如,若編譯器能夠存取定義記憶體晶片之複數個記憶體組及一對應結構的資料結構)。在此等實施例中,編譯後最佳化器可使自動再新去能,且僅針對記憶體晶片之在臨限時間量內未被存取之區段手動地設定再新控制。因此,類似於上文所描述之步驟3250但在編譯之後,編譯後最佳化器可產生再新命令以確保使用預定臨限時間量存取或再新每一記憶體區段。 The renewed controller of the present invention can allow software to be executed by logic circuits (whether it is a conventional logic circuit such as CPU and GPU or a processing group on the same substrate as a memory chip, for example, as depicted in FIG. 7A) Disable the automatic renewal performed by the renewed controller, and instead control the renewal by the executed software. Therefore, some embodiments of the present invention can provide software with known access patterns to the memory chip (for example, if the compiler can access a plurality of memory groups defining the memory chip and a data structure corresponding to the structure) . In these embodiments, the post-compilation optimizer can automatically disable renewal, and manually set renewal control only for sections of the memory chip that have not been accessed within the threshold amount of time. Therefore, similar to step 3250 described above but after compilation, the post-compilation optimizer can generate a renew command to ensure that each memory segment is accessed or renewed with a predetermined threshold amount of time.
減少再新循環之另一實例可包括使用對記憶體晶片之存取的預定義圖案。舉例而言,若由邏輯電路執行之軟體可控制其用於記憶體晶片之存取圖案,則一些實施例可產生用於超出習知線性線再新之再新的存取圖案。舉例而言,若控制器判定由邏輯電路執行之軟體規則地每第二列記憶體進行存取,則本發明之再新控制器可使用並非每第二排進行再新之存取圖案以便加速記憶體晶片且減少功率使用量。 Another example of reducing recycling cycles can include using predefined patterns of access to memory chips. For example, if the software executed by the logic circuit can control its access pattern for the memory chip, some embodiments can generate new access patterns for renewing beyond the conventional linear line. For example, if the controller determines that the software executed by the logic circuit regularly accesses every second row of memory, the renewed controller of the present invention can use an access pattern that is not renewed every second row in order to speed up Memory chips and reduce power usage.
此再新控制器之實例展示於圖33中。圖33描繪藉由符合本發明所儲存圖案組態的實例再新控制器3300。再新控制器3300可併入於本發明之記憶體晶片中,該記憶體晶片例如具有複數個記憶體組及包括於複數個記憶體組中之每一者中的複數個記憶體區段,諸如圖28之記憶體晶片2800。 An example of this renewed controller is shown in Figure 33. Figure 33 depicts the renewal of the controller 3300 by conforming to the example of the stored pattern configuration of the present invention. The new controller 3300 can be incorporated into the memory chip of the present invention, the memory chip having, for example, a plurality of memory banks and a plurality of memory sections included in each of the plurality of memory banks, Such as the memory chip 2800 of FIG. 28.
再新控制器3300包括計時器3301(類似於圖29A及圖29B之計時器2901)、列計數器3303(類似於圖29A及圖29B之列計數器2903)及加法器3305(類似於圖29A及圖29B之加法器2907)。此外,再新控制器3300包括資料儲存器3307。不同於圖29B之資料儲存區2909,資料儲存器3307可儲存至少一個記憶體再新圖案,該至少一個記憶體再新圖案待在再新包括於複數個記憶體組中之每一者中的複數個記憶體區段時實施。舉例而言,如圖33中所描繪,資料儲存器3307可包括按列及/或行來定義記憶體組中之區段的Li(例如,在圖33之實例中,L1、L2、L3及L4)及Hi(例如,在圖33之實例中,H1、H2、H3及H4)。此外,每一區段可與Inci變數(例如,在圖33之實例中,Inc1、Inc2、Inc3及Inc4)相關聯,該變數定義與區段相關聯之列如何遞增(例如,是否存取或再新每一列,是否每隔一列進行存取或再新,或其類似者)。因此,如圖33中所展示,再新圖案可包含表,該表包括由軟體指派之複數個記憶體區段識別符,該等複數個記憶體區段識別符用以識別特定記憶體組中之待在再新循環期間需再新的複數個記憶體區段之範圍,及該特定記憶體組中之在該再新循環期間不需再新的複數個記憶體區段之範圍。
The new controller 3300 includes a timer 3301 (similar to the
因此,資料儲存器3308可定義由邏輯電路(無論係諸如CPU及GPU之習知邏輯電路抑或與記憶體晶片在同一基板上之處理群組,例如,如圖7A中所描繪)執行之軟體可選擇以供使用的再新圖案。記憶體再新圖案可為可使用軟體組態的,以識別在再新循環期間,特定記憶體組中之複數個記憶體區段中的哪些者需再新,而特定記憶體組中之複數個記憶體區段中的哪些者在該再新循環期間不需再新。因此,再新控制器3300可根據Inci再新在當前循環期間未被存取之所定義區段內的一些或所有列。再新控制器3300可跳過經設定為在當前循環期間被存取之所定義區段的其他列。 Therefore, the data storage 3308 can define software executed by logic circuits (whether it is a conventional logic circuit such as CPU and GPU or a processing group on the same substrate as a memory chip, for example, as depicted in FIG. 7A). Select the new pattern for use. The memory refresh pattern can be configurable by software to identify which of the plurality of memory sections in the specific memory group need to be refreshed during the refresh cycle, and the plural in the specific memory group Which of the memory sections does not need to be renewed during the recirculation period. Therefore, the renew controller 3300 may renew some or all columns in the defined section that have not been accessed during the current cycle according to Inci. The new controller 3300 may skip other rows of the defined section that are set to be accessed during the current cycle.
在再新控制器3300之資料儲存器3308包括複數個記憶體再新圖 案的實施例中,每一記憶體再新圖案可表示用於再新包括於複數個記憶體組中之每一者中之複數個記憶體區段的不同再新圖案。記憶體再新圖案可為可選擇的以用於複數個記憶體區段上。因此,再新控制器3300可經組態以允許選擇在特定再新循環期間實施複數個記憶體再新圖案中之哪一者。舉例而言,由邏輯電路(無論係諸如CPU及GPU之習知邏輯電路抑或與記憶體晶片在同一基板上之處理群組,例如,如圖7A中所描繪)執行之軟體可選擇不同記憶體再新圖案以供在一或多個不同再新循環期間使用。替代地,由邏輯電路執行之軟體可選擇一個記憶體再新圖案以供貫穿不同再新循環中之一些或全部而使用。 The data storage 3308 of the renewed controller 3300 includes a plurality of memories and renewed picture In an embodiment of the case, each memory renewal pattern may represent a different renewal pattern for renewing a plurality of memory sections included in each of the plurality of memory groups. The memory renew pattern can be selectable for use in a plurality of memory segments. Therefore, the refresh controller 3300 may be configured to allow selection of which of a plurality of memory refresh patterns to be implemented during a particular refresh cycle. For example, the software executed by the logic circuit (whether it is a conventional logic circuit such as CPU and GPU or a processing group on the same substrate as the memory chip, for example, as depicted in Figure 7A) can choose different memory Renew patterns for use during one or more different renew cycles. Alternatively, the software executed by the logic circuit can select a memory regeneration pattern for use through some or all of the different regeneration cycles.
可使用儲存於資料儲存器3308中之一或多個變數來編碼記憶體再新圖案。舉例而言,在複數個記憶體區段配置成列的實施例中,每一記憶體區段識別符可經組態以識別記憶體之一列內記憶體再新應開始或結束之特定位置。舉例而言,除Li及Hi以外,一或多個額外變數亦可定義由Li及Hi定義之列之哪些部分在區段內。 One or more variables stored in the data storage 3308 can be used to encode the memory renew pattern. For example, in an embodiment where a plurality of memory segments are arranged in rows, each memory segment identifier can be configured to identify a specific position in a row of memory where the memory should start or end. For example, in addition to Li and Hi, one or more additional variables can also define which parts of the column defined by Li and Hi are in the segment.
圖34為用於判定記憶體晶片(例如,圖28之記憶體晶片2800)之再新的處理程序3400之實例流程圖。處理程序3100可由符合本發明之再新控制器(例如,圖33之再新控制器3300)內的軟體實施。
FIG. 34 is a flowchart of an example of a
在步驟3410處,再新控制器可儲存至少一個記憶體再新圖案,該至少一個記憶體再新圖案待在再新包括於複數個記憶體組中之每一者中的複數個記憶體區段時實施。舉例而言,如上文關於圖33所解釋,再新圖案可包含表,該表包括由軟體指派之複數個記憶體區段識別符,該等複數個記憶體區段識別符用以識別特定記憶體組中之在再新循環期間需再新的複數個記憶體區段之範圍,及該特定記憶體組中之在再新循環期間不需再新的複數個記憶體區段之範圍。 At step 3410, the renew controller may store at least one memory renew pattern, and the at least one memory renew pattern is to be renewed in a plurality of memory areas included in each of the plurality of memory groups Implemented in a short period of time. For example, as explained above with respect to FIG. 33, the new pattern may include a table including a plurality of memory segment identifiers assigned by the software, and the plurality of memory segment identifiers are used to identify a specific memory The range of a plurality of memory segments in the body group that need to be renewed during the recycle, and the range of a plurality of memory segments in the specific memory group that do not need to be renewed during the recycle.
在一些實施例中,至少一個再新圖案可在製造期間編碼至再新控 制器上(例如,編碼至與再新控制器相關聯或至少可由再新控制器存取之唯讀記憶體上)。因此,再新控制器可存取至少一個記憶體再新圖案,但不儲存該至少一個記憶體再新圖案。 In some embodiments, at least one renew pattern can be coded to renew control during manufacturing. Controller (e.g., coded onto a read-only memory that is associated with the renewed controller or at least can be accessed by the renewed controller). Therefore, the refresh controller can access at least one memory refresh pattern, but does not store the at least one memory refresh pattern.
在步驟3420及3430處,再新控制器可使用軟體以識別特定記憶體組中之複數個記憶體區段中的哪些者在再新循環期間需再新,而特定記憶體組中之複數個記憶體區段中的哪些者在該再新循環期間不需再新。舉例而言,如上文關於圖33所解釋,由邏輯電路(無論係諸如CPU及GPU之習知邏輯電路抑或與記憶體晶片在同一基板上之處理群組,例如,如圖7A中所描繪)執行之軟體可選擇至少一個記憶體再新圖案。此外,再新控制器可在每一再新循環期間存取選定的至少一個記憶體再新圖案以產生對應再新信號。再新控制器可根據該至少一個記憶體再新圖案再新在當前循環期間未被存取之所定義區段內的一些或所有部分,且可跳過經設定為在當前循環期間被存取之所定義區段的其他部分。
At
在步驟3440處,再新控制器可產生對應再新命令。舉例而言,如圖33中所描繪,加法器3305可包含邏輯電路,該邏輯電路經組態以根據資料儲存器3307中之至少一個記憶體再新圖案而使用於未被再新之特定區段的再新信號無效。另外或替代地,微處理器(圖33中未展示)可基於根據資料儲存器3307中之至少一個記憶體再新圖案將再新哪些區段而產生特定再新信號。
At
方法3400可進一步包括額外步驟。舉例而言,在至少一個記憶體再新圖案經組態以每一個、兩個或其他數目個再新循環而改變(例如,自L1、H1及Inc1移動至L2、H2及Inc2,如圖33中所展示)的實施例中,再新控制器可根據步驟3430及3440存取資料儲存器之不同部分以用於再新信號之下一判定。類似地,若由邏輯電路(無論係諸如CPU及GPU之習知邏輯電路抑或與記憶體晶片在同一基板上之處理群組,例如,如圖7A中所描繪)執行之軟體自資
料儲存器選擇新的記憶體再新圖案以供用於一或多個未來再新循環中,則再新控制器可根據步驟3430及3440存取資料儲存器之不同部分以用於再新信號之下一判定。
大小可選擇之記憶體晶片 Memory chip with selectable size
當設計記憶體晶片且目標為記憶體之某一容量時,記憶體容量改變至較大大小或較小大小可能需要重新設計產品及重新設計整個光罩集。通常,產品設計與市場研究並列地進行,且在一些狀況下,產品設計在市場研究可用之前完成。因此,產品設計與市場之實際需求之間可能存在脫節。本發明提出靈活地提供具有滿足市場需求之記憶體容量的記憶體晶片之方式。設計方法可包括在晶圓上設計晶粒連同適當的互連電路系統,使得可自晶圓選擇性地切割可含有一或多個晶粒之記憶體晶片,以便提供自單個晶圓生產具有大小可變之記憶體容量的記憶體晶片之機會。 When the memory chip is designed and the target is a certain capacity of the memory, changing the memory capacity to a larger or smaller size may require redesigning the product and redesigning the entire mask set. Usually, product design and market research are carried out side by side, and in some cases, product design is completed before market research is available. Therefore, there may be a disconnect between product design and actual market demand. The present invention proposes a way to flexibly provide a memory chip with a memory capacity that meets market demand. The design method may include designing the die on the wafer together with the appropriate interconnection circuit system, so that the memory chip that may contain one or more dies can be selectively cut from the wafer, so as to provide the size from a single wafer production Opportunities for memory chips with variable memory capacity.
本發明係關於用於藉由自晶圓切割記憶體晶片來製造該等記憶體晶片之系統及方法。該方法可用於自晶圓生產大小可選擇之記憶體晶片。含有晶粒3503之晶圓3501的實例實施例展示於圖35A中。晶圓3501可由半導體材料(例如,矽(Si)、矽鍺(SiGe)、絕緣體上矽(SOI)、氮化鎵(GaN)、氮化鋁(AlN)、氮化鋁鎵(AlGaN)、氮化硼(BN)、砷化鎵(GaAs)、砷化鎵鋁(AlGaAs)、氮化銦(InN)、以上各者之組合及其類似者)形成。晶粒3503可包括任何合適的電路元件(例如,電晶體、電容器、電阻器及/或其類似者),該等電路元件可包括任何合適的半導體、介電或金屬組件。晶粒3503可由可能與晶圓3501之材料相同或不同的半導體材料形成。除晶粒3503以外,晶圓3501亦可包括其他結構及/或電路系統。在一些實施例中,可提供一或多個耦接電路且該一或多個耦接電路將晶粒中之一或多者耦接在一起。在實例實施例中,此耦接電路可包括由兩個或多於兩個晶粒3503共用之匯流排。另外,該耦
接電路可包括經設計以控制與晶粒3503相關聯之電路系統及/或將資訊導引至晶粒3503/導引來自晶粒之資訊的一或多個邏輯電路。在一些狀況下,該耦接電路可包括記憶體存取管理邏輯。此邏輯可將邏輯記憶體位址轉譯成與晶粒3503相關聯之實體位址。應注意,如本文中所使用,術語製造可共同地指用於建置所揭示晶圓、晶粒及/或晶片之步驟中之任一者。舉例而言,製造可指包括於晶圓上之各種晶粒(及任何其他電路系統)的同時佈置及形成。製造亦可指自晶圓切割大小可選擇之記憶體晶片以在一些狀況下包括一個晶粒,或在其他狀況下包括多個晶粒。當然,術語製造並不欲限於此等實例,而是可包括與所揭示記憶體晶片及中間結構中之任一者或全部之產生相關聯的其他態樣。
The present invention relates to a system and method for manufacturing memory chips by cutting the memory chips from the wafer. This method can be used to produce memory chips of selectable sizes from the wafer. An example embodiment of a
晶粒3503或晶粒群組可用於製造記憶體晶片。記憶體晶片可包括分散式處理器,如本發明之其他章節中所描述。如圖35B中所展示,晶粒3503可包括基板3507及安置於該基板上之記憶體陣列。該記憶體陣列可包括一或多個記憶體單元,諸如經設計以儲存資料之記憶體組3511A至3511D。在各種實施例中,記憶體組可包括基於半導體之電路元件,諸如電晶體、電容器及其類似者。在實例實施例中,記憶體組可包括多列及多行儲存單元。在一些狀況下,此記憶體組可具有大於一百萬位元組之容量。該等記憶體組可包括動態或靜態存取記憶體。
The
晶粒3503可進一步包括安置於基板上之處理陣列,該處理陣列包括複數個處理器子單元3515A至3515D,如圖35B中所展示。如上文所描述,每一記憶體組可包括由專用匯流排連接之專用處理器子單元。舉例而言,處理器子單元3515A經由匯流排或連接件3512與記憶體組3511A相關聯。應理解,記憶體組3511A至3511D與處理器子單元3515A至3515D之間的各種連接為可能的,且僅一些說明性連接展示於圖35B中。在實例實施例中,處理器子單元可對相關聯之記憶體組執行讀取/寫入操作,且可進一步相對於儲存於各種記憶
體組中之記憶體執行再新操作或任何其他合適之操作。
The
如所提到,晶粒3503可包括經組態以將處理器子單元與其對應記憶體組連接之第一群組匯流排。實例匯流排可包括連接電組件之一組導線或導體,且允許將資料及位址傳送至每一記憶體組及其相關聯之處理器子單元以及自每一記憶體組及其相關聯之處理器子單元傳送資料。在實例實施例中,連接件3512可充當用於將處理器子單元3515A連接至記憶體組3511A之專用匯流排。晶粒3503可包括此類匯流排之群組,每一匯流排將處理器子單元連接至對應的專用記憶體組。另外,晶粒3503可包括匯流排之另一群組,每一匯流排將處理器子單元(例如,子單元3515A至3515D)連接至彼此。舉例而言,此類匯流排可包括連接件3516A至3516D。在各種實施例中,用於記憶體組3511A至3511D之資料可經由輸入輸出匯流排3530遞送。在實例實施例中,輸入輸出匯流排3530可攜載資料相關資訊,及用於控制晶粒3503之記憶體單元之操作的命令相關資訊。資料資訊可包括用於儲存於記憶體組中之資料、自記憶體組讀取之資料、基於相對於儲存於對應記憶體組中之資料執行之操作的來自處理器子單元中之一或多者的處理結果、命令相關資訊、各種程式碼等。
As mentioned, the
在各種狀況下,由輸入輸出匯流排3530傳輸之資料及命令可由輸入輸出(IO)控制器3521控制。在實例實施例中,IO控制器3521可控制自匯流排3530至處理器子單元3515A至3515D及來自處理器子單元3515A至3515D之資料流。IO控制器3521可判定自處理器子單元3515A至3515D中之哪一者擷取資訊。在各種實施例中,IO控制器3521可包括經組態以不啟動IO控制器3521之熔斷器3554。若多個晶粒組合在一起以形成較大記憶體晶片(亦被稱作多晶粒記憶體晶片,作為僅含有一個晶粒之單晶粒記憶體晶片的替代),則可使用熔斷器3554。多晶粒記憶體晶片可接著使用形成該多晶粒記憶體晶片之晶粒單元中之一者的IO控制器中之一者,同時藉由使用對應於與其他晶粒單
元相關之其他IO控制器的熔斷器來使其他IO控制器去能。
Under various conditions, the data and commands transmitted by the input-
如所提到,每一記憶體晶片或前置晶粒或晶粒群組可包括與對應記憶體組相關聯之分散式處理器。在一些實施例中,此等分散式處理器可配置於與複數個記憶體組安置在同一基板上的處理陣列中。另外,該處理陣列可包括各包括位址產生器(亦被稱作位址產生器單元(AGU))之一或多個邏輯部分。在一些狀況下,該位址產生器可為至少一個處理器子單元之部分。該位址產生器可產生自與記憶體晶片相關聯之一或多個記憶體組提取資料所需的記憶體位址。位址產生計算可涉及整數算術運算,諸如加法、減法、模數運算或位元移位。該位址產生器可經組態以一次對多個運算元進行運算。此外,多個位址產生器可同時執行多於一個位址計算運算。在各種實施例中,位址產生器可與對應記憶體組相關聯。該等位址產生器可藉助於對應匯流排線與其對應記憶體組連接。 As mentioned, each memory chip or pre-die or die group may include a distributed processor associated with the corresponding memory group. In some embodiments, these distributed processors may be arranged in a processing array arranged on the same substrate as a plurality of memory banks. In addition, the processing array may include one or more logic parts each including an address generator (also referred to as an address generator unit (AGU)). In some cases, the address generator may be part of at least one processor subunit. The address generator can generate memory addresses required to retrieve data from one or more memory groups associated with the memory chip. The address generation calculation may involve integer arithmetic operations, such as addition, subtraction, modulus operations, or bit shifting. The address generator can be configured to perform operations on multiple operands at once. In addition, multiple address generators can perform more than one address calculation operation at the same time. In various embodiments, the address generator may be associated with the corresponding memory bank. These address generators can be connected to their corresponding memory banks by means of corresponding bus lines.
在各種實施例中,大小可選擇之記憶體晶片可藉由選擇性地切割晶圓3501之不同區而由該晶圓形成。如所提到,該晶圓可包括晶粒3503之群組,該群組包括晶圓上所包括之兩個或多於兩個晶粒(例如,2個、3個、4個、5個、10個或多於10個晶粒)的任何群組。如將在下文進一步所論述,在一些狀況下,單個記憶體晶片可藉由切割晶圓之僅包括晶粒群組中之一個晶粒的一部分來形成。在此等狀況下,所得記憶體晶片將包括與一個晶粒相關聯之記憶體單元。然而,在其他狀況下,大小可選擇之記憶體晶片可形成為包括多於一個晶粒。此等記憶體晶片可藉由切割晶圓之包括晶圓上所包括之晶粒群組中之兩個或多於兩個晶粒的區來形成。在此等狀況下,晶粒連同將晶粒耦接在一起之耦接電路提供多晶粒記憶體晶片。一些額外電路元件亦可板載地線連接於晶片之間,諸如時脈元件、資料匯流排或任何合適的邏輯電路。
In various embodiments, memory chips of selectable sizes can be formed from the
在一些狀況下,與晶粒群組相關聯之至少一個控制器可經組態以 控制晶粒群組作為單個記憶體晶片(例如,多記憶體單元記憶體晶片)進行操作。該控制器可包括管理進入記憶體晶片及來自記憶體晶片之資料流的一或多個電路。記憶體控制器可為記憶體晶片之一部分,或其可為不與記憶體晶片直接相關之分開晶片的一部分。在實例實施例中,控制器可經組態以便利讀取及寫入請求或與記憶體晶片之分散式處理器相關聯的其他命令,且可經組態以控制記憶體晶片之任何其他合適的態樣(例如,再新記憶體晶片,與分散式處理器互動等)。在一些狀況下,控制器可為晶粒3503之部分,且在其他狀況下,控制器可鄰近於晶粒3503佈置。在各種實施例中,控制器亦可包括記憶體晶片上所包括之記憶體單元中之至少一者的至少一個記憶體控制器。在一些狀況下,對於複製可存在於記憶體晶片上之複製邏輯及記憶體單元(例如,記憶體組),用於存取記憶體晶片上之資訊的協定可能為不可知的。該協定可經組態以具有用於充分存取記憶體晶片上之資料的不同ID或位址範圍。具有此協定之晶片的實例可包括具有聯合電子裝置工程委員會(JEDEC)雙資料速率(DDR)控制器之晶片,其中不同記憶體組可具有不同位址範圍、串列周邊介面(SPI)連接,其中不同記憶體單元(例如,記憶體組)具有不同識別項(ID),及其類似者。 In some cases, at least one controller associated with a die group can be configured to The control die group operates as a single memory chip (for example, a multi-memory cell memory chip). The controller may include one or more circuits that manage the flow of data into and from the memory chip. The memory controller may be part of the memory chip, or it may be part of a separate chip that is not directly related to the memory chip. In an example embodiment, the controller can be configured to facilitate read and write requests or other commands associated with the distributed processor of the memory chip, and can be configured to control any other suitable of the memory chip (For example, new memory chips, interaction with distributed processors, etc.). In some cases, the controller may be part of die 3503, and in other cases, the controller may be arranged adjacent to die 3503. In various embodiments, the controller may also include at least one memory controller of at least one of the memory cells included on the memory chip. In some situations, for the replication logic and memory units (for example, memory banks) that can exist on the memory chip, the protocol used to access the information on the memory chip may be unknown. The protocol can be configured to have different IDs or address ranges for full access to the data on the memory chip. Examples of chips with this protocol may include chips with a joint electronic device engineering committee (JEDEC) dual data rate (DDR) controller, where different memory banks may have different address ranges and serial peripheral interface (SPI) connections, Among them, different memory units (for example, memory groups) have different identification items (ID), and the like.
在各種實施例中,可自晶圓切割多個區,其中各個區包括一或多個晶粒。在一些狀況下,可用每一分開區以建置多晶粒記憶體晶片。在其他狀況下,待自晶圓切割之每一區可包括單個晶粒以提供單晶粒記憶體晶片。在一些狀況下,該等區中之兩者或多於兩者可具有相同形狀且具有以相同方式耦接至耦接電路之相同數目個晶粒。替代地,在一些實例實施例中,可用第一群組區以形成第一類型之記憶體晶片,且可用第二群組區以形成第二類型之記憶體晶片。舉例而言,如圖35C中所展示,晶圓3501可包括區3505,該區可包括單個晶粒,且第二區3504可包括兩個晶粒之群組。當自晶圓3501切割區3505時,
將提供單晶粒記憶體晶片。當自晶圓3501切割區3504時,將提供多晶粒記憶體晶片。圖35C中所展示之群組僅為說明性,且可自晶圓3501切下晶粒之各種其他區及群組。
In various embodiments, multiple regions may be diced from the wafer, where each region includes one or more dies. In some cases, each partition can be used to build a multi-die memory chip. In other cases, each area to be diced from the wafer may include a single die to provide a single die memory chip. In some cases, two or more of the regions may have the same shape and have the same number of dies coupled to the coupling circuit in the same manner. Alternatively, in some example embodiments, the first group of regions can be used to form memory chips of the first type, and the second group of regions can be used to form memory chips of the second type. For example, as shown in FIG. 35C, the
在各種實施例中,晶粒可形成於晶圓3501上,使得其沿著晶圓之一或多列配置,如展示於例如圖35C中。該等晶粒可共用對應於一或多列之輸入輸出匯流排3530。在實例實施例中,可使用各種切割形狀自晶圓3501切下晶粒群組,其中當切下可用以形成記憶體晶片之晶粒群組時,可能不包括共用輸入輸出匯流排3530之至少一部分(例如,僅可包括輸入輸出匯流排3530之一部分作為形成為包括晶粒群組之記憶體晶片的一部分)。
In various embodiments, the dies may be formed on the
如先前所論述,當多個晶粒(例如,晶粒3506A及3506B,如圖35C中所展示)用以形成記憶體晶片3517時,對應於該等晶粒中之一者的一個IO控制器可經賦能且經組態以控制至晶粒3506A及3506B之所有處理器子單元的資料流。舉例而言,圖35D展示經組合以形成記憶體晶片3517之記憶體晶粒3506A及3506B,該記憶體晶片包括記憶體組3511A至3511H、處理器子單元3515A至3515H、IO控制器3521A及3521B,以及熔斷器3554A及3554B。應注意,在自晶圓移除記憶體晶片3517之前,該記憶體晶片對應於晶圓3501之區3517。換言之,如此處且在本發明中別處所使用,一旦自晶圓3501切割,晶圓3501之區3504、3505、3517等便將產生記憶體晶片3504、3505、3517等。另外,本文中之熔斷器亦被稱作去能元件。在實例實施例中,熔斷器3554B可用以不啟動IO控制器3521B,且IO控制器3521A可用以藉由將資料傳達至處理器子單元3515A至3515H來控制至所有記憶體組3511A至3511H之資料流。在實例實施例中,IO控制器3521A可使用任何合適的連接來連接至各種處理器子單元。在一些實施例中,如下文進一步所描述,處理器子單元3515A至3515H可互連,且IO控制器3521A可經組態以控制至形成記憶體晶片3517之處理邏
輯之處理器子單元3515A至3515H的資料流。
As previously discussed, when multiple dies (eg, dies 3506A and 3506B, as shown in FIG. 35C) are used to form the
在實例實施例中,諸如控制器3521A及3521B之IO控制器以及對應熔斷器3554A及3554B可連同形成記憶體組3511A至3511H及處理器子單元3515A至3515H一起在晶圓3501上形成。在各種實施例中,當形成記憶體晶片3517時,可啟動熔斷器中之一者(例如,熔斷器3554B)使得晶粒3506A及3506B經組態以形成記憶體晶片3517,該記憶體晶片充當單個晶片且受單個輸入輸出控制器(例如,控制器3521A)控制。在實例實施例中,啟動熔斷器可包括施加電流以觸發熔斷器。在各種實施例中,當多於一個晶粒用於形成記憶體晶片時,可經由對應熔斷器不啟動除一個IO控制器之外的所有其他IO控制器。
In an example embodiment, IO controllers such as
在各種實施例中,如圖35C中所展示,多個晶粒連同一組輸入輸出匯流排及/或控制匯流排一起形成於晶圓3501上。實例輸入輸出匯流排3530展示於圖35C中。在實例實施例中,輸入輸出匯流排中之一者(例如,輸入輸出匯流排3530)可連接至多個晶粒。圖35C展示接近晶粒3506A及3506B通過之輸入輸出匯流排3530的實例實施例。如圖35C中所展示之晶粒3506A及3506B以及輸入輸出匯流排3530之組態僅為說明性的,且可使用各種其他組態。舉例而言,圖35E說明形成於晶圓3501上且配置成六邊形形式之晶粒3540。可自晶圓3501切下包括四個晶粒3540之記憶體晶片3532。在實例實施例中,記憶體晶片3532可包括藉由合適的匯流排線(例如,線3533,如圖35E中所展示)連接至四個晶粒的輸入輸出匯流排3530之一部分。為了將資訊投送至記憶體晶片3532之適當記憶體單元,記憶體晶片3532可包括置放於輸出匯流排3530之分支點處的輸入/輸出控制器3542A及3542B。控制器3542A及3542B可經由輸入輸出匯流排3530接收命令資料,且選擇匯流排3530之分支用於將資訊傳輸至適當記憶體單元。舉例而言,若命令資料包括自與晶粒3546相關聯之記憶體單元
的讀取資訊/至該等記憶體單元之寫入資訊,則控制器3542A可接收命令請求且將資料傳輸至匯流排3530之分支3531A,如圖35D中所展示,而控制器3542B可接收命令請求且將資料傳輸至分支3531B。圖35E提示可進行之不同區的各種切割,其中切割線由虛線表示。
In various embodiments, as shown in FIG. 35C, multiple dies are formed on the
在實例實施例中,晶粒群組及互連電路系統可經設計以包括於如圖36A中所展示之記憶體晶片3506中。此實施例可包括可經組態以彼此通信之處理器子單元(用於記憶體內處理)。舉例而言,待包括於記憶體晶片3506中之每一晶粒可包括諸如記憶體組3511A至3511D之各種記憶體單元、處理器子單元3515A至3515D,以及IO控制器3521及3522。IO控制器3521及3522可並聯連接至輸入輸出匯流排3530。IO控制器3521可具有熔斷器3554,且IO控制器3522可具有熔斷器3555。在實例實施例中,處理器子單元3515A至3515D可藉助於例如匯流排3613連接。在一些狀況下,IO控制器中之一者可使用對應熔斷器來去能。舉例而言,可使用熔斷器3555使IO控制器3522去能,且IO控制器3521可經由處理器子單元3515A至3515D控制至記憶體組3511A至3511D中之資料流,該等處理器子單元經由匯流排3613彼此連接。
In an example embodiment, the die group and interconnect circuitry may be designed to be included in the
如圖36A中所展示之記憶體單元的組態僅為說明性的,且各種其他組態可藉由切割晶圓3501之不同區來形成。舉例而言,圖36B展示具有三個域3601至3603之組態,該三個域含有記憶體單元且連接至輸入輸出匯流排3530。在實例實施例中,域3601至3603係使用可由對應熔斷器3554至3556去能之IO控制模組3521至3523連接至輸入輸出匯流排3530。配置含有記憶體單元之域的實施例之另一實例展示於圖36C中,其中使用匯流排線3611、3612及3613將三個域3601、3602及3603連接至輸入輸出匯流排3530。圖36D展示經由IO控制器3521至3524連接至輸入輸出匯流排3530A及3530B之記憶體晶片3506A至3506D的另一實例實施例。在實例實施例中,可使用對應熔斷器元
件3554至3557不啟動IO控制器,如圖36D中所展示。
The configuration of the memory cell shown in FIG. 36A is only illustrative, and various other configurations can be formed by cutting different areas of the
圖37展示晶粒3503之各種群組,諸如可包括一或多個晶粒3503之群組3713及群組3715。在實例實施例中,除在晶圓3501上形成晶粒3503以外,晶圓3501亦可含有被稱作膠合邏輯3711之邏輯電路3711。相較於在不存在膠合邏輯3711之情況下可能已製造的晶粒之數目,膠合邏輯3711可佔用晶圓3501上之一些空間,以導致每晶圓3501製造較少數目個晶粒。然而,存在膠合邏輯3711可允許多個晶粒經組態以一起充當單個記憶體晶片。舉例而言,膠合邏輯可連接多個晶粒,而不必改變組態且不必將晶粒本身中之任一者內的區域指明用於僅用來將晶粒連接在一起之電路系統。在各種實施例中,膠合邏輯3711提供與其他記憶體控制器之介面,使得多晶粒記憶體晶片充當單個記憶體晶片。膠合邏輯3711可連同晶粒群組(例如,如由群組3713展示)一起切割。替代地,如例如對於群組3715,若記憶體晶片僅需要一個晶粒,則可能不切割膠合邏輯。舉例而言,在不需要使得不同晶粒之間能夠相配合之情況下,可選擇性地消除膠合邏輯。在圖37中,可進行不同區之各種切割,如例如由虛線區所展示。在各種實施例中,如圖37中所展示,對於每兩個晶粒3506,可在晶圓上佈置一個膠合邏輯元件3711。在一些狀況下,一個膠合邏輯元件3711可用於形成晶粒群組之任何合適數目個晶粒3506。膠合邏輯3711可經組態以連接至來自晶粒群組之所有晶粒。在各種實施例中,連接至膠合邏輯3711之晶粒可經組態以形成多晶粒記憶體晶片,且可經組態以在晶粒不連接至膠合邏輯3711時形成分開的單晶粒記憶體晶片。在各種實施例中,連接至膠合邏輯3711且經設計以一起起作用之晶粒可作為群組自晶圓3501切下,且可包括膠合邏輯3711,如例如由群組3713所提示。未連接至膠合邏輯3711之晶粒可在不包括膠合邏輯3711之情況下自晶圓3501切下(如例如由群組3715所提示),以形成單晶粒記憶體晶片。
FIG. 37 shows various groups of dies 3503, such as
在一些實施例中,在自晶圓3501製造多晶粒記憶體晶片期間,可判定一或多個切割形狀(例如,形成群組3713、3715之形狀)用於產生多晶粒記憶體晶片中之所要集合。在一些狀況下,如由群組3715所展示,切割形狀可能不包括膠合邏輯3711。
In some embodiments, during the production of multi-die memory chips from
在各種實施例中,膠合邏輯3711可為用於控制多晶粒記憶體晶片之多個記憶體單元的控制器。在一些狀況下,膠合邏輯3711可包括可由各種其他控制器修改之參數。舉例而言,用於多晶粒記憶體晶片之耦接電路可包括用於組態膠合邏輯3711之參數或記憶體控制器之參數的電路(例如,處理器子單元3515A至3515D,如展示於例如圖35B中)。膠合邏輯3711可經組態以進行多種任務。舉例而言,邏輯3711可經組態以判定哪一晶粒可能需要定址。在一些狀況下,邏輯3711可用以使多個記憶體單元同步。在各種實施例中,邏輯3711可經組態以控制各種記憶體單元,使得記憶體單元作為單個晶片操作。在一些狀況下,可在輸入輸出匯流排(例如,匯流排3530,如圖35C中所展示)與處理器子單元3515A至3515D之間添加放大器以放大來自匯流排3530之資料信號。
In various embodiments, the
在各種實施例中,自晶圓3501切割複雜形狀在技術上可能為困難/昂貴的,且可採用較簡單的切割方法,其限制條件為晶粒在晶圓3501上對準。舉例而言,圖38A展示經對準以形成矩形柵格之晶粒3506。在實例實施例中,可進行跨越整個晶圓3501之豎直切割3803及水平切割3801以分開切下之晶粒群組。在實例實施例中,豎直切割3803及水平切割3801可產生含有選定數目個晶粒之群組。舉例而言,切割3803及3801可產生含有單個晶粒之區(例如,區3811A)、含有兩個晶粒之區(例如,區3811B)及含有四個晶粒之區(例如,區3811C)。由切割3801及3803形成之區僅為說明性的,且可形成任何其他合適的區。在各種實施例中,取決於晶粒對準,可進行各種切割。舉例而言,若
晶粒配置成三角形柵格,如圖38B中所展示,則諸如線3802、3804及3806之切割線可用以製成多晶粒記憶體晶片。舉例而言,一些區可包括六個晶粒、五個晶粒、四個晶粒、三個晶粒、兩個晶粒、一個晶粒任何其他合適數目個晶粒。
In various embodiments, cutting complex shapes from the
圖38C展示配置成三角形柵格之匯流排線3530,其中晶粒3503在藉由匯流排線3530相交形成之三角形的中心對準。晶粒3503可經由匯流排線3820連接至所有相鄰的匯流排線。藉由切割含有兩個或多於兩個鄰近晶粒之區(例如,區3822,如圖38C中所展示),至少一個匯流排線(例如,線3824)保留在區3822內,且匯流排線3824可用以將資料及命令供應至使用區3822形成之多晶粒記憶體晶片。
FIG. 38C shows the
圖39展示可形成於處理器子單元3515A至3515P之間以允許記憶體單元之群組充當單個記憶體晶片的各種連接件。舉例而言,各種記憶體單元之群組3901可包括處理器子單元3515B與子單元3515E之間的連接件3905。連接件3905可用作用於將資料及命令自子單元3515B傳輸至可用以控制各別記憶體組3511E之子單元3515E的匯流排線。在各種實施例中,處理器子單元之間的連接件可在晶圓3501上之晶粒的形成期間實施。在一些狀況下,額外連接件可在由若干晶粒形成之記憶體晶片的封裝階段期間製造。
Figure 39 shows various connections that can be formed between
如圖39中所展示,處理器子單元3515A至3515P可使用各種匯流排(例如,連接件3905)彼此連接。連接件3905可能不含時序硬體邏輯組件,使得在處理器子單元之間及跨越連接件3905的資料傳送可能不受時序硬體邏輯組件控制。在各種實施例中,連接處理器子單元3515A至3515P之匯流排可在將各種電路製造於晶圓3501上之前佈置於晶圓3501上。
As shown in FIG. 39, the
在各種實施例中,處理器子單元(例如,子單元3515A至3515P)可互連。舉例而言,子單元3515A至3515P可藉由合適匯流排(例如,連接件3905)連接。連接件3905可將子單元3515A至3515P中之任一者與子單元3515A
至3515P中之任何其他者連接。在實例實施例中,所連接之子單元可在同一晶粒上(例如,子單元3515A及3515B),且在其他狀況下,所連接之子單元可在不同晶粒上(例如,子單元3515B及3515E)。連接件3905可包括用於連接子單元之專用匯流排且可經組態以在子單元3515A至3515P之間高效地傳輸資料。
In various embodiments, processor sub-units (e.g.,
本發明之各種態樣係關於用於自晶圓生產大小可選擇之記憶體晶片的方法。在實例實施例中,大小可選擇之記憶體晶片可由一或多個晶粒形成。如前文所提到,該等晶粒可沿著一或多列配置,如展示於例如圖35C中。在一些狀況下,對應於一或多列之至少一個共用輸入輸出匯流排可佈置於晶圓3501上。舉例而言,可佈置匯流排3530,如圖35C中所展示。在各種實施例中,匯流排3530可電連接至晶粒中之至少兩個的記憶體單元,且所連接之晶粒可用以形成多晶粒記憶體晶片。在實例實施例中,一或多個控制器(例如,輸入輸出控制器3521及3522,如圖35B中所展示)可經組態以控制用以形成多晶粒記憶體晶片之至少兩個晶粒之記憶體單元。在各種實施例中,可自晶圓切下具有連接至匯流排3530之記憶體單元的晶粒,其中共用輸入輸出匯流排(例如,匯流排3530,如圖35B中所展示)之至少一個對應部分將資訊傳輸至至少一個控制器(例如,控制器3521、3522),以組態控制器控制所連接晶粒之記憶體單元從而一起充當單個晶片。
Various aspects of the present invention relate to methods for producing memory chips of selectable sizes from wafers. In an example embodiment, a memory chip with a selectable size may be formed of one or more dies. As mentioned above, the dies can be arranged along one or more rows, as shown in, for example, Figure 35C. In some cases, at least one common I/O bus corresponding to one or more rows may be arranged on the
在一些狀況下,可在藉由切割晶圓3501之區製造記憶體晶片之前測試位於晶圓3501上之記憶體單元。可使用至少一個共用輸入輸出匯流排(例如,匯流排3530,如圖35C中所展示)進行測試。當記憶體單元通過測試時,記憶體晶片可由含有該等記憶體單元之晶粒的群組形成。可捨棄未通過測試之記憶體單元,且不將該等記憶體單元用於製造記憶體晶片。
In some cases, the memory cells located on the
圖40展示自晶粒群組建置記憶體晶片之實例處理程序4000。在
處理程序4000之步驟4011處,可在半導體晶圓3501上佈置晶粒。在步驟4015處,可使用任何合適的方法在晶圓3501上製造晶粒。舉例而言,可藉由蝕刻晶圓3501,沈積各種介電、金屬或半導體層及進一步蝕刻所沈積層等來製造晶粒。舉例而言,可沈積及蝕刻多個層。在各種實施例中,可使用任何合適的摻雜元素對層進行n型摻雜或P型摻雜。舉例而言,可用磷對半導體層進行n型摻雜且用硼對半導體層進行P型摻雜。如圖35A中所展示,晶粒3503可藉由可用以自晶圓3501切下晶粒3503之空間彼此分開。舉例而言,晶粒3503可藉由間隔區彼此隔開,其中可選擇間隔區之寬度以允許在間隔區中進行晶圓切割。
FIG. 40 shows an
在步驟4017處,可使用任何合適的方法自晶圓3501切下晶粒3503。在實例實施例中,可使用雷射切下晶粒3503。在實例實施例中,可首先刻劃晶圓3501,其後接著進行機械劃割。替代地,可使用機械劃割鋸。在一些狀況下,可使用隱形劃割處理程序。在劃割期間,一旦切下晶粒,晶圓3501便可安裝於用於固持晶粒之劃割帶上。在各種實施例中,可進行大的切割,如例如在圖38A中由切割3801及3803所展示或在圖38B中由切割3802、3804或3806所展示。一旦個別地或以群組切下晶粒3503,如例如在圖35C中由群組3504所展示,便可封裝晶粒3503。晶粒之封裝可包括形成至晶粒3503之接點,在接點上方沈積保護層,附接熱管理裝置(例如,散熱片)及囊封晶粒3503。在各種實施例中,取決於選擇多少晶粒來形成記憶體晶片,可使用接點及匯流排之適當組態。在實例實施例中,可在記憶體晶片封裝期間製作形成記憶體晶片之不同晶粒之間的接點中之一些。
At
圖41A展示用於製造含有多個晶粒之記憶體晶片的實例處理程序4100。處理程序4100之步驟4011可與處理程序4000之步驟4011相同。在步驟4111處,如圖37中所展示,可將膠合邏輯3711佈置於晶圓3501上。膠合邏輯3711可為用於控制晶粒3506之操作的任何合適的邏輯,如圖37中所展示。
如前文所描述,膠合邏輯3711之存在可允許多個晶粒充當單個記憶體晶片。膠合邏輯3711可提供與其他記憶體控制器之介面,使得由多個晶粒形成之記憶體晶片充當單個記憶體晶片。
FIG. 41A shows an
在處理程序4100之步驟4113處,可將匯流排(例如,輸入輸出匯流排及控制匯流排)佈置於晶圓3501上。匯流排可佈置為使得其與各種晶粒及諸如膠合邏輯3711之邏輯電路連接。在一些狀況下,匯流排可連接記憶體單元。舉例而言,匯流排可經組態以連接不同晶粒之處理器子單元。在步驟4115處,可使用任何合適的方法製造晶粒、膠合邏輯及匯流排。舉例而言,可藉由蝕刻晶圓3501,沈積各種介電、金屬或半導體層及進一步蝕刻所沈積層等來製造邏輯元件。可使用例如金屬蒸鍍來製造匯流排。
At
在步驟4140處,可使用切割形狀以切割連接至單個膠合邏輯3711之晶粒的群組,如展示於例如圖37中。可使用對含有多個晶粒3503之記憶體晶片的記憶體要求來判定切割形狀。舉例而言,圖41B展示處理程序4101,該處理程序可為處理程序4100之變體,其中處理程序4100之步驟4140之前可為步驟4117及4119。在步驟4117處,用於切割晶圓3501之系統可接收描述對記憶體晶片之要求的指令。舉例而言,要求可包括形成包括四個晶粒3503之記憶體晶片。在一些狀況下,在步驟4119處,程式軟體可判定用於晶粒群組及膠合邏輯3711之週期性圖案。舉例而言,週期性圖案可包括兩個膠合邏輯3711元件及四個晶粒3503,其中每兩個晶粒連接至一個膠合邏輯3711。替代地,在步驟4119處,可由記憶體晶片之設計者提供該圖案。
At
在一些狀況下,可選擇該圖案以最大化來自晶圓3501之記憶體晶片的良率。在實例實施例中,可測試晶粒3503之記憶體單元以識別具有故障記憶體單元之晶粒(此類晶粒被稱作故障的未通過晶粒),且基於故障晶粒之位置,可識別含有通過測試之記憶體單元的晶粒3503之群組且可判定適當的切
割圖案。舉例而言,若在晶圓3501之邊緣處,大量晶粒3503發生未通過,則可判定切割圖案以避開晶圓3501之邊緣處的晶粒。處理程序4101之諸如步驟4011、4111、4113、4115及4140的其他步驟可與處理程序4100之相同編號步驟相同。
In some cases, the pattern can be selected to maximize the yield of memory chips from
圖41C展示可為處理程序4101之變化形式的實例處理程序4102。處理程序4102之步驟4011、4111、4113、4115及4140可與處理程序4101之相同編號步驟相同,處理程序4102之步驟4131可替代處理程序4101之步驟4117,且處理程序4102之步驟4133可替代處理程序4101之步驟4119。在步驟4131處,用於切割晶圓3501之系統可接收描述對第一記憶體晶片集合及第二記憶體晶片集合之要求的指令。舉例而言,要求可包括:形成具有由四個晶粒3503組成之記憶體晶片的第一記憶體晶片集合;及形成具有由兩個晶粒3503組成之記憶體晶片的第二記憶體晶片集合。在一些狀況下,可能需要自晶圓3501形成多於兩個記憶體晶片集合。舉例而言,第三記憶體晶片集合可包括僅由一個晶粒3503組成之記憶體晶片。在一些狀況下,在步驟4133處,程式軟體可判定用於晶粒群組及膠合邏輯3711之週期性圖案,以用於形成每一記憶體晶片集合中之記憶體晶片。舉例而言,第一記憶體晶片集合可包括含有兩個膠合邏輯3711及四個晶粒3503之記憶體晶片,其中每兩個晶粒連接至一個膠合邏輯3711。在各種實施例中,用於同一記憶體晶片之膠合邏輯單元3711可鏈接在一起以充當單個膠合邏輯。舉例而言,在製造膠合邏輯3711期間,可形成將膠合邏輯單元3711彼此鏈接之適當匯流排線。
FIG. 41C shows an
第二記憶體晶片集合可包括含有一個膠合邏輯3711及兩個晶粒3503之記憶體晶片,其中晶粒3503連接至膠合邏輯3711。在一些狀況下,當選擇第三記憶體晶片集合時且當第三記憶體晶片集合包括由單個晶粒3503組成之記憶體晶片時,此等記憶體晶片可能不需要膠合邏輯3711。
The second memory chip set may include a memory chip including one
雙埠功能性 Dual port functionality
當設計記憶體晶片或晶片內之記憶體例項時,一個重要的特性為在單個時脈循環期間可同時存取之字的數目。對於讀取及/或寫入,可同時存取之位址愈多(例如,沿著亦被稱作字或字線之列及亦被稱作位元或位元線之行的位址),記憶體晶片愈快。雖然在開發包括多路埠之記憶體時已存在一些活動,該等埠允許同時存取多個位址,例如用於建置暫存器檔案、快取記憶體或共用記憶體,但大部分例項使用大小較大且支援多個位址存取之記憶體墊。然而,DRAM晶片通常包括連接至每一記憶體胞元之每一電容器的單個位元線及單個列線。因此,本發明之實施例試圖提供對現有DRAM晶片之多埠存取,而不修改DRAM陣列之此習知單埠記憶體結構。 When designing a memory chip or memory instance within the chip, an important characteristic is the number of ZigZags that can be accessed simultaneously during a single clock cycle. For reading and/or writing, the more addresses that can be accessed at the same time (for example, addresses along a column also called a word or word line and a row also called a bit or bit line) , The faster the memory chip. Although there have been some activities in the development of memory that includes multiple ports, which allow simultaneous access to multiple addresses, such as for building register files, cache memory, or shared memory, most The example uses a memory pad that is large in size and supports multiple address access. However, a DRAM chip usually includes a single bit line and a single column line connected to each capacitor of each memory cell. Therefore, the embodiment of the present invention attempts to provide multi-port access to the existing DRAM chip without modifying the conventional single-port memory structure of the DRAM array.
本發明之實施例可使用記憶體以兩倍於邏輯電路之速度來時控記憶體例項或晶片。使用記憶體之任何邏輯電路可因此「對應於」記憶體及其任何組件。因此,本發明之實施例可在兩個記憶體陣列時脈循環中對兩個位址進行擷取或寫入,該兩個記憶體陣列時脈循環等效於用於邏輯電路之單個處理時脈循環。該等邏輯電路可包含諸如控制器、加速器、GPU或CPU之電路,或可包含與記憶體晶片在同一基板上之處理群組,例如,如圖7A中所描繪。如上文關於圖3A所解釋,「處理群組」可指基板上之兩個或多於兩個處理器子單元及其對應記憶體組。該群組可表示基板上之空間分佈及/或用於編譯程式碼以供在記憶體晶片2800上執行之目的之邏輯分組。因此,如上文關於圖7A所描述,具有記憶體晶片之基板可包括記憶體陣列,該記憶體陣列具有複數個組,諸如圖28中所展示之組2801a及其他組。此外,該基板亦可包括處理陣列,該處理陣列可包括複數個處理器子單元(諸如,圖7A中所展示之子單元730a、730b、730c、730d、730e、730f、730g及730h)。 Embodiments of the present invention can use memory to time control memory instances or chips at twice the speed of logic circuits. Any logic circuit that uses memory can therefore "correspond" to the memory and any of its components. Therefore, the embodiment of the present invention can retrieve or write two addresses in two memory array clock cycles, which are equivalent to a single processing time for logic circuits. Pulse circulation. The logic circuits may include circuits such as controllers, accelerators, GPUs, or CPUs, or may include processing groups on the same substrate as the memory chip, for example, as depicted in FIG. 7A. As explained above with respect to FIG. 3A, the "processing group" can refer to two or more processor sub-units on the substrate and their corresponding memory groups. The group may represent a spatial distribution on the substrate and/or a logical grouping for the purpose of compiling program codes for execution on the memory chip 2800. Therefore, as described above with respect to FIG. 7A, the substrate with memory chips may include a memory array having a plurality of groups, such as the group 2801a shown in FIG. 28 and other groups. In addition, the substrate may also include a processing array, which may include a plurality of processor sub-units (such as the sub-units 730a, 730b, 730c, 730d, 730e, 730f, 730g, and 730h shown in FIG. 7A).
因此,本發明之實施例可在兩個連續記憶體循環中之每一者內自 陣列擷取資料,以便針對每一邏輯循環處置兩個位址,且向邏輯提供兩個結果,就如同單埠記憶體陣列為雙埠記憶體晶片一般。額外時控可允許本發明之記憶體晶片如同單埠記憶體陣列為雙埠記憶體例項、三埠記憶體例項、四埠記憶體例項埠或任何其他多埠記憶體例項一般起作用。 Therefore, the embodiments of the present invention can be self-contained in each of two consecutive memory cycles. The array retrieves data to handle two addresses for each logical cycle and provides two results to the logic, just as if the single-port memory array is a dual-port memory chip. The additional timing allows the memory chip of the present invention to function as if the single-port memory array is a dual-port memory instance, a three-port memory instance, a four-port memory instance port, or any other multi-port memory instance.
圖42描繪符合本發明的實例電路系統4200,該電路系統提供沿著使用電路系統4200之記憶體晶片之行的雙埠存取。圖42中所描繪之實施例可使用具有兩個行多工器(「mux」)4205a及4205b以在用於邏輯電路之同一時脈循環期間存取同一列上之兩個字的一個記憶體陣列4201。舉例而言,在記憶體時脈循環期間,RowAddrA用於列解碼器4203中且ColAddrA用於多工器4205a中以緩衝來自具有位址(RowAddrA,ColAddrA)之記憶體胞元的資料。在同一記憶體時脈循環期間,ColAddrB用於多工器4205b中以緩衝來自具有位址(RowAddrA,ColAddrB)之記憶體胞元的資料。因此,電路系統4200可允許沿著同一列或字線對儲存於兩個不同位址處之記憶體胞元上的資料(例如,DataA及DataB)進行雙埠存取。因此,兩個位址可共用一列使得列解碼器4203針對兩次擷取啟動同一字線。此外,如圖42中所描繪之實例的實施例可使用行多工器,使得可在同一記憶體時脈循環期間存取兩個位址。
42 depicts an example circuit system 4200 consistent with the present invention that provides dual-port access along the rows of memory chips using the circuit system 4200. The embodiment depicted in FIG. 42 can use a memory with two row multiplexers ("mux") 4205a and 4205b to access two words on the same column during the same clock cycle used for
類似地,圖43描繪符合本發明的實例電路系統4300,該電路系統提供沿著使用電路系統4300之記憶體晶片之列的雙埠存取。圖43中所描繪之實施例可使用一個記憶體陣列4301,該記憶體陣列具有與多工器(「mux」)耦接之列解碼器4303以在用於邏輯電路之同一時脈循環期間存取同一行上之兩個字。舉例而言,在兩個記憶體時脈循環中之第一記憶體時脈循環上,RowAddrA用於列解碼器4303中且ColAddrA用於行多工器4305中以緩衝來自具有位址(RowAddrA,ColAddrA)之記憶體胞元的資料(例如,至圖43之「緩衝字」緩衝器)。在兩個記憶體時脈循環中之第二記憶體時脈循環上,RowAddrB用於列
解碼器4303中且ColAddrA用於行多工器4305中以緩衝來自具有位址(RowAddrB,ColAddrA)之記憶體胞元的資料。因此,電路系統4300可允許沿著同一行或位元線對儲存於兩個不同位址處之記憶體胞元上的資料(例如,DataA及DataB)進行雙埠存取。因此,兩個位址可共用一列使得行解碼器(其可與一或多個行多工器分開或組合,如圖43中所描繪)針對兩次擷取啟動同一位元線。如圖43中所描繪之實例的實施例可使用兩個記憶體時脈循環,此係因為列解碼器4303啟動每一字線可能皆需要一個記憶體時脈循環。因此,若以至少兩倍於對應邏輯電路之速度進行時控,則使用電路系統4300之記憶體晶片可充當雙埠記憶體。
Similarly, FIG. 43 depicts an example circuit system 4300 in accordance with the present invention that provides dual-port access along the rows of memory chips using the circuit system 4300. The embodiment depicted in FIG. 43 can use a
因此,如上文所解釋,圖43可在比用於對應邏輯電路之時脈循環快的兩個記憶體時脈循環期間擷取DataA及DataB。舉例而言,列解碼器(例如,圖43之列解碼器4303)及行解碼器(其可與一或多個行多工器分開或組合,如圖43中所描繪)可經組態成以至少兩倍於對應邏輯電路產生兩個位址之速率的速率進行時控。舉例而言,用於電路系統4300之時脈電路(圖43中未展示)可根據至少兩倍於對應邏輯電路產生兩個位址之速率的速率對電路系統4300進行時控。 Therefore, as explained above, FIG. 43 can retrieve DataA and DataB during two memory clock cycles that are faster than the clock cycle used for the corresponding logic circuit. For example, a column decoder (e.g., column decoder 4303 of FIG. 43) and a row decoder (which can be separated or combined with one or more row multiplexers, as depicted in FIG. 43) can be configured as The timing is controlled at a rate that is at least twice the rate at which the corresponding logic circuit generates two addresses. For example, a clock circuit (not shown in FIG. 43) used in the circuit system 4300 can time the circuit system 4300 at a rate that is at least twice the rate at which the corresponding logic circuit generates two addresses.
可分開地或組合地使用圖42之實施例及圖43之實施例。因此,在單埠記憶體陣列或墊上提供雙埠功能性之電路系統(例如,電路系統4200或4300)可包含沿著至少一列及至少一行配置之複數個記憶體組。該等複數個記憶體組在圖42中描繪為記憶體陣列4201及在圖43中描繪為記憶體陣列4301。該等實施例可進一步使用經組態以在單個時脈循環期間接收用於讀取或寫入之兩個位址的至少一個列多工器(如圖43中所描繪)或至少一個行多工器(如圖42中所描繪)。此外,該等實施例可使用列解碼器(例如,圖42之列解碼器4203及圖43之列解碼器4303)及行解碼器(其可與一或多個行多工器分開或組合,
如圖42及圖43中所描繪)以自兩個位址讀取或寫入至兩個位址。舉例而言,列解碼器及行解碼器可在第一循環期間自至少一個列多工器或至少一個行多工器擷取兩個位址中之第一位址,且解碼對應於第一位址之字線及位元線。此外,列解碼器及行解碼器可在第二循環期間自至少一個列多工器或至少一個行多工器擷取兩個位址中之第二位址,且解碼對應於第二位址之字線及位元線。該等擷取可各包含使用列解碼器啟動對應於位址之字線及使用行解碼器啟動經啟動字線上之對應於位址的位元線。
The embodiment of FIG. 42 and the embodiment of FIG. 43 can be used separately or in combination. Therefore, a circuit system that provides dual-port functionality on a single-port memory array or pad (for example, the circuit system 4200 or 4300) may include a plurality of memory banks arranged along at least one row and at least one row. The plurality of memory banks are depicted as a
儘管上文針對擷取進行了描述,但圖42及圖43之實施例(無論係分開地抑或組合地實施)皆可包括寫入命令。舉例而言,在第一循環期間,列解碼器及行解碼器可將自至少一個列多工器或至少一個行多工器擷取之第一資料寫入至兩個位址中之第一位址。此外,在第二循環期間,列解碼器及行解碼器可將自至少一個列多工器或至少一個行多工器擷取之第二資料寫入至兩個位址中之第二位址。 Although the capture is described above, the embodiments of FIGS. 42 and 43 (whether implemented separately or in combination) may include write commands. For example, during the first cycle, the row decoder and the row decoder can write the first data retrieved from at least one row multiplexer or at least one row multiplexer to the first of the two addresses. Address. In addition, during the second cycle, the column decoder and the row decoder can write the second data retrieved from at least one column multiplexer or at least one row multiplexer to the second address of the two addresses .
圖42之實例展示在第一位址及第二位址共用字線位址時之此處理程序,而圖43之實例展示在第一位址及第二位址共用行位址時之此處理程序。如下文關於圖47進一步所描述,在第一位址及第二位址不共用字線位址抑或行位址時,可實施同一處理程序。 The example in FIG. 42 shows the processing when the first address and the second address share the word line address, and the example in FIG. 43 shows the processing when the first address and the second address share the row address program. As described further below with respect to FIG. 47, the same processing procedure can be implemented when the first address and the second address do not share the word line address or the row address.
因此,儘管上文之實例提供沿著列或行中之至少一者的雙埠存取,但額外實施例可提供沿著列及行兩者之雙埠存取。圖44描繪符合本發明的實例電路系統4400,該電路系統提供沿著使用電路系統4400之記憶體晶片之列及行兩者的雙埠存取。因此,電路系統4700可表示圖42之電路系統4200與圖43之電路系統4300的組合。
Therefore, although the above examples provide dual-port access along at least one of a row or a row, additional embodiments may provide dual-port access along both the row and the row. FIG. 44 depicts an example circuit system 4400 in accordance with the present invention that provides dual-port access along both the column and row of memory chips using the circuit system 4400. Therefore, the
圖44中所描繪之實施例可使用一個記憶體陣列4401,該記憶體陣列具有與多工器(「mux」)耦接之列解碼器4403以在用於邏輯電路之同一 時脈循環期間存取兩列。此外,圖44中所描繪之實施例可使用記憶體陣列4401,該記憶體陣列具有與多工器(「mux」)耦接之行解碼器(或多工器)4405以在同一時脈循環期間存取兩行。舉例而言,在兩個記憶體時脈循環中之第一記憶體時脈循環上,RowAddrA用於列解碼器4403中且ColAddrA用於行多工器4405中以緩衝來自具有位址(RowAddrA,ColAddrA)之記憶體胞元的資料(例如,至圖44之「緩衝字」緩衝器)。在兩個記憶體時脈循環中之第二記憶體時脈循環上,RowAddrB用於列解碼器4403中且ColAddrB用於行多工器4405中以緩衝來自具有位址(RowAddrB,ColAddrB)之記憶體胞元的資料。因此,電路系統4400可允許對儲存於兩個不同位址處之記憶體胞元上之資料(例如,DataA及DataB)進行雙埠存取。如圖44中所描繪之實例的實施例可使用額外緩衝器,此係因為列解碼器4403啟動每一字線可能皆需要一個記憶體時脈循環。因此,若以至少兩倍於對應邏輯電路之速度進行時控,則使用電路系統4400之記憶體晶片可充當雙埠記憶體。 The embodiment depicted in FIG. 44 can use a memory array 4401 with a column decoder 4403 coupled to a multiplexer ("mux") for use in the same logic circuit Two columns are accessed during the clock cycle. In addition, the embodiment depicted in FIG. 44 can use a memory array 4401 that has a row decoder (or multiplexer) 4405 coupled to a multiplexer ("mux") to cycle at the same clock Access two rows during the period. For example, on the first memory clock cycle of the two memory clock cycles, RowAddrA is used in the column decoder 4403 and ColAddrA is used in the row multiplexer 4405 to buffer data from the address (RowAddrA, ColAddrA) memory cell data (for example, to the "buffer word" buffer in FIG. 44). On the second memory clock cycle of the two memory clock cycles, RowAddrB is used in the column decoder 4403 and ColAddrB is used in the row multiplexer 4405 to buffer the memory from the address (RowAddrB, ColAddrB) Somatic data. Therefore, the circuit system 4400 can allow dual-port access to data (for example, DataA and DataB) stored on memory cells at two different addresses. The embodiment of the example depicted in FIG. 44 may use additional buffers because the column decoder 4403 may require one memory clock cycle to activate each word line. Therefore, if time control is performed at least twice the speed of the corresponding logic circuit, the memory chip using the circuit system 4400 can serve as a dual-port memory.
儘管在圖44中未描繪,但電路系統4400可進一步包括沿著列或字線之圖46(下文進一步描述)的額外電路系統及/或沿著行或位元線之類似額外電路系統。因此,電路系統4400可啟動對應電路系統(例如,藉由斷開一或多個開關元件,諸如圖46之開關元件4613a、4613b及其類似者中之一或多者)以啟動包括位址之斷開部分(例如,藉由連接電壓或允許電流流動至斷開部分)。因此,當電路系統之元件(諸如,線或其類似者)包括識別位址之位置時及/或當電路系統之元件(諸如,開關元件)控制至由位址識別之記憶體胞元的供應或電壓及/或電流時,該電路系統可「對應」。電路系統4400可接著使用列解碼器4403及行多工器4405以解碼對應字線及位元線,以自位於經啟動之斷開部分中之位址擷取資料或將資料寫入至該等位址。
Although not depicted in FIG. 44, circuitry 4400 may further include additional circuitry of FIG. 46 (described further below) along column or word lines and/or similar additional circuitry along row or bit lines. Therefore, the circuit system 4400 can activate the corresponding circuit system (for example, by turning off one or more switching elements, such as one or more of the
如圖44中進一步所描繪,電路系統4400可進一步使用經組態以 在單個時脈循環期間接收用於讀取或寫入之兩個位址的至少一個列多工器(描繪為與列解碼器4403分開,但可併入其中)及/或至少一個行多工器(例如,描繪為與行多工器4405分開,但可併入其中)。因此,實施例可使用列解碼器(例如,列解碼器4403)及行解碼器(其可與行多工器4405分開或組合)以自兩個位址讀取或寫入至兩個位址。舉例而言,列解碼器及行解碼器可在記憶體時脈循環期間自至少一個列多工器或至少一個行多工器擷取兩個位址中之第一位址,且解碼對應於第一位址之字線及位元線。此外,列解碼器及行解碼器可在同一記憶體循環期間自至少一個列多工器或至少一個行多工器擷取兩個位址中之第二位址,且解碼對應於第二位址之字線及位元線。 As further depicted in Figure 44, the circuitry 4400 can be further configured to At least one column multiplexer (depicted as being separate from column decoder 4403, but can be incorporated into it) that receives two addresses for reading or writing during a single clock cycle and/or at least one row multiplexer (E.g., depicted as separate from the row multiplexer 4405, but could be incorporated into it). Therefore, an embodiment can use a column decoder (for example, the column decoder 4403) and a row decoder (which can be separated or combined with the row multiplexer 4405) to read from or write to two addresses . For example, the column decoder and the row decoder can retrieve the first of the two addresses from at least one column multiplexer or at least one row multiplexer during the memory clock cycle, and the decoding corresponds to The word line and bit line of the first address. In addition, the column decoder and the row decoder can retrieve the second address of the two addresses from at least one column multiplexer or at least one row multiplexer during the same memory cycle, and the decoding corresponds to the second address Zigzag line and bit line.
圖45A及圖45B描繪用於在單埠記憶體陣列或墊上提供雙埠功能性之現有複製技術。如圖45A中所展示,雙埠讀取可藉由跨越記憶體陣列或墊使資料之複本保持同步來提供。因此,可自記憶體例項之兩個複本執行讀取,如圖45A中所描繪。此外,如圖45B中所展示,雙埠寫入可藉由跨越記憶體陣列或墊複製所有寫入來提供。舉例而言,記憶體晶片可能需要使用記憶體晶片之邏輯電路以複製形式發送寫入命令,針對每一資料複本發送一個寫入命令。替代地,在一些實施例中,如圖45A中所展示,額外電路系統可允許使用記憶體例項之邏輯電路發送單個寫入命令,該單個寫入命令由額外電路系統自動地複製以跨越記憶體陣列或墊產生寫入資料之複本,以便使複本保持同步。圖42、圖43及圖44之實施例可藉由使用多工器在單個記憶體時脈循環中存取兩條位元線(例如,如圖42中所描繪)及/或藉由比對應邏輯電路更快地時控記憶體(例如,如圖43及圖44中所描繪)及提供額外多工器以處置額外位址而非複製記憶體中之所有資料來減少來自此等現有複製技術之冗餘。 Figures 45A and 45B depict existing replication techniques used to provide dual-port functionality on a single-port memory array or pad. As shown in Figure 45A, dual-port reads can be provided by keeping copies of data synchronized across memory arrays or pads. Therefore, reading can be performed from two copies of the memory instance, as depicted in Figure 45A. In addition, as shown in Figure 45B, dual-port writes can be provided by copying all writes across a memory array or pad. For example, the memory chip may need to use the logic circuit of the memory chip to send a write command in the form of a copy, and send a write command for each data copy. Alternatively, in some embodiments, as shown in FIG. 45A, the additional circuitry may allow the logic circuit using the memory instance to send a single write command that is automatically copied by the additional circuitry to span the memory The array or pad generates copies of the written data in order to keep the copies in sync. The embodiments of FIGS. 42, 43, and 44 can access two bit lines in a single memory clock cycle by using a multiplexer (for example, as depicted in FIG. 42) and/or by comparing corresponding logic The circuit time-controls the memory faster (for example, as depicted in Figure 43 and Figure 44) and provides additional multiplexers to handle additional addresses instead of copying all the data in the memory to reduce the cost from these existing copy technologies. redundancy.
除上文所描述之更快時控及/或額外多工器以外,本發明之實施例亦可使用在記憶體陣列內之一些點處斷開位元線及/或字線的電路系統。此等實 施例可允許對陣列之多個同時存取,只要列解碼器及行解碼器存取不耦接至斷開電路系統之相同部分的不同位置即可。舉例而言,可同時存取具有不同字線及位元線之位置,此係因為斷開電路系統可允許列解碼及行解碼存取不同位址而無電干擾。在設計記憶體晶片期間,可權衡記憶體陣列內之斷開區的粒度與斷開電路系統所需之額外區域。 In addition to the faster timing and/or additional multiplexers described above, the embodiments of the present invention can also use a circuit system that disconnects bit lines and/or word lines at some points in the memory array. Such reality The embodiment may allow multiple simultaneous accesses to the array, as long as the column decoder and row decoder access are not coupled to different positions of the same part of the disconnect circuit system. For example, it is possible to access locations with different word lines and bit lines at the same time, because the open circuit system allows column decoding and row decoding to access different addresses without electrical interference. During the design of the memory chip, the granularity of the disconnected area in the memory array can be weighed against the extra area required to disconnect the circuit system.
用於實施此同時存取之架構描繪於圖46中。特定而言,圖46描繪在單埠記憶體陣列或墊上提供雙埠功能性之實例電路系統4600。如圖46中所描繪,電路系統4600可包括沿著至少一列及至少一行配置之複數個記憶體墊(例如,記憶體墊4609a、墊4609b及其類似者)。電路系統4600之佈局進一步包括複數條字線,諸如對應於列之字線4611a及4611b,以及對應於行之位元線4615a及4615b。
The architecture for implementing this simultaneous access is depicted in FIG. 46. In particular, FIG. 46 depicts an example circuit system 4600 that provides dual port functionality on a single port memory array or pad. As depicted in FIG. 46, the circuit system 4600 may include a plurality of memory pads (eg,
圖46之實例包括十二個記憶體墊,每一記憶體墊具有兩條線及八個行。在其他實施例中,基板可包括任何數目個記憶體墊,且每一記憶體墊可包括任何數目條線及任何數目個行。一些記憶體墊可包括相同數目個線及行(如圖46中所展示),而其他記憶體墊可包括不同數目個線及/或行。 The example of FIG. 46 includes twelve memory pads, each of which has two lines and eight rows. In other embodiments, the substrate may include any number of memory pads, and each memory pad may include any number of lines and any number of rows. Some memory pads may include the same number of lines and rows (as shown in FIG. 46), while other memory pads may include a different number of lines and/or rows.
儘管在圖46中未描繪,但電路系統4600可進一步使用經組態以在單個時脈循環期間接收用於讀取或寫入之兩個(或三個或任何複數個)位址的至少一個列多工器(與列解碼器4601a及/或4601b分開或與該列解碼器合併)或至少一個行多工器(例如,行多工器4603a及/或4603b)。此外,實施例可使用列解碼器(例如,列解碼器4601a及/或4601b)及行解碼器(其可與行多工器4603a及/或4603b分開或組合)以自兩個(或多於兩個)位址讀取或寫入至兩個(或多於兩個)位址。舉例而言,列解碼器及行解碼器可在記憶體時脈循環期間自至少一個列多工器或至少一個行多工器擷取兩個位址中之第一位址,且解碼對應於第一位址之字線及位元線。此外,列解碼器及行解碼器可在 同一記憶體循環期間自至少一個列多工器或至少一個行多工器擷取兩個位址中之第二位址,且解碼對應於第二位址之字線及位元線。如上文所解釋,只要兩個位址處於不耦接至斷開電路系統(例如,開關元件,諸如4613a、4613b及其類似者)之相同部分的不同位置中,便可在同一記憶體時脈循環期間進行存取。另外,電路系統4600可在第一記憶體時脈循環期間同時存取前兩個位址,且接著在第二記憶體時脈循環期間同時存取接下來的兩個位址。在此等實施例中,若以至少兩倍於對應邏輯電路之速度進行時控,則使用電路系統4600之記憶體晶片可充當四埠記憶體。 Although not depicted in FIG. 46, the circuitry 4600 may further be configured to receive at least one of two (or three or any plural) addresses for reading or writing during a single clock cycle A column multiplexer (separate from or combined with the column decoder 4601a and/or 4601b) or at least one row multiplexer (e.g., row multiplexer 4603a and/or 4603b). In addition, embodiments may use column decoders (for example, column decoders 4601a and/or 4601b) and row decoders (which can be separated or combined with row multiplexers 4603a and/or 4603b) to select from two (or more than one) Two) addresses read or write to two (or more than two) addresses. For example, the column decoder and the row decoder can retrieve the first of the two addresses from at least one column multiplexer or at least one row multiplexer during the memory clock cycle, and the decoding corresponds to The word line and bit line of the first address. In addition, the column decoder and row decoder can be During the same memory cycle, the second address of the two addresses is retrieved from at least one row multiplexer or at least one row multiplexer, and the word line and bit line corresponding to the second address are decoded. As explained above, as long as the two addresses are in different positions that are not coupled to the same part of the open circuit system (for example, switching elements such as 4613a, 4613b and the like), they can be at the same memory clock. Access during the cycle. In addition, the circuit system 4600 can simultaneously access the first two addresses during the first memory clock cycle, and then simultaneously access the next two addresses during the second memory clock cycle. In these embodiments, if the timing is at least twice the speed of the corresponding logic circuit, the memory chip using the circuit system 4600 can serve as a four-port memory.
圖46進一步包括經組態以充當開關之至少一個列電路及至少一個行電路。舉例而言,諸如4613a、4613b及其類似者之對應開關元件可包含電晶體或任何其他電元件,該電晶體或任何其他電元件經組態以允許或停止電流流動及/或連接或斷開電壓與連接至諸如4613a、4613b及其類似者之開關元件的字線或位元線。因此,對應開關元件可將電路系統4600分成斷開部分。儘管描繪為包含單個列且每一列包含十六行,但電路系統4600內之斷開區可取決於電路系統4600之設計而包括不同粒度等級。 Figure 46 further includes at least one column circuit and at least one row circuit configured to act as a switch. For example, corresponding switching elements such as 4613a, 4613b, and the like may include a transistor or any other electrical element that is configured to allow or stop current flow and/or connect or disconnect Voltage and word lines or bit lines connected to switching elements such as 4613a, 4613b and the like. Therefore, the corresponding switching element can divide the circuit system 4600 into disconnected parts. Although depicted as including a single column and each column including sixteen rows, the disconnection region within the circuit system 4600 may include different levels of granularity depending on the design of the circuit system 4600.
電路系統4600可使用控制器(例如,列控制件4607)以啟動至少一個列電路及至少一個行電路中之對應者,以便在上文所描述之位址操作期間啟動對應斷開區。舉例而言,電路系統4600可傳輸一或多個控制信號以閉合開關元件(例如,開關元件4613a、4613b及其類似者)中之對應者。在開關元件4613a、4613b及其類似者包含電晶體之實施例中,控制信號可包含斷開電晶體之電壓。
The circuit system 4600 may use a controller (for example, the column control element 4607) to activate the corresponding one of the at least one column circuit and the at least one row circuit, so as to activate the corresponding disconnection area during the address operation described above. For example, the circuit system 4600 may transmit one or more control signals to turn on the corresponding ones of the switching elements (for example, the
取決於包括位址之斷開區,可由電路系統4600啟動開關元件中之多於一者。舉例而言,為到達圖46之記憶體墊4609b內的位址,必須斷開允許存取記憶體墊4609a之開關元件以及允許存取記憶體墊4609b之開關元件。列
控制件4607可判定要啟動之開關元件,以便根據特定位址擷取電路系統4600內之特定位址。
Depending on the open area including the address, more than one of the switching elements can be activated by the circuit system 4600. For example, in order to reach the address in the
圖46表示用以劃分記憶體陣列(例如,包含記憶體墊4609a、墊4609b及其類似者)之字線的電路系統4600之實例。然而,其他實施例可使用類似電路系統(例如,將記憶體晶片4600分成斷開區之開關元件)以劃分記憶體陣列之位元線。因此,電路系統4600之架構可用於雙行存取(如圖42或圖44中所描繪之情況)以及雙列存取(如圖43或圖44中所描繪之情況)中。
FIG. 46 shows an example of a circuit system 4600 for dividing word lines of a memory array (for example, including
用於對記憶體陣列或墊進行多循環存取的處理程序描繪於圖47A中。特定而言,圖47A為用於在單埠記憶體陣列或墊上提供雙埠存取(例如,使用圖43之電路系統4300或圖44之電路系統4400)之處理程序4700的實例流程圖。可使用符合本發明之列解碼器及行解碼器執行處理程序4700,諸如分別圖43或圖44之列解碼器4303或4403,及行解碼器(其可與一或多個行多工器分開或組合,諸如分別描繪於圖43或圖44中之行多工器4305或4405)。
The processing procedure for multi-cycle access to the memory array or pad is depicted in FIG. 47A. Specifically, FIG. 47A is an example flowchart of a
在步驟4710處,在第一記憶體時脈循環期間,該電路系統可使用至少一個列多工器及至少一個行多工器以解碼對應於兩個位址中之第一位址的字線及位元線。舉例而言,至少一個列解碼器可啟動字線,且至少一個行多工器可放大來自沿著經啟動字線並對應於第一位址之記憶體胞元的電壓。可將經放大電壓提供至使用包括電路系統之記憶體晶片的邏輯電路,或根據下文所描述之步驟4720緩衝經放大電壓。該等邏輯電路可包含諸如GPU或CPU之電路,或可包含與記憶體晶片在同一基板上之處理群組,例如,如圖7A中所描繪。
At
儘管上文描述為讀取操作,但方法4700可類似地處理寫入操作。舉例而言,至少一個列解碼器可啟動字線,且至少一個行多工器可將電壓施加至沿著經啟動字線並對應於第一位址之記憶體胞元,以將新資料寫入至該記憶體胞元。在一些實施例中,該電路系統可將對寫入之確認提供至使用包括電路
系統之記憶體晶片的邏輯電路,或根據下文步驟4720緩衝該確認。
Although described above as read operations,
在步驟4720處,該電路系統可緩衝第一位址之所擷取資料。舉例而言,如圖43及圖44中所描繪,緩衝器可允許電路系統擷取兩個位址中之第二位址(如下文描述於步驟4730中)且將兩次擷取之結果一起傳回。緩衝器可包含暫存器、SRAM、非揮發性記憶體或任何其他資料儲存裝置。
At
在步驟4730處,在第二記憶體時脈循環期間,該電路系統可使用至少一個列多工器及至少一個行多工器以解碼對應於兩個位址中之第二位址的字線及位元線。舉例而言,至少一個列解碼器可啟動字線,且至少一個行多工器可放大來自沿著經啟動字線並對應於第二位址之記憶體胞元的電壓。可將經放大電壓提供至使用包括電路系統之記憶體晶片的邏輯電路,無論係個別地提供抑或連同例如來自步驟4720之經緩衝電壓一起提供。該等邏輯電路可包含諸如GPU或CPU之電路,或可包含與記憶體晶片在同一基板上之處理群組,例如,如圖7A中所描繪。
At
儘管上文描述為讀取操作,但方法4700可類似地處理寫入操作。舉例而言,至少一個列解碼器可啟動字線,且至少一個行多工器可將電壓施加至沿著經啟動字線並對應於第二位址之記憶體胞元,以將新資料寫入至該記憶體胞元。在一些實施例中,該電路系統可將對寫入之確認提供至使用包括電路系統之記憶體晶片的邏輯電路,無論係個別地提供抑或連同例如來自步驟4720之經緩衝電壓一起提供。
Although described above as read operations,
在步驟4740處,該電路系統可輸出第二位址之所擷取資料與經緩衝第一位址。舉例而言,如圖43及圖44中所描繪,該電路系統可將兩次擷取之結果(例如,來自步驟4710及4730)一起傳回。該電路系統可將結果傳回至使用包括電路系統之記憶體晶片的邏輯電路。該等邏輯電路可包含諸如GPU或CPU之電路,或可包含與記憶體晶片在同一基板上之處理群組,例如,如圖7A
中所描繪。
At
儘管參考多個循環進行描述,但若兩個位址共用字線,如圖42中所描繪,則方法4700可允許對兩個位址之單循環存取。舉例而言,步驟4710及4730可在同一記憶體時脈循環期間進行,此係因為多個行多工器可在同一記憶體時脈循環期間解碼同一字線上之不同位元線。在此等實施例中,可跳過緩衝步驟4720。
Although described with reference to multiple cycles, if two addresses share a word line, as depicted in FIG. 42,
用於同時存取(例如,使用上文所描述之電路系統4600)之處理程序描繪於圖47B中。因此,儘管依序地展示,但圖47B之步驟可全部在同一記憶體時脈循環期間進行,且可同時執行至少一些步驟(例如,步驟4760與4780或步驟4770與4790)。特定而言,圖47B為用於在單埠記憶體陣列或墊上提供雙埠存取(例如,使用圖42之電路系統4200或圖46之電路系統4600)的處理程序4750之實例流程圖。可使用符合本發明之列解碼器及行解碼器執行處理程序4750,諸如分別圖42或圖46之列解碼器4203或列解碼器4601a及4601b,及行解碼器(其可與一或多個行多工器分開或組合,諸如分別描繪於圖42或圖46中之行多工器4205a及4205b或行多工器4603a及4306b)。
The processing procedure for simultaneous access (for example, using the circuit system 4600 described above) is depicted in FIG. 47B. Therefore, although shown sequentially, the steps in FIG. 47B can all be performed during the same memory clock cycle, and at least some steps can be performed at the same time (for example, steps 4760 and 4780 or
在步驟4760處,在記憶體時脈循環期間,該電路系統可基於兩個位址中之第一位址啟動至少一個列電路及至少一個行電路中之對應者。舉例而言,該電路系統可傳輸一或多個控制信號以閉合包含至少一個列電路及至少一個行電路之開關元件中之對應者。因此,該電路系統可存取包括兩個位址中之第一位址的對應斷開區。
At
在步驟4770處,在該記憶體時脈循環期間,該電路系統可使用至少一個列多工器及至少一個行多工器以解碼對應於第一位址之字線及位元線。舉例而言,至少一個列解碼器可啟動字線,且至少一個行多工器可放大來自沿著經啟動字線並對應於第一位址之記憶體胞元的電壓。可將經放大電壓提
供至使用包括電路系統之記憶體晶片的邏輯電路。舉例而言,如上文所描述,該等邏輯電路可包含諸如GPU或CPU之電路,或可包含與記憶體晶片在同一基板上之處理群組,例如,如圖7A中所描繪。
At
儘管上文描述為讀取操作,但方法4500可類似地處理寫入操作。舉例而言,至少一個列解碼器可敗動字線,且至少一個行多工器可將電壓施加至沿著經啟動字線並對應於第一位址之記憶體胞元,以將新資料寫入至該記憶體胞元。在一些實施例中,該電路系統可將對寫入之確認提供至使用包括該電路系統之記憶體晶片的邏輯電路。 Although described above as a read operation, the method 4500 can similarly handle write operations. For example, at least one column decoder can fail the word line, and at least one row multiplexer can apply a voltage to the memory cell along the activated word line and corresponding to the first address to transfer new data Write to the memory cell. In some embodiments, the circuit system can provide confirmation of writing to a logic circuit using a memory chip that includes the circuit system.
在步驟4780處,在同一循環期間,該電路系統可基於兩個位址中之第二位址啟動至少一個列電路及至少一個行電路中之對應者。舉例而言,該電路系統可傳輸一或多個控制信號以閉合包含至少一個列電路及至少一個行電路之開關元件中之對應者。因此,該電路系統可存取包括兩個位址中之第二位址的對應斷開區。
At
在步驟4790處,在同一循環期間,該電路系統可使用至少一個列多工器及至少一個行多工器以解碼對應於第二位址之字線及位元線。舉例而言,至少一個列解碼器可啟動字線,且至少一個行多工器可放大來自沿著經啟動字線並對應於第二位址之記憶體胞元的電壓。將經放大電壓提供至可使用包括電路系統之記憶體晶片的邏輯電路。舉例而言,如上文所描述,該等邏輯電路可包含諸如GPU或CPU之習知電路,或可包含與記憶體晶片在同一基板上之處理群組,例如,如圖7A中所描繪。
At
儘管上文描述為讀取操作,但方法4500可類似地處理寫入操作。舉例而言,至少一個列解碼器可啟動字線,且至少一個行多工器可將電壓施加至沿著經啟動字線並對應於第二位址之記憶體胞元,以將新資料寫入至該記憶體胞元。在一些實施例中,該電路系統可將對寫入之確認提供至使用包括該電 路系統之記憶體晶片的邏輯電路。 Although described above as a read operation, the method 4500 can similarly handle write operations. For example, at least one column decoder can activate a word line, and at least one row multiplexer can apply a voltage to a memory cell along the activated word line and corresponding to the second address to write new data Into the memory cell. In some embodiments, the circuit system can provide confirmation of writing to the The logic circuit of the memory chip of the system.
儘管參考單個循環進行描述,但若兩個位址處於共用字線或位元線(或以其他方式共用至少一個列電路及至少一個行電路中之開關元件)之斷開區中,則方法4500可允許對兩個位址之多循環存取。舉例而言,步驟4760及4770可在第一記憶體時脈循環期間進行,在該第一記憶體時脈循環中,第一列解碼器及第一行多工器可解碼對應於第一位址之字線及位元線,而步驟4780及4790可在第二記憶體時脈循環期間進行,在該第二記憶體時脈循環中,第二列解碼器及第二行多工器可解碼對應於第二位址之字線及位元線。
Although the description is made with reference to a single cycle, if the two addresses are in the disconnected region of a shared word line or bit line (or otherwise share at least one column circuit and at least one switching element in the row circuit), the method 4500 Allows multiple cyclic accesses to two addresses. For example, steps 4760 and 4770 can be performed during the first memory clock cycle. In the first memory clock cycle, the first row of decoders and the first row of multiplexers can decode corresponding to the first bit Address word lines and bit lines, and
用於沿著列及行兩者之雙埠存取的架構之另一實例描繪於圖48中。特定而言,圖48描繪使用多個列解碼器結合多個行多工器提供沿著列及行兩者之雙埠存取的實例電路系統4800。在圖48中,列解碼器4801a可存取第一字線,且行多工器4803a可解碼來自沿著第一字線之一或多個記憶體胞元的資料,而列解碼器4801b可存取第二字線,且行多工器4803b可解碼來自沿著第二字線之一或多個記憶體胞元的資料。 Another example of the architecture for dual port access along both rows and rows is depicted in FIG. 48. In particular, FIG. 48 depicts an example circuitry 4800 that uses multiple column decoders in combination with multiple row multiplexers to provide dual-port access along both columns and rows. In FIG. 48, the column decoder 4801a can access the first word line, and the row multiplexer 4803a can decode data from one or more memory cells along the first word line, and the column decoder 4801b can store Take the second word line, and the row multiplexer 4803b can decode data from one or more memory cells along the second word line.
如關於圖47B所描述,此存取可在一個記憶體時脈循環期間同時進行。因此,類似於圖46之架構,圖48之架構(包括下文描述於圖49中之記憶體墊)可允許在同一時脈循環中存取多個位址。舉例而言,圖48之架構可包括任何數目個列解碼器及任何數目個行多工器,使得數目對應於列解碼器及行多工器之數目的位址可全部在單個記憶體時脈循環內存取。 As described with respect to FIG. 47B, this access can be performed simultaneously during one memory clock cycle. Therefore, similar to the architecture of FIG. 46, the architecture of FIG. 48 (including the memory pad described in FIG. 49 below) can allow multiple addresses to be accessed in the same clock cycle. For example, the architecture of FIG. 48 can include any number of column decoders and any number of row multiplexers, so that the number of addresses corresponding to the number of column decoders and row multiplexers can all be at a single memory clock. Access within the loop.
在其他實施例中,此存取沿著兩個記憶體時脈循環可依序進行。藉由比對應邏輯電路更快地時控記憶體晶片4800,兩個記憶體時脈循環可等效於使用記憶體之邏輯電路的一個時脈循環。舉例而言,如上文所描述,該等邏輯電路可包含諸如GPU或CPU之習知電路,或可包含與記憶體晶片在同一基板上之處理群組,例如,如圖7A中所描繪。 In other embodiments, this access can be performed sequentially along two memory clock cycles. By clocking the memory chip 4800 faster than the corresponding logic circuit, two memory clock cycles can be equivalent to one clock cycle of a logic circuit using memory. For example, as described above, the logic circuits may include conventional circuits such as GPU or CPU, or may include processing groups on the same substrate as the memory chip, for example, as depicted in FIG. 7A.
其他實施例可允許同時存取。舉例而言,如關於圖42所描述,多個行解碼器(其可包含行多工器,諸如4803a及4803b,如圖48中所展示)可在單個記憶體時脈循環期間讀取沿著同一字線之多條位元線。另外或替代地,如關於圖46所描述,電路系統4800可併有額外電路系統使得此存取可為同時的。舉例而言,列解碼器4801a可存取第一字線,且行多工器4803a可在同一記憶體時脈循環期間解碼來自沿著第一字線之記憶體胞元的資料,在該記憶體時脈循環中,列解碼器4801b存取第二字線,且行多工器4803b解碼來自沿著第二字線之記憶體胞元的資料。 Other embodiments may allow simultaneous access. For example, as described with respect to FIG. 42, multiple row decoders (which may include row multiplexers, such as 4803a and 4803b, as shown in FIG. 48) can read along the line during a single memory clock cycle Multiple bit lines of the same word line. Additionally or alternatively, as described with respect to FIG. 46, the circuitry 4800 may incorporate additional circuitry so that this access can be simultaneous. For example, the column decoder 4801a can access the first word line, and the row multiplexer 4803a can decode data from memory cells along the first word line during the same memory clock cycle. In the clock cycle, the column decoder 4801b accesses the second word line, and the row multiplexer 4803b decodes data from memory cells along the second word line.
圖48之架構可與形成記憶體組之經修改記憶體墊一起使用,如圖49中所展示。在圖49中,藉由兩條字線及兩條位元線存取每一記憶體胞元(描繪為類似於DRAM之電容器,但亦可包含以類似於SRAM或任何其他記憶體胞元之方式配置的數個電晶體)。因此,圖49之記憶體墊4900允許藉由兩個不同邏輯電路同時存取兩個不同位元,或甚至存取同一位元。然而,圖49之實施例使用對記憶體墊之修改而非在標準DRAM記憶體墊上實施雙埠解決方案,該等記憶體墊經線連接以用於單埠存取,如以上實施例一般。 The architecture of FIG. 48 can be used with modified memory pads forming a memory bank, as shown in FIG. 49. In FIG. 49, each memory cell (depicted as a capacitor similar to DRAM, but can also include a memory cell similar to SRAM or any other memory cell) is accessed by two word lines and two bit lines. Several transistors configured in the same way). Therefore, the memory pad 4900 of FIG. 49 allows two different bits to be accessed simultaneously by two different logic circuits, or even the same bit. However, the embodiment of FIG. 49 uses a modification to the memory pads instead of implementing a dual-port solution on standard DRAM memory pads, which are connected via wires for single-port access, as in the above embodiment.
儘管描述為具有兩個埠,但上文所描述之實施例中之任一者可擴展至多於兩個埠。舉例而言,圖42、圖46、圖48及圖49之實施例可分別包括額外的行多工器或列多工器,以在單個時脈循環期間提供對額外行或列之存取。作為另一實例,圖43及圖44之實施例可包括額外的列解碼器及/或行多工器,以在單個時脈循環期間分別提供對額外列或行之存取。 Although described as having two ports, any of the embodiments described above can be extended to more than two ports. For example, the embodiments of FIGS. 42, 46, 48, and 49 may include additional row multiplexers or column multiplexers, respectively, to provide access to additional rows or columns during a single clock cycle. As another example, the embodiments of FIGS. 43 and 44 may include additional column decoders and/or row multiplexers to provide access to additional columns or rows during a single clock cycle, respectively.
記憶體中之可變字長存取 Variable word length access in memory
如上文及下文進一步所使用,術語「耦接」可包括直接連接、間接連接、電通信及其類似者。 As used further above and below, the term "coupled" can include direct connection, indirect connection, electrical communication, and the like.
此外,如「第一」、「第二」及其類似者之術語用以區分具有相 同或類似名稱或標題之元件或方法步驟,且未必提示空間或時間次序。 In addition, terms such as "first", "second" and the like are used to distinguish Elements or method steps with the same or similar names or titles, and do not necessarily indicate spatial or temporal order.
通常,記憶體晶片可包括記憶體組。記憶體組可耦接至列解碼器及行解碼器,該等解碼器經組態以選擇待讀取或寫入之特定字(或其他固定大小之資料單元)。每一記憶體組可包括用以儲存資料單元之記憶體胞元、用以放大來自藉由列解碼器及行解碼器選擇之記憶體胞元的電壓,及任何其他適當電路。 Generally, the memory chip may include a memory bank. The memory bank can be coupled to column decoders and row decoders, which are configured to select specific words (or other fixed-size data units) to be read or written. Each memory bank may include memory cells for storing data cells, amplifying voltages from memory cells selected by the row decoder and row decoder, and any other appropriate circuits.
每一記憶體組通常具有特定I/O寬度。舉例而言,I/O寬度可包含字。 Each memory bank usually has a specific I/O width. For example, the I/O width can include words.
雖然由使用記憶體晶片之邏輯電路執行之一些處理程序可受益於使用極長字,但一些其他處理程序可僅需要該字之一部分。 Although some processing procedures executed by logic circuits using memory chips can benefit from using extremely long words, some other processing procedures may only require a portion of the word.
實際上,記憶體內運算單元(諸如,與記憶體晶片安置於同一基板上之處理器子單元,例如,如圖7A中所描繪及描述)頻繁地執行僅需要該字之一部分的記憶體存取操作。 In fact, in-memory arithmetic units (such as processor sub-units placed on the same substrate as the memory chip, for example, as depicted and described in FIG. 7A) frequently perform memory accesses that require only a portion of the word operating.
為了減少與在僅使用一部分時存取整個字相關聯之潛時,本發明之實施例可提供用於僅提取一字之一或多個部分的方法及系統,藉此減少與傳送該字之不需要部分相關聯的資料損失且允許記憶體裝置中之功率節省。 In order to reduce the latent time associated with accessing the entire word when only a part of it is used, embodiments of the present invention may provide a method and system for extracting only one or more parts of a word, thereby reducing and transmitting the word. No part of the associated data loss is required and power saving in the memory device is allowed.
此外,本發明之實施例亦可減少記憶體晶片與其他實體(諸如,邏輯電路,無論係分開的,如CPU及GPU,抑或與記憶體晶片包括於同一基板上,諸如圖7A中所描繪及描述之處理器子單元)之間的相互作用之功率消耗,該等其他實體存取記憶體晶片,其可僅接收或寫入該字之一部分。 In addition, the embodiments of the present invention can also reduce the memory chip and other entities (such as logic circuits, whether separate, such as CPU and GPU, or included on the same substrate as the memory chip, such as those depicted in FIG. 7A and The power consumption of the interaction between the described processor subunits), these other entities access the memory chip, which can only receive or write part of the word.
記憶體存取命令(例如,來自使用記憶體之邏輯電路)可包括記憶體中之位址。舉例而言,該位址可包括列位址及行位址,或可例如藉由記憶體之記憶體控制器轉譯成列位址及行位址。 Memory access commands (for example, from logic circuits that use memory) can include addresses in memory. For example, the address may include a column address and a row address, or may be translated into a column address and a row address by a memory controller of the memory, for example.
在諸如DRAM的許多揮發性記憶體中,將列位址發送(例如, 直接由邏輯電路或使用記憶體控制器)至列解碼器,該列解碼器啟動整個列(亦被稱作字線)且載入包括於該列中之所有位元線。 In many volatile memories such as DRAM, the column address is sent (for example, Directly from the logic circuit or using the memory controller) to the column decoder, the column decoder activates the entire column (also called the word line) and loads all the bit lines included in the column.
該行位址識別經啟動列上之位元線,該等位元線在包括位元線之記憶體組外部傳送且傳送至下一層級電路系統。舉例而言,下一層級電路系統可包含記憶體晶片之I/O匯流排。在使用記憶體內處理之實施例中,下一層級電路系統可包含記憶體晶片之處理器子單元(例如,如圖7A中所描繪)。 The row address identifies the bit lines on the activated column, and the bit lines are sent outside the memory group including the bit lines and sent to the next level circuit system. For example, the next level of circuitry can include I/O buses of memory chips. In embodiments using in-memory processing, the next level of circuitry may include the processor sub-units of the memory chip (for example, as depicted in FIG. 7A).
因此,下文所描述之記憶體晶片可包括於如圖3A、圖3B、圖4至圖6、圖7A至圖7D、圖11至圖13、圖16至圖19、圖22或圖23中之任一者中所說明的記憶體晶片中,或以其他方式包含該記憶體晶片。 Therefore, the memory chips described below can be included in the ones shown in FIGS. 3A, 3B, 4-6, 7A-7D, 11-13, 16-19, 22 or 23. Any one of the memory chips described in, or otherwise includes the memory chip.
該記憶體晶片可藉由針對記憶體胞元而非邏輯胞元而最佳化之第一製造製程來製造。舉例而言,由第一製造製程所製造之記憶體胞元可展現比由第一製造製程所製造之邏輯電路之臨界尺寸小的臨界尺寸(例如,小超過2倍、3倍、4倍、5倍、6倍、7倍、8倍、9倍、10倍及其類似者)。舉例而言,第一製造製程可包含類比製造製程、DRAM製造製程及其類似者。 The memory chip can be manufactured by a first manufacturing process optimized for memory cells instead of logic cells. For example, the memory cell manufactured by the first manufacturing process may exhibit a critical size smaller than that of the logic circuit manufactured by the first manufacturing process (for example, more than 2 times, 3 times, 4 times, 5 times, 6 times, 7 times, 8 times, 9 times, 10 times and the like). For example, the first manufacturing process may include an analog manufacturing process, a DRAM manufacturing process, and the like.
此記憶體晶片可包含積體電路,該積體電路可包括記憶體單元。該記憶體單元可包括記憶體胞元、輸出埠及讀取電路系統。在一些實施例中,該記憶體單元可進一步包括處理單元,諸如,如上文所描述之處理器子單元。 The memory chip may include an integrated circuit, and the integrated circuit may include a memory cell. The memory unit may include a memory cell, an output port, and a reading circuit system. In some embodiments, the memory unit may further include a processing unit, such as the processor sub-unit as described above.
舉例而言,該讀取電路系統可包括縮減單元及第一群組記憶體讀取路徑,該等記憶體讀取路徑用於經由輸出埠輸出多達第一數目個位元。該輸出埠可連接至晶片外邏輯電路(諸如,加速器、CPU、GPU或其類似者)或晶載處理器子單元,如上文所描述。 For example, the read circuit system may include a reduction unit and a first group of memory read paths, and the memory read paths are used to output up to a first number of bits through the output port. The output port can be connected to off-chip logic circuits (such as accelerators, CPUs, GPUs, or the like) or on-chip processor sub-units, as described above.
在一些實施例中,該處理單元可包括縮減單元,可為縮減單元之一部分,可不同於縮減單元,或可用其他方式包含縮減單元。 In some embodiments, the processing unit may include a reduction unit, may be a part of the reduction unit, may be different from the reduction unit, or may include the reduction unit in other ways.
記憶體內讀取路徑可包括於積體電路中(例如,可在記憶體單元 中),且可包括經組態用於自記憶體胞元讀取及/或寫入至記憶體胞元之任何電路及/或鏈路。舉例而言,記憶體內讀取路徑可包括感測放大器、耦接至記憶體胞元之導體、多工器及其類似者。 The read path in the memory can be included in the integrated circuit (for example, it can be in the memory unit Medium), and may include any circuits and/or links configured for reading from and/or writing to memory cells. For example, the read path in the memory may include a sense amplifier, a conductor coupled to a memory cell, a multiplexer, and the like.
該處理單元可經組態以將讀取請求發送至該記憶體單元以自該記憶體單元讀取第二數目個位元。另外或替代地,該讀取請求可源自晶片外邏輯電路(諸如,加速器、CPU、GPU或其類似者)。 The processing unit can be configured to send a read request to the memory unit to read the second number of bits from the memory unit. Additionally or alternatively, the read request may originate from an off-chip logic circuit (such as an accelerator, CPU, GPU, or the like).
該縮減單元可經組態以例如藉由使用本文中所描述之部分字存取中之任一者來輔助減少與存取請求相關之功率消耗。 The reduction unit can be configured to assist in reducing the power consumption associated with the access request, for example, by using any of the partial word accesses described herein.
該縮減單元可經組態以在由該讀取請求觸發之讀取操作期間基於第一數目個位元及第二數目個位元而控制記憶體讀取路徑。舉例而言,來自縮減單元之控制信號可影響讀取路徑之記憶體消耗,以減少與所請求之第二數目個位元不相關的記憶體讀取路徑之能量消耗。舉例而言,該縮減單元可經組態以在第二數目小於第一數目時控制不相關的記憶體讀取路徑。 The reduction unit can be configured to control the memory read path based on the first number of bits and the second number of bits during the read operation triggered by the read request. For example, the control signal from the reduction unit can affect the memory consumption of the read path, so as to reduce the energy consumption of the memory read path that is not related to the second number of bits requested. For example, the reduction unit can be configured to control unrelated memory read paths when the second number is less than the first number.
如上文所解釋,該積體電路可包括於如圖3A、圖3B、圖4至圖6、圖7A至圖7D、圖11至圖13、圖16至圖19、圖22或圖23中之任一者中所說明的記憶體晶片中,可包括該記憶體晶片或以其他方式包含該記憶體晶片。 As explained above, the integrated circuit can be included in any of Figure 3A, Figure 3B, Figure 4 to Figure 6, Figure 7A to Figure 7D, Figure 11 to Figure 13, Figure 16 to Figure 19, Figure 22, or Figure 23 The memory chip described in any one may include the memory chip or include the memory chip in other ways.
不相關的記憶體內讀取路徑可與第一數目個位元中之不相關位元相關,諸如第一數目個位元中之不包括於第二數目個位元中的位元。 The unrelated in-memory read path may be related to unrelated bits in the first number of bits, such as bits in the first number of bits that are not included in the second number of bits.
圖50說明實例積體電路5000,其包括:記憶體胞元陣列5050中之記憶體胞元5001至5008;輸出埠5020,其包括位元5021至5028;讀取電路系統5040,其包括記憶體讀取路徑5011至5018;及縮減單元5030。
FIG. 50 illustrates an example integrated
當使用對應的記憶體讀取路徑讀取第二數目個位元時,第一數目個位元中之不相關位元可對應於不應讀取之位元(例如,不包括於第二數目個位元中之位元)。 When using the corresponding memory read path to read the second number of bits, irrelevant bits in the first number of bits can correspond to bits that should not be read (for example, not included in the second number) The bit of the ones).
在讀取操作期間,縮減單元5030可經組態以啟動對應於第二數目個位元之記憶體讀取路徑,使得經啟動之記憶體讀取路徑可經組態以輸送第二數目個位元。在此等實施例中,可僅啟動對應於第二數目個位元之記憶體讀取路徑。
During the read operation, the
在讀取操作期間,縮減單元5030可經組態以切斷每一不相關的記憶體讀取路徑之至少一部分。舉例而言,不相關的記憶體讀取路徑可對應於第一數目個位元中之不相關位元。
During the read operation, the
應注意,替代切斷不相關的記憶體路徑之至少一部分,縮減單元5030可替代地保證不啟動不相關的記憶體路徑。
It should be noted that instead of cutting off at least a part of the unrelated memory path, the
另外或替代地,在讀取操作期間,縮減單元5030可經組態以將不相關的記憶體讀取路徑維持於低功率模式中。舉例而言,低功率模式可包含分別向不相關的記憶體路徑供應低於正常工作電壓或電流之電壓或電流。
Additionally or alternatively, during a read operation, the
縮減單元5030可經進一步組態以控制不相關的記憶體讀取路徑之位元線。
The
因此,縮減單元5030可經組態以載入相關的記憶體讀取路徑之位元線,且將不相關的記憶體讀取路徑之位元線維持於低功率模式下。舉例而言,僅可載入相關的記憶體讀取路徑之位元線。
Therefore, the
另外或替代地,縮減單元5030可經組態以載入相關的記憶體讀取路徑之位元線,同時將不相關的記憶體讀取路徑之位元線維持為不啟動。
Additionally or alternatively, the
在一些實施例中,縮減單元5030可經組態以在讀取操作期間利用相關的記憶體讀取路徑之部分,且將每一不相關的記憶體讀取路徑之一部分維持於低功率模式下,其中該部分不同於位元線。
In some embodiments, the
如上文所解釋,記憶體晶片可使用感測放大器以放大來自包括於記憶體晶片中之記憶體胞元的電壓。因此,縮減單元5030可經組態以在讀取操
作期間利用相關的記憶體讀取路徑之部分,且將與不相關的記憶體讀取路徑中之至少一些相關聯的感測放大器維持於低功率模式下。
As explained above, the memory chip can use a sense amplifier to amplify the voltage from the memory cell included in the memory chip. Therefore, the
在此等實施例中,縮減單元5030可經組態以在讀取操作期間利用相關的記憶體讀取路徑之部分,且將與所有不相關的記憶體讀取路徑相關聯之一或多個感測放大器維持於低功率模式下。
In these embodiments, the
另外或替代地,縮減單元5030可經組態以在讀取操作期間利用相關的記憶體讀取路徑之部分,且將在與不相關的記憶體讀取路徑相關聯之一或多個感測放大器之後(例如,在空間上及/或在時間上)的不相關的記憶體讀取路徑之部分維持於低功率模式下。
Additionally or alternatively, the
在上文所描述之實施例中之任一者中,該記憶體單元可包括行多工器(未圖示)。 In any of the embodiments described above, the memory unit may include a row multiplexer (not shown).
在此等實施例中,縮減單元5030可耦接於行多工器與輸出埠之間。
In these embodiments, the
另外或替代地,縮減單元5030可嵌入於行多工器中。
Additionally or alternatively, the
另外或替代地,縮減單元5030可耦接於記憶體胞元與行多工器之間。
Additionally or alternatively, the
縮減單元5030可包含可為可獨立控制之縮減子單元。舉例而言,不同的縮減子單元可與不同的記憶體單元行相關聯。
The
儘管上文關於讀取操作及讀取電路系統進行了描述,但以上實施例可類似地應用於寫入操作及寫入電路系統。 Although the reading operation and the reading circuit system are described above, the above embodiments can be similarly applied to the writing operation and the writing circuit system.
舉例而言,根據本發明之積體電路可包括記憶體單元,該記憶體單元包含記憶體胞元、輸出埠及寫入電路系統。在一些實施例中,該記憶體單元可進一步包括處理單元,諸如,如上文所描述之處理器子單元。該寫入電路系統可包括縮減單元及第一群組記憶體寫入路徑,該等記憶體寫入路徑用於經
由輸出埠輸出多達第一數目個位元。該處理單元可經組態以將寫入請求發送至該記憶體單元以寫入來自該記憶體單元之第二數目個位元。另外或替代地,該寫入請求可源自晶片外邏輯電路(諸如,加速器、CPU、GPU或其類似者)。縮減單元5030可經組態以在由該寫入請求觸發之寫入操作期間基於第一數目個位元及第二數目個位元而控制該等記憶體寫入路徑。
For example, the integrated circuit according to the present invention may include a memory cell including a memory cell, an output port, and a writing circuit system. In some embodiments, the memory unit may further include a processing unit, such as the processor sub-unit as described above. The write circuit system may include a reduction unit and a first group of memory write paths, and the memory write paths are used for
Up to the first number of bits are output from the output port. The processing unit can be configured to send a write request to the memory unit to write the second number of bits from the memory unit. Additionally or alternatively, the write request may originate from an off-chip logic circuit (such as an accelerator, CPU, GPU, or the like). The
圖51說明記憶體組5100,該記憶體組包括使用列位址及行位址(例如,來自晶載處理器子單元或晶片外邏輯電路,諸如加速器、CPU、GPU或其類似者)來定址之記憶體胞元的陣列5111。如圖51中所展示,記憶體胞元饋接至位元線(豎直)及字線(水平,為簡單起見省略許多字線)。此外,列解碼器5112可饋入有列位址(例如,來自晶載處理器子單元、晶片外邏輯電路,或圖51中未展示之記憶體控制器),行多工器5113可饋入有行位址(例如,來自晶載處理器子單元、晶片外邏輯電路,或圖51中未展示之記憶體控制器),且行多工器5113可經由輸出匯流排5115接收來自多達整條線之輸出及多達一字之輸出。在圖51中,行多工器5113之輸出匯流排5115耦接至主I/O匯流排5114。在其他實施例中,輸出匯流排5115可耦接至發送列位址及行位址之記憶體晶片(例如,如圖7A中所描繪)的處理器子單元。為簡單起見,未展示將記憶體組分成記憶體墊之劃分。
Figure 51 illustrates a
圖52說明記憶體組5101。在圖52中,記憶體組亦說明為包括記憶體內處理(PIM)邏輯5116,該邏輯具有耦接至輸出匯流排5115之輸入端。PIM邏輯5116可產生位址(例如,包含列位址及行位址)且經由PIM位址匯流排5118輸出位址以存取記憶體組。PIM邏輯5116為亦包含處理單元之縮減單元(例如,單元5030)的實例。PIM邏輯5016可控制圖52未展示之輔助減少功率的其他電路。PIM邏輯5116可進一步控制包括記憶體組5101之記憶體單元的記憶體路徑。
Figure 52 illustrates the
如上文所解釋,在一些狀況下,字長(例如,選擇一次傳送之位元線之數目)可為大的。 As explained above, in some situations, the word length (for example, the number of bit lines selected for one transmission) may be large.
在彼等狀況下,用於讀取及/或寫入之每一字可與可在讀取及/或寫入操作之各種階段消耗功率的記憶體路徑相關聯,例如: Under these conditions, each word used for reading and/or writing can be associated with a memory path that can consume power at various stages of the reading and/or writing operation, such as:
a.載入位元線一為了避免位元線載入至所需值(在讀取循環中自位元線上之電容器,抑或在寫入循環中待寫入至電容器之新值),需要使位於記憶體陣列之末端處的感測放大器去能且確保保存資料之電容器不放電或充電(否則,儲存於其上之資料將被破壞);及 a. Load the bit line. In order to prevent the bit line from loading to the required value (from the capacitor on the bit line in the read cycle, or the new value to be written to the capacitor in the write cycle), it is necessary to use The sense amplifier at the end of the memory array is disabled and the capacitors that store data are not discharged or charged (otherwise, the data stored on it will be destroyed); and
b.經由選擇位元線之行多工器移動來自感測放大器之資料且移動至晶片之其餘部分(移動至將資料傳入及傳出晶片之I/O匯流排或移動至將使用資料之嵌入式邏輯,諸如與記憶體在同一基板上之處理器子單元)。 b. Move the data from the sense amplifier through the row multiplexer that selects the bit line and move it to the rest of the chip (moving to the I/O bus that transfers data to and from the chip or to the I/O bus that will use the data) Embedded logic, such as a processor subunit on the same substrate as the memory).
為了達成功率節省,本發明之積體電路可在列啟動時間判定字之一些部分為不相關的且接著針對該字之該等不相關的部分將去能信號發送至一或多個感測放大器。 In order to achieve power saving, the integrated circuit of the present invention can determine that some parts of a word are irrelevant at the column start time and then send a disable signal to one or more sensors for these irrelevant parts of the word Amplifier.
圖53說明記憶體單元5102,該記憶體單元包括記憶體胞元陣列5111、列解碼器5112、耦接至輸出匯流排5115之行多工器5113,及PIM邏輯5116。
FIG. 53 illustrates a
記憶體單元5102亦包括對位元至行多工器5113之通道賦能或使其去能的開關5201。開關5201可包含類比開關、經組態以充當開關之電晶體,或經組態以控制至記憶體單元5102之部分的供應或電壓及/或電流流動的任何其他電路系統。感測放大器(未圖示)可位於記憶體胞元陣列之末端處,例如,在開關5201之前(在空間上及/或在時間上)。
The
開關5201可由自PIM邏輯5116經由匯流排5117發送之賦能信號控制。當斷開時,該等開關經組態以斷開記憶體單元5102之感測放大器(未
圖示),且因此不對與感測放大器斷開之位元線放電或充電。
The
開關5201及PIM邏輯5116可形成縮減單元(例如,縮減單元5030)。
The
在又一實例中,PIM邏輯5116可將賦能信號發送至感測放大器(例如,當感測放大器具有賦能輸入時)而非發送至開關5201。
In yet another example, the
位元線可另外或替代地在其他點處斷開,例如,不在位元線之末端處及在感測放大器之後斷開。舉例而言,位元線可在進入陣列5111之前斷開。
The bit line may additionally or alternatively be disconnected at other points, for example, not at the end of the bit line and after the sense amplifier. For example, the bit line can be disconnected before entering the
在此等實施例中,在自感測放大器及轉送硬體(諸如,輸出匯流排5115)進行資料傳送時,亦可節省功率。 In these embodiments, power can also be saved when data is transmitted from the sense amplifier and the forwarding hardware (such as the output bus 5115).
其他實施例(其可節省較少功率,但可較容易實施)聚焦於節省行多工器5113之功率且將損失自行多工器5113轉移至下一層級電路系統。舉例而言,如上文所解釋,下一層級電路系統可包含記憶體晶片之I/O匯流排(諸如,匯流排5115)。在使用記憶體內處理之實施例中,下一層級電路系統可另外或替代地包含記憶體晶片之處理器子單元(諸如,PIM邏輯5116)。
Other embodiments (which can save less power, but are easier to implement) focus on saving the power of the
圖54A說明分段為多個區段5202之行多工器5113。行多工器5113之每一區段5202可藉由自PIM邏輯5116經由匯流排5119發送之賦能及/或去能信號來個別地賦能或去能。行多工器5113亦可由位址行匯流排5118饋入。
FIG. 54A illustrates a
圖54A之實施例可提供對來自行多工器5113之輸出之不同部分的較佳控制。
The embodiment of FIG. 54A can provide better control of different parts of the output from the
應注意,對不同記憶體路徑之控制可具有不同解析度,例如範圍為自一位元解析度至多位元解析度。前者在功率節省之意義上可能更有效。後者之實施可能較簡單且需要較少控制信號。 It should be noted that the control of different memory paths may have different resolutions, for example, the range is from a one-bit resolution to a multi-bit resolution. The former may be more effective in the sense of power saving. The latter may be simpler to implement and require fewer control signals.
圖54B說明實例方法5130。舉例而言,可使用上文關於圖50、圖51、圖52、圖53或圖54A所描述之記憶體單元中之任一者來實施方法5130。
FIG. 54B illustrates an
方法5130可包括步驟5132及5134。
The
步驟5132可包括:藉由積體電路之處理單元(例如,PIM邏輯5116)發送存取請求且發送至至積體電路之記憶體單元以自該記憶體單元讀取第二數目個位元。該記憶體單元可包括記憶體胞元(例如,陣列5111之記憶體胞元)、輸出埠(例如,輸出匯流排5115),及讀取/寫入電路系統,該讀取/寫入電路系統可包括縮減單元(例如,縮減單元5030)及第一群組記憶體讀取/寫入路徑,該等記憶體讀取/寫入路徑用於經由輸出埠輸出及/或輸入多達第一數目個位元。
存取請求可包含讀取請求及/或寫入請求。 The access request may include a read request and/or a write request.
記憶體輸入/輸出路徑可包含記憶體讀取路徑、記憶體寫入路徑及/或用於讀取及寫入兩者之路徑。 The memory input/output path may include a memory read path, a memory write path, and/or a path for both reading and writing.
步驟5134可包括對存取請求作出回應。
舉例而言,步驟5134可包括在由存取請求觸發之存取操作期間藉由縮減單元(例如,單元5030)基於第一數目個位元及第二數目個位元而控制記憶體讀取/寫入路徑。
For example,
步驟5134可進一步包括以下操作中之任一者及/或以下操作中之任一者的任何組合。下文列出之操作中的任一者可在對存取請求作出回應期間執行,但亦可在對存取請求作出回應之前及/或之後執行。
因此,步驟5134可包括以下操作中之至少一者:
Therefore,
a.在第二數目小於第一數目時控制不相關的記憶體讀取路徑,其中不相關的記憶體讀取路徑與第一數目個位元中之不包括於第二數目個位元中的位元相關聯; a. Control unrelated memory read paths when the second number is less than the first number, where the unrelated memory read paths and the first number of bits are not included in the second number of bits Bit-associated
b.在讀取操作期間啟動相關的記憶體讀取路徑,其中相關的記憶體讀取路徑經組態以輸送第二數目個位元; b. Start the relevant memory read path during the read operation, wherein the relevant memory read path is configured to transmit a second number of bits;
c.在讀取操作期間切斷不相關的記憶體讀取路徑中之每一者的至少一部分; c. Cut off at least part of each of the unrelated memory read paths during the read operation;
d.在讀取操作期間將不相關的記憶體讀取路徑維持於低功率模式中; d. Maintain the unrelated memory read path in the low power mode during the read operation;
e.控制不相關的記憶體讀取路徑之位元線; e. Control the bit lines of unrelated memory read paths;
f.載入相關的記憶體讀取路徑之位元線且將不相關的記憶體讀取路徑之位元線維持於低功率模式中; f. Load the bit lines of the relevant memory read path and maintain the bit lines of the unrelated memory read path in low power mode;
g.載入相關的記憶體讀取路徑之位元線,同時將不相關的記憶體讀取路徑之位元線維持為不啟動; g. Load the bit lines of the relevant memory read path, while keeping the bit lines of the unrelated memory read path inactive;
h.在讀取操作期間利用相關的記憶體讀取路徑之部分且將每一不相關的記憶體讀取路徑之一部分維持於低功率模式中,其中該部分不同於位元線; h. During a read operation, use a portion of the related memory read path and maintain a portion of each unrelated memory read path in a low power mode, where the portion is different from the bit line;
i.在讀取操作期間利用相關的記憶體讀取路徑之部分且將用於不相關的記憶體讀取路徑中之至少一些的感測放大器維持於低功率模式中; i. Use part of the related memory read path during a read operation and maintain the sense amplifiers used for at least some of the unrelated memory read paths in a low power mode;
j.在讀取操作期間利用相關的記憶體讀取路徑之部分且將不相關的記憶體讀取路徑中之至少一些的感測放大器維持於低功率模式中;及 j. Use part of the relevant memory read path during a read operation and maintain the sense amplifiers of at least some of the unrelated memory read paths in a low power mode; and
k.在讀取操作期間利用相關的記憶體讀取路徑之部分且將在不相關的記憶體讀取路徑之感測放大器之後的不相關的記憶體讀取路徑維持於低功率模式中。 k. Use part of the related memory read path during the read operation and maintain the unrelated memory read path after the sense amplifier of the unrelated memory read path in a low power mode.
低功率模式或閒置模式可包含記憶體存取路徑之功率消耗低於在記憶體存取路徑用於存取操作時記憶體存取路徑之功率消耗的模式。在一些實施例中,低功率模式可能甚至涉及切斷記憶體存取路徑。低功率模式可另外或替代地包括不啟動記憶體存取路徑。 The low power mode or the idle mode may include a mode in which the power consumption of the memory access path is lower than the power consumption of the memory access path when the memory access path is used for an access operation. In some embodiments, the low power mode may even involve cutting off the memory access path. The low power mode may additionally or alternatively include not activating the memory access path.
應注意,在位元線階段期間發生的功率減少可能需要在開放字線之前應知曉記憶體存取路徑之相關性或不相關性。在別處發生(例如,在行多工器中)之功率減少可替代地允許在每次存取時決定記憶體存取路徑之相關性 或不相關性。 It should be noted that the power reduction that occurs during the bit line phase may require knowledge of the relevance or irrelevance of the memory access path before opening the word line. Power reduction that occurs elsewhere (for example, in a row multiplexer) may alternatively allow the relevance of the memory access path to be determined on each access Or irrelevance.
快速及低功率啟動以及快速存取記憶體 Fast and low power startup and fast memory access
DRAM及其他記憶體類型(諸如,SRAM、快閃記憶體或其類似者)常常自記憶體組建置,該等記憶體組通常建置為允許列及行存取方案。 DRAM and other memory types (such as SRAM, flash memory, or the like) are often built from memory, and these memory banks are usually built to allow column and row access schemes.
圖55說明記憶體晶片5140之實例,該記憶體晶片包括多個記憶體墊及相關聯邏輯(諸如,列及行解碼器,在圖55中分別描繪為RD及COL)。在圖55之實例中,墊被分組成組且具有通過其的字線及位元線。記憶體墊及相關聯邏輯在圖55中表示為5141、5142、5143、5144、5145及5146,且共用至少一個匯流排5147。
Figure 55 illustrates an example of a
記憶體晶片5140可包括於如圖3A、圖3B、圖4至圖6、圖7A至圖7D、圖11至圖13、圖16至圖19、圖22或圖23中之任一者中所說明的記憶體晶片中,可包括該記憶體晶片或以其他方式包含該記憶體晶片。
The
舉例而言,在DRAM中,與啟動新列(例如,準備用於存取之新線)相關聯的耗用很大。一旦一線經啟動(亦被稱作開放),彼列內之資料便可用於更快存取。在DRAM中,此存取可能以隨機方式進行。 For example, in DRAM, the consumption associated with starting a new row (for example, a new line ready for access) is high. Once the first line is activated (also known as open), the data in that row can be used for faster access. In DRAM, this access may be performed in a random manner.
與啟動新線相關聯之兩個問題為功率及時間: The two issues associated with starting a new line are power and time:
a.由於一起存取該線上之所有電容器及必須載入該線所導致的電流驟增,功率會上升(例如,當開放僅具有幾個記憶體組之線時,功率可達到若干安培);及 a. Due to the sudden increase in current caused by accessing all the capacitors on the line and having to load the line together, the power will increase (for example, when the line with only a few memory banks is opened, the power can reach several amperes); and
b.時間延遲問題主要與載入列(字)線及接著載入位元(行)線所花費之時間相關聯。 b. The time delay problem is mainly related to the time it takes to load the column (word) line and then load the bit (row) line.
本發明之一些實施例可包括用以在啟動線期間減少峰值功率消耗且減少線啟動時間之系統及方法。一些實施例可至少在一定程度上犧牲一線內之完全隨機存取,以減少此等功率及時間成本。 Some embodiments of the present invention may include systems and methods to reduce peak power consumption during a start-up line and reduce line start-up time. Some embodiments may sacrifice complete random access within a line at least to a certain extent to reduce these power and time costs.
舉例而言,在一個實施例中,記憶體單元可包括第一記憶體墊、第二記憶體墊及啟動單元,該啟動單元經組態以啟動包括於第一記憶體墊中之第一群組記憶體胞元,而不啟動包括於第二記憶體墊中之第二群組記憶體胞元。該等一群組記憶體胞元及該等二群組記憶體胞元可皆屬於該記憶體單元之單個列。 For example, in one embodiment, the memory unit may include a first memory pad, a second memory pad, and an activation unit configured to activate the first group included in the first memory pad Group memory cells without activating the second group of memory cells included in the second memory pad. The one group of memory cells and the two groups of memory cells may all belong to a single row of the memory cell.
替代地,該啟動單元可經組態以啟動包括於第二記憶體墊中之第二群組記憶體胞元,而不啟動第一群組記憶體胞元。 Alternatively, the activation unit may be configured to activate the second group of memory cells included in the second memory pad without activating the first group of memory cells.
在一些實施例中,該啟動單元可經組態以在啟動第一群組記憶體胞元之後啟動第二群組記憶體胞元。 In some embodiments, the activation unit may be configured to activate the second group of memory cells after activating the first group of memory cells.
舉例而言,該啟動單元可經組態以在第一群組記憶體胞元的啟動已完成之後起始的延遲時段期滿之後啟動第二群組記憶體胞元。 For example, the activation unit may be configured to activate the second group of memory cells after a delay period that starts after the activation of the first group of memory cells has completed has expired.
另外或替代地,該啟動單元可經組態以基於信號之值而啟動第二群組記憶體胞元,該信號係在耦接至第一群組記憶體胞元的第一字線區段上產生的。 Additionally or alternatively, the activation unit may be configured to activate the second group of memory cells based on the value of the signal, the signal being coupled to the first word line section of the first group of memory cells Produced on.
在上文所描述之實施例中之任一者中,該啟動單元可包括安置於第一字線區段與第二字線區段之間的中間電路。在此等實施例中,第一字線區段可耦接至第一記憶體胞元且第二字線區段可耦接至第二記憶體胞元。中間電路之非限制性實例包括開關、正反器、緩衝器、反相器及其類似者,其中之一些貫穿圖56至圖61加以說明。 In any of the above-described embodiments, the activation unit may include an intermediate circuit disposed between the first word line section and the second word line section. In these embodiments, the first word line segment can be coupled to the first memory cell and the second word line segment can be coupled to the second memory cell. Non-limiting examples of intermediate circuits include switches, flip-flops, buffers, inverters, and the like, some of which are described throughout FIGS. 56 to 61.
在一些實施例中,第二記憶體胞元可耦接至第二字線區段。在此等實施例中,第二字線區段可耦接至通過至少第一記憶體墊之旁路字線徑。此類旁路路徑之實例說明於圖61中。 In some embodiments, the second memory cell may be coupled to the second word line segment. In these embodiments, the second word line segment may be coupled to the bypass word line diameter passing through at least the first memory pad. An example of such a bypass path is illustrated in Figure 61.
該啟動單元可包含控制單元,該控制單元經組態以基於來自與單個列相關聯之字線的啟動信號而控制電壓(及/或電流)至第一群組記憶體胞元 及第二群組記憶體胞元的供應。 The activation unit may include a control unit configured to control the voltage (and/or current) to the first group of memory cells based on the activation signal from the word line associated with a single row And the supply of the second group of memory cells.
在另一實例實施例中,記憶體單元可包括第一記憶體墊、第二記憶體墊及啟動單元,該啟動單元經組態以將啟動信號供應至第一記憶體墊之第一群組記憶體胞元,且延遲該啟動信號至第二記憶體墊之第二群組記憶體胞元的供應,至少直至第一群組記憶體胞元的啟動已完成。該等一群組記憶體胞元及該等二群組記憶體胞元可屬於該記憶體單元之單個列。 In another example embodiment, the memory unit may include a first memory pad, a second memory pad, and an activation unit configured to supply activation signals to the first group of first memory pads Memory cells, and delay the supply of the activation signal to the second group of memory cells of the second memory pad at least until the activation of the first group of memory cells has been completed. The one group of memory cells and the two groups of memory cells can belong to a single row of the memory cell.
舉例而言,該啟動單元可包括可經組態以延遲供應啟動信號之延遲單元。 For example, the activation unit may include a delay unit that can be configured to delay the supply of the activation signal.
另外或替代地,該啟動單元可包括比較器,該比較器可經組態以在其輸入端處接收啟動信號且基於啟動信號之至少一個特性而控制延遲單元。 Additionally or alternatively, the activation unit may include a comparator that may be configured to receive the activation signal at its input and control the delay unit based on at least one characteristic of the activation signal.
在另一實例實施例中,記憶體單元可包括第一記憶體墊、第二記憶體墊及隔離單元,該隔離單元可經組態以:在第一記憶體墊之第一記憶體胞元被啟動的初始啟動時段期間將該等第一記憶體胞元與第二記憶體墊之第二記憶體胞元相隔離;及在該初始啟動時段之後將該等第一記憶體胞元耦接至該等二記憶體胞元。第一記憶體胞元及第二記憶體胞元可屬於記憶體單元之單個列。 In another example embodiment, the memory unit may include a first memory pad, a second memory pad, and an isolation unit, and the isolation unit may be configured to: in the first memory cell of the first memory pad Separate the first memory cell from the second memory cell of the second memory pad during the activated initial activation period; and couple the first memory cell after the initial activation period To these two memory cells. The first memory cell and the second memory cell may belong to a single row of memory cells.
在以下實例中,可能不需要對記憶體墊本身進行修改。在某些實例中,實施例可依賴於對記憶體組之少量修改。 In the following example, the memory pad itself may not need to be modified. In some instances, embodiments may rely on minor modifications to the memory bank.
以下圖式描繪縮短添加至記憶體組之字信號藉此將字線分裂成數個較短部分的機構。 The following diagram depicts the mechanism of shortening the word signal added to the memory bank to split the word line into several shorter parts.
在以下諸圖中,為了清楚起見省略各種記憶體組組件。 In the following figures, various memory bank components are omitted for clarity.
圖56至圖61說明記憶體組之部分(分別表示為5140(1)、5140(2)、5140(3)、5140(4)、5140(5)及5149(6)),該等部分包括分組於不同群組內之列解碼器5112及多個記憶體墊(諸如,5150(1)、5150(2)、5150(3)、5150(4)、5150(5)、5150(6)、5151(1)、5151(2)、5151(3)、5151(4)、5151(5)、5151(6)、5152(1)、5152(2)、
5152(3)、5152(4)、5152(5)及5152(6))。
Figure 56 to Figure 61 illustrate the parts of the memory group (represented as 5140(1), 5140(2), 5140(3), 5140(4), 5140(5) and 5149(6)), which include
配置成一列之記憶體墊可包括不同群組。 The memory pads arranged in a row may include different groups.
圖56至圖59及圖61說明記憶體墊之九個群組,其中每一群組包括一對記憶體墊。可使用任何數目個群組,每一群組具有任何數目個記憶體墊。 Figures 56 to 59 and Figure 61 illustrate nine groups of memory pads, where each group includes a pair of memory pads. Any number of groups can be used, and each group has any number of memory pads.
記憶體墊5150(1)、5150(2)、5150(3)、5150(4)、5150(5)及5150(6)配置成一列,共用多條記憶體線,且分成三個群組:第一上部群組,其包括記憶體墊5150(1)及5150(2);第二上部群組,其包括記憶體墊5150(3)及5150(4);及第三上部群組,其包括記憶體墊5150(5)及5150(6)。 The memory pads 5150(1), 5150(2), 5150(3), 5150(4), 5150(5) and 5150(6) are arranged in a row, share multiple memory lines, and are divided into three groups: The first upper group, which includes memory pads 5150(1) and 5150(2); the second upper group, which includes memory pads 5150(3) and 5150(4); and the third upper group, which Including memory pads 5150(5) and 5150(6).
類似地,記憶體墊5151(1)、5151(2)、5151(3)、5151(4)、5151(5)及5151(6)配置成一列,共用多條記憶體線且分成三個群組:第一中間群組,其包括記憶體墊5151(1)及5151(2);第二中間群組,其包括記憶體墊5151(3)及5151(4);及第三中間群組,其包括記憶體墊5151(5)及5151(6)。 Similarly, the memory pads 5151(1), 5151(2), 5151(3), 5151(4), 5151(5) and 5151(6) are arranged in a row, share multiple memory lines and are divided into three groups Group: the first middle group, which includes memory pads 5151(1) and 5151(2); the second middle group, which includes memory pads 5151(3) and 5151(4); and the third middle group , Which includes memory pads 5151(5) and 5151(6).
此外,記憶體墊5152(1)、5152(2)、5152(3)、5152(4)、5152(5)及5152(6)配置成一列,共用多條記憶體線且分組成三個群組:第一下部群組,其包括記憶體墊5152(1)及5152(2);第二下部群組,其包括記憶體墊5152(3)及5152(4);及第三下部群組,其包括記憶體墊5152(5)及5152(6)。任何數目個記憶體墊可配置成一列並共用記憶體線,且可分成任何數目個群組。 In addition, the memory pads 5152(1), 5152(2), 5152(3), 5152(4), 5152(5) and 5152(6) are arranged in a row, share multiple memory lines and are grouped into three groups Group: the first lower group, which includes memory pads 5152(1) and 5152(2); the second lower group, which includes memory pads 5152(3) and 5152(4); and the third lower group Group, which includes memory pads 5152(5) and 5152(6). Any number of memory pads can be arranged in a row and share memory lines, and can be divided into any number of groups.
舉例而言,每個群組之記憶體墊的數目可為一個、兩個或可超過兩個。 For example, the number of memory pads in each group can be one, two, or more than two.
如上文所解釋,啟動電路可經組態以啟動記憶體墊之一個群組,而不啟動共用相同記憶體線或至少耦接至具有同一線位址之不同記憶體線區段的記憶體墊之另一群組。 As explained above, the activation circuit can be configured to activate a group of memory pads without activating memory pads that share the same memory line or are at least coupled to different memory line segments with the same line address Another group.
圖56至圖61說明啟動電路之不同實例。在一些實施例中,啟動電路之至少一部分(諸如,中間電路)可位於記憶體墊群組之間,以允許啟動 一個群組之記憶體墊,而不啟動同一列之記憶體墊的另一群組。 Figure 56 to Figure 61 illustrate different examples of starting circuits. In some embodiments, at least a part of the activation circuit (such as an intermediate circuit) may be located between the memory pad groups to allow activation One group of memory pads does not activate another group of memory pads in the same row.
圖56說明如定位於記憶體之第一上部群組的不同線與記憶體墊之第二上部群組的不同線之間的中間電路,諸如延遲或隔離電路5153(1)至、5153(3)。 FIG. 56 illustrates the intermediate circuits such as delay or isolation circuits 5153(1) to 5153(3) located between different lines of the first upper group of memory and different lines of the second upper group of memory pads. ).
圖56亦說明如定位於記憶體之第二上部群組的不同線與記憶體墊之第三上部群組的不同線之間的中間電路,諸如延遲或隔離電路5154(1)至5154(3)。另外,一些延遲或隔離電路定位於由中間群組之記憶體墊形成的群組之間。此外,一些延遲或隔離電路定位於由下部群組之記憶體墊形成的群組之間。 FIG. 56 also illustrates the intermediate circuits such as delay or isolation circuits 5154(1) to 5154(3) located between different lines of the second upper group of memory and different lines of the third upper group of memory pads. ). In addition, some delay or isolation circuits are positioned between the groups formed by the memory pads of the middle group. In addition, some delay or isolation circuits are positioned between the groups formed by the memory pads of the lower group.
該等延遲或隔離電路可延遲或停止字線信號自列解碼器5112沿著一列傳播至另一群組。
The delay or isolation circuits can delay or stop the propagation of the word line signal from the
圖57說明包含正反器(諸如,5155(1)至5155(3)及5156(1)至5156(3))之中間電路,諸如延遲或隔離電路。 Figure 57 illustrates an intermediate circuit, such as a delay or isolation circuit, including flip-flops (such as 5155(1) to 5155(3) and 5156(1) to 5156(3)).
當將啟動信號注入至字線時,啟動第一墊群組中之一者(取決於該字線),而沿著該字線之其他群組保持不啟動。可在下一時脈循環啟動其他群組。舉例而言,可在下一時脈循環啟動其他群組中之第二群組,且可在又一時脈循環之後啟動其他群組中之第三群組。 When the activation signal is injected into the word line, one of the first pad groups (depending on the word line) is activated, while the other groups along the word line remain inactive. Other groups can be activated in the next clock cycle. For example, the second group in other groups can be activated in the next clock cycle, and the third group in other groups can be activated after another clock cycle.
正反器可包含D型正反器或任何其他類型的正反器。為簡單起見,自圖式省略饋入至D型正反器的時脈。 The flip-flop can include a D-type flip-flop or any other type of flip-flop. For simplicity, the clock fed to the D-type flip-flop is omitted from the diagram.
因此,對第一群組的存取可使用電力以僅對與第一群組相關聯之字線的部分充電,此充電比對整條字線充電更快且需要更少電流。 Therefore, access to the first group can use power to charge only part of the word line associated with the first group, which charge is faster and requires less current than charging the entire word line.
可在記憶體墊群組之間使用多於一個正反器,藉此增加開放部分之間的延遲。另外或替代地,實施例可使用較慢時脈以增加延遲。 More than one flip-flop can be used between the memory pad groups, thereby increasing the delay between the open parts. Additionally or alternatively, embodiments may use a slower clock to increase the delay.
此外,經啟動之群組可仍含有來自所使用之先前線值的群組。舉 例而言,該方法可允許啟動新的線區段,同時仍存取先前線之資料,藉此減少與啟動新線相關聯之懲罰。 In addition, the activated group may still contain the group from the previous line value used. Lift For example, this method can allow new line segments to be activated while still accessing the data of the previous line, thereby reducing the penalty associated with activating the new line.
因此,一些實施例可具有經啟動之第一群組且允許先前經啟動線之其他群組保持在作用中,其中位元線之信號彼此不干擾。 Therefore, some embodiments may have a first group activated and allow other groups of previously activated lines to remain active, where the signals of the bit lines do not interfere with each other.
另外,一些實施例可包括開關及控制信號。該等控制信號可由組控制器控制或藉由在控制信號之間添加正反器(例如,產生上文所描述之機構具有的相同時序效應)來控制。 In addition, some embodiments may include switches and control signals. These control signals can be controlled by the group controller or by adding a flip-flop between the control signals (for example, producing the same timing effect as the mechanism described above).
圖58說明諸如延遲或隔離電路之中間電路,該等電路為開關(諸如,5157(1)至5157(3)及5158(1)至5158(3))且定位於一個群組與另一群組之間。定位於群組之間的一組開關可由專用控制信號控制。在圖58中,控制信號可由列控制單元5160(1)發送且由不同組開關之間的一或多個延遲單元(例如,單元5160(2)及5160(3))之序列延遲。 Figure 58 illustrates intermediate circuits such as delay or isolation circuits that are switches (such as 5157(1) to 5157(3) and 5158(1) to 5158(3)) and are located in one group and another group Between groups. A set of switches located between the groups can be controlled by dedicated control signals. In FIG. 58, the control signal can be sent by the column control unit 5160(1) and delayed by the sequence of one or more delay units (e.g., units 5160(2) and 5160(3)) between different sets of switches.
圖59說明諸如延遲或隔離電路之中間電路,該等電路為反相器閘或緩衝器(諸如,5159(1)至5159(3)及5159'(1)至5159'(3))之序列且定位於記憶體墊群組之間。 Figure 59 illustrates intermediate circuits such as delay or isolation circuits, which are sequences of inverter gates or buffers (such as 5159(1) to 5159(3) and 5159'(1) to 5159'(3)) And positioned between the memory pad groups.
替代開關,可在記憶體墊群組之間使用緩衝器。緩衝器可能不允許開關之間沿著字線降低電壓,電壓降低為在使用單個電晶體結構時有時會發生的效應。 Instead of switches, buffers can be used between memory pad groups. The buffer may not allow the voltage to be reduced along the word line between the switches. The voltage reduction is an effect that sometimes occurs when a single transistor structure is used.
其他實施例可允許更多的隨機存取,且藉由使用添加至記憶體組之區域仍提供極低的啟動功率及時間。 Other embodiments may allow more random access, and still provide extremely low startup power and time by using the area added to the memory bank.
實例展示於圖60中,該圖說明使用接近記憶體墊定位之全域字線(諸如,5152(1)至5152(8))。此等字線可能通過或可能不通過記憶體墊且經由諸如開關(諸如,5157(1)至5157(8))之中間電路耦接至記憶體墊內之字線。該等開關可控制將啟動哪一記憶體墊且允許記憶體控制器在每一時間點僅啟動 相關線部分。不同於上文所描述之使用線部分之依序啟動的實施例,圖60之實例可提供更好的控制。 An example is shown in Figure 60, which illustrates the use of global word lines (such as 5152(1) to 5152(8)) located close to the memory pad. These word lines may or may not pass through the memory pad and are coupled to the word lines in the memory pad via intermediate circuits such as switches (such as 5157(1) to 5157(8)). These switches can control which memory pad will be activated and allow the memory controller to only activate at each point in time Related line part. Unlike the above-described embodiment that uses the sequential activation of the wire portion, the example of FIG. 60 can provide better control.
諸如列部分賦能信號5170(1)及7150(2)之賦能信號可源自未展示之邏輯,諸如記憶體控制器。 The enabling signals such as column part enabling signals 5170(1) and 7150(2) can be derived from unshown logic, such as a memory controller.
圖61說明全域字線5180通過記憶體墊且形成用於可能不需要在墊外部投送之字線信號的旁路路徑。因此,圖61中所展示之實施例可以一些記憶體密度為代價來減小記憶體組之面積。
Figure 61 illustrates that the
在圖61中,全域世界線可不間斷地通過記憶體墊且可能不連接至記憶體胞元。區域字線區段可由開關中之一者控制且連接至墊中之記憶體胞元。 In FIG. 61, the global world line can pass through the memory pad without interruption and may not be connected to the memory cell. The local word line segment can be controlled by one of the switches and connected to the memory cell in the pad.
當記憶體墊群組提供字線之實質分割時,記憶體組可實際上支援完全隨機存取。 When the memory pad group provides substantial division of word lines, the memory group can actually support full random access.
用於減緩啟動信號沿著字線之散佈的另一實施例亦可節省一些佈線及邏輯,在記憶體墊之間使用開關及/或其他緩衝或隔離電路,而不使用專用賦能信號及專用線來輸送賦能信號。 Another embodiment for slowing the dispersion of the activation signal along the word line can also save some wiring and logic, using switches and/or other buffering or isolation circuits between memory pads, instead of using dedicated enable signals and dedicated Wire to convey the enabling signal.
舉例而言,比較器可用以控制開關或其他緩衝或隔離電路。當由比較器監視之字線區段上的信號之位準達到某一位準時,比較器可啟動開關或其他緩衝或隔離電路。舉例而言,某一位準可提示完全載入先前字線區段。 For example, the comparator can be used to control a switch or other buffer or isolation circuit. When the level of the signal on the word line segment monitored by the comparator reaches a certain level, the comparator can activate a switch or other buffering or isolation circuits. For example, a certain level may indicate that the previous word line segment is fully loaded.
圖62說明用於操作記憶體單元之方法5190。舉例而言,可使用上文關於圖56至圖61所描述之記憶體組中之任一者來實施方法5130。
Figure 62 illustrates a
方法5190可包括步驟5192及5194。
The
步驟5192可包括藉由啟動單元啟動包括於記憶體單元之第一記憶體墊中的第一群組記憶體胞元,而不啟動包括於記憶體單元之第二記憶體墊中的第二群組記憶體胞元。該等一群組記憶體胞元及該等二群組記憶體胞元可
皆屬於該記憶體單元之單個列。
步驟5194可包括藉由啟動單元啟動第二群組記憶體胞元,例如,在步驟5192之後。
可在啟動第一群組記憶體胞元時,在完全啟動第一群組記憶體胞元之後,在第一群組記憶體胞元的啟動已完成之後起始的延遲時段期滿之後,在第一群組記憶體胞元不啟動之後及在類似情況下執行步驟5194。
When the first group of memory cells are activated, after the first group of memory cells are fully activated, after the start of the delay period expires after the activation of the first group of memory cells has been completed,
延遲時段可為固定或可調整的。舉例而言,延遲時段之持續時間可基於記憶體單元之預期存取圖案,或可無關於預期存取圖案而設定。延遲時段之範圍可介於少於一毫秒與多於一秒之間。 The delay period can be fixed or adjustable. For example, the duration of the delay period may be based on the expected access pattern of the memory cell, or may be set regardless of the expected access pattern. The range of the delay period can be between less than one millisecond and more than one second.
在一些實施例中,步驟5194可基於信號之值起始,該信號係在耦接至第一群組記憶體胞元的第一字線區段上產生的。舉例而言,當信號之值超過第一臨限值時,其可提示第一群組記憶體胞元完全啟動。
In some embodiments,
步驟5192及5194中之任一者可涉及使用安置於第一字線區段與第二字線區段之間的中間電路(例如,啟動單元之中間電路)。第一字線區段可耦接至第一記憶體胞元且第二字線區段可耦接至第二記憶體胞元。
Any of
中間電路之實例貫穿圖56至圖61加以說明。 Examples of the intermediate circuit are described throughout FIGS. 56 to 61.
步驟5192及5194可進一步包括藉由控制單元來控制啟動信號自與單個列相關聯之字線至第一群組記憶體胞元及第二群組記憶體胞元的供應。
使用記憶體並列性來加速測試時間及使用向量測試記憶體中之邏輯 Use memory parallelism to speed up test time and use vectors to test logic in memory
本發明之一些實施例可使用晶片內測試單元來加速測試。 Some embodiments of the present invention may use in-chip test units to accelerate testing.
一般而言,記憶體晶片測試需要大量測試時間。減少測試時間可減少生產成本且亦允許進行更多測試,以產生更可靠的產品。 Generally speaking, memory chip testing requires a lot of testing time. Reducing testing time can reduce production costs and also allow more testing to produce more reliable products.
圖63及圖64說明測試器5200及晶片(或晶片之晶圓)5210。
測試器5200可包括管理測試之軟體。測試器5200可將不同資料序列運行至所有記憶體5210,且接著讀回該等序列以識別記憶體5210之發生故障的位元位於何處。一旦辨識到,測試器5200便可發出修復位元之命令,且若能夠修復問題,則測試器5200可聲明記憶體5210通過。在其他狀況下,可聲明一些晶片未通過。
Figures 63 and 64 illustrate the
測試器5200可寫入測試序列且接著讀回資料以將其與預期結果進行比較。
The
圖64展示測試系統,其具有測試器5200及被並列地測試之晶片(諸如,5210)之完整晶圓5202。舉例而言,測試器5200可藉由導線匯流排連接至晶片中之每一者。
Figure 64 shows a test system with a
如圖64中所展示,測試器5200必須數次讀取及寫入所有記憶體晶片,且彼資料必須經由外部晶片介面傳遞。
As shown in FIG. 64, the
此外,例如使用可程式化組態資訊測試積體電路之邏輯及記憶體組兩者可為有益的,該組態資訊可使用規則I/O操作來提供。 In addition, it can be beneficial to test both the logic of the integrated circuit and the memory bank, for example, using programmable configuration information, which can be provided using regular I/O operations.
該測試亦可受益於積體電路內存在測試單元。 This test can also benefit from the presence of test units in the integrated circuit.
該等測試單元可屬於積體電路且可分析測試結果,且找到例如邏輯(例如,如圖7A中所描繪及所描述之處理器子單元)及/或記憶體(例如,跨越複數個記憶體組)中的故障。 These test units can belong to integrated circuits and can analyze the test results, and find, for example, logic (for example, the processor subunit as depicted and described in FIG. 7A) and/or memory (for example, across a plurality of memories) Group).
記憶體測試器通常極簡單且根據簡單格式與積體電路交換測試向量。舉例而言,可存在寫入向量,該等寫入向量包括成對的待寫入之記憶體條目的位址與待寫入至記憶體條目之值。亦可存在讀取向量,該讀取向量包括待讀取之記憶體條目的位址。寫入向量之位址中的至少一些可與讀取向量中之至少一些位址相同。寫入向量之至少一些其他位址可不同於讀取向量之至少一些其他位址。當經程式化時,記憶體測試器亦可接收預期結果向量,該預期結果向量可包括待讀取之記憶體條目的位址及待讀取之預期值。記憶體測試器可 將預期值與其讀取值進行比較。 Memory testers are usually extremely simple and exchange test vectors with integrated circuits according to a simple format. For example, there may be write vectors that include a pair of addresses of the memory entry to be written and the value to be written to the memory entry. There may also be a read vector that includes the address of the memory entry to be read. At least some of the addresses of the write vector may be the same as at least some of the addresses of the read vector. At least some other addresses for writing the vector may be different from at least some other addresses for reading the vector. When programmed, the memory tester can also receive an expected result vector, which can include the address of the memory entry to be read and the expected value to be read. Memory tester can Compare the expected value with its read value.
根據實施例,積體電路(具有或不具有積體電路之記憶體)之邏輯(例如,處理器子單元)可藉由記憶體測試器使用同一協定/格式來測試。舉例而言,寫入向量中之一些值可為待由積體電路之邏輯執行的命令(且可例如涉及計算及/或記憶體存取)。可運用讀取向量及預期結果向量來程式化記憶體測試器,該預期結果向量可包括記憶體條目位址,該等記憶體條目位址中之至少一些儲存計算之預期值。因此,記憶體測試器可用於測試邏輯以及記憶體。記憶體測試器通常比邏輯測試器更簡單且更便宜,且所提議方法允許使用簡單的記憶體測試器執行複雜的邏輯測試。 According to the embodiment, the logic (for example, the processor subunit) of the integrated circuit (memory with or without the integrated circuit) can be tested by the memory tester using the same protocol/format. For example, some of the values written into the vector may be commands to be executed by the logic of the integrated circuit (and may involve calculations and/or memory access, for example). The read vector and the expected result vector may be used to program the memory tester. The expected result vector may include memory entry addresses, and at least some of the memory entry addresses store calculated expected values. Therefore, the memory tester can be used to test logic and memory. Memory testers are generally simpler and cheaper than logic testers, and the proposed method allows the use of simple memory testers to perform complex logic tests.
在一些實施例中,記憶體內之邏輯可藉由僅使用向量(或其他資料結構)而不使用邏輯測試中常見之更複雜機制(諸如,例如經由介面與控制器通信,告知邏輯待測試哪一電路)來對記憶體內之邏輯的測試賦能。 In some embodiments, the logic in the memory can be used by only using vectors (or other data structures) without using more complex mechanisms commonly used in logic testing (such as, for example, communicating with the controller via an interface to inform the logic of which one to test) Circuit) to enable the test of logic in the memory.
替代使用測試單元,記憶體控制器可經組態以接收存取包括於組態資訊中之記憶體條目的指令,且執行存取指令並輸出結果。 Instead of using a test unit, the memory controller can be configured to receive instructions to access memory items included in the configuration information, and execute the access instructions and output the results.
圖65至圖69中所說明之積體電路中之任一者可執行測試,甚至在缺乏測試單元之情況下或在存在不能夠執行測試之測試單元的情況下亦如此。 Any one of the integrated circuits illustrated in FIGS. 65 to 69 can perform testing, even in the absence of test units or in the presence of test units that cannot perform the test.
本發明之實施例可包括使用記憶體並列性及內部晶片頻寬來加速及改善測試時間之方法及系統。 Embodiments of the present invention may include methods and systems that use memory parallelism and internal chip bandwidth to accelerate and improve test time.
該方法及系統可基於記憶體晶片測試本身(相對於測試器運行測試、讀取測試結果及分析結果),保存結果且最終允許測試器讀取該等結果(且在需要時,往回程式化記憶體晶片,例如以啟動冗餘機構)。該測試可包括測試記憶體或測試記憶體組及邏輯(在具有要測試之起作用邏輯部分的運算記憶體之狀況下,諸如上文在圖7A中所描述之狀況)。 The method and system can be based on the memory chip test itself (as opposed to the tester running the test, reading the test results, and analyzing the results), save the results and finally allow the tester to read the results (and when necessary, back to programmatically) Memory chip, for example, to activate the redundant mechanism). The test may include testing memory or testing memory groups and logic (in the case of arithmetic memory having a functional logic portion to be tested, such as the situation described above in FIG. 7A).
在一個實施例中,該方法可包括讀取及寫入晶片內之資料使得外部頻寬不限制測試。 In one embodiment, the method may include reading and writing data in the chip so that the external bandwidth does not limit the test.
在記憶體晶片包括處理器子單元之實施例中,每一處理器子單元可藉由測試程式碼或組態來程式化。 In embodiments where the memory chip includes processor sub-units, each processor sub-unit can be programmed by test code or configuration.
在記憶體晶片具有無法執行測試程式碼之處理器子單元或不具有處理器子單元但具有記憶體控制器的實施例中,記憶體控制器接著可經組態以讀取及寫入圖案(例如,在外部程式化至控制器)且標記故障之位置(例如,將值寫入至記憶體條目,讀取該條目,及接收不同於寫入值之值)以供進一步分析。 In embodiments where the memory chip has a processor subunit that cannot execute the test code or does not have a processor subunit but has a memory controller, the memory controller can then be configured to read and write patterns ( For example, externally program to the controller) and mark the location of the fault (for example, write a value to a memory entry, read the entry, and receive a value different from the written value) for further analysis.
應注意,測試記憶體可能需要測試大量位元,例如,測試記憶體之每一位元及驗證受測位元是否起作用。此外,有時可在不同電壓及溫度條件下重複記憶體測試。 It should be noted that testing memory may need to test a large number of bits, for example, testing each bit of the memory and verifying whether the tested bit works. In addition, sometimes the memory test can be repeated under different voltage and temperature conditions.
對於一些缺陷,可啟動一或多個冗餘機構(例如,藉由程式化快閃記憶體或OTP或燒斷熔斷器)。此外,可能亦必須測試記憶體晶片之邏輯及類比電路(例如,控制器、調節器、I/O)。 For some defects, one or more redundant mechanisms can be activated (for example, by programming a flash memory or OTP or blowing a fuse). In addition, it may also be necessary to test the logic and analog circuits of the memory chip (eg, controller, regulator, I/O).
在一個實施例中,積體電路可包括:基板、安置於基板上之記憶體陣列、安置於基板上之處理陣列,及安置於基板上之介面。 In one embodiment, the integrated circuit may include a substrate, a memory array disposed on the substrate, a processing array disposed on the substrate, and an interface disposed on the substrate.
本文中所描述之積體電路可包括於如圖3A、圖3B、圖4至圖6、圖7A至圖7D、圖11至圖13、圖16至圖19、圖22或圖23中之任一者中所說明的記憶體晶片中,可包括該記憶體晶片,或以其他方式包含該記憶體晶片。 The integrated circuit described herein can be included in any of FIGS. 3A, 3B, 4-6, 7A-7D, 11-13, 16-19, 22 or 23. The memory chip described in one may include the memory chip or include the memory chip in other ways.
圖65至圖69說明各種積體電路5210及測試器5200。
Figures 65 to 69 illustrate various
該積體電路說明為包括記憶體組5212、晶片介面5211(諸如,由該等記憶體組共用之I/O控制器5214及匯流排5213)及邏輯單元(在下文中為「邏輯」)5215。圖66說明熔斷器介面5216及耦接至熔斷器介面及不同記
憶體組之匯流排5217。
The integrated circuit is illustrated as including a
圖65至圖70亦說明測試處理程序中之各種步驟,諸如: Figure 65 to Figure 70 also illustrate various steps in the test process, such as:
a.寫入測試序列5221(圖65、圖67、圖68及圖69); a. Write test sequence 5221 (Figure 65, Figure 67, Figure 68 and Figure 69);
b.讀回測試結果5222(圖67、圖68及圖69); b. Read back the test result 5222 (Figure 67, Figure 68 and Figure 69);
c.寫入預期結果序列5223(圖65); c. Write the expected result sequence 5223 (Figure 65);
d.讀取故障位址以修復5224(圖66);及 d. Read the faulty address to repair 5224 (Figure 66); and
e.程式化熔斷器5225(圖66)。 e. Programmable fuse 5225 (Figure 66).
每一記憶體組可耦接至其自身的邏輯單元5215及/或由該邏輯單元來控制。然而,如上文所描述,可提供對邏輯單元5215之任何記憶體組分配。因此,邏輯單元5215之數目可不同於記憶體組之數目,邏輯單元可控制多於單個記憶體組或一記憶體組之一部分,及其類似者。
Each memory bank can be coupled to and/or controlled by its
邏輯單元5215可包括一或多個測試單元。圖65說明邏輯5215內之測試單元(TU)5218。TU可包括於所有或一些邏輯單元5212中。應注意,測試單元可與邏輯單元分開或與邏輯單元整合。
The
圖65亦說明TU 5218內之測試圖案產生器(表示為GEN)5219。
Figure 65 also illustrates the test pattern generator (denoted as GEN) 5219 in the
測試圖案產生器可包括於所有或一些測試單元中。為簡單起見,測試圖案產生器及測試單元未說明於圖66至圖70中,但可包括於此等實施例中。 The test pattern generator may be included in all or some of the test units. For simplicity, the test pattern generator and the test unit are not illustrated in FIGS. 66 to 70, but may be included in these embodiments.
該記憶體陣列可包括多個記憶體組。此外,該處理陣列可包括複數個測試單元。該等複數個測試單元可經組態以測試多個記憶體組以提供測試結果。該介面可經組態以將提示測試結果之資訊輸出至在積體電路外部之裝置。 The memory array may include a plurality of memory banks. In addition, the processing array may include a plurality of test units. The plurality of test units can be configured to test multiple memory banks to provide test results. The interface can be configured to output information that prompts the test result to a device outside the integrated circuit.
該等複數個測試單元可包括至少一個測試圖案產生器,該至少一個測試圖案產生器經組態以產生至少一個測試圖案以供用於測試多個記憶體組中之一或多者。在一些實施例中,如上文所解釋,該等複數個測試單元中之每 一者可包括測試圖案產生器,該測試圖案產生器經組態以產生測試圖案以供該等複數個測試單元中之特定測試單元使用以測試多個記憶體組中之至少一者。如上文所提示,圖65說明測試單元內之測試圖案產生器(GEN)5219。一或多個或甚至所有邏輯單元可包括測試圖案產生器。 The plurality of test units may include at least one test pattern generator configured to generate at least one test pattern for testing one or more of the plurality of memory groups. In some embodiments, as explained above, each of the plurality of test units One may include a test pattern generator configured to generate a test pattern for use by a specific test unit of the plurality of test units to test at least one of the plurality of memory groups. As indicated above, FIG. 65 illustrates the test pattern generator (GEN) 5219 in the test unit. One or more or even all logic units may include a test pattern generator.
至少一個測試圖案產生器可經組態以自介面接收用於產生至少一個測試圖案之指令。測試圖案可包括在測試期間應存取(例如,讀取及/或寫入)之記憶體條目及/或待寫入至該等條目之值,及其類似者。 The at least one test pattern generator can be configured to receive instructions for generating at least one test pattern from the interface. The test pattern may include memory entries that should be accessed (for example, read and/or write) during the test and/or values to be written to these entries, and the like.
該介面可經組態以自可在積體電路外部之外部單元接收組態資訊,該組態資訊包括用於產生至少一個測試圖案之指令。 The interface can be configured to receive configuration information from external units that can be external to the integrated circuit, the configuration information including instructions for generating at least one test pattern.
至少一個測試圖案產生器可經組態以自記憶體陣列讀取組態資訊,該組態資訊包括用於產生至少一個測試圖案之指令。 The at least one test pattern generator can be configured to read configuration information from the memory array, the configuration information including commands for generating at least one test pattern.
在一些實施例中,該組態資訊可包括向量。 In some embodiments, the configuration information may include vectors.
該介面可經組態以自可在積體電路外部之裝置接收組態資訊,該組態資訊可包括可為至少一個測試圖案之指令。 The interface can be configured to receive configuration information from a device that can be external to the integrated circuit, and the configuration information can include commands that can be at least one test pattern.
舉例而言,至少一個測試圖案可包括待在記憶體陣列之測試期間存取的記憶體陣列條自。 For example, the at least one test pattern may include memory array strips to be accessed during testing of the memory array.
至少一個測試圖案進一步可包括待寫入至在記憶體陣列之測試期間存取之記憶體陣列條目的輸入資料。 The at least one test pattern may further include input data to be written to the memory array entry accessed during the test of the memory array.
另外或替代地,至少一個測試圖案進一步可包括待寫入至在記憶體陣列之測試期間存取之記憶體陣列條目的輸入資料,及待自在記憶體陣列之測試期間存取之記憶體陣列條目讀取的輸出資料之預期值。 Additionally or alternatively, the at least one test pattern may further include input data to be written to the memory array entry accessed during the test of the memory array, and the memory array entry to be accessed during the test of the memory array freely The expected value of the output data read.
在一些實施例中,該等複數個測試單元可經組態以自記憶體陣列擷取一旦由該等複數個測試單元執行便使該等複數個測試單元測試該記憶體陣列之測試指令。 In some embodiments, the plurality of test units may be configured to retrieve the test instructions from the memory array that, once executed by the plurality of test units, cause the plurality of test units to test the memory array.
舉例而言,該等測試指令可包括於組態資訊中。 For example, these test commands can be included in the configuration information.
組態資訊可包括記憶體陣列之測試的預期結果。 The configuration information may include the expected result of the test of the memory array.
另外或替代地,該組態資訊可包括待自在記憶體陣列之測試期間存取之記憶體陣列條目讀取的輸出資料之值。 Additionally or alternatively, the configuration information may include the value of the output data to be read from the memory array entry accessed during the test of the memory array.
另外或替代地,該組態資訊可包括向量。 Additionally or alternatively, the configuration information may include vectors.
在一些實施例中,該等複數個測試單元可經組態以自記憶體陣列擷取一旦由該等複數個測試單元執行便使該等複數個測試單元測試該記憶體陣列且測試該處理陣列之測試指令。 In some embodiments, the plurality of test units can be configured to retrieve from the memory array once executed by the plurality of test units to make the plurality of test units test the memory array and test the processing array The test instruction.
舉例而言,該等測試指令可包括於組態資訊中。 For example, these test commands can be included in the configuration information.
該組態資訊可包括向量。 The configuration information may include vectors.
另外或替代地,該組態資訊可包括記憶體陣列及處理陣列之測試的預期結果。 Additionally or alternatively, the configuration information may include the expected result of the test of the memory array and the processing array.
在一些實施例中,如上文所描述,該等複數個測試單元可能缺乏測試圖案產生器,該測試圖案產生器用於產生在多個記憶體組之測試期間使用的測試圖案。 In some embodiments, as described above, the plurality of test units may lack a test pattern generator, which is used to generate test patterns used during testing of a plurality of memory banks.
在此等實施例中,該等複數個測試單元中之至少兩個可經組態以並列地測試多個記憶體組中之至少兩個。 In these embodiments, at least two of the plurality of test units may be configured to test at least two of the plurality of memory groups in parallel.
替代地,該等複數個測試單元中之至少兩個可經組態以串列地測試多個記憶體組中之至少兩個。 Alternatively, at least two of the plurality of test units may be configured to test at least two of the plurality of memory groups in series.
在一些實施例中,提示測試結果之資訊可包括故障記憶體陣列條目之識別符。 In some embodiments, the information prompting the test result may include the identifier of the faulty memory array entry.
在一些實施例中,該介面可經組態以在記憶體陣列之測試期間多次擷取由複數個測試電路獲得之部分測試結果。 In some embodiments, the interface can be configured to capture partial test results obtained by a plurality of test circuits multiple times during the test of the memory array.
在一些實施例中,該積體電路可包括錯誤校正單元,該錯誤校正 單元經組態以校正在記憶體陣列之測試期間偵測到的至少一個錯誤。舉例而言,該錯誤校正單元可經組態以使用任何適當技術來修復記憶體誤差,例如藉由使一些記憶體字去能及用冗餘字替換該等字。 In some embodiments, the integrated circuit may include an error correction unit, and the error correction The unit is configured to correct at least one error detected during the test of the memory array. For example, the error correction unit can be configured to use any appropriate technology to repair memory errors, such as by disabling some memory words and replacing them with redundant words.
在上文所描述之實施例中之任一者中,該積體電路可為記憶體晶片。 In any of the embodiments described above, the integrated circuit may be a memory chip.
舉例而言,該積體電路可包括分散式處理器,其中處理陣列可包括分散式處理器之複數個子單元,如圖7A中所描繪。 For example, the integrated circuit may include a distributed processor, where the processing array may include a plurality of subunits of the distributed processor, as depicted in FIG. 7A.
在此等實施例中,該等處理器子單元中之每一者可與多個記憶體組中之對應的專用記憶體組相關聯。 In these embodiments, each of the processor sub-units may be associated with a corresponding dedicated memory group among multiple memory groups.
在上文所描述之實施例中之任一者中,提示測試結果之資訊可提示至少一個記憶體組之狀態。可按一或多個粒度來提供記憶體組之狀態:每一記憶體字,每一條目群組,或每一完整記憶體組。 In any of the above-described embodiments, the information prompting the test result can prompt the state of at least one memory bank. The state of the memory group can be provided in one or more granularities: each memory word, each entry group, or each complete memory group.
圖65至圖66說明測試器測試階段中之四個步驟。 Figure 65 to Figure 66 illustrate the four steps in the test phase of the tester.
在第一步驟中,測試器寫入(5221)測試序列,且組之邏輯單元將資料寫入至其記憶體。該邏輯亦可能足夠複雜以自測試器接收命令且其自身產生序列(如下文所解釋)。 In the first step, the tester writes (5221) the test sequence, and the logical unit of the group writes data to its memory. The logic may also be complex enough to receive commands from the tester and generate the sequence itself (as explained below).
在第二步驟中,測試器將預期結果寫入(5223)至受測記憶體,且邏輯單元將預期結果與自其記憶體組讀取之資料進行比較,以保存錯誤清單。若邏輯足夠複雜以自身產生預期結果之序列(如下文所解釋),則可簡化預期結果之寫入。 In the second step, the tester writes (5223) the expected result to the memory under test, and the logic unit compares the expected result with the data read from its memory bank to save the error list. If the logic is complex enough to produce a sequence of expected results by itself (as explained below), the writing of expected results can be simplified.
在第三步驟中,測試器自邏輯單元讀取(5224)故障位址。 In the third step, the tester reads (5224) the fault address from the logic unit.
在第四步驟中,測試器對結果採取動作(5225)且可修復錯誤。舉例而言,測試器可連接至特定介面以程式化記憶體中之熔斷器,但亦可使用允許程式化記憶體內之錯誤校正機構的任何其他機構。 In the fourth step, the tester takes action on the result (5225) and the error can be repaired. For example, the tester can be connected to a specific interface to program the fuses in the memory, but any other mechanism that allows programming of the error correction mechanism in the memory can also be used.
在此等實施例中,記憶體測試器可使用向量以測試記憶體。 In these embodiments, the memory tester can use vectors to test the memory.
舉例而言,每一向量可自輸入系列及輸出系列建置。 For example, each vector can be built from the input series and the output series.
輸入系列可包括成對的位址與寫入至記憶體之資料(在許多實施例中,此系列可模型化為允許程式在需要時產生的公式,該程式諸如由邏輯單元執行之程式)。 The input series can include a pair of addresses and data written to the memory (in many embodiments, this series can be modeled as a formula that allows a program to be generated when needed, such as a program executed by a logic unit).
在一些實施例中,測試圖案產生器可產生此類向量。 In some embodiments, the test pattern generator can generate such vectors.
應注意,向量為實例資料結構,但一些實施例可使用其他資料結構。該等資料結構可與由位於積體電路外部之測試器產生的其他測試資料結構相容。 It should be noted that the vector is an example data structure, but some embodiments may use other data structures. These data structures are compatible with other test data structures generated by testers located outside the integrated circuit.
該輸出系列可包括位址與資料對,其包含待自記憶體讀回之預期資料(在一些實施例中,該系列可另外或替代地由程式在執行階段產生,例如藉由邏輯單元)。 The output series may include address and data pairs, which include expected data to be read back from memory (in some embodiments, the series may be generated in addition or alternatively by the program at the execution stage, such as by a logic unit).
記憶體測試通常包括執行向量清單,每一向量根據輸入系列將資料寫入至記憶體,且接著根據輸出系列讀回資料並將該資料與其預期資料進行比較。 Memory testing usually involves executing a list of vectors, each vector writes data to memory based on the input series, and then reads back the data based on the output series and compares the data with its expected data.
在失配之狀況下,記憶體可分類為發生故障的,或若記憶體包括用於冗餘之機構,則可啟動冗餘機構使得再次在經啟動冗餘機構上測試向量。 In the case of a mismatch, the memory can be classified as malfunctioning, or if the memory includes a mechanism for redundancy, the redundant mechanism can be activated so that the vector is tested on the activated redundant mechanism again.
在記憶體包括處理器子單元(如上文關於圖7A所描述)或含有許多記憶體控制器之實施例中,整個測試可由組之邏輯單元操縱。因此,記憶體控制器或處理器子單元可執行測試。 In embodiments where the memory includes a processor sub-unit (as described above with respect to FIG. 7A) or contains many memory controllers, the entire test can be handled by the group of logic units. Therefore, the memory controller or processor sub-unit can perform the test.
該記憶體控制器可自測試器程式化,且測試結果可保存於控制器本身中以稍後由測試器讀取。 The memory controller can be programmed from the tester, and the test results can be saved in the controller itself for later read by the tester.
為了組態及測試邏輯單元之操作,測試器可組態邏輯單元以用於記憶體存取且確認結果可藉由記憶體存取來讀取。 In order to configure and test the operation of the logic unit, the tester can configure the logic unit for memory access and confirm that the result can be read by the memory access.
舉例而言,輸入向量可含有用於邏輯單元之程式化序列,且輸出向量可含有此測試之預期結果。舉例而言,若諸如處理器子單元之邏輯單元包含經組態以對記憶體中之兩個位址執行運算的乘法器或加法器,則輸入向量可包括將資料寫入至記憶體之一組命令以及至加法器/乘法器邏輯之一組命令。只要可將加法器/乘法器結果讀回至輸出向量,便可將結果發送至測試器。 For example, the input vector may contain a stylized sequence for the logic unit, and the output vector may contain the expected result of this test. For example, if a logic unit such as a processor sub-unit includes a multiplier or an adder configured to perform operations on two addresses in memory, the input vector may include one of writing data to the memory Group commands and a group command to the adder/multiplier logic. As long as the adder/multiplier result can be read back to the output vector, the result can be sent to the tester.
該測試可進一步包括自記憶體載入邏輯組態及將邏輯輸出發送至記憶體。 The test may further include loading logic configuration from memory and sending logic output to memory.
在邏輯單元自記憶體載入其組態(例如,若該邏輯為記憶體控制器)之實施例中,該邏輯單元可運行來自記憶體本身之程式碼。 In embodiments where the logic unit loads its configuration from memory (for example, if the logic is a memory controller), the logic unit can run code from the memory itself.
因此,該輸入向量可包括用於邏輯單元之程式,且該程式本身可測試邏輯單元中之各種電路。 Therefore, the input vector can include a program for the logic unit, and the program itself can test various circuits in the logic unit.
因此,測試可能不限於接收呈由外部測試器使用之格式的向量。 Therefore, testing may not be limited to receiving vectors in a format used by an external tester.
若載入至邏輯單元之命令發指令給邏輯單元以將結果寫回至記憶體組中,則測試器可讀取彼等結果且將該等結果與預期輸出系列進行比較。 If the command loaded into the logic unit is issued to the logic unit to write the results back to the memory bank, the tester can read their results and compare the results with the expected output series.
舉例而言,寫入至記憶體之向量可為或可包括用於邏輯單元之測試程式(例如,測試可假定記憶體有效,但即使記憶體無效,寫入之測試程式仍將不工作且測試將未通過,此為可接受之結果,此係因為晶片無論如何為無效的)及/或邏輯單元如何運行程式碼及將結果寫回至記憶體。由於邏輯單元之所有測試可經由記憶體進行(例如,將邏輯測試輸入寫入至記憶體及將測試結果寫回至該記憶體),因此測試器可運用輸入序列及預期輸出序列來運行簡單的向量測試。 For example, the vector written to the memory may be or may include a test program for the logic unit (for example, the test may assume that the memory is valid, but even if the memory is invalid, the written test program will still not work and the test Will fail, which is an acceptable result because the chip is invalid anyway) and/or how the logic unit runs the code and writes the result back to memory. Since all the tests of the logic unit can be performed through the memory (for example, the logic test input is written to the memory and the test result is written back to the memory), the tester can use the input sequence and the expected output sequence to run simple Vector test.
邏輯組態及結果可作為讀取及/或寫入命令來存取。 The logical configuration and results can be accessed as read and/or write commands.
圖68說明發送寫入測試序列5221之測試器5200,該寫入測試序列為向量。
Figure 68 illustrates the
向量之部分包括在耦接至處理陣列之邏輯5215的記憶體組5212之間分裂的測試程式碼5232。
The part of the vector includes the test code 5232 split between the
每一邏輯5215可執行儲存於其相關聯記憶體組中之程式碼5232,且該執行可包括存取一或多個記憶體組,執行計算及將結果(例如,測試結果5231)儲存於記憶體組5212中。
Each
測試結果可由測試器5200發送回(例如,讀回結果5222)。 The test result may be sent back by the tester 5200 (eg, read back the result 5222).
此可允許邏輯5215受由I/O控制器5214接收之命令控制。
This allows the
在圖68中,I/O控制器5214連接至記憶體組及邏輯。在其他實施例中,邏輯可連接於I/O控制器5214與記憶體組之間。
In Figure 68, the I/
圖70說明用於測試記憶體組之方法5300。舉例而言,可使用上文關於圖65至圖69所描述之記憶體組中之任一者來實施方法5300。
Figure 70 illustrates a
方法5300可包括步驟5302、5310及5320。步驟5302可包括接收測試積體電路之記憶體組的請求。該積體電路可包括:基板、安置於基板上且包含記憶體組之記憶體陣列、安置於基板上之處理陣列,及安置於基板上之介面。該處理陣列可包括複數個測試單元,如上文所描述。
The
在一些實施例中,該請求可包括組態資訊、一或多個向量、命令,及其類似者。 In some embodiments, the request may include configuration information, one or more vectors, commands, and the like.
在此等實施例中,該組態資訊可包括記憶體陣列之測試的預期結果、指令、資料、待自在記憶體陣列之測試期間存取之記憶體陣列條目讀取的輸出資料之值、測試圖案,及其類似者。 In these embodiments, the configuration information may include the expected results, commands, data of the memory array test, the value of the output data read from the memory array entry to be accessed during the test of the memory array, and the test Patterns, and similar ones.
該測試圖案可包括以下各者中之至少一者:(i)待在記憶體陣列之測試期間存取的記憶體陣列條目,(ii)待寫入至在記憶體陣列之測試期間存取之記憶體陣列條目的輸入資料,或(iii)待自在記憶體陣列之測試期間存取之記憶體陣列條目讀取的輸出資料之預期值。 The test pattern may include at least one of the following: (i) memory array entries to be accessed during the test of the memory array, (ii) to be written to the memory array entries accessed during the test of the memory array The input data of the memory array entry, or (iii) the expected value of the output data to be read from the memory array entry accessed during the test of the memory array.
步驟5302可包括以下各者中之至少一者及/或其後可接著以下各者中之至少一者: Step 5302 may include at least one of the following and/or may be followed by at least one of the following:
a.藉由至少一個測試圖案產生器自介面接收用於產生至少一個測試圖案之指令; a. At least one test pattern generator receives an instruction for generating at least one test pattern from the interface;
b.藉由該介面及自在積體電路外部之外部單元接收組態資訊,該組態資訊包括用於產生至少一個測試圖案之指令; b. Receive configuration information through the interface and an external unit outside the free integrated circuit, the configuration information including instructions for generating at least one test pattern;
c.藉由至少一個測試圖案產生器自記憶體陣列讀取組態資訊,該組態資訊包括用於產生至少一個測試圖案之指令; c. Read configuration information from the memory array by at least one test pattern generator, the configuration information including instructions for generating at least one test pattern;
d.藉由該介面及自在積體電路外部之外部單元接收組態資訊,該組態資訊包含為至少一個測試圖案之指令; d. Receive configuration information through the interface and an external unit outside the free integrated circuit, the configuration information includes instructions for at least one test pattern;
e.藉由複數個測試單元及自記憶體陣列擷取一旦由該等複數個測試單元執行便使該等複數個測試單元測試記憶體陣列之測試指令;及 e. Retrieving test commands from the memory array by a plurality of test units and once executed by the plurality of test units to make the plurality of test units test the memory array; and
f.藉由複數個測試單元及自該記憶體陣列接收一旦由該等複數個測試單元執行便使該等複數個測試單元測試記憶體陣列且測試處理陣列之測試指令。 f. By using a plurality of test units and receiving from the memory array, once executed by the plurality of test units, the plurality of test units will test the memory array and test the processing array.
步驟5302之後可接著步驟5310。步驟5310可包括藉由複數個測試單元且回應於請求而測試多個記憶體組以提供測試結果。
Step 5302 can be followed by
方法5300可進一步包括藉由該介面在記憶體陣列之測試期間複數次接收由複數個測試電路獲得之部分測試結果。
The
步驟5310可包括以下各者中之至少一者及/或其後可接著以下各者中之至少一者:
a.藉由一或多個測試圖案產生器(例如,包括於複數個測試單元中之一者、一些或全部中)產生測試圖案以供一或多個測試單元用於測試多個記憶體組中之至少一者; a. Generate test patterns by one or more test pattern generators (for example, included in one, some or all of a plurality of test units) for one or more test units to test multiple memory groups At least one of;
b.藉由該等複數個測試單元中之至少兩個並列地測試多個記憶體組中之 至少兩個; b. Test a plurality of memory groups in parallel by at least two of the plurality of test units At least two
c.藉由該等複數個測試單元中之至少兩個串列地測試多個記憶體組中之至少兩個; c. Test at least two of the plurality of memory groups in series by at least two of the plurality of test units;
d.將值寫入至記憶體條目,讀取記憶體條目及比較結果;及 d. Write the value to the memory entry, read the memory entry and compare the result; and
e.藉由錯誤校正單元校正在記憶體陣列之測試期間偵測到的至少一個錯誤。 e. Correct at least one error detected during the test of the memory array by the error correction unit.
步驟5310之後可接著步驟5320。步驟5320可包括藉由介面及在積體電路外部輸出提示測試結果之資訊。
提示測試結果之資訊可包括故障記憶體陣列條目之識別符。此可藉由不發送關於每一記憶體條目之讀取資料來節省時間。 The information prompting the test result may include the identifier of the faulty memory array entry. This can save time by not sending read data about each memory entry.
另外或替代地,提示測試結果之資訊可提示至少一個記憶體組之狀態。 Additionally or alternatively, the information prompting the test result may prompt the state of at least one memory bank.
因此,在一些實施例中,提示測試結果之資訊可比在測試期間寫入至記憶體組或自記憶體組讀取之資料單元的總大小小得多,且可比在無測試單元輔助之情況下可自測試記憶體之測試器發送的輸入資料小得多。 Therefore, in some embodiments, the information that prompts the test result can be much smaller than the total size of the data unit written to or read from the memory set during the test, and is comparable to that without the aid of the test unit The input data sent by the tester that can test the memory is much smaller.
受測試積體電路可包含如先前諸圖中之任一者中所說明的記憶體晶片及/或分散式處理器。舉例而言,本文中所描述之積體電路可包括於如圖3A、圖3B、圖4至圖6、圖7A至圖7D、圖11至圖13、圖16至圖19、圖22或圖23中之任一者中所說明的記憶體晶片中,可包括該記憶體晶片,或以其他方式包含該記憶體晶片。 The integrated circuit under test may include a memory chip and/or a distributed processor as described in any of the previous figures. For example, the integrated circuit described herein can be included in Figure 3A, Figure 3B, Figure 4 to Figure 6, Figure 7A to Figure 7D, Figure 11 to Figure 13, Figure 16 to Figure 19, Figure 22 or Figure The memory chip described in any one of 23 may include the memory chip, or include the memory chip in other ways.
圖71說明用於測試積體電路之記憶體組的方法5350之實例。舉例而言,可使用上文關於圖65至圖69所描述之記憶體組中之任一者來實施方法5350。
Figure 71 illustrates an example of a
方法5350可包括步驟5352、5355及5358。步驟5352可包括藉
由積體電路之介面接收包含指令之組態資訊。包括介面之積體電路亦可包括基板、包含記憶體組且安置於基板上之記憶體陣列、安置於基板上之處理陣列,及安置於基板上之介面。
The
該組態資訊可包括記憶體陣列之測試的預期結果、指令、資料、待自在記憶體陣列之測試期間存取之記憶體陣列條目讀取的輸出資料之值、測試圖案,及其類似者。 The configuration information may include expected results, commands, data of the memory array test, the value of the output data read from the memory array entry to be accessed during the test of the memory array, test patterns, and the like.
另外或替代地,該組態資訊可包括指令、用以寫入該等指令之記憶體條目的位址、輸入資料,且亦可包括用以接收在指令執行期間計算之輸出值的記憶體條目之位址。 Additionally or alternatively, the configuration information may include commands, addresses of memory entries used to write these commands, input data, and may also include memory entries used to receive output values calculated during command execution The address.
該測試圖案可包括以下各者中之至少一者:(i)待在記憶體陣列之測試期間存取的記憶體陣列條目,(ii)待寫入至在記憶體陣列之測試期間存取之記憶體陣列條目的輸入資料,或(iii)待自在記憶體陣列之測試期間存取之記憶體陣列條目讀取的輸出資料之預期值。 The test pattern may include at least one of the following: (i) memory array entries to be accessed during the test of the memory array, (ii) to be written to the memory array entries accessed during the test of the memory array The input data of the memory array entry, or (iii) the expected value of the output data to be read from the memory array entry accessed during the test of the memory array.
步驟5352之後可接著步驟5355。步驟5355可包括藉由處理陣列執行指令,該執行藉由存取記憶體陣列,執行運算操作及提供結果來進行。
步驟5355之後可接著步驟5358。步驟5358可包括藉由介面及在積體電路外部輸出提示結果之資訊。
網路(cyber)安全性及篡改偵測技術 Cyber security and tamper detection technology
記憶體晶片及/或處理器可為惡意行動者之目標,且可能會受到各種類型之網路攻擊。在一些狀況下,此類攻擊可能嘗試改變儲存於一或多個記憶體資源中之資料及/或程式碼。相對於經訓練神經網路或取決於儲存於記憶體中之大量資料的其他類型之人工智慧(AI)模型,網路攻擊可能尤其成問題。若所儲存資料被操縱或甚至遮蔽,則此操縱可為有害的。舉例而言,若資料密集型AI模型所依賴之資料被破壞或遮蔽,則依賴於該等模型以識別其他車輛或 行人等之自主車輛系統可能會不正確地評估主機車輛之環境。結果,可能會發生事故。隨著AI模型在廣泛技術中變得愈來愈普遍,針對與此類模型相關聯之資料的網路攻擊可能造成重大破壞。 The memory chip and/or processor can be the target of malicious actors and may be subject to various types of cyber attacks. In some cases, this type of attack may attempt to change the data and/or code stored in one or more memory resources. Compared to trained neural networks or other types of artificial intelligence (AI) models that depend on large amounts of data stored in memory, cyber attacks can be particularly problematic. If the stored data is manipulated or even obscured, this manipulation can be harmful. For example, if the data on which the data-intensive AI model depends is destroyed or obscured, it will rely on these models to identify other vehicles or Autonomous vehicle systems such as pedestrians may incorrectly evaluate the environment of the host vehicle. As a result, accidents may occur. As AI models become more common in a wide range of technologies, cyber attacks on data associated with such models can cause significant damage.
在其他狀況下,網路攻擊可包括一或多個行動者篡改或嘗試篡改與處理器或其他類型之基於積體電路之邏輯元件相關聯的操作參數。舉例而言,處理器通常經設計以在某些操作規格內操作。涉及篡改之網路攻擊可試圖改變處理器、記憶體單元或其他電路之操作參數中之一或多者,使得處理器、記憶體單元或其他電路超出其設計操作規格(例如,時脈速度、頻寬規格、溫度限制、操作速率等)。此篡改可導致目標硬體發生故障。 In other situations, cyber attacks may include one or more actors tampering or attempting to tamper with operating parameters associated with the processor or other types of integrated circuit-based logic components. For example, processors are often designed to operate within certain operating specifications. Cyber attacks involving tampering can attempt to change one or more of the operating parameters of the processor, memory unit, or other circuit, causing the processor, memory unit, or other circuit to exceed its design operating specifications (for example, clock speed, Bandwidth specifications, temperature limits, operating speeds, etc.). This tampering can cause the target hardware to malfunction.
用於防禦網路攻擊之習知技術可包括在處理器層級操作之電腦程式(例如,防病毒軟體或防惡意軟體的軟體)。其他技術可包括使用與路由器或其他硬體相關聯的基於軟體之防火牆。雖然此等技術可使用在記憶體單元外部執行之軟體程式來對抗網路攻擊,但仍需要用於高效地保護儲存於記憶體單元中之資料的額外或替代技術,尤其在彼資料之準確性及可用性對諸如神經網路等之記憶體密集型應用之操作至關重要的情況下。本發明之實施例可提供包含記憶體之抵抗對記憶體之網路攻擊的各種積體電路設計。 Conventional technologies used to defend against cyber attacks may include computer programs that operate at the processor level (for example, anti-virus software or anti-malware software). Other technologies may include the use of software-based firewalls associated with routers or other hardware. Although these technologies can use software programs running outside the memory unit to fight against cyber attacks, they still need additional or alternative technologies to efficiently protect the data stored in the memory unit, especially in the accuracy of the data. And availability is critical to the operation of memory-intensive applications such as neural networks. The embodiments of the present invention can provide various integrated circuit designs including memory to resist cyber attacks on the memory.
以安全方式將敏感資訊及命令擷取至積體電路(例如,在至晶片/積體電路外部之介面尚未起作用時的開機處理程序期間)及接著維護積體電路內之敏感資訊及命令而不將其曝露於積體電路外部,此可增加敏感資訊及命令之安全性。CPU及其他類型之處理單元易受網路攻擊,尤其在彼等CPU/處理單元與外部記憶體一起操作時。包括安置於記憶體陣列當中之記憶體晶片上之分散式處理器子單元的所揭示實施例可能不易受到網路攻擊及篡改(例如,此係因為處理在記憶體晶片內發生),該記憶體陣列包括複數個記憶體組。包括在下文更詳細地論述之所揭示安全措施的任何組合可進一步降低所揭示實施例對 網路攻擊及/或篡改之易感性。 Retrieve sensitive information and commands to the integrated circuit in a secure manner (for example, during the boot process when the interface to the chip/integrated circuit is not functional) and then maintain the sensitive information and commands in the integrated circuit. Do not expose it to the outside of the integrated circuit, which can increase the security of sensitive information and commands. CPUs and other types of processing units are vulnerable to network attacks, especially when their CPU/processing units operate together with external memory. The disclosed embodiment including the distributed processor subunits arranged on the memory chip in the memory array may not be vulnerable to cyber attacks and tampering (for example, this is because the processing takes place within the memory chip). The array includes a plurality of memory banks. Any combination that includes the disclosed security measures discussed in more detail below can further reduce the impact of the disclosed embodiments Susceptibility to cyber attacks and/or tampering.
圖72A為符合本發明之實施例的包括記憶體陣列及處理陣列之積體電路7200的圖解表示。舉例而言,積體電路7200可包括在以上章節中且貫穿本發明描述之記憶體晶片上分散式處理器架構(及特徵)中之任一者。記憶體陣列及處理陣列可形成於共同基板上,且在某些所揭示實施例中,積體電路7200可構成記憶體晶片。舉例而言,如上文所論述,積體電路7200可包括記憶體晶片,該記憶體晶片包括複數個記憶體組及在空間上分佈於記憶體晶片上之複數個處理器子單元,其中複數個記憶體組中之每一者與複數個處理器子單元中之專用的一或多者相關聯。在一些狀況下,每一處理器子單元可專用於一或多個記憶體組。
FIG. 72A is a diagrammatic representation of an
在一些實施例中,記憶體陣列可包括複數個離散記憶體組7210_1、7210_2……7210_J1、7210_Jn,如圖72A中所展示。根據本發明之實施例,記憶體陣列7210可包含一或多種類型之記憶體,包括例如揮發性記憶體(諸如,RAM、DRAM、SRAM、相變RAM(PRAM)、磁阻式RAM(MRAM)、電阻式RAM(ReRAM)或其類似者)或非揮發性記憶體(諸如,快閃記憶體或ROM)。根據本發明之一些實施例,記憶體組7210_1至7210_Jn可包括複數個MOS記憶體結構。
In some embodiments, the memory array may include a plurality of discrete memory groups 7210_1, 7210_2...7210_J1, 7210_Jn, as shown in FIG. 72A. According to an embodiment of the present invention, the
如上文所提及,處理陣列可包括複數個處理器子單元7220_1至7220_K。在一些實施例中,處理器子單元7220_1至7220_K中之每一者可與複數個離散記憶體組7210_1至7210_Jn當中之一或多個離散記憶體組相關聯。雖然圖72A之實例實施例說明每一處理器子單元與兩個離散記憶體組7210相關聯,但應瞭解,每一處理器子單元可與任何數目個離散的專用記憶體組相關聯。且反之亦然,每一記憶體組可與任何數目個處理器子單元相關聯。根據本發明之實施例,包括於積體電路7200之記憶體陣列中的離散記憶體組之數目可等
於、小於或大於包括於積體電路7200之處理陣列中的處理器子單元之數目。
As mentioned above, the processing array may include a plurality of processor subunits 7220_1 to 7220_K. In some embodiments, each of the processor sub-units 7220_1 to 7220_K may be associated with one or more of a plurality of discrete memory groups 7210_1 to 7210_Jn. Although the example embodiment of FIG. 72A illustrates that each processor subunit is associated with two
積體電路7200可進一步包括符合本發明之實施例(且如描述於以上章節中)的複數個第一匯流排7260。每一匯流排7260可將處理器子單元7220_k連接至對應的專用記憶體組7210_j。根據本發明之一些實施例,積體電路7200可進一步包括複數個第二匯流排7261。每一匯流排7261可將處理器子單元7220_k連接至另一處理器子單元7220_k+1。如圖72A中所展示,複數個處理器子單元7220_1至7220_K可經由匯流排7261連接至彼此。雖然圖72A將形成迴路之複數個處理器子單元7220_1至7220_K說明為其經由匯流排7261串聯連接,但應瞭解,處理器單元7220可用任何其他方式連接。舉例而言,在一些狀況下,特定處理器子單元可能不經由匯流排7261連接至其他處理器子單元。在其他狀況下,特定處理器子單元可僅連接至一個其他處理器子單元,且在另外其他狀況下,特定處理器子單元可經由一或多個匯流排7261連接至兩個或多於兩個其他處理器子單元(例如,形成串聯連接、並聯連接、分支連接等)。應注意,本文中所描述之積體電路7200的實施例僅為例示性的。在一些狀況下,積體電路7200可具有不同的內部組件及連接,且在其他狀況下,可省略內部組件及所描述連接中之一或多者(例如,取決於特定應用之需要)。
The
返回參看圖72A,積體電路7200可包括用於相對於積體電路7200實施至少一個安全措施的一或多個結構。在一些狀況下,此等結構可經組態以偵測操縱或遮蔽(或嘗試操縱或遮蔽)儲存於記憶體組中之一或多者中之資料的網路攻擊。在其他狀況下,此等結構可經組態以偵測篡改與積體電路7200相關聯之操作參數或篡改直接或間接影響與積體電路7200相關聯之一或多個操作的一或多個硬體元件(無論包括於積體電路7200內抑或積體電路7200外部)。
Referring back to FIG. 72A, the
在一些狀況下,控制器7240可包括於積體電路7200中。控制器7240可經由一或多個匯流排7250連接至例如處理器子單元7220_1……7220_k
中之一或多者。控制器7240亦可連接至記憶體組7210_1……7210_Jn中之一或多者。雖然圖72A之實例實施例展示一個控制器7240,但應理解,控制器7240可包括多個處理器元件及/或邏輯電路。在所揭示實施例中,控制器7240可經組態以相對於積體電路7200之至少一個操作實施至少一個安全措施。另外,在所揭示實施例中,若至少一個安全措施被觸發,則控制器7240可經組態以採取(或引起)一或多個補救動作。
In some cases, the
根據本發明之一些實施例,至少一個安全措施可包括用於鎖定對積體電路7200之某些態樣之存取的控制器實施處理程序。存取鎖定涉及使控制器防止自晶片外部對記憶體之某些區的存取(讀取及/或寫入)。可按位址解析度、記憶體組解析度之部分、記憶體組解析度及其類似者來應用存取控制。在一些狀況下,可鎖定與積體電路7200相關聯之記憶體中的一或多個實體位置(例如,積體電路7200之一或多個記憶體組或記憶體組中之一或多者的任何部分)。在一些實施例中,控制器7240可鎖定對與人工智慧模型(或其他類型的基於軟體之系統)之執行相關聯的積體電路7200之某些部分的存取。舉例而言,在一些實施例中,控制器7240可鎖定對儲存於與積體電路7200相關聯之記憶體中的神經網路模型之權重的存取。應注意,軟體程式(亦即,模型)可包括三個組件,包括:程式之輸入資料、程式之程式碼資料及執行程式之輸出資料。此等組件亦可適用於神經網路模型。在此模型之操作期間,可產生輸入資料並將其饋入至模型,且執行模型可產生輸出資料以供讀取。然而,與使用所接收輸入資料執行模型相關聯的程式碼及資料值(例如,預定模型權重等)可保持固定。
According to some embodiments of the present invention, the at least one security measure may include a controller implementation processing program for locking access to certain aspects of the
如本文中所描述,鎖定可指控制器例如不允許自晶片/積體電路外部起始之相對於記憶體之某些區的讀取或寫入操作的操作。晶片/積體電路之I/O可通過的控制器不僅可鎖定全部記憶體組,而且可鎖定記憶體組內之記憶體位址的任何範圍,自單個記憶體位址至包括可用記憶體組之所有位址的位址範圍 (或兩者之間的任何位址範圍)。 As described herein, locking may refer to an operation in which the controller does not allow read or write operations relative to certain areas of the memory from outside the chip/integrated circuit, for example. The controller through which the chip/integrated circuit I/O can pass can not only lock all memory banks, but also lock any range of memory addresses in the memory banks, from a single memory address to all including the available memory banks Address range (Or any address range in between).
因為與接收輸入資料及儲存輸出資料相關聯之記憶體位置係與改變值及與積體電路7200外部之組件(例如,供應輸入資料或接收輸出資料之組件)的互動相關聯,所以鎖定對彼等記憶體位置之存取在一些狀況下可能不切實際。另一方面,限制對與模型程式碼及固定資料值相關聯之記憶體位置的存取可有效抵抗某些類型之網路攻擊。因此,在一些實施例中,作為安全措施,可鎖定與程式碼及資料值相關聯之記憶體(例如,不用於寫入/接收輸入資料及用於讀取/提供輸出資料之記憶體)。限制存取可包括鎖定某些記憶體位置使得無法對某些程式碼及/或資料值(例如,與基於所接收輸入資料執行模型相關聯的彼等程式碼及/或資料值)進行改變。另外,亦可鎖定與中間資料(例如,在執行模型期間產生之資料)相關聯之記憶體區域以抵抗外部存取。因此,雖然各種運算邏輯(無論為在積體電路7200板上抑或位於積體電路7200外部)可將資料提供至與接收輸入資料或擷取所產生輸出資料相關聯之記憶體位置或自該等記憶體位置接收資料,但此運算邏輯將不能夠基於所接收輸入資料來存取或修改儲存與程式執行相關聯之程式碼及資料值的記憶體位置。
Because the memory location associated with receiving input data and storing output data is associated with changing values and interactions with components outside the integrated circuit 7200 (for example, components that supply input data or receive output data), it is locked to each other. Waiting for memory location access may be impractical in some situations. On the other hand, restricting access to memory locations associated with model code and fixed data values can effectively resist certain types of cyber attacks. Therefore, in some embodiments, as a security measure, the memory associated with the program code and data value (for example, the memory not used for writing/receiving input data and reading/providing output data) can be locked. Restricting access may include locking certain memory locations so that certain code and/or data values (for example, their code and/or data values associated with the execution model based on the received input data) cannot be changed. In addition, the memory area associated with intermediate data (for example, data generated during the execution of the model) can also be locked to resist external access. Therefore, although various arithmetic logics (whether on the
除鎖定積體電路7200上之記憶體位置以提供安全措施以外,亦可藉由限制對經組態以執行與特定程式或模型相關聯之程式碼的某些運算邏輯元件(及其存取之記憶體區)的存取來實施其他安全措施。在一些狀況下,可相對於位於積體電路7200上之運算邏輯(及其相關聯之記憶體區)(例如,運算記憶體(例如,包括運算能力之記憶體,諸如本文中所揭示之記憶體晶片上的分散式處理器)等)實現此存取約束。亦可鎖定/限制對與儲存於積體電路7200之鎖定記憶體部分中的程式碼之任何執行相關聯或與對儲存於積體電路7200之鎖定記憶體部分中的資料值之任何存取相關聯的運算邏輯(及相關聯之記憶體位置)之存取,而無關於彼運算邏輯是否位於積體電路7200板上。限制對負責
執行程式/模型之運算邏輯的存取可進一步確保與對所接收輸入資料之操作相關聯的程式碼及資料值仍受到保護以免被操縱、遮蔽等。
In addition to locking the memory location on the
可用任何合適的方式實現控制器實施之安全措施,包括鎖定或限制對與積體電路7200之記憶體陣列之某些部分相關聯的基於硬體之區的存取。在一些實施例中,可藉由將命令添加或供應至經組態以使控制器7240鎖定某些記憶體部分之控制器7240來實施此鎖定。在一些實施例中,待鎖定的基於硬體之記憶體部分可由特定記憶體位址(例如,與記憶體組7210_1……7210_J2等之任何記憶體元件相關聯的位址)指明。在一些實施例中,記憶體之鎖定區可在程式或模型執行期間保持固定。在其他狀況下,鎖定區可為可組態的。亦即,在一些狀況下,可向控制器7240供應命令使得在程式或模型之執行期間,鎖定區可改變。舉例而言,在特定時間,可將某些記憶體位置添加至記憶體之鎖定區。或在特定時間,可自記憶體之鎖定區排除某些記憶體位置(例如,先前鎖定之記憶體位置)。
The security measures implemented by the controller can be implemented in any suitable manner, including locking or restricting access to hardware-based areas associated with certain portions of the memory array of the
可用任何合適的方式實現某些記憶體位置之鎖定。在一些狀況下,鎖定記憶體位置之記錄(例如,儲存及識別鎖定記憶體位址之檔案、資料庫、資料結構等)可為可由控制器7240存取的,使得控制器7240可判定某一記憶體請求是否與鎖定記憶體位置相關。在一些狀況下,控制器7240維護鎖定位址之資料庫以使用控制對某些記憶體位置之存取。在其他狀況下,控制器可具有可組態直至鎖定之表或一或多個暫存器之集合,且可包括識別待鎖定之記憶體位置(例如,應限制自晶片外部對該等記憶體位置之記憶體存取)的固定預定值。舉例而言,當請求記憶體存取時,控制器7240可比較與記憶體存取請求相關聯之記憶體位址與鎖定記憶體位址。若判定與記憶體存取請求相關聯之記憶體位址在鎖定記憶體位址之清單內,則可拒絕記憶體存取請求(例如,讀取抑或寫入操作)。
Any suitable method can be used to lock certain memory positions. In some cases, the records of the locked memory location (for example, files, databases, data structures, etc.) that store and identify the locked memory address can be accessed by the
如上文所論述,至少一個安全措施可包括鎖定對不用於接收輸入資料或用於提供對所產生輸出資料之存取的記憶體陣列7210之某些記憶體部分的存取。在一些狀況下,可調整鎖定區內之記憶體部分。舉例而言,可將鎖定記憶體部分解除鎖定,且可鎖定非鎖定記憶體部分。任何合適的方法可用於將鎖定記憶體部分解除鎖定。舉例而言,所實施之安全措施可包括需要用於將鎖定記憶體區之一或多個部分解除鎖定的複雜密碼。
As discussed above, at least one security measure may include locking access to certain memory portions of the
在偵測到對抗所實施之安全措施的任何動作後,可觸發所實施之安全措施。舉例而言,嘗試對鎖定記憶體部分進行存取(無論為讀取抑或寫入請求)可觸發安全措施。另外,若所鍵入之複雜密碼(例如,試圖將鎖定記憶體部分解除鎖定)不匹配預定複雜密碼,則可觸發安全措施。在一些狀況下,若在可允許的臨限數目次複雜密碼條目嘗試(例如,1次、2次、3次等)中未提供正確的複雜密碼,則可觸發安全措施。 After detecting any action against the implemented security measures, the implemented security measures can be triggered. For example, an attempt to access a portion of the locked memory (whether it is a read or write request) can trigger security measures. In addition, if the entered complex password (for example, an attempt to unlock part of the locked memory) does not match the predetermined complex password, a security measure can be triggered. In some situations, if the correct complex password is not provided in the allowable threshold number of complex password entry attempts (eg, 1, 2, 3, etc.), security measures can be triggered.
可在任何合適的時間鎖定記憶體部分。舉例而言,在一些狀況下,可在程式執行期間之各個時間鎖定記憶體部分。在其他狀況下,可在起動後或在程式/模型執行之前鎖定記憶體部分。舉例而言,可連同程式/模型程式碼之程式化或在產生及儲存待由程式/模型存取之資料後判定及識別待鎖定之記憶體位址。藉此,可在程式/模型執行開始時或之後、在已產生及儲存待由程式/模型使用之資料之後等的時間期間減少或消除對記憶體陣列7210之攻擊的漏洞。
The memory part can be locked at any suitable time. For example, in some situations, the memory portion can be locked at various times during program execution. In other situations, the memory part can be locked after startup or before the program/model is executed. For example, it can be combined with the programming of the program/model code or determine and identify the memory address to be locked after generating and storing the data to be accessed by the program/model. In this way, the vulnerability of attacks on the
可藉由任何合適的方法或在任何合適的時間實現鎖定記憶體之解除鎖定。如上文所描述,可在接收到正確的複雜密碼或密碼等之後將鎖定記憶體部分解除鎖定。在其他狀況下,可藉由重新啟動(藉由命令或藉由斷電及通電)或刪除整個記憶體陣列7210將鎖定記憶體解除鎖定。另外或替代地,可實施釋放命令序列以將一或多個記憶體部分解除鎖定。
The unlocking of the locked memory can be achieved by any suitable method or at any suitable time. As described above, the locked memory can be partially unlocked after receiving the correct complex password or password. In other situations, the locked memory can be unlocked by restarting (by command or by power-off and power-on) or by deleting the
根據本發明之實施例且如上文所描述,控制器7240可經組態以
控制至及自積體電路7200之訊務,尤其自在積體電路7200外部之源的訊務。舉例而言,如圖72A中所展示,可藉由控制器7240控制在積體電路7200外部之組件與在積體電路7200內部之組件(例如,記憶體陣列7210或處理器子單元7220)之間的訊務。此訊務可通過控制器7240或由控制器7240控制或監視之一或多個匯流排(例如,7250、7260或7261)。
According to an embodiment of the present invention and as described above, the
根據本發明之一些實施例,積體電路7200可在開機處理程序期間接收不可改變資料(例如,固定資料;例如模型權重、係數等)及某些命令(例如,程式碼;例如識別待鎖定之記憶體部分)。此處,不可改變資料可指在程式或模型之執行期間保持固定且可保持不變直至後續開機處理程序的資料。在程式執行期間,積體電路7200可與可改變資料互動,該可改變資料可包括待處理之輸入資料及/或由與積體電路7200相關聯之處理產生的輸出資料。如上文所論述,可在程式或模型執行期間限制對記憶體陣列7210或處理陣列7220之存取。舉例而言,存取可限於記憶體陣列7210之某些部分或限於某些處理器子單元,該等處理器子單元與以下各者相關聯:關於待寫入之傳入輸入資料的處理或與待寫入之傳入輸入資料的互動,或關於待讀取之所產生輸出資料的處理或與待讀取之所產生輸出資料的互動。在程式或模型執行期間,可鎖定含有不可改變資料之記憶體部分且藉此使其不可存取。在一些實施例中,與待鎖定之記憶體部分相關聯的不可改變資料及/或命令可包括於任何適當的資料結構中。舉例而言,可經由可在開機序列期間或之後存取的一或多個組態檔案使此類資料及/或命令可用於控制器7240。
According to some embodiments of the present invention, the
返回參看圖72A,積體電路7200可進一步包括通信埠7230。如圖72A中所展示,控制器7240可耦接於通信埠7230與匯流排7250之間,該匯流排在處理子單元7220_1至7220_K之間共用。在一些實施例中,通信埠7230可間接地或直接地耦接至主機電腦7270,該主機電腦與可包括例如非揮發性記
憶體之主機記憶體7280相關聯。在一些實施例中,主機電腦7270可自其相關聯之主機記憶體7280擷取可改變資料7281(例如,待在程式或模型之執行期間使用的輸入資料)、不可改變資料7282及/或命令7283。可改變資料7181、不可改變資料7282及命令7283可在開機處理程序期間經由7230自主機電腦7270上傳至控制器7240。
Referring back to FIG. 72A, the
圖72B為符合本發明之實施例的積體電路內部之記憶體區的圖解表示。如所展示,圖72B描繪包括於主機記憶體7280中之資料結構的實例。
FIG. 72B is a diagrammatic representation of a memory area inside an integrated circuit according to an embodiment of the present invention. As shown, FIG. 72B depicts an example of the data structure included in the
現參看圖7A3,其為符合本發明之實施例的積體電路之另一實例。如圖73A中所展示,控制器7240可包括網路攻擊偵測器7241及回應模組7242。在本發明之一些實施例中,控制器7240可經組態以儲存或存取存取控制規則7243。根據本發明之一些實施例,存取控制規則7243可包括於控制器7240可存取之組態檔案中。在一些實施例中,存取控制規則7243可在開機處理程序期間上傳至控制器7240。存取控制規則7243可包含提示與以下各者中之任一者相關聯之存取規則的資訊:可改變資料7281、不可改變資料7282及命令7283以及其對應記憶體位置。如上文所解釋,存取控制規則7243或組態檔案可包括識別記憶體陣列7210當中之某些記憶體位址的資訊。在一些實施例中,控制器7240可經組態以提供鎖定機制及/或功能,該鎖定機制及/或功能鎖定記憶體陣列7210之各種位址,例如用於儲存命令或不可改變資料之位址。
Referring now to FIG. 7A3, it is another example of an integrated circuit according to the embodiment of the present invention. As shown in FIG. 73A, the
控制器7240可經組態以強制執行存取控制規則7243,例如以防止未經授權實體改變不可改變資料或命令。在一些實施例中,可根據存取控制規則7243禁止對不可改變資料或命令之讀取。根據本發明之一些實施例,控制器7240可經組態以判定是否對某些命令或不可改變資料之至少一部分進行了存取嘗試。控制器7240(例如,包括網路攻擊偵測器7241)可比較與存取請求相關聯之記憶體位址與用於不可改變資料及命令之記憶體位址,以偵測是否已對
一或多個鎖定記憶體位置進行了未經授權存取嘗試。以此方式,例如,控制器7240之網路攻擊偵測器7241可經組態以判定是否發生疑似網路攻擊,例如更改一或多個命令或改變或遮蔽與一或多個鎖定記憶體部分相關聯之不可改變資料的請求。回應模組7242可經組態以判定如何對偵測到之網路攻擊作出回應及/或實施對偵測到之網路攻擊的回應。舉例而言,在一些狀況下,回應於偵測到對一或多個鎖定記憶體位置中之資料或命令的攻擊,控制器7240之回應模組7242可實施或使得實施回應,該回應可包括例如停止一或多個操作,諸如與偵測到之攻擊相關聯的記憶體存取操作。對偵測到之攻擊的回應亦可包括停止與程式或模型之執行相關聯的一或多個操作,傳回所嘗試攻擊之警告或其他指示符,向主機確證提示線,或刪除整個記憶體等。
The
除鎖定記憶體部分以外,亦可實施用於抵禦網路攻擊之其他技術以提供與積體電路7200相關聯之所描述安全措施。舉例而言,在一些實施例中,控制器7240可經組態以在與積體電路7200相關聯之不同記憶體位置及處理器子單元內複製程式或模型。以此方式,可獨立地執行程式/模型及程式/模型之複製者,且可比較獨立程式/模型執行之結果。舉例而言,可在兩個記憶體組7210中複製且在積體電路7200中之不同處理器子單元7220執行程式/模型。在其他實施例中,可在兩個不同積體電路7200中複製程式/模型。在任一狀況下,可比較程式/模型執行之結果以判定複製程式/模型執行之間是否存在任何差異。執行結果(例如,中間執行結果、最終執行結果等)之偵測到的差異可提示存在已變更程式/模型或其相關聯資料之一或多個態樣的網路攻擊。在一些實施例中,可指派不同記憶體組7210及處理器子單元7220以基於相同輸入資料執行兩個複製模型。在一些實施例中,可在基於相同輸入資料執行兩個複製模型期間比較中間結果,且若在同一階段,兩個中間結果之間存在失配,則作為潛在的補救動作,可暫時中止執行。在同一積體電路之處理器子單元執行兩個複製模型
的狀況下,彼積體電路亦可比較結果。此可在不通知積體電路外部之任何實體關於兩個複製模型之執行的情況下進行。換言之,晶片外部之實體不知曉複製模型正在積體電路上並列地運行。
In addition to locking the memory portion, other techniques for defending against network attacks can also be implemented to provide the described security measures associated with the
圖73B為符合本發明之實施例的用於同時執行複製模型的組態之圖解表示。 Figure 73B is a diagrammatic representation of a configuration for simultaneous execution of a copy model in accordance with an embodiment of the present invention.
雖然將單個程式/模型複製描述為用於偵測可能網路攻擊之一個實例,但可使用任何數目個複製(例如,1個、2個、3個或多於3個)以偵測可能網路攻擊。隨著複製及獨立程式/模型執行之數目增加,網路攻擊之偵測的信賴等級亦可增加。複製之較大數目亦可降低網路攻擊之潛在成功率,此係因為攻擊者可能更難影響多個程式/模型複製者。可在執行階段判定程式或模型複製者之數目,以進一步增加網路攻擊者成功地影響程式或模型執行之困難。 Although a single program/model copy is described as an instance for detecting possible cyber attacks, any number of copies (for example, 1, 2, 3, or more than 3) can be used to detect possible cyber attacks. Road attack. As the number of replication and independent program/model execution increases, the level of confidence in the detection of cyber attacks can also increase. A larger number of copies can also reduce the potential success rate of a network attack, because it may be more difficult for an attacker to influence multiple program/model copies. The number of copies of the program or model can be determined during the execution phase to further increase the difficulty for network attackers to successfully influence the execution of the program or model.
在一些實施例中,複製模型可在彼此不同之一或多個態樣中不相同。在此實例中,可使與兩個程式/模型相關聯之程式碼彼此不同,但該等程式/模型可經設計使得兩者傳回相同輸出結果。至少以此方式,兩個程式/模型可被視為彼此之複製者。舉例而言,兩個神經網路模型在一層中相對於彼此可能具有不同的神經元排序。然而,儘管模型程式碼具有此改變,但兩個神經網路模型均可傳回相同輸出結果。以此方式複製程式/模型可使得網路攻擊者更難以識別待破解之程式或模型的此等有效複製者,且結果,複製模型/程式不僅可提供用以提供冗餘以最小化網路攻擊影響之方式,而且可增強網路攻擊偵測(例如,藉由突出顯示篡改或未經授權存取,其中網路攻擊者更改一個程式/模型或其資料,但未能對程式/模型複製者作出對應改變)。 In some embodiments, the replication models may be different in one or more aspects from each other. In this example, the codes associated with the two programs/models can be made different from each other, but the programs/models can be designed so that both return the same output result. At least in this way, two programs/models can be regarded as copies of each other. For example, two neural network models may have different neuron rankings relative to each other in one layer. However, despite this change in the model code, both neural network models can return the same output results. Copying the program/model in this way can make it more difficult for cyber attackers to identify these effective replicators of the program or model to be cracked, and as a result, the copied model/program not only provides redundancy to minimize cyber attacks It can also enhance the detection of network attacks (for example, by highlighting tampering or unauthorized access, in which a network attacker changes a program/model or its data, but fails to affect the copy of the program/model Make corresponding changes).
在許多狀況下,複製程式/模型(尤其包括展現程式碼差異之複製程式/模型)可經設計使得其輸出不完全匹配,而是構成軟值(例如,近似相同的輸出值),而非準確的固定值。在此等實施例中,可比較(例如,使用專用
模組或藉由主機處理器)來自兩個或多於兩個有效程式/模型複製者之輸出結果,以判定其輸出結果(無論為中間結果抑或最終結果)之間的差是否處於預定範圍內。所輸出軟值之差不超過預定臨限值或範圍可被視為無篡改、未經授權存取等之證據。另一方面,若所輸出軟值之差超過預定臨限值或範圍,則此等差可被視為已發生呈篡改、對記憶體之未經授權存取等之形式的網路攻擊之證據。在此等狀況下,將觸發複製程式/模型安全措施且可採取一或多個補救動作(例如,停止執行程式或模型,關閉積體電路7200之一或多個操作,在具有有限功能性之安全模式中操作,連同許多其他動作)。
In many situations, copy programs/models (especially including copy programs/models that show code differences) can be designed so that their output does not match exactly, but constitutes a soft value (for example, approximately the same output value) instead of being accurate Fixed value. In these embodiments, it is possible to compare (e.g., use dedicated
Module or by the host processor) the output results from two or more valid program/model copiers to determine whether the difference between the output results (no matter intermediate results or final results) is within a predetermined range . The difference between the output soft values does not exceed the predetermined threshold or range can be regarded as evidence of no tampering, unauthorized access, etc. On the other hand, if the difference between the output soft values exceeds a predetermined threshold or range, the difference can be regarded as evidence that a network attack has occurred in the form of tampering, unauthorized access to memory, etc. . Under these conditions, the copy program/model safety measures will be triggered and one or more remedial actions can be taken (for example, stop executing the program or model, close one or more operations of the
與積體電路7200相關聯之安全措施亦可涉及對與執行中或已執行程式或模型相關聯之資料的定量分析。舉例而言,在一些實施例中,控制器7240可經組態以計算關於儲存於記憶體陣列7210之至少一部分中之資料的一或多個總和檢查碼/散列/循環冗餘檢查(CRC)/同位值。可將所計算之值與一或多個預定值進行比較。若所比較值之間存在偏差,則此偏差可解譯為篡改儲存於記憶體陣列7210之至少部分中之資料的證據。在一些實施例中,可針對與記憶體陣列7210相關聯之所有記憶體位置而計算總和檢查碼/散列/CRC/同位值以識別資料之改變。在此實例中,可藉由例如主機電腦7270或與積體電路7200相關聯之處理器讀取所討論之整個記憶體(或記憶體組),以用於計算總和檢查碼/散列/CRC/同位值。在其他狀況下,可針對與記憶體陣列7210相關聯之記憶體位置的預定子集而計算總和檢查碼/散列/CRC/同位值,以識別與記憶體位置之子集相關聯的資料之改變。在一些實施例中,控制器7240可經組態以計算與預定資料路徑相關聯(例如,與記憶體存取圖案相關聯)之總和檢查碼/散列/CRC/同位值,且所計算值可彼此進行比較或與預定值進行比較以判定是否已發生篡改或另一形式之網路攻擊。
The security measures associated with the
藉由保護積體電路7200內或積體電路7200可存取之位置中的一
或多個預定值(例如,預期總和檢查碼/散列/CRC/同位值、中間或最終輸出結果之預期差值、與某些值相關聯之預期差範圍等),可使積體電路7200甚至更安全地抵抗網路攻擊。舉例而言,在一些實施例中,一或多個預定值可儲存於記憶體陣列7210之暫存器中,且可在模型之每次運行期間或之後用以(例如,藉由積體電路7200之控制器7240)評估中間或最終輸出結果、總和檢查碼等。在一些狀況下,可使用「保存最後結果資料」命令來更新暫存器值以在運作中計算預定值,且可將所計算值保存於暫存器或另一記憶體位置中。以此方式,有效輸出值可用以在每一程式或模型執行或部分執行之後更新用於比較的預定值。此技術可增加網路攻擊者在嘗試修改或以其他方式篡改經設計以曝露網路攻擊者活動之一或多個預定參考值時可能體驗的困難。
By protecting one of the positions within the
在操作中,CRC計算器可用以追蹤記憶體存取。舉例而言,此計算電路可安置於記憶體組層級處、處理器子單元中或控制器處,其中每一計算電路可經組態以在進行每一記憶體存取時累加至CRC計算器。 In operation, the CRC calculator can be used to track memory access. For example, the calculation circuit can be placed at the memory bank level, in the processor sub-unit, or at the controller, where each calculation circuit can be configured to accumulate to the CRC calculator for each memory access .
現參看圖74A,其提供積體電路7200之另一實施例的圖解表示。在由圖74A表示之實例實施例中,控制器7240可包括篡改偵測器7245及回應模組7246。類似於其他所揭示實施例,篡改偵測器7245可經組態以偵測潛在篡改嘗試之證據。根據本發明之一些實施例,與積體電路7200相關聯且由控制器7240實施之安全措施例如可包括將實際程式/模型操作圖案與預定/所允許操作圖案進行比較。若在一或多個態樣中,實際程式/模型操作圖案與預定/所允許操作圖案不同,則可觸發安全措施。且若觸發安全措施,則控制器7240之回應模組7246可經組態以作為回應而實施一或多個補救措施。
Referring now to FIG. 74A, a diagrammatic representation of another embodiment of an
圖74C為根據例示性所揭示實施例之可位於晶片內之各個點處的偵測元件之圖解表示。如上文所描述,可使用位於晶片內之各個點處的偵測元件執行網路攻擊及篡改之偵測,如展示於例如圖74C中。舉例而言,某一程式 碼可與某一時間段內之預期數目個處理事件相關聯。圖74C中所展示之偵測器可對系統在某一時間段(由時間計數器監視)期間經歷之事件(由事件計數器監視)的數目進行計數。若事件之數目超過某一預定臨限值(例如,在預定義時間段期間之預期事件的數目),則可提示篡改。此類偵測器可包括於系統之多個點中以監視各種類型之事件,如圖74C中所展示。 FIG. 74C is a diagrammatic representation of detection elements that can be located at various points within the chip according to an illustratively disclosed embodiment. As described above, the detection components located at various points within the chip can be used to perform network attack and tampering detection, as shown in, for example, FIG. 74C. For example, a program The code can be associated with an expected number of processing events within a certain period of time. The detector shown in FIG. 74C can count the number of events (monitored by the event counter) experienced by the system during a certain period of time (monitored by the time counter). If the number of events exceeds a predetermined threshold (for example, the number of expected events during a predefined time period), tampering can be prompted. Such detectors can be included in multiple points of the system to monitor various types of events, as shown in Figure 74C.
更具體而言,在一些實施例中,控制器7240可經組態以儲存或存取預期程式/模型操作圖案7244。舉例而言,在一些狀況下,操作圖案可表示為提示每時間圖案之所允許負載及每時間圖案之禁止或不合法負載的曲線7247。篡改嘗試可使記憶體陣列7210或處理陣列7220在某些操作規格之外操作。此可使記憶體陣列7210或處理陣列7220產生熱或發生故障,且可使得能夠改變與記憶體陣列7210或處理陣列7220相關的資料或程式碼。此等改變可導致操作圖案超出如由曲線7247提示之所允許操作圖案。
More specifically, in some embodiments, the
根據本發明之一些實施例,控制器7240可經組態以監視與記憶體陣列7210或處理陣列7220相關聯之操作圖案。操作圖案可與存取請求之數目、存取請求之類型、存取請求之時序等相關聯。控制器7240可經進一步組態以在操作圖案不同於可允許操作圖案之情況下偵測篡改攻擊。
According to some embodiments of the present invention, the
應注意,所揭示實施例不僅可用以抵禦網路攻擊,而且用以抵禦操作中之非惡意錯誤。舉例而言,所揭示實施例亦可有效保護諸如積體電路7200之系統免受由諸如溫度或電壓改變或位準之環境因素引起之錯誤的影響,尤其在此等位準超出用於積體電路7200之操作規格的情況下。
It should be noted that the disclosed embodiments can be used not only to defend against network attacks, but also to defend against non-malicious errors in operation. For example, the disclosed embodiments can also effectively protect systems such as
回應於偵測到疑似網路攻擊(例如,作為對所觸發安全措施之回應),可實施任何合適的補救動作。舉例而言,補救動作可包括停止與程式/模型執行相關聯之一或多個操作,在安全模式中操作與積體電路7200相關聯之一或多個組件,將積體電路7200之一或多個組件鎖定至額外輸入或存取等。
In response to the detection of a suspected cyber attack (for example, as a response to a triggered security measure), any appropriate remedial action can be implemented. For example, the remedial action may include stopping one or more operations associated with the program/model execution, operating one or more components associated with the
圖74B提供根據例示性所揭示實施例之保護積體電路以防篡改的方法7450之流程圖表示。舉例而言,步驟7452可包括使用與積體電路相關聯之控制器相對於積體電路之操作實施至少一個安全措施。在步驟7454處,若觸發至少一個安全措施,則可採取一或多個補救動作。積體電路包括:基板;記憶體陣列,其安置於基板上,該記憶體陣列包括複數個離散記憶體組;及處理陣列,其安置於基板上,該處理陣列包括複數個處理器子單元,該等複數個處理器子單元中之每一者與複數個離散記憶體組當中之一或多個離散記憶體組相關聯。
FIG. 74B provides a flowchart representation of a
在一些實施例中,所揭示安全措施可實施於多個記憶體晶片中,且所揭示安全機制中之至少一或多者可針對每一記憶體晶片/積體電路而實施。在一些狀況下,每一記憶體晶片/積體電路可實施相同的安全措施,但在一些狀況下,不同的記憶體晶片/積體電路可實施不同的安全措施(例如,當不同的安全措施可能更適合於與特定積體電路相關聯之某一類型的操作時)。在一些實施例中,可由積體電路之特定控制器實施多於一個安全措施。舉例而言,特定積體電路可實施任何數目或類型之所揭示安全措施。另外,特定積體電路控制器可經組態以回應於所觸發安全措施而實施多個不同的補救措施。 In some embodiments, the disclosed security measures can be implemented in multiple memory chips, and at least one or more of the disclosed security mechanisms can be implemented for each memory chip/integrated circuit. In some situations, each memory chip/integrated circuit can implement the same security measures, but in some situations, different memory chips/integrated circuits can implement different security measures (for example, when different security measures May be more suitable for a certain type of operation associated with a particular integrated circuit). In some embodiments, more than one safety measure can be implemented by a specific controller of the integrated circuit. For example, a particular integrated circuit can implement any number or type of disclosed security measures. In addition, specific integrated circuit controllers can be configured to implement a number of different remedial measures in response to triggered safety measures.
亦應注意,可組合上述安全機制中之兩者或多於兩者以改善針對網路攻擊或篡改攻擊之安全性。另外,可跨越不同積體電路而實施安全措施,且此等積體電路可協調安全措施實施。舉例而言,可在一個記憶體晶片內執行或可跨越不同記憶體晶片執行模型複製。在此實例中,可比較來自一個記憶體晶片之結果或來自兩個或多於兩個記憶體晶片之結果以偵測潛在網路攻擊或篡改攻擊。在一些實施例中,跨越多個積體電路而應用之複製安全措施可包括以下各者中之一或多者:所揭示之存取鎖定機制、散列保護機制、模型複製、程式/模型執行圖案分析,或此等或其他所揭示實施例之任何組合。 It should also be noted that two or more of the above security mechanisms can be combined to improve the security against network attacks or tampering attacks. In addition, safety measures can be implemented across different integrated circuits, and these integrated circuits can coordinate the implementation of safety measures. For example, model replication can be performed within one memory chip or can be performed across different memory chips. In this example, the results from one memory chip or the results from two or more memory chips can be compared to detect potential cyber attacks or tampering attacks. In some embodiments, the copy security measures applied across multiple integrated circuits may include one or more of the following: the disclosed access locking mechanism, hash protection mechanism, model copy, program/model execution Pattern analysis, or any combination of these or other disclosed embodiments.
DRAM中之多埠處理器子單元 Multi-port processor subunit in DRAM
如上文所描述,本發明所揭示之實施例可包括分散式處理器記憶體晶片,該記憶體晶片包括處理器子單元陣列及記憶體組陣列,其中處理器子單元中之每一者可專用於記憶體組陣列中之至少一者。如在以下章節中所論述,分散式處理器記憶體晶片可充當可擴展系統之基礎。亦即,在一些狀況下,分散式處理器記憶體晶片可包括經組態以將資料自一個分散式處理器記憶體晶片傳送至另一分散式處理器記憶體晶片之一或多個通信埠。以此方式,任何所要數目個分散式處理器記憶體晶片可鏈接在一起(例如,串聯、並聯、以迴路或其任何組合)以形成分散式處理器記憶體晶片之可擴展陣列。此陣列可提供用於高效地執行記憶體密集型操作及用於擴展與記憶體密集型操作之效能相關聯之計算資源的靈活解決方案。因為分散式處理器記憶體晶片可包括具有不同時序圖案之時脈,所以本發明所揭示之實施例包括用以甚至在存在時脈時序差異之情況下亦準確地控制分散式處理器記憶體晶片之間的資料傳送的特徵。此等實施例可使得能夠在不同的分散式處理器記憶體晶片間進行高效資料共用。 As described above, the disclosed embodiments of the present invention may include a distributed processor memory chip, the memory chip includes an array of processor sub-units and an array of memory banks, wherein each of the processor sub-units can be dedicated At least one of the memory bank arrays. As discussed in the following sections, distributed processor memory chips can serve as the basis for scalable systems. That is, in some cases, the distributed processor memory chip may include one or more communication ports configured to transfer data from one distributed processor memory chip to another distributed processor memory chip . In this way, any desired number of distributed processor memory chips can be linked together (for example, in series, in parallel, in a loop, or any combination thereof) to form an expandable array of distributed processor memory chips. This array can provide a flexible solution for efficiently performing memory-intensive operations and for expanding the computing resources associated with the performance of memory-intensive operations. Because the distributed processor memory chip can include clocks with different timing patterns, the embodiments disclosed in the present invention include methods to accurately control the distributed processor memory chip even when there is a difference in clock timing The characteristics of the data transfer between. Such embodiments can enable efficient data sharing among different distributed processor memory chips.
圖75A為符合本發明之實施例的包括複數個分散式處理器記憶體晶片之可擴展處理器記憶體系統的圖解表示。根據本發明之實施例,可擴展處理器記憶體系統可包括複數個分散式處理器記憶體晶片,諸如第一分散式處理器記憶體晶片7500、第二分散式處理器記憶體晶片7500'及第三分散式處理器記憶體晶片7500"。第一分散式處理器記憶體晶片7500、第二分散式處理器記憶體晶片7500'及第三分散式處理器記憶體晶片7500"中之每一者可包括與描述於本發明分散式處理器中之實施例中之任一者相關聯的組態及/或特徵中之任一者。
FIG. 75A is a diagrammatic representation of an expandable processor memory system including a plurality of distributed processor memory chips in accordance with an embodiment of the present invention. According to an embodiment of the present invention, the expandable processor memory system may include a plurality of distributed processor memory chips, such as a first distributed
在一些實施例中,第一分散式處理器記憶體晶片7500、第二分散式處理器記憶體晶片7500'及第三分散式處理器記憶體晶片7500"中之每一者可
類似於圖7200中所展示之積體晶片7200而實施。如圖75A中所展示,第一分散式處理器記憶體晶片7500可包含記憶體陣列7510、處理陣列7520及控制器7540。記憶體陣列7510、處理陣列7520及控制器7540可類似於圖72A中之記憶體陣列7210、處理陣列7220及控制器7240而組態。
In some embodiments, each of the first distributed
根據本發明之實施例,第一分散式處理器記憶體晶片7500可包括第一通信埠7530。在一些實施例中,第一通信埠7530可經組態以與一或多個外部實體通信。舉例而言,通信埠7530可經組態以建立分散式處理器記憶體晶片7500與除另一分散式處理器記憶體晶片(諸如,分散式處理器記憶體晶片7500'及7500")以外之外部實體之間的通信連接。舉例而言,通信埠7530可間接地或直接地耦接至主機電腦(例如,如圖72A中所說明)或任何其他運算裝置、通信模組等。
According to an embodiment of the present invention, the first distributed
根據本發明之實施例,第一分散式處理器記憶體晶片7500可進一步包含經組態以與例如7500'或7500"之其他分散式處理器記憶體晶片通信的一或多個額外通信埠。在一些實施例中,一或多個額外通信埠可包括第二通信埠7531及第三通信埠7532,如圖75A中所展示。第二通信埠7531可經組態以與第二分散式處理器記憶體晶片7500'通信,且建立第一分散式處理器記憶體晶片7500與第二分散式處理器記憶體晶片7500'之間的通信連接。類似地,第三通信埠7532可經組態以與第三分散式處理器記憶體晶片7500'通信,且建立第一分散式處理器記憶體晶片7500與第三分散式處理器記憶體晶片7500"之間的通信連接。在一些實施例中,第一分散式處理器記憶體晶片7500(及本文中所揭示之記憶體晶片中的任一者)可包括複數個通信埠,包括任何適當數目個通信埠(例如,2個、3個、4個、5個、6個、7個、8個、9個、10個、20個、50個、100個、1000個等)。
According to an embodiment of the present invention, the first distributed
在一些實施例中,第一通信埠、第二通信埠及第三通信埠與對應 匯流排相關聯。對應匯流排可為第一通信埠、第二通信埠及第三通信埠中之每一者所共同的匯流排。在一些實施例中,與第一通信埠、第二通信埠及第三通信埠中之每一者相關聯的對應匯流排皆連接至複數個離散記憶體組。在一些實施例中,第一通信埠連接至記憶體晶片內部之主匯流排或包括於記憶體晶片中之至少一個處理器子單元中的至少一者。在一些實施例中,第二通信埠連接至記憶體晶片內部之匯流排或包括於記憶體晶片中之至少一個處理器子單元中的至少一者。 In some embodiments, the first communication port, the second communication port, and the third communication port correspond to The bus is associated. The corresponding bus may be a bus common to each of the first communication port, the second communication port, and the third communication port. In some embodiments, the corresponding bus associated with each of the first communication port, the second communication port, and the third communication port are connected to a plurality of discrete memory banks. In some embodiments, the first communication port is connected to at least one of a main bus inside the memory chip or at least one processor subunit included in the memory chip. In some embodiments, the second communication port is connected to at least one of a bus inside the memory chip or at least one processor subunit included in the memory chip.
雖然相對於第一分散式處理器記憶體晶片7500解釋了所揭示之分散式處理器記憶體晶片的組態,但應注意,第二處理器記憶體晶片7500'及第三處理器記憶體晶片7500"可類似於第一分散式處理器記憶體晶片7500而組態。舉例而言,第二分散式處理器記憶體晶片7500'亦可包含記憶體陣列7510'、處理陣列7520'、控制器7540'及/或複數個通信埠,諸如埠7530'、7531'及7532'。類似地,第三分散式處理器記憶體晶片7500"可包含記憶體陣列7510"、處理陣列7520"、控制器7540"及/或複數個通信埠,諸如埠7530"、7531"及7532"。在一些實施例中,第二分散式處理器記憶體晶片7500'之第二通信埠7531'及第三通信埠7532'可經組態以分別與第三分散式處理器記憶體晶片7500"及第一分散式處理器記憶體晶片7500通信。類似地,第三分散式處理器記憶體晶片7500"之第二通信埠7531"及第三通信埠7532"可經組態以分別與第一分散式處理器記憶體晶片7500及第二分散式處理器記憶體晶片7500'通信。分散式處理器記憶體晶片間的此組態類似性可便利基於所揭示之分散式處理器記憶體晶片而擴展運算系統。另外,與每一分散式處理器記憶體晶片相關聯之通信埠的所揭示配置及組態可使得能夠靈活地配置分散式處理器記憶體晶片之陣列(例如,包括串聯連接、並聯連接、環形連接、星形連接或網路連接等)。
Although the configuration of the disclosed distributed processor memory chip is explained relative to the first distributed
根據本發明之實施例,例如第一至第三分散式處理器記憶體晶片
7500、7500'及7500"之分散式處理器記憶體晶片可經由匯流排7533彼此通信。在一些實施例中,匯流排7533可連接兩個不同的分散式處理器記憶體晶片之兩個通信埠。舉例而言,第一處理器記憶體晶片7500之第二通信埠7531可經由匯流排7533連接至第二處理器記憶體晶片7500'之第三通信埠7532'。根據本發明之實施例,例如第一至第三分散式處理器記憶體晶片7500、7500'及7500"之分散式處理器記憶體晶片亦可經由諸如匯流排7534之匯流排與外部實體(例如,主機電腦)通信。舉例而言,第一分散式處理器記憶體晶片7500之第一通信埠7530可經由匯流排7534連接至一或多個外部實體。分散式處理器記憶體晶片可用各種方式彼此連接。在一些狀況下,分散式處理器記憶體晶片可展現串聯連接性,其中每一分散式處理器記憶體晶片連接至一對鄰近分散式處理器記憶體晶片。在其他狀況下,分散式處理器記憶體晶片可展現較高程度之連接性,其中至少一個分散式處理器記憶體晶片連接至兩個或多於兩個其他分散式處理器記憶體晶片。在一些狀況下,複數個記憶體晶片內之所有分散式處理器記憶體晶片可連接至複數個記憶體晶片中之所有其他分散式處理器記憶體晶片。
According to an embodiment of the present invention, for example, the first to third distributed
如圖75A中所展示,匯流排7533(或與圖75A之實施例相關聯的任何其他匯流排)可為單向的。雖然圖75A將匯流排7533說明為單向的且具有某一資料傳送流(如由圖75A中所展示之箭頭提示),但匯流排7533(或圖75A中之任何其他匯流排)可實施為雙向匯流排。根據本發明之一些實施例,連接於兩個分散式處理器記憶體晶片之間的匯流排可經組態以具有比連接於分散式處理器記憶體晶片與外部實體之間的匯流排之通信速度高的通信速度。在一些實施例中,分散式處理器記憶體晶片與外部實體之間的通信可在有限時間期間發生,例如在執行準備(自主機電腦載入程式碼、輸入資料、權重資料等)期間,在將由神經網路模型之執行產生之結果等輸出至主機電腦的時段期間發生。在與晶片7500、7500'及7500"之分散式處理器相關聯的一或多個程式之執
行期間(例如,在與人工智慧應用程式相關聯之記憶體密集型操作期間等),分散式處理器記憶體晶片之間的通信可經由匯流排7533、7533'等進行。在一些實施例中,相比兩個處理器記憶體晶片之間的通信,分散式處理器記憶體晶片與外部實體之間的通信發生之頻率可能較低。根據通信要求及實施例,分散式處理器記憶體晶片與外部實體之間的匯流排可經組態以具有等於、大於或小於分散式處理器記憶體晶片之間的匯流排之通信速度的通信速度。
As shown in FIG. 75A, bus 7533 (or any other bus associated with the embodiment of FIG. 75A) may be unidirectional. Although FIG. 75A illustrates the
在一些實施例中,如由圖75A表示,諸如第一至第三分散式處理器記憶體晶片7500、7500'及7500"之複數個分散式處理器記憶體晶片可經組態以彼此通信。如所提到,此能力可便利可擴展分散式處理器記憶體晶片系統之組裝。舉例而言,來自第一至第三處理器記憶體晶片7500、7500'及7500"之記憶體陣列7510、7510'及7510"及處理陣列7520、7520'及7520"在藉由通信通道(諸如,圖75A中所展示之匯流排)鏈接時可被視為實際上屬於單個分散式處理器記憶體晶片。
In some embodiments, as represented by FIG. 75A, a plurality of distributed processor memory chips such as the first to third distributed
根據本發明之實施例,可用任何合適的方式管理複數個分散式處理器記憶體晶片之間的通信及/或分散式處理器記憶體晶片與一或多個外部實體之間的通信。在一些實施例中,可藉由諸如分散式處理器記憶體晶片7500中之處理陣列7520的處理資源來管理此等通信。在一些其他實施例中,例如為了減輕由分散式處理器之陣列提供的處理資源所受的由通信管理強加之運算負荷,分散式處理器記憶體晶片之諸如控制器7540、7540'、7540"等的控制器可經組態以管理分散式處理器記憶體晶片之間的通信及/或分散式處理器記憶體晶片與一或多個外部實體之間的通信。舉例而言,相對於其他分散式處理器記憶體晶片,第一至第三處理器記憶體晶片7500、7500'及7500"之每一控制器7540、7540'及7540"可經組態以管理與其對應分散式處理器記憶體晶片相關的通信。在一些實施例中,控制器7540、7540'及7540"可經組態以經由諸如埠7531、7531'、7531"、
7532、7532'及7532"等之對應通信埠控制此等通信。
According to the embodiments of the present invention, the communication between a plurality of distributed processor memory chips and/or the communication between the distributed processor memory chips and one or more external entities can be managed in any suitable manner. In some embodiments, such communications can be managed by processing resources such as the
控制器7540、7540'及7540"亦可經組態以在考量可存在於分散式處理器記憶體晶片間之時序差的同時管理分散式處理器記憶體晶片之間的通信。舉例而言,分散式處理器記憶體晶片(例如,7500)可由內部時脈饋入,該內部時脈相對於其他分散式處理器記憶體晶片(例如,7500'及7500")之時脈可能不同。因此,在一些實施例中,控制器7540可經組態以實施用於考量分散式處理器記憶體晶片間之不同時脈時序圖案的一或多個策略,且藉由考慮分散式處理器記憶體晶片之間的可能時間偏差來管理分散式處理器記憶體晶片之間的通信。
The
舉例而言,在一些實施例中,第一分散式處理器記憶體晶片7500之控制器7540可經組態以使得能夠在某些條件下將資料自第一分散式處理器記憶體晶片7500傳送至第二處理器記憶體晶片7500'。在一些狀況下,若第一分散式處理器記憶體晶片7500之一或多個處理器子單元未準備好傳送資料,則控制器7540可抑制資料傳送。替代地或另外,若第二分散式處理器記憶體晶片7500'之接收處理器子單元未準備好接收資料,則控制器7540可抑制資料傳送。在一些狀況下,控制器7540可在確定發送處理器子單元準備好發送資料且接收處理器子單元準備好接收資料之後起始將資料自發送處理器子單元(例如,在晶片7500中)傳送至接收處理器子單元(例如,在晶片7500'中)。在其他實施例中,控制器7540可僅基於發送處理器子單元是否準備好發送資料來起始資料傳送,尤其在資料可在控制器7540或7540'中緩衝例如直至接收處理器子單元準備好接收所傳送資料之情況下。
For example, in some embodiments, the
根據本發明之實施例,控制器7540可經組態以判定是否滿足一或多個其他時序約束以便使得能夠進行資料傳送。此等時間約束可與以下各者相關:自發送處理器子單元之傳送時間與接收處理器子單元中之接收時間之間
的時間差、來自外部實體(例如,主機電腦)之對所處理資料的存取請求、對與發送或接收處理器子單元相關聯之記憶體資源(例如,記憶體陣列)執行的再新操作,以及其他。
According to an embodiment of the present invention, the
圖75E為符合本發明之實施例的實例時序圖。圖75E說明以下實例。 FIG. 75E is an example timing diagram according to an embodiment of the present invention. Figure 75E illustrates the following example.
在一些實施例中,控制器7540及與分散式處理器記憶體晶片相關聯之其他控制器可經組態以使用時脈賦能信號來管理晶片之間的資料傳送。舉例而言,處理陣列7520可由時脈饋入。在一些實施例中,可例如藉由控制器7540使用時脈賦能信號(例如,在圖75A展示為「至CE」)來控制一或多個處理器子單元是否對所供應時脈信號作出回應。每一處理器子單元,例如7520_1至7520_K,可執行程式碼,且程式碼可包括通信命令。根據本發明之一些實施例,控制器7540可藉由控制至處理器子單元7520_1至7520_K之時脈賦能信號來控制通信命令之時序。舉例而言,根據一些實施例,當發送處理器子單元(例如,在第一處理器記憶體晶片7500中)經程式化以在某一循環(例如,第1000個時脈循環)傳送資料且接收處理器子單元(例如,在第二處理器記憶體晶片7500'中)經程式化以在某一循環(例如,第1000個時脈循環)接收資料時,第一處理器記憶體晶片7500之控制器7540及第二處理器記憶體晶片7500'之控制器7540'可能不允許資料傳送,直至發送處理器子單元及接收處理器子單元兩者均準備好執行資料傳送。舉例而言,控制器7540可藉由向晶片7500中之發送處理器子單元供應某一時脈賦能信號(例如,邏輯低)來「抑制」自發送處理器子單元的資料傳送,該時脈賦能信號可防止發送處理器子單元回應於所接收時脈信號而發送資料。某一時脈賦能信號可「凍結」整個分散式處理器記憶體晶片或分散式處理器記憶體晶片之任何部分。另一方面,控制器7540可藉由向發送處理器子單元供應相反的時脈賦能信號(例如,邏輯高)來使發送處理器子
單元起始資料傳送,該時脈賦能信號使發送處理器子單元對所接收時脈信號作出回應。可使用由控制器7540'發出之時脈賦能信號來控制類似操作,例如藉由晶片7500'中之接收處理器子單元接收或不接收。
In some embodiments, the
在一些實施例中,可將時脈賦能信號發送至處理器記憶體晶片(例如,7500)中之所有處理器子單元(例如,7520_1至7520_K)。一般而言,時脈賦能信號可具有使處理器子單元對其各別時脈信號作出回應或忽略彼等時脈信號之效應。舉例而言,在一些狀況下,當時脈賦能信號為高(取決於特定應用之慣例)時,處理器子單元可對其時脈信號作出回應且可根據其時脈信號時序執行一或多個指令。另一方面,當時脈賦能信號為低時,防止處理器子單元對其時脈信號作出回應,使得其不回應於時脈時序而執行指令。換言之,當時脈賦能信號為低時,處理器子單元可忽略所接收時脈信號。 In some embodiments, the clock energizing signal can be sent to all the processor sub-units (for example, 7520_1 to 7520_K) in the processor memory chip (for example, 7500). Generally speaking, the clock energizing signal can have the effect of causing the processor sub-units to respond to their respective clock signals or to ignore their clock signals. For example, in some situations, when the clock energizing signal is high (depending on the convention of a particular application), the processor sub-unit can respond to its clock signal and can perform one or more operations according to its clock signal timing. Instructions. On the other hand, when the clock energizing signal is low, the processor sub-unit is prevented from responding to its clock signal, so that it does not execute instructions in response to the clock timing. In other words, when the clock energizing signal is low, the processor subunit can ignore the received clock signal.
返回圖75A之實例,控制器7540、7540'或7540"中之任一者可經組態以使用時脈賦能信號,從而藉由使各別陣列中之一或多個處理器子單元對所接收時脈信號作出回應或不作出回應來控制各別分散式處理器記憶體晶片之操作。在一些實施例中,控制器7540、7540'或7540"可經組態以選擇性地推進程式碼執行,例如在此程式碼與資料傳送操作及其時序相關或包括資料傳送操作及其時序時。在一些實施例中,控制器7540、7540'或7540"可經組態以使用時脈賦能信號來控制兩個不同的分散式處理器記憶體晶片之間經由通信埠7531、7531'、7531"、7532、7532'及7532"等中之任一者的資料傳輸之時序。在一些實施例中,控制器7540、7540'或7540"可經組態以使用時脈賦能信號來控制兩個不同的分散式處理器記憶體晶片之間經由通信埠7531、7531'、7531"、7532、7532'及7532"等中之任一者的資料接收之時間。
Returning to the example of FIG. 75A, any one of the
在一些實施例中,兩個不同的分散式處理器記憶體晶片之間的資料傳送時序可基於編譯最佳化步驟而配置。編譯可允許建置處理常式,其中可 將任務高效地指派給處理子單元而不受連接於兩個不同處理器記憶體晶片之間的匯流排上之傳輸延遲影響。編譯可由主機電腦中之編譯器執行,或傳輸至主機電腦。通常,兩個不同處理器記憶體晶片之間的匯流排上之傳送延遲將導致需要資料之處理子單元的資料瓶頸。所揭示編譯可用使得處理單元能夠甚至在匯流排上具有不利傳輸延遲之情況下仍連續地接收資料的方式排程資料傳輸。 In some embodiments, the data transfer timing between two different distributed processor memory chips can be configured based on the compilation optimization step. Compilation allows to build processing routines, which can be Efficiently assign tasks to processing sub-units without being affected by the transmission delay on the bus connected between two different processor memory chips. Compilation can be executed by the compiler in the host computer or transferred to the host computer. Generally, the transmission delay on the bus between two different processor memory chips will cause the data bottleneck of the processing sub-units that require data. The disclosed code can be used to schedule data transmission in a way that enables the processing unit to continuously receive data even when there is an unfavorable transmission delay on the bus.
雖然圖75A之實施例針對每個分散式處理器記憶體晶片(7500'、7500"、7500''')包括三個埠,但根據所揭示實施例,任何數目個埠可包括於分散式處理器記憶體晶片中。舉例而言,在一些狀況下,分散式處理器記憶體晶片可包括更多或更少埠。在圖75B之實施例中,每一分散式處理器記憶體晶片(例如,7500A至7500I)可組態有多個埠。此等埠可大體上彼此相同或可能不同。在所展示之實例中,每一分散式處理器記憶體晶片包括五個埠,包括一主機通信埠7570及四個晶片埠7572。主機通信埠7570可經組態以在陣列(如圖75B中所展示)中之分散式處理器中的任一者與例如相對於分散式處理器記憶體晶片之陣列位於遠端的主機電腦之間進行通信(經由匯流排7534)。晶片埠7572可經組態以使得能夠經由匯流排7535在分散式處理器記憶體晶片之間進行通信。
Although the embodiment of FIG. 75A includes three ports for each distributed processor memory chip (7500', 7500", 7500"'), according to the disclosed embodiment, any number of ports can be included in the distributed processing For example, in some cases, distributed processor memory chips may include more or fewer ports. In the embodiment of FIG. 75B, each distributed processor memory chip (e.g., , 7500A to 7500I) can be configured with multiple ports. These ports may be substantially the same or may be different from each other. In the example shown, each distributed processor memory chip includes five ports, including a
任何數目個分散式處理器記憶體晶片可彼此連接。在圖75B中所展示之每分散式處理器包括四個晶片埠的實例中,記憶體晶片可對陣列賦能,在該陣列中,每一分散式處理器記憶體晶片連接至兩個或多於兩個其他分散式處理器記憶體晶片且在一些狀況下,某些晶片可連接至四個其他分散式處理器記憶體晶片。在分散式處理器記憶體晶片中包括更多晶片埠可實現分散式處理器記憶體晶片之間的更多互連性。 Any number of distributed processor memory chips can be connected to each other. In the example shown in FIG. 75B where each distributed processor includes four chip ports, the memory chip can energize the array in which each distributed processor memory chip is connected to two or more On two other distributed processor memory chips and in some cases, some chips can be connected to four other distributed processor memory chips. Including more chip ports in a distributed processor memory chip can achieve more interconnectivity between distributed processor memory chips.
另外,雖然分散式處理器記憶體晶片7500A至7500I在圖75B中展示為具有兩種不同類型之通信埠7570及7572,但在一些狀況下,單種類型之
通信埠可包括於每一分散式處理器記憶體晶片中。在其他狀況下,多於兩種不同類型之通信埠可包括於分散式處理器記憶體晶片中之一或多者中。在圖75C之實例中,分散式處理器記憶體晶片7500A'至7500C'中之每一者包括兩個(或多於兩個)相同類型之通信埠7570。在此實施例中,通信埠7570可經組態以使得能夠經由匯流排7534與諸如主機電腦之外部實體進行通信,且亦可經組態以使得能夠經由匯流排7535在分散式處理器記憶體晶片之間(例如,在分散式處理器記憶體晶片7500B'與7500C'之間)進行通信。
In addition, although the distributed
在一些實施例中,設置於一或多個分散式處理器記憶體晶片上之埠可用以提供對多於一個主機之存取。舉例而言,在圖75D中所展示之實施例中,分散式處理器記憶體晶片包括兩個或多於兩個埠7570。埠7570可構成主機埠、晶片埠,或主機埠與晶片埠之組合。在所展示之實例中,兩個埠7570及7570'可使兩個不同主機(例如,主機電腦或運算元件或其他類型之邏輯單元)能夠經由匯流排7534及7534'存取分散式處理器記憶體晶片7500A。此實施例可使兩個(或多於兩個)不同主機電腦能夠存取分散式處理器記憶體晶片7500A。然而,在其他實施例中,匯流排7534及7534'兩者可連接至同一主機實體,例如其中彼主機實體需要額外頻寬或對分散式處理器記憶體晶片7500A之處理器子單元/記憶體組中之一或多者的並列存取。
In some embodiments, ports provided on one or more distributed processor memory chips can be used to provide access to more than one host. For example, in the embodiment shown in FIG. 75D, the distributed processor memory chip includes two or
在一些狀況下,如圖75D中所展示,多於一個控制器7540及7540'可用以控制對分散式處理器記憶體晶片7500A之分散式處理器子單元/記憶體組的存取。在其他狀況下,單個控制器可用以處置自一或多個外部主機實體之通信。
In some cases, as shown in FIG. 75D, more than one
另外,分散式處理器記憶體晶片7500A內部之一或多個匯流排可使得能夠對分散式處理器記憶體晶片7500A之分散式處理器子單元/記憶體組進行並列存取。舉例而言,分散式處理器記憶體晶片7500A可包括第一匯流排7580
及第二匯流排7580',該等匯流排使得能夠對例如分散式處理器子單元7520_1至7520_6及其對應的專用記憶體組7510_1至7510_6進行並列存取。此配置可允許同時存取分散式處理器記憶體晶片7500A中之兩個不同位置。另外,在不同時使用所有埠之狀況下,該等埠可共用分散式處理器記憶體晶片7500A內之硬體資源(例如,共同匯流排及/或共同控制器),且可構成多工至彼硬體之IO。
In addition, one or more buses inside the distributed
在一些實施例中,運算單元中之一些(例如,處理器子單元7520_1至7520_6)可連接至額外埠(7570')或控制器,而其他者不連接至額外埠或控制器。然而,來自不連接至額外埠7570'之運算單元的資料可通過至連接至埠7570'之運算單元的連接之內部網格。以此方式,可同時在兩個埠7570及7570'處執行通信而無需添加額外匯流排。
In some embodiments, some of the arithmetic units (for example, the processor sub-units 7520_1 to 7520_6) can be connected to an additional port (7570') or a controller, while the others are not connected to an additional port or controller. However, data from arithmetic units not connected to the extra port 7570' can pass through the internal grid of connections to the arithmetic units connected to the port 7570'. In this way, communication can be performed at the two
雖然通信埠(例如,7530至7532)及控制器(例如,7540)已說明為分開元件,但應瞭解,通信埠及控制器(或任何其他組件)可實施為根據本發明之實施例的積體單元。圖76提供符合本發明之實施例的具有整合之控制器及介面模組的分散式處理器記憶體晶片7600之圖解表示。如圖76中所展示,處理器記憶體晶片7600可實施為具有整合之控制器及介面模組7547,該模組經組態以執行圖75中之控制器7540以及通信埠7530、7531及7532之功能。如圖76中所展示,控制器及介面模組7547經組態以經由類似於通信埠(例如,7530、7531及7532)之介面7548_1至7548_N與諸如外部實體、一或多個分散式處理器記憶體晶片等之多個不同實體通信。控制器及介面模組7547可經進一步組態以控制分散式處理器記憶體晶片之間或分散式處理器記憶體晶片7600與諸如主機電腦之外部實體之間的通信。在一些實施例中,控制器及介面模組7547可包括經組態以與一或多個其他分散式處理器記憶體晶片及與諸如主機電腦、通信模組等之外部實體並列地通信的通信介面7548_1至7548_N。
Although the communication ports (for example, 7530 to 7532) and the controller (for example, 7540) have been described as separate components, it should be understood that the communication ports and the controller (or any other components) can be implemented as products according to embodiments of the present invention. Body unit. Figure 76 provides a diagrammatic representation of a distributed
圖77提供表示符合本發明之實施例的用於在圖75中所展示之可
擴展處理器記憶體系統中的分散式處理器記憶體晶片之間傳送資料的處理程序之流程圖。出於說明之目的,將參看圖75描述用於傳送資料之流程,且假定資料係自第一處理器記憶體晶片7500傳送至第二處理器記憶體晶片7500'。
FIG. 77 provides an example of the method shown in FIG. 75 according to an embodiment of the present invention.
The flow chart of the processing procedure for transferring data between the distributed processor memory chips in the extended processor memory system. For illustrative purposes, the process for transferring data will be described with reference to FIG. 75, and it is assumed that the data is transferred from the first
在步驟S7710處,可接收資料傳送請求。然而,應注意且如上文所描述,在一些實施例中,資料傳送請求可能並非必需的。舉例而言,在一些狀況下,資料傳送之時序可為預定的(例如,藉由特定軟體程式碼)。在此等狀況下,資料傳送可在無分開的資料傳送請求之情況下繼續進行。步驟S7710可由例如控制器7540以及其他者執行。在一些實施例中,資料傳送請求可包括將資料自第一分散式處理器記憶體晶片7500之一個處理器子單元傳送至第二分散式處理器記憶體晶片7500'之另一處理器子單元的請求。
At step S7710, a data transmission request can be received. However, it should be noted that and as described above, in some embodiments, the data transfer request may not be necessary. For example, in some situations, the timing of data transmission may be predetermined (for example, by specific software code). Under these conditions, data transmission can continue without a separate data transmission request. Step S7710 can be executed by, for example, the
在步驟S7720處,可判定資料傳送時序。如所提到,資料傳送時序可為預定的且可取決於特定軟體程式之執行次序。步驟S7720可由例如控制器7540以及其他者執行。在一些實施例中,可藉由考慮(1)發送處理器子單元是否準備好傳送資料及/或(2)接收處理器子單元是否準備好接收資料來判定資料傳送時序。根據本發明之實施例,亦可考慮是否滿足一或多個其他時序約束以使得能夠進行此資料傳送。一或多個時間約束可與以下各者相關:自發送處理器子單元之傳送時間與接收處理器子單元處之接收時間之間的時間差、來自外部實體(例如,主機電腦)之對所處理資料的存取請求、對與發送或接收處理器子單元相關聯之記憶體資源(例如,記憶體陣列)執行的再新操作等。根據本發明之實施例,處理子單元可由時脈饋入。在一些實施例中,可例如使用時脈賦能信號來控制供應至處理子單元之時脈。根據本發明之一些實施例,控制器7540可藉由控制至處理器子單元7520_1至7520_K之時脈賦能信號來控制通信命令之時序。
At step S7720, the data transmission timing can be determined. As mentioned, the data transmission sequence can be predetermined and can depend on the execution order of a specific software program. Step S7720 may be executed by, for example, the
在步驟S7730處,可基於在步驟S7720處判定之資料傳送時序而
執行資料傳輸。步驟S7730可由例如控制器7540以及其他者執行。舉例而言,第一處理器記憶體晶片7500之發送處理器子單元可根據在步驟S7720處判定之資料傳送時序將資料傳送至第二處理器記憶體晶片7500'之接收處理器子單元。
At step S7730, it can be determined based on the data transmission timing determined at step S7720
Perform data transfer. Step S7730 may be executed by, for example, the
所揭示架構可適用於多種應用。舉例而言,在一些狀況下,以上架構可便利在不同分散式處理器記憶體晶片間共用資料,諸如與神經網路(尤其為大型神經網路)相關聯之權重或神經元值或部分神經元值。另外,在諸如SUM、AVG等之某些運算中可能需要來自多個不同的分散式處理器記憶體晶片之資料。在此等狀況下,所揭示架構可便利共用來自多個分散式處理器記憶體晶片之此資料。又另外,例如,所揭示架構可便利在分散式處理器記憶體晶片之間共用記錄以支援查詢之接合操作。 The disclosed architecture is applicable to a variety of applications. For example, in some situations, the above architecture can facilitate the sharing of data between different distributed processor memory chips, such as weights or neuron values or parts of nerves associated with neural networks (especially large-scale neural networks). Meta value. In addition, certain operations such as SUM and AVG may require data from multiple different distributed processor memory chips. Under these conditions, the disclosed architecture can facilitate the sharing of this data from multiple distributed processor memory chips. In addition, for example, the disclosed architecture can facilitate the sharing of records between distributed processor memory chips to support query bonding operations.
亦應注意,雖然已相對於分散式處理器記憶體晶片描述了以上實施例,但相同原理及技術可應用於例如不包括分散式處理器子單元之常規記憶體晶片。舉例而言,在一些狀況下,多個記憶體晶片可一起組合成多埠記憶體晶片,以形成甚至不具有處理器子單元之陣列的記憶體晶片之陣列。在另一實施例中,多個記憶體晶片可組合在一起以形成所連接記憶體之陣列,從而實際上向主機提供包含多個記憶體晶片之一個較大記憶體。 It should also be noted that although the above embodiments have been described with respect to distributed processor memory chips, the same principles and techniques can be applied to, for example, conventional memory chips that do not include distributed processor sub-units. For example, in some situations, multiple memory chips can be combined together to form a multi-port memory chip to form an array of memory chips that does not even have an array of processor subunits. In another embodiment, multiple memory chips can be combined to form an array of connected memory, thereby actually providing the host with a larger memory containing multiple memory chips.
埠之內部連接可至主匯流排或至包括於處理陣列中之內部處理器子單元中之一者。 The internal connection of the port can be to the main bus or to one of the internal processor subunits included in the processing array.
記憶體內零偵測 Zero detection in memory
本發明之一些實施例係有關於用於偵測儲存於複數個記憶體組之一或多個特定位址中之零值的記憶體單元。所揭示記憶體單元之此零值偵測特徵可適用於減少運算系統之功率消耗,且另外或替代地,亦可減少用於自記憶體擷取零值所需之處理時間。此特徵可在以下系統中尤其相關:在該系統中,讀取之大量資料實際上為0值且亦用於計算運算,諸如乘法\加法\減法\及更多運 算,對於該等運算,自記憶體擷取零值可能不必要(例如,零值與任何其他值之乘積為零),且運算電路可使用運算元中之一者為零的事實且在時間及能量上更高效地計算結果。在此等狀況下,可使用對零值之存在的偵測來代替記憶體存取及自記憶體擷取零值。 Some embodiments of the present invention relate to memory cells used to detect zero values stored in one or more specific addresses in a plurality of memory groups. The zero value detection feature of the disclosed memory unit can be applied to reduce the power consumption of the computing system, and additionally or alternatively, can also reduce the processing time required for retrieving zero values from the memory. This feature can be particularly relevant in the following systems: In this system, the large amount of data read is actually zero and is also used for calculation operations, such as multiplication\addition\subtraction\ and more operations For these operations, it may not be necessary to retrieve the zero value from the memory (for example, the product of the zero value and any other value is zero), and the arithmetic circuit can use the fact that one of the operands is zero. And calculate the result more efficiently in terms of energy. Under these conditions, the detection of the existence of the zero value can be used instead of memory access and retrieval of the zero value from the memory.
貫穿此章節,相對於讀取功能來描述所揭示實施例。然而,應注意,所揭示架構及技術同樣適用於零值寫入操作,或在其他值可能更經常出現之狀況下,亦用於其他特定預定非零值操作。 Throughout this section, the disclosed embodiments are described with respect to reading functions. However, it should be noted that the disclosed architecture and technology are also applicable to zero-value write operations, or other specific predetermined non-zero value operations under conditions where other values may occur more frequently.
在所揭示實施例中,替代自記憶體擷取零值,當在特定位址處偵測到此值時,記憶體單元可將零值指示符傳回至記憶體單元外部之一或多個電路(例如,位於記憶體單元外部之一或多個處理器、CPU等)。零值為多位元零值零(例如,零值位元組,零值字,小於一位元組、大於一位元組之多位元零值,及其類似者)。零值指示符為提示儲存於記憶體中之零值的1位元信號,因此相比傳送儲存於記憶體中之n個資料位元,傳送提示信號之1位元零值為有益的。所傳輸之零提示可將用於傳送之能量消耗減少至1/n,且可加速運算,例如其中在藉由神經元之權重計算輸入、卷積、將核心應用於輸入資料以及與經訓練神經網路、人工智慧及廣泛其他類型之運算相關聯的許多其他計算中涉及乘法運算。為提供此功能性,所揭示記憶體單元可包括一或多個零值偵測邏輯單元,該一或多個零值偵測邏輯單元可偵測記憶體中之特定位置中存在零值,防止擷取零值(例如,經由讀取命令)且使得替代地將零值指示符傳輸至記憶體單元外部之電路系統(例如,使用記憶體之一或多條控制線、與記憶體單元相關聯之一或多個匯流排等)。可在記憶體墊層級、在組層級、在子組層級、在晶片層級等執行零值偵測。 In the disclosed embodiment, instead of retrieving the zero value from the memory, when the value is detected at a specific address, the memory unit can return the zero value indicator to one or more of the external memory units Circuits (for example, one or more processors, CPUs, etc.) located outside the memory unit. The zero value is a multi-bit zero-valued zero (for example, a zero-valued byte, a zero-valued word, a multi-bit zero value less than one bit, greater than one bit, and the like). The zero value indicator is a 1-bit signal that prompts the zero value stored in the memory. Therefore, compared to transmitting n data bits stored in the memory, it is beneficial to transmit the 1-bit zero value of the prompt signal. The transmitted zero reminder can reduce the energy consumption for transmission to 1/n, and can speed up calculations, such as calculating the input by the weight of the neuron, convolution, applying the core to the input data and interacting with the trained nerve Multiplication is involved in many other calculations associated with the Internet, artificial intelligence, and a wide range of other types of operations. To provide this functionality, the disclosed memory unit may include one or more zero-value detection logic units that can detect the presence of zero values in specific locations in the memory to prevent Retrieve the zero value (for example, via a read command) and cause the zero value indicator to be transmitted instead to a circuit system outside the memory unit (for example, using one or more control lines of the memory, associated with the memory unit) One or more bus bars, etc.). Zero detection can be performed at the memory pad level, at the group level, at the sub-group level, at the chip level, etc.
應注意,雖然相對於將零指示符遞送至在記憶體晶片外部之位置而描述了所揭示實施例,但所揭示實施例及特徵亦可在處理可在記憶體晶片內 部進行之系統中提供顯著益處。舉例而言,在諸如本文中所揭示之分散式處理器記憶體晶片的實施例中,可藉由對應處理器子單元對各種記憶體組中之資料執行處理。在許多狀況下,諸如相關聯資料可包括許多零之神經網路執行或資料分析,所揭示技術可加速處理及/或減少與由分散式處理器記憶體晶片中之處理器子單元執行之處理相關聯的功率消耗。 It should be noted that although the disclosed embodiments are described with respect to delivering the zero indicator to a location outside the memory chip, the disclosed embodiments and features can also be processed within the memory chip. Provides significant benefits in a partially implemented system. For example, in embodiments such as the distributed processor memory chip disclosed herein, the corresponding processor sub-units can perform processing on data in various memory groups. In many situations, such as neural network execution or data analysis where the associated data can include many zeros, the disclosed technology can speed up processing and/or reduce processing performed by the processor subunits in the distributed processor memory chip The associated power consumption.
圖78A說明符合本發明之實施例的用於在晶片層級偵測儲存於複數個記憶體組之一或多個特定位址中之零值的系統7800,該等複數個記憶體組實施於記憶體晶片7810中。系統7800可包括記憶體晶片7810及主機7820。記憶體晶片7810可包括複數個控制單元且每一控制單元可具有專用記憶體組。舉例而言,控制單元可用可操作方式連接至專用記憶體組。
FIG. 78A illustrates a system 7800 for detecting zero values stored in one or more specific addresses in a plurality of memory groups at the chip level according to an embodiment of the present invention, and the plurality of memory groups are implemented in the
在一些狀況下,例如相對於此處所揭示之分散式處理器記憶體晶片,記憶體晶片內之處理可涉及記憶體存取(無論為讀取抑或寫入),該等分散式處理器記憶體晶片包括在空間上分佈於記憶體組之陣列當中的處理器子單元。甚至在記憶體晶片內部之處理的狀況下,偵測與讀取或寫入命令相關聯之零值的所揭示技術可允許內部處理器單元或子單元放棄傳送實際零值。實情為,回應於零值偵測及零值指示符傳輸(例如,至一或多個內部處理子單元),分散式處理器記憶體晶片可節省否則將已用於傳輸記憶體晶片內之零資料值的能量。 In some situations, such as the distributed processor memory chip disclosed herein, the processing within the memory chip may involve memory access (whether it is read or write). The distributed processor memory The chip includes processor sub-units spatially distributed in an array of memory banks. Even in the context of processing inside the memory chip, the disclosed technology of detecting the zero value associated with a read or write command can allow the internal processor unit or sub-unit to give up transmitting the actual zero value. In fact, in response to zero value detection and zero value indicator transmission (for example, to one or more internal processing sub-units), the distributed processor memory chip can be saved or would have been used to transmit the zeros in the memory chip The energy of the data value.
在另一實例中,記憶體晶片7810及主機7820中之每一者可包括輸入/輸出(IO),以使得能夠在記憶體晶片7810與主機7820之間進行通信。每一IO可與零值指示符線7830A及匯流排7840A耦接。零值指示符線7830A可將零值指示符自記憶體晶片7810傳送至主機7820,其中零值指示符可包括在偵測到儲存於由主機7820請求之記憶體組之特定位址中的零值後由記憶體晶片7810產生之1位元信號。在經由零值指示符線7830A接收到零值指示符後,主
機7820可執行與零值指示符相關聯之一或多個預定義動作。舉例而言,若主機7820向記憶體晶片7810請求擷取用於乘法之運算元,則主機7820可更高效地計算乘法,此係因為主機7820將自所接收零值指示符確認(不接收實際記憶體值)運算元中之一者為零。主機7820亦可經由匯流排7840將指令、資料及其他輸入提供至記憶體晶片7810,且自記憶體晶片7810讀取輸出。在自主機7820接收到通信後,記憶體晶片7810可擷取與所接收通信相關聯之資料,且經由匯流排7840將所擷取資料傳送至主機7820。
In another example, each of the
在一些實施例中,主機可將零值指示符而非零資料值發送至記憶體晶片。以此方式,記憶體晶片(例如,安置於記憶體晶片上之控制器)可儲存或再新記憶體中之零值而不必接收零資料值。此更新可基於零值指示符(例如,作為寫入命令之部分)之接收而發生。 In some embodiments, the host may send a zero value indicator instead of a zero data value to the memory chip. In this way, the memory chip (for example, a controller placed on the memory chip) can store or renew the zero value in the memory without having to receive the zero data value. This update may occur based on the receipt of a zero value indicator (e.g., as part of a write command).
圖78B說明符合本發明之實施例的用於在記憶體組層級偵測儲存於複數個記憶體組7811A至7811B之一或多個特定位址中之零值的記憶體晶片7810。記憶體晶片7810可包括複數個記憶體組7811A至7811B及IO匯流排7812。儘管圖78B描繪實施於記憶體晶片7810之兩個記憶體組7811A至7811B,但記憶體晶片7810可包括任何數目個記憶體組。
FIG. 78B illustrates a
IO匯流排7812可經組態以經由匯流排7840B將資料傳送至外部晶片(例如,圖78A中之主機7820)/自該外部晶片傳送資料。匯流排7840B可類似於圖78A中之匯流排7840A起作用。IO 7812亦可經由零值指示符線7830B傳輸零值指示符,其中零值指示符線7830B可類似於圖78A中之零值指示符線7830A起作用。IO匯流排7812亦可經組態以經由內部零值指示符線7831及匯流排7841與記憶體組7811A至7811B通信。IO匯流排7812可將來自外部晶片之所接收資料傳輸至記憶體組7811A至7811B中之一者。舉例而言,IO匯流排7812可經由匯流排7841傳送資料,該資料包含用以讀取儲存於記憶體組7811A
之特定位址中之資料的指令。多工器可包括於IO匯流排7812與記憶體組7811A至7811B之間,且可藉由內部零值指示符線7831及匯流排7841A連接。多工器可經組態以將來自IO匯流排7812之所接收資料傳輸至特定記憶體組,且可經進一步組態以將來自特定記憶體組之所接收資料或所接收零值指示符傳輸至IO匯流排7812。
The IO bus 7812 can be configured to transmit data to/from an external chip (for example, the host 7820 in FIG. 78A) via the
在一些狀況下,主機實體可僅經組態以接收常規資料傳輸,且可不經裝備以解譯所揭示之零值指示符或對該零值指示符作出回應。在此狀況下,所揭示實施例(例如,控制器/晶片IO等)可在至主機IO之資料線上重新產生零值來代替零值指示符信號,且因此可節省晶片內部之資料傳輸功率。 In some situations, the host entity may only be configured to receive regular data transmissions, and may not be equipped to interpret or respond to the disclosed zero-value indicator. In this situation, the disclosed embodiments (for example, the controller/chip IO, etc.) can regenerate a zero value on the data line to the host IO to replace the zero value indicator signal, and thus can save the data transmission power inside the chip.
記憶體組7811A至7811B中之每一者可包括控制單元。控制單元可偵測儲存於記憶體組之所請求位址中的零值。在偵測到所儲存零值後,控制單元可產生零值指示符且經由內部零值指示符線7831將所產生之零值指示符傳輸至IO匯流排7812,其中零值指示符經由零值指示符線7830B進一步傳送至外部晶片。
Each of the
圖79說明符合本發明之實施例的用於在記憶體墊層級偵測儲存於複數個記憶體墊之特定位址中的一或多者中之零值的記憶體組7911。在一些實施例中,記憶體組7911可組織成記憶體墊7912A至7912B,該等記憶體墊中之每一者可被獨立地控制及獨立地存取。記憶體組7911可包括記憶體墊控制器7913A至7913B,該等控制器可包括零值偵測邏輯單元7914A至7914B。記憶體墊控制器7913A至7913B中之每一者可允許對記憶體墊7912A至7912B上之位置進行讀取及寫入。記憶體組7911可進一步包括讀取去能元件、區域感測放大器7915A至7915B及/或全域感測放大器7916。
FIG. 79 illustrates a memory set 7911 for detecting zero values stored in one or more of the specific addresses of a plurality of memory pads at the memory pad level in accordance with an embodiment of the present invention. In some embodiments, the memory group 7911 can be organized into memory pads 7912A to 7912B, each of which can be independently controlled and accessed independently. The memory bank 7911 may include memory pad controllers 7913A to 7913B, and the controllers may include zero value
記憶體墊7912A至7912B中之每一者可包括複數個記憶體胞元。複數個記憶體胞元中之每一者可儲存一個二進位資訊位元。舉例而言,記憶體 胞元中之任一者可個別地儲存零值。若特定記憶體墊中之所有記憶體胞元皆儲存零值,則零值可與整個記憶體墊相關聯。 Each of the memory pads 7912A to 7912B may include a plurality of memory cells. Each of the plurality of memory cells can store one binary information bit. For example, memory Any one of the cells can store zero values individually. If all memory cells in a particular memory pad store zero values, the zero value can be associated with the entire memory pad.
記憶體墊控制器7913A至7913B中之每一者可經組態以存取專用記憶體墊,且讀取儲存於專用記憶體墊中之資料或將資料寫入專用墊中。 Each of the memory pad controllers 7913A to 7913B can be configured to access the dedicated memory pad, and read data stored in the dedicated memory pad or write data to the dedicated pad.
在一些實施例中,零值偵測邏輯單元7914A或7914B可實施於記憶體組7911中。一或多個零值偵測邏輯單元7914A至7914B可與記憶體組、記憶體子組、記憶體墊及一或多個記憶體胞元之集合相關聯。零值偵測邏輯單元7914A或7914B可偵測所請求之特定位址(例如,記憶體墊7912A或7912B)儲存零值。該偵測可用許多方法執行。
In some embodiments, the zero value
第一方法可包括使用相對於零的數位比較器。數位比較器可經組態以獲取兩個數字作為二進位形式之輸入,且判定第一數字(所擷取資料)是否等於第二數字(零)。若數位比較器判定兩個數字相等,則零值偵測邏輯單元可產生零值指示符。零值指示符可為1位元信號,且可使可將資料位元發送至下一層級(例如,圖78B中之IO匯流排7812)之放大器(例如,區域感測放大器7915A至7915B)、傳輸器及緩衝器去能。零值指示符可經由零值指示符線7931A或7931B進一步傳輸至全域感測放大器7916,但在一些狀況下,可繞過全域感測放大器。
The first method may include using a digital comparator relative to zero. The digital comparator can be configured to take two numbers as input in binary form and determine whether the first number (the captured data) is equal to the second number (zero). If the digital comparator determines that the two numbers are equal, the zero value detection logic unit can generate a zero value indicator. The zero value indicator can be a 1-bit signal, and can enable the amplifiers (for example,
用於零偵測之第二方法可包括使用類比比較器。除了將兩個類比輸入之電壓用於比較以外,類比比較器亦可類似於數位比較器起作用。舉例而言,可感測所有位元,且比較器可充當信號之間的邏輯「或(OR)」函數。 The second method for zero detection may include the use of analog comparators. In addition to using the voltages of the two analog inputs for comparison, the analog comparator can also function like a digital comparator. For example, all bits can be sensed, and the comparator can act as a logical "OR" function between signals.
用於零值偵測之第三方法可包括使用自區域感測放大器7915A至7915B至全域感測放大器7916中之傳送信號,其中全域感測放大器7916經組態以感測輸入中之任一者是否為高(非零)且使用彼邏輯信號以控制放大器之下一層級。區域感測放大器7915A至7915B及全域感測放大器7916可包括複
數個電晶體,該等複數個電晶體經組態以感測來自複數個記憶體組之低功率信號,且該等放大器將小的電壓擺動放大至較高電壓位準使得儲存於複數個記憶體組中之資料可由諸如記憶體墊控制器7913A或7913B之至少一個控制器解譯。舉例而言,記憶體胞元可按列及行佈置於記憶體組7911上。每一線可附接至列中之每一記憶體胞元。沿著列延行之線被稱作字線,該等字線藉由將電壓選擇性地施加至字線來啟動。沿著行延行之線被稱作位元線,且兩個此等互補位元線可在記憶體陣列之邊緣處附接至感測放大器。感測放大器之數目可對應於記憶體組7911上之位元線(行)的數目。為了自特定記憶體胞元讀取位元,接通沿著胞元列之字線,從而啟動該列中之所有記憶體胞元。來自每一胞元之所儲存值(0或1)接著在與特定胞元相關聯之位元線上可用。在兩個互補位元線之末端處,感測放大器可將小的電壓放大至正常邏輯位準。可接著將來自所要胞元之位元自胞元之感測放大器鎖存至緩衝器中且置於輸出匯流排上。
The third method for zero value detection can include the use of signal transmission from
用於零值偵測之第四方法可包括:若值為0,則針對保存至記憶體且在寫入時間儲存之每一字使用一額外位元,且在讀出資料時使用彼額外位元以知曉資料是否為零。該方法可避免將所有零寫入至記憶體,因此節省更多能量。 The fourth method for zero value detection can include: if the value is 0, use an extra bit for each word stored in the memory and stored at the write time, and use that extra bit when reading data Yuan to know whether the data is zero. This method can avoid writing all zeros to the memory, thus saving more energy.
如上文且貫穿本發明所描述,一些實施例可包括記憶體單元(諸如,記憶體單元7800),該記憶體單元包括複數個處理器子單元。此等處理器子單元可在空間上分佈於單個基板(例如,諸如記憶體單元7800之記憶體晶片的基板)上。此外,複數個處理器子單元中之每一者可專用於記憶體單元7800之複數個記憶體組當中的對應記憶體組。且專用於對應處理器子單元之此等記憶體組亦可在空間上分佈於基板上。在一些實施例中,記憶體單元7800可與特定任務(例如,執行與運行神經網路相關聯之一或多個操作等)相關聯,且記憶體單元7800之處理器子單元中之每一者可負責執行此任務之一部分。舉例而 言,每一處理器子單元可裝備有可包括資料處置及記憶體操作、算術及邏輯運算等之指令。在一些狀況下,零值偵測邏輯可經組態以將零值指示符提供至在空間上分佈於記憶體單元7800上之所描述處理器子單元中之一或多者。 As described above and throughout the present invention, some embodiments may include a memory unit (such as a memory unit 7800) that includes a plurality of processor sub-units. These processor sub-units may be spatially distributed on a single substrate (for example, a substrate of a memory chip such as the memory unit 7800). In addition, each of the plurality of processor subunits can be dedicated to a corresponding memory group among the plurality of memory groups of the memory unit 7800. And these memory groups dedicated to the corresponding processor sub-units can also be spatially distributed on the substrate. In some embodiments, the memory unit 7800 may be associated with a specific task (for example, performing one or more operations associated with running a neural network, etc.), and each of the processor subunits of the memory unit 7800 The person may be responsible for performing part of this task. For example In other words, each processor sub-unit can be equipped with instructions that can include data processing and memory operations, arithmetic and logical operations, and so on. In some cases, the zero value detection logic can be configured to provide a zero value indicator to one or more of the described processor subunits that are spatially distributed on the memory unit 7800.
現參看圖80,其為說明符合本發明之實施例的偵測儲存於複數個記憶體組之特定位址中之零值的例示性方法8000之流程圖。方法8000可由記憶體晶片(例如,圖78B之記憶體晶片7810)執行。特定而言,記憶體單元之控制器(例如,圖79之控制器7913A)及零值偵測邏輯單元(例如,零值偵測邏輯單元7914A)可執行方法8000。
Referring now to FIG. 80, it is a flowchart illustrating an
在步驟8010中,可藉由任何合適的技術起始讀取或寫入操作。在一些狀況下,控制器可接收對讀取儲存於複數個離散記憶體組(例如,圖78中所描繪之記憶體組)之特定位址中之資料的請求。控制器可經組態以控制相對於複數個離散記憶體組之讀取/寫入操作的至少一個態樣。
In
在步驟8020中,一或多個零值偵測電路可用以偵測與讀取或寫入命令相關聯之零值的存在。舉例而言,零值偵測邏輯單元(例如,圖78之零值偵測邏輯單元7830)可偵測與特定位址相關聯之零值,該特定位址與讀取或寫入相關聯。
In
在步驟8030中,控制器可回應於由零值偵測邏輯單元在步驟8020中進行之零值偵測而將零值指示符傳輸至記憶體單元外部之一或多個電路。舉例而言,零值偵測邏輯可偵測到所請求位址儲存零值,且可將值為零之提示傳輸至記憶體晶片外部(或記憶體晶片內,例如在所揭示之分散式處理器記憶體晶片包括分佈於記憶體組之陣列當中的處理器子單元之狀況下)之實體(例如,一或多個電路)。若未偵測到與讀取或寫入命令相關聯之零值,則控制器可傳輸資料值而非零值指示符。在一些實施例中,被傳回零值指示符之一或多個電路可在記憶體單元內部。
In
雖然所揭示實施例已關於零值偵測進行了描述,但相同原理及技術將適用於偵測其他記憶體值(例如,1等)。在一些狀況下,除零值指示符以外,偵測邏輯亦可傳回與讀取或寫入命令相關聯之其他值(例如,1等)的一或多個指示符,且此等指示符可在偵測到對應於值指示符之任何值的情況下被傳回/傳輸。在一些狀況下,可藉由使用者(例如,經由更新一或多個暫存器)調整該等值。在可能知曉關於資料集之特性且瞭解到(例如,就使用者而言)某些值在資料中可能比其他值更普遍的情況下,此等更新可能尤其有用。在此等狀況下,一個、兩個、三個或多於三個值指示符可與最普遍資料相關聯,該等最普遍資料與資料集相關聯。 Although the disclosed embodiments have been described with respect to zero value detection, the same principles and techniques will be applicable to detecting other memory values (for example, 1, etc.). In some cases, in addition to the zero value indicator, the detection logic may also return one or more indicators of other values (for example, 1, etc.) associated with the read or write command, and these indicators Can be returned/transmitted when any value corresponding to the value indicator is detected. In some cases, these values can be adjusted by the user (for example, by updating one or more registers). Such updates may be especially useful in situations where you may know about the characteristics of the data set and understand (for example, as far as the user is concerned) that certain values may be more common in the data than others. Under these conditions, one, two, three, or more than three value indicators can be associated with the most common data, which is associated with the data set.
補償DRAM啟動懲罰 Compensation for DRAM startup penalty
在某些類型之記憶體(例如,DRAM)中,記憶體胞元可按陣列配置於記憶體組內,且一次可針對陣列中之一排記憶體胞元存取及擷取(讀取)包括於記憶體胞元中之值。此讀取處理程序可涉及首先開放(啟動)記憶體胞元之一排(或列)以使由記憶體胞元儲存之資料值可用。接下來,可同時感測開放排中之記憶體胞元的值,且行位址可用以循環通過個別記憶體胞元值或記憶體胞元值之群組(亦即,字),且將每一記憶體胞元值連接至外部資料匯流排以便讀取記憶體胞元值。此等處理程序耗費時間。在一些狀況下,開放用於讀取之記憶體排可能需要運算時間之32個循環,且自開放排讀取值可能需要另外32個循環。若僅在當前開放排之讀取操作完成之後開放待讀取之下一排,則可產生顯著潛時。在此實例中,在開放下一排所需的32個循環期間,無資料被讀取,且讀取每一排有效地需要總計64個循環而非僅需要32個循環來遍歷排資料。習知記憶體系統不允許在正讀取或寫入第一排時開放同一組中之第二排。為節省潛時,待開放之下一排可因此在用於雙排存取之特殊組中的不同組o中,如下文進一步詳細地論述。在開放下一排之前,當前排可皆取樣至正反器或鎖 存器,且在可開放下一排時,所有處理皆在正反器\鎖存器上完成。若下一預測排在同一組中(且以上情形中無一者存在),則可能無法避免潛時且系統可能需要等待。此等機制與標準記憶體且尤其與記憶體處理裝置兩者均相關。 In some types of memory (for example, DRAM), memory cells can be arranged in an array in a memory bank, and one row of memory cells in the array can be accessed and retrieved (read) at a time The value included in the memory cell. This reading process may involve first opening (activating) a row (or row) of memory cells to make the data values stored by the memory cells available. Next, the value of the memory cell in the open row can be sensed at the same time, and the row address can be used to cycle through individual memory cell values or groups of memory cell values (that is, words), and each A memory cell value is connected to the external data bus for reading the memory cell value. Such processing procedures are time consuming. In some cases, the memory bank open for reading may require 32 cycles of computing time, and reading the value from the open bank may require another 32 cycles. If the next row to be read is opened only after the reading operation of the currently open row is completed, significant latency can be generated. In this example, during the 32 cycles required to open the next row, no data is read, and reading each row effectively requires a total of 64 cycles instead of only 32 cycles to traverse the row data. The conventional memory system does not allow the second row in the same group to be opened while the first row is being read or written. To save latency, the next row to be opened can therefore be in a different group o in the special group for dual-row access, as discussed in further detail below. Before opening the next row, the current row can be sampled to the flip-flop or lock When the next row can be opened, all processing is done on the flip-flop\latch. If the next prediction is in the same group (and none of the above situations exist), the latency may not be avoided and the system may need to wait. These mechanisms are related to both standard memory and especially memory processing devices.
本發明所揭示之實施例可藉由例如在當前開放記憶體排之讀取操作已完成之前預測待開放之下一記憶體排來減少此潛時。亦即,若可預測待開放之下一排,則用於開放下一排之處理程序可在當前排之讀取操作已完成之前開始。取決於在處理程序中何時進行下一排預測,與開放下一排相關聯之潛時可自32個循環(在上文所描述之特定實例中)減少至少於32個循環。在一個特定實例中,若提前20個循環預測下一排開放,則額外潛時僅為12個循環。在另一實例中,若提前32個循環預測下一排開放,則根本不存在潛時。結果,替代需要總計64個循環來串列地開放及讀取每一列,藉由在讀取當前列的同時開放下一列,可減少讀取每一列之有效時間。 The disclosed embodiment of the present invention can reduce this latency by, for example, predicting the next memory bank to be opened before the read operation of the current open memory bank is completed. That is, if the next row can be predicted to be opened, the processing procedure for opening the next row can be started before the reading operation of the current row is completed. Depending on when the next row prediction is performed in the processing procedure, the latent time associated with opening the next row can be reduced from 32 cycles (in the specific example described above) by at least 32 cycles. In a specific example, if the next row is predicted to open 20 cycles in advance, the additional latency is only 12 cycles. In another example, if the next row is predicted to open 32 cycles in advance, there is no latent time at all. As a result, instead of requiring a total of 64 cycles to serially open and read each row, by opening the next row while reading the current row, the effective time for reading each row can be reduced.
以下機制可能需要當前排及預測排在相同組中,但若存在可支援在一排上同時啟動及工作之此組,則亦可使用該等機制。 The following mechanisms may require the current row and forecast to be in the same group, but if there is such a group that can support simultaneous activation and work on a row, these mechanisms can also be used.
在所揭示實施例中,可使用各種技術(在下文更詳細地論述)執行下一列預測。舉例而言,下一列預測可基於圖案辨識,基於預定列存取排程,基於人工智慧模型(例如,用以分析列存取且進行待開放之下一列之預測的經訓練神經網路)之輸出或基於任何其他合適的預測技術。在一些實施例中,可藉由使用如下文所描述之延遲位址產生器或公式或其他方法來達成100%成功預測。預測可包含建置具有在需要存取待開放之下一排之前充分預測該排之能力的系統。在一些狀況下,下一列預測可由下一列預測器執行,該下一列預測器可用各種方式實施。舉例而言,用以產生用於對記憶體列進行讀取及/或寫入之當前位址的預測位址產生器。產生用於存取記憶體(讀取或寫入)之位址的實體可基於執行軟體指令之任何邏輯電路或控制器\CPU。預測位址產生器可包 括圖案學習模型,該圖案學習模型觀測所存取列,識別與存取(例如,依序排存取,對每第二排之存取,對每第三排之存取等)相關聯之一或多個圖案且基於觀測到之圖案而估計待存取之下一列。在其他實例中,預測位址產生器可包括應用公式/演算法以預測待存取之下一列的單元。在另外其他實施例中,預測位址產生器可包括經訓練神經網路,該經訓練神經網路基於諸如正存取之當前位址/列、經存取之最後2個、3個、4個或多於4個位址/列等的輸入來輸出待存取之所預測下一列(包括與所預測列相關聯之一或多個位址)。使用所描述之預測位址產生器中之任一者預測待存取之下一記憶體排可顯著減少與記憶體存取相關聯之潛時。所描述之預測位址/列產生器可適用於涉及存取記憶體以擷取資料之任何系統中。在一些狀況下,所描述之預測位址/列產生器及用於預測下一記憶體排存取之相關聯技術可尤其適合於執行人工智慧模型之系統中,此係因為AI模型可與可便利下一列預測之重複記憶體存取圖案相關聯。 In the disclosed embodiment, various techniques (discussed in more detail below) can be used to perform the next column of predictions. For example, the next row prediction can be based on pattern recognition, based on a predetermined row access schedule, based on artificial intelligence models (for example, a trained neural network used to analyze row access and make predictions for the next row to be opened) Output or based on any other suitable forecasting technique. In some embodiments, 100% success prediction can be achieved by using the delay address generator or formula described below or other methods. Prediction can include building a system that has the ability to fully predict the next row before it needs to be accessed. In some cases, the next column of predictions can be performed by the next column of predictors, which can be implemented in various ways. For example, a predictive address generator used to generate the current address for reading and/or writing the memory row. The entity that generates the address used to access the memory (read or write) can be based on any logic circuit or controller\CPU that executes software commands. Predictive address generator can include Including the pattern learning model, the pattern learning model observes the accessed rows, identifying and accessing (for example, sequential access, access to every second row, access to every third row, etc.) associated with One or more patterns and the next column to be accessed is estimated based on the observed patterns. In other examples, the predictive address generator may include applying formulas/algorithms to predict the cells in the next column to be accessed. In still other embodiments, the predictive address generator may include a trained neural network based on, for example, the current address/row being accessed, the last 2, 3, 4 Inputs of one or more addresses/columns etc. are used to output the predicted next column to be accessed (including one or more addresses associated with the predicted column). Using any of the described predictive address generators to predict the next memory bank to be accessed can significantly reduce the latent time associated with memory access. The described predictive address/row generator can be applied to any system that involves accessing memory to retrieve data. In some cases, the described predictive address/row generator and the associated technology for predicting the next bank access can be particularly suitable for systems that execute artificial intelligence models, because AI models can be compatible with Facilitate the association of repeated memory access patterns for the next row of predictions.
圖81A說明符合本發明之實施例的用於基於下一列預測啟動與記憶體組8180相關聯之下一列的系統8100。系統8100可包括當前及預測位址產生器8192、組控制器8191及記憶體組8180A至8180B。位址產生器可為產生用於存取記憶體組8180A至8180B之位址的實體,且可基於執行軟體程式之任何邏輯電路、控制器或微處理器。組控制器8191可經組態以存取記憶體組8180A之當前列(例如,使用由位址產生器8192產生之當前列識別符)。組控制器8191亦可經組態以基於由位址產生器8192產生之預測列識別符啟動記憶體組8180B內待存取之所預測下一列。以下實例描述兩個組。在其他實例中,可使用更多組。在一些實施例中,可存在允許一次存取多於一列(如下文所論述)之記憶體組,且因此可在單個組上進行相同處理程序。如上文所描述,待存取之所預測下一列之啟動可在相對於正存取之當前列執行的讀取操作完成之前開始。因此,在一些狀況下,位址產生器8192可預測待存取之下一列,且可在對當前列
之存取已完成之前的任何時間將所預測下一列之識別符(例如,一或多個位址)發送至組控制器8191。此時序可允許組控制器在正存取當前列期間且在對當前列之存取完成之前的任何時間點起始所預測下一列之啟動。在一些狀況下,組控制器8291可在待存取之當前列的啟動完成及/或相對於當前列之讀取操作已開始的同時(或在幾個時脈循環)起始記憶體組8180之所預測下一列的啟動。
FIG. 81A illustrates a
在一些實施例中,相對於與當前位址相關聯之當前列的操作可為讀取或寫入操作。在一些實施例中,當前列及下一列可在同一記憶體組中。在一些實施例中,同一記憶體組可允許在正存取當前列之同時存取下一列。當前列及下一列可在不同記憶體組中。在一些實施例中,記憶體單元可包括經組態以產生當前位址及預測位址之處理器。在一些實施例中,記憶體單元可包括分散式處理器。分散式處理器可包括在空間上分佈於記憶體陣列之複數個離散記憶體組當中的處理陣列之複數個處理器子單元。在一些實施例中,預測位址可藉由對延遲產生之位址進行取樣的一系列正反器產生。該延遲可為可經由在儲存經取樣位址之正反器之間進行選擇的多工器來組態的。 In some embodiments, the operation relative to the current column associated with the current address may be a read or write operation. In some embodiments, the current row and the next row may be in the same memory bank. In some embodiments, the same memory bank may allow access to the next row while the current row is being accessed. The current row and the next row can be in different memory banks. In some embodiments, the memory unit may include a processor configured to generate the current address and the predicted address. In some embodiments, the memory unit may include a distributed processor. The distributed processor may include a plurality of processor subunits of the processing array among the plurality of discrete memory groups of the memory array distributed in space. In some embodiments, the predicted address can be generated by a series of flip-flops that sample the delayed generated address. The delay can be configurable via a multiplexer that selects between flip-flops storing sampled addresses.
應注意,在確認所預測下一列實際上為執行軟體請求以存取之下一列後(例如,在完成相對於當前列之讀取操作之後),所預測下一列可成為待存取之當前列。在所揭示實施例中,因為可在完成當前列讀取操作之前起始用於啟動所預測下一列之處理程序,所以在確認所預測下一列為待存取之正確的下一列後,可能已完全或部分啟動待存取之下一列。此可顯著減少與排啟動相關聯之潛時。若啟動下一列使得啟動在當前列之讀取結束之前或同時結束,則可獲得功率減少。 It should be noted that after confirming that the predicted next row is actually executing a software request to access the next row (for example, after completing a read operation relative to the current row), the predicted next row can become the current row to be accessed . In the disclosed embodiment, because the processing program for starting the predicted next row can be started before the current row read operation is completed, it may be the correct next row to be accessed after confirming that the predicted next row is the correct next row to be accessed. Fully or partially activate the next row to be accessed. This can significantly reduce the latent time associated with platoon activation. If the next column is activated so that the activation ends before or at the same time that the reading of the current column ends, the power reduction can be obtained.
當前及預測位址產生器8192可包括經組態以識別記憶體組8180中待存取之列(例如,基於程式執行)且預測待存取之下一列(例如,基於列存取中之所觀測圖案,基於預定圖案(n+1、n+2)等)的任何合適的邏輯組件、
運算單元、記憶體單元、演算法、經訓練模型等。舉例而言,在一些實施例中,當前及預測位址產生器8192可包括計數器8192A、當前位址產生器8192B及預測位址產生器8192C。當前位址產生器8192B可經組態以基於計數器8192A之輸出,例如基於來自運算單元之請求而產生記憶體組8180中待存取之當前列的當前位址。可將與待存取之當前列相關聯的位址提供至組控制器8191。預測位址產生器8192C可經組態以基於計數器8192A之輸出、基於預定存取圖案(例如,結合計數器8192A)或基於經訓練神經網路之輸出或其他類型之圖案預測演算法來判定記憶體組8180中待存取之下一列的預測位址,該圖案預測演算法觀測排存取且基於例如與所觀測到之排存取相關聯的圖案來預測待存取之下一排。位址產生器8192可將來自預測位址產生器8192C之所預測下一列位址提供至組控制器8191。
The current and predicted
在一些實施例中,當前位址產生器8192B及預測位址產生器8192C可實施於系統8100內部或外部。外部主機亦可實施於系統8100外部且進一步連接至系統8100。舉例而言,當前位址產生器8192B可為執行程式之外部主機處的軟體,且為避免任何潛時,預測位址產生器8192C可實施於系統8100內部或系統8100外部。
In some embodiments, the current address generator 8192B and the predicted address generator 8192C can be implemented inside or outside the
如所提到,可使用經訓練神經網路判定所預測下一列位址,該經訓練神經網路基於可包括一或多個先前存取之列位址的輸入來預測待存取之下一列。經訓練神經網路或其他類型之模型可在與預測位址產生器8192C相關聯之邏輯內運行。在一些狀況下,經訓練神經網路等可藉由預測位址產生器8192C外部但與該預測位址產生器通信之一或多個運算單元執行。 As mentioned, the predicted next row address can be determined using a trained neural network that predicts the next row to be accessed based on an input that can include one or more previously accessed row addresses . A trained neural network or other type of model can run within the logic associated with the predictive address generator 8192C. In some cases, the trained neural network, etc. can be executed by one or more arithmetic units external to the predictive address generator 8192C but in communication with the predictive address generator.
在一些實施例中,預測位址產生器8192C可包括當前位址產生器8192B之複製者或實質複製者。另外,當前位址產生器8192B及預測位址產生器8192C之操作的時序可相對於彼此固定或可調整。舉例而言,在一些狀況下, 預測位址產生器8192C可經組態以相對於當前位址產生器8192B發出與待存取之下一列相關聯的位址識別符時在固定時間(例如,固定數目個時脈循環)輸出與所預測下一列相關聯之位址識別符。在一些狀況下,在待存取之當前列的啟動開始之前或之後,在與待存取之當前列相關聯的讀取操作開始之前或之後或在與正存取之當前列相關聯的讀取操作完成之前的任何時間,可產生所預測下一列識別符。在一些狀況下,可在待存取之當前列的啟動開始的同時或在與待存取之當前列相關聯的讀取操作開始的同時產生所預測下一列識別符。 In some embodiments, the predicted address generator 8192C may include a copy or a substantial copy of the current address generator 8192B. In addition, the operation timings of the current address generator 8192B and the predicted address generator 8192C can be fixed or adjustable with respect to each other. For example, in some situations, The predictive address generator 8192C can be configured to output the address identifier associated with the next row to be accessed relative to the current address generator 8192B at a fixed time (for example, a fixed number of clock cycles) to output and The address identifier associated with the predicted next column. In some cases, before or after the start of the current row to be accessed, before or after the start of the read operation associated with the current row to be accessed, or after the read operation associated with the current row to be accessed At any time before the completion of the fetch operation, the predicted next list of identifiers can be generated. In some cases, the predicted next row identifier may be generated at the same time as the start of the current row to be accessed or at the same time the read operation associated with the current row to be accessed starts.
在其他狀況下,所預測下一列識別符之產生與待存取之當前列的啟動或與當前列相關聯之讀取操作的起始之間的時間可為可調整的。舉例而言,在一些狀況下,此時間可在記憶體單元8100之操作期間基於與一或多個操作參數相關聯之值而延長或縮短。在一些狀況下,與記憶體單元或運算系統之另一組件相關聯的當前溫度(或任何其他參數值)可使當前位址產生器8192B及預測位址產生器8192C改變其相對操作時序。在實施例中,其中在記憶體處理中,預測機制可為彼邏輯之部分。
In other situations, the time between the generation of the predicted next row identifier and the start of the current row to be accessed or the start of the read operation associated with the current row may be adjustable. For example, in some situations, this time may be lengthened or shortened during the operation of the
當前及預測位址產生器8192可產生與所預測下一列相關聯之信賴等級以存取判定。此信賴等級(其可作為預測處理程序之部分由預測位址產生器8192C判定)可用於判定例如是否在當前列之讀取操作期間(亦即,在當前列讀取操作已完成之前且在待存取之下一列的識別已確認之前)起始所預測下一列之啟動。舉例而言,在一些狀況下,可將與待存取之所預測下一列相關聯的信賴等級與臨限等級進行比較。若信賴等級降至低於臨限等級,則例如記憶體單元8100可放棄啟動所預測下一列。另一方面,若信賴等級超過臨限等級,則記憶體單元8100可起始記憶體組8180中之所預測下一列的啟動。
The current and predicted
可用任何合適的方式實現測試相對於臨限等級之所預測下一列的信賴等級及所預測下一列之啟動之後續起始或非起始的機制。在一些狀況
下,例如,若與所預測下一列相關聯之信賴等級降至低於臨限值,則預測位址產生器8192C可放棄將其所預測下一列結果輸出至下游邏輯組件。替代地,在此狀況下,當前及預測位址產生器8192可抑制來自組控制器8191之所預測下一列識別符,或組控制器(或另一邏輯單元)可經裝備以使用所預測下一列之信賴等級以判定是否在與正讀取之當前列相關聯的讀取操作完成之前開始啟動所預測下一列。
Any suitable way can be used to implement the mechanism of testing the confidence level of the predicted next column relative to the threshold level and the subsequent start or non-initial start of the predicted next column. In some situations
Next, for example, if the confidence level associated with the predicted next column drops below the threshold, the predicted address generator 8192C may abandon outputting the predicted result of the next column to the downstream logic component. Alternatively, in this situation, the current and predicted
可用任何合適的方式產生與所預測下一列相關聯之信賴等級。在一些狀況下,諸如在基於預定之已知存取圖案識別所預測下一列之情況下,預測位址產生器8192C可產生高信賴等級或鑒於列存取之預定圖案,可完全放棄產生信賴等級。另一方面,在預測位址產生器8192C執行一或多個演算法以監視列存取,且基於相對於所監視之列存取而計算的圖案輸出所預測列,或在一或多個經訓練神經網路或其他模型經組態以基於包括最近列存取之輸入而輸出所預測下一列之情況下,可基於任何相關參數判定所預測下一列之信賴等級。舉例而言,在一些狀況下,信賴等級可取決於一或多個先前之下一列預測是否證明為準確的(例如,過去效能指示符)。信賴等級亦可基於演算法/模型之輸入的一或多個特性。舉例而言,包括遵循圖案之實際列存取的輸入可導致比展現較少圖案化之實際列存取高的信賴等級。且在相對於包括最近列存取之輸入的串流偵測隨機性之一些狀況下,例如,所產生之信賴度可為低的。另外,在偵測到隨機性之狀況下,可完全中止下一列預測處理程序,記憶體單元8100之組件中的一或多者可忽略下一列預測,或可採取任何其他動作以放棄啟動所預測下一列。
Any suitable method can be used to generate the confidence level associated with the predicted next column. In some situations, such as when the next row is predicted based on a predetermined known access pattern recognition, the predicted address generator 8192C can generate a high confidence level or in view of the predetermined pattern of row access, it can completely abandon generating a confidence level . On the other hand, the predicted address generator 8192C executes one or more algorithms to monitor column access, and outputs the predicted column based on the pattern calculated relative to the monitored column access, or one or more In the case where the training neural network or other model is configured to output the predicted next row based on the input including the most recent row access, the confidence level of the predicted next row can be determined based on any relevant parameters. For example, in some situations, the confidence level may depend on whether one or more previous predictions in the next column prove to be accurate (e.g., past performance indicators). The trust level can also be based on one or more characteristics of the input of the algorithm/model. For example, an input that includes actual column access that follows a pattern can result in a higher level of trust than actual column access that exhibits less patterning. And in some situations where randomness is detected relative to the input stream including the most recent row access, for example, the resulting reliability may be low. In addition, when randomness is detected, the next prediction process can be completely stopped, one or more of the components of the
在一些狀況下,可相對於記憶體8100之操作包括反饋機制。舉例而言,週期性地或甚至在每下一列預測之後,可判定預測位址產生器8192C預測待存取之實際下一列的準確性。在一些狀況下,若在預測待存取之下一列
時存在錯誤(或在預定數目個錯誤之後),則可暫時中止預測位址產生器8192C之下一列預測操作。在其他狀況下,預測位址產生器8192C可包括學習元件,使得其預測操作之一或多個態樣可基於關於其預測待存取之下一列之準確性的所接收反饋而調整。此能力可改進預測位址產生器8192C之操作,使得位址產生器8192C可適應於改變之存取圖案等。
In some cases, the operation with respect to the
在一些實施例中,所預測下一列之產生及/或所預測下一列之啟動的時序可取決於記憶體單元8100之整體操作。舉例而言,在通電之後或在重設記憶體單元8100之後,可暫時中止預測待存取之下一列(或將所預測下一列轉送至組控制器8191)(例如,持續預定時間量或時脈循環,直至預定數目個列存取/讀取已完成,直至所預測下一列之信賴等級超過預定臨限值,或基於任何其他合適的準則)。
In some embodiments, the timing of the predicted generation of the next row and/or the predicted activation of the next row may depend on the overall operation of the
圖81B說明根據例示性所揭示實施例之記憶體單元8100的另一組態。在圖81B之系統8100B中,快取記憶體8193可與組控制器8191相關聯。舉例而言,快取記憶體8193可經組態以在一或多個資料列被存取之後儲存該一或多個資料列,且防止需要再次啟動該等資料列。因此,快取記憶體8193可使得組控制器8191能夠存取來自快取記憶體8193之列資料而非存取記憶體組8180。舉例而言,快取記憶體8193可儲存最後X列資料(或任何其他快取記憶體節省策略),且組控制器8191可根據所預測列來填充快取記憶體8193。此外,若所預測列已在快取記憶體8193中,則不需要再次開放所預測列,且組控制器(或實施於快取記憶體8193中之快取控制器)可保護所預測列不被調換。快取記憶體8193可提供若干益處。首先,由於快取記憶體8193將列載入至快取記憶體8193且組控制器可存取快取記憶體8193以擷取列資料,因此不需要特殊組或多於一個組用於下一列預測。其次,對快取記憶體8193進行讀取及寫入可節省能量,此係因為自組控制器8191至快取記憶體8193之實體距離小於自組控制器
8191至記憶體組8180之實體距離。第三,相較於記憶體組8180,由快取記憶體8193引起之潛時通常較低,此係因為快取記憶體8193更小且更接近控制器8191。在一些狀況下,當藉由組控制器8191在記憶體組8180中啟動所預測下一列時,由預測位址產生器產生之所預測下一列的識別符例如可儲存於快取記憶體8193中。基於程式執行等,當前位址產生器8192B可識別記憶體組8191中待存取之實際下一列。可將與待存取之實際下一列相關聯的識別符與儲存於快取記憶體8193中之所預測下一列之識別符進行比較。若待存取之實際下一列與待存取之所預測下一列相同,則組控制器8191可在待存取之實際下一列的啟動已完成之後開始相對於彼列之讀取操作(其可能由於下一列預測處理程序而完全或部分啟動)。另一方面,若待存取之實際下一列(由當前位址產生器8192B判定)不匹配儲存於快取記憶體8193中的所預測下一列識別符,則將不會相對於完全或部分啟動之所預測下一列開始讀取操作,而是系統將開始啟動待存取之實際下一列。
FIG. 81B illustrates another configuration of the
雙重啟動組 Dual boot group
如所論述,描述若干機制為有價值的,該等機制允許建置能夠在一列仍正被處理的同時啟動另一列之組。可針對在另一列正被存取的同時啟動額外列之組提供若干實施例。雖然實施例僅描述兩列啟動,但應瞭解,其可適用於更多列。在首先建議之實施例中,記憶體組可分成記憶體子組,且所描述實施例可用以執行相對於一個子組中之一排的讀取操作,同時啟動另一子組中之所預測或所需下一列。舉例而言,如圖81C中所展示,記憶體組8180可經配置以包括多個記憶體子組8181。另外,與記憶體組8180相關聯之組控制器8191可包括與對應子組相關聯之複數個子組控制器。複數個子組控制器中之第一子組控制器可經組態以使得能夠存取包括於複數個子組中之第一子組之當前列中的資料,而複數個子組控制器中之第二子組控制器可啟動複數個子組中之第二
子組中的下一列。當一次僅存取一個子組中之字時可使用僅一個行解碼器。兩個組可繫結至同一輸出匯流排以呈現為單個組。新的單個組輸入亦可為單個位址及用於開放下一列之額外列位址。
As discussed, it is valuable to describe several mechanisms that allow the build to be able to activate a group of one row while the other is still being processed. Several embodiments can be provided for activating a group of additional rows while another row is being accessed. Although the embodiment only describes two column activations, it should be understood that it can be applied to more columns. In the first proposed embodiment, the memory group can be divided into memory sub-groups, and the described embodiment can be used to perform read operations relative to one row in one sub-group, and at the same time activate the prediction in the other sub-group Or the next column is required. For example, as shown in FIG. 81C, the memory bank 8180 may be configured to include
圖81C說明每一記憶體子組8181之第一及第二子組列控制器(8183A、8183B)。記憶體組8180可包括複數個子組8181,如圖81C中所展示。另外,組控制器8191可包括各與對應子組8181相關聯之複數個子組控制器8183A至8183B。複數個子組控制器中之第一子組控制器8183A可經組態以使得能夠存取包括於子組8181中之第一部分之當前列中的資料,而第二子組控制器8183B可啟動子組8181之第二部分中的下一列。
Figure 81C illustrates the first and second row controllers (8183A, 8183B) of each
因為啟動直接鄰近於正被存取之列的列可能會使所存取列失真及/或損壞正自所存取列讀取之資料,所以所揭示實施例可經組態以使得待啟動之所預測下一列可與第一子組中正被存取資料之當前列隔開至少兩列(例如)。在一些實施例中,待啟動之列可隔開至少一墊,使得啟動可在不同墊中執行。第二子組控制器可經組態以使得存取包括於第二子組之當前列中的資料,而第一子組控制器啟動第一子組中之下一列。第一子組之經啟動的下一列可與第二子組中正被存取資料的當前列隔開至少兩列。 Since activating a row directly adjacent to the row being accessed may distort the accessed row and/or damage the data being read from the accessed row, the disclosed embodiment can be configured so that the row to be activated The predicted next row can be separated from the current row of the data being accessed in the first subgroup by at least two rows (for example). In some embodiments, the column to be activated can be separated by at least one pad, so that activation can be performed in different pads. The second sub-group controller can be configured to access data included in the current row of the second sub-group, and the first sub-group controller activates the next row in the first sub-group. The activated next row of the first subgroup can be separated from the current row of the data being accessed in the second subgroup by at least two rows.
正被讀取/存取之列與正被啟動之列之間的此預定義距離可由例如將記憶體組之不同部分耦接至不同列解碼器之硬體判定,且軟體可維持該預定義距離以免破壞資料。當前列之間的間隔可超過兩列(例如可為3列、4列、5列及甚至多於5列)。該距離可隨時間改變,例如基於關於所儲存資料中引入之失真的評估。可用各種方式評估失真,例如藉由計算信雜比、錯誤率、修復失真所需之錯誤碼及其類似者。若兩列足夠遠且兩個組控制器實施於同一組上,則實際上可啟動兩列。新架構(在同一組上實施兩個控制器)可防止開放同一墊中之多個排。 The predefined distance between the row being read/accessed and the row being activated can be determined by, for example, the hardware that couples different parts of the memory bank to different row decoders, and the software can maintain the predefined distance Distance so as not to destroy the data. The interval between the current rows can exceed two rows (for example, it can be 3 rows, 4 rows, 5 rows, and even more than 5 rows). The distance can change over time, for example based on an assessment of the distortion introduced in the stored data. Various methods can be used to evaluate the distortion, such as by calculating the signal-to-noise ratio, error rate, error code required to repair the distortion, and the like. If the two rows are far enough and the two group controllers are implemented on the same group, then two rows can actually be activated. The new architecture (implementing two controllers on the same group) prevents the opening of multiple rows in the same pad.
圖81D說明符合本發明之實施例的下一列預測之實施例。實施例可包括正反器(位址暫存器A至C)之額外管線。管線可藉由任何數目個正反器(級)實施為在位址產生器之後啟動及延遲整體執行以使用所延遲位址所需的延遲,接著預測可為所產生之新位址(在管線之開頭,在位址暫存器C下方)且當前位址為管線之末尾。在此實施例中,不需要複製位址產生器。可添加選擇器(圖81D中所展示之多工器)以組態延遲,而位址暫存器提供延遲。 Figure 81D illustrates an embodiment of the next column prediction in accordance with an embodiment of the present invention. The embodiment may include an additional pipeline of flip-flops (address registers A to C). The pipeline can be implemented by any number of flip-flops (stages) to start after the address generator and delay the overall execution to use the delay required for the delayed address, and then predict the new address that can be generated (in the pipeline The beginning is below the address register C) and the current address is the end of the pipeline. In this embodiment, there is no need to duplicate the address generator. A selector (the multiplexer shown in Figure 81D) can be added to configure the delay, and the address register provides the delay.
圖81E說明符合本發明之實施例的記憶體組之實施例。記憶體組可實施為若新啟動之排距當前排足夠遠,則啟動新排將不會破壞當前排。如圖81E中所展示,記憶體組可包括墊之每兩排之間的額外記憶體墊(黑色)因此,控制單元(諸如,列解碼器)可啟動隔開一墊之多個排。 FIG. 81E illustrates an embodiment of a memory bank in accordance with an embodiment of the present invention. The memory bank can be implemented such that if the newly activated row is far enough from the current row, the new row will not be destroyed if the new row is activated. As shown in FIG. 81E, the memory bank may include additional memory pads (black) between every two rows of pads. Therefore, a control unit (such as a column decoder) can activate multiple rows that separate one pad.
在一些實施例中,記憶體單元可經組態以在預定時間接收第一位址以用於處理及接收第二位址以起作用及存取。 In some embodiments, the memory unit can be configured to receive the first address at a predetermined time for processing and receiving the second address for function and access.
圖81F說明符合本發明之實施例的記憶體組之另一實施例。記憶體組可實施為若新啟動之排距當前排足夠遠,則啟動新排將不會破壞當前排。圖81F中所描繪之實施例可藉由確保在記憶體組之上半部分處實施所有偶數排且在記憶體組之下半部分處實施所有奇數排來允許列解碼器開放排n及n+1。實施方案可允許存取始終足夠遠之連續排。 FIG. 81F illustrates another embodiment of the memory bank in accordance with the embodiment of the present invention. The memory bank can be implemented such that if the newly activated row is far enough from the current row, the new row will not be destroyed if the new row is activated. The embodiment depicted in FIG. 81F can allow the row decoder to open rows n and n+ by ensuring that all even rows are implemented at the upper half of the memory bank and all odd rows are implemented at the lower half of the memory bank. 1. The implementation may allow access to consecutive rows that are always far enough away.
根據所揭示實施例,雙重控制記憶體組可允許存取及啟動單個記憶體組之不同部分,即使在雙重控制記憶體組經組態以一次輸出一個資料單元時亦如此。舉例而言,如所描述,雙重控制可使得記憶體組能夠在啟動第二列(例如,所預測下一列或待存取之預定下一列)時存取第一列。 According to the disclosed embodiments, the dual-control memory bank can allow access and activation of different parts of a single memory bank, even when the dual-control memory bank is configured to output one data unit at a time. For example, as described, dual control can enable the memory bank to access the first row when the second row is activated (for example, the predicted next row or the predetermined next row to be accessed).
圖82說明符合本發明之實施例的用於減少記憶體列啟動懲罰(例如,潛時)之雙重控制記憶體組8280。雙重控制記憶體組8280可包括輸入,該等輸入包括資料輸入(DIN)8290、列位址(ROW)8291、行位址(COLUMN)
8292、第一命令輸入(COMMAND_1)8293及第二命令輸入(COMMAND_2)8294。記憶體組8280可包括資料輸出(Dout)8295。
FIG. 82 illustrates a dual-
假定位址可包括列位址及行位址,且存在兩個列解碼器。可提供位址之其他配置,列解碼器之數目可超過兩個,且可存在多於單個行解碼器。 The pseudo location address can include a column address and a row address, and there are two column decoders. Other configurations of addresses can be provided, the number of column decoders can exceed two, and there can be more than a single row decoder.
列位址(ROW)8291可識別與諸如啟動命令之命令相關聯的列。因為列啟動後可接著自該列讀取或寫入至該列,所以接著在該列開放(在其啟動之後),可能不需要發送用於寫入至開放列或自開放列讀取之列位址。 The column address (ROW) 8291 can identify the column associated with a command such as a start command. Since the column can be read from or written to the column after it is started, it is then opened (after it is started), and may not need to be sent for writing to the open column or read from the open column Address.
第一命令輸入(COMMAND_1)8293可用以將命令(諸如但不限於啟動命令)發送至由第一列解碼器存取之列。第二命令(COMMAND_2)輸入8294可用以將命令(諸如但不限於啟動命令)發送至由第二列解碼器存取之列。
The first command input (COMMAND_1) 8293 can be used to send a command (such as but not limited to a start command) to the column accessed by the first column decoder. The second command (COMMAND_2)
資料輸入(DIN)8290可用以在執行寫入操作時饋入資料。 Data input (DIN) 8290 can be used to feed data when performing write operations.
因為無法一次讀取整列,所以可依序讀取單個列區段,且行位址(COLUMN)8292可提示待讀取該列之哪一區段(哪些行)。為解釋簡單起見,可假定存在2Q個區段且行輸入具有Q個位元;Q為超過一的正整數。 Since the entire column cannot be read at one time, individual column segments can be read sequentially, and the row address (COLUMN) 8292 can prompt which segment (which row) of the column is to be read. For simplicity of explanation, it can be assumed that there are 2Q sectors and the line input has Q bits; Q is a positive integer exceeding one.
雙重控制記憶體組8280可在具有或不具有上文關於圖81A至圖81B所描述之位址預測的情況下操作。當然,為減少操作潛時,根據所揭示實施例,雙重控制記憶體組可在具有位址預測之情況下操作。
The dual
圖83A、圖83B及圖83C說明存取及啟動記憶體組8180之列的實例。如上文所提及,假定在一個實例中,讀取列及啟動列兩者均需要32個循環(區段)。另外,為了減少啟動懲罰(具有表示為差量(Delta)的長度),預先(在需要存取下一列之前至少差量)知曉應開放下一列可為有益的。在一些狀況下,差量可等於四個循環。圖83A、圖83B及圖83C中所描繪之每一記憶體組可包括兩個或多於兩個子組,在該兩個或多於兩個子組內,在一些實施
例中,在任何給定時間可僅開放一個列。在一些狀況下,偶數列可與第一子組相關聯,且奇數列可與第二子組相關聯。在此實例中,使用所揭示之預測性定址實施例可使得能夠在到達相對於另一記憶體子組之列的讀取操作之末尾之前(在到達末尾之前的延遲時段)起始某一記憶體子組之一個列的啟動。以此方式,可用高效方式進行依序記憶體存取(例如,預定義記憶體存取序列,其中列1、2、3、4、5、6、7、8……待讀取,且列1、3、5……等與第一記憶體子組相關聯且列2、4、6……等與第二不同記憶體子組)相關聯。
83A, 83B, and 83C illustrate examples of accessing and activating the rows of the memory bank 8180. As mentioned above, assume that in one example, both the read column and the start column require 32 cycles (segments). In addition, in order to reduce the startup penalty (having a length denoted as delta), it may be beneficial to know in advance (at least the delta before the next column needs to be accessed) that the next column should be opened. In some cases, the difference can be equal to four cycles. Each memory group depicted in FIG. 83A, FIG. 83B, and FIG. 83C may include two or more than two sub-groups. Within the two or more sub-groups, in some implementations
In the example, only one column can be opened at any given time. In some cases, even-numbered columns can be associated with the first sub-group, and odd-numbered columns can be associated with the second sub-group. In this example, the use of the disclosed predictive addressing embodiment can enable a memory to be started before the end of the read operation relative to another memory subgroup (the delay period before the end is reached) Start of a column of the body subgroup. In this way, sequential memory access can be performed in an efficient manner (for example, a predefined memory access sequence, in which
圖83A可說明用於存取包括於兩個不同記憶體子組中之記憶體列的狀態。在圖83A中所展示之狀態中: FIG. 83A can illustrate the state for accessing memory rows included in two different memory subgroups. In the state shown in Figure 83A:
a.列A可為可由第一列解碼器存取的。可在第一列解碼器啟動列A之後存取第一區段(以灰色標記之最左區段)。 a. Column A can be accessed by the first column decoder. The first sector (leftmost sector marked in gray) can be accessed after the first row decoder activates row A.
b.列B可為可由第二列解碼器存取的。在圖83A中所展示之此等狀態中,列B被關閉且尚未啟動。 b. Column B can be accessed by the second column decoder. In these states shown in Figure 83A, column B is closed and has not yet been activated.
圖83A中所說明之狀態之前可為將啟動命令及列A之位址發送至第一列解碼器。 The state illustrated in FIG. 83A can be preceded by sending the start command and the address of column A to the first column decoder.
圖83B說明用於在存取列A之後存取列B的狀態。根據此實例:列A可為可由第一列解碼器存取的。在圖83B中所展示之狀態中,第一列解碼器啟動列A且已存取除四個最右區段(未以灰色標記之四個區段)以外的所有區段。因為差量(列A中之四個白色區段)等於四個循環,所以組控制器可使得第二列解碼器能夠在存取列A中之最右區段之前啟動列B。在一些狀況下,啟動列B可回應於預定存取圖案(例如,依序列存取,其中奇數列指明於第一子組中且偶數列指明於第二子組中)。在其他狀況下,啟動列B可回應於上文所描述之任何列預測技術。組控制器可使得第二列解碼器能夠預先啟動列B,使得當存取列B時,已啟動(開放)列B而非等待啟動列B以開放列B。 FIG. 83B illustrates the state for accessing column B after accessing column A. According to this example: column A may be accessible by the first column decoder. In the state shown in FIG. 83B, the first row decoder activates row A and has accessed all sectors except the four rightmost sectors (four sectors not marked in gray). Because the difference (four white sections in column A) is equal to four cycles, the group controller can enable the second column decoder to activate column B before accessing the rightmost section in column A. In some cases, the activation row B can respond to a predetermined access pattern (for example, access in a sequence, where the odd-numbered rows are designated in the first sub-group and the even-numbered rows are designated in the second sub-group). In other situations, the activation column B can respond to any of the column prediction techniques described above. The group controller can enable the second column decoder to activate column B in advance, so that when column B is accessed, column B has been activated (opened) instead of waiting for column B to be activated to open column B.
圖83B中所說明之狀態之前可為以下操作: The state illustrated in Figure 83B can be the following operations before:
a.將啟動命令及列A之位址發送至第一列解碼器。 a. Send the start command and the address of column A to the first column decoder.
b.寫入或讀取列A之前二十八個區段。 b. Write or read the twenty-eight sectors before column A.
c.在對列之二十八個區段進行讀取或寫入操作之後,將相對於列B之位址的啟動命令發送至第二列解碼器。 c. After reading or writing the twenty-eight sectors of the column, send the start command relative to the address of the column B to the decoder of the second column.
在一些實施例中,偶數編號列位於一或多個記憶體組之一半中。在一些實施例中,奇數編號列位於一或多個記憶體組之一半中。 In some embodiments, the even-numbered columns are located in one half of one or more memory banks. In some embodiments, the odd-numbered columns are located in one half of one or more memory banks.
在一些實施例中,一排額外冗餘墊置放於兩個墊排中之每一者之間以建立用於允許啟動之距離。在一些實施例中,可能不同時啟動彼此接近之多個排。 In some embodiments, an extra row of redundant pads is placed between each of the two pad rows to establish a distance for allowing activation. In some embodiments, multiple rows that are close to each other may not be activated at the same time.
圖83C可說明用於在存取列A之後存取列C(例如,包括於第一子組中之下一奇數列)的狀態。如圖83C中所展示,列B可為可由第二列解碼器存取的。如所展示,第二列解碼器已啟動列B且已存取除四個最右區段(未以灰色標記之四個剩餘區段)以外的所有區段。因為在此實例中,差量等於四個循環,組控制器可使得第一列解碼器能夠在存取列B中之最右區段之前啟動列C。組控制器可使得第一列解碼器能夠預先啟動列C,使得當存取列C時,已啟動列C而非等待啟動列C。以此方式操作可減少或完全消除與記憶體讀取操作相關聯之潛時。 FIG. 83C can illustrate the state for accessing column C (for example, the next odd-numbered column included in the first subgroup) after accessing column A. As shown in Figure 83C, column B may be accessible by the second column decoder. As shown, the second row decoder has activated row B and has accessed all sectors except the four rightmost sectors (the four remaining sectors not marked in gray). Because in this example, the difference is equal to four cycles, the group controller can enable the first column decoder to activate column C before accessing the rightmost section in column B. The group controller can enable the first column decoder to activate column C in advance, so that when column C is accessed, column C has been activated instead of waiting to be activated. Operating in this way can reduce or completely eliminate the latency associated with memory read operations.
作為暫存器檔案之記憶體墊 As a memory pad for temporary storage files
在電腦架構中,處理器暫存器構成電腦處理器(例如,中央處理單元(CPU))可快速存取之儲存位置。暫存器通常包括最接近處理器核心(L0)之記憶體單元。暫存器可提供存取某些類型之資料的最快方式。電腦可具有若干類型之暫存器,其各根據其儲存之資訊的類型或基於對某一類型之暫存器中之資訊操作的指令之類型而分類。舉例而言,電腦可包括:資料暫存器,其保 存數值資訊、運算元、中間結果及組態;位址暫存器,其儲存由指令使用以存取主要記憶體之位址資訊;通用暫存器,其儲存資料及位址資訊兩者;及狀態暫存器;以及其他暫存器。暫存器檔案包括可供電腦處理單元使用之暫存器的邏輯群組。 In the computer architecture, the processor register constitutes a storage location that the computer processor (for example, a central processing unit (CPU)) can quickly access. The register usually includes the memory unit closest to the processor core (L0). Registers can provide the fastest way to access certain types of data. A computer may have several types of registers, each of which is classified according to the type of information it stores or based on the type of instructions operating on the information in a certain type of register. For example, the computer may include: a data register, which protects Store numerical information, operands, intermediate results and configuration; address register, which stores address information used by commands to access the main memory; general purpose register, which stores both data and address information; And status registers; and other registers. The register file includes a logical group of registers that can be used by the computer processing unit.
在許多狀況下,電腦之暫存器檔案位於處理單元(例如,CPU)內且由邏輯電晶體實施。然而,在所揭示實施例中,運算處理單元可能不駐存於傳統的CPU中。實情為,此等處理元件(例如,處理器子單元)可作為處理陣列在空間上分佈於(如以上章節中所描述)記憶體晶片內。每一處理器子單元可與一或多個對應及專用的記憶體單元(例如,記憶體組)相關聯。經由此架構,每一處理器子單元可在空間上位於儲存特定處理器子單元操作之資料的一或多個記憶體元件附近。如本文中所描述,此架構可藉由例如消除由典型CPU及外部記憶體架構所經歷之記憶體存取瓶頸來顯著加速某些記憶體密集型操作中之操作。 In many cases, the computer's register file is located in the processing unit (for example, CPU) and implemented by logic transistors. However, in the disclosed embodiment, the arithmetic processing unit may not reside in a traditional CPU. In fact, these processing elements (for example, processor sub-units) can be spatially distributed in the memory chip (as described in the above section) as a processing array. Each processor subunit can be associated with one or more corresponding and dedicated memory units (e.g., memory banks). With this structure, each processor sub-unit can be spatially located near one or more memory elements that store data for the operation of a specific processor sub-unit. As described herein, this architecture can significantly speed up operations in certain memory-intensive operations by, for example, eliminating memory access bottlenecks experienced by typical CPU and external memory architectures.
然而,本文中所描述之分散式處理器記憶體晶片架構可仍利用暫存器檔案,其包括用於對來自專用於對應處理器子單元之記憶體元件之資料進行操作的各種類型之暫存器。然而,由於處理器子單元可分佈於記憶體晶片之記憶體元件當中,因此有可能將一或多個記憶體元件(相較於特定製造製程中之邏輯元件,該一或多個記憶體元件可受益於彼同一製程)添加於對應處理器子單元中,以充當用於對應處理器子單元之暫存器檔案或快取記憶體,而非充當主要記憶體儲存器。 However, the distributed processor memory chip architecture described in this article can still use temporary memory files, which include various types of temporary memory used to manipulate data from memory components dedicated to the corresponding processor subunits. Device. However, since the processor subunits can be distributed among the memory elements of the memory chip, it is possible to combine one or more memory elements (compared to the logic elements in a specific manufacturing process, the one or more memory elements Can benefit from the same process) is added to the corresponding processor sub-unit to serve as a register file or cache memory for the corresponding processor sub-unit instead of acting as the main memory storage.
此架構可提供若干優點。舉例而言,由於暫存器檔案為對應處理器子單元之部分,因此處理器子單元可在空間上位於相關暫存器檔案附近。此配置可顯著增加操作效率。習知暫存器檔案由邏輯電晶體實施。舉例而言,習知暫存器檔案之每一位元由約12個邏輯電晶體製成,且因此16個位元之暫存器 檔案由192個邏輯電晶體製成。此暫存器檔案可能需要大量邏輯組件來存取邏輯電晶體,且因此可佔用大的空間。相較於由邏輯電晶體實施之暫存器檔案,本發明所揭示之實施例的暫存器檔案可能需要顯著更少的空間。此大小減小可藉由使用包括記憶體胞元之記憶體墊實施所揭示實施例之暫存器檔案來實現,該等記憶體胞元係藉由經最佳化以用於製造記憶體結構而非用於製造邏輯結構之製程來製造。大小減小亦可允許較大暫存器檔案或快取記憶體。 This architecture can provide several advantages. For example, since the register file is a part corresponding to the processor sub-unit, the processor sub-unit can be spatially located near the relevant register file. This configuration can significantly increase operating efficiency. The conventional register file is implemented by logic transistors. For example, each bit of the conventional register file is made of about 12 logic transistors, and therefore a 16-bit register The file is made of 192 logic transistors. This register file may require a large number of logic components to access the logic transistors, and therefore may occupy a large space. Compared with a register file implemented by a logic transistor, the register file of the embodiment disclosed in the present invention may require significantly less space. This size reduction can be achieved by implementing the register file of the disclosed embodiment using memory pads including memory cells that are optimized for use in manufacturing memory structures It is not manufactured by the process used to manufacture logical structures. The size reduction can also allow larger scratchpad files or cache memory.
在一些實施例中,可提供分散式處理器記憶體晶片。分散式處理器記憶體晶片可包括:基板;記憶體陣列,其安置於基板上且包括複數個離散記憶體組;及處理陣列,其安置於基板上且包括複數個處理器子單元。該等處理器子單元中之每一者可與複數個離散記憶體組中之對應的專用記憶體組相關聯。分散式處理器記憶體晶片亦可包括第一複數個匯流排及第二複數個匯流排。第一複數個匯流排中之每一者可將複數個處理器子單元中之一者連接至其對應的專用記憶體組。第二複數個匯流排中之每一者可將複數個處理器子單元中之一者連接至複數個處理器子單元中之另一者。在一些狀況下,第二複數個匯流排可將複數個處理器子單元中之一或多者連接至複數個處理器子單元當中之兩個或多於兩個其他處理器子單元。處理器子單元中之一或多者亦可包括安置於基板上之至少一個記憶體墊。至少一個記憶體墊可經組態以充當用於複數個處理子單元中之一或多者的暫存器檔案之至少一個暫存器。 In some embodiments, a distributed processor memory chip may be provided. The distributed processor memory chip may include: a substrate; a memory array disposed on the substrate and including a plurality of discrete memory groups; and a processing array disposed on the substrate and including a plurality of processor subunits. Each of the processor subunits can be associated with a corresponding dedicated memory group among a plurality of discrete memory groups. The distributed processor memory chip may also include a first plurality of buses and a second plurality of buses. Each of the first plurality of bus bars can connect one of the plurality of processor subunits to its corresponding dedicated memory bank. Each of the second plurality of bus bars can connect one of the plurality of processor subunits to another of the plurality of processor subunits. In some cases, the second plurality of buses may connect one or more of the plurality of processor sub-units to two or more other processor sub-units among the plurality of processor sub-units. One or more of the processor sub-units may also include at least one memory pad disposed on the substrate. The at least one memory pad may be configured to serve as at least one register for a register file for one or more of the plurality of processing subunits.
在一些狀況下,暫存器檔案可與一或多個邏輯組件相關聯以使得記憶體墊能夠充當暫存器檔案之一或多個暫存器。舉例而言,此等邏輯組件可包括開關、放大器、反相器、感測放大器以及其他者。在暫存器檔案由動態隨機存取記憶體(DRAM)墊實施之實例中,可包括邏輯組件以執行再新操作從而防止所儲存資料丟失。此等邏輯組件可包括列及行多工器(「mux」)。此外,由DRAM墊實施之暫存器檔案可包括冗餘機構以對抗良率下降。 In some cases, the register file may be associated with one or more logical components so that the memory pad can serve as one or more registers of the register file. For example, these logic components may include switches, amplifiers, inverters, sense amplifiers, and others. In the case where the register file is implemented by a dynamic random access memory (DRAM) pad, logic components may be included to perform renew operations to prevent the loss of stored data. These logical components may include column and row multiplexers ("mux"). In addition, the register file implemented by the DRAM pad may include redundant mechanisms to combat yield degradation.
圖84說明包括CPU 8402及外部記憶體8406之傳統電腦架構8400。在操作期間,可將來自記憶體8406之值載入至與包括於CPU 8402中之暫存器檔案8504相關聯的暫存器中。
FIG. 84 illustrates a
圖85A說明符合所揭示實施例之例示性分散式處理器記憶體晶片8500a。相比於圖84之架構,分散式處理器記憶體晶片8500a包括安置於同一基板上之記憶體元件及處理器元件。亦即,晶片8500a可包括記憶體陣列及處理陣列,該處理陣列包括各與包括於記憶體陣列中之一或多個專用記憶體組相關聯的複數個處理器子單元。在圖85之架構中,由處理器子單元使用之暫存器係藉由安置於同一基板上之一或多個記憶體墊提供,記憶體陣列及處理陣列形成於該基板上。 FIG. 85A illustrates an exemplary distributed processor memory chip 8500a in accordance with the disclosed embodiments. Compared with the architecture of FIG. 84, the distributed processor memory chip 8500a includes memory elements and processor elements arranged on the same substrate. That is, the chip 8500a may include a memory array and a processing array, the processing array including a plurality of processor subunits each associated with one or more dedicated memory groups included in the memory array. In the architecture of FIG. 85, the register used by the processor subunit is provided by one or more memory pads disposed on the same substrate, and the memory array and the processing array are formed on the substrate.
如圖85A中所描繪,分散式處理器記憶體晶片8500a可藉由安置於基板8502上之複數個處理群組8510a、8510b及8510c形成。更具體而言,分散式處理器記憶體晶片8500a可包括安置於基板8502上之記憶體陣列8520及處理陣列8530。記憶體陣列8520可包括複數個記憶體組,諸如記憶體組8520a、8520b及8520c。處理陣列8530可包括複數個處理器子單元,諸如處理器子單元8530a、8530b及8530c。
As depicted in FIG. 85A, the distributed processor memory chip 8500a can be formed by a plurality of processing groups 8510a, 8510b, and 8510c disposed on a
此外,處理群組8510a、8510b及8510c中之每一者可包括處理器子單元及專用於該處理器子單元之一或多個對應記憶體組。在圖85A中所描繪之實施例中,處理器子單元8530a、8530b及8530c中之每一者可與對應的專用記憶體組8520a、8520b或8520c相關聯。亦即,處理器子單元8530a可與記憶體組8520a相關聯;處理器子單元8530b可與記憶體組8520b相關聯;且處理器子單元8530c可與記憶體組8520c相關聯。 In addition, each of the processing groups 8510a, 8510b, and 8510c may include a processor subunit and one or more corresponding memory groups dedicated to the processor subunit. In the embodiment depicted in FIG. 85A, each of the processor subunits 8530a, 8530b, and 8530c may be associated with a corresponding dedicated memory bank 8520a, 8520b, or 8520c. That is, the processor subunit 8530a can be associated with the memory group 8520a; the processor subunit 8530b can be associated with the memory group 8520b; and the processor subunit 8530c can be associated with the memory group 8520c.
為了允許每一處理器子單元與其對應的專用記憶體組通信,分散式處理器記憶體晶片8500a可包括將處理器子單元中之一者連接至其對應的專
用記憶體組之第一複數個匯流排8540a、8540b及8540c。在圖85A中所描繪之實施例中,匯流排8540a可將處理器子單元8530a連接至記憶體組8520a;匯流排8540b可將處理器子單元8530b連接至記憶體組8520b;且匯流排8540c可將處理器子單元8530c連接至記憶體組8520c。
In order to allow each processor subunit to communicate with its corresponding dedicated memory bank, the distributed processor memory chip 8500a may include connecting one of the processor subunits to its corresponding dedicated memory bank.
Use the first plurality of
此外,為了允許每一處理器子單元與其他處理器子單元通信,分散式處理器記憶體晶片8500a可包括將處理器子單元中之一者連接至至少另一處理器子單元之第二複數個匯流排8550a及8550b。在圖85中所描繪之實施例中,匯流排8550a可將處理器子單元8530a連接至處理器子單元8530b,且匯流排8550b可將處理器子單元8530a連接至處理器子單元8550b,等等。
In addition, in order to allow each processor sub-unit to communicate with other processor sub-units, the distributed processor memory chip 8500a may include a second plurality that connects one of the processor sub-units to at least another processor sub-unit Two
離散記憶體組8520a、8520b及8520c中之每一者可包括複數個記憶體墊。在圖84中所描繪之實施例中,記憶體組8520a可包括記憶體墊8522a、8524a及8526a;記憶體組8520b可包括記憶體墊8522b、8524b及8526b;且記憶體組8520c可包括記憶體墊8522c、8524c及8526c。如先前關於圖10所揭示,記憶體墊可包括複數個記憶體胞元,且每一胞元可包含電容器、電晶體或儲存至少一個資料位元之其他電路系統。習知記憶體墊可包含例如512個位元×512個位元,但本文中所揭示之實施例不限於此。
Each of the discrete memory groups 8520a, 8520b, and 8520c may include a plurality of memory pads. In the embodiment depicted in FIG. 84, the memory set 8520a may include
處理器子單元8530a、8530b及8530c中之至少一者可包括經組態以充當用於對應處理器子單元8530a、8530b及8530c之暫存器檔案的至少一個記憶體墊,諸如記憶體墊8532a、8532b及8532c。亦即,至少一個記憶體墊8532a、8532b及8532c提供由處理器子單元8530a、8530b及8530c中之一或多者使用的暫存器檔案之至少一個暫存器。暫存器檔案可包括一或多個暫存器。在圖85A中所描繪之實施例中,處理器子單元8530a中之記憶體墊8532a可充當用於處理器子單元8530a(及/或包括於分散式處理器記憶體晶片8500a中之任何其他處理器子單元)之暫存器檔案(亦被稱作「暫存器檔案8532a」);處理器子單元
8530b中之記憶體墊8532b可充當用於處理器子單元8530b之暫存器檔案;且處理器子單元8530c中之記憶體墊8532c可充當用於處理器子單元8530c之暫存器檔案。
At least one of the processor subunits 8530a, 8530b, and 8530c may include at least one memory pad configured to serve as a register file for the corresponding processor subunits 8530a, 8530b, and 8530c, such as a
處理器子單元8530a、8530b及8530c中之至少一者亦可包括至少一個邏輯組件,諸如邏輯組件8534a、8534b及8534c。每一邏輯組件8534a、8534b或8534c可經組態以使得對應記憶體墊8532a、8532b或8532c能夠充當用於對應處理器子單元8530a、8530b或8530c之暫存器檔案。
At least one of the processor subunits 8530a, 8530b, and 8530c may also include at least one logic component, such as
在一些實施例中,至少一個記憶體墊可安置於基板上,且至少一個記憶體墊可含有經組態以提供用於複數個處理器子單元中之一或多者之至少一個冗餘暫存器的至少一個冗餘記憶體位元。在一些實施例中,處理器子單元中之至少一者可包括用以停止當前任務且在某些時間觸發記憶體再新操作以再新記憶體墊之機制。 In some embodiments, at least one memory pad may be disposed on the substrate, and at least one memory pad may contain at least one redundant temporary configured to provide for one or more of the plurality of processor subunits. At least one redundant memory bit of the memory. In some embodiments, at least one of the processor sub-units may include a mechanism for stopping the current task and triggering a memory renew operation at a certain time to renew the memory pad.
圖85B說明符合所揭示實施例之例示性分散式處理器記憶體晶片8500b。圖85B中所說明之記憶體晶片8500b與圖85A中所說明之記憶體晶片8500大體上相同,除了圖85B中之記憶體墊8532a、8532b及8532c不包括於對應處理器子單元8530a、8530b及8530c中以外。實情為,圖85B中之記憶體墊8532a、8532b及8532c安置於對應處理器子單元8530a、8530b及8530c外部但在空間上靠近該等處理器子單元。以此方式,記憶體墊8532a、8532b及8532c仍可充當用於對應處理器子單元8530a、8530b及8530c之暫存器檔案。
FIG. 85B illustrates an exemplary distributed processor memory chip 8500b in accordance with the disclosed embodiment. The memory chip 8500b illustrated in FIG. 85B is substantially the same as the memory chip 8500 illustrated in FIG. 85A, except that the
圖85C說明符合所揭示實施例之裝置8500c。裝置8500c包括基板8560、第一記憶體組8570、第二記憶體組8572及處理單元8580。第一記憶體組8570、第二記憶體組8572及處理單元8580安置於基板8560上。處理單元8580包括處理器8584及由記憶體墊實施之暫存器檔案8582。在處理單元8580之操作期間,處理器8584可存取暫存器檔案8582以讀取或寫入資料。
Figure 85C illustrates a device 8500c in accordance with the disclosed embodiment. The device 8500c includes a
分散式處理器記憶體晶片8500a、8500b或裝置8500c可基於處理器子單元對由記憶體墊提供之暫存器的存取而提供多種功能。舉例而言,在一些實施例中,分散式處理器記憶體晶片8500a或8500b可包括處理器子單元,該處理器子單元充當耦接至記憶體之加速器,從而允許其使用更多記憶體頻寬。在圖85A中所描繪之實施例中,處理器子單元8530a可充當加速器(亦被稱作「加速器8530a」)。加速器8530a可使用安置於加速器8530a中之記憶體墊8532a以提供暫存器檔案之一或多個暫存器。替代地,在圖85B中所描繪之實施例中,加速器8530a可使用安置於加速器8530a外部之記憶體墊8532a作為暫存器檔案。又另外,加速器8530a可使用記憶體組8520b中之記憶體墊8522b、8524b及8526b中之任一者或記憶體組8520c中之記憶體墊8522c、8524c及8526c中之任一者,以提供一或多個暫存器。
The distributed processor memory chip 8500a, 8500b or the device 8500c can provide multiple functions based on the processor subunit's access to the register provided by the memory pad. For example, in some embodiments, the distributed processor memory chip 8500a or 8500b may include a processor sub-unit that acts as an accelerator coupled to the memory, allowing it to use more memory frequency. width. In the embodiment depicted in FIG. 85A, the processor sub-unit 8530a may act as an accelerator (also referred to as "accelerator 8530a"). The accelerator 8530a can use the
所揭示實施例可尤其適用於某些類型之影像處理、神經網路、資料庫分析、壓縮及解壓縮以及更多應用。舉例而言,在圖85A或圖85B之實施例中,記憶體墊可提供用於與記憶體墊包括在同一晶片上之一或多個處理器子單元的暫存器檔案之一或多個暫存器作為記憶體墊。一或多個暫存器可用以儲存由處理器子單元頻繁存取之資料。舉例而言,在卷積影像處理期間,卷積加速器可在保存於記憶體中之整個影像上反覆使用相同係數。用於此卷積加速器之所建議實施方案可將所有此等係數保存於在一或多個暫存器內之「關閉」暫存器檔案中,該一或多個暫存器包括於專用於一或多個處理器子單元之記憶體墊內,該一或多個處理器子單元與暫存器檔案記憶體墊位於同一晶片上。此架構可將暫存器(及所儲存之係數值)置放成緊密接近對係數值操作之處理器子單元。因為由記憶體墊實施之暫存器檔案可充當在空間上緊密之高效快取記憶體,所以可達成資料傳送之顯著較低損失及存取之較低潛時。 The disclosed embodiments are particularly suitable for certain types of image processing, neural networks, database analysis, compression and decompression, and more applications. For example, in the embodiment of FIG. 85A or FIG. 85B, the memory pad may provide one or more register files for one or more processor subunits included on the same chip as the memory pad The register serves as a memory pad. One or more registers can be used to store data frequently accessed by the processor subunits. For example, during convolutional image processing, the convolution accelerator can repeatedly use the same coefficients on the entire image stored in memory. The proposed implementation for this convolutional accelerator can save all these coefficients in a "closed" register file in one or more registers, the one or more registers including those dedicated to Within the memory pad of one or more processor sub-units, the one or more processor sub-units and the register file memory pad are located on the same chip. This architecture can place the register (and the stored coefficient value) in close proximity to the processor subunit that operates on the coefficient value. Because the register file implemented by the memory pad can serve as an efficient cache memory that is tightly spaced, a significantly lower loss of data transmission and a lower latency of access can be achieved.
在另一實例中,所揭示實施例可包括可將字輸入至由記憶體墊提 供之暫存器中的加速器。加速器可將暫存器處置為循環緩衝器以在單個循環中將向量相乘。舉例而言,在圖85C中所說明之裝置8500c中,處理單元8580中之處理器8584充當加速器,其使用由記憶體墊實施之暫存器檔案8582作為循環緩衝器以儲存資料A1、A2、A3……。第一記憶體組8570儲存待與資料A1、A2、A3……相乘之資料B1、B2、B3……。第二記憶體組8572儲存乘法結果C1、C2、C3……。亦即,Ci=Ai×Bi。若處理單元8580中不存在暫存器檔案,則處理器8584將需要更多記憶體頻寬及更多循環以自諸如記憶體組8570或8572之外部記憶體組讀取資料A1、A2、A3……及資料B1、B2、B3……兩者,此可產生顯著延遲。另一方面,在本實施例中,資料A1、A2、A3……儲存於形成於處理單元8580內之暫存器檔案8582中。因此,處理器8584將僅需要自外部記憶體組8570讀取資料B1、B2、B3……。因此,可顯著減少記憶體頻寬。 In another example, the disclosed embodiment may include words that can be input to a memory pad Provide the accelerator in the register. The accelerator can treat the scratchpad as a circular buffer to multiply the vectors in a single cycle. For example, in the device 8500c illustrated in FIG. 85C, the processor 8584 in the processing unit 8580 acts as an accelerator, which uses a register file 8582 implemented by a memory pad as a circular buffer to store data A1, A2, A3……. The first memory group 8570 stores the data B1, B2, B3... to be multiplied by the data A1, A2, A3.... The second memory group 8572 stores the multiplication results C1, C2, C3.... That is, Ci=Ai×Bi. If there is no register file in the processing unit 8580, the processor 8584 will need more memory bandwidth and more cycles to read data A1, A2, A3 from an external memory bank such as the memory bank 8570 or 8572 ...And data B1, B2, B3... both, this can cause a significant delay. On the other hand, in this embodiment, the data A1, A2, A3... are stored in the register file 8582 formed in the processing unit 8580. Therefore, the processor 8584 will only need to read the data B1, B2, B3... from the external memory bank 8570. Therefore, the memory bandwidth can be significantly reduced.
在記憶體處理程序中,記憶體墊通常允許單向存取(亦即,單次存取)。在單向存取中,存在至記憶體之一個埠。結果,可在某一時間僅執行對特定位址之一個存取操作,例如讀取或寫入。然而,若記憶體墊本身允許雙向存取,則雙向存取可為有效選項。在雙向存取中,可在某一時間存取兩個不同位址。存取記憶體墊之方法可基於面積及要求而判定。在一些狀況下,若由記憶體墊實施之暫存器檔案連接至需要讀取兩個源且具有一個目的地暫存器之處理器,則該等暫存器檔案可允許四向存取。在一些狀況下,當暫存器檔案由DRAM墊實施以儲存組態或快取記憶體資料時,暫存器檔案可僅允許單向存取。標準CPU可包括多向存取墊,而單向存取墊對於DRAM應用可為更佳的。 In the memory processing procedure, the memory pad usually allows one-way access (ie, single access). In one-way access, there is a port to the memory. As a result, only one access operation to a specific address, such as reading or writing, can be performed at a certain time. However, if the memory pad itself allows two-way access, then two-way access may be a valid option. In bidirectional access, two different addresses can be accessed at a certain time. The method of accessing the memory pad can be determined based on the area and requirements. In some situations, if the register files implemented by the memory pad are connected to a processor that needs to read two sources and has a destination register, the register files can allow four-way access. In some situations, when the register file is implemented by a DRAM pad to store configuration or cache memory data, the register file may only allow one-way access. Standard CPUs may include multi-directional access pads, while unidirectional access pads may be better for DRAM applications.
當控制器或加速器以其僅需要單次存取暫存器(在可能的少數情況下)之方式設計時,可使用記憶體墊實施之暫存器而非傳統的暫存器檔案。在單次存取中,一次僅可存取一個字。舉例而言,處理單元可在某一時間自兩個暫存器檔案存取兩個字。兩個暫存器檔案中之每一者可藉由僅允許單次存取 之記憶體墊(例如,DRAM墊)實施。 When the controller or accelerator is designed in such a way that it only needs a single access to the register (in a few possible cases), the register implemented by the memory pad can be used instead of the traditional register file. In a single access, only one word can be accessed at a time. For example, the processing unit can access two words from two register files at a certain time. Each of the two register files can be accessed by only allowing single access The memory pad (for example, DRAM pad) is implemented.
在大多數技術中,記憶體墊IP(其為自製造商獲得之封閉區塊(IP))將附帶有處於適當位置以用於列及行存取之佈線,諸如字線及列線。但記憶體墊IP不包括環繞邏輯組件。因此,由揭示於本發明實施例中之記憶體墊實施的暫存器檔案可包括邏輯組件。可基於暫存器檔案之所需大小選擇記憶體墊之大小。 In most technologies, the memory pad IP, which is a closed block (IP) obtained from the manufacturer, will be accompanied by wiring in place for column and row access, such as word lines and column lines. But the memory pad IP does not include surround logic components. Therefore, the register file implemented by the memory pad disclosed in the embodiment of the present invention may include logic components. The size of the memory pad can be selected based on the required size of the register file.
當使用記憶體墊以提供暫存器檔案之暫存器時,可能會出現某些挑戰,且此等挑戰可取決於用以形成記憶體墊之特定記憶體技術。舉例而言,在記憶體生產中,並非所有製造之記憶體胞元皆可在生產之後適當地操作。此為已知問題,尤其在晶片上存在高密度之SRAM或DRAM的情況下。為了解決記憶體技術中之此問題,可使用一或多個冗餘機構以便將良率維持於合理位準。在所揭示實施例中,因為用以提供暫存器檔案之暫存器的記憶體例項(例如,記憶體組)之數目可相當小,所以冗餘機構可能不如正常記憶體應用中那樣重要。另一方面,影響記憶體功能性之相同生產問題亦可影響特定記憶體墊在提供一或多個暫存器時是否可適當地起作用。結果,冗餘元件可包括於所揭示實施例中。舉例而言,至少一個冗餘記憶體墊可安置於分散式處理器記憶體晶片之基板上。至少一個冗餘記憶體墊可經組態以針對複數個處理器子單元中之一或多者提供至少一個冗餘暫存器。在另一實例中,墊可大於所需大小(例如,620×620而非512×512),且冗餘機構可建置至512×512區或其等效物外部之記憶體墊的區中。 When a memory pad is used to provide a register for a register file, certain challenges may arise, and these challenges may depend on the specific memory technology used to form the memory pad. For example, in memory production, not all memory cells manufactured can be properly operated after production. This is a known problem, especially when there is a high density of SRAM or DRAM on the chip. To solve this problem in memory technology, one or more redundant mechanisms can be used to maintain the yield rate at a reasonable level. In the disclosed embodiment, because the number of memory instances (for example, memory banks) of the register used to provide the register file can be quite small, the redundancy mechanism may not be as important as in normal memory applications. On the other hand, the same production issues that affect the functionality of the memory can also affect whether a particular memory pad can function properly when one or more registers are provided. As a result, redundant elements can be included in the disclosed embodiments. For example, at least one redundant memory pad can be disposed on the substrate of the distributed processor memory chip. The at least one redundant memory pad can be configured to provide at least one redundant register for one or more of the plurality of processor subunits. In another example, the pad may be larger than the required size (for example, 620×620 instead of 512×512), and the redundancy mechanism may be built into the area of the memory pad outside the 512×512 area or its equivalent .
另一挑戰可與時序相關。載入字及位元線之時序通常由記憶體之大小判定。由於暫存器檔案可由相當小之單個記憶體墊(例如,512×512個位元)實施,因此自記憶體墊載入字所需之時間將為少的,相較於邏輯,時序可足以相當快速地運行。 Another challenge can be related to timing. The timing of loading words and bit lines is usually determined by the size of the memory. Since the register file can be implemented by a relatively small single memory pad (for example, 512×512 bits), the time required to load words from the memory pad will be less. Compared with logic, the timing can be sufficient Runs fairly quickly.
再新一如DRAM之一些記憶體類型需要週期性地再新。再新可在暫停處理器或加速器時執行。對於小的記憶體墊,再新時間可為時間之一小部分。因此,即使系統在短時間段內停止,自總效能來看,藉由使用記憶體墊作為暫存器所獲得之增益亦值得停工時間。在一個實施例中,處理單元可包括自預定義數目向後計數之計數器。當計數器到達「0」時,處理單元可停止由處理器(例如,加速器)執行之當前任務,且觸發逐排再新記憶體墊之再新操作。當再新操作完成時,處理器可重新繼續其任務,且計數器可經重設以自預定義數目向後計數。 Some memory types like DRAM need to be renewed periodically. Renew can be executed when the processor or accelerator is paused. For small memory pads, the new time can be a small part of the time. Therefore, even if the system stops in a short period of time, from the perspective of overall performance, the gain obtained by using the memory pad as a register is worth the downtime. In one embodiment, the processing unit may include a counter that counts backward from a predefined number. When the counter reaches "0", the processing unit can stop the current task executed by the processor (for example, the accelerator), and trigger the renew operation of renewing the memory pad row by row. When the new operation is completed, the processor can resume its task, and the counter can be reset to count backward by a predefined number.
圖86提供表示符合所揭示實施例之用於在分散式處理器記憶體晶片中執行至少一個指令的例示性方法之流程圖8600。舉例而言,在步驟8602處,可自分散式處理器記憶體晶片之基板上的記憶體陣列擷取至少一個資料值。在步驟8604處,可將所擷取之資料值儲存於由分散式處理器記憶體晶片之基板上的記憶體陣列之記憶體墊提供的暫存器中。在步驟8606處,諸如分散式處理器記憶體晶片板上之分散式處理器子單元中之一或多者的處理器元件可對來自記憶體墊暫存器之所儲存資料值操作。
Figure 86 provides a
此處且貫穿全文,應理解,對暫存器檔案之所有參考皆應等同地指快取記憶體,此係因為暫存器檔案可為最低層級快取記憶體。 Here and throughout the text, it should be understood that all references to the register file should equally refer to the cache, because the register file can be the lowest level cache.
處理瓶頸 Deal with bottlenecks
術語「第一」、「第二」、「第三」及其類似者僅用以區分不同術語。此等術語可能不提示元件之次序及/或時序及/或重要性。舉例而言,第一處理程序之前可為第二處理程序,及其類似者。 The terms "first", "second", "third" and the like are only used to distinguish different terms. These terms may not indicate the order and/or timing and/or importance of the components. For example, the first processing procedure can be preceded by the second processing procedure, and the like.
術語「耦接」可意謂直接連接及/或間接連接。 The term "coupled" can mean direct connection and/or indirect connection.
術語「記憶體/處理」、「記憶體及處理」及「記憶體處理」係以可互換方式使用。 The terms "memory/processing", "memory and processing" and "memory processing" are used interchangeably.
可提供可為記憶體/處理單元之多個方法、電腦可讀媒體、記憶體/處理單元及/或系統。 Multiple methods, computer-readable media, memory/processing units, and/or systems that can be memory/processing units can be provided.
記憶體/處理單元為具有記憶體及處理能力之硬體單元。 The memory/processing unit is a hardware unit with memory and processing capabilities.
記憶體/處理單元可為記憶體處理積體電路,可包括於記憶體處理積體電路中或可包括一或多個記憶體處理積體電路。 The memory/processing unit may be a memory processing integrated circuit, may be included in a memory processing integrated circuit, or may include one or more memory processing integrated circuits.
記憶體/處理單元可為如PCT專利申請公開案WO2019025892中所說明之分散式處理器。 The memory/processing unit may be a distributed processor as described in PCT Patent Application Publication WO2019025892.
記憶體/處理單元可包括如PCT專利申請公開案WO2019025892中所說明之分散式處理器。 The memory/processing unit may include a distributed processor as described in PCT Patent Application Publication WO2019025892.
記憶體/處理單元可屬於如PCT專利申請公開案WO2019025892中所說明之分散式處理器。 The memory/processing unit may belong to a distributed processor as described in PCT Patent Application Publication WO2019025892.
記憶體/處理單元可為如PCT專利申請公開案WO2019025892中所說明之記憶體晶片。 The memory/processing unit may be a memory chip as described in PCT Patent Application Publication WO2019025892.
記憶體/處理單元可包括如PCT專利申請公開案WO2019025892中所說明之記憶體晶片。 The memory/processing unit may include a memory chip as described in PCT Patent Application Publication WO2019025892.
記憶體/處理單元可為如PCT專利申請案第PCT/IB2019/001005號中所說明之分散式處理器。 The memory/processing unit can be a distributed processor as described in PCT Patent Application No. PCT/IB2019/001005.
記憶體/處理單元可屬於如PCT專利申請案第PCT/IB2019/001005號中所說明之分散式處理器。 The memory/processing unit may belong to a distributed processor as described in PCT Patent Application No. PCT/IB2019/001005.
記憶體/處理單元可為如PCT專利申請案第PCT/IB2019/001005號中所說明之記憶體晶片。 The memory/processing unit may be a memory chip as described in PCT Patent Application No. PCT/IB2019/001005.
記憶體/處理單元可包括如PCT專利申請案第PCT/IB2019/001005號中所說明之記憶體晶片。 The memory/processing unit may include a memory chip as described in PCT Patent Application No. PCT/IB2019/001005.
記憶體/處理單元可屬於如PCT專利申請案第PCT/IB2019/001005 號中所說明之記憶體晶片。 The memory/processing unit may belong to the PCT patent application No. PCT/IB2019/001005 The memory chip described in the number.
記憶體/處理單元可為使用晶圓間接合及多個導體彼此連接之積體電路。 The memory/processing unit can be an integrated circuit that uses inter-wafer bonding and multiple conductors to connect to each other.
對分散式處理器記憶體晶片、分散式記憶體處理積體電路、記憶體晶片、分散式處理器之任何參考可實施為藉由晶圓間接合及多個導體彼此連接之一對積體電路。 Any reference to distributed processor memory chips, distributed memory processing integrated circuits, memory chips, and distributed processors can be implemented as a pair of integrated circuits by bonding between wafers and connecting multiple conductors to each other .
記憶體/處理單元可藉由相比邏輯胞元更佳地適合記憶體胞元之第一製造製程來製造。因此,第一製造製程可被視為記憶體類別之製造製程。記憶體胞元可包括多個電晶體中之一者。邏輯胞元可包括一或多個電晶體。可應用第一製造製程以製造記憶體組。邏輯胞元可包括一起實施邏輯功能之一或多個電晶體,且可用作較大邏輯電路之基本建置區塊。記憶體胞元可包括一起實施記憶體功能之一或多個電晶體,且可用作較大邏輯電路之基本建置區塊。對應邏輯胞元可實施相同邏輯功能。 The memory/processing unit can be manufactured by a first manufacturing process that is better suited for memory cells than logic cells. Therefore, the first manufacturing process can be regarded as the manufacturing process of the memory type. The memory cell may include one of a plurality of transistors. The logic cell may include one or more transistors. The first manufacturing process can be applied to manufacture the memory bank. A logic cell can include one or more transistors that implement logic functions together, and can be used as a basic building block for larger logic circuits. A memory cell may include one or more transistors that perform memory functions together, and may be used as a basic building block for larger logic circuits. Corresponding logic cells can implement the same logic function.
記憶體/處理單元可不同於處理器、處理積體電路及/或處理單元中之任一者,該處理器、處理積體電路及/或處理單元藉由相比記憶體胞元更佳地適合於邏輯胞元之第二製造製程來製造。因此,第一製造製程可被視為邏輯類別之製造製程。第二製造製程可用以製造中央處理單元、圖形處理單元及其類似者。 The memory/processing unit may be different from any one of the processor, the processing integrated circuit and/or the processing unit, and the processor, the processing integrated circuit and/or the processing unit are better than the memory cell. It is suitable for the second manufacturing process of logic cells. Therefore, the first manufacturing process can be regarded as a logical type of manufacturing process. The second manufacturing process can be used to manufacture central processing units, graphics processing units and the like.
相比處理器、處理積體電路及/或處理單元,記憶體/處理單元可更適合於執行較少算術密集型運算。 Compared to processors, processing integrated circuits, and/or processing units, memory/processing units may be more suitable for performing less arithmetic-intensive operations.
舉例而言,由第一製造製程製造之記憶體胞元可展現超過且甚至大大超過(例如,超過2倍、3倍、4倍、5倍、9倍、7倍、8倍、9倍、10倍及其類似者)由第一製造製程製造之邏輯電路之臨界尺寸的臨界尺寸。 For example, the memory cells manufactured by the first manufacturing process can exhibit more than and even greater than (for example, more than 2 times, 3 times, 4 times, 5 times, 9 times, 7 times, 8 times, 9 times, 10 times and the like) the critical dimension of the critical dimension of the logic circuit manufactured by the first manufacturing process.
第一製造製程可為類比製造製程,第一製造製程可為DRAM製 造製程,及其類似者。 The first manufacturing process can be an analog manufacturing process, and the first manufacturing process can be a DRAM manufacturing process Manufacturing process, and the like.
由第一製造製程製造之邏輯胞元的大小可超過由第二製造製程製造之對應邏輯胞元的大小至少兩倍。對應邏輯呼叫可具有與由第一製造製程製造之邏輯胞元相同的功能性。 The size of the logic cell manufactured by the first manufacturing process can exceed the size of the corresponding logic cell manufactured by the second manufacturing process by at least twice. The corresponding logic call may have the same functionality as the logic cell manufactured by the first manufacturing process.
第二製造製程可為數位製造製程。 The second manufacturing process may be a digital manufacturing process.
第二製造製程可為互補金屬氧化物半導體(CMOS)、雙極、雙極CMOS(BiCOMS)、雙擴散金屬氧化物半導體(DMOS)、氧化物上矽製造製程及其類似者中之任一者。 The second manufacturing process can be any of complementary metal oxide semiconductor (CMOS), bipolar, bipolar CMOS (BiCOMS), double diffused metal oxide semiconductor (DMOS), silicon-on-oxide manufacturing process, and the like .
記憶體/處理單元可包括多個處理器子單元。 The memory/processing unit may include multiple processor sub-units.
一或多個記憶體/處理單元之處理器子單元可彼此獨立地操作及/或可彼此相配合及/或執行分散式處理。可以各種方式,例如以平面方式或以階層式方式執行分散式處理。 The processor subunits of one or more memory/processing units can operate independently of each other and/or can cooperate with each other and/or perform distributed processing. Distributed processing can be performed in various ways, such as in a planar manner or in a hierarchical manner.
平面方式可涉及使處理器子單元執行相同操作(且可能在或可能不在處理器子單元之間輸出處理結果)。 The planar approach may involve causing processor sub-units to perform the same operation (and may or may not output processing results between the processor sub-units).
階層式方式可涉及執行不同層級之處理操作序列,而某一層之處理操作在又一層級之處理操作之後進行。處理器子單元可經分配(動態地或靜態地)給不同層且參與階層式處理。 The hierarchical approach may involve executing a sequence of processing operations at different levels, and the processing operations at one level are performed after the processing operations at another level. Processor sub-units can be allocated (dynamically or statically) to different layers and participate in hierarchical processing.
分散式處理亦可涉及其他單元,例如記憶體/處理單元之控制器及/或不屬於記憶體/處理單元之單元。 Distributed processing may also involve other units, such as memory/processing unit controllers and/or units that are not memory/processing units.
以可互換方式使用術語邏輯及處理器子單元。 The terms logic and processor subunit are used interchangeably.
可以任何方式(分散式及/或非分散式及其類似者)執行本申請案中所提及之任何處理。 Any processing mentioned in this application can be executed in any way (distributed and/or non-distributed and the like).
在以下申請案中,對PCT專利申請公開案WO2019025892及PCT專利申請案第PCT/IB2019/001005號(2019年9月9日)進行各種參考及/或以 參考方式併入。PCT專利申請公開案WO2019025892及/或PCT專利申請案第PCT/IB2019/001005號提供各種方法、系統、處理器、記憶體晶片及其類似者之非限制性實例。可提供其他方法、系統、處理器。 In the following applications, various references and/or references are made to PCT Patent Application Publication WO2019025892 and PCT Patent Application No. PCT/IB2019/001005 (September 9, 2019) Incorporated by reference. PCT Patent Application Publication WO2019025892 and/or PCT Patent Application No. PCT/IB2019/001005 provide non-limiting examples of various methods, systems, processors, memory chips and the like. Other methods, systems, and processors can be provided.
可提供處理系統(系統),其中處理器之前為一或多個記憶體/處理單元,每一記憶體及處理單元(記憶體/處理單元)具有處理資源及儲存資源。 A processing system (system) can be provided, in which the processor is previously one or more memory/processing units, and each memory and processing unit (memory/processing unit) has processing resources and storage resources.
處理器可請求或發指令給一或多個記憶體/處理單元以執行各種處理任務。各種處理任務之執行可減輕處理器之負擔,減少潛時,且在一些狀況下減少一或多個記憶體/處理單元與處理器之間的總資訊頻寬,及其類似者。 The processor can request or issue instructions to one or more memory/processing units to perform various processing tasks. The execution of various processing tasks can reduce the burden on the processor, reduce latency, and in some cases reduce the total information bandwidth between one or more memory/processing units and the processor, and the like.
處理器可用不同粒度提供指令及/或請求,例如處理器可發送針對某些處理資源之指令或可發送針對記憶體/處理單元之較高階指令,而不指定任何處理資源。 The processor can provide instructions and/or requests with different granularities. For example, the processor can send instructions for certain processing resources or can send higher-level instructions for memory/processing units without specifying any processing resources.
記憶體/處理單元可用任何方式(動態、靜態、分散式、集中式、離線、線上及其類似者)管理其處理及/或記憶體資源。資源之管理可在以下情況下執行:自主地、在處理器之控制下、在處理器進行組態之後,及其類似者。 The memory/processing unit can manage its processing and/or memory resources in any way (dynamic, static, distributed, centralized, offline, online, and the like). Resource management can be performed in the following situations: autonomously, under the control of the processor, after the processor is configured, and the like.
舉例而言,可將任務分割成可能需要一或多個記憶體/處理單元之一或多個處理資源及/或記憶體資源執行或一或多個指令之子任務。每一處理資源可經組態以執行(例如,獨立地或非獨立地)至少一個指令。參見例如藉由諸如PCT專利申請公開案WO2019025892之處理器子單元的處理資源對指令子系列之執行。 For example, the task can be divided into subtasks that may require one or more processing resources of one or more memory/processing units and/or execution of memory resources or one or more instructions. Each processing resource can be configured to execute (e.g., independently or non-independently) at least one instruction. See, for example, the execution of the instruction sub-series by the processing resources of the processor sub-unit such as PCT Patent Application Publication WO2019025892.
亦可至少將記憶體資源之分配提供至除一或多個記憶體/處理單元以外之實體,例如可耦接至一或多個記憶體/處理單元之直接存取記憶體(DMA)單元。 It is also possible to provide at least the allocation of memory resources to entities other than one or more memory/processing units, such as direct access memory (DMA) units that can be coupled to one or more memory/processing units.
編譯器可針對由記憶體/處理單元執行之任務的每個類型準備組 態檔案。組態檔案包括與任務類型相關聯之記憶體分配及處理資源分配。組態檔案可包括可由不同處理資源執行及/或可定義記憶體分配之指令。 The compiler can prepare groups for each type of task performed by the memory/processing unit State file. The configuration file includes memory allocation and processing resource allocation associated with task types. The configuration file can include commands that can be executed by different processing resources and/or can define memory allocation.
舉例而言,與矩陣乘法(將矩陣A乘以矩陣B,A*B=C)之任務相關的組態檔案可提示在何處儲存矩陣A之元素,在何處儲存矩陣B之元素,在何處儲存矩陣C之元素,在何處儲存在矩陣乘法期間產生之中間結果,且可包括針對用於執行與矩陣乘法相關之任何數學運算之處理資源的指令。組態檔案為資料結構之實例,可提供其他資料結構。 For example, the configuration file related to the task of matrix multiplication (multiply matrix A by matrix B, A*B=C) can prompt where to store the elements of matrix A and where to store the elements of matrix B. Where to store the elements of the matrix C, where to store the intermediate results generated during the matrix multiplication, and may include instructions for processing resources used to perform any mathematical operations related to the matrix multiplication. The configuration file is an example of the data structure, and other data structures can be provided.
可藉由一或多個記憶體/處理單元以任何方式執行矩陣乘法。 The matrix multiplication can be performed in any manner by one or more memory/processing units.
一或多個記憶體/處理單元可將矩陣A乘以向量V。此可用任何方式進行。舉例而言,此可涉及每處理資源維護矩陣之一列或行(每不同處理資源維護行之不同列),及循環(在不同處理資源之間)矩陣之列或行與向量的乘法之最終結果(在第一反覆期間),及循環先前乘法之最終結果(在第二至最後反覆期間)。 One or more memory/processing units can multiply the matrix A by the vector V. This can be done in any way. For example, this may involve maintaining one column or row of the matrix per processing resource (different columns for each different processing resource maintenance row), and looping (between different processing resources) the final result of the multiplication of the column or row of the matrix and the vector (During the first iteration), and loop the final result of the previous multiplication (during the second to the last iteration).
假定矩陣A為4×4矩陣,向量V為1×4向量,且存在四個處理資源。在此等假設下,矩陣A之第一列儲存於第一處理器子單元處,矩陣A之第二列儲存於第二處理器子單元處,矩陣A之第三列儲存於第三處理資源處,且矩陣A在第四列儲存於第四處理器子單元處。藉由以下操作開始乘法:將向量V之第一至第四元素發送至第一至第四處理資源;及將向量V之第一至第四元素乘以A之不同向量以提供第一中間結果。藉由以下操作循環第一中間結果來繼續乘法:藉由每一處理資源將由第一處理資源計算之第一中間結果發送至其相鄰處理資源。每一處理資源將第一乘法結果乘以向量以提供第二乘法結果。此過程重複多次,直至矩陣A與向量V之乘法結束。 Assume that matrix A is a 4×4 matrix, vector V is a 1×4 vector, and there are four processing resources. Under these assumptions, the first row of matrix A is stored at the first processor subunit, the second row of matrix A is stored at the second processor subunit, and the third row of matrix A is stored at the third processing resource , And the matrix A is stored at the fourth processor subunit in the fourth column. Start the multiplication by: sending the first to fourth elements of vector V to the first to fourth processing resources; and multiplying the first to fourth elements of vector V by different vectors of A to provide the first intermediate result . The multiplication is continued by looping the first intermediate result by the following operation: With each processing resource, the first intermediate result calculated by the first processing resource is sent to its neighboring processing resources. Each processing resource multiplies the first multiplication result by the vector to provide the second multiplication result. This process is repeated many times until the multiplication of the matrix A and the vector V ends.
圖90A為包括一或多個記憶體/處理單元(共同地表示為10910)及處理器10920之系統10900的實例。處理器10920可將請求或指令發送(經由
鏈路10931)至一或多個記憶體/處理單元10920,該一或多個記憶體/處理單元又完成(或選擇性地完成)請求及/或指令且將結果發送(經由鏈路10932)至處理器10920,如上文所說明。處理器10920可進一步處理結果以提供(經由鏈路10933)一或多個輸出。
90A is an example of a
一或多個記憶體/處理單元可包括J(J為正整數)個記憶體資源10912(1,1)至10912(1,J)及K(K為正整數)個處理資源10911(1,1)至10911(1,K)。 One or more memory/processing units may include J (J is a positive integer) memory resources 10912 (1, 1) to 10912 (1, J) and K (K is a positive integer) processing resources 10911 (1, 1) to 10911 (1, K).
J可等於K或可不同於K。 J can be equal to K or can be different from K.
處理資源10911(1,1)至10911(1,K)可為例如處理群組或處理器子單元,如PCT專利申請公開案WO2019025892中所說明。 The processing resources 10911(1,1) to 10911(1,K) may be, for example, processing groups or processor subunits, as described in PCT Patent Application Publication WO2019025892.
記憶體資源10912(1,1)至10912(1,J)可為記憶體例項、記憶體墊、記憶體組,如PCT專利申請公開案WO2019025892中所說明。 The memory resources 10912(1,1) to 10912(1,J) can be memory instances, memory pads, and memory groups, as described in PCT Patent Application Publication WO2019025892.
一或多個記憶體/處理單元之資源(記憶體或處理)中的任一者之間可存在任何連接性及/或任何功能關係。 There may be any connectivity and/or any functional relationship between any of the resources (memory or processing) of one or more memory/processing units.
圖90B為記憶體/處理單元10910(1)之實例。 Figure 90B shows an example of the memory/processing unit 10910(1).
在圖90B中,K(K為正整數)個處理資源10911(1,1)至10911(1,K)形成迴路,此係因為該等處理資源彼此串聯連接(參見鏈路10915)。每一處理資源亦耦接至其自身的一對專用記憶體資源(例如,處理資源10911(1)耦接至記憶體資源10912(1)及10912(2),且處理資源10911(K)耦接至記憶體資源10912(J-1)及10912(J))。處理資源可用任何其他方式彼此連接。每一處理資源所分配之記憶體資源的數目可不同於兩個。不同資源之間的連接性之實例說明於PCT專利申請公開案WO2019025892中。 In FIG. 90B, K (K is a positive integer) processing resources 10911(1,1) to 10911(1,K) form a loop because these processing resources are connected in series with each other (see link 10915). Each processing resource is also coupled to its own pair of dedicated memory resources (for example, processing resource 10911(1) is coupled to memory resources 10912(1) and 10912(2), and processing resource 10911(K) is coupled Connect to memory resources 10912(J-1) and 10912(J)). Processing resources can be connected to each other in any other way. The number of memory resources allocated for each processing resource can be different from two. An example of the connectivity between different resources is described in PCT Patent Application Publication WO2019025892.
圖90C為N(N為正整數)個記憶體/處理單元10910(1)至10910(N)及處理器10920之系統10901的實例。處理器10920可將請求或指令發送(經由鏈路10931(1)至10931(N))至記憶體/處理單元10920(1)至10910(N),該等記憶
體/處理單元又完成請求及/或指令且將結果發送(經由鏈路10932(1)至3232(N))至處理器10920,如上文所說明。處理器10920可進一步處理結果以提供(經由鏈路10933)一或多個輸出。
90C is an example of a
圖90D為包括N(N為正整數)個記憶體/處理單元10910(1)至10910(N)及處理器10920之系統10902的實例。圖90D說明在記憶體/處理單元10910(1)至10910(N)之前的預處理器10909。預處理器可執行各種預處理操作,諸如圖框提取、標頭偵測及其類似者。
90D is an example of a
圖90E為包括一或多個記憶體/處理單元10910及處理器10920之系統10903的實例。圖90E說明在一或多個記憶體/處理單元10910及DMA控制器10908之前的預處理器10909。
FIG. 90E is an example of a
圖90F說明用於至少一個資訊串流之分散式處理的方法10800。
Figure 90F illustrates a
方法10800可開始於藉由一或多個記憶體處理積體電路經由第一通信通道接收至少一個資訊串流之步驟10810;其中每一記憶體處理積體單元包含控制器、多個處理器子單元及多個記憶體單元。
The
步驟10810之後可接著步驟10820及10830。
Step 10810 can be followed by
步驟10820可包括藉由一或多個記憶體處理積體電路緩衝資訊串流。
步驟10830可包括藉由一或多個記憶體處理積體電路對至少一個資訊串流執行第一處理操作以提供第一處理結果。 Step 10830 may include performing a first processing operation on at least one information stream by one or more memory processing integrated circuits to provide a first processing result.
步驟10830可涉及壓縮或解壓縮。 Step 10830 may involve compression or decompression.
因此,資訊串流之總大小可超過第一處理結果之總大小。資訊串流之總大小可反映在給定持續時間之時段期間接收的資訊量。第一處理結果之總大小可反映在同一給定持續時間之任何時段期間輸出的第一處理結果之量。 Therefore, the total size of the information stream may exceed the total size of the first processing result. The total size of the information stream can reflect the amount of information received during a period of a given duration. The total size of the first processing result can reflect the amount of the first processing result output during any period of the same given duration.
替代地,資訊串流(在本說明書中所提及之任何其他資訊實體) 之總大小小於第一處理結果之總大小。在此狀況下,獲得壓縮。 Instead, information stream (any other information entity mentioned in this manual) The total size of is smaller than the total size of the first processing result. In this situation, compression is obtained.
步驟10830之後可接著為將第一處理結果發送至一或多個處理積體電路之步驟10840。 Step 10830 can be followed by step 10840 of sending the first processing result to one or more processing integrated circuits.
一或多個記憶體處理積體電路可由記憶體類別之製造製程製造。 One or more memory processing integrated circuits can be manufactured by the manufacturing process of the memory type.
一或多個記憶體處理積體電路可由邏輯類別之製造製程製造。 One or more memory processing integrated circuits can be manufactured by a logic type manufacturing process.
在記憶體處理積體單元中,記憶體單元中之每一者可耦接至處理器子單元。 In the memory processing integrated unit, each of the memory units can be coupled to a processor sub-unit.
步驟10840之後可接著為藉由一或多個處理積體電路對第一處理結果執行第二處理操作以提供第二處理結果之步驟10850。 Step 10840 can be followed by step 10850 of performing a second processing operation on the first processing result by one or more processing integrated circuits to provide a second processing result.
步驟10820及/或步驟10830可藉由一或多個處理積體電路發指令,可藉由一或多個處理積體電路請求,可藉由一或多個處理積體電路在一或多個記憶體處理積體電路進行組態之後執行,或可獨立地執行而無需一或多個處理積體電路之介入。
第一處理操作可具有比第二處理操作低的算術強度。 The first processing operation may have a lower arithmetic strength than the second processing operation.
步驟10830及/或步驟10850可為以下各者中之至少一者:(a)蜂巢式網路處理操作;(b)其他網路相關處理操作(不同於蜂巢式網路之網路的處理);(c)資料庫處理操作;(d)資料庫分析處理操作;(e)人工智慧處理操作;或任何其他處理操作。 Step 10830 and/or step 10850 can be at least one of the following: (a) cellular network processing operations; (b) other network-related processing operations (different from cellular network processing) ; (C) Database processing operation; (d) Database analysis processing operation; (e) Artificial intelligence processing operation; or any other processing operation.
分解式系統記憶體/處理單元及用於分散式處理之方法 Decomposable system memory/processing unit and method for distributed processing
可提供分解式系統、用於分散式處理之方法、處理/記憶體單元、用於操作分解式系統之方法、用於操作處理/記憶體單元之方法及電腦可讀媒體,該電腦可讀媒體為非暫時性的且儲存用於執行該等方法中之任一者的指令。分解式系統分配不同子系統以執行不同功能。舉例而言,儲存器可主要實施於一或多個儲存子系統中,而運算可主要在一或多個儲存子系統中進行。 Decomposable systems, methods for distributed processing, processing/memory units, methods for operating decomposable systems, methods for operating processing/memory units, and computer-readable media can be provided. It is non-transitory and stores instructions for executing any of these methods. The decomposition system allocates different subsystems to perform different functions. For example, the storage may be mainly implemented in one or more storage subsystems, and the operation may be mainly performed in one or more storage subsystems.
分解式系統可為一分解式伺服器、一或多個分解式伺服器及/或可不同於一或多個伺服器。 The decomposition system may be a decomposition server, one or more decomposition servers, and/or may be different from one or more servers.
分解式系統可包括一或多個交換子系統、一或多個運算子系統、一或多個儲存子系統及一或多個處理/記憶體子系統。 The decomposition system may include one or more switching subsystems, one or more computing subsystems, one or more storage subsystems, and one or more processing/memory subsystems.
一或多個處理/記憶體子系統、一或多個運算子系統及一或多個儲存子系統經由一或多個交換子系統彼此耦接。 One or more processing/memory subsystems, one or more computing subsystems, and one or more storage subsystems are coupled to each other via one or more switching subsystems.
一或多個處理/記憶體子系統可包括於分解式系統之一或多個子系統中。 One or more processing/memory subsystems may be included in one or more of the decomposed systems.
圖87A說明分解式系統之各種實例。 Figure 87A illustrates various examples of the exploded system.
可提供任何數目個任何類型之子系統。分解式系統可包括圖87A中不包括之類型的一或多個額外子系統,可包括較少類型之子系統,及其類似者。 Any number of subsystems of any type can be provided. The decomposed system may include one or more additional subsystems of the types not included in FIG. 87A, may include fewer types of subsystems, and the like.
分解式系統7101包括兩個儲存子系統7130、運算子系統7120、交換子系統7140及處理/記憶體子系統7110。
The
分解式系統7102包括兩個儲存子系統7130、運算子系統7120、交換子系統7140、處理/記憶體子系統7110及加速器子系統7150。
The
分解式系統7103包括兩個儲存子系統7130、運算子系統7120及包括處理/記憶體子系統7110之交換子系統7140。
The
分解式系統7104包括兩個儲存子系統7130、運算子系統7120、包括處理/記憶體子系統7110之交換子系統7140,及加速器子系統7150。
The
將處理/記憶體子系統7110包括於交換子系統7140中可減少分解式系統7101及7102內之訊務,可減少切換之潛時,及其類似者。
Including the processing/memory subsystem 7110 in the switching subsystem 7140 can reduce the traffic in the decomposed
分解式系統之不同子系統可使用各種通信協定彼此通信。已發現,使用乙太網路及甚至乙太網路RDMA通信協定可增加輸送量,且可能甚至 降低與分解式系統之元件之間的資訊單元之交換相關的各種控制及/儲存操作之複雜度。 The different subsystems of the decomposed system can communicate with each other using various communication protocols. It has been found that the use of Ethernet and even the Ethernet RDMA communication protocol can increase throughput, and may even Reduce the complexity of various control and/storage operations related to the exchange of information units between the components of the decomposition system.
分解式系統可藉由允許處理/記憶體子系統參與計算(尤其藉由執行記憶體密集型計算)來執行分散式處理。 Disaggregated systems can perform distributed processing by allowing the processing/memory subsystem to participate in calculations (especially by performing memory-intensive calculations).
舉例而言,假定N個運算單元應在其間共用資訊單元(全部共用),則(a)可將N個資訊單元發送至一或多個處理/記憶體子系統之一或多個處理/記憶體單元,(b)一或多個處理/記憶體單元可執行需要全部共用之計算,且(c)將N個經更新資訊單元發送至N個運算單元。此將需要大約N個傳送操作。 For example, assuming that N arithmetic units should share information units among them (all shared), then (a) N information units can be sent to one or more processing/memory subsystems or one or more processing/memory Volume unit, (b) one or more processing/memory units can perform calculations that need to be all shared, and (c) send N updated information units to N arithmetic units. This will require about N transfer operations.
舉例而言,圖87B說明更新神經網路之模型(該模型包括指派給神經網路之節點的權重)的分散式處理。 For example, Figure 87B illustrates the decentralized process of updating a model of a neural network (the model includes the weights assigned to the nodes of the neural network).
N個運算單元PU(1)7120(1)至PU(N)7120(N)中之每一者可屬於分解式系統7101、7102、7103及7104中之任一者的運算子系統7120。
Each of the N arithmetic units PU(1) 7120(1) to PU(N) 7120(N) may belong to the
N個運算單元計算N個部分模型更新(經更新之N個不同部分)7121(1)至7121(N),且將其發送(經由交換子系統7140)至處理/記憶體子系統7110。 The N arithmetic units calculate N partial model updates (updated N different parts) 7121(1) to 7121(N), and send them (via the exchange subsystem 7140) to the processing/memory subsystem 7110.
處理/記憶體子系統7110計算經更新模型7122且將經更新模型發送(經由交換子系統7140)至N個運算單元PU(1)7120(1)至PU(N)7120(N)。
The processing/memory subsystem 7110 calculates the updated
圖87C、圖87D及圖87E分別說明記憶體/處理單元7011、7012及7013之實例,且圖87F及聽87G說明積體電路7014及7015,積體電路包括記憶體/處理單元9010諸如乙太網路模組及乙太網路RDMA模組22之一或多個通信模組。
Figure 87C, Figure 87D, and Figure 87E illustrate examples of memory/
記憶體/處理單元包括控制器9020、內部匯流排9021以及多對邏輯9030及記憶體組9040。控制器經組態以作為通信模組操作或可耦接至通信模
組。
The memory/processing unit includes a
可用其他方式實施控制器9020與多對邏輯9030及記憶體組9040之間的連接性。可用其他方式(不成對)配置記憶體組及邏輯。
The connectivity between the
處理/記憶體子系統7110之一或多個記憶體/處理單元9010可並列地處理(使用不同邏輯且自不同記憶體組並列地擷取模型之不同部分)模型更新,且受益於大量記憶體資源、記憶體組與邏輯之間的連接之極高頻寬,可用高效方式執行此等計算。
One or more memory/
圖87C至圖87E之記憶體/處理單元7011、7012及7013以及圖87C至圖87E之積體電路7014及7015包括一或多個通信模組,諸如乙太網路模組7023(在圖87C至圖87G中)及乙太網路RDMA模組7022(在圖87E及圖87G中)。
The memory/
具有此等RDMA及/或乙太網路模組(在記憶體/處理單元內或在與記憶體/處理單元相同的積體電路內)大大加速分解式系統之不同元件之間的通信,且在RDMA之狀況下,大大簡化分解式系統之不同元件之間的通信。 Having these RDMA and/or Ethernet modules (in the memory/processing unit or in the same integrated circuit as the memory/processing unit) greatly accelerates the communication between the different components of the decomposed system, and In the case of RDMA, the communication between the different components of the decomposition system is greatly simplified.
應注意,包括RDMA及/或乙太網路模組之記憶體/處理單元在其他環境中可為有益的,即使當記憶體/處理單元不包括於分解式系統中時亦如此。 It should be noted that the memory/processing unit that includes RDMA and/or Ethernet modules can be beneficial in other environments, even when the memory/processing unit is not included in the disassembled system.
亦應注意,例如出於減少成本原因,可針對記憶體/處理單元之每個群組分配RDMA及/或乙太網路模組。 It should also be noted that, for example, for cost reduction reasons, RDMA and/or Ethernet modules can be allocated to each group of memory/processing units.
應注意,記憶體/處理單元、記憶體/處理單元之群組及甚至處理/記憶體子系統可包括其他通信埠,例如PCIe通信埠。 It should be noted that the memory/processing unit, the group of memory/processing units, and even the processing/memory subsystem may include other communication ports, such as PCIe communication ports.
使用RDMA及/或乙太網路模組可具有成本效益,此係因為可消除將記憶體/處理單元連接至橋接器之需要,該橋接器連接至可具有乙太網路埠之網路積體電路(NIC)。 The use of RDMA and/or Ethernet modules can be cost-effective because it eliminates the need to connect the memory/processing unit to a bridge that connects to a network product that can have an Ethernet port Body circuit (NIC).
使用RDMA及/或乙太網路模組可使乙太網路(或乙太網路 RDMA)為記憶體/處理單元中原生的。 Use RDMA and/or Ethernet modules to enable Ethernet (or Ethernet RDMA) is native to the memory/processing unit.
應注意,乙太網路僅為區域網路(LAN)協定之實例。PCIe僅為可在比乙太網路更長之距離上使用的另一通信協定之實例。 It should be noted that Ethernet is only an example of a local area network (LAN) protocol. PCIe is only an example of another communication protocol that can be used over longer distances than Ethernet.
圖87H說明用於分散式處理之方法7000。
Figure 87H illustrates a
方法7000可包括一或多個處理反覆。
處理反覆可由分解式系統之一或多個記憶體處理積體電路執行。 The processing repetition can be performed by one or more memory processing integrated circuits of the decomposable system.
處理反覆可由分解式系統之一或多個處理積體電路執行。 The processing repetition can be performed by one or more processing integrated circuits of the decomposition system.
由更多記憶體處理積體電路執行之處理反覆之後可接著為由一或多個處理積體電路執行之處理反覆。 The processing iterations performed by more memory processing integrated circuits can be followed by processing iterations performed by one or more processing integrated circuits.
由更多記憶體處理積體電路執行之處理反覆可在由一或多個處理積體電路執行之處理反覆之前。 The process repetition performed by more memory processing integrated circuits may precede the process repetition performed by one or more processing integrated circuits.
又一處理反覆可由分解式系統之其他電路執行。舉例而言,一或多個預處理電路可執行任何類型之預處理,包括準備用於一或多個記憶體處理積體電路執行之處理反覆的資訊單元。 Yet another process iteration can be performed by other circuits of the decomposition system. For example, one or more pre-processing circuits can perform any type of pre-processing, including information units that are prepared for processing iterations performed by one or more memory processing integrated circuits.
方法7000可包括藉由分解式系統之一或多個記憶體處理積體電路接收資訊單元的步驟7020。
The
每一記憶體處理積體單元可包括控制器、多個處理器子單元及多個記憶體單元。 Each memory processing integrated unit may include a controller, multiple processor sub-units, and multiple memory units.
一或多個記憶體處理積體電路可由記憶體類別之製造製程製造。 One or more memory processing integrated circuits can be manufactured by the manufacturing process of the memory type.
資訊單元可輸送神經網路之模型的部分。 The information unit can convey the part of the neural network model.
資訊單元可輸送至少一個資料庫查詢之部分結果。 The information unit can deliver part of the results of at least one database query.
資訊單元可輸送至少一個聚集資料庫查詢之部分結果。 The information unit can deliver part of the results of at least one aggregate database query.
步驟7020可包括自分解式系統之一或多個儲存子系統接收資訊單元。
步驟7020可包括自分解式系統之一或多個運算子系統接收資訊單元,一或多個運算子系統可包括由邏輯類別之製造製程製造的多個處理積體電路。
步驟7020之後可接著為藉由一或多個記憶體處理積體電路對資訊單元執行處理操作以提供處理結果的步驟7030。
資訊單元之總大小可超過,可等於或可小於處理結果之總大小。 The total size of the information unit may exceed, may be equal to, or may be smaller than the total size of the processing result.
步驟7030之後可接著為藉由一或多個記憶體處理積體電路輸出處理結果之步驟7040。
步驟7040可包括將處理結果輸出至分解式系統之一或多個運算子系統,一或多個運算子系統可包括由邏輯類別之製造製程製造的多個處理積體電路。 Step 7040 may include outputting the processing result to one or more computing subsystems of the decomposition system, and the one or more computing subsystems may include multiple processing integrated circuits manufactured by the manufacturing process of the logic category.
步驟7040可包括將處理結果輸出至分解式系統之一或多個儲存子系統。 Step 7040 may include outputting the processing result to one or more storage subsystems of the disaggregated system.
資訊單元可自多個處理積體電路之處理單元的不同群組發送,且可為藉由多個處理積體電路以分散式方式執行之處理程序的中間結果之不同部分。處理單元之群組可包括至少一個處理積體電路。 Information units can be sent from different groups of processing units of multiple processing integrated circuits, and can be different parts of intermediate results of processing procedures executed in a distributed manner by multiple processing integrated circuits. The group of processing units may include at least one processing integrated circuit.
步驟7030可包括處理資訊單元以提供整個處理程序之結果。
步驟7040可包括將整個處理程序之結果發送至多個處理積體電路中之每一者。 Step 7040 may include sending the result of the entire processing procedure to each of the multiple processing integrated circuits.
中間結果之不同部分可為經更新神經網路模型之不同部分,且其中整個處理程序之結果為經更新神經網路模型。 The different parts of the intermediate result can be different parts of the updated neural network model, and the result of the entire processing procedure is the updated neural network model.
步驟7040可包括將經更新神經網路模型發送至多個處理積體電路中之每一者。 Step 7040 may include sending the updated neural network model to each of the multiple processing integrated circuits.
步驟7040之後可接著為藉由多個處理積體電路至少部分地基於 至多個處理積體電路之處理結果來執行另一處理的步驟7050。 Step 7040 can be followed by multiple processing integrated circuits based at least in part on Step 7050 of another processing is executed to the processing result of a plurality of processing integrated circuits.
步驟7040可包括使用分解式系統之交換子單元輸出處理結果。 Step 7040 may include using the exchange subunit of the decomposition system to output the processing result.
步驟7020可包括接收資訊單元,該等資訊單元為經預處理之資訊單元。
圖87I說明用於分散式處理之方法7001。
Figure 87I illustrates a
方法7001與方法7000的不同之處在於包括藉由多個處理積體電路預處理資訊以提供經預處理之資訊單元的步驟7010。
The
步驟7010之後可接著為步驟7010、7020、7030及7040。
Step 7010 can be followed by
資料庫分析加速 Database analysis acceleration
提供一種裝置、方法及電腦可讀媒體,該裝置、該方法及該電腦可讀媒體儲存用於藉由屬於與記憶體單元相同之積體電路的篩選單元至少執行篩選的指令,而篩選器可提示哪些條目與某一資料庫查詢相關。仲裁器或任何其他流程控制管理器可將相關條目發送至處理器且不將不相關條目發送至處理器,因此節省了至處理器及來自處理器之幾乎大部分訊務。 A device, a method, and a computer-readable medium are provided. The device, the method, and the computer-readable medium store instructions for performing at least screening by a screening unit belonging to the same integrated circuit as a memory unit, and the filter can Prompt which items are related to a certain database query. The arbiter or any other flow control manager can send related items to the processor and not send irrelevant items to the processor, thus saving almost most of the traffic to and from the processor.
參見例如圖91A,其展示處理器(CPU 9240)、包括記憶體及篩選系統9220之積體電路。記憶體及篩選系統9220可包括耦接至記憶體單元條目9222及一或多個仲裁器(諸如,用於將相關條目發送至處理器之仲裁器9229)之篩選單元9224。可應用任何仲裁處理程序。條目之數目、篩選單元之數目及仲裁器之數目之間可存在任何關係。
See, for example, Figure 91A, which shows a processor (CPU 9240), an integrated circuit including memory and a
仲裁器可由能夠控制資訊流之任何單元替換,例如通信介面、流控制器及其類似者。 The arbiter can be replaced by any unit capable of controlling the flow of information, such as a communication interface, a flow controller, and the like.
參考篩選,其係基於一或多個相關性/篩選準則。 Reference screening, which is based on one or more relevance/screening criteria.
可針對每個資料庫查詢設定相關性,且可用任何方式提示相關性,例如記憶體單元可儲存提示哪一條目相關之相關性旗標9224'。亦存在儲存 K個資料庫區段9220(k)之儲存裝置9210,而k之範圍為1與K之間。應注意,整個資料庫可儲存於記憶體單元中而不儲存於儲存裝置中(該解決方案亦被稱作揮發性記憶體儲存之資料庫)。 The relevance can be set for each database query, and the relevance can be displayed in any way. For example, the memory unit can store a relevance flag 9224' indicating which item is relevant. Storage also exists The storage device 9210 of K database sections 9220(k), and the range of k is between 1 and K. It should be noted that the entire database can be stored in the memory unit but not in the storage device (this solution is also known as the database of volatile memory storage).
記憶體單元條目可能太小而無法儲存整個資料庫,且因此一次可接收一個區段。 The memory cell entry may be too small to store the entire database, and therefore can receive one section at a time.
篩選單元可執行篩選操作,諸如比較欄位之值與臨限值,比較欄位之值與預定義值,判定欄位之值是否在預定義範圍內,及其類似者。 The filtering unit can perform filtering operations, such as comparing the value of the field with the threshold value, comparing the value of the field with a predefined value, and determining whether the value of the field is within the predefined range, and the like.
因此,篩選單元可執行已知資料庫篩選操作,且可為緊密且廉價的電路。 Therefore, the screening unit can perform screening operations of the known database, and can be a compact and inexpensive circuit.
將篩選操作之最終結果(例如,相關資料庫條目之內容)9101發送至CPU 9420以供處理。
The final result of the screening operation (for example, the content of the relevant database entry) 9101 is sent to the
記憶體及篩選系統9220可由如圖91B中所說明之記憶體及處理系統替換。
The memory and
記憶體及處理系統9229包括耦接至記憶體單元條目9222之處理單元9225。處理單元9225可執行篩選操作,且可至少部分地參與對相關記錄執行一或多個額外操作。
The memory and
處理單元可經定製以執行特定操作及/或可為經組態以執行多個操作之可程式化單元。舉例而言,處理單元可為管線化處理單元,可包括ALU,可包括多個ALU,及其類似者。 The processing unit may be customized to perform specific operations and/or may be a programmable unit configured to perform multiple operations. For example, the processing unit may be a pipelined processing unit, may include ALU, may include multiple ALUs, and the like.
處理單元9225可執行全部的一或多個額外操作。
The
替代地,一或多個額外操作之一部分由處理單元執行,且處理器(CPU 9240)可執行一或多個額外操作之另一部分。 Alternatively, part of the one or more additional operations is performed by the processing unit, and the processor (CPU 9240) may perform another part of the one or more additional operations.
將處理操作之最終結果(例如,對資料庫查詢之部分回應9102,或完整回應9103)發送至CPU 9420。
Send the final result of the processing operation (for example, the partial response 9102 to the database query, or the complete response 9103) to the
部分回應需要進一步處理。 Some responses require further processing.
圖92A說明包括經組態以執行篩選及額外處理之記憶體/處理單元9227的記憶體/處理系統9228。
Figure 92A illustrates a memory/processing system 9228 that includes a memory/
記憶體/處理系統9228藉由記憶體/處理單元9227實施圖91之處理單元及記憶體單元。
The memory/processing system 9228 implements the processing unit and the memory unit of FIG. 91 through the memory/
處理器之作用可包括控制處理單元、執行一或多個額外操作之至少一部分,及其類似者。 The role of the processor may include controlling the processing unit, performing at least part of one or more additional operations, and the like.
記憶體條目與處理單元之組合可至少部分地由一或多個記憶體/處理單元實施。 The combination of memory entries and processing units can be implemented at least in part by one or more memory/processing units.
圖92B說明實例記憶體/處理單元9010。
Figure 92B illustrates an example memory/
記憶體/處理單元9010包括控制器9020、內部匯流排9021以及多對邏輯9030及記憶體組9040。控制器經組態以作為通信模組操作或可耦接至通信模組。
The memory/
可用其他方式實施控制器9020與多對邏輯9030及記憶體組9040之間的連接性。可用其他方式(不成對)配置記憶體組及邏輯。多個記憶體組可耦接至單個邏輯及/或由單個邏輯管理。
The connectivity between the
記憶體/處理系統經由介面9211接收資料庫查詢9100。介面9211可為匯流排、埠、輸入/輸出介面及其類似者。
The memory/processing system receives
應注意,對資料庫查詢之回應可由以下各者中之至少一者(或以下各者中之一或多者的組合)產生:一或多個記憶體/處理系統、一或多個記憶體及處理系統、一或多個記憶體及篩選系統、位於此等系統外部之一或多個處理器,及其類似者。 It should be noted that the response to the database query can be generated by at least one of the following (or a combination of one or more of the following): one or more memories/processing systems, one or more memories And processing systems, one or more memory and screening systems, one or more processors located outside these systems, and the like.
應注意,對資料庫查詢之回應可由以下各者中之至少一者(或以下各者中之一或多者的組合)產生:一或多個篩選單元、一或多個記憶體/處理 單元、一或多個處理單元、一或多個其他處理器(諸如,一或多個其他CPU),及其類似者。 It should be noted that the response to the database query can be generated by at least one of the following (or a combination of one or more of the following): one or more filtering units, one or more memories/processing Unit, one or more processing units, one or more other processors (such as one or more other CPUs), and the like.
任何處理程序可包括尋找相關資料庫條目,及其處理相關資料庫條目。處理可由一或多個處理實體執行。 Any processing procedure may include searching for related database entries and processing related database entries. The processing can be performed by one or more processing entities.
處理實體可為以下各者中之至少一者:記憶體及處理系統之處理單元(例如,記憶體及處理系統9229之處理單元9225)、記憶體/處理單元之處理器子單元(或邏輯)、另一處理器(例如,圖91A、圖91B及圖74之CPU 9240),及其類似者。
The processing entity can be at least one of the following: the processing unit of the memory and the processing system (for example, the
在產生對資料庫查詢之回應中所涉及的處理可由以下各者中之任一者或以下各者之組合產生: The processing involved in generating a response to a database query can be generated by any one of the following or a combination of the following:
a.記憶體及處理系統9229之處理單元9225。
a. Memory and
b.不同記憶體及處理系統9229之處理單元9225。
b.
c.記憶體/處理系統9228之一或多個記憶體/處理單元9227的處理器子單元(或邏輯9030)。
c. One or more of the memory/processing system 9228 processor sub-units (or logic 9030) of the memory/
d.不同記憶體/處理系統9228之記憶體/處理單元9227的處理器子單元(或邏輯9030)。
d. Different memory/processing system 9228 memory/
e.記憶體/處理系統9228之一或多個記憶體/處理單元9227的控制器。
e. The controller of one or more memory/
f.不同記憶體/處理系統9228之一或多個記憶體/處理單元9227的控制器。
f. One of the different memory/processing systems 9228 or a controller of multiple memory/
因此,在對資料庫查詢之回應中所涉及的處理可由以下各者之組合或子組合產生:(a)一或多個記憶體/處理單元之一或多個控制器、(b)記憶體處理系統之一或多個處理單元、(c)一或多個記憶體/處理單元之一或多個處理器子單元,及(d)一或多個其他處理器,及其類似者。 Therefore, the processing involved in the response to the database query can be generated by a combination or sub-combination of: (a) one or more memory/processing unit or one or more controllers, (b) memory One or more processing units of the processing system, (c) one or more memory/processing units or one or more processor sub-units, and (d) one or more other processors, and the like.
由多於一個處理實體執行之處理可被稱作分散式處理。 Processing performed by more than one processing entity can be referred to as distributed processing.
應注意,篩選可由一或多個篩選單元及/或一或多個處理單元及/ 或一或多個處理器子單元中之篩選實體執行。在此意義上,執行篩選操作之處理單元及/或處理器子單元可被稱作篩選單元。 It should be noted that the screening can be performed by one or more screening units and/or one or more processing units and/ Or the screening entity in one or more processor sub-units executes. In this sense, the processing unit and/or the processor subunit that performs the screening operation can be referred to as the screening unit.
處理實體可為篩選實體或可不同於篩選實體。 The processing entity may be a screening entity or may be different from the screening entity.
處理實體可執行由另一篩選實體視為相關之資料庫條目的處理操作。 The processing entity can perform processing operations on the database entries deemed to be related by another screening entity.
處理實體亦可執行篩選操作。 The processing entity can also perform filtering operations.
對資料庫查詢之回應可利用一或多個篩選實體及一或多個處理實體。 One or more screening entities and one or more processing entities can be used in response to database queries.
一或多個篩選實體及一或多個處理實體可屬於同一系統(例如,記憶體/處理系統9228、記憶體及處理系統9229、記憶體及篩選系統9220)或屬於不同系統。
One or more screening entities and one or more processing entities may belong to the same system (eg, memory/processing system 9228, memory and
記憶體/處理單元可包括多個處理器子單元。處理器子單元可彼此獨立地操作,可彼此部分地合作,可參與分散式處理,及其類似者。 The memory/processing unit may include multiple processor sub-units. The processor sub-units can operate independently of each other, can partially cooperate with each other, can participate in distributed processing, and the like.
圖92C說明多個記憶體及篩選系統9220、多個其他處理器(諸如,CPU 9240)及儲存裝置9210。
FIG. 92C illustrates multiple memory and
多個記憶體及篩選系統9220可基於多個資料庫查詢中之一者內的一或多個篩選準則來參與(同時或不同時)一或多個資料庫條目之篩選。
Multiple memories and
圖92D說明多個記憶體及處理系統9229、多個其他處理器(諸如,CPU 9240)及儲存裝置9210。
FIG. 92D illustrates multiple memory and
多個記憶體及處理系統9229可參與(同時或不同時)在對多個資料庫查詢中之一者作出回應中所涉及的篩選及至少部分處理。
Multiple memories and
圖92F說明多個記憶體/處理系統9228、多個其他處理器(諸如,CPU 9240)及儲存裝置9210。 Figure 92F illustrates multiple memory/processing systems 9228, multiple other processors (such as CPU 9240), and storage device 9210.
多個記憶體/處理系統9228可參與(同時或不同時)在對多個資 料庫查詢中之一者作出回應中所涉及的篩選及至少部分處理。 Multiple memory/processing systems 9228 can participate (simultaneously or at different times) The screening and at least partial processing involved in the response to one of the database queries.
圖92G說明用於資料庫分析加速之方法9300方法。
Figure 92G illustrates the
方法9300可開始於藉由記憶體處理積體電路接收資料庫查詢之步驟9310,該資料庫查詢包含提示資料庫中與資料庫查詢相關之資料庫條目的至少一個相關性準則。
The
資料庫中與資料庫查詢相關之資料庫條目可能並非資料庫之資料庫條目,可為資料庫之資料庫條目中的一者、一些或全部。 The database entries related to the database query in the database may not be the database entries of the database, but may be one, some or all of the database entries in the database.
記憶體處理積體電路可包括控制器、多個處理器子單元及多個記憶體單元。 The memory processing integrated circuit may include a controller, a plurality of processor sub-units, and a plurality of memory units.
步驟9310之後可接著為藉由記憶體處理積體電路且基於至少一個相關性準則而判定儲存於記憶體處理積體電路中之相關資料庫條目之群組的步驟9320。
步驟9320之後可接著為將相關資料庫條目之群組發送至一或多個處理實體以供進一步處理而實質上不將儲存於記憶體處理積體電路中之不相關資料條目發送至該一或多個處理實體的步驟9330。
片語「而實質上不發送」意謂根本不發送(在對資料庫查詢作出回應期間)或發送數目不多的不相關條目。不多可意謂至多1、2、3、4、5、9、7、8、9、10個百分比,或發送對頻寬無顯著影響之任何量。 The phrase "but not to send in substance" means not to send at all (during the response to database queries) or to send a small number of irrelevant items. Not much can mean at most 1, 2, 3, 4, 5, 9, 7, 8, 9, 10 percentages, or sending any amount that has no significant effect on bandwidth.
步驟9330之後可接著為處理相關資料庫條目之群組以提供對資料庫查詢之回應的步驟9340。
圖92H說明用於資料庫分析加速之方法9301。
Figure 92H illustrates a
假定對資料庫查詢作出回應所需之篩選及整個處理由記憶體處理積體電路執行。 It is assumed that the screening and the entire processing required to respond to database queries are performed by the memory processing integrated circuit.
方法9301可開始於藉由記憶體處理積體電路接收資料庫查詢之
步驟9310,該資料庫查詢包含提示資料庫中與資料庫查詢相關之資料庫條目的至少一個相關性準則。
步驟9310之後可接著為藉由記憶體處理積體電路且基於至少一個相關性準則而判定儲存於記憶體處理積體電路中之相關資料庫條目之群組的步驟9320。
步驟9320之後可接著為將相關資料庫條目之群組發送至記憶體處理積體電路之一或多個處理實體以供完全處理而實質上不將儲存於記憶體處理積體電路中之不相關資料條目發送至該一或多個處理實體的步驟9331。
After
步驟9331之後可接著為完全處理相關資料庫條目之群組以提供對資料庫查詢之回應的步驟9341。
Step 9331 can be followed by
步驟9341之後可接著為自記憶體處理積體電路輸出對資料庫查詢之回應的步驟9351。
圖92I說明用於資料庫分析加速之方法9302。
Figure 92I illustrates a
假定對資料庫查詢作出回應所需之篩選以及處理之僅一部分由記憶體處理積體電路執行。記憶體處理積體電路將輸出部分結果,該等部分結果將由位於記憶體處理積體電路外部之一或多個其他處理實體處理。 It is assumed that only part of the screening and processing required to respond to database queries is performed by the memory processing integrated circuit. The memory processing integrated circuit will output partial results, and these partial results will be processed by one or more other processing entities located outside the memory processing integrated circuit.
方法9301可開始於藉由記憶體處理積體電路接收資料庫查詢之步驟9310,該資料庫查詢包含提示資料庫中與資料庫查詢相關之資料庫條目的至少一個相關性準則。
The
步驟9310之後可接著為藉由記憶體處理積體電路且基於至少一個相關性準則而判定儲存於記憶體處理積體電路中之相關資料庫條目之群組的步驟9320。
步驟9320之後可接著為將相關資料庫條目之群組發送至記憶體處理積體電路之一或多個處理實體以供部分處理而實質上不將儲存於記憶體處
理積體電路中之不相關資料條目發送至該一或多個處理實體的步驟9332。
步驟9332之後可接著為部分地處理相關資料庫條目之群組以提供對資料庫查詢之中間回應的步驟9342。
步驟9342之後可接著為自記憶體處理積體電路輸出對資料庫查詢之中間回應的步驟9352。
Step 9342 can be followed by
步驟9352之後可接著為進一步處理中間回應以提供對資料庫之回應的步驟9390。
圖92J說明用於資料庫分析加速之方法9303。
Figure 92J illustrates a
假定記憶體處理積體電路執行相關資料庫條目之篩選,但不執行相關資料庫條目之處理。記憶體處理積體電路將輸出將由位於記憶體處理積體電路外部之一或多個其他處理實體完全處理的相關資料庫條目之群組。 It is assumed that the memory processing integrated circuit performs the screening of related database entries, but does not perform the processing of related database entries. The memory processing integrated circuit will output a group of related database entries that will be completely processed by one or more other processing entities located outside the memory processing integrated circuit.
方法9301可開始於藉由記憶體處理積體電路接收資料庫查詢之步驟9310,該資料庫查詢包含提示資料庫中與資料庫查詢相關之資料庫條目的至少一個相關性準則。
The
步驟9310之後可接著為藉由記憶體處理積體電路且基於至少一個相關性準則而判定儲存於記憶體處理積體電路中之相關資料庫條目之群組的步驟9320。
步驟9320之後可接著為將相關資料庫條目之群組發送至位於記憶體處理積體電路外部之一或多個處理實體而實質上不將儲存於記憶體處理積體電路中之不相關資料條目發送至該一或多個處理實體的步驟9333。
After
步驟9333之後可接著為完全處理中間回應以提供對資料庫之回應的步驟9391。
圖92K說明資料庫分析加速之方法9304。 Figure 92K illustrates a method 9304 for database analysis acceleration.
方法9303可開始於藉由積體電路接收資料庫查詢之步驟9315,
該資料庫查詢包含提示資料庫中與資料庫查詢相關之資料庫條目的至少一個相關性準則;其中該積體電路包含控制器、篩選單元及多個記憶體單元。
步驟9315之後可接著為藉由篩選單元且基於至少一個相關性準則來判定儲存於積體電路中之相關資料庫條目之群組的步驟9325。
步驟9325之後可接著為將相關資料庫條目之群組發送至位於積體電路外部之一或多個處理實體以供進一步處理而實質上不將儲存於積體電路中之不相關資料條目發送至一或多個處理實體的步驟9335。
After
步驟9335之後可接著為步驟9391。
Step 9335 can be followed by
圖92L說明資料庫分析加速之方法9305。
Figure 92L illustrates a
方法9305可開始於藉由積體電路接收資料庫查詢之步驟9314,該資料庫查詢包含提示資料庫中與資料庫查詢相關之資料庫條目的至少一個相關性準則;其中積體電路包含控制器、篩選單元及多個記憶體單元。
The
步驟9314之後可接著為藉由處理單元且基於至少一個相關性準則來判定儲存於積體電路中之相關資料庫條目之群組的步驟9324。
步驟9324之後可接著為藉由處理單元處理相關資料庫條目之群組而不藉由處理單元處理儲存於積體電路中之不相關資料條目以提供處理結果的步驟9334。
步驟9334之後可接著為自積體電路輸出處理結果之步驟9344。
Step 9334 can be followed by
在方法9300、9301、9302、9304及9305中之任一者中,記憶體處理積體電路輸出一輸出。該輸出可為相關資料庫條目、一或多個中間結果或一或多個(完整)結果之群組。
In any of
該輸出之前可為自記憶體處理積體電路之篩選實體及/或處理實體擷取一或多個相關資料庫條目及/或一或多個結果(完整或中間)。 Before the output, one or more related database entries and/or one or more results (complete or intermediate) can be retrieved from the screening entity and/or the processing entity of the memory processing integrated circuit.
該擷取可用一或多種方式控制且可由記憶體處理積體電路之仲 裁器及/或一或多個控制器控制。 The capture can be controlled in one or more ways and can be processed by the memory of the integrated circuit. The cutter and/or one or more controllers control.
輸出及/或擷取可包括控制擷取及/或輸出之一或多個參數。該等參數可包括擷取時序、擷取速率、擷取源、頻寬、次序或擷取、輸出時序、輸出速率、輸出源、頻寬、次序或輸出、擷取方法之類型、仲裁方法之類型及其類似者。 The output and/or capture may include controlling the capture and/or output of one or more parameters. These parameters can include acquisition timing, acquisition rate, acquisition source, bandwidth, sequence or acquisition, output timing, output rate, output source, bandwidth, sequence or output, type of acquisition method, arbitration method Types and similar ones.
輸出及/或擷取可執行流控制處理程序。 The output and/or capture can execute a flow control processing program.
輸出及/或擷取(例如,應用流控制處理程序)可對自一或多個處理實體輸出的關於群組之資料庫條目的處理之完成的指示符作出回應。指示符可提示中間結果是否已準備好被自處理實體擷取。 The output and/or retrieval (for example, application flow control processing program) may respond to the completion indicator of the processing of the group's database entry output from one or more processing entities. The indicator can prompt whether the intermediate result is ready to be retrieved by the self-processing entity.
輸出可包括嘗試匹配在輸出期間使用之頻寬與鏈路上之最大可允許頻寬,該鏈路將記憶體處理積體電路耦接至請求者單元。該鏈路可為至記憶體處理積體電路之輸出之接收者的鏈路。最大可允許頻寬可藉由鏈路之容量及/或可用性、所輸出內容之接收者的容量及/或可用性及其類似者規定。 The output may include an attempt to match the bandwidth used during the output with the maximum allowable bandwidth on the link that couples the memory processing integrated circuit to the requester unit. The link may be a link to the receiver of the output of the memory processing integrated circuit. The maximum allowable bandwidth may be specified by the capacity and/or availability of the link, the capacity and/or availability of the recipient of the output content, and the like.
輸出可包括嘗試以最佳或次最佳方式輸出所輸出內容。 Output can include trying to output the output in the best or sub-optimal way.
所輸出內容之輸出可包括嘗試維持輸出訊務速率之波動低於臨限值。 The output of the output content may include an attempt to maintain the fluctuation of the output traffic rate below a threshold value.
方法9300、9301、9302及9305之任何方法可包括藉由一或多個處理實體產生處理狀態指示符,處理狀態指示符可提示相關資料庫條目之群組的進一步處理之進展。
Any of
當包括於上文所提及之方法中之任一者中的處理由多於單個處理實體執行時,則處理可被視為分散式處理,此係因為處理以分散式方式執行。 When the processing included in any of the methods mentioned above is performed by more than a single processing entity, then the processing can be regarded as distributed processing because the processing is performed in a distributed manner.
如上文所提示,可以階層式方式或以平面方式執行處理。 As indicated above, processing can be performed in a hierarchical manner or in a planar manner.
方法9300至9305中之任一者可由可同時或依序地對一個或多個資料庫查詢作出回應之多個系統執行。
Any one of
字嵌入 Word embedding
如上文所提及,字嵌入(word embedding)為自然語言處理(NLP)中之語言模型化及特徵學習技術之集合的總稱,其中來自詞彙表之字或片語映射至元素之向量。概念上,字嵌入涉及自每個字具有許多維度之空間至具有低得多之維度之連續向量空間的數學嵌入。 As mentioned above, word embedding is a general term for a collection of language modeling and feature learning techniques in natural language processing (NLP), in which words or phrases from a vocabulary are mapped to vectors of elements. Conceptually, word embedding involves mathematical embedding from a space where each word has many dimensions to a continuous vector space with much lower dimensions.
可對該等向量進行數學處理。舉例而言,可對屬於矩陣之向量進行加總以提供加總向量。 These vectors can be mathematically processed. For example, the vectors belonging to the matrix can be summed to provide a summed vector.
又對於另一實例,可計算(語句之)矩陣的協方差。此可包括將矩陣乘以其轉置矩陣。 For another example, the covariance of the matrix (of sentences) can be calculated. This can include multiplying the matrix by its transposed matrix.
記憶體/處理單元可儲存詞彙表。特定而言,詞彙表之部分可儲存於記憶體/處理單元之多個記憶體組中。 The memory/processing unit can store a glossary. In particular, parts of the vocabulary can be stored in multiple memory groups of the memory/processing unit.
因此,可使用將表示語句之片語的字之集合的存取資訊(諸如,擷取金鑰)來存取記憶體/處理單元,使得將自記憶體/處理單元之記憶體組中的至少一些擷取表示語句之片語之字的向量。 Therefore, it is possible to use the access information (such as the retrieval key) to access the memory/processing unit, so that at least from the memory group of the memory/processing unit Some extract vectors representing the phrase words of the sentence.
記憶體/處理單元之不同記憶體組可儲存詞彙表之不同部分,且可被並列地存取(取決於語句之索引的分佈)。即使在需要依序存取記憶體組之多於單排時,預測亦可減少懲罰。 Different memory groups of the memory/processing unit can store different parts of the vocabulary and can be accessed in parallel (depending on the distribution of the index of the sentence). Even when more memory banks need to be accessed sequentially than in a single row, prediction can reduce the penalty.
可最佳化在記憶體/處理單元之不同記憶體組之間分配詞彙表之字,或該分配在可增加每語句對記憶體/處理單元之不同記憶體組之並列存取的機會之意義上為高度有益的。該分配可按每個使用者學習,可按每個一般群體學習或可按每個人群來學習。 It can be optimized to allocate the words of the vocabulary between different memory groups of the memory/processing unit, or the allocation can increase the meaning of the chance of parallel access of each sentence to the different memory groups of the memory/processing unit The above is highly beneficial. The allocation can be learned for each user, for each general group, or for each group.
此外,記憶體/處理單元亦可用以執行處理操作中之至少一些(藉由其邏輯),且藉此可減少自記憶體/處理單元外部之匯流排所需的頻寬,可用高效方式(甚至並列)計算多個運算(並列地使用記憶體/處理單元之多個處理 器)。 In addition, the memory/processing unit can also be used to perform at least some of the processing operations (by its logic), and thereby reduce the bandwidth required by the bus outside the memory/processing unit, and can be used in an efficient manner (or even Parallel) Calculate multiple operations (multiple processing using memory/processing units in parallel 器).
記憶體組可與邏輯相關聯。 The memory bank can be associated with logic.
處理操作之至少一部分可由一或多個額外處理器(諸如,向量處理器,包括但不限於向量加法器)執行。 At least part of the processing operations may be performed by one or more additional processors (such as vector processors, including but not limited to vector adders).
記憶體/處理單元可包括可分配給記憶體組(邏輯對)中之一些或全部的一或多個額外處理器。 The memory/processing unit may include one or more additional processors that can be allocated to some or all of the memory groups (logical pairs).
因此,可將單個額外處理器分配給記憶體組(邏輯對)中之全部或一些。又對於另一實例,該等額外處理器可用階層式方式配置,使得某一層級之額外處理器處理來自較低層級之額外處理器的輸出。 Therefore, a single additional processor can be allocated to all or some of the memory banks (logical pairs). For yet another example, the additional processors can be configured in a hierarchical manner, so that the additional processors of a certain level process the output from the additional processors of a lower level.
應注意,處理操作可在不使用任何額外處理器之情況下執行,但可由記憶體/處理單元之邏輯執行。 It should be noted that the processing operations can be performed without using any additional processors, but can be performed by the logic of the memory/processing unit.
圖89A、圖89B、圖89C、圖89D、圖89E、圖89F及圖89G分別說明記憶體/處理單元9010、9011、9012、9013、9014、9015及9019之實例。記憶體/處理單元9010包括控制器9020、內部匯流排9021以及多對邏輯9030及記憶體組9040。
FIGS. 89A, 89B, 89C, 89D, 89E, 89F, and 89G illustrate examples of memory/
應注意,邏輯9030及記憶體組9040可用其他方式耦接至控制器及/或彼此耦接,例如,多個匯流排可設置於控制器與邏輯之間,邏輯可配置於多個層中,單個邏輯可由多個記憶體組共用(參見例如圖89E),及其類似者。
It should be noted that the
可用任何方式定義記憶體/處理單元9010內之每一記憶體組的頁面之長度,例如,其可足夠小,且記憶體組之數目可足夠大以使得能夠並列地輸出大量向量而不會在不相關資訊上浪費許多位元。
The length of the page of each memory bank in the memory/
邏輯9020可包括完整ALU、部分ALU、記憶體控制器、部分記憶體控制器及其類似者。部分ALU(記憶體控制器)單元能夠僅執行可由完整ALU(記憶體控制器)執行之功能的一部分。在本申請案中說明之任何邏輯或
子處理器可包括完整ALU、部分ALU、記憶體控制器、部分記憶體控制器及其類似者。
The
可用其他方式實施控制器9020與多對邏輯9030及記憶體組9040之間的連接性。可用其他方式(不成對)配置記憶體組及邏輯。
The connectivity between the
記憶體/處理單元9010可能不具有額外向量,且向量(來自記憶體組)之處理由邏輯9030進行。
The memory/
圖89B說明額外處理器,諸如耦接至內部匯流排9021之向量處理器9050。
FIG. 89B illustrates an additional processor, such as a
圖89C說明額外處理器,諸如耦接至內部匯流排9021之向量處理器9050。一或多個額外處理器執行(單獨或與邏輯相配合)處理操作。
Figure 89C illustrates an additional processor, such as a
圖89D亦說明經由匯流排9022耦接至記憶體/處理單元9010之主機9018。
FIG. 89D also illustrates the
圖89D亦說明將字/片語9072映射至向量9073之詞彙表9070。使用擷取金鑰9071存取記憶體/處理單元,每一擷取金鑰表示先前辨識之字或片語。主機9018將表示語句之多個擷取金鑰9071發送至記憶體/處理單元,且記憶體/處理單元可輸出向量9070或由與語句相關之向量所應用的處理操作之最終結果。字/片語通常不儲存於記憶體/處理單元9010中。
FIG. 89D also illustrates the mapping of words/
用於控制記憶體組之記憶體控制器功能性可包括(單獨或部分地)於邏輯中,可包括(單獨或部分地)於控制器9020中及/或可包括(單獨或部分地)於記憶體/處理單元9010內之一或多個記憶體控制器(未圖示)中。
The functionality of the memory controller used to control the memory bank may be included (alone or partially) in the logic, may be included (alone or partially) in the
記憶體/處理單元可經組態以最大化發送至主機9018之向量/結果的輸送量,或可應用用於控制內部記憶體/處理單元訊務及/或控制記憶體/處理單元與主機電腦(或記憶體/處理單元外部之任何其他實體)之間的訊務的任何處理程序。
The memory/processing unit can be configured to maximize the throughput of vectors/results sent to the
不同邏輯9030耦接至記憶體/處理單元之記憶體組9040,且可對向量執行(較佳並列地)數學運算以產生經處理向量。一個邏輯9030可將向量發送至另一邏輯(參見例如圖89G之線38),且另一邏輯可對所接收向量及其計算之向量應用數學運算。邏輯可按層級配置,且某一層級之邏輯可處理來自前一層級邏輯之向量或中間結果(由應用數學運算產生)。該等邏輯可形成樹(二元、三元及其類似者)。
The
當經處理向量之總大小超過結果之總大小時,則獲得輸出頻寬(在記憶體/處理單元外部)之減少。舉例而言,當K個向量由記憶體/處理單元加總以提供單個輸出向量時,則獲得頻寬的K:1減少。 When the total size of the processed vectors exceeds the total size of the result, the output bandwidth (outside the memory/processing unit) is reduced. For example, when K vectors are summed by the memory/processing unit to provide a single output vector, a K:1 reduction in bandwidth is obtained.
控制器9020可經組態以藉由廣播待存取之不同向量的位址來並列地開放多個記憶體組。
The
控制器可經組態以至少部分地基於語句中之字或片語的次序來控制自多個記憶體組(或自儲存不同向量之任何中間緩衝或儲存電路,參見圖89D之緩衝器9033)擷取不同向量之次序。 The controller can be configured to control multiple memory groups based at least in part on the order of the words or phrases in the sentence (or from any intermediate buffers or storage circuits that store different vectors, see buffer 9033 in FIG. 89D) Capture the order of different vectors.
控制器9020可經組態以基於與在記憶體/處理單元9010外部輸出向量相關之一或多個參數來管理不同向量之擷取,例如,自記憶體組擷取不同向量之速率可設定為實質上等於自記憶體/處理單元9010輸出不同向量之可允許速率。
The
控制器可藉由應用任何訊務塑形處理程序來在記憶體/處理單元9010外部輸出不同向量。舉例而言,控制器9020可旨在以儘可能接近主機電腦或將記憶體/處理單元9010耦接至主機電腦之鏈路可允許之最大速率的速率輸出不同向量。又對於另一實例,控制器可輸出不同向量,同時最少化或至少實質上減少訊務速率隨時間的波動。
The controller can output different vectors outside the memory/
控制器9020屬於與記憶體組9040及邏輯9030相同之積體電路,
且因此可易於自不同邏輯/記憶體組接收關於不同向量之擷取狀態(例如,向量是否準備好,向量是否準備好但自同一記憶體組正擷取或將要擷取另一向量)的反饋,及其類似者。可用任何方式提供反饋:經由專用控制線,經由共用控制線。使用一或多個狀態位元及其類似者(參見圖89F之狀態線9039)。
The
控制器9020可獨立地控制不同向量之擷取及輸出,且因此可減少主機電腦之參與。替代地,主機電腦可能不知曉控制器之管理能力,且可能繼續發送詳細指令,且在此狀況下,記憶體/處理單元9010可忽略詳細指令,可隱藏控制器之管理能力,及其類似者。可基於可由主機電腦管理之協定使用所提及之以上解決方案。
The
已發現,在記憶體/處理單元中執行處理操作為極有益的(就能量而言),即使當此等操作相比在主機中之處理操作消耗更多功率時且即使當此等操作相比在主機與記憶體/處理單元之間的傳送操作消耗更多功率時亦如此。舉例而言,假定向量足夠大,傳送資料單元之能量消耗為4pJ,資料單元之處理操作(藉由主機)的能量消耗為0.1pJ,則當藉由記憶體/處理單元處理資料單元之能量消耗低於5pJ時,藉由記憶體/處理單元處理資料單元更有效。 It has been found that performing processing operations in the memory/processing unit is extremely beneficial (in terms of energy), even when these operations consume more power than processing operations in the host and even when these operations are compared This is also true when the transfer operation between the host and the memory/processing unit consumes more power. For example, assuming that the vector is large enough, the energy consumption of the data unit is 4pJ, and the energy consumption of the processing operation of the data unit (by the host) is 0.1pJ, then the energy consumption of the data unit is processed by the memory/processing unit When it is lower than 5pJ, it is more effective to process the data unit by the memory/processing unit.
(表示語句之矩陣的)每一向量可由字(或其他多位元區段)之序列表示。為解釋簡單起見,假定多個位元區段為字。 Each vector (representing a matrix of sentences) can be represented by a sequence of words (or other multi-bit segments). For simplicity of explanation, it is assumed that multiple bit segments are words.
當向量包括零值字時,可獲得額外功率節省。替代輸出整個零值字,可輸出短於字(例如,位元)之零值旗標(甚至經由專用控制線輸送)而非整個字。可將旗標分配給其他值(例如,值1之字)。
When the vector includes zero-valued words, additional power savings can be obtained. Instead of outputting the entire zero-valued word, a zero-valued flag shorter than the word (for example, bit) can be output (or even sent via a dedicated control line) instead of the entire word. The flag can be assigned to other values (for example, the
圖88A說明用於嵌入之方法9400,或確切而言,可為用於擷取特徵向量相關資訊之方法。特徵向量相關資訊可包括特徵向量及/或處理特徵向量之結果。
FIG. 88A illustrates a
方法9400可開始於藉由記憶體處理積體電路接收擷取資訊以用
於擷取多個所請求特徵向量的步驟9410,該等特徵向量可映射至多個語句區段。
記憶體處理單元可包括控制器、多個處理器子單元及多個記憶體單元。記憶體單元中之每一者可耦接至處理器子單元。 The memory processing unit may include a controller, multiple processor sub-units, and multiple memory units. Each of the memory units can be coupled to a processor sub-unit.
步驟9410之後可接著為自多個記憶體單元中之至少一些擷取多個所請求特徵向量的步驟9420。
該擷取可包括向兩個或多於兩個記憶體單元同時請求儲存於該兩個或多於兩個記憶體單元中之所請求特徵向量。 The retrieval may include simultaneously requesting two or more memory units for the requested feature vectors stored in the two or more memory units.
該請求係基於語句區段與映射至語句區段之特徵向量的位置之間的已知映射而執行。 The request is executed based on the known mapping between the sentence section and the location of the feature vector mapped to the sentence section.
該映射可在記憶體處理積體電路之開機處理程序期間上傳。 The mapping can be uploaded during the boot process of the memory processing integrated circuit.
一次擷取儘可能多的所請求特徵向量可為有益的,但此取決於所請求特徵向量儲存之處及不同記憶體單元之數目。 It may be beneficial to retrieve as many requested feature vectors as possible at one time, but this depends on where the requested feature vectors are stored and the number of different memory cells.
若多於一個所請求特徵向量儲存於同一記憶體組中,則可應用預測性擷取以用於減少與自記憶體組擷取資訊相關聯之懲罰。在本申請案之各種章節中說明用於減少懲罰之各種方法。 If more than one requested feature vector is stored in the same memory set, predictive retrieval can be applied to reduce the penalty associated with retrieving information from the memory set. Various methods for reducing punishment are explained in various chapters of this application.
擷取可包括應用儲存於單個記憶體單元中之所請求特徵向量之集合中的至少一些所請求特徵向量之預測性擷取。 The retrieval may include predictive retrieval using at least some of the requested feature vectors from the set of requested feature vectors stored in a single memory unit.
所請求特徵向量可用最佳方式分佈於記憶體單元之間。 The requested feature vectors can be optimally distributed among memory cells.
所請求特徵向量可基於預期擷取圖案而分佈於記憶體單元之間。 The requested feature vector can be distributed among the memory cells based on the expected capture pattern.
多個所請求特徵向量之擷取可根據某一次序執行。舉例而言,根據語句區段在一或多個語句中之次序。 The extraction of multiple requested feature vectors can be performed according to a certain order. For example, according to the order of sentence sections in one or more sentences.
多個所請求特徵向量之擷取可至少部分無序地執行;且其中擷取進一步可包括將多個所請求特徵向量重新排序。 The extraction of the plurality of requested feature vectors may be performed at least partially out of order; and the extraction may further include reordering the plurality of requested feature vectors.
多個所請求特徵之擷取可包括在多個所請求特徵向量可由控制 器讀取之前緩衝該多個所請求特徵向量。 The extraction of multiple requested features can be included in multiple requested feature vectors that can be controlled The multiple requested feature vectors are buffered before being read by the processor.
多個所請求特徵之擷取可包括產生緩衝狀態指示符,該等緩衝狀態指示符提示與多個記憶體單元相關聯之一或多個緩衝器何時儲存一或多個所請求特徵向量。 The retrieval of the plurality of requested features may include generating buffer status indicators that prompt one or more buffers associated with the plurality of memory cells when to store the one or more requested feature vectors.
該方法可包括經由專用控制線輸送緩衝狀態指示符。 The method may include transmitting the buffer status indicator via a dedicated control line.
可每記憶體單元分配一個專用控制線。 Each memory unit can be assigned a dedicated control line.
緩衝狀態指示符可為儲存於一或多個緩衝器中之狀態位元。 The buffer status indicator can be a status bit stored in one or more buffers.
該方法可包括經由一或多個共用控制線輸送緩衝狀態指示符。 The method may include transmitting the buffer status indicator via one or more common control lines.
步驟9420之後可接著為處理多個所請求特徵向量以提供處理結果之步驟9430。
另外或替代地,步驟9420之後可接著為自記憶體處理積體電路輸出可包括以下各者中之至少一者之輸出的步驟9440:(a)所請求特徵向量;及(b)處理所請求特徵向量之結果。(a)所請求特徵向量及(b)處理所請求特徵向量之結果中的至少一者亦被稱作特徵向量相關資訊。
Additionally or alternatively,
當執行步驟9430時,則步驟9440可包括輸出(至少)處理所請求特徵向量之結果。 When step 9430 is performed, step 9440 may include outputting (at least) the result of processing the requested feature vector.
當跳過步驟9430時,則步驟9440包括輸出所請求特徵向量且可能不包括輸出處理所請求特徵向量之結果。 When step 9430 is skipped, step 9440 includes outputting the requested feature vector and may not include outputting the result of processing the requested feature vector.
圖88B說明用於嵌入之方法9401。
Figure 88B illustrates the
假定輸出包括所請求特徵向量,但不包括處理所請求特徵向量之結果。 It is assumed that the output includes the requested feature vector, but does not include the result of processing the requested feature vector.
方法9401可開始於藉由記憶體處理積體電路接收擷取資訊以用於擷取多個所請求特徵向量之步驟9410,該等特徵向量可映射至多個語句區段。
The
步驟9410之後可接著為自多個記憶體單元中之至少一些擷取多
個所請求特徵向量的步驟9420。
步驟9420之後可接著為自記憶體處理積體電路輸出包括所請求特徵向量但不包括處理所請求特徵向量之結果之輸出的步驟9431。
圖88C說明用於嵌入之方法9402。
Figure 88C illustrates the
假定輸出包括處理所請求特徵向量之結果。 It is assumed that the output includes the result of processing the requested feature vector.
方法9402可開始於藉由記憶體處理積體電路接收擷取資訊以用於擷取多個所請求特徵向量之步驟9410,該等特徵向量可映射至多個語句區段。
The
步驟9410之後可接著為自多個記憶體單元中之至少一些擷取多個所請求特徵向量的步驟9420。
步驟9420之後可接著為處理多個所請求特徵向量以提供處理結果之步驟9430。
步驟9430之後可接著為自記憶體處理積體電路輸出可包括處理所請求特徵向量之結果之輸出的步驟9442。 Step 9430 can be followed by step 9442 of processing the output of the integrated circuit from the memory, which can include processing the output of the requested feature vector.
該輸出之輸出可包括對輸出應用訊務塑形。 The output of the output may include applying traffic shaping to the output.
該輸出之輸出可包括嘗試匹配在輸出期間使用之頻寬與鏈路上之最大可允許頻寬,該鏈路將記憶體處理積體電路耦接至請求者單元。 The output of the output may include an attempt to match the bandwidth used during the output with the maximum allowable bandwidth on the link that couples the memory processing integrated circuit to the requester unit.
該輸出之輸出可包括嘗試維持輸出訊務速率之波動低於臨限值。 The output of the output may include an attempt to maintain the fluctuation of the output traffic rate below a threshold value.
擷取及輸出中之任何步驟可在主機之控制下及/或獨立地或部分地由控制器執行。 Any step in the capture and output can be executed by the controller under the control of the host and/or independently or in part.
主機可發送具有不同粒度之擷取命令,自一般發送擷取資訊而無關於所請求特徵向量在多個記憶體單元內之位置,直至基於所請求特徵向量在多個記憶體單元內之位置而發送詳細擷取資訊。 The host can send capture commands with different granularities, from generally sending capture information regardless of the location of the requested feature vector in multiple memory cells, until based on the location of the requested feature vector in multiple memory cells Send detailed retrieval information.
主機可控制(或嘗試控制)記憶體處理積體電路內之不同擷取操作之時序,但可能與時序無關。 The host can control (or try to control) the timing of different acquisition operations in the memory processing integrated circuit, but it may not be related to the timing.
控制器可藉由主機在各種層級中控制,且可甚至忽略主機之詳細命令,且獨立地至少控制擷取及/或輸出。 The controller can be controlled by the host in various levels, and can even ignore the detailed commands of the host, and independently control at least the capture and/or output.
所請求特徵向量之處理可由以下各者中之至少一者(以下各者中之一或多者的組合)執行:一或多個記憶體/處理單元,及位於一或多個記憶體/處理單元外部之一或多個處理器,及其類似者。 The processing of the requested feature vector can be performed by at least one of the following (one or a combination of more of the following): one or more memories/processing units, and one or more memories/processing One or more processors outside the unit, and the like.
應注意,所請求特徵向量之處理可由以下各者中之至少一者(以下各者中之一或多者的組合)執行:一或多個處理器子單元、控制器、一或多個向量處理器,及位於一或多個記憶體/處理單元外部之一或多個記憶體/處理單元。 It should be noted that the processing of the requested feature vector can be performed by at least one of the following (a combination of one or more of the following): one or more processor sub-units, a controller, one or more vectors The processor, and one or more memories/processing units located outside of the one or more memories/processing units.
所請求特徵向量之處理可由以下各者中之任一者或以下各者之組合執行,可由以下各者中之任一者或以下各者之組合產生: The processing of the requested feature vector can be performed by any one of the following or a combination of the following, and can be generated by any one of the following or a combination of the following:
a.記憶體/處理單元之處理器子單元(或邏輯9030)。 a. The processor sub-unit of the memory/processing unit (or logic 9030).
b.多個記憶體/處理單元之處理器子單元(或邏輯9030)。 b. Processor sub-units of multiple memory/processing units (or logic 9030).
c.記憶體/處理單元之控制器。 c. Controller of memory/processing unit.
d.多個記憶體/處理單元之控制器。 d. Controller with multiple memory/processing units.
e.記憶體/處理單元之一或多個向量處理器。 e. One or more vector processors of memory/processing unit.
f.一或多個向量處理器、多個記憶體/處理單元。 f. One or more vector processors, multiple memory/processing units.
因此,所請求特徵向量之處理可由以下各者之任何組合或子組合執行:(a)一或多個記憶體/處理單元之一或多個控制器;(b)一或多個記憶體/處理單元之一或多個處理器子單元;(c)一或多個記憶體/處理單元之一或多個向量處理器;及(d)位於一或多個記憶體/處理單元外部之一或多個其他處理器。 Therefore, the processing of the requested feature vector can be performed by any combination or sub-combination of: (a) one or more memories/processing unit or one or more controllers; (b) one or more memories/ One or more processor sub-units of the processing unit; (c) one or more vector processors of one or more memory/processing units; and (d) one of the one or more memory/processing units located outside Or multiple other processors.
由多於一個處理實體執行之處理可被稱作分散式處理。 Processing performed by more than one processing entity can be referred to as distributed processing.
記憶體/處理單元可包括多個處理器子單元。處理器子單元可彼此 獨立地操作,可彼此部分地合作,可參與分散式處理,及其類似者。 The memory/processing unit may include multiple processor sub-units. The processor sub-units can each other Operate independently, partially cooperate with each other, participate in distributed processing, and the like.
可用平面方式執行處理,其中所有處理器子單元執行相同操作(且在其之間可能輸出或可能不輸出處理結果)。 The processing can be performed in a planar manner, in which all processor subunits perform the same operation (and may or may not output the processing result in between).
可用階層式方式執行處理,其中處理涉及不同層級之處理操作序列,而某一層之處理操作在又一層級之處理操作之後。處理器子單元可經分配(動態地或靜態地)給不同層且參與階層式處理。 The processing can be performed in a hierarchical manner, where the processing involves a sequence of processing operations at different levels, and the processing operations of a certain level follow the processing operations of another level. Processor sub-units can be allocated (dynamically or statically) to different layers and participate in hierarchical processing.
所請求特徵向量之任何處理可由多於一個處理實體(處理器子單元、控制器、向量處理器、其他處理器)執行,可用任何方式(用平面、階層式或其他方式)進行分散式處理。舉例而言,處理器子單元可將其處理結果輸出至控制器,該控制器可進一步處理該等結果。位於一或多個記憶體/處理單元外部之一或多個其他處理器可進一步處理記憶體處理積體電路之輸出。 Any processing of the requested feature vector can be performed by more than one processing entity (processor subunits, controllers, vector processors, other processors), and distributed processing can be performed in any manner (planar, hierarchical, or other methods). For example, the processor sub-unit can output its processing results to the controller, and the controller can further process the results. One or more other processors located outside the one or more memory/processing units can further process the output of the memory processing integrated circuit.
應注意,擷取資訊亦可包括用於擷取不映射至語句區段之所請求特徵向量的資訊。此等特徵向量可映射至一或多個人員、裝置或可與語句區段相關之任何其他實體。舉例而言,感測語句區段之裝置的使用者、感測區段之裝置、識別為語句區段之來源的使用者、在產生語句時存取之網站、俘獲語句之位置,及其類似者。 It should be noted that the retrieved information may also include information used to retrieve the requested feature vector that is not mapped to the sentence section. These feature vectors can be mapped to one or more persons, devices, or any other entities that can be related to the sentence segment. For example, the user of the device that senses the sentence section, the device that senses the section, the user identified as the source of the sentence section, the website accessed when the sentence was generated, the location of the captured sentence, and the like By.
在細節上作必要修改後,方法9400、9401及9402可適用於不映射至語句區段之處理及/或所請求擷取向量。
After making necessary modifications in details, the
特徵向量之處理的非限制性實例可包括加總、加權和、平均、減法或應用任何其他數學函數。 Non-limiting examples of the processing of feature vectors may include summation, weighted sum, average, subtraction, or application of any other mathematical function.
混合裝置 Mixing device
隨著處理器速度及記憶體大小兩者均繼續增加,對有效處理速度之顯著限制係馮諾依曼(von Neumann)瓶頸。馮諾依曼瓶頸係由習知電腦架構所導致之輸送量限制造成。特定而言,相較於由處理器進行之實際運算,自記 憶體至處理器之資料傳送(在諸如外部DRAM記憶體之邏輯晶粒外部)常常遇到瓶頸。因此,用以對記憶體進行讀取及寫入之時脈循環的數目隨著記憶體密集型處理程序而顯著增加。此等時脈循環導致較低的有效處理速度,此係因為對記憶體進行讀取及寫入會消耗時脈循環,該等時脈循環無法用於對資料執行操作。此外,處理器之運算頻寬通常大於處理器用以存取記憶體之匯流排的頻寬。 As both processor speed and memory size continue to increase, the significant limitation on effective processing speed is the von Neumann bottleneck. The Von Neumann bottleneck is caused by the throughput limitation caused by the conventional computer architecture. In particular, compared to the actual calculations performed by the processor, self-report Data transfer from the memory to the processor (outside the logic die such as external DRAM memory) often encounters bottlenecks. Therefore, the number of clock cycles used to read and write the memory increases significantly with memory-intensive processing procedures. These clock cycles result in a lower effective processing speed. This is because reading and writing to the memory consumes clock cycles, and these clock cycles cannot be used to perform operations on data. In addition, the computing bandwidth of the processor is generally greater than the bandwidth of the bus used by the processor to access the memory.
此等瓶頸對於以下各者特別明顯:記憶體密集型處理程序,諸如神經網路及其他機器學習演算法;資料庫建構、索引搜尋及查詢;以及包括比資料處理操作多的讀取及寫入操作之其他任務。 These bottlenecks are particularly obvious for the following: memory-intensive processing procedures, such as neural networks and other machine learning algorithms; database construction, index search and query; and include more reads and writes than data processing operations Other tasks of operation.
本發明描述用於減輕或克服上文所闡述之問題中之一或多者以及先前技術中之其他問題的解決方案。 The present invention describes solutions for alleviating or overcoming one or more of the problems set forth above and other problems in the prior art.
可提供一種用於記憶體密集型處理之混合裝置,該混合裝置可包括基礎晶粒、多個處理器、至少另一晶粒之第一記憶體資源,及至少一個其他晶粒之第二記憶體資源。 A hybrid device for memory-intensive processing can be provided. The hybrid device can include a basic die, a plurality of processors, a first memory resource of at least another die, and a second memory of at least one other die Physical resources.
該基礎晶粒及該至少另一晶粒藉由晶圓上晶圓接合彼此連接。 The base die and the at least another die are connected to each other by wafer-on-wafer bonding.
多個處理器經組態以執行處理操作,且擷取儲存於第一記憶體資源中之所擷取資訊。 The multiple processors are configured to perform processing operations and retrieve the retrieved information stored in the first memory resource.
第二記憶體資源經組態以將來自第二記憶體資源之額外資訊發送至第一記憶體資源。 The second memory resource is configured to send additional information from the second memory resource to the first memory resource.
基礎晶粒與至少另一晶粒之間的第一路徑之總頻寬超過至少另一晶粒與至少一個其他晶粒之間的第二路徑之總頻寬,且第一記憶體資源之儲存容量為第二記憶體資源之儲存容量的一部分。 The total bandwidth of the first path between the base die and at least another die exceeds the total bandwidth of the second path between at least another die and at least one other die, and the storage of the first memory resource The capacity is a part of the storage capacity of the second memory resource.
第二記憶體資源為高頻寬記憶體(HBM)資源。 The second memory resource is a high-bandwidth memory (HBM) resource.
至少一個其他晶粒為高頻寬記憶體(HBM)晶片之堆疊。 At least one other die is a stack of high-bandwidth memory (HBM) chips.
第二記憶體資源中之至少一些可屬於另一晶粒,該另一晶粒藉由不同於晶圓間接合之連接性而連接至基礎晶粒。 At least some of the second memory resources may belong to another die that is connected to the base die by a different connectivity from the inter-wafer bonding.
第二記憶體資源中之至少一些屬於另一晶粒,該另一晶粒藉由不同於晶圓間接合之連接性而連接至另一晶粒。 At least some of the second memory resources belong to another die, and the other die is connected to the other die by a different connectivity from the inter-wafer bonding.
第一記憶體資源及第二記憶體資源為不同層級之快取記憶體。 The first memory resource and the second memory resource are different levels of cache memory.
第一記憶體資源定位於基礎晶粒與第二記憶體資源之間。 The first memory resource is positioned between the basic die and the second memory resource.
第一記憶體資源定位於第二記憶體資源之一側。 The first memory resource is positioned on one side of the second memory resource.
另一晶粒經組態以執行額外處理,其中另一晶粒包含複數個處理器子單元及第一記憶體資源。 The other die is configured to perform additional processing, and the other die includes a plurality of processor sub-units and a first memory resource.
每一處理器子單元耦接至分配給處理器子單元之第一記憶體資源的唯一部分。 Each processor sub-unit is coupled to a unique portion of the first memory resource allocated to the processor sub-unit.
第一記憶體資源之唯一部分為至少一個記憶體組。 The only part of the first memory resource is at least one memory bank.
多個處理器為包括於為記憶體處理晶片第一記憶體資源中之複數個處理器子單元。 The plurality of processors is a plurality of processor subunits included in the first memory resource for the memory processing chip.
基礎晶粒包含多個處理器,其中多個處理器為經由使用晶圓間接合形成之導體耦接至第一記憶體資源的複數個處理器子單元。 The basic die includes a plurality of processors, wherein the plurality of processors are a plurality of processor sub-units coupled to the first memory resource through a conductor formed by using inter-wafer bonding.
每一處理器子單元耦接至分配給處理器子單元之第一記憶體資源的唯一部分。 Each processor sub-unit is coupled to a unique portion of the first memory resource allocated to the processor sub-unit.
可提供混合積體電路,其可利用晶圓上晶圓(WOW)連接性以將基礎晶粒之至少一部分耦接至第二記憶體資源,該等第二記憶體資源包括於一或多個其他晶粒中且使用不同於WOW連接性之連接性而連接。第二記憶體資源之實例可為高頻寬記憶體(HBM)記憶體資源。在各種圖中,第二記憶體資源包括於HBM記憶體單元之堆疊中,可使用矽穿孔(TSV)連接性而耦接至控制器。控制器可包括於基礎晶粒中或耦接(例如,經由微凸塊)至基礎晶粒 之至少部分。 A hybrid integrated circuit can be provided, which can utilize wafer-on-wafer (WOW) connectivity to couple at least a part of the basic die to a second memory resource, and the second memory resource is included in one or more Other dies are connected using a connectivity different from that of WOW. An example of the second memory resource may be a high-bandwidth memory (HBM) memory resource. In the various figures, the second memory resource is included in the stack of HBM memory cells and can be coupled to the controller using via-silicon via (TSV) connectivity. The controller can be included in the base die or coupled (for example, via micro bumps) to the base die At least part of it.
基礎晶粒可為邏輯晶粒,但可為記憶體/處理單元。 The basic die can be a logic die, but can be a memory/processing unit.
WOW連接性用以將基礎晶粒之一或多個部分耦接至另一晶粒(WOW連接之晶粒)之一或多個部分,該另一晶粒可為記憶體晶粒或記憶體/處理單元。WOW連接性為極高輸送量連接性。 WOW connectivity is used to couple one or more parts of the basic die to one or more parts of another die (WOW connected die), which can be a memory die or a memory /Processing unit. WOW connectivity is extremely high throughput connectivity.
高頻寬記憶體(HBM)晶片之堆疊可耦接至基礎晶粒(直接或經由WOW連接之晶粒),且可提供高輸送量連接及極廣泛記憶體資源。 The stack of high-bandwidth memory (HBM) chips can be coupled to the basic die (die connected directly or via WOW), and can provide high throughput connections and extremely extensive memory resources.
WOW連接之晶粒可耦接於HBM晶片之堆疊與基礎晶粒之間以形成HBM記憶體晶片堆疊,該堆疊具有TSV連接性且在其底部具有WOW連接之晶粒。 The WOW-connected die can be coupled between the stack of HBM chips and the base die to form an HBM memory chip stack that has TSV connectivity and has WOW-connected dies at the bottom.
具有TSV連接性且在底部具有WOW連接之晶粒的HBM晶片堆疊可提供多層記憶體階層,其中WOW連接之晶粒可用作基礎晶粒可存取之較低層級記憶體(例如,3階快取記憶體),其中自較高層級HBM記憶體堆疊之提取及/或預提取操作填充WOW連接之晶粒。 HBM chip stacks with TSV connectivity and WOW connected dies at the bottom can provide multiple levels of memory, where WOW connected dies can be used as lower-level memory accessible by the basic die (for example, 3-level Cache memory), where the fetch and/or prefetch operation from the higher-level HBM memory stack fills the WOW connected die.
HBM記憶體晶片可為HBM DRAM晶片,但可使用任何其他記憶體技術。 The HBM memory chip can be an HBM DRAM chip, but any other memory technology can be used.
使用WOW連接性與HMB晶片之組合使得能夠提供多層記憶體結構,該多層記憶體結構可包括可提供頻寬與記憶體密度之間的不同取捨的多個記憶體層。 Using a combination of WOW connectivity and HMB chips enables the provision of a multi-layer memory structure that can include multiple memory layers that can provide different trade-offs between bandwidth and memory density.
所建議解決方案可充當傳統的DRAM記憶體/HBM至邏輯晶粒之內部快取記憶體之間的額外的全新記憶體階層,從而在DRAM側實現更多頻寬以及更佳管理及重複使用。 The proposed solution can serve as an additional brand new memory hierarchy between the traditional DRAM memory/HBM and the internal cache memory of the logic die, thereby achieving more bandwidth and better management and reuse on the DRAM side.
此可在DRAM側提供以快速方式較佳地管理記憶體讀取之新的記憶體階層。 This can provide a new memory hierarchy on the DRAM side that better manages memory reads in a fast manner.
圖93A至圖93I分別說明混合積體電路11011'至11019'。 FIGS. 93A to 93I illustrate the hybrid integrated circuits 11011' to 11019', respectively.
圖93A說明具有TSV連接性且在最低層級具有微凸塊之HBM DRAM堆疊(共同表示為11030),該堆疊包括彼此耦接且使用TSV(11039)耦接至基礎晶粒之第一記憶體控制器11031的HDM DRAM記憶體晶片11032之堆疊。
Figure 93A illustrates an HBM DRAM stack with TSV connectivity and micro bumps at the lowest level (collectively denoted as 11030). The stack includes a first memory control coupled to each other and coupled to the base die using TSV (11039) The HDM
圖93A亦說明至少具有記憶體資源且使用WOW技術耦接之晶圓(共同表示為11040),該晶圓包括經由一或多個WOW中間層(11023)耦接至DRAM晶圓(11021)之基礎晶粒11019的第二記憶體控制器11022。一或多個WOW中間層可由不同材料製成,但可不同於墊連接性及/或可不同於TSV連接性。
FIG. 93A also illustrates a wafer (collectively indicated as 11040) having at least memory resources and coupled using WOW technology. The wafer includes a wafer (11021) coupled to a DRAM wafer (11021) through one or more WOW interlayers (11023) The
穿過一或多個WOW中間層之導體11022'將DRAM晶粒電耦接至基礎晶粒之組件。 Conductors 11022' passing through one or more WOW intermediate layers electrically couple the DRAM die to the components of the base die.
基礎晶粒11019耦接至中介層11018,該中介層又使用微凸塊耦接至封裝基板11017。封裝基板在其下表面處具有微凸塊之陣列。
The base die 11019 is coupled to the
微凸塊可由其他連接性替換。中介層11018及封裝基板11017可由其他層替換。
The micro bumps can be replaced by other connectivity. The
第一及/或第二記憶體控制器(分別為11031及11032)可定位於(至少部分)基礎晶粒11019外部,例如定位於DRAM晶圓中,DRAM晶圓與基礎晶粒之間,HBM記憶體單元之堆疊與基礎晶粒之間,及其類似者。 The first and/or second memory controllers (11031 and 11032, respectively) can be positioned (at least partly) outside the base die 11019, for example, positioned in the DRAM wafer, between the DRAM wafer and the base die, HBM Between the stack of memory cells and the base die, and the like.
第一及/或第二記憶體控制器(分別為11031及11032)可屬於同一控制器或可屬於不同控制器。 The first and/or second memory controllers (11031 and 11032, respectively) may belong to the same controller or may belong to different controllers.
HBM記憶體單元中之一或多者可包括邏輯以及記憶體,且可為或可包括記憶體/處理單元。 One or more of the HBM memory units may include logic and memory, and may or may include memory/processing units.
第一及第二記憶體控制器藉由多個匯流排11016彼此耦接,以用
於在第一記憶體資源與第二記憶體資源之間輸送資訊。圖93A亦說明自第二記憶體控制器至基礎晶粒之組件(例如,多個處理器)的匯流排11014。圖93A進一步說明自第一記憶體控制器至基礎晶粒之組件(例如,多個處理器,如圖93C中所展示)的匯流排11015。
The first and second memory controllers are coupled to each other through a plurality of
圖93B說明混合積體電路11012,其與圖93A之混合積體電路11011的不同之處在於具有記憶體/處理單元11021'而非DRAM晶粒11021。
FIG. 93B illustrates the hybrid
圖93C說明混合積體電路11013,其與圖93A之混合積體電路11011的不同之處在於具有HBM記憶體晶片堆疊,該堆疊具有TSV連接性且在其底部具有WOW連接之晶粒(共同表示為11040),該晶粒包括HBM記憶體單元之堆疊與基礎晶粒11018之間的DRAM晶粒11021。
FIG. 93C illustrates the hybrid
DRAM晶粒11021使用WOW技術(參見WOW中間層11023)耦接至基礎晶粒11019之第一記憶體控制器11031。HBM記憶體晶粒11032中之一或多者可包括邏輯以及記憶體,且可為或可包括記憶體/處理單元。
The DRAM die 11021 is coupled to the
最下部DRAM晶粒(在圖93C中表示為DEAM晶粒11021)可為HBM記憶體晶粒或可不同於HBM晶粒。最下部DRAM晶粒(DRAM晶粒11021)可由記憶體/處理單元11021'替換,如由圖93D之混合積體電路11014所說明。
The lowermost DRAM die (denoted as DEAM die 11021 in FIG. 93C) may be an HBM memory die or may be different from an HBM die. The lowermost DRAM die (DRAM die 11021) can be replaced by a memory/processing unit 11021', as illustrated by the hybrid
圖93E至圖93G分別說明混合積體電路11015、11016及11016',其中基礎晶粒11019耦接至在最低層級具有TSV連接性及微凸塊之HBM DRAM堆疊(11020)及至少具有記憶體資源且使用WOW技術耦接之晶圓(11030)的多個例項,及/或耦接至具有TSV連接性且在底部具有WOW連接之晶粒的HBM記憶體晶片堆疊(11040)之多個例項。
Figures 93E to 93G illustrate the hybrid
圖93H說明混合積體電路11014',該混合積體電路與圖93D之混合積體電路11014的不同之處在於說明記憶體單元53、二階快取記憶體(L2快
取記憶體52)、多個處理器11051。多個處理器11051耦接至L2快取記憶體11052,且可饋入有儲存於記憶體單元11053及L2快取記憶體11052中之係數及/或資料。
FIG. 93H illustrates the hybrid integrated circuit 11014'. The difference between the hybrid integrated circuit and the hybrid
上文所提及之混合積體電路中的任一者可用於人工智慧(AI)處理,該處理為頻寬密集的。 Any of the hybrid integrated circuits mentioned above can be used for artificial intelligence (AI) processing, which is bandwidth intensive.
當使用WOW技術耦接至記憶體控制器時,圖93D及93H之記憶體/處理單元11021'可執行AI計算,且可用極高速率自HBM DRAM堆疊及/或自WOW連接之晶粒接收資料及係數兩者。 When coupled to a memory controller using WOW technology, the memory/processing unit 11021' of Figures 93D and 93H can perform AI calculations and can receive data from HBM DRAM stacks and/or from WOW connected dies at a very high rate And the coefficient both.
任何記憶體/處理單元可包括分散式記憶體陣列及處理器陣列。分散式記憶體及處理器陣列可包括多個記憶體組及多個處理器。多個處理器可形成處理陣列。 Any memory/processing unit can include distributed memory arrays and processor arrays. Distributed memory and processor arrays may include multiple memory banks and multiple processors. Multiple processors can form a processing array.
參看圖93C、圖93D及圖93H且假定需要混合積體電路(11013、11014或11014')來執行一般的矩陣向量乘法(GEMV),該等乘法包括計算矩陣與向量之乘積。因為不存在對所擷取矩陣之重複使用,所以此類型之計算為頻寬密集的。因此,僅需要擷取及使用整個矩陣一次。 Refer to FIGS. 93C, 93D, and 93H and assume that a hybrid integrated circuit (11013, 11014, or 11014') is required to perform general matrix vector multiplication (GEMV), which includes calculating the product of a matrix and a vector. Because there is no repeated use of the extracted matrix, this type of calculation is bandwidth-intensive. Therefore, it is only necessary to retrieve and use the entire matrix once.
GEMV可為數學運算序列之一部分,其涉及(i)將第一矩陣(A)乘以第一向量(V1)以提供第一中間向量,對第一中間向量應用第一非線性運算(NLO1)以提供第一中間結果;(ii)將第二矩陣(B)乘以第一中間結果以第二中間向量,對第二中間向量應用第二非線性運算(NLO2)以提供第二中間結果,等等(直至接收第N中間結果,N可超過2)。 GEMV can be part of a sequence of mathematical operations involving (i) multiplying a first matrix (A) by a first vector (V1) to provide a first intermediate vector, and applying a first nonlinear operation (NLO1) to the first intermediate vector To provide the first intermediate result; (ii) multiply the second matrix (B) by the first intermediate result with a second intermediate vector, and apply a second non-linear operation (NLO2) to the second intermediate vector to provide the second intermediate result, And so on (until the Nth intermediate result is received, N can exceed 2).
假定每一矩陣為大的(例如,1Gb),計算將需要1Tbs運算功率及1Tbs之頻寬/輸送量。可並列地執行運算及計算。 Assuming that each matrix is large (for example, 1Gb), the calculation will require 1Tbs of computing power and 1Tbs of bandwidth/transmission. Operations and calculations can be performed in parallel.
假定GEMV計算展現N=4且具有以下形式:結果=NLO4(D*(NLO3(C*(NLO2(B*(NLO1(A*V1)))))))。 Assume that the GEMV calculation exhibits N=4 and has the following form: result=NLO4(D*(NLO3(C*(NLO2(B*(NLO1(A*V1))))))).
亦假定DRAM晶粒11021(或記憶體/處理單元11021')不具有足夠的記憶體資源以同時儲存A、B、C及D,則此等矩陣中之至少一些將儲存於HDM DRAM晶粒11032中。 It is also assumed that the DRAM die 11021 (or memory/processing unit 11021') does not have enough memory resources to store A, B, C, and D at the same time, then at least some of these matrices will be stored in the HDM DRAM die 11032 in.
假定基礎晶粒為包括諸如但不限於處理器、算術邏輯單元及其類似者之計算單元的邏輯晶粒。 It is assumed that the basic die is a logic die including a computing unit such as but not limited to a processor, arithmetic logic unit, and the like.
在第一晶粒計算A*V1時,第一記憶體控制器11031自一或多個HBM DRAM晶粒11032擷取其他矩陣之缺失部分以用於接下來的計算。
When the first die calculates A*V1, the
參看圖93H且假定(a)DRAM晶粒11021具有2TBs頻寬及512Mb容量,(b)HBM DRAM晶粒11032具有0.2TBs頻寬及8Gb容量,且(c)L2快取記憶體11052為具有6Ts頻寬及10Mb容量之SRAM。 Refer to Figure 93H and assume that (a) the DRAM die 11021 has a bandwidth of 2TBs and a capacity of 512Mb, (b) the HBM DRAM die 11032 has a bandwidth of 0.2TBs and a capacity of 8Gb, and (c) the L2 cache 11052 has a capacity of 6Ts SRAM with bandwidth and 10Mb capacity.
矩陣乘法涉及重複使用資料,將大的矩陣分段成多個區段(例如,5Mb個區段以適合可在雙緩衝器組態下使用之L2快取記憶體)及將所提取之第一矩陣區段乘以第二矩陣之區段(一個第二矩陣區段接著另一第二矩陣區段)。 Matrix multiplication involves reusing data, segmenting a large matrix into multiple segments (for example, 5Mb segments to fit the L2 cache that can be used in a double-buffer configuration) and extracting the first The matrix segment is multiplied by the second matrix segment (one second matrix segment followed by another second matrix segment).
在將第一矩陣區段乘以第二矩陣區段時,將另一第二矩陣區段自(記憶體處理單元11021'之)DRAM晶粒11021提取至L2快取記憶體。 When the first matrix section is multiplied by the second matrix section, another second matrix section is extracted from the DRAM die 11021 (of the memory processing unit 11021') to the L2 cache.
假定矩陣各為1Gb,在執行取得及計算時,DRAM晶粒11021或記憶體/處理單元11021'饋入有來自HBM DRAM晶粒11032之矩陣區段。 Assuming that the matrices are each 1Gb, the DRAM die 11021 or the memory/processing unit 11021' is fed with the matrix section from the HBM DRAM die 11032 when performing the acquisition and calculation.
DRAM晶粒11021或記憶體/處理單元11021'聚集矩陣區段,且矩陣區段接著經由WOW中間層(11023)饋入至基礎晶粒11019。 The DRAM die 11021 or the memory/processing unit 11021' gathers matrix segments, and the matrix segments are then fed to the base die 11019 through the WOW intermediate layer (11023).
記憶體/處理單元11021'可藉由執行計算及發送結果而非發送經計算以提供結果之中間值來減少經由WOW中間層(11023)發送至基礎晶粒11019的資訊之量。當處理多個(Q個)中間值以提供結果時,則壓縮比可為Q比1。 The memory/processing unit 11021' can reduce the amount of information sent to the base die 11019 through the WOW intermediate layer (11023) by performing calculations and sending the results instead of sending the calculated intermediate values to provide the results. When multiple (Q) intermediate values are processed to provide a result, the compression ratio may be Q to 1.
圖93I說明使用WOW技術實施之記憶體處理單元11019'的實例。邏輯單元9030(可為處理器子單元)、控制器9020及匯流排9021位於一個晶片111061中,分配給不同邏輯單元之記憶體組9040位於第二晶片11062中,而第一及第二晶片使用穿過WOW接合部11061之導體11012'彼此連接,該WOW接合部可包括一或多個WOW中間層。
FIG. 93I illustrates an example of a memory processing unit 11019' implemented using WOW technology. The logic unit 9030 (may be a processor sub-unit), the
圖93J為用於記憶體密集型處理之方法11100的實例。記憶體密集意謂處理需要高頻寬記憶體消耗或與高頻寬記憶體消耗相關聯。
Figure 93J shows an example of a
方法11100可開始於步驟11110、11120及11130。
步驟11110包括藉由多個處理器混合裝置執行處理操作,該混合裝置包含基礎晶粒、至少另一晶粒之第一記憶體資源及至少一個其他晶粒之第二記憶體資源;其中基礎晶粒及至少另一晶粒藉由晶圓上晶圓接合彼此連接。
步驟11120包括藉由多個處理器擷取儲存於第一記憶體資源中之所擷取資訊。
步驟11130可包括將來自第二記憶體資源之額外資訊發送至第一記憶體資源,其中基礎晶粒與至少另一晶粒之間的第一路徑之總頻寬超過至少另一晶粒與至少一個其他晶粒之間的第二路徑之總頻寬,且其中第一記憶體資源之儲存容量為第二記憶體資源之儲存容量的一部分。
方法11100亦可包括藉由包括複數個處理器子單元及第一記憶體資源之另一晶粒執行額外處理的步驟11140。
The
每一處理器子單元可耦接至分配給處理器子單元之第一記憶體資源的唯一部分。 Each processor sub-unit can be coupled to a unique portion of the first memory resource allocated to the processor sub-unit.
第一記憶體資源之唯一部分為至少一個記憶體組。 The only part of the first memory resource is at least one memory bank.
步驟11110、11120、11130及11140可同時、以部分重疊方式及其類似方式執行。
第二記憶體資源可為高頻寬記憶體(HBM)記憶體資源或可不同於HBM記憶體資源。 The second memory resource may be a high-bandwidth memory (HBM) memory resource or may be different from the HBM memory resource.
至少一個其他晶粒為高頻寬記憶體(HBM)記憶體晶片之堆疊。 At least one other die is a stack of high-bandwidth memory (HBM) memory chips.
通信晶片 Communication chip
資料庫包括許多條目,該等條目包括多個欄位。資料庫處理通常包括執行一或多個查詢,該一或多個查詢包括一或多個篩選參數(例如,識別一或多個相關欄位及一或多個相關欄位值)且亦包括一或多個操作參數,該一或多個操作參數可判定待執行之操作的類型、待在應用操作時使用之變數或常數,及其類似者。資料處理可包括資料庫分析或其他資料庫處理程序。 The database includes many entries, and these entries include multiple fields. Database processing usually includes executing one or more queries that include one or more filter parameters (for example, identifying one or more related fields and one or more related field values) and also includes a Or multiple operating parameters, the one or more operating parameters can determine the type of operation to be performed, the variable or constant to be used in the application operation, and the like. Data processing may include database analysis or other database processing procedures.
舉例而言,資料庫查詢可請求對資料庫之所有記錄執行統計操作(操作參數),其中某一欄位具有預定義範圍內之值(篩選參數)。又對於另一實例,資料庫查詢可請求刪除具有小於臨限值(篩選參數)之某一欄位的(操作參數)記錄。 For example, a database query can request a statistical operation (operation parameter) to be performed on all records in the database, and a certain field has a value within a predefined range (filter parameter). For another example, the database query may request deletion of (operation parameter) records with a certain field less than the threshold value (screening parameter).
大型資料庫通常儲存於儲存裝置中。為了對查詢作出回應,將資料庫發送至記憶體單元,通常為一個資料庫區段接著另一資料庫區段。 Large databases are usually stored in storage devices. In order to respond to queries, the database is sent to the memory unit, usually one database section followed by another database section.
將資料庫區段之條目自記憶體單元發送至不屬於與記憶體單元相同之積體電路的處理器。該等條目接著由處理器處理。 The entries of the database section are sent from the memory unit to the processor that does not belong to the same integrated circuit as the memory unit. These entries are then processed by the processor.
對於儲存於記憶體單元中之資料庫的每一資料庫區段,處理包括以下步驟:(i)選擇資料庫區段之記錄;(ii)將記錄自記憶體單元發送至處理器;(iii)藉由處理器篩選記錄以判定記錄是否相關;及(iv)對相關記錄執行一或多個額外操作(求和、應用任何其他數學運算及/或統計操作)。 For each database section of the database stored in the memory unit, the processing includes the following steps: (i) select the record of the database section; (ii) send the record from the memory unit to the processor; (iii) ) Filter the records by the processor to determine whether the records are relevant; and (iv) Perform one or more additional operations on the related records (summing, applying any other mathematical operations and/or statistical operations).
篩選處理程序在所有記錄被發送至處理器且處理器判定哪些記錄相關之後結束。 The screening process ends after all records are sent to the processor and the processor determines which records are relevant.
在資料庫區段之相關條目不儲存於處理器中之狀況下,則需要在 篩選階段之後將此等相關記錄發送至處理器以供進一步處理(應用在處理之後的操作)。 In the case that the relevant entries in the database section are not stored in the processor, you need to After the screening stage, these related records are sent to the processor for further processing (operations applied after processing).
當多個處理操作在單個篩選之後時,則可將每一操作之結果發送至記憶體單元且接著再次發送至處理器。 When multiple processing operations follow a single screening, the result of each operation can be sent to the memory unit and then sent to the processor again.
此處理程序為耗頻寬且耗時的。 This processing procedure is bandwidth-consuming and time-consuming.
愈來愈需要提供執行資料庫處理之高效方式。 There is an increasing need to provide efficient ways to perform database processing.
可提供一種可包括資料庫加速積體電路之裝置。 It is possible to provide a device that can include a database to accelerate an integrated circuit.
可提供一種可包括資料庫加速積體電路之一或多個群組的裝置,該等資料庫加速積體電路可經組態以在資料庫加速積體電路之一或多個群組中的資料庫加速積體電路之間交換資訊及/或加速結果(藉由資料庫加速積體電路進行之處理的最終結果)。 It is possible to provide a device that can include one or more groups of database accelerated integrated circuits, and the database accelerated integrated circuits can be configured to perform data in one or more groups of database accelerated integrated circuits. The database accelerates the exchange of information and/or acceleration results between integrated circuits (the final result of the processing performed by the database to accelerate the integrated circuits).
群組之資料庫加速度積體電路可連接至同一印刷電路板。 The database acceleration integrated circuit of the group can be connected to the same printed circuit board.
群組之資料庫加速度積體電路可屬於電腦化系統之模組化單元。 The group's database acceleration integrated circuit can be a modular unit of a computerized system.
不同群組之資料庫加速積體電路可連接至不同印刷電路板。 Different groups of database accelerated integrated circuits can be connected to different printed circuit boards.
不同群組之資料庫加速積體電路可屬於電腦化系統之不同模組化單元。 Different groups of database accelerated integrated circuits can belong to different modular units of the computerized system.
該裝置可經組態以藉由一或多個群組之資料庫加速積體電路執行分散式處理程序。 The device can be configured to use one or more groups of databases to accelerate integrated circuits to execute distributed processing procedures.
該裝置可經組態以使用至少一個交換器以用於在一或多個群組中之不同群組的資料庫加速積體電路之間交換(a)資訊及(b)資料庫加速結果中之至少一者。 The device can be configured to use at least one switch for the exchange of (a) information and (b) database acceleration results between database acceleration integrated circuits of different groups in one or more groups At least one of them.
該裝置可經組態以藉由一或多個群組中之一些的資料庫加速積體電路中之一些執行分散式處理程序。 The device can be configured to speed up some of the integrated circuits to execute distributed processing procedures through the database of some of one or more groups.
該裝置可經組態以執行第一及第二資料結構之分散式處理程 序,其中第一及第二資料結構之總大小超過多個記憶體處理積體電路之儲存能力。 The device can be configured to perform distributed processing of the first and second data structures Sequence, where the total size of the first and second data structures exceeds the storage capacity of multiple memory processing integrated circuits.
該裝置可經組態以藉由執行以下步驟之多個反覆來執行分散式處理程序:(a)執行將第一資料結構部分及第二資料結構部分之不同對新分配給不同資料庫加速積體電路;及(b)處理不同對。 The device can be configured to perform distributed processing by performing multiple iterations of the following steps: (a) Perform new allocation of different pairs of the first data structure part and the second data structure part to different database acceleration products Body circuit; and (b) deal with different pairs.
圖94A及圖9B說明儲存系統11560、電腦系統11150及用於資料庫加速之一或多個裝置11520的實例。用於資料庫加速之一或多個裝置11520可用各種方式(藉由監聽或藉由定位於電腦系統11150與儲存系統11560之間)監視儲存系統11560與電腦系統11150之間的通信。
94A and 9B illustrate examples of the
儲存系統11560可包括許多(例如,多於20個、50個、100個、100個及其類似者)儲存單元(諸如,磁碟或磁碟之raid),且可例如儲存多於100萬億位元組資訊。運算系統11510可為大型電腦系統且可包括數十、數百及甚至數千個處理單元。
The
運算系統11510可包括由管理器11511控制之多個運算節點11512。
The
運算節點可控制或以其他方式與用於資料庫加速之一或多個裝置11520互動。
The computing node can control or interact with one or
用於資料庫加速之一或多個裝置11520可包括一或多個資料庫加速積體電路(參見例如圖94A及圖94B之資料庫加速積體電路11530)及記憶體資源11550。記憶體資源可屬於專用於記憶體但可屬於記憶體/處理單元之一或多個晶片。
The one or
圖94C及圖94D說明電腦系統11150及用於資料庫加速之一或多個裝置11520的實例。
94C and 94D illustrate an example of the computer system 11150 and one or
用於資料庫加速之一或多個裝置11520的一或多個資料庫加速積
體電路可由管理單元11513控制,該管理單元可位於電腦系統內(參見圖94C)或位於用於資料庫加速之一或多個裝置11520內(圖94D)。
One or more database acceleration products of one or
圖94E說明用於資料庫加速之裝置11520,該裝置包括資料庫加速積體電路11530及多個記憶體處理積體電路1151。每一記憶體處理積體電路可包括控制器、多個處理器子單元及多個記憶體單元。
FIG. 94E illustrates a
資料庫加速積體電路11530經說明為包括網路通信介面11531、第一處理單元11532、記憶體控制器11533、資料庫加速單元11535、互連件11536及管理單元11513。
The database acceleration integrated
網路通信介面(11531)可經組態以自大量儲存單元接收(例如,經由網路通信介面之第一埠11531(1))大量資訊。每一儲存單元可用超過數十及甚至數百百萬位元組/秒之速率輸出資訊,而資料傳送速率預期隨時間增加(例如,每2至3年加倍)。儲存資料單元之數目(大數目)可超過10個、50個、100個、200個及甚至更多個。大量資訊可超過數十、數百十億位元組/秒,且甚至可在萬億位元組/秒及千兆位元組/秒之範圍內。 The network communication interface (11531) can be configured to receive a large amount of information from a mass storage unit (for example, via the first port 11531(1) of the network communication interface). Each storage unit can output information at a rate exceeding tens or even hundreds of megabytes per second, and the data transfer rate is expected to increase over time (for example, doubling every 2 to 3 years). The number of data storage units (large number) can exceed 10, 50, 100, 200 and even more. A large amount of information can exceed tens or tens of billions of bytes per second, and can even be in the range of trillion bytes per second and gigabytes per second.
第一處理單元11532可經組態以對大量資訊進行第一處理(預處理)以提供第一經處理資訊。
The
記憶體控制器11533可經組態以經由大輸送量介面11534將第一經處理資訊發送至多個記憶體處理積體電路。
The
多個記憶體處理積體電路11551可經組態以藉由多個記憶體處理積體電路對第一經處理資訊之至少部分進行第二處理(處理)以提供第二經處理資訊。
The plurality of memory processing integrated
記憶體控制器11533可經組態以自多個記憶體處理積體電路擷取所擷取資訊。所擷取資訊可包括以下各者中之至少一者:(a)第一經處理資訊之至少一部分;及(b)第二經處理資訊之至少一部分。
The
資料庫加速單元11535可經組態以對所擷取資訊執行資料庫處理操作,以提供資料庫加速結果。
The
資料庫加速積體電路可經組態以輸出資料庫加速結果,例如經由網路通信介面之一或多個第二埠11531(2)。 The database acceleration integrated circuit can be configured to output the database acceleration result, for example, via one or more second ports 11531(2) of the network communication interface.
圖94E亦說明管理單元11513,該管理單元經組態以管理以下各者中之至少一者:所擷取資訊之擷取、第一處理(預處理)、第二處理(處理)及第三處理(資料庫處理)。管理單元11513可位於資料庫加速積體電路外部。
FIG. 94E also illustrates the
管理單元可經組態以基於執行計劃而執行該管理。執行計劃可由管理單元產生,或可由位於資料庫加速積體電路外部之實體產生。執行計劃可包括以下各者中之至少一者:(a)待由資料庫加速積體電路之各種組件執行的指令、(b)實施執行計劃所需之資料及/或係數、(c)指令及/或資料之記憶體分配。 The management unit may be configured to perform the management based on the execution plan. The execution plan can be generated by the management unit, or by an entity located outside the accelerated integrated circuit of the database. The execution plan may include at least one of the following: (a) instructions to be executed by various components of the integrated circuit accelerated by the database, (b) data and/or coefficients required to implement the execution plan, (c) instructions And/or memory allocation of data.
管理單元可經組態以藉由分配以下各者中之至少一些來執行管理:(a)網路通信網路介面資源、(b)解壓縮單元資源、(c)記憶體控制器資源、(d)多個記憶體處理積體電路資源,及(e)資料庫加速單元資源。 The management unit can be configured to perform management by allocating at least some of the following: (a) network communication network interface resources, (b) decompression unit resources, (c) memory controller resources, ( d) Multiple memory processing integrated circuit resources, and (e) database acceleration unit resources.
如圖94E及圖94G中所說明,網路通信網路介面可包括不同類型之網路通信埠。 As illustrated in FIG. 94E and FIG. 94G, the network communication network interface may include different types of network communication ports.
不同類型之網路通信埠可包括儲存介面協定埠(例如,SATA埠、ATA埠、ISCSI埠、網路檔案系統、光纖通道埠)及通用網路儲存介面協定埠(例如,乙太網路ATA、乙太網路光纖通道、NVME、Roce及其他)。 Different types of network communication ports can include storage interface protocol ports (for example, SATA port, ATA port, ISCSI port, network file system, Fibre Channel port) and general network storage interface protocol ports (for example, Ethernet ATA , Ethernet Fibre Channel, NVME, Roce and others).
不同類型之網路通信埠可包括儲存介面協定埠及PCIe埠。 Different types of network communication ports can include storage interface protocol ports and PCIe ports.
圖94F包括虛線,該等虛線說明大量資訊、第一經處理資訊、所擷取資訊及資料庫加速結果之流。圖94F將資料庫加速積體電路11530說明為耦接至多個記憶體資源11550。多個記憶體資源11550可能不屬於記憶體處理積體
電路。
FIG. 94F includes dotted lines that illustrate the flow of large amounts of information, first processed information, retrieved information, and database acceleration results. FIG. 94F illustrates the database acceleration integrated
用於資料庫加速之裝置11520可經組態以藉由資料庫加速積體電路11530同時執行多個任務,此係因為網路通信介面11531可接收多個資訊串流(同時),第一處理單元11532可同時對多個資訊單元執行第一處理,記憶體控制器11533可同時將多個第一經處理資訊單元發送至多個記憶體處理積體電路11551,資料庫加速單元11535可同時處理多個所擷取資訊單元。
The
用於資料庫加速之裝置11520可經組態以藉由大型運算系統之運算節點基於發送至資料庫加速積體電路的執行計劃而執行擷取、第一處理、發送及第三處理中之至少一者。
The
用於資料庫加速之裝置11520可經組態以用實質上最佳化資料庫加速積體電路之利用的方式管理擷取、第一處理、發送及第三處理中之至少一者。該最佳化考慮潛時、輸送量及任何其他時序或儲存或處理考慮因素,且嘗試使沿著流徑之所有組件保持忙碌且無瓶頸。
The
資料庫加速積體電路可經組態以輸出資料庫加速結果,例如經由網路通信介面之一或多個第二埠11531(2)。 The database acceleration integrated circuit can be configured to output the database acceleration result, for example, via one or more second ports 11531(2) of the network communication interface.
用於資料庫加速之裝置11520可經組態以實質上最佳化藉由網路通信網路介面交換之訊務的頻寬。
The
用於資料庫加速之裝置11520可經組態以用實質上最佳化資料庫加速積體電路之利用的方式實質上防止在擷取、第一處理、發送及第三處理中之至少一者中形成瓶頸。
The
用於資料庫加速之裝置11520可經組態以根據時間I/O頻寬來分配資料庫加速積體電路之資源。
The
圖94G說明用於資料庫加速之裝置11520,該裝置包括資料庫加速積體電路11530及多個記憶體處理積體電路1151。圖94G亦說明耦接至資料
庫加速積體電路11530之各種單元:遠端RAM 11546、乙太網路記憶體DIMM 11547、儲存系統11560、本端儲存單元11561及非揮發性記憶體(NVM)11563(該非揮發性記憶體可為快速NVM單元(NVME))。
FIG. 94G illustrates a
資料庫加速積體電路11530經說明為包括乙太網路埠11531(1)、RDMA單元11545、串列擴展埠11531(15)、SATA控制器11540、PCIe埠11531(9)、第一處理單元11532、記憶體控制器11533、資料庫加速單元11535、互連件11536、管理單元11513、用於執行密碼操作之密碼編譯引擎11537,及二階靜態隨機存取記憶體(L2 SRAM)11538。
The database acceleration integrated
資料庫加速單元經說明為包括DMA引擎11549、三階(L3)記憶體11548及資料庫加速子單元11547。資料庫加速子單元11547可為可組態單元。 The database acceleration unit is illustrated as including a DMA engine 11549, a third-level (L3) memory 11548, and a database acceleration sub-unit 11547. The database acceleration subunit 11547 can be a configurable unit.
乙太網路埠11531(1)、RDMA單元11545、串列擴展埠11531(15)、SATA控制器11540、PCIe埠11531(9)可被視為網路通信介面11531之部分。
The Ethernet port 11531 (1), RDMA unit 11545, serial expansion port 11531 (15), SATA controller 11540, PCIe port 11531 (9) can be regarded as part of the
遠端RAM 11546、乙太網路記憶體DIMM 11547、儲存系統11560耦接至乙太網路埠11531(1),該乙太網路埠又耦接至RDMA單元11545。
The
本端儲存單元11561耦接至SATA控制器11540。
The
PCIe埠11531(9)耦接至NVM 11563。PCIe埠亦可用於交換命令,例如用於管理目的。
The PCIe port 11531(9) is coupled to the
圖94H為資料庫加速單元11535之實例。
FIG. 94H is an example of the
資料庫加速單元11535可經組態以藉由資料庫處理子單元11573同時執行資料庫處理指令,其中資料庫加速單元可包括共用一共用記憶體單元11575之資料庫加速器子單元方的群組。
The
資料庫加速子單元11535之不同組合可動態地彼此鏈接(經由可組態鏈路或互連件11576)以提供執行可包括多個指令之資料庫處理操作所需的
執行管線。
The different combinations of the
每一資料庫處理子單元可經組態以執行特定類型之資料庫處理指令(例如,篩選、合併、累加及其類似者)。 Each database processing subunit can be configured to execute specific types of database processing commands (for example, filtering, merging, accumulation, and the like).
圖94H亦說明耦接至快取記憶體11571之獨立資料庫處理單元11572。替代DB加速器之可重組態陣列11574或除DB加速器之可重組態陣列11574以外,亦可提供資料庫處理單元11572及快取記憶體11571。
FIG. 94H also illustrates the independent
該裝置可便利向內擴展及/或向外擴展,因此使得多個資料庫加速積體電路11530(及其相關聯之記憶體資源11550或其相關聯之多個記憶體處理積體電路11551)能夠例如藉由參與資料庫操作之分散式處理而彼此相配合。
The device can facilitate inward expansion and/or outward expansion, thus enabling multiple databases to accelerate the integrated circuit 11530 (and its associated
圖94I說明包括兩個資料庫加速積體電路11530(及其相關聯之記憶體資源11550)的模組化單元,諸如刀鋒11580。該刀鋒可包括一個、兩個或多於兩個記憶體處理積體電路11551及其相關聯之記憶體資源11550。
FIG. 94I illustrates a modular unit, such as the
該刀鋒亦可包括一或多個非揮發性記憶體單元、乙太網路交換器、PCIe交換器及乙太網路交換器。 The blade can also include one or more non-volatile memory units, Ethernet switches, PCIe switches and Ethernet switches.
多個刀鋒可使用任何通信方法、通信協定及連接性彼此通信。 Multiple blades can communicate with each other using any communication method, communication protocol, and connectivity.
圖94I說明彼此完全連接之四個資料庫加速積體電路11530(及其相關聯之記憶體資源11550),每一資料庫加速積體電路11530連接至所有三個其他資料庫加速積體電路11530。連接性可使用任何通信協定,例如藉由使用乙太網路RDMA協定達成。
FIG. 94I illustrates four database acceleration integrated circuits 11530 (and its associated memory resource 11550) that are fully connected to each other. Each database acceleration integrated
圖94I亦說明資料庫加速積體電路11530,該資料庫加速積體電路連接至其相關聯之記憶體資源11550以及包括RAM記憶體及乙太網路埠之單元11531。
94I also illustrates the database acceleration integrated
圖94J、圖94K、圖94L及圖94M說明資料庫加速積體電路之四個群組11580,每一群組包括四個資料庫加速積體電路11530(彼此完全連接)
及其相關聯之記憶體資源11550。不同群組經由交換器11590彼此連接。
Fig. 94J, Fig. 94K, Fig. 94L, and Fig. 94M illustrate four
群組之數目可為兩個、三個或多於四個。每群組之資料庫加速積體電路的數目可為兩個、三個或多於四個。群組之數目可相同於(或可不同於)每群組之資料庫加速積體電路的數目。 The number of groups can be two, three or more than four. The number of database accelerated integrated circuits in each group can be two, three, or more than four. The number of groups can be the same as (or can be different from) the number of database acceleration integrated circuits per group.
圖94K說明兩個表A及B,該兩個表過大(例如,1萬億位元組)而無法一次高效地接合。 FIG. 94K illustrates two tables A and B, which are too large (for example, 1 trillion bytes) to be efficiently joined at one time.
將表實際上分段成分片且將接合操作應用於包括表A之分片及表B之分片的對。 The table is actually segmented into pieces and the splicing operation is applied to the pair including the piece of table A and the piece of table B.
資料庫加速積體電路之群組可用各種方式處理分片。 The group of the database accelerated integrated circuit can handle the slices in various ways.
舉例而言,裝置可經組態以藉由以下操作來執行分散式處理程序: For example, the device can be configured to perform distributed processing by:
g.將不同的第一資料結構部分(表A之分片,例如第一至第十六分片A0至A15)分配給一或多個群組之不同資料庫加速積體電路。 g. Allocate different first data structure parts (slices of Table A, such as the first to sixteenth slices A0 to A15) to one or more groups of different database acceleration integrated circuits.
h.執行以下各者之多個反覆:(i)將不同的第二資料結構部分(表B之分片,例如第一直至第十六分片B0至B15)新分配給一或多個群組之不同資料庫加速積體電路;及(ii)藉由資料庫加速積體電路處理第一及第二資料結構部分。 h. Perform multiple iterations of each of the following: (i) Newly assign different second data structure parts (slices of table B, such as the first to sixteenth slices B0 to B15) to one or more groups Groups of different databases accelerate the integrated circuit; and (ii) use the database to accelerate the integrated circuit to process the first and second data structure parts.
裝置可經組態以用與當前反覆之處理至少部分時間重疊的方式執行下一反覆之新分配。 The device can be configured to perform a new allocation for the next iteration in a manner that overlaps at least part of the time with the current iteration.
裝置可經組態以藉由在不同資料庫加速積體電路之間交換第二資料結構部分來執行新分配。 The device can be configured to perform a new allocation by exchanging the second data structure part between different database accelerated integrated circuits.
交換可用與處理程序至少部分時間重疊之方式執行。 The exchange can be performed in a manner that overlaps at least part of the time with the processing procedure.
裝置可經組態以藉由以下操作來執行新分配:在群組之不同資料庫加速積體電路之間交換第二資料結構部分;及一旦該交換已完成,則在資料庫加速積體電路之不同群組之間交換第二資料結構部分。 The device can be configured to perform a new allocation by: exchanging the second data structure part between the different database acceleration integrated circuits of the group; and once the exchange has been completed, then accelerating the integrated circuit in the database The second data structure part is exchanged between different groups.
在圖94K中,展示接合操作中之一些的四個循環,例如參考左上方群組之左上方資料庫加速積體電路11530,四個循環包括計算Join(A0,B0)、Join(A0,B3)、Join(A0,B2)及Join(A0,B1)。在此等四個循環期間,A0保持在同一資料庫加速積體電路11530處,而矩陣B之分片(B0、B1、B2及B3)在資料庫加速積體電路11530之同一群組的成員之間旋轉。
In Fig. 94K, four cycles of some of the joining operations are shown. For example, referring to the upper left database acceleration integrated
在圖94L中,第二矩陣之分片在不同群組之間旋轉,(a)將分片B0、B1、B2及B3(先前由左上方群組處理)自左上方群組發送至左下方群組,(b)將分片B4、B5、B6及B7(先前由左下方群組處理)自左下方群組發送至右上方群組,(c)將分片B8、B9、B10及B11(先前由右上方群組處理)自右上方群組發送至右下方群組,且(d)將分片B12、B13、B14及B15(先前由右下方群組處理)自右下方群組發送至左上方群組。 In Figure 94L, the slices of the second matrix are rotated between different groups. (a) The slices B0, B1, B2, and B3 (previously processed by the upper left group) are sent from the upper left group to the lower left Group, (b) send the slices B4, B5, B6 and B7 (previously processed by the lower left group) from the lower left group to the upper right group, (c) send the slices B8, B9, B10 and B11 (Previously processed by the upper right group) sent from the upper right group to the lower right group, and (d) segments B12, B13, B14, and B15 (previously processed by the lower right group) were sent from the lower right group To the upper left group.
圖94N為系統之實例,該系統包括多個刀鋒11580、SATA控制器11540、本端儲存單元11561、NVME 11563、PCIe交換器11601、乙太網路記憶體DIMM 11547及乙太網路埠11531(4)。
Figure 94N is an example of a system that includes
刀鋒11580可耦接至PCIE交換器11601、乙太網路埠11531及SATA控制器11540中之每一者。
The
圖94O說明兩個系統11621及11622。
Figure 94O illustrates two
系統11621可包括用於資料庫加速之一或多個裝置11520、交換系統11611、儲存系統11612及運算系統11613。交換系統11611提供用於資料庫加速之一或多個裝置11520、儲存系統11612及運算系統11613之間的連接性。
The
系統11622可包括儲存系統以及用於資料庫加速之一或多個裝置11615、交換系統11611及運算系統11613。交換系統11611提供儲存系統以及用於資料庫加速之一或多個裝置11615及運算系統11613之間的連接性。
The
圖95A說明用於資料庫加速之方法11200。
Figure 95A illustrates a
方法11200可開始於藉由資料庫加速積體電路之網路通信網路介面自大量儲存單元擷取大量資訊的步驟11210。
The
連接至大量儲存單元(例如,使用多個不同匯流排)使得網路通信網路介面能夠接收大量資訊,即使當單個儲存單元具有有限輸送量時亦如此。 Connecting to a large number of storage units (for example, using multiple different buses) enables the network communication network interface to receive a large amount of information, even when a single storage unit has a limited throughput.
步驟11210之後可接著為對大量資訊進行第一處理以提供第一經處理資訊。第一處理可包括緩衝、自有效負載提取資訊、移除標頭、解壓縮、壓縮、解密、篩選資料庫查詢或執行任何其他處理操作。第一處理亦可能限於緩衝。 Step 11210 can be followed by performing first processing on a large amount of information to provide first processed information. The first processing may include buffering, extracting information from the payload, removing headers, decompressing, compressing, decrypting, filtering database queries, or performing any other processing operations. The first processing may also be limited to buffering.
步驟11210之後可接著為藉由資料庫加速積體電路之記憶體控制器且經由大輸送量介面將第一經處理資訊發送至多個記憶體處理積體電路的步驟11220,其中每一記憶體處理積體電路可包括控制器、多個處理器子單元及多個記憶體單元。記憶體處理積體電路可為記憶體/處理單元或分散式處理器或記憶體晶片,如本專利申請案之任何其他部分中所說明。
Step 11210 can be followed by
步驟11220之後可接著為藉由多個記憶體處理積體電路對第一經處理資訊之至少部分進行第二處理以提供第二經處理資訊的步驟11230。
Step 11220 can be followed by
步驟11230可包括藉由資料庫加速積體電路同時執行多個任務。
步驟11230可包括藉由資料庫處理子單元同時執行資料庫處理指令,其中資料庫加速單元可包括共用一共用記憶體單元之資料庫加速器子單元的群組。
步驟11230之後可接著為藉由資料庫加速積體電路之記憶體控制器自多個記憶體處理積體電路擷取所擷取資訊的步驟11240,其中所擷取資訊可包括以下各者中之至少一者:(a)第一經處理資訊之至少一部分;及(b)第二經處理資訊之至少一部分。
Step 11230 can be followed by
步驟11240之後可接著為藉由資料庫加速積體電路之資料庫加速
單元對所擷取資訊執行資料庫處理操作以提供資料庫加速結果的步驟11250。
After
步驟11250可包括根據時間I/O頻寬分配資料庫加速積體電路之資源。
步驟11250之後可接著為輸出資料庫加速結果之步驟11260。
Step 11250 can be followed by
步驟11260可包括動態地鏈接資料庫處理子單元以提供執行可包括多個指令之資料庫處理操作所需的執行管線。
步驟11260可包括將資料庫加速結果輸出至本端儲存器及自本端儲存器擷取資料庫加速結果。
應注意,方法11100之步驟11210、11220、11230、11240、11250及11260或任何其他步驟可用管線化方式執行。可同時或以不同於上文所提及之次序的次序執行此等步驟。
It should be noted that
舉例而言,步驟1120之後可接著為步驟11250,使得第一經處理資訊由資料庫加速單元進一步處理。
For example, step 1120 can be followed by
又對於另一實例,第一經處理資訊可發送至多個記憶體處理積體電路,且接著發送(不由多個記憶體處理積體電路處理)至資料庫加速單元。 For yet another example, the first processed information can be sent to multiple memory processing integrated circuits, and then sent (not processed by multiple memory processing integrated circuits) to the database acceleration unit.
又對於另一實例,第一經處理資訊及/或第二經處理資訊可自資料庫加速積體電路輸出,而不由資料庫加速度單元進行資料庫處理。 For another example, the first processed information and/or the second processed information can be output from the database acceleration integrated circuit, instead of database processing by the database acceleration unit.
該方法可包括藉由大型運算系統之運算節點基於發送至資料庫加速積體電路的執行計劃而執行以下操作中之至少一者:擷取、第一處理、發送及第三處理。 The method may include performing at least one of the following operations based on the execution plan sent to the database acceleration integrated circuit by the computing node of the large computing system: capturing, first processing, sending, and third processing.
該方法可包括以實質上最佳化資料庫加速積體電路之利用的方式管理擷取、第一處理、發送及第三處理中之至少一者。 The method may include managing at least one of capture, first processing, sending, and third processing in a manner that substantially optimizes the database to accelerate the utilization of the integrated circuit.
該方法可包括實質上最佳化藉由網路通信網路介面交換之訊務的頻寬。 The method may include substantially optimizing the bandwidth of the traffic exchanged through the network communication network interface.
該方法可包括以實質上最佳化資料庫加速積體電路之利用的方式實質上防止在擷取、第一處理、發送及第三處理中之至少一者中形成瓶頸。 The method may include substantially preventing the formation of a bottleneck in at least one of the acquisition, the first processing, the sending, and the third processing by substantially optimizing the database to accelerate the utilization of the integrated circuit.
方法11200亦可包括以下步驟中之至少一者:
The
步驟11270可包括藉由資料庫加速積體電路之管理單元來管理擷取、第一處理、發送及第三處理中之至少一者。
該管理可基於由資料庫加速積體電路之管理單元產生的執行計劃而執行。 The management can be performed based on the execution plan generated by the management unit of the database accelerated integrated circuit.
該管理可基於由資料庫加速積體電路之管理單元接收而並非由管理單元產生的執行計劃而執行。 The management can be performed based on the execution plan received by the management unit of the database accelerated integrated circuit, but not generated by the management unit.
該管理可包括分配以下各者中之至少一些:(a)網路通信網路介面資源、(b)解壓縮單元資源、(c)記憶體控制器資源、(d)多個記憶體處理積體電路資源,及(e)資料庫加速單元資源。 The management may include allocating at least some of the following: (a) network communication network interface resources, (b) decompression unit resources, (c) memory controller resources, (d) multiple memory processing products Body circuit resources, and (e) database acceleration unit resources.
步驟11271可包括藉由大型運算系統之運算節點控制擷取、第一處理、發送及第三處理中之至少一者中之至少一者。
步驟11272可包括藉由位於資料庫加速積體電路外部之管理單元來管理擷取、第一處理、發送及第三處理中之至少一者。
圖95B說明用於操作資料庫加速積體電路之群組的方法11300。
FIG. 95B illustrates a
方法11300可開始於藉由資料庫加速積體電路執行資料庫加速操作之步驟11310。步驟11310可包括執行方法11200之一或多個步驟。
The
方法11300亦可包括在資料庫加速積體電路之一或多個群組的資料庫加速積體電路之間交換(a)資訊及(b)資料庫加速結果中之至少一者的步驟11320。
The
步驟11310及11320之組合可相當於藉由一或多個群組之資料庫加速積體電路執行分散式處理。
The combination of
可使用一或多個群組之資料庫加速積體電路的網路通信網路介面執行交換。 One or more groups of databases can be used to accelerate the exchange of the network communication network interface of the integrated circuit.
可經由多個群組執行交換,該等群組可藉由星形連接而彼此連接。 The exchange can be performed through multiple groups, and the groups can be connected to each other by a star connection.
步驟11320可包括使用至少一個交換器以用於在一或多個群組中之不同群組的資料庫加速積體電路之間交換以下各者中之至少一者:(a)資訊;及(b)資料庫加速結果。
步驟11310可包括藉由一或多個群組中之一些的資料庫加速積體電路中之一些執行分散式處理的步驟11311。
步驟11311可包括執行第一及第二資料結構之分散式處理,其中第一及第二資料結構之總大小超過多個記憶體處理積體電路之儲存能力。
分散式處理之執行可包括執行以下各者之多個反覆:(a)執行將第一資料結構部分及第二資料結構部分之不同對新分配給不同資料庫加速積體電路;及(b)處理不同對。 The execution of distributed processing may include the execution of multiple iterations of the following: (a) execution of newly allocating different pairs of the first data structure part and the second data structure part to different database accelerated integrated circuits; and (b) Deal with different pairs.
分散式處理之執行可包括執行資料庫接合操作。 The execution of distributed processing may include the execution of database joining operations.
步驟11310可包括(a)將不同的第一資料結構部分分配給一或多個群組之不同資料庫加速積體電路的步驟11312;及(b)執行以下各者之多個反覆:將不同的第二資料結構部分新分配給一或多個群組之不同資料庫加速積體電路的步驟11314;及藉由資料庫加速積體電路處理第一及第二資料結構部分的步驟11316。
可用與當前反覆之處理至少部分時間重疊的方式執行步驟11314。
步驟11314可包括在不同資料庫加速積體電路之間交換第二資料結構部分。
可用與步驟11310至少部分時間重疊之方式執行步驟11320。
Step 11320 can be performed in a manner that overlaps with
步驟11314可包括在群組之不同資料庫加速積體電路之間交換第二資料結構部分;及一旦交換已完成,便在資料庫加速積體電路之不同群組之間交換第二資料結構部分。
圖95C說明用於資料庫加速之方法11350。
Figure 95C illustrates
方法11350可包括藉由資料庫加速積體電路之網路通信網路介面自大量儲存單元擷取大量資訊的步驟11352。
The
步驟11352之後可接著為對大量資訊進行第一處理以提供第一經處理資訊的步驟11354。
Step 11352 can be followed by
步驟11352之後可接著藉由資料庫加速積體電路之記憶體控制器且經由大輸送量介面將第一經處理資訊發送至多個記憶體資源的步驟11354。
Step 11352 can be followed by
步驟11354之後可接著為自多個記憶體資源擷取所擷取資訊之步驟11356。
Step 11354 can be followed by
步驟11356之後可接著為藉由資料庫加速積體電路之資料庫加速單元對所擷取資訊執行資料庫處理操作以提供資料庫加速結果的步驟11358。
Step 11356 can be followed by
步驟11358之後可接著為輸出資料庫加速結果之步驟11359。
Step 11358 can be followed by
該方法亦可包括對第一經處理資訊進行第二處理以提供第二經處理資訊之步驟11355。第二處理由多個處理器執行,該多個處理器位於進一步包含多個記憶體資源之一或多個記憶體處理積體電路中。步驟11355在步驟11354之後且在步驟11356之前。
The method may also include a
第二經處理資訊之總大小可小於第一經處理資訊之總大小。 The total size of the second processed information may be smaller than the total size of the first processed information.
第一經處理資訊之總大小可小於大量資訊之總大小。 The total size of the first processed information may be smaller than the total size of the large amount of information.
第一處理可包括篩選資料庫條目。因此,在執行任何其他處理之前及/或甚至在將不相關的資料庫條目儲存於多個記憶體資源之前,篩選出與查 詢不相關之資料庫條目,藉此節省頻寬、儲存資源及其他處理資源。 The first process may include screening database entries. Therefore, before performing any other processing and/or even before storing irrelevant database entries in multiple memory resources, filter out and check Consult irrelevant database entries to save bandwidth, storage resources, and other processing resources.
第二處理可包括篩選資料庫條目。篩選可在篩選條件可為複雜的(包括多個條件)時應用,且可能需要在篩選進行之前接收多個資料庫條目欄位。舉例而言,當搜尋(a)超過某一年齡且喜歡香蕉之人及(b)超過另一年齡且喜歡蘋果之人時。 The second process may include filtering database entries. Filtering can be applied when the filtering conditions can be complex (including multiple conditions), and it may be necessary to receive multiple database entry fields before the filtering can proceed. For example, when searching for (a) people over a certain age who like bananas and (b) people over another age who like apples.
資料庫database
以下實例可參考資料庫。資料庫可為資料中心,可為資料中心之部分,或可能不屬於資料中心。 The following examples can refer to the database. The database may be a data center, may be part of a data center, or may not belong to a data center.
資料庫可經由一或多個網路耦接至多個使用者。資料庫可為雲端資料庫。 The database can be coupled to multiple users via one or more networks. The database can be a cloud database.
可提供包括一或多個管理單元及多個資料庫加速器板之資料庫,該等加速器板包括一或多個記憶體/處理單元。 A database including one or more management units and multiple database accelerator boards can be provided. The accelerator boards include one or more memory/processing units.
圖96B說明資料庫12020,該資料庫包括管理單元12021及多個DB加速器板12022,該等加速器板各包括通信/管理處理器(處理器12024)及多個記憶體/處理單元12026。
FIG. 96B illustrates the
處理器12024可支援各種通信協定,諸如但不限於PCIe、類似ROCE之協定,及其類似者。 The processor 12024 can support various communication protocols, such as but not limited to PCIe, ROCE-like protocols, and the like.
資料庫命令可由記憶體/處理單元12026執行,且處理器可在記憶體/處理單元12026之間、在不同DB加速器板12022之間且與管理單元12021投送訊務。
The database commands can be executed by the memory/processing unit 12026, and the processor can be between the memory/processing unit 12026, between different
尤其在包括大型內部記憶體組時,使用多個記憶體/處理單元12026可顯著加速資料庫命令之執行且避免通信瓶頸。 Especially when a large internal memory bank is included, the use of multiple memory/processing units 12026 can significantly speed up the execution of database commands and avoid communication bottlenecks.
圖96C說明包括處理器12024及多個記憶體/處理單元12026之DB加速器板12022。處理器12024包括多個通信專用組件,諸如用於與記憶體/
處理單元12026、RDMA引擎12031、DB查詢資料庫引擎12034及其類似者通信之DDR控制器12033。DDR控制器為通信控制器之實例,且RDMA引擎為任何通信引擎之實例。
FIG. 96C illustrates a
可提供一種用於操作圖96B、圖96C及圖96D中之任一者之系統(或操作系統之任何部分)的方法。 A method for operating the system (or any part of the operating system) of any one of FIG. 96B, FIG. 96C, and FIG. 96D can be provided.
應注意,資料庫加速積體電路11530可與多個記憶體資源相關聯,該等記憶體資源不包括於多個記憶體處理積體電路中或以其他方式不與處理單元相關聯。在此狀況下,處理主要且甚至僅由資料庫加速積體電路執行。
It should be noted that the database acceleration integrated
圖94P說明用於資料庫加速之方法11700。
Figure 94P illustrates a
方法11700可包括藉由資料庫加速積體電路之網路通信介面自儲存單元擷取資訊的步驟11710。
The
步驟11710之後可接著為對資訊量進行第一處理以提供第一經處理資訊的步驟11720。
Step 11710 can be followed by
步驟11720之後可接著為藉由資料庫加速積體電路之記憶體控制器且經由輸送量介面將第一經處理資訊發送至多個記憶體資源的步驟11730。
Step 11720 can be followed by
步驟11730之後可接著為自多個記憶體資源擷取資訊之步驟11740。
Step 11730 can be followed by
步驟11740之後可接著為藉由資料庫加速積體電路之資料庫加速單元對所擷取資訊執行資料庫處理操作以提供資料庫加速結果的步驟11750。
Step 11740 can be followed by
步驟11750之後可接著為輸出資料庫加速結果之步驟11760。
Step 11750 can be followed by
第一處理及/或第二處理可包括篩選資料庫條目,判定應進一步處理哪些資料庫條目。 The first processing and/or the second processing may include screening database entries to determine which database entries should be further processed.
第二處理包含篩選資料庫條目。 The second process involves screening database entries.
混合系統 Hybrid system
記憶體/處理單元在執行可為記憶體密集的及/或瓶頸與擷取操作相關之計算時可為高效的。當瓶頸與運算操作相關時,面向處理(且較少面向記憶體)之處理器單元(諸如但不限於圖形處理單元、中央處理單元)可更有效。 The memory/processing unit can be efficient in performing calculations that can be memory-intensive and/or bottlenecks related to retrieval operations. When the bottleneck is related to arithmetic operations, processing-oriented (and less memory-oriented) processor units (such as but not limited to graphics processing units, central processing units) can be more effective.
混合系統可包括彼此可完全或部分連接之一或多個處理器單元及一或多個記憶體/處理單元兩者。 The hybrid system may include both one or more processor units and one or more memory/processing units that can be fully or partially connected to each other.
記憶體/處理單元(MPU)可藉由相比邏輯胞元更佳地適合記憶體胞元之第一製造製程來製造。舉例而言,由第一製造製程製造之記憶胞元可展現相比由第一製造製程製造之邏輯電路之臨界尺寸較小且甚至小得多(例如,小超過2倍、3倍、4倍、5倍、6倍、7倍、8倍、9倍、10倍及其類似者)的臨界尺寸。舉例而言,第一製造製程可為類比製造製程,第一製造製程可為DRAM製造製程,及其類似者。 The memory/processing unit (MPU) can be manufactured by the first manufacturing process that is better suited for memory cells than logic cells. For example, the memory cell manufactured by the first manufacturing process can exhibit a smaller and even much smaller critical size than the logic circuit manufactured by the first manufacturing process (for example, more than 2 times, 3 times, 4 times smaller). , 5 times, 6 times, 7 times, 8 times, 9 times, 10 times and the like) critical size. For example, the first manufacturing process may be an analog manufacturing process, and the first manufacturing process may be a DRAM manufacturing process, and the like.
處理器可由較佳地適合邏輯之第二製造製程製造。舉例而言,由第二製造製程製造之邏輯電路的臨界尺寸可比由第一製造製程製造之邏輯電路的臨界尺寸小且甚至小得多。又對於另一實例,由第二製造製程製造之邏輯電路的臨界尺寸可比由第一製造製程製造之記憶體胞元的臨界尺寸小且甚至小得多。舉例而言,第二製造製程可為類比製造製程,第二製造製程可為CMOS製造製程,及其類似者。 The processor can be manufactured by a second manufacturing process that is better suited for logic. For example, the critical dimension of the logic circuit manufactured by the second manufacturing process may be smaller or even much smaller than the critical dimension of the logic circuit manufactured by the first manufacturing process. For yet another example, the critical size of the logic circuit manufactured by the second manufacturing process may be smaller or even much smaller than the critical size of the memory cell manufactured by the first manufacturing process. For example, the second manufacturing process may be an analog manufacturing process, and the second manufacturing process may be a CMOS manufacturing process, or the like.
可藉由考慮每一單元之益處及與在單元之間傳送資料相關的任何懲罰而以靜態或動態方式在不同單元之間分配任務。 Tasks can be allocated between different units in a static or dynamic manner by considering the benefits of each unit and any penalties associated with transferring data between the units.
舉例而言,可將記憶體密集型處理程序分配給記憶體/處理單元,而可將處理密集型記憶體輕處理分配給處理單元。 For example, memory-intensive processing programs can be allocated to memory/processing units, and processing-intensive memory light processing can be allocated to processing units.
處理器可請求或發指令給一或多個記憶體/處理單元以執行各種處理任務。各種處理任務之執行可減輕處理器之負擔,減少潛時,且在一些狀 況下減少一或多個記憶體/處理單元與處理器之間的總資訊頻寬,及其類似者。 The processor can request or issue instructions to one or more memory/processing units to perform various processing tasks. The execution of various processing tasks can reduce the burden on the processor, reduce latency, and in some situations In this case, reduce the total information bandwidth between one or more memory/processing units and the processor, and the like.
處理器可用不同粒度提供指令及/或請求,例如處理器可發送針對某些處理資源之指令或可發送針對記憶體/處理單元之較高階指令,而不指定任何處理資源。 The processor can provide instructions and/or requests with different granularities. For example, the processor can send instructions for certain processing resources or can send higher-level instructions for memory/processing units without specifying any processing resources.
圖96D為包括一或多個記憶體/處理單元(MPU)12043及處理器12042之混合系統12040的實例。處理器12042可將請求或指令發送至一或多個MPU 12043,該一或多個MPU又完成(或選擇性地完成)請求及/或指令且將結果發送至處理器12042,如上文所說明。 FIG. 96D is an example of a hybrid system 12040 including one or more memory/processing units (MPU) 12043 and a processor 12042. The processor 12042 may send a request or instruction to one or more MPUs 12043, which in turn completes (or selectively completes) the request and/or instruction and sends the result to the processor 12042, as explained above .
處理器12042可進一步處理結果以提供一或多個輸出。 The processor 12042 may further process the results to provide one or more outputs.
每一MPU包括記憶體資源、處理資源(諸如,緊密微控制器12044)及快取記憶體12049。微控制器可具有有限運算能力(例如,可主要包括乘法累加單元)。 Each MPU includes memory resources, processing resources (such as compact microcontroller 12044), and cache memory 12049. The microcontroller may have limited computing capabilities (for example, it may mainly include a multiplication and accumulation unit).
微控制器12044可出於記憶體內加速目的而應用處理程序,亦可為CPU或整個整DB處理引擎或其子集。
The
MPU 12043可包括可用網狀/環形/或其他拓樸連接以用於快速組間通信的微處理器及封包處理單元。
The
可存在多於一個DDR控制器以用於快速DIMM間通信。 There may be more than one DDR controller for fast inter-DIMM communication.
記憶體內封包處理器之目標為減少BW、資料移動、功率消耗,且增加效能。相比標準解決方案,使用記憶體內封包處理器將使效能/TCO顯著增加。 The goal of the in-memory packet processor is to reduce BW, data movement, power consumption, and increase performance. Compared to the standard solution, the use of an in-memory packet processor will significantly increase the performance/TCO.
應注意,管理單元為可選的。 It should be noted that the management unit is optional.
每一MPU可作為人工智慧(AI)記憶體/處理單元操作,此係因為其可執行AI計算且僅將結果傳回至處理器,藉此減少訊務量,尤其在MPU接收及儲存待用於多個計算中之神經網路係數時,且每次使用神經網路之一部 分以處理新資料時不需要自外部晶片接收係數。 Each MPU can be operated as an artificial intelligence (AI) memory/processing unit. This is because it can perform AI calculations and only return the results to the processor, thereby reducing the amount of traffic, especially when the MPU is receiving and storing for use In multiple calculations of neural network coefficients, and one part of the neural network is used each time It is not necessary to receive coefficients from an external chip when processing new data.
MPU可判定係數何時為零,且通知處理器不需要執行包括零值係數之乘法。 The MPU can determine when the coefficient is zero, and inform the processor that it is not necessary to perform multiplication including zero-valued coefficients.
應注意,第一處理及第二處理可包括篩選資料庫條目。 It should be noted that the first processing and the second processing may include filtering database entries.
MPU可為本說明書中、PCT專利申請案WO2019025862及PCT專利申請案第PCT/IB2019/001005號中之任一者中所說明的任何記憶體處理單元。 The MPU may be any memory processing unit described in any one of the PCT patent application WO2019025862 and the PCT patent application No. PCT/IB2019/001005 in this specification.
可提供AI運算系統(及可由系統執行之系統),其中網路介面卡具有AI處理能力且經組態以執行一些AI處理任務,以便減少待經由耦接多個AI加速伺服器之網路發送的訊務之量。 It can provide an AI computing system (and a system that can be executed by the system), in which the network interface card has AI processing capabilities and is configured to perform some AI processing tasks, so as to reduce the need to be sent through the network coupled to multiple AI acceleration servers The amount of communications.
舉例而言,在一些推斷系統中,輸入為網路(例如,連接至AI伺服器之IP攝影機的多個串流)。在此等狀況下,在處理及網路連接單元上利用RDMA+AI可減小CPU及PCIe匯流排之負載且對處理及網路連接單元提供處理,而非由不包括於處理及網路連接單元中之GPU提供處理。 For example, in some inference systems, the input is a network (for example, multiple streams of IP cameras connected to an AI server). Under these conditions, using RDMA+AI on the processing and network connection unit can reduce the load of the CPU and PCIe bus and provide processing for the processing and network connection unit, instead of being excluded from the processing and network connection The GPU in the unit provides processing.
舉例而言,替代計算初始結果及將初始結果發送至目標AI加速伺服器(應用一或多個AI處理操作),處理及網路連接單元可執行減少發送至目標AI加速伺服器之值之量的預處理。目標AI運算伺服器為經分配以對由其他AI加速伺服器提供之值執行計算的AI運算伺服器。此減少在AI加速伺服器之間交換的訊務之頻寬且亦減小目標AI加速伺服器之負載。 For example, instead of calculating the initial result and sending the initial result to the target AI acceleration server (applying one or more AI processing operations), the processing and network connection unit can reduce the amount of value sent to the target AI acceleration server Pretreatment. The target AI calculation server is an AI calculation server that is allocated to perform calculations on values provided by other AI acceleration servers. This reduces the bandwidth of the traffic exchanged between the AI acceleration servers and also reduces the load of the target AI acceleration server.
可藉由使用負載平衡或其他分配演算法以動態或靜態方式分配目標AI加速伺服器。可存在多於單個目標AI加速伺服器。 The target AI acceleration server can be allocated dynamically or statically by using load balancing or other allocation algorithms. There may be more than a single target AI acceleration server.
舉例而言,若目標AI加速伺服器添加了多個損失,則處理及網路連接單元可添加由其AI加速伺服器產生之損失且將損失總和發送至目標AI加速伺服器,藉此減少頻寬。當執行諸如導數計算及聚集以及其類似者之其他 預處理操作時,可獲得相同益處。 For example, if the target AI acceleration server adds multiple losses, the processing and network connection unit can add the losses generated by its AI acceleration server and send the total loss to the target AI acceleration server, thereby reducing the frequency. width. When performing other such as derivative calculation and aggregation and the like The same benefits can be obtained during pretreatment operations.
圖97B說明包括子系統之系統12060,每一子系統包括用於將具有伺服器主機板12064之AI處理及網路連接單元12063連接至彼此的交換器12061。伺服器主機板包括具有網路能力且具有AI處理能力之一或多個AI處理及網路連接單元12063。AI處理及網路連接單元12063可包括一或多個NIC及ALU或用於執行預處理之其他計算電路。
FIG. 97B illustrates a
AI處理及網路連接單元12063可為晶片,或可包括多於單個晶片。具有為單個晶片之AI處理及網路連接單元12063可為有益的。
The AI processing and
AI處理及網路連接單元12063可包括(僅或主要)處理資源。AI處理及網路連接單元12063可包括記憶體內運算電路,或可不包括記憶體內運算電路,或可能不包括大量記憶體內運算電路。
The AI processing and
AI處理及網路連接單元12063可為積體電路,可包括多於單個積體電路,可為積體電路之一部分,及其類似者。
The AI processing and
AI處理及網路連接單元12063可在包括AI處理及網路連接單元12063之AI加速伺服器與其他AI加速伺服器之間輸送(參見例如圖97C)訊務(例如,藉由使用諸如DDR通道、網路通道及/或PCIe通道之通信埠)。AI處理及網路連接單元12063亦可耦接至諸如DDR記憶體之外部記憶體。處理及網路連接單元可包括記憶體及/或可包括記憶體/處理單元。
The AI processing and
在圖97C中,AI處理及網路連接單元12063經說明為包括本端DDR連接、DDR通道、AI加速器、RAM記憶體、加密/解密引擎、PCIe交換器、PCIe介面、多個核心處理陣列、快速網路連接及其類似者。
In FIG. 97C, the AI processing and
可提供一種用於操作圖97B及圖97C中之任一者之系統(或操作系統之任何部分)的方法。 A method for operating the system (or any part of the operating system) of any one of FIG. 97B and FIG. 97C can be provided.
可提供在本申請案中所提及之任何方法的任何步驟之任何組合。 Any combination of any steps of any method mentioned in this application can be provided.
可提供在本申請案中所提及之任何單元、積體電路、記憶體資源、邏輯、處理子單元、控制器、組件的任何組合。 Any combination of any units, integrated circuits, memory resources, logic, processing subunits, controllers, and components mentioned in this application can be provided.
對「包括」及/或「包含」之任何參考可在細節上作必要修改後應用於「組成」、「實質上組成」。 Any reference to "include" and/or "include" may be applied to "composition" and "substantial composition" after making necessary modifications in details.
已出於說明之目的呈現前述描述。先前描述並不詳盡且不限於所揭示之精確形式或實施例。自本說明書之考慮及所揭示實施例之實踐,修改及調適對熟習此項技術者將為顯而易見的。另外,儘管所揭示實施例之態樣描述為儲存於記憶體中,但熟習此項技術者將瞭解,此等態樣亦可儲存於其他類型之電腦可讀媒體上,諸如次要儲存裝置,例如硬碟或CD ROM,或其他形式之RAM或ROM、USB媒體、DVD、藍光、4K超HD藍光,或其他光碟機媒體。 The foregoing description has been presented for illustrative purposes. The previous description is not exhaustive and is not limited to the precise form or embodiment disclosed. From the consideration of this specification and the practice of the disclosed embodiments, modifications and adaptations will be obvious to those familiar with the technology. In addition, although the aspects of the disclosed embodiments are described as being stored in memory, those skilled in the art will understand that these aspects can also be stored on other types of computer-readable media, such as secondary storage devices. Such as hard disk or CD ROM, or other forms of RAM or ROM, USB media, DVD, Blu-ray, 4K Ultra HD Blu-ray, or other optical drive media.
基於書面描述及所揭示方法之電腦程式在有經驗開發者之技能內。各種程式或程式模組可使用熟習此項技術者已知的技術中之任一者來產生或可結合現有軟體來設計。舉例而言,程式區段或程式模組可用或藉助於.Net Framework、.Net Compact Framework(及相關語言,諸如Visual Basic、C等)、Java、C++、Objective-C、HTML、HTML/AJAX組合、XML或包括Java小程式之HTML來設計。 Computer programs based on written descriptions and disclosed methods are within the skills of experienced developers. Various programs or program modules can be generated using any of the technologies known to those skilled in the art or can be designed in combination with existing software. For example, program sections or program modules can be used or combined with .Net Framework, .Net Compact Framework (and related languages, such as Visual Basic, C, etc.), Java, C++, Objective-C, HTML, HTML/AJAX , XML or HTML including Java applets.
此外,雖然本文中已描述說明性實施例,但熟習此項技術者基於本發明將瞭解具有等效元件、修改、省略、組合(例如,跨越各種實施例之態樣的組合)、調適及/或更改的任何及所有實施例之範圍。申請專利範圍中之限制應基於申請專利範圍中所使用之語言來廣泛地解譯,且不限於本說明書中所描述或在本申請案之審查期間的實例。實例應解釋為非排他性的。此外,所揭示方法之步驟可用包括藉由對步驟重排序及/或插入或刪除步驟的任何方式來修改。因此,本說明書及實例意欲僅被視為說明性的,其中真實範圍及精神由以下申請範圍及其等效物之完整範圍提示。 In addition, although illustrative embodiments have been described herein, those skilled in the art based on the present invention will understand that there are equivalent elements, modifications, omissions, combinations (for example, combinations across various embodiments), adaptations and/ Or modify the scope of any and all embodiments. The restrictions in the scope of the patent application should be widely interpreted based on the language used in the scope of the patent application, and are not limited to the examples described in this specification or during the examination period of this application. Examples should be interpreted as non-exclusive. In addition, the steps of the disclosed method can be modified in any manner including by reordering the steps and/or inserting or deleting steps. Therefore, this specification and examples are intended to be regarded as illustrative only, wherein the true scope and spirit are suggested by the following application scope and the full scope of its equivalents.
300:硬體晶片 300: hardware chip
310a:處理群組 310a: Processing group
310b:處理群組 310b: Processing group
310c:處理群組 310c: Processing group
310d:處理群組 310d: Processing group
320a:邏輯及控制子單元 320a: logic and control subunit
320b:邏輯及控制子單元 320b: logic and control subunit
320c:邏輯及控制子單元 320c: logic and control subunit
320d:邏輯及控制子單元 320d: logic and control subunit
320e:邏輯及控制子單元 320e: logic and control subunit
320f:邏輯及控制子單元 320f: logic and control subunit
320g:邏輯及控制子單元 320g: logic and control subunit
320h:邏輯及控制子單元 320h: logic and control subunit
330a:專用記憶體例項 330a: Dedicated memory instance
330b:專用記憶體例項 330b: Dedicated memory example
330c:專用記憶體例項 330c: Dedicated memory instance
330d:專用記憶體例項 330d: Dedicated memory instance
330e:專用記憶體例項 330e: Dedicated memory instance
330f:專用記憶體例項 330f: Dedicated memory example
330g:專用記憶體例項 330g: Dedicated memory example
330h:專用記憶體例項 330h: Dedicated memory instance
340a:控制件 340a: control
340b:控制件 340b: control
340c:控制件 340c: control
340d:控制件 340d: control part
350:主機 350: host
Claims (368)
Applications Claiming Priority (10)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201962886328P | 2019-08-13 | 2019-08-13 | |
US62/886,328 | 2019-08-13 | ||
US201962907659P | 2019-09-29 | 2019-09-29 | |
US62/907,659 | 2019-09-29 | ||
US201962930593P | 2019-11-05 | 2019-11-05 | |
US62/930,593 | 2019-11-05 | ||
US202062971912P | 2020-02-07 | 2020-02-07 | |
US62/971,912 | 2020-02-07 | ||
US202062983174P | 2020-02-28 | 2020-02-28 | |
US62/983,174 | 2020-02-28 |
Publications (1)
Publication Number | Publication Date |
---|---|
TW202122993A true TW202122993A (en) | 2021-06-16 |
Family
ID=74570549
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW109127495A TW202122993A (en) | 2019-08-13 | 2020-08-13 | Memory-based processors |
Country Status (5)
Country | Link |
---|---|
EP (1) | EP4010808A4 (en) |
KR (1) | KR20220078566A (en) |
CN (1) | CN114586019A (en) |
TW (1) | TW202122993A (en) |
WO (1) | WO2021028723A2 (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI776785B (en) * | 2022-04-07 | 2022-09-01 | 點序科技股份有限公司 | Die test system and die test method thereof |
TWI810017B (en) * | 2021-10-11 | 2023-07-21 | 美商萬國商業機器公司 | Training data augmentation via program simplification |
US11789894B2 (en) | 2022-01-27 | 2023-10-17 | Wistron Corporation | Acceleration system and dynamic configuration method thereof |
TWI820818B (en) * | 2021-12-14 | 2023-11-01 | 大陸商長鑫存儲技術有限公司 | Storage system and data writing method thereof |
TWI825853B (en) * | 2021-07-16 | 2023-12-11 | 美商聖巴諾瓦系統公司 | Defect repair circuits for a reconfigurable data processor |
US11914532B2 (en) | 2021-08-31 | 2024-02-27 | Apple Inc. | Memory device bandwidth optimization |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102638791B1 (en) * | 2018-09-03 | 2024-02-22 | 에스케이하이닉스 주식회사 | Semiconductor device and semiconductor system |
EP4341934A1 (en) * | 2021-05-18 | 2024-03-27 | Silicon Storage Technology, Inc. | Split array architecture for analog neural memory in a deep learning artificial neural network |
US20230051863A1 (en) * | 2021-08-10 | 2023-02-16 | Micron Technology, Inc. | Memory device for wafer-on-wafer formed memory and logic |
WO2023227945A1 (en) * | 2022-05-25 | 2023-11-30 | Neuroblade Ltd. | Processing systems and methods |
US20230393849A1 (en) * | 2022-06-01 | 2023-12-07 | Advanced Micro Devices, Inc. | Method and apparatus to expedite system services using processing-in-memory (pim) |
WO2024027937A1 (en) * | 2022-08-05 | 2024-02-08 | Synthara Ag | Memory-mapped compact computing array |
CN115237036B (en) * | 2022-09-22 | 2023-01-10 | 之江实验室 | Full-digitalization management device for wafer-level processor system |
CN115599025B (en) * | 2022-12-12 | 2023-03-03 | 南京芯驰半导体科技有限公司 | Resource grouping control system, method and storage medium of chip array |
CN116962176B (en) * | 2023-09-21 | 2024-01-23 | 浪潮电子信息产业股份有限公司 | Data processing method, device and system of distributed cluster and storage medium |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002063069A (en) * | 2000-08-21 | 2002-02-28 | Hitachi Ltd | Memory controller, data processing system, and semiconductor device |
US9612979B2 (en) * | 2010-10-22 | 2017-04-04 | Intel Corporation | Scalable memory protection mechanism |
US20140040622A1 (en) * | 2011-03-21 | 2014-02-06 | Mocana Corporation | Secure unlocking and recovery of a locked wrapped app on a mobile device |
US9262246B2 (en) * | 2011-03-31 | 2016-02-16 | Mcafee, Inc. | System and method for securing memory and storage of an electronic device with a below-operating system security agent |
US8590050B2 (en) * | 2011-05-11 | 2013-11-19 | International Business Machines Corporation | Security compliant data storage management |
US8996951B2 (en) * | 2012-11-15 | 2015-03-31 | Elwha, Llc | Error correction with non-volatile memory on an integrated circuit |
WO2014081457A1 (en) * | 2012-11-21 | 2014-05-30 | Coherent Logix Incorporated | Processing system with interspersed processors dma-fifo |
WO2019025864A2 (en) * | 2017-07-30 | 2019-02-07 | Sity Elad | A memory-based distributed processor architecture |
US10810141B2 (en) * | 2017-09-29 | 2020-10-20 | Intel Corporation | Memory control management of a processor |
-
2020
- 2020-08-13 CN CN202080071415.1A patent/CN114586019A/en active Pending
- 2020-08-13 KR KR1020227008116A patent/KR20220078566A/en unknown
- 2020-08-13 WO PCT/IB2020/000665 patent/WO2021028723A2/en unknown
- 2020-08-13 EP EP20852497.5A patent/EP4010808A4/en active Pending
- 2020-08-13 TW TW109127495A patent/TW202122993A/en unknown
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI825853B (en) * | 2021-07-16 | 2023-12-11 | 美商聖巴諾瓦系統公司 | Defect repair circuits for a reconfigurable data processor |
US11914532B2 (en) | 2021-08-31 | 2024-02-27 | Apple Inc. | Memory device bandwidth optimization |
TWI810017B (en) * | 2021-10-11 | 2023-07-21 | 美商萬國商業機器公司 | Training data augmentation via program simplification |
US11947940B2 (en) | 2021-10-11 | 2024-04-02 | International Business Machines Corporation | Training data augmentation via program simplification |
TWI820818B (en) * | 2021-12-14 | 2023-11-01 | 大陸商長鑫存儲技術有限公司 | Storage system and data writing method thereof |
US11861232B2 (en) | 2021-12-14 | 2024-01-02 | Changxin Memory Technologies, Inc. | Storage system and data writing method thereof |
US11789894B2 (en) | 2022-01-27 | 2023-10-17 | Wistron Corporation | Acceleration system and dynamic configuration method thereof |
TWI819480B (en) * | 2022-01-27 | 2023-10-21 | 緯創資通股份有限公司 | Acceleration system and dynamic configuration method thereof |
TWI776785B (en) * | 2022-04-07 | 2022-09-01 | 點序科技股份有限公司 | Die test system and die test method thereof |
Also Published As
Publication number | Publication date |
---|---|
KR20220078566A (en) | 2022-06-10 |
EP4010808A4 (en) | 2023-11-15 |
WO2021028723A2 (en) | 2021-02-18 |
CN114586019A (en) | 2022-06-03 |
WO2021028723A3 (en) | 2021-07-08 |
EP4010808A2 (en) | 2022-06-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TW202122993A (en) | Memory-based processors | |
US20220269645A1 (en) | Memory mat as a register file | |
TWI779069B (en) | Memory chip with a memory-based distributed processor architecture | |
US11901026B2 (en) | Partial refresh | |
Nair | Evolution of memory architecture | |
Nguyen et al. | A classification of memory-centric computing | |
TW202027076A (en) | Memory-based processors | |
CN111433758B (en) | Programmable operation and control chip, design method and device thereof | |
CN105468298B (en) | A kind of key assignments storage method based on log-structured merging tree | |
Senger et al. | BSP cost and scalability analysis for MapReduce operations | |
Li et al. | Self-repair of uncore components in robust system-on-chips: An opensparc t2 case study | |
Mutlu | Intelligent architectures for intelligent computing systems | |
Schmidt | Accelerating checkpoint/restart application performance in large-scale systems with network attached memory | |
HeydariGorji et al. | Leveraging Computational Storage for Power-Efficient Distributed Data Analytics | |
Xu et al. | Research on Satellite-borne Big-data Storage System | |
Qian et al. | Efficient abstraction algorithms for accelerating reconfiguration of VLSI arrays | |
Karakchi et al. | NAPOLY: A Non-deterministic Automata Processor OverLaY | |
Eghbal | Three-Dimensional NoC Reliability Evaluation Automated Tool (TREAT) | |
Young | Global Address Spaces for Efficient Resource Provisioning in the Data Center |